Skip to main content

Posts

Showing posts from 2017

Day 25 R Programming

https://stackoverflow.com/questions/18222286/dynamically-select-data-frame-columns-using-and-a-vector-of-column-names https://stackoverflow.com/questions/12614953/how-to-create-a-numeric-vector-of-zero-length-in-r https://stackoverflow.com/questions/7355187/error-in-if-while-condition-missing-value-where-true-false-needed http://www.dummies.com/programming/r/how-to-add-observations-to-a-data-frame-in-r/ https://stackoverflow.com/questions/11561856/add-new-row-to-dataframe-at-specific-row-index-not-appended https://stackoverflow.com/questions/22235809/append-value-to-empty-vector-in-r

Day 24 (quite a break) Back to Git Issues

So after 3 weeks of sis in law's marital rituals and back to office loaded with tons of work , I finally get some time to get back. And only to realise that it's been more than a month now , I'm already beginning to forget . First day and git issues start , Apparently before leaving I had pulled the working code and downloaded rstudio and r so that I could probably learn something in my vacation (high hopes) , didnt touch a thing. Now when i come back I realize that there were certain changes I had done on this machine and from RStudio I had even commited but it doesn't show up in my github repo or the other machine when I pull. Something was wrong. Figured out it was because of the origins , I didn't add .git to the end of the Repo while adding remote origin and that indeed caused some issues , surprisingly RStudio didn't error out and even when I check it's history it shows me , So finally recreated the origin , pull and pushed the code thankfully so

Day 23 Git Crashes

** Please tell me who you are. Run   git config --global user.email "you@example.com"   git config --global user.name "Your Name" to set your account's default identity. Omit --global to set the identity only in this repository. fatal: empty ident name (for <(NULL)>) not allowed Git - fatal: Unable to create '/path/my_project/.git/index.lock': File exists. http://stackoverflow.com/questions/7860751/git-fatal-unable-to-create-path-my-project-git-index-lock-file-exists

Day 23 R Objective Functions. Plotting , Date Time

Objective functions could be imagined as essence of a constructor in a function . this gives the ability to reuse or declare the parameters to the function and keep it instantiated instead of passing the parameters each time. Learned a little bit of plotting and the different types of plot , probably the easiest of the visualizations. Mode, Class and Type of R objects https://stats.stackexchange.com/questions/3212/mode-class-and-type-of-r-objects https://cran.r-project.org/doc/manuals/r-patched/R-intro.html#Object-orientation t2, like all POSIXlt objects, is just a list of values that make up the date and | time. Use str(unclass(t2)) to have a more compact view. >  > str(unclass(t2)) List of 11  $ sec   : num 18.5  $ min   : int 32  $ hour  : int 16  $ mday  : int 23  $ mon   : int 3  $ year  : int 117  $ wday  : int 0  $ yday  : int 112  $ isdst : int 0  $ zone  : chr "AST"  $ gmtoff: int 10800  - attr(*, "tzone&q

Day 22 R Coursera

Day 21 R Continuing in Coursera

So I'm just done with GCP on Coursera , a brief introduction to the set of tools provided by the Google Cloud Platform and practical hands on lab on certain things to give the realization how easy it is to get things started up. Partial matching of function argument http://stackoverflow.com/questions/14153904/partial-matching-of-function-argument The three-dots construct in R http://www.burns-stat.com/the-three-dots-construct-in-r/

Day 20 Google APIs, Google Application Default Credentials

Searching for objects attribute value, it has to be Datastore . Remember that BigTable, you can only search by key.  High-throughput writes of wide-column data. Well, that is BigTable , right, because it's supporting high-throughput writes.  Warehousing structured data. So what's the data warehouse technology on Google Cloud? That's, which one, BigQuery .  To create and test new machine learning methods. Well, if you're writing new machine learning methods, then TensorFlow .  Develop Big Data algorithms interactively in Python.Well, interactive development in Python is done best with Datalab .   Well, interactive No-ops, custom machine learning applications at scale. No-ops ML at scale, then that's a role for Cloud ML.  Automatically reject inappropriate image content. Rejecting image content where it is inappropriate. Well, that could be the Vision API. So you could use a Vision API to basically see if this is safe content or not safe content. 

Day 19 TensorFlow

http://www.kdnuggets.com/2015/11/google-tensorflow-deep-learning-disappoints.html http://www.businessinsider.com/what-is-google-tensorflow-2015-11 http://playground.tensorflow.org/#activation=linear&batchSize=10&dataset=circle&regDataset=reg-plane&learningRate=0.03&regularizationRate=0&noise=0&networkShape=4,2&seed=0.98991&showTestData=false&discretize=false&percTrainData=50&x=true&y=true&xTimesY=false&xSquared=false&ySquared=false&cosX=false&sinX=false&cosY=false&sinY=false&collectStats=false&problem=classification&initZero=false&hideText=false Feature Engineering https://datasciencedojo.com/data-wrangling-in-r/ https://en.wikipedia.org/wiki/Feature_engineering For Machine Learning in TensorFlow, for training the model , the data has to be numeric ( also not categorical variable - not holdeing weightage for that value but just a representation or classification like day of the

Day 18 GCP DataLab, Big Query from Client Side , Pandas- Python

'sudo su -' vs 'sudo -i' vs 'sudo /bin/bash' - when does it matter which is used, or does it matter at all? https://askubuntu.com/questions/376199/sudo-su-vs-sudo-i-vs-sudo-bin-bash-when-does-it-matter-which-is-used docker ps  will show only running containers by default. To see all containers:  docker ps -a https://docs.docker.com/v1.11/engine/reference/commandline/ps/ https://8081-dot-2337103-dot-devshell.appspot.com/tree/datalab root1234 - paraphrase DataLab gives the ability to share a notebook with other people , at the same time use the cloud for computing n storage. https://cloud.google.com/bigquery/docs/reference/libraries#client-libraries-usage-python https://github.com/google/google-api-javascript-client http://stackoverflow.com/questions/12479895/obtaining-bigquery-data-from-javascript-code Python Data Analysis Library http://pandas.pydata.org/

Day 17 Periodic Data Science

Becoming a Data Scientist:  Profiling Cisco’s Data Science  Certification Program http://blog.kaggle.com/2017/03/02/becoming-a-data-scientist-profiling-ciscos-data-science-certification-program/?utm_source=Mailing+list&utm_campaign=8ed002c926-Kaggle_Newsletter_04-11-2017&utm_medium=email&utm_term=0_f42f9df1e1-8ed002c926-402242277 Wow , found this on R-Bloggers , quite awesome , I have certain blocks from each legend but still a logn way to go.

Day 16 DataProc , Goole Cloud Solutions

The Beginner’s Guide to Nano, the Linux Command-Line Text Editor https://www.howtogeek.com/howto/42980/the-beginners-guide-to-nano-the-linux-command-line-text-editor/ So far from the course Google Cloud Platform , I got the main advantage using GCP is invariant of which tehcnology whether they are Sql Db or NoSql DB , Spark or Hadoop , GCP offers to run your pieces of programs to be run for a specific amount of time unlike in other scenario where one would end up many hardware or software in order to perform these heavy processing tasks. Microservices Architecture on Google App Engine https://cloud.google.com/appengine/docs/standard/python/microservices-on-app-engine

Day 15 GCP CloudSql Lab, Hadoop

bash ./find_my_ip.sh cd training-data-analyst/CPB100/lab3a Note : If you lose your Cloud Shell VM due to inactivity, you will have to reauthorize your new Cloud Shell VM with Cloud SQL. For your convenience, lab3a includes a script called  authorize_cloudshell.sh  that you can run. https://cloud.google.com/certification/data-engineer https://cloud.google.com/sql/docs/mysql/connect-compute-engine https://cloud.google.com/solutions/mysql-remote-access You have to make sure the console IP is specified in the Autorized Networks PySpark Cheat SHeet http://www.datasciencecentral.com/profiles/blogs/pyspark-cheat-sheet-spark-in-python https://www.datacamp.com/community/blog/pyspark-cheat-sheet-python#gs.0vl89jQ DataProc is google managed Hadoop Spark Pig Hive Program The Hadoop Ecosystem Table https://hadoopecosystemtable.github.io/ HBase, Sqoop, Flume and More: Apache Hadoop Defined http://wikibon.org/wiki/v/HBase,_Sqoop,_Flume_and_More:_Apache_Hadoop_Defined http://www

Day 15 GCP Recommendations , Cloud SQL PySpark DataProc

Collaborative Filtering - RDD-based API https://spark.apache.org/docs/latest/mllib-collaborative-filtering.html PySpark https://spark.apache.org/docs/0.9.0/python-programming-guide.html Managed Hadoop & Spark https://cloud.google.com/dataproc/ Fully-Managed PostgreSQL  BETA  & MySQL https://cloud.google.com/sql/ Cloud sql can run 1 petabit per second

Day 14 Swirl

install.packages("swirl") https://github.com/swirldev/swirl_courses#swirl-courses https://en.wikipedia.org/wiki/YAML http://yaml.org/ | You can exit swirl and return to the R prompt (>) at any time by pressing the Esc key. If you are | already at the prompt, type bye() to exit and save your progress. When you exit properly, you'll see a | short message letting you know you've done so. | When you are at the R prompt (>): | -- Typing skip() allows you to skip the current question. | -- Typing play() lets you experiment with R on your own; swirl will ignore what you do... | -- UNTIL you type nxt() which will regain swirl's attention. | -- Typing bye() causes swirl to exit. Your progress will be saved. | -- Typing main() returns you to swirl's main menu. | -- Typing info() displays these options again. | Let's get started! sqrt() function and to take the absolute value, use the abs() function Vector of unequal length Artihmetic Op

Day 14 Subsetting Matrix , Partial Matching , Missing NA Values

In matrix , subsetting by default returns a vector Remove rows with NAs (missing values) in data.frame http://stackoverflow.com/questions/4862178/remove-rows-with-nas-missing-values-in-data-frame Find Complete Cases https://stat.ethz.ch/R-manual/R-devel/library/stats/html/complete.cases.html Repeating a repeated sequence http://stackoverflow.com/questions/11180125/repeating-a-repeated-sequence http://stackoverflow.com/questions/3672527/r-generate-a-repeating-sequence-based-on-vector Error in complete.cases(x, y) : not all arguments have the same length http://stackoverflow.com/questions/4740244/chisq-test-error-message True Matrix Mulitplication https://stat.ethz.ch/R-manual/R-devel/library/base/html/matmult.html

Day 13 Saving R Data , Subsetting

Saving R Data http://thomasleeper.com/Rcourse/Tutorials/savingdata.html Difference between dput and dump?   ( self.Rlanguage ) https://www.reddit.com/r/Rlanguage/comments/2po2i3/difference_between_dput_and_dump/   HOW CAN I TIME MY CODE? | R FAQ http://stats.idre.ucla.edu/r/faq/how-can-i-time-my-code/ Subsetting lists with single bracket [ always return the same class type subsetting lists [[ may or may not return the same type  There is only one exception to [[ vs $

Day 13 R on Coursera , Reading Datasets

Attributes in R - names, length, class, dimensions COmplex Vectors 1+0i imaginary number Vector function vector() Coercion when mixing vectors Explicit coercion as.numeric(),as.logical(),as.character(),as.complex() attributes(),dim() Attach 2 columns or 2 rows , cbind(),rbind() table(x) is.na(),is.nan() nrow() ,ncol() Reading strings that contain whitespace into R from tab delimited .txt file http://stackoverflow.com/questions/11199496/reading-strings-that-contain-whitespace-into-r-from-tab-delimited-txt-file Difference between read.table and read.delim functions http://stackoverflow.com/questions/10599708/difference-between-read-table-and-read-delim-functions read.delim(file, header = FALSE, sep = "\t", quote = "\"", dec = ".", fill = TRUE, comment.char = "", ...) read.delim2(file, header = TRUE, sep = "\t", quote = "\"", dec = ",", fill = TRUE, comment.ch

Day 12 R Programming on COursera, History

Beginner trying to figure out how to import a simple csv file into R http://stackoverflow.com/questions/10417938/beginner-trying-to-figure-out-how-to-import-a-simple-csv-file-into-r How to Set Working Directory in R http://rprogramming.net/set-working-directory-in-r/ A short list of the most useful R commands https://www.personality-project.org/r/r.commands.html The purpose of S n R was to allow usage of language without going deep into the programming , to be able to use them easily. and once the user is familiar to basic statsitics , should be able to program more efficiently and get into as a programmer. https://www.r-bloggers.com/ross-ihaka-on-the-history-of-the-r-project/

Day 12 GCP , SDK , Open Data

Interacting with Cloud Storage https://cloud.google.com/storage/docs/ We can upload data to cloud using gsutil or GCP Console SSH Google Cloud SDK https://cloud.google.com/sdk/docs/ Open Data for UAE http://opendata.fcsa.gov.ae/ https://cloud.google.com/sdk/docs/quickstart-windows Transfer Services Latency and Zones - Distirubute the data across diff zones 4 Practical “less” Command Examples and tips for effective navigation in Linux: http://www.sanfoundry.com/4-practical-less-command-examples-and-tips-effective-navigation-in-linux/ What's a .sh file? http://stackoverflow.com/questions/13805295/whats-a-sh-file down vote What is a file with extension .sh? It is a  Bourne shell script . They are used in many variations of UNIX-like operating systems. They have no "language" and are interpreted by your shell (interpreter of terminal commands) or if the first line is in the form #!/path/to/interpreter the

Day 11 GCP First Project Compute Engine

So I created my first project , credit card will be required to create a google cloud account , 300$ will be given for trial for 1 year. Has a really cool dashboard and most of the words are kind of lexicons for me at this point of time , hopefully I become an expert on these. Why Google Cloud Platform? https://cloud.google.com/why-google/ CUSTOM MACHINE TYPES https://cloud.google.com/custom-machine-types/ Instead if purchasing physical hardware and keep upgrading it , Cloud solution will let you select only the required configuration at the moment of time and upgrading would be lot easier. Also you get all the services along with it.' Premptible VMS https://cloud.google.com/preemptible-vms/ https://codelabs.developers.google.com/codelabs/cpb100-compute-engine/#0 To find some information about the Compute Engine instance, type the following into the command-line: cat /proc/cpuinfo

Day 10 Google Cloud Platform P2, Code Labs

Atomic Fiction Walks “The Walk” https://cloudplatform.googleblog.com/2015/10/Atomic-Fiction-walks-The-Walk.html Market Reconstruction 2.0: A Financial Services Application of Google Cloud Bigtable and Google Cloud Dataflow https://cloud.google.com/customers/fis/ https://www.fisglobal.com/Solutions/Institutional-and-Wholesale/Broker-Dealer/-/media/FISGlobal/Files/Whitepaper/A-Financial-Services-Application-of-Google-Cloud-Bigtable-and-Google-Cloud-Dataflow.pdf Google Analytics Premium + Google BigQuery for Predictive Digital Marketing https://cloud.google.com/solutions/google-analytics-bigquery CPB100 https://codelabs.developers.google.com/cpb100 https://codelabs.developers.google.com/codelabs/cpb100-free-trial/index.html?index=..%2F..%2Fcpb100#0 https://console.cloud.google.com/freetrial?pli=1&page=0

Day 10 Google Cloud Platform

3rd Wave Cloud Generation is just using the maximum processing availability for a certain task and paying for that just for the amount required , unlike 2nd Wave where you had to own dedicated machines and you have to be limited with the processing power of those machines. Instead 3rd wave you can pay only for the amount of time you require a certain task. eg. Spotify Engg Spotify's journey to cloud: why Spotify migrated its event delivery system from Kafka to Google Cloud Pub/Sub https://cloud.google.com/blog/big-data/2016/03/spotifys-journey-to-cloud-why-spotify-migrated-its-event-delivery-system-from-kafka-to-google-cloud-pubsub

Day 10 Coursera Short Course Data and Machine Learning on Google Cloud Platform

Coursera Short Course Data and Machine Learning on Google Cloud Platform https://www.coursera.org/learn/gcp-big-data-ml-fundamentals/lecture/EewWO/introduction-to-the-data-and-machine-learning-specialization MapReduce Applications and Limitations of MapReduce http://mapreduce-specifics.wikispaces.asu.edu/Applications+and+Limitations+of+MapReduce HADOOP – ADVANTAGES AND DISADVANTAGES http://www.j2eebrain.com/java-J2ee-hadoop-advantages-and-disadvantages.html Google I/O: Hello Dataflow, Goodbye MapReduce http://www.informationweek.com/cloud/software-as-a-service/google-i-o-hello-dataflow-goodbye-mapreduce/d/d-id/1278917 GOOGLE CLOUD BIG DATA AND MACHINE LEARNING BLOG https://cloud.google.com/blog/big-data/2016/05/no-shard-left-behind-dynamic-work-rebalancing-in-google-cloud-dataflow http://www.datacenterknowledge.com/archives/2014/06/25/google-dumps-mapreduce-favor-new-hyper-scale-analytics-system/ Colossus: Successor to the Google File System (GFS)

Day 9 MOOCs

What Meaningful Careers Exist In Data Science? https://www.forbes.com/sites/quora/2017/03/31/what-meaningful-careers-exist-in-data-science/?utm_content=buffereeb38&utm_medium=social&utm_source=facebook.com&utm_campaign=buffer#6ebe81e3d266 What are the best data science MOOCs? https://www.quora.com/What-are-the-best-data-science-MOOCs?ref=forbes&rel_pos=1

Day 8 Data Frames

invalid factor level, NA generated http://stackoverflow.com/questions/16819956/invalid-factor-level-na-generated Convert data.frame columns from factors to characters http://stackoverflow.com/questions/2851015/convert-data-frame-columns-from-factors-to-characters?noredirect=1&lq=1

Day 8 Include R Source files , Google Data Studio , R Data Frames , Completed Module 1

http://stackoverflow.com/questions/6456501/how-to-include-source-r-script-in-other-scripts Missing Values (NA) Sometimes values in a vector are missing and you have to show them using NA, which is a special value in R for "Not Available". For example, if you don't know the age restriction for some movies, you can use NA. In [5]: age_restric <- 10="" 12="" 18="" c="" na="" p="">age_restric is.na(age_restric) Out[5]: [1] 14 12 10 NA 18 NA Out[5]: [1] FALSE FALSE FALSE  TRUE FALSE  TRUE Google Data Studio https://www.google.com/analytics/data-studio/ How to delete multiple values from a vector? http://stackoverflow.com/questions/9665984/how-to-delete-multiple-values-from-a-vector Completed Module 1 on Coursera , has a brief introduction to Data Science and more focus is on getting the tools ready.Can't wait for Module 2 :P

Day 8 Data Scientist resources

Seven Ways to Be More Curious https://www.psychologytoday.com/blog/finding-the-next-einstein/201407/seven-ways-be-more-curious Curiosity: The One Superpower We Don't Use Enough, And How To Use It https://www.forbes.com/sites/lawtonursrey/2014/06/20/curiosity-the-one-superpower-we-dont-use-enough-and-how-to-use-it/#7c82cd95624f 10 Reasons Why You Should Be Curious http://www.marcandangel.com/2007/08/24/10-reasons-why-you-should-be-curious/ Tools for improving structured thinking (for analysts) https://www.analyticsvidhya.com/blog/2014/02/tools-structured-thinking/ https://www.analyticsvidhya.com/blog/2013/06/art-structured-thinking-analyzing/ https://www.analyticsvidhya.com/blog/2013/06/art-structured-thinking-analyzing/ Critical Thinking: Where to Begin http://www.criticalthinking.org/pages/critical-thinking-where-to-begin/796 How to Use Design Thinking Methods to Improve Your Nonprofit’s Strategy and Measurement http://www.bethkanter.org/design-thinking/ Int

Day 7 The Elements of Data Analytic Style , Supervised vs Unsupervised Learning , Data

Supervised V Unsupervised Machine Learning -- What's The Difference? https://www.forbes.com/sites/bernardmarr/2017/03/16/supervised-v-unsupervised-machine-learning-whats-the-difference/2/#1f636ebc2080 The Elements of Data Analytic Style by Jeff Leak https://leanpub.com/datastyle Darwin Tunes https://en.wikipedia.org/wiki/DarwinTunes The home of the U.S. Government’s open data https://www.data.gov/ This Is How Much Data The Internet Gets Through In One Minute http://www.iflscience.com/technology/this-is-how-much-data-the-internet-gets-through-in-one-minute/ Big Data: Are you ready for blast-off? http://www.bbc.com/news/business-26383058

Day 7 Hands on with Git Repo and RStudio , R 101 BigDataUniversity

Removing a remote http://stackoverflow.com/questions/9224754/how-to-remove-origin-from-git-repository Kickstarting   R  - Writing R scripts https://cran.r-project.org/doc/contrib/Lemon-kickstart/kr_scrpt.html Source on Save https://support.rstudio.com/hc/en-us/articles/200484448-Editing-and-Executing-Code Ctrl+L  — Clear the Console https://support.rstudio.com/hc/en-us/articles/200404846-Working-in-the-Console User Defined Functions in R http://www.statmethods.net/management/userfunctions.html Issue pushing new code in Github http://stackoverflow.com/questions/20939648/issue-pushing-new-code-in-github Git refusing to merge unrelated histories http://stackoverflow.com/questions/37937984/git-refusing-to-merge-unrelated-histories