Lets get stared with Python

I always wanted to master one programming language ( just as hobby). Programming can make your life facile by automating all kind of task. I had learned C during my college but only to pass exam --which I did. Than never looked back to it.
During my professional life I had use R. So, I have excellent knowledge of R (never felt the need to learn other language) but R is mostly for data analysis and data science (that is what I do). Sometime ago, while I was doing some research on data science,  I found that R is not enough, if I wanted to be bad-ass data scientist, I had to learn python.
Learning python is trickier than R, as there are so many thing to learn, so many branches not like R (learn concept of data frame and drive into it). I want to learn python for both data science and as my hobby. This made my learning more trickier. After long research and my person experience, here are steps, that work best and resources.

Step for understand python from basic:
1. Try_python -- basic --  take this course at first
2. Programming Foundations with Python -- basic-- -- best course for stater if you want to understand from basic programming but lack info about loop, data-type.
3. Python--Codecademy -- basic to advance -- beautifully design, you will learn by doing.

After taking these course you can start making project as you like, if you want to drive deep into data
2. How to get better at data science -- best blog on this topic.

Need books
1. It ebooks -- free download -- buy books from amazon if you can effort. Support author.

Free online courses (just search python in these website)
1. Udacity -- Most interactive and interesting courses -- first time programmer try introductory course-- you can progress on your pace.
2. Edx--  Interactive but can be boring and fix schedule( you can archive course and take latter too)-- have course from basic to advance.
3. Coursera  -- first time programmer don't try this --most of courses are boring if you don' t have enthusiasm.
* This is my personal view, may be coursera has very interesting and interactive course but I always toke boring class or udacity may have boring class too. See all three website and decide yourself.

Install  python and package can be tricky too
My suggestion is used Anaconda distribution
  -- It comes with more than 200+ popular packages install along with jupyter notebook( ipthyon                 notebook) and spyder(IDE for python)
 -- Another advantage you can install both 2.7 and 3.4 using one click.
       conda create -n python2 python=2.7 anaconda

       Activate environment 
       activate python2
 -- You can also install R and use R in jupyter notebook with single line of code

       conda create -n my-r-env -c r r-essentials

Need help:
3. Google -- ha ha 

Remember if you have to do same task again and again, always find a way to automate it. 
Happy coding :-)

Good coding format and Practices in R

There are many recommended coding standard and layout. A badly written code is big pain for anyone reader. So its always better to have good format of coding and follow few standard. My favorite layout of coding is described below:
  • Always start your code with description because when you write many code, names can also be confuse. Code description should have good name followed by what it does than files need for running the code. This will save you and others lot of time in long run.
######################Daily_mail and dispatch_cockpit###############################
#######open VPN Client ######
##Send a mail to all seller manager and make output for dispatch cockpit
#delisted file from BI, order from BOB
  • Than always load all packages (this will make it easy to see what packages are need to run code when you share file) need for analysis, always used suppressPackageStartupMessages function, it make output elegance. 
#################load required package
suppressPackageStartupMessages(require("dplyr"))
suppressPackageStartupMessages(require("mailR"))
suppressPackageStartupMessages(require("lubridate"))
suppressPackageStartupMessages(require("htmlTable"))
suppressPackageStartupMessages(require("googlesheets"))
currentDate = Sys.Date() ##current date to make folder and use in file name

  • Set up directory of R to folder that has all input file. If you are running R code on daily basis for any repetitive task, always have separate folder for input and output ( for output you can  have new folder of each day and keep input there)
#set input to require directory
setwd("M:/R_Script")
filepath=getwd()
setwd(paste(filepath, "Input", sep="/"))

  • If you can always, import all file at start of analysis.

seller = read.csv("sellers_delisting.csv", stringsAsFactors = F)
order = read.csv2("order.csv")

  • While writing code, if you are reading heavy file or from database always make a copy of original file and keep it separate while you progress ( like say I imported file bob than make copy of bob and do all analysis on copy of bob) as while writing code you will make mistake and if you again have to import original file, its tedious. 

order_new = order

  • If you are making many subset of data, give it same name always like "temp" for subset and some relevant name for summary of subset.
temp = subset(seller, seller$Date.delisted> as.Date(Sys.Date())-30 &
seller$Status =="Delisted", select = c("Seller.Name", "Reason.for.delisting"))
#summarize
seller_delisted = table(temp$Seller.Name.,temp$Reason.for.delisting)

  • When you save output always save it in output or today's folder with date in file name. Its will save you from lot of confusion.



#Save the the file
setwd("M:/Daily/Daily")
dir.create(as.character(currentDate)) #new folder with name current date
setwd(paste("M:/Daily/Daily", currentDate, sep="/"))
csvFileName1 = paste("Threshold limit and seller delisted",currentDate,".csv",sep=" ") #File name with date
write.csv(seller_delisted, file=csvFileName1, row.names = F)

  • When you save code that need further fine tuning always use git to commit or use Version in file name.  Like text_v1. R than text_v2.R so on.
  • If your are running multiple code one after another, always remove all variable from R once single analysis is completed. So that there is no interference of old variable with new code variable. 
rm(list=ls())
Now you ready to write lucid code.

Vlookup in R

First thing that was on my mind when I used R for first time, How the hell will I use Vlookup in R?(All my report had vlookup at least one time) . I googled it, answers were not lucid. If you google it most probability you will come across use merge as answer. Merge is base function, like most base function(except very few) it complected to use. Plus excel user are not that familiar with relationship, for them info in each cell are different. Excel user never think data as column, info in each cell is separate for them. We (excel user) thinking about how we will add two cell, how will we look value of cell A1 on table B1:C10. We never think as lets look value of column A into table B:C. or add column A to B.
Advice: If you come from excel background start thinking all data as column  and starts respecting the structure of data. In excel you can add any two cells (A1 and A5) and put that somewhere in  C5, have different type of data in one column(like number in A1, date A2, string in third ). This is very bad habit . Always think any operation as column operation not cells operation. Like if you have to add two series, put it under different column and add these to make third column. Any analysis, reporting, manipulation only consists of joining column and than summarizing(visualization, modeling). Now when some ask me for analysis, I just have to know where are column with these info, how can I  join them, how to summarize, that all there is in any reporting.
Why am I taking so much about column in vlookup tutorial?  Reason is, in any database language or programming language for Vlookup, you need to get related info about these column from next column and both of these column should have common id.

Lets break down Vlookup,
Vlookup - takes a value say "A" than find that value "A" in next table than pull info related to "A" from  this table.
This is called joining in database and R, you take list of value, join(match these value in next table) than pull info related to these value.

lets take an example 
##make data frame
master
<- data.frame(ID = 1:50, name = letters[1:50],
date = seq(as.Date("2016-01-01"), by = "week", len = 50))

Now we have different list which only has id
##lookup value
lookup
=data.frame(id = c(23, 50, 4, 45))

Now we need to look up name of these id in master data.frame.
Merge?? i have not used it for ages there are easy solution  for it.
##load dplyr
required(dplyr)

dplyr has many user friendly join function.



lets get back to problem
##lookup
id_lookup
= left_join(id, master, by="id") # output are only value that >matches to id_lookup, if no match is found it return as NA
or
id_lookup = right_join(master, id, by="id") ##both column should have common name

If column name are different you can
##If column name are different you can
id_lookup
= right_join(master, id, by=c("id"="id2"))

or rename column using
colnames(id)[x] = "id" # x is cloumn index
id_lookup = rename(id, id=id2) # rename is dplyr function

New id_lookup will have colnames as "id","name","date". If you don't need date you can always make subset of data,frame and get only required data. Or before join make subset of master with only required column and than join. Any way you like.

##subset of data
id_lookup
= id_lookup[ , -c("date")]
or
id_lookup = id_lookup[ , c("id", "date")
or
id_lookup = id_lookup[,c(1,3)]
or
id_lookup = subset(id_lookup, condition, select=c("id", "date"))
Cautious: make sure name are same for similar  field, not like column names is id obs. are names. There is were respect for database structure comes.

Get used to with joins, these are all joins you we need to perform any lookup. You never perform look for only particular value mostly, its always column look up. Best practices is always make data.frame of what you have to look up and  join to next table.


Softwares and Setup for Analysis (R)

Let start by downloading all software and setting up few accounts.

Software to download
1.  -- Programming language for statistician by statistician 
2. -- IDE for R
3.-- Data Visualization Tool

Make account on 
1.  --  Distributed revision control and source code management
2.  -- Markdown(HTML, pdf report from R) hosting site
3.  -- Hosting your Tableau public workbook.

Once everything is set up open R studio and choose which R version you want to use whether 64bit/32 bit (you can change version later from R studio if you want) choose 32bit as 64bit has some issue while using  package like mailR, ODBC etc. You can change to 64bit version when you need from Rstudio UI. 

Once everything is step up go through few basic tutorial on YouTube or Datacamp.
Few best resources to learn basic of R:
1. TryR
2. Datacamp -- play and learn R
3. The Analytics Edge | Edx -- Edx course, its very very useful 
4. R-programming | Coursera -- Learn from fundamental  
5. R-bloggers --  Content collected from bloggers
6. R_books -- list of all useful R book.
4. Online-learning_R -- Blog post 

Advice: Don't try to be export(i.e  learn everything)  before writing program, after learning basic, try to write code that helps you in your work, you will write shit code ( when i see my old code I feel that too) but you will learn very fast. 
If you are not able to solve anything just google it, you will find solution. You will find solution to simple problem like changing col names to writing complex loop in google, always give it try.

List of most useful packages in R
1. dplyr, tidyr, reshape2, stringr  -- Data manipulation( sqldf if you know SQL)
2. ggplot2, pploty, googleVis, htmlwidgets , shiny-- Data visualization 
3. Markdown -- Reporting

Just play around with few packages, you will get hooked to them as they make you life so easy. Just few line of code and you will have beautiful visualization , cleaned data and summary of data.
Now we are all set to start automating and producing beautiful /lucid visualization. 




What I like in twitter

Contact Form

Name

Email *

Message *