Good coding format and Practices in R

There are many recommended coding standard and layout. A badly written code is big pain for anyone reader. So its always better to have good format of coding and follow few standard. My favorite layout of coding is described below:
  • Always start your code with description because when you write many code, names can also be confuse. Code description should have good name followed by what it does than files need for running the code. This will save you and others lot of time in long run.
######################Daily_mail and dispatch_cockpit###############################
#######open VPN Client ######
##Send a mail to all seller manager and make output for dispatch cockpit
#delisted file from BI, order from BOB
  • Than always load all packages (this will make it easy to see what packages are need to run code when you share file) need for analysis, always used suppressPackageStartupMessages function, it make output elegance. 
#################load required package
suppressPackageStartupMessages(require("dplyr"))
suppressPackageStartupMessages(require("mailR"))
suppressPackageStartupMessages(require("lubridate"))
suppressPackageStartupMessages(require("htmlTable"))
suppressPackageStartupMessages(require("googlesheets"))
currentDate = Sys.Date() ##current date to make folder and use in file name

  • Set up directory of R to folder that has all input file. If you are running R code on daily basis for any repetitive task, always have separate folder for input and output ( for output you can  have new folder of each day and keep input there)
#set input to require directory
setwd("M:/R_Script")
filepath=getwd()
setwd(paste(filepath, "Input", sep="/"))

  • If you can always, import all file at start of analysis.

seller = read.csv("sellers_delisting.csv", stringsAsFactors = F)
order = read.csv2("order.csv")

  • While writing code, if you are reading heavy file or from database always make a copy of original file and keep it separate while you progress ( like say I imported file bob than make copy of bob and do all analysis on copy of bob) as while writing code you will make mistake and if you again have to import original file, its tedious. 

order_new = order

  • If you are making many subset of data, give it same name always like "temp" for subset and some relevant name for summary of subset.
temp = subset(seller, seller$Date.delisted> as.Date(Sys.Date())-30 &
seller$Status =="Delisted", select = c("Seller.Name", "Reason.for.delisting"))
#summarize
seller_delisted = table(temp$Seller.Name.,temp$Reason.for.delisting)

  • When you save output always save it in output or today's folder with date in file name. Its will save you from lot of confusion.



#Save the the file
setwd("M:/Daily/Daily")
dir.create(as.character(currentDate)) #new folder with name current date
setwd(paste("M:/Daily/Daily", currentDate, sep="/"))
csvFileName1 = paste("Threshold limit and seller delisted",currentDate,".csv",sep=" ") #File name with date
write.csv(seller_delisted, file=csvFileName1, row.names = F)

  • When you save code that need further fine tuning always use git to commit or use Version in file name.  Like text_v1. R than text_v2.R so on.
  • If your are running multiple code one after another, always remove all variable from R once single analysis is completed. So that there is no interference of old variable with new code variable. 
rm(list=ls())
Now you ready to write lucid code.

Softwares and Setup for Analysis (R)

Let start by downloading all software and setting up few accounts.

Software to download
1.  -- Programming language for statistician by statistician 
2. -- IDE for R
3.-- Data Visualization Tool

Make account on 
1.  --  Distributed revision control and source code management
2.  -- Markdown(HTML, pdf report from R) hosting site
3.  -- Hosting your Tableau public workbook.

Once everything is set up open R studio and choose which R version you want to use whether 64bit/32 bit (you can change version later from R studio if you want) choose 32bit as 64bit has some issue while using  package like mailR, ODBC etc. You can change to 64bit version when you need from Rstudio UI. 

Once everything is step up go through few basic tutorial on YouTube or Datacamp.
Few best resources to learn basic of R:
1. TryR
2. Datacamp -- play and learn R
3. The Analytics Edge | Edx -- Edx course, its very very useful 
4. R-programming | Coursera -- Learn from fundamental  
5. R-bloggers --  Content collected from bloggers
6. R_books -- list of all useful R book.
4. Online-learning_R -- Blog post 

Advice: Don't try to be export(i.e  learn everything)  before writing program, after learning basic, try to write code that helps you in your work, you will write shit code ( when i see my old code I feel that too) but you will learn very fast. 
If you are not able to solve anything just google it, you will find solution. You will find solution to simple problem like changing col names to writing complex loop in google, always give it try.

List of most useful packages in R
1. dplyr, tidyr, reshape2, stringr  -- Data manipulation( sqldf if you know SQL)
2. ggplot2, pploty, googleVis, htmlwidgets , shiny-- Data visualization 
3. Markdown -- Reporting

Just play around with few packages, you will get hooked to them as they make you life so easy. Just few line of code and you will have beautiful visualization , cleaned data and summary of data.
Now we are all set to start automating and producing beautiful /lucid visualization. 




About Me and Blog


TED talk by Hans Rosling(viedo --below) is most  beautiful example of how interesting data can be, I had never thought data could be this interesting and lucid . After watching this,  I had always dream of making data this useful and lucid no matter where I am. This blog will be all about this, making data interesting and lucid using various tools mainly: R to automated task, visualization, run different analysis; Tableau for producing powerful dashboard; excel for sharing; python for analysis and other automation task.

Rans Rosling TED Talk

Software to be used:
1. R, Tableau public, python: Both are free to download.
2. Excel: Not free but everyone will have it.
3. Github (Gitlab alternative): All the code will be made freely available at github

Little history on why R, Tableau, python?
I came across R when I was working at Kaymu, once I got into basic and wrote few code I was hooked to it. I was (am) working at Kaymu Nepal as business analysts, so my daily routine is to produce few daily report (which was tedious as I had to repeat same step every day), run various analysis ( anyone who has worked in excel with data more than 100k row and vlookup will understand my pain) and develop new system ( any thing from new googlesheet team headquarter to develop new access database to handle logistics).
I had made a basic rule i.e no any regular reporting should take me more than 5 mins. R was live saver, I automated all my input file for reporting using R and used excel for data presentation. R is very fast (for me as I have to deal with max 500k) for data manipulation and even data visualization and analysis. Mostly I have been using R to automate my task, followed by analysis and than visualization (visualization is a bit of problem as only I run R in my office so, I take raw-input from R and use Excel for visualization for creating various dashboard that can be easy share with anyone).
If you use excel for any analysis more than 20k row and to do repetitive task everyday than I would advice you to switch to R. Learning curve of R is steep but once you get used to, you will save a lot of time like, it used to take me 3 hrs/day  to make all reporting now its less than 2 mins ( some report are automatically made & emailed all I have to do open my laptop every morning and for other reports input are produced. I just have put these input to preexisting excel template and report are ready).

Tableau is very powerful data visualization tool and easy to used. What I love about data is its ability to tell story, ability to make impact, these two things can only be achieve if anyone (anyone with zero knowledge of programming) can play around with data, this is were tableau comes in. Simple yet powerful and lucid dashboard can be prepared from 100k row(upto 1000k/worksheet) of worksheet, in very short amount of time.

Python, I don't have to explain much I  think, It one programming language to do all most any thing.

Tableau is not replacement of R for data visualization but both program go side by side. Some data visualization can make made with one line of code in R but will take ages to make in Tableau and vice-versa . So its wise to make decision as per your requirement. Nor are python and R replacement of each other, they go side by side.

About blog
I will be sharing my experience with R , python and Tableau , various analysis, automation trick, data visualization technique, data science stuff like regression, forecasting, addition to that I will post series bank data analysis and beautiful visualization with step by step guide (from scratch to like one created by Hans and more).

Welcome to my Blog






What I like in twitter

Contact Form

Name

Email *

Message *