Vlookup in R

First thing that was on my mind when I used R for first time, How the hell will I use Vlookup in R?(All my report had vlookup at least one time) . I googled it, answers were not lucid. If you google it most probability you will come across use merge as answer. Merge is base function, like most base function(except very few) it complected to use. Plus excel user are not that familiar with relationship, for them info in each cell are different. Excel user never think data as column, info in each cell is separate for them. We (excel user) thinking about how we will add two cell, how will we look value of cell A1 on table B1:C10. We never think as lets look value of column A into table B:C. or add column A to B.
Advice: If you come from excel background start thinking all data as column  and starts respecting the structure of data. In excel you can add any two cells (A1 and A5) and put that somewhere in  C5, have different type of data in one column(like number in A1, date A2, string in third ). This is very bad habit . Always think any operation as column operation not cells operation. Like if you have to add two series, put it under different column and add these to make third column. Any analysis, reporting, manipulation only consists of joining column and than summarizing(visualization, modeling). Now when some ask me for analysis, I just have to know where are column with these info, how can I  join them, how to summarize, that all there is in any reporting.
Why am I taking so much about column in vlookup tutorial?  Reason is, in any database language or programming language for Vlookup, you need to get related info about these column from next column and both of these column should have common id.

Lets break down Vlookup,
Vlookup - takes a value say "A" than find that value "A" in next table than pull info related to "A" from  this table.
This is called joining in database and R, you take list of value, join(match these value in next table) than pull info related to these value.

lets take an example 
##make data frame
master
<- data.frame(ID = 1:50, name = letters[1:50],
date = seq(as.Date("2016-01-01"), by = "week", len = 50))

Now we have different list which only has id
##lookup value
lookup
=data.frame(id = c(23, 50, 4, 45))

Now we need to look up name of these id in master data.frame.
Merge?? i have not used it for ages there are easy solution  for it.
##load dplyr
required(dplyr)

dplyr has many user friendly join function.



lets get back to problem
##lookup
id_lookup
= left_join(id, master, by="id") # output are only value that >matches to id_lookup, if no match is found it return as NA
or
id_lookup = right_join(master, id, by="id") ##both column should have common name

If column name are different you can
##If column name are different you can
id_lookup
= right_join(master, id, by=c("id"="id2"))

or rename column using
colnames(id)[x] = "id" # x is cloumn index
id_lookup = rename(id, id=id2) # rename is dplyr function

New id_lookup will have colnames as "id","name","date". If you don't need date you can always make subset of data,frame and get only required data. Or before join make subset of master with only required column and than join. Any way you like.

##subset of data
id_lookup
= id_lookup[ , -c("date")]
or
id_lookup = id_lookup[ , c("id", "date")
or
id_lookup = id_lookup[,c(1,3)]
or
id_lookup = subset(id_lookup, condition, select=c("id", "date"))
Cautious: make sure name are same for similar  field, not like column names is id obs. are names. There is were respect for database structure comes.

Get used to with joins, these are all joins you we need to perform any lookup. You never perform look for only particular value mostly, its always column look up. Best practices is always make data.frame of what you have to look up and  join to next table.


WBS- Part3 - lappy to make Individual file of Each Topic

WBI is large set of data. In last two blog we discussed how to do cleaning and manipulation along with make beautiful visualization.
But some time, we don't have enough resources or its gets boring to run same code time again and again. What if we had data that was cleaned (ready for any analysis) and arrange topic wise so that we can easily access when we need, without having to do all dirty work in R. May be you don't have R in other computer and want to run some analysis on tableau or excel on any particular topic. Its always good to have cleaned data.
We will take big chunk of data and do all cleaning and manipulation than produce csv for each topic and save it for further access.

Lets get started
First part is similar  to old tutorial so I will just paste the code there.

###download world bank data "http://data.worldbank.org/products/wdi" 
#>> "Data catalog downloads (Excel | CSV)">> "CSV"
##unzip and keep in directory of your choice my is "M:/R_scripts/Combine"
#################load required package

##if (!require("dplyr")) install.packages('dplyr') 
# if you are not sure if package is installed
suppressPackageStartupMessages(require("dplyr"))
suppressPackageStartupMessages(require("tidyr"))
suppressPackageStartupMessages(require("reshape2"))
suppressPackageStartupMessages(require("readr"))
suppressPackageStartupMessages(require("googleVis"))
currentDate = Sys.Date()

#########Set the file directory
setwd("M:/")
filepath=getwd()
setwd(paste(filepath, "R_Script/Combine", sep="/"))

#####readfile from your directory
wdi = read_csv("WDI_Data.csv")
country = read_csv("WDI_Country.csv")
i_name= read_csv("WDI_Series.csv")

#### create subset of above data, select only required row
## required col from wdi
wdi_sub = wdi[ , c(1,3,5:60)]

##lets run anysis on country name only; 
#country name in wdi file has other names like summary of region
country_sub = subset(country, country$`Currency Unit`!="" ,
select = c("Table Name", "Region")) # if currency unit is blank its not country
colnames(country_sub) <- c("Country Name", "Region")

Now lets make list of all Topic from i_name

# lets make list of topic
i_name_sub = as.data.frame(table(i_name$Topic))
i_name_sub = as.character(i_name_sub[,1])
Now we are all set. Let run loop to get subset of each topic's indicator name than we will left join with required data frame.

###let used lappy on each topic lapply(i_name_sub, function(x){
## take each list as temp and get Indicator Name related to it
temp = as.character(x)
temp = subset(i_name, i_name$Topic==temp, select="Indicator Name")
##left join to get only those Indicator data and country
wdi_sub_temp = left_join(country_sub, wdi_sub)
wdi_sub_temp = left_join(temp, wdi_sub_temp)
##gather date and expand Indicator Name
wdi_sub_temp = gather(wdi_sub_temp, "years", "sample", 4:59)
colnames(wdi_sub_temp) <- c("Indicator.Name", "Country.Name","Region" ,"years", "Value")
wdi_sub_temp = dcast(wdi_sub_temp, Country.Name+years+Region~Indicator.Name, value.var = "Value", na.rm = T )
##make years as date
wdi_sub_temp$years = paste(wdi_sub_temp$years,"-01-01", sep="")
wdi_sub_temp$years=as.Date(wdi_sub_temp$years, "%Y-%m-%d")
##let make unique ID in each dataset if we want to join later on for any analysis
wdi_sub_temp$ID_for_join = paste(wdi_sub_temp$Country.Name, wdi_sub_temp$years, sep="-")
##save file
setwd(paste(filepath, "R_script/Output", sep="/"))
csvname = paste(gsub(":",",",x),".csv",paste=" ") #file name cant have ":"
write.csv(wdi_sub_temp, file=csvname, row.names = F)
setwd(filepath)
})
###total of 91 file will be produced
###You can find all 91 file #here https://www.dropbox.com/sh/sk7f7uoz9t7mb38/AACxA8gGTXZJV90CycB4uT_Ka?dl=0
##download anyfile you need and play around.
#happy coding
Advice: Don't used 'for', 'while' loop, try to avoid them as much as possible, (for if used ifelse). I know you are used to with for loop but it too slow in R. Always used apply family as far as possible no matter how small loop is. If you want to be good at R, You will have to know apply. Don't try to find other option. ( I used 'for' loop in R for long time but I had go to basic of lappy and learn it, it inevitable) 

We have all the file ready for analysis, All 91 file are available at Dropbox_WBS, You can download any file and play around.

What I like in twitter

Contact Form

Name

Email *

Message *