World Bank Series - Part 1 - Data Cleaning and Manipulation

January 31, 2016

Let get some open data to play around. World Development Indicators from World bank is good datasheet to get us started.
First download the World Development Indicators from Worldbank Data 
Download the CSV format >>> Data catalog downloads (Excel | CSV)(45 mb)
Once download completes unzip the file to you desired directory.
My is "M:/R_scripts/Combine"
Let set out R directory to same folder
setwd("M:/")
filepath=getwd()
setwd(paste(filepath, "R_Script/Combine", sep="/"))

Now lets load the required packages

##if (!require("dplyr")) install.packages('dplyr') # if you are not sure if package is installed
suppressPackageStartupMessages(require("dplyr"))
suppressPackageStartupMessages(require("tidyr"))
suppressPackageStartupMessages(require("reshape2"))
suppressPackageStartupMessages(require("readr"))
suppressPackageStartupMessages(require("googleVis"))
currentDate = Sys.Date()

Read three files 
1. WDI_data -- has all data 
2. WDI_country -- has info about country
3. WDI_country --  has info about Indicators
#####readfile from your directory
wdi
= read_csv("WDI_Data.csv")
country = read_csv("WDI_Country.csv")
i_name= read_csv("WDI_Series.csv")
str(wdi) # To see structure of data


First thing you will notice is that WDI_Data has more than 333000 obs. which is large-set of data.
Second thing is that its has data from 1960 to 2015, which is also long duration.
Third thing is country name has name like Arab world, low income, which as you guessed are not country name
*str(wdi)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 333560 obs. of  60 variables:
 $ Country Name  : chr  "Arab World" "Arab World" "Arab World" "Arab World" ...
 $ Country Code  : chr  "ARB" "ARB" "ARB" "ARB" ...*

Lets start data cleaning 
Few point to always keep in mind
1. In data.frame or any database column should have similar variables i.e if col name is country name than its better it has only country name.
2. Try your best to put other info into separate column like, arab world looks like classification of country by region and low income by income gap, so its better to have separate column as region and income gap.

Let see head i.e first 10 row all data sheet and see what other data.frame have
##see first 10 rows country
head(country)

This table seems helpful to find only country from WDI table.
Here you see country_name, region and income gap. We can join these two table to get only country from WDI. But there is small problem in row 8 you can see there is arab world that should not be name of country. Solution if you see column "Currency unit", you see its NA for Arab world, so if currency name is "NA" than its not country. Lets get subset of country with country name and with currency unit only.
##lets run anysis on country name only; #country name in wdi file has other names like summary of region
country_sub
= subset(country, country$`Currency Unit`!="" ,
select = c("Table Name", "Region")) # if currency unit is blank its not country
colnames(country_sub) <- c("Country Name", "Region")

Lets only keep required col names in wdi
#### create subset of above data, select only required row
wdi_sub
= wdi[ , c(1,3,5:60)]

Now lets left_join to get only country name
####let join all data to get only required data
wdi_sub
= left_join(country_sub, wdi_sub)

But still we have 283798 row of data which is to much info for any analysis
If you see col "Indicator Name" its seems way to subset data and few names look like of some category 
Let explore more data.frame
##see first 10 rows country
head(i_name)

This data.frame has Topic and category name, Its seems another good way to get sub-category. Lets get only one topic at a time.
##lets get only one Topic
i_name_sub
= subset(i_name, i_name$Topic=="Public Sector: Defense & arms trade", select="Indicator Name")

Other way of  making subset of data is by selecting multiple Indicator Name
##OR get multiple Indicator
#to remove multiple comment just select line click shif+ctrl+c
# i_name_sub = subset(i_name,

i_name$`Indicator Name`=="GDP growth (annual %)" |
i_name$`Indicator Name`=="Population density" |
i_name$`Indicator Name`=="CO2 emissions (kt)" |
i_name$`Indicator Name`=="Access to electricity" |
i_name$`Indicator Name`=="Forest area (% of land area)" |
i_name$`Indicator Name`=="Inflation, consumer prices" |
i_name$`Indicator Name`=="Life expectancy at birth" |
i_name$`Indicator Name`=="Birth rate, crude (per 1,000 people)" , 
select="Indicator Name")
At last lets left_join to get data of only required topic

####let join all data to get only required data
wdi_sub = left_join(i_name_sub, wdi_sub)

Finally its has only 1477 row and only related data. We  are ready for analysis.




You Might Also Like

2 comments

What I like in twitter

Contact Form

Name

Email *

Message *