World Bank Series - Part 1 - Data Cleaning and Manipulation
January 31, 2016
Let get some open data to play around. World Development Indicators from World bank is good datasheet to get us started.
First download the World Development Indicators from Worldbank Data
Download the CSV format >>> Data catalog downloads (Excel | CSV)(45 mb)
Once download completes unzip the file to you desired directory.
My is "M:/R_scripts/Combine"
Let set out R directory to same folder
setwd("M:/") | |
filepath=getwd() | |
setwd(paste(filepath, "R_Script/Combine", sep="/")) |
Now lets load the required packages
| ||
suppressPackageStartupMessages(require("tidyr")) | ||
suppressPackageStartupMessages(require("reshape2")) | ||
suppressPackageStartupMessages(require("readr")) | ||
suppressPackageStartupMessages(require("googleVis")) | ||
currentDate = Sys.Date() |
Read three files
1. WDI_data -- has all data
2. WDI_country -- has info about country
3. WDI_country -- has info about Indicators
| ||
country = read_csv("WDI_Country.csv") | ||
i_name= read_csv("WDI_Series.csv") | ||
str(wdi) # To see structure of data |
First thing you will notice is that WDI_Data has more than 333000 obs. which is large-set of data.
Second thing is that its has data from 1960 to 2015, which is also long duration.
Third thing is country name has name like Arab world, low income, which as you guessed are not country name
*str(wdi)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 333560 obs. of 60 variables:
$ Country Name : chr "Arab World" "Arab World" "Arab World" "Arab World" ...
$ Country Code : chr "ARB" "ARB" "ARB" "ARB" ...*
Lets start data cleaning
Few point to always keep in mind
1. In data.frame or any database column should have similar variables i.e if col name is country name than its better it has only country name.
2. Try your best to put other info into separate column like, arab world looks like classification of country by region and low income by income gap, so its better to have separate column as region and income gap.
Let see head i.e first 10 row all data sheet and see what other data.frame have
|
This table seems helpful to find only country from WDI table.
Here you see country_name, region and income gap. We can join these two table to get only country from WDI. But there is small problem in row 8 you can see there is arab world that should not be name of country. Solution if you see column "Currency unit", you see its NA for Arab world, so if currency name is "NA" than its not country. Lets get subset of country with country name and with currency unit only.
| ||
select = c("Table Name", "Region")) # if currency unit is blank its not country | ||
colnames(country_sub) <- c("Country Name", "Region") |
Lets only keep required col names in wdi
|
Now lets left_join to get only country name
|
But still we have 283798 row of data which is to much info for any analysis
If you see col "Indicator Name" its seems way to subset data and few names look like of some category
Let explore more data.frame
|
This data.frame has Topic and category name, Its seems another good way to get sub-category. Lets get only one topic at a time.
|
Other way of making subset of data is by selecting multiple Indicator Name
| ||
# i_name_sub = subset(i_name, | ||
i_name$`Indicator Name`=="GDP growth (annual %)" | | ||
i_name$`Indicator Name`=="Population density" | | ||
i_name$`Indicator Name`=="CO2 emissions (kt)" | | ||
i_name$`Indicator Name`=="Access to electricity" | | ||
i_name$`Indicator Name`=="Forest area (% of land area)" | | ||
i_name$`Indicator Name`=="Inflation, consumer prices" | | ||
i_name$`Indicator Name`=="Life expectancy at birth" | | ||
i_name$`Indicator Name`=="Birth rate, crude (per 1,000 people)" , select="Indicator Name") | ||
At last lets left_join to get data of only required topic
####let join all data to get only required data | |
wdi_sub = left_join(i_name_sub, wdi_sub) |
4 comments
betmatik
ReplyDeletekralbet
betpark
tipobet
slot siteleri
kibris bahis siteleri
poker siteleri
bonus veren siteler
mobil ödeme bahis
AWCGU5
bolu
ReplyDeletebursa
çanakkale
çorum
denizli
HO5
شركة مكافحة حشرات بالدمام mc4Cdzxw4y
ReplyDeleteتسليك مجاري twuoQrEeTK
ReplyDelete