Titanic: Machine Learning from Disaster -- Part 2 -- Preprocessing and data visualization

March 06, 2016

Lets add sustainable value to all NA.

We can see there are NA in Age, cabin and embarked. Lets remove NA from Embarked.
As there was only 2 NA value, let assign NA to most frequent embarked.


	train$Embarked[which(is.na(train$Embarked))] ='S'
	table(train$Embarked, useNA = "always")

             C    Q    S <NA> 
            168   77  646    0

2. As ages have 177 missing value.This is tricky as we can add mean of ages to these age but that may not be good estimation. We see name have title like Mr., Mrs., Master, We can use this info to add ages to missing value, first we will find mean ages of each title and assign these value to NA of same title.


train$Name = as.character(train$Name)
table_names = table(unlist(strsplit(train$Name, "\\s+")))
sort(table_names[grep('\\.', names(table_names))], decreasing = T)

Mr.

517

Miss.

182

Mrs.

125

Master.

Dr.

Rev.

Col.

Major.

Mlle.

Capt.

Countess.

Don.

Jonkheer.

Lady.

Mme.

Ms.

Sir.

# lets get initial of missing value

table_na = train[which(is.na(train$Age)),]

table_names = table(unlist(strsplit(table_na$Name, "\\s+")))

sort(table_names[grep('\\.', names(table_names))], decreasing = T)

Mr.

119

Miss.

Mrs.

Master.

Dr.

# lets gets mean age for each title

sort(table_names[grep('\\.', names(table_names))], decreasing = T)

title = c("Mr\\.", "Miss\\.", "Mrs\\.", "Master\\." ,"Dr\\.")

# means

sapply(title, function(x){

mean(train$Age[grepl(x, train$Name) & !is.na(train$Age)])
})

Mr\.

32.3680904522613

Miss\.

21.7739726027397

Mrs\.

35.8981481481481

Master\.

4.57416666666667

Dr\.

# lets use these mean value to fill NA value in each title

title = c("Mr\\.", "Miss\\.", "Mrs\\.", "Master\\." ,"Dr\\.", "Ms\\.")

for (x in title){

train$Age[grepl(x, train$Name) & is.na(train$Age)]=mean(train$Age[grepl(x, train$Name) & !is.na(train$Age)])

}
summary(train$Age)

                      Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
                     0.42   21.77   30.00   29.75   35.90   80.00

3. Cabin also has 687 NA, as no of NA is very high we will just drop this column in analysis.

Now our data is all clean and good for analysis. Before we do any analysis lets visualize data. Finding pattern is helpful while developing model latter.

a) Sex vs survived

	(ggplot(train, aes(Survived, fill=Sex))
	+ geom_bar(aes(color = Sex) )
	+ xlab("")
	+ ylab("No of Passanger")
	+ scale_x_discrete(breaks=c("0", "1"),labels=c("Perished","Survived")))

We can see that if you have female you have high chance of surviving.

b) Ages Vs Survivor


	(ggplot(train, aes(Age, fill=Survived))
	+ geom_histogram( binwidth = 2 )
	+ xlab("Age")
	+ ylab("No of Passanger")
	+ scale_fill_discrete(breaks=c("0", "1"),labels=c("Perished","Survived")))

We can see that if ages is below 10 you have higher survivor rate.

c) Class Vs Survived

	(ggplot(train, aes(Pclass, fill=Survived))
	+ geom_bar( )
	+ xlab("Class")
	+ ylab("No of Passanger")
	+ scale_x_discrete(breaks=c("1", "2", "3"),labels=c("First","Second", "third"))
	+scale_fill_discrete(breaks=c("0", "1"),labels=c("Perished","Survived")))

First class have higher survivor rate.

In next tutorial we will make our first ML model.

Github--Titanic-Machine-Learning-from-Disaster

**Happy Coding**

Pages

Titanic: Machine Learning from Disaster -- Part 2 -- Preprocessing and data visualization

Unknown

0 comments

Recent Posts

About me

Contact Me

Labels

Blog Archive

Twitter Feed

Popular Posts

What I like in twitter

Contact Form

	train$Name = as.character(train$Name)
	table_names = table(unlist(strsplit(train$Name, "\\s+")))
	sort(table_names[grep('\\.', names(table_names))], decreasing = T)

Pages

Titanic: Machine Learning from Disaster -- Part 2 -- Preprocessing and data visualization

Share This Story

Unknown

You Might Also Like

0 comments

Recent Posts

About me

Contact Me

Labels

Blog Archive

Twitter Feed

Popular Posts

What I like in twitter

Contact Form