Titanic: Machine Learning from Disaster -- Part 2 -- Preprocessing and data visualization

Lets add sustainable value to all NA.

We can see there are NA in Age, cabin and embarked. Lets remove NA from Embarked.
As there was only 2 NA value, let assign NA to most frequent embarked.

train$Embarked[which(is.na(train$Embarked))] ='S'
table(train$Embarked, useNA = "always")

             C    Q    S <NA> 
            168   77  646    0 
2. As ages have 177 missing value.This is tricky as we can add mean of ages to these age but that may not be good estimation. We see name have title like Mr., Mrs., Master, We can use this info to add ages to missing value, first we will find mean ages of each title and assign these value to NA of same title.
train$Name = as.character(train$Name)
table_names = table(unlist(strsplit(train$Name, "\\s+")))
sort(table_names[grep('\\.', names(table_names))], decreasing = T)



Mr.
517
Miss.
182
Mrs.
125
Master.
40
Dr.
7
Rev.
6
Col.
2
Major.
2
Mlle.
2
Capt.
1
Countess.
1
Don.
1
Jonkheer.
1
L.
1
Lady.
1
Mme.
1
Ms.
1
Sir.
1


# lets get initial of missing value
table_na
= train[which(is.na(train$Age)),]
table_names = table(unlist(strsplit(table_na$Name, "\\s+")))
sort(table_names[grep('\\.', names(table_names))], decreasing = T)


Mr.
119
Miss.
36
Mrs.
17
Master.
4
Dr.
1

# lets gets mean age for each title
sort(table_names[grep('\\.', names(table_names))], decreasing = T)
title = c("Mr\\.", "Miss\\.", "Mrs\\.", "Master\\." ,"Dr\\.")
# means
sapply(title, function(x){
mean(train$Age[grepl(x, train$Name) & !is.na(train$Age)])
})



Mr\.
32.3680904522613
Miss\.
21.7739726027397
Mrs\.
35.8981481481481
Master\.
4.57416666666667
Dr\.
42

# lets use these mean value to fill NA value in each title
title = c("Mr\\.", "Miss\\.", "Mrs\\.", "Master\\." ,"Dr\\.", "Ms\\.")
for (x in title){
train$Age[grepl(x, train$Name) & is.na(train$Age)]=mean(train$Age[grepl(x, train$Name) & !is.na(train$Age)])
} summary(train$Age)
                      Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
                     0.42   21.77   30.00   29.75   35.90   80.00       

3. Cabin also has 687 NA, as no of NA is very high we will just drop this column in analysis.

Now our data is all clean and good for analysis. Before we do any analysis lets visualize data. Finding pattern is helpful while developing model latter. 
 a) Sex vs survived
(ggplot(train, aes(Survived, fill=Sex))
+ geom_bar(aes(color = Sex) )
+ xlab("")
+ ylab("No of Passanger")
+ scale_x_discrete(breaks=c("0", "1"),labels=c("Perished","Survived")))
             We can see that if you have female you have high chance of surviving.

b) Ages Vs Survivor

(ggplot(train, aes(Age, fill=Survived))
+ geom_histogram( binwidth = 2 )
+ xlab("Age")
+ ylab("No of Passanger")
+ scale_fill_discrete(breaks=c("0", "1"),labels=c("Perished","Survived")))

                    We can see that if ages is below 10 you have higher survivor rate.


c) Class Vs Survived
(ggplot(train, aes(Pclass, fill=Survived))
+ geom_bar( )
+ xlab("Class")
+ ylab("No of Passanger")
+ scale_x_discrete(breaks=c("1", "2", "3"),labels=c("First","Second", "third"))
+scale_fill_discrete(breaks=c("0", "1"),labels=c("Perished","Survived")))
               First class have higher survivor rate.

In next tutorial we will make our first ML model.

**Happy Coding**

Titanic: Machine Learning from Disaster -- Part 1 -- Let get started

Machine learning has always fascinated me.

I had heard of Kaggle ( Kaggle is a platform for predictive modelling and analytic competitions on which companies and researchers post their data and statisticians and data miners from all over the world compete to produce the best models) but never participated in any competitions until recently(i had created account 11 month ago). I wanted to used my machine learning skill after taking intro-to-machine-learningmachine-learning-in-15-hours-of-expert-videos and going through  Machine Learning With R Cookbook & Machine-Learning-R-Brett-Lantz in real world problem. Kaggle was best place to get started.

Anyone who has gone in little depth of machine learning will know about Titanic dataset. Titanic competition description from Kaggle website:

 "The sinking of the RMS Titanic is one of the most infamous shipwrecks in history.  On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.

One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy"

Lets get started by downloading file(train.csv and test.csv, you need account) or you can use link to data that is store in  My repository github while reading csv.
Titanic Data

Lets load all required packages

###load all required packages
suppressPackageStartupMessages(require('dplyr'))
suppressPackageStartupMessages(require('ggplot2'))
suppressPackageStartupMessages(require('caret')) # contain various ML packages
suppressPackageStartupMessages(require('e1071')) # conatin Various ML packages
suppressPackageStartupMessages(require('party')) # Desicion tree
suppressPackageStartupMessages(require('nnet')) # Neural network analysis
suppressPackageStartupMessages(require('randomForest')) # Random Forest
suppressPackageStartupMessages(require('pROC')) # ROC curve
These are mostly all packages you will be needing for ML in R. We will not use all these packages in this tutorial,  I will explain use of all packages as we go along.

Now lets load data

#set directory to folder with downloaded file
setwd("E:/")
filepath=getwd()
setwd(paste(filepath, "R_Script/Input", sep="/"))
# read the files
train = read.csv("train.csv", na.strings=c("NA", ""))

OR train = read.csv("https://raw.githubusercontent.com/BkrmDahal/All_file_backup_for_analysis/master/train_titanic.csv",
na.strings=c("NA", ""))
#let check structure of data
str(train)

'data.frame': 891 obs. of 12 variables: $ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ... $ Survived : int 0 1 1 1 0 0 0 0 1 1 ... $ Pclass : int 3 1 3 1 3 3 1 3 3 2 ... $ Name : Factor w/ 891 levels "Abbing, Mr. Anthony",..: 109 191 358 277 16 559 520 629 417 581 ... $ Sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ... $ Age : num 22 38 26 35 35 NA 54 2 27 14 ... $ SibSp : int 1 1 0 1 0 0 0 3 0 1 ... $ Parch : int 0 0 0 0 0 0 0 1 2 0 ... $ Ticket : Factor w/ 681 levels "110152","110413",..: 524 597 670 50 473 276 86 396 345 133 ... $ Fare : num 7.25 71.28 7.92 53.1 8.05 ... $ Cabin : Factor w/ 147 levels "A10","A14","A16",..: NA 82 NA 56 NA NA 130 NA NA NA ... $ Embarked : Factor w/ 3 levels "C","Q","S": 3 1 3 3 3 2 3 3 3 1 ...

Let check summary

Summary(train)
PassengerId Survived Pclass Name Min. : 1.0 0:549 1:216 Abbing, Mr. Anthony : 1 1st Qu.:223.5 1:342 2:184 Abbott, Mr. Rossmore Edward : 1 Median :446.0 3:491 Abbott, Mrs. Stanton (Rosa Hunt) : 1 Mean :446.0 Abelson, Mr. Samuel : 1 3rd Qu.:668.5 Abelson, Mrs. Samuel (Hannah Wizosky): 1 Max. :891.0 Adahl, Mr. Mauritz Nils Martin : 1 (Other) :885 Sex Age SibSp Parch Ticket female:314 Min. : 0.42 Min. :0.000 Min. :0.0000 1601 : 7 male :577 1st Qu.:20.12 1st Qu.:0.000 1st Qu.:0.0000 347082 : 7 Median :28.00 Median :0.000 Median :0.0000 CA. 2343: 7 Mean :29.70 Mean :0.523 Mean :0.3816 3101295 : 6 3rd Qu.:38.00 3rd Qu.:1.000 3rd Qu.:0.0000 347088 : 6 Max. :80.00 Max. :8.000 Max. :6.0000 CA 2144 : 6 NA's :177 (Other) :852 Fare Cabin Embarked Min. : 0.00 B96 B98 : 4 C :168 1st Qu.: 7.91 C23 C25 C27: 4 Q : 77 Median : 14.45 G6 : 4 S :644 Mean : 32.20 C22 C26 : 3 NA's: 2 3rd Qu.: 31.00 D : 3 Max. :512.33 (Other) :186 NA's :687
We see few column has missing value, Machine learning algorithm mostly exclude observation which has NA.We want to have maximum observations for training for better model. In next tutorial we will try to fill NA data with appropriate value. This is most important skills for data scientist, learning data munging because in real world data will not be arrange as we want,we need to do pre process data and make it ready for analysis. That is where we come in otherwise machine can do better than us, if its was just importing data and fitting model.

Github--Titanic-Machine-Learning-from-Disaster


**Happy Coding**

What I like in twitter

Contact Form

Name

Email *

Message *