Titanic: Machine Learning from Disaster -- Part 1 -- Let get started
March 06, 2016
Machine learning has always fascinated me.
I had heard of Kaggle ( Kaggle is a platform for predictive modelling and analytic competitions on which companies and researchers post their data and statisticians and data miners from all over the world compete to produce the best models) but never participated in any competitions until recently(i had created account 11 month ago). I wanted to used my machine learning skill after taking intro-to-machine-learning, machine-learning-in-15-hours-of-expert-videos and going through Machine Learning With R Cookbook & Machine-Learning-R-Brett-Lantz in real world problem. Kaggle was best place to get started.
Anyone who has gone in little depth of machine learning will know about Titanic dataset. Titanic competition description from Kaggle website:
"The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.
One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.
In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy"
Lets get started by downloading file(train.csv and test.csv, you need account) or you can use link to data that is store in My repository github while reading csv.
Titanic Data
Lets load all required packages
These are mostly all packages you will be needing for ML in R. We will not use all these packages in this tutorial, I will explain use of all packages as we go along.
Now lets load data
'data.frame': 891 obs. of 12 variables: $ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ... $ Survived : int 0 1 1 1 0 0 0 0 1 1 ... $ Pclass : int 3 1 3 1 3 3 1 3 3 2 ... $ Name : Factor w/ 891 levels "Abbing, Mr. Anthony",..: 109 191 358 277 16 559 520 629 417 581 ... $ Sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ... $ Age : num 22 38 26 35 35 NA 54 2 27 14 ... $ SibSp : int 1 1 0 1 0 0 0 3 0 1 ... $ Parch : int 0 0 0 0 0 0 0 1 2 0 ... $ Ticket : Factor w/ 681 levels "110152","110413",..: 524 597 670 50 473 276 86 396 345 133 ... $ Fare : num 7.25 71.28 7.92 53.1 8.05 ... $ Cabin : Factor w/ 147 levels "A10","A14","A16",..: NA 82 NA 56 NA NA 130 NA NA NA ... $ Embarked : Factor w/ 3 levels "C","Q","S": 3 1 3 3 3 2 3 3 3 1 ...
Let check summary
Summary(train)
PassengerId Survived Pclass Name Min. : 1.0 0:549 1:216 Abbing, Mr. Anthony : 1 1st Qu.:223.5 1:342 2:184 Abbott, Mr. Rossmore Edward : 1 Median :446.0 3:491 Abbott, Mrs. Stanton (Rosa Hunt) : 1 Mean :446.0 Abelson, Mr. Samuel : 1 3rd Qu.:668.5 Abelson, Mrs. Samuel (Hannah Wizosky): 1 Max. :891.0 Adahl, Mr. Mauritz Nils Martin : 1 (Other) :885 Sex Age SibSp Parch Ticket female:314 Min. : 0.42 Min. :0.000 Min. :0.0000 1601 : 7 male :577 1st Qu.:20.12 1st Qu.:0.000 1st Qu.:0.0000 347082 : 7 Median :28.00 Median :0.000 Median :0.0000 CA. 2343: 7 Mean :29.70 Mean :0.523 Mean :0.3816 3101295 : 6 3rd Qu.:38.00 3rd Qu.:1.000 3rd Qu.:0.0000 347088 : 6 Max. :80.00 Max. :8.000 Max. :6.0000 CA 2144 : 6 NA's :177 (Other) :852 Fare Cabin Embarked Min. : 0.00 B96 B98 : 4 C :168 1st Qu.: 7.91 C23 C25 C27: 4 Q : 77 Median : 14.45 G6 : 4 S :644 Mean : 32.20 C22 C26 : 3 NA's: 2 3rd Qu.: 31.00 D : 3 Max. :512.33 (Other) :186 NA's :687
We see few column has missing value, Machine learning algorithm mostly exclude observation which has NA.We want to have maximum observations for training for better model. In next tutorial we will try to fill NA data with appropriate value. This is most important skills for data scientist, learning data munging because in real world data will not be arrange as we want,we need to do pre process data and make it ready for analysis. That is where we come in otherwise machine can do better than us, if its was just importing data and fitting model.
Github--Titanic-Machine-Learning-from-Disaster
I had heard of Kaggle ( Kaggle is a platform for predictive modelling and analytic competitions on which companies and researchers post their data and statisticians and data miners from all over the world compete to produce the best models) but never participated in any competitions until recently(i had created account 11 month ago). I wanted to used my machine learning skill after taking intro-to-machine-learning, machine-learning-in-15-hours-of-expert-videos and going through Machine Learning With R Cookbook & Machine-Learning-R-Brett-Lantz in real world problem. Kaggle was best place to get started.
Anyone who has gone in little depth of machine learning will know about Titanic dataset. Titanic competition description from Kaggle website:
"The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.
One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.
In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy"
Lets get started by downloading file(train.csv and test.csv, you need account) or you can use link to data that is store in My repository github while reading csv.
Titanic Data
Lets load all required packages
| ||
suppressPackageStartupMessages(require('ggplot2')) | ||
suppressPackageStartupMessages(require('caret')) # contain various ML packages | ||
suppressPackageStartupMessages(require('e1071')) # conatin Various ML packages | ||
suppressPackageStartupMessages(require('party')) # Desicion tree | ||
suppressPackageStartupMessages(require('nnet')) # Neural network analysis | ||
suppressPackageStartupMessages(require('randomForest')) # Random Forest | ||
suppressPackageStartupMessages(require('pROC')) # ROC curve |
Now lets load data
| ||||
filepath=getwd() | ||||
setwd(paste(filepath, "R_Script/Input", sep="/")) | ||||
# read the files | ||||
train = read.csv("train.csv", na.strings=c("NA", "")) | ||||
OR
train = read.csv("https://raw.githubusercontent.com/BkrmDahal/All_file_backup_for_analysis/master/train_titanic.csv", na.strings=c("NA", ""))
|
'data.frame': 891 obs. of 12 variables: $ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ... $ Survived : int 0 1 1 1 0 0 0 0 1 1 ... $ Pclass : int 3 1 3 1 3 3 1 3 3 2 ... $ Name : Factor w/ 891 levels "Abbing, Mr. Anthony",..: 109 191 358 277 16 559 520 629 417 581 ... $ Sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ... $ Age : num 22 38 26 35 35 NA 54 2 27 14 ... $ SibSp : int 1 1 0 1 0 0 0 3 0 1 ... $ Parch : int 0 0 0 0 0 0 0 1 2 0 ... $ Ticket : Factor w/ 681 levels "110152","110413",..: 524 597 670 50 473 276 86 396 345 133 ... $ Fare : num 7.25 71.28 7.92 53.1 8.05 ... $ Cabin : Factor w/ 147 levels "A10","A14","A16",..: NA 82 NA 56 NA NA 130 NA NA NA ... $ Embarked : Factor w/ 3 levels "C","Q","S": 3 1 3 3 3 2 3 3 3 1 ...
Let check summary
Summary(train)
PassengerId Survived Pclass Name Min. : 1.0 0:549 1:216 Abbing, Mr. Anthony : 1 1st Qu.:223.5 1:342 2:184 Abbott, Mr. Rossmore Edward : 1 Median :446.0 3:491 Abbott, Mrs. Stanton (Rosa Hunt) : 1 Mean :446.0 Abelson, Mr. Samuel : 1 3rd Qu.:668.5 Abelson, Mrs. Samuel (Hannah Wizosky): 1 Max. :891.0 Adahl, Mr. Mauritz Nils Martin : 1 (Other) :885 Sex Age SibSp Parch Ticket female:314 Min. : 0.42 Min. :0.000 Min. :0.0000 1601 : 7 male :577 1st Qu.:20.12 1st Qu.:0.000 1st Qu.:0.0000 347082 : 7 Median :28.00 Median :0.000 Median :0.0000 CA. 2343: 7 Mean :29.70 Mean :0.523 Mean :0.3816 3101295 : 6 3rd Qu.:38.00 3rd Qu.:1.000 3rd Qu.:0.0000 347088 : 6 Max. :80.00 Max. :8.000 Max. :6.0000 CA 2144 : 6 NA's :177 (Other) :852 Fare Cabin Embarked Min. : 0.00 B96 B98 : 4 C :168 1st Qu.: 7.91 C23 C25 C27: 4 Q : 77 Median : 14.45 G6 : 4 S :644 Mean : 32.20 C22 C26 : 3 NA's: 2 3rd Qu.: 31.00 D : 3 Max. :512.33 (Other) :186 NA's :687
We see few column has missing value, Machine learning algorithm mostly exclude observation which has NA.We want to have maximum observations for training for better model. In next tutorial we will try to fill NA data with appropriate value. This is most important skills for data scientist, learning data munging because in real world data will not be arrange as we want,we need to do pre process data and make it ready for analysis. That is where we come in otherwise machine can do better than us, if its was just importing data and fitting model.
Github--Titanic-Machine-Learning-from-Disaster
**Happy Coding**
0 comments