Titanic: Machine Learning from Disaster -- Part 1 -- Let get started

Machine learning has always fascinated me.

I had heard of Kaggle ( Kaggle is a platform for predictive modelling and analytic competitions on which companies and researchers post their data and statisticians and data miners from all over the world compete to produce the best models) but never participated in any competitions until recently(i had created account 11 month ago). I wanted to used my machine learning skill after taking intro-to-machine-learningmachine-learning-in-15-hours-of-expert-videos and going through  Machine Learning With R Cookbook & Machine-Learning-R-Brett-Lantz in real world problem. Kaggle was best place to get started.

Anyone who has gone in little depth of machine learning will know about Titanic dataset. Titanic competition description from Kaggle website:

 "The sinking of the RMS Titanic is one of the most infamous shipwrecks in history.  On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.

One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy"

Lets get started by downloading file(train.csv and test.csv, you need account) or you can use link to data that is store in  My repository github while reading csv.
Titanic Data

Lets load all required packages

###load all required packages
suppressPackageStartupMessages(require('dplyr'))
suppressPackageStartupMessages(require('ggplot2'))
suppressPackageStartupMessages(require('caret')) # contain various ML packages
suppressPackageStartupMessages(require('e1071')) # conatin Various ML packages
suppressPackageStartupMessages(require('party')) # Desicion tree
suppressPackageStartupMessages(require('nnet')) # Neural network analysis
suppressPackageStartupMessages(require('randomForest')) # Random Forest
suppressPackageStartupMessages(require('pROC')) # ROC curve
These are mostly all packages you will be needing for ML in R. We will not use all these packages in this tutorial,  I will explain use of all packages as we go along.

Now lets load data

#set directory to folder with downloaded file
setwd("E:/")
filepath=getwd()
setwd(paste(filepath, "R_Script/Input", sep="/"))
# read the files
train = read.csv("train.csv", na.strings=c("NA", ""))

OR train = read.csv("https://raw.githubusercontent.com/BkrmDahal/All_file_backup_for_analysis/master/train_titanic.csv",
na.strings=c("NA", ""))
#let check structure of data
str(train)

'data.frame': 891 obs. of 12 variables: $ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ... $ Survived : int 0 1 1 1 0 0 0 0 1 1 ... $ Pclass : int 3 1 3 1 3 3 1 3 3 2 ... $ Name : Factor w/ 891 levels "Abbing, Mr. Anthony",..: 109 191 358 277 16 559 520 629 417 581 ... $ Sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ... $ Age : num 22 38 26 35 35 NA 54 2 27 14 ... $ SibSp : int 1 1 0 1 0 0 0 3 0 1 ... $ Parch : int 0 0 0 0 0 0 0 1 2 0 ... $ Ticket : Factor w/ 681 levels "110152","110413",..: 524 597 670 50 473 276 86 396 345 133 ... $ Fare : num 7.25 71.28 7.92 53.1 8.05 ... $ Cabin : Factor w/ 147 levels "A10","A14","A16",..: NA 82 NA 56 NA NA 130 NA NA NA ... $ Embarked : Factor w/ 3 levels "C","Q","S": 3 1 3 3 3 2 3 3 3 1 ...

Let check summary

Summary(train)
PassengerId Survived Pclass Name Min. : 1.0 0:549 1:216 Abbing, Mr. Anthony : 1 1st Qu.:223.5 1:342 2:184 Abbott, Mr. Rossmore Edward : 1 Median :446.0 3:491 Abbott, Mrs. Stanton (Rosa Hunt) : 1 Mean :446.0 Abelson, Mr. Samuel : 1 3rd Qu.:668.5 Abelson, Mrs. Samuel (Hannah Wizosky): 1 Max. :891.0 Adahl, Mr. Mauritz Nils Martin : 1 (Other) :885 Sex Age SibSp Parch Ticket female:314 Min. : 0.42 Min. :0.000 Min. :0.0000 1601 : 7 male :577 1st Qu.:20.12 1st Qu.:0.000 1st Qu.:0.0000 347082 : 7 Median :28.00 Median :0.000 Median :0.0000 CA. 2343: 7 Mean :29.70 Mean :0.523 Mean :0.3816 3101295 : 6 3rd Qu.:38.00 3rd Qu.:1.000 3rd Qu.:0.0000 347088 : 6 Max. :80.00 Max. :8.000 Max. :6.0000 CA 2144 : 6 NA's :177 (Other) :852 Fare Cabin Embarked Min. : 0.00 B96 B98 : 4 C :168 1st Qu.: 7.91 C23 C25 C27: 4 Q : 77 Median : 14.45 G6 : 4 S :644 Mean : 32.20 C22 C26 : 3 NA's: 2 3rd Qu.: 31.00 D : 3 Max. :512.33 (Other) :186 NA's :687
We see few column has missing value, Machine learning algorithm mostly exclude observation which has NA.We want to have maximum observations for training for better model. In next tutorial we will try to fill NA data with appropriate value. This is most important skills for data scientist, learning data munging because in real world data will not be arrange as we want,we need to do pre process data and make it ready for analysis. That is where we come in otherwise machine can do better than us, if its was just importing data and fitting model.

Github--Titanic-Machine-Learning-from-Disaster


**Happy Coding**

Lets get stared with Python

I always wanted to master one programming language ( just as hobby). Programming can make your life facile by automating all kind of task. I had learned C during my college but only to pass exam --which I did. Than never looked back to it.
During my professional life I had use R. So, I have excellent knowledge of R (never felt the need to learn other language) but R is mostly for data analysis and data science (that is what I do). Sometime ago, while I was doing some research on data science,  I found that R is not enough, if I wanted to be bad-ass data scientist, I had to learn python.
Learning python is trickier than R, as there are so many thing to learn, so many branches not like R (learn concept of data frame and drive into it). I want to learn python for both data science and as my hobby. This made my learning more trickier. After long research and my person experience, here are steps, that work best and resources.

Step for understand python from basic:
1. Try_python -- basic --  take this course at first
2. Programming Foundations with Python -- basic-- -- best course for stater if you want to understand from basic programming but lack info about loop, data-type.
3. Python--Codecademy -- basic to advance -- beautifully design, you will learn by doing.

After taking these course you can start making project as you like, if you want to drive deep into data
2. How to get better at data science -- best blog on this topic.

Need books
1. It ebooks -- free download -- buy books from amazon if you can effort. Support author.

Free online courses (just search python in these website)
1. Udacity -- Most interactive and interesting courses -- first time programmer try introductory course-- you can progress on your pace.
2. Edx--  Interactive but can be boring and fix schedule( you can archive course and take latter too)-- have course from basic to advance.
3. Coursera  -- first time programmer don't try this --most of courses are boring if you don' t have enthusiasm.
* This is my personal view, may be coursera has very interesting and interactive course but I always toke boring class or udacity may have boring class too. See all three website and decide yourself.

Install  python and package can be tricky too
My suggestion is used Anaconda distribution
  -- It comes with more than 200+ popular packages install along with jupyter notebook( ipthyon                 notebook) and spyder(IDE for python)
 -- Another advantage you can install both 2.7 and 3.4 using one click.
       conda create -n python2 python=2.7 anaconda

       Activate environment 
       activate python2
 -- You can also install R and use R in jupyter notebook with single line of code

       conda create -n my-r-env -c r r-essentials

Need help:
3. Google -- ha ha 

Remember if you have to do same task again and again, always find a way to automate it. 
Happy coding :-)

What I like in twitter

Contact Form

Name

Email *

Message *