R_Python: Machine learning

Titanic: Machine Learning from Disaster -- Part 3 -- Logistics Regression

Our first machine learning algorithm will be Logistics Regression. Detail on Logistic_regression.
I had learned regression during high school and bachelor but never understood its true power ( I just studied to pass exam). Now I understand real power of regression.

In last two tutorial we did some reprocessing. Let make function for pre-processing.

	prepossessing= function(x) {
	train = read.csv(x ,na.strings=c("NA", ""))

	# Convert string to factor
	train$Sex = factor(train$Sex)
	train$Pclass = factor(train$Pclass)

	#fill na on Embarked with S
	train$Embarked[which(is.na(train$Embarked))] ='S'

	# lets gets mean age for each title to fill na value
	title = c("Mr\\.", "Miss\\.", "Mrs\\.", "Master\\." ,"Dr\\.", "Ms\\.")
	for (x in title){
	train$Age[grepl(x, train$Name) & is.na(train$Age)]=mean(train$Age[grepl(x, train$Name) & !is.na(train$Age)])
	}


	#return everything as numeric data as most Model take numeric value only
	train$Sex = as.numeric(train$Sex)
	train$Pclass = as.numeric(train$Pclass)
	train$Embarked = as.numeric(train$Embarked)
	train$Fare[is.na(train$Fare)] = median(train$Fare, na.rm = T)
	return (train)
	}

This function take file name as input and return cleaned data frame.
Next step is to splitting data into trainset and testing set. We will use train set to built model and testset to evaluated performance of our model.

	intrain<-createDataPartition(y=train$Survive,p=0.7,list=FALSE)
	traingset = train[intrain,]
	testset = train[-intrain,]

Lets built Logistics regression model as our first ML model. Logistics regression can be done using base glm function or using caret. I will be using train function with method 'glm' from caret packages

fit = train(Survived ~ Pclass+Sex+Age+SibSp+family+Embarked +Fare, data=trainset, method="glm",

preProcess="scale")

* Best thing about using caret function is that you can just change method name and train same model with other algorithm. For detail on train function see help(train):Detail Tutorial on Caret

Lets see performance using confusion matrix

	pred = predict(fit, testset, type='raw')
	class = ifelse(pred >= .5,1,0)
	tb = table(testset$Survive,class)
	confusionMatrix(tb)

Confusion Matrix and Statistics


   class

      0   1

  0 138  26

                                          

  1  33  69

               Accuracy : 0.7782          

                 95% CI : (0.7234, 0.8267)

    P-Value [Acc > NIR] : 1.259e-06       

    No Information Rate : 0.6429          

                                          
Kappa : 0.5247          
 Mcnemar's Test P-Value : 0.4347          
            Sensitivity : 0.8070          

             Prevalence : 0.6429          
Specificity : 0.7263          
         Pos Pred Value : 0.8415          
         Neg Pred Value : 0.6765          
         Detection Rate : 0.5188          

       'Positive' Class : 0               
Detection Prevalence : 0.6165          
      Balanced Accuracy : 0.7667

We see our accuracy is 77.82% not bad.

You can also make ROC curve

	pred.rocr = prediction(pred, testset$Survived)
	perf.rocr = performance(pred.rocr, measure = "auc", x.measure = "cutoff")
	perf.tpr.rocr = performance(pred.rocr, "tpr","fpr")
	plot(perf.tpr.rocr, colorize=T,main=paste("AUC:",(perf.rocr@y.values)))

As ROC also seems good let use this model on real data

	test = model("test.csv")
	summary(test)

	# We can see that test data still has NA in ages that as there is Ms.(Ms. is same as Mss.)
	# in testset which we never had in train set, let put means of Mss. in this data too
	test$Age[grepl("Ms\\.", test$Name) & is.na(test$Age)]=mean(test$Age[grepl("Miss\\.", test$Name) & !is.na(test$Age)])


	##lets predict
	pred = predict(fit, test, type='response')
	class = as.data.frame(ifelse(pred >= .5,1,0))


	##let make data frame of pred and save it
	passangerid = as.data.frame(test[,1])
	class = cbind(passangerid, class)
	colnames(class) = c("PassengerId", "Survived")
	write.csv(class, "rf.csv", row.names=F)

Now everything is done. We made a model and tested on new data.

This is very simple model, I have used split the data for validation of model but there are other way of doing validation like cross-validation and ton's of ML algorithm to used.

Here are other algorithm , I have used for same data like SVM, Random forest, boosting.

Github

###################################Happy Coding###############################

April 20, 2016
0 Comments

Titanic: Machine Learning from Disaster -- Part 2 -- Preprocessing and data visualization

_ML Titanic_

Lets add sustainable value to all NA.

We can see there are NA in Age, cabin and embarked. Lets remove NA from Embarked.
As there was only 2 NA value, let assign NA to most frequent embarked.


	train$Embarked[which(is.na(train$Embarked))] ='S'
	table(train$Embarked, useNA = "always")

             C    Q    S <NA> 
            168   77  646    0

2. As ages have 177 missing value.This is tricky as we can add mean of ages to these age but that may not be good estimation. We see name have title like Mr., Mrs., Master, We can use this info to add ages to missing value, first we will find mean ages of each title and assign these value to NA of same title.


train$Name = as.character(train$Name)
table_names = table(unlist(strsplit(train$Name, "\\s+")))
sort(table_names[grep('\\.', names(table_names))], decreasing = T)

Mr.

517

Miss.

182

Mrs.

125

Master.

Dr.

Rev.

Col.

Major.

Mlle.

Capt.

Countess.

Don.

Jonkheer.

Lady.

Mme.

Ms.

Sir.

# lets get initial of missing value

table_na = train[which(is.na(train$Age)),]

table_names = table(unlist(strsplit(table_na$Name, "\\s+")))

sort(table_names[grep('\\.', names(table_names))], decreasing = T)

Mr.

119

Miss.

Mrs.

Master.

Dr.

# lets gets mean age for each title

sort(table_names[grep('\\.', names(table_names))], decreasing = T)

title = c("Mr\\.", "Miss\\.", "Mrs\\.", "Master\\." ,"Dr\\.")

# means

sapply(title, function(x){

mean(train$Age[grepl(x, train$Name) & !is.na(train$Age)])
})

Mr\.

32.3680904522613

Miss\.

21.7739726027397

Mrs\.

35.8981481481481

Master\.

4.57416666666667

Dr\.

# lets use these mean value to fill NA value in each title

title = c("Mr\\.", "Miss\\.", "Mrs\\.", "Master\\." ,"Dr\\.", "Ms\\.")

for (x in title){

train$Age[grepl(x, train$Name) & is.na(train$Age)]=mean(train$Age[grepl(x, train$Name) & !is.na(train$Age)])

}
summary(train$Age)

                      Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
                     0.42   21.77   30.00   29.75   35.90   80.00

3. Cabin also has 687 NA, as no of NA is very high we will just drop this column in analysis.

Now our data is all clean and good for analysis. Before we do any analysis lets visualize data. Finding pattern is helpful while developing model latter.

a) Sex vs survived

	(ggplot(train, aes(Survived, fill=Sex))
	+ geom_bar(aes(color = Sex) )
	+ xlab("")
	+ ylab("No of Passanger")
	+ scale_x_discrete(breaks=c("0", "1"),labels=c("Perished","Survived")))

We can see that if you have female you have high chance of surviving.

b) Ages Vs Survivor


	(ggplot(train, aes(Age, fill=Survived))
	+ geom_histogram( binwidth = 2 )
	+ xlab("Age")
	+ ylab("No of Passanger")
	+ scale_fill_discrete(breaks=c("0", "1"),labels=c("Perished","Survived")))

We can see that if ages is below 10 you have higher survivor rate.

c) Class Vs Survived

	(ggplot(train, aes(Pclass, fill=Survived))
	+ geom_bar( )
	+ xlab("Class")
	+ ylab("No of Passanger")
	+ scale_x_discrete(breaks=c("1", "2", "3"),labels=c("First","Second", "third"))
	+scale_fill_discrete(breaks=c("0", "1"),labels=c("Perished","Survived")))

First class have higher survivor rate.

In next tutorial we will make our first ML model.

Github--Titanic-Machine-Learning-from-Disaster

**Happy Coding**

March 06, 2016
0 Comments

Pages

Titanic: Machine Learning from Disaster -- Part 3 -- Logistics Regression

Titanic: Machine Learning from Disaster -- Part 2 -- Preprocessing and data visualization

Recent Posts

About me

Contact Me

Labels

Blog Archive

Twitter Feed

Popular Posts

What I like in twitter

Contact Form

	train$Name = as.character(train$Name)
	table_names = table(unlist(strsplit(train$Name, "\\s+")))
	sort(table_names[grep('\\.', names(table_names))], decreasing = T)