Titanic: Machine Learning from Disaster -- Part 3 -- Logistics Regression
April 20, 2016
Our first machine learning algorithm will be Logistics Regression. Detail on Logistic_regression.
I had learned regression during high school and bachelor but never understood its true power ( I just studied to pass exam). Now I understand real power of regression.
In last two tutorial we did some reprocessing. Let make function for pre-processing.
This function take file name as input and return cleaned data frame.
Next step is to splitting data into trainset and testing set. We will use train set to built model and testset to evaluated performance of our model.
Lets built Logistics regression model as our first ML model. Logistics regression can be done using base glm function or using caret. I will be using train function with method 'glm' from caret packages
* Best thing about using caret function is that you can just change method name and train same model with other algorithm. For detail on train function see help(train):Detail Tutorial on Caret
Lets see performance using confusion matrix
prepossessing= function(x) { | |
train = read.csv(x ,na.strings=c("NA", "")) | |
# Convert string to factor | |
train$Sex = factor(train$Sex) | |
train$Pclass = factor(train$Pclass) | |
#fill na on Embarked with S | |
train$Embarked[which(is.na(train$Embarked))] ='S' | |
# lets gets mean age for each title to fill na value | |
title = c("Mr\\.", "Miss\\.", "Mrs\\.", "Master\\." ,"Dr\\.", "Ms\\.") | |
for (x in title){ | |
train$Age[grepl(x, train$Name) & is.na(train$Age)]=mean(train$Age[grepl(x, train$Name) & !is.na(train$Age)]) | |
} | |
#return everything as numeric data as most Model take numeric value only | |
train$Sex = as.numeric(train$Sex) | |
train$Pclass = as.numeric(train$Pclass) | |
train$Embarked = as.numeric(train$Embarked) | |
train$Fare[is.na(train$Fare)] = median(train$Fare, na.rm = T) | |
return (train) | |
} |
intrain<-createDataPartition(y=train$Survive,p=0.7,list=FALSE) | |
traingset = train[intrain,] | |
testset = train[-intrain,] |
pred = predict(fit, testset, type='raw') | |
class = ifelse(pred >= .5,1,0) | |
tb = table(testset$Survive,class) | |
confusionMatrix(tb) |
Confusion Matrix and Statisticsclass0 10 138 261 33 69Accuracy : 0.778295% CI : (0.7234, 0.8267)P-Value [Acc > NIR] : 1.259e-06No Information Rate : 0.6429Kappa : 0.5247 Mcnemar's Test P-Value : 0.4347 Sensitivity : 0.8070Prevalence : 0.6429Specificity : 0.7263 Pos Pred Value : 0.8415 Neg Pred Value : 0.6765 Detection Rate : 0.5188'Positive' Class : 0Detection Prevalence : 0.6165 Balanced Accuracy : 0.7667
We see our accuracy is 77.82% not bad.
You can also make ROC curve
pred.rocr = prediction(pred, testset$Survived) | |
perf.rocr = performance(pred.rocr, measure = "auc", x.measure = "cutoff") | |
perf.tpr.rocr = performance(pred.rocr, "tpr","fpr") | |
plot(perf.tpr.rocr, colorize=T,main=paste("AUC:",(perf.rocr@y.values))) |
As ROC also seems good let use this model on real data
test = model("test.csv") | |
summary(test) | |
# We can see that test data still has NA in ages that as there is Ms.(Ms. is same as Mss.) | |
# in testset which we never had in train set, let put means of Mss. in this data too | |
test$Age[grepl("Ms\\.", test$Name) & is.na(test$Age)]=mean(test$Age[grepl("Miss\\.", test$Name) & !is.na(test$Age)]) | |
##lets predict | |
pred = predict(fit, test, type='response') | |
class = as.data.frame(ifelse(pred >= .5,1,0)) | |
##let make data frame of pred and save it | |
passangerid = as.data.frame(test[,1]) | |
class = cbind(passangerid, class) | |
colnames(class) = c("PassengerId", "Survived") | |
write.csv(class, "rf.csv", row.names=F) |
Now everything is done. We made a model and tested on new data.
This is very simple model, I have used split the data for validation of model but there are other way of doing validation like cross-validation and ton's of ML algorithm to used.
Here are other algorithm , I have used for same data like SVM, Random forest, boosting.
