Titanic Part I
The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.
One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.
In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy.
1 2 3 4 5 6 7 8
<class 'pandas.core.frame.DataFrame'> Int64Index: 891 entries, 0 to 890 Data columns (total 12 columns): PassengerId 891 non-null int64 Survived 891 non-null int64 Pclass 891 non-null int64 Name 891 non-null object Sex 891 non-null object Age 714 non-null float64 SibSp 891 non-null int64 Parch 891 non-null int64 Ticket 891 non-null object Fare 891 non-null float64 Cabin 204 non-null object Embarked 889 non-null object dtypes: float64(2), int64(5), object(5) memory usage: 90.5+ KB
From the info output, we could see that there were null values in the Age, Cabin and Embarked features. The majority of the values in Cabin feature were null and I felt the values were hard to interpret anyways so I decided to not include it in the model. There were also a few missing values in the Age and Embarked features, which would be imputed later.
|0||1||0||3||Braund, Mr. Owen Harris||male||22||1||0||A/5 21171||7.2500||NaN||S|
|1||2||1||1||Cumings, Mrs. John Bradley (Florence Briggs Th...||female||38||1||0||PC 17599||71.2833||C85||C|
|2||3||1||3||Heikkinen, Miss. Laina||female||26||0||0||STON/O2. 3101282||7.9250||NaN||S|
|3||4||1||1||Futrelle, Mrs. Jacques Heath (Lily May Peel)||female||35||1||0||113803||53.1000||C123||S|
|4||5||0||3||Allen, Mr. William Henry||male||35||0||0||373450||8.0500||NaN||S|
|5||6||0||3||Moran, Mr. James||male||NaN||0||0||330877||8.4583||NaN||Q|
|17||18||1||2||Williams, Mr. Charles Eugene||male||NaN||0||0||244373||13.0000||NaN||S|
|19||20||1||3||Masselmani, Mrs. Fatima||female||NaN||0||0||2649||7.2250||NaN||C|
|26||27||0||3||Emir, Mr. Farred Chehab||male||NaN||0||0||2631||7.2250||NaN||C|
|28||29||1||3||O'Dwyer, Miss. Ellen "Nellie"||female||NaN||0||0||330959||7.8792||NaN||Q|
|61||62||1||1||Icard, Miss. Amelie||female||38||0||0||113572||80||B28||NaN|
|829||830||1||1||Stone, Mrs. George Nelson (Martha Evelyn)||female||62||0||0||113572||80||B28||NaN|
So, let’s dive deeper into the data and see if we could understand which features were more important for our model. The competition summary hinted that some groups of people were more likely to survive than others, such as women, children, and the upper-class. The obvious predicators in the data for women and children is sex and age. The Pclass predicator should be a direct representation of a passenger’s socio-economic status. The Fare predicator might also indicates if a passenger was upper-class.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
The first two plots showed that females had a significantly higher chance of survival compared with males, thanks to the “women and children first” maxim. Kudos to the gallant gentlemen who sacrificed themselves for the lives of the women and children. As we could see from the male survival histogram, male had a uniformly low survival rate across all age groups. It was noteworthy that the age group with the highest male survival rate was 0 - 10 years old, aka children, and elderly females almost all survived.
1 2 3 4 5 6 7 8 9 10 11 12 13
Here I generated two plots to show that if the survival rate of passengers are related to their socio-economic class. From the plot on the left, it was evident that passengers with a Pclass of 3, which means lower-class, had a much lower survival rate compared with passengers with a Pclass of 1 or 2. The plot on the right shows how the a combination of Pclass and Sex variables could influence the survival rate of a passenger. It is notable that the different socio-economic status only seemed to significantly affect the survival rate of females, not males. The odds of a upper-class lady surviving the Titanic shipwreck is surprising high.
1 2 3 4 5 6
I also generated two plots to check if different Port of Embarkation would make a difference in survival rate. It is surprising that it actually did have an impact. As passengers embarked from Cherbourg have a significant higher survival chance compared with passengers embarked from Queenstown or Southampton.
With the exploratory visualization from the last point, we have some idea what out data looks like. We know that the survival rate is in some way related to Pclass, Sex, Age, Fare and Embarked, but we haven’t got any chance to explore the rest of the predicators. So, let’s start building models and making predictions.
Purely Gender-Based Model
From last post, we know that the most influential predicator in the data set is Sex, as survival rate differs dramatically between male and female passengers. Given the low survival rate of male and high survival rate of female, the easiest possible model is to simply predict 0 for male and 1 for female.
The output of the head() function tells us our change to the data frame is effective and we have already made our first prediction! Next, we just need to export the data to a csv file and submit to Kaggle.
1 2 3 4 5 6
The results of the gender-based prediction was 0.76555, which is surprisingly good for a single line of prediction code! Of course we are not going to call it a day. Let’s see if we could incorporate Pclass or Age in the model.
While we could do a pivot table, predicting the majority of each gender in different Pclass, I’d say it is time to bring in some models. Let’s try our hands at Logistic Regression first, since it is one of the most interpretable classification model in my opinion. We’ll start with an extremely simple case of just 2 predicators: Pclass and Sex
1 2 3 4 5 6 7 8 9
It seems like we got the exact same result 0.76555 again. Why is that? Shouldn’t an adding an predicator change some of our predictions? Here is why.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Given Pclass = 1 and Sex = female, predict 1 for survival. Given Pclass = 1 and Sex = male, predict 0 for survival. Given Pclass = 3 and Sex = female, predict 1 for survival. Given Pclass = 3 and Sex = male, predict 0 for survival. Given Pclass = 2 and Sex = female, predict 1 for survival. Given Pclass = 2 and Sex = male, predict 0 for survival.
Pclass Sex 1 female 0.968085 male 0.368852 2 female 0.921053 male 0.157407 3 female 0.500000 male 0.135447 Name: Survived, dtype: float64
Turns out that the predications stays the same, with or without the Pclass prediactor. Logistic Regression computes the probabilities for each response and predicts the response with largest likelihood. In this case, as we can see from the above pivot table, the most probable survival response for female is 1 and 0 for male, despite the change in survival rate due to Pclass, which is why the predictions stay the same.
Even though we didn’t improve the model with the addition of the Pclass predicators, our effort was not totally worthless. Now we can add more variables to the Logistic Regression, hoping it would produce a more accurate prediction with the extra data. In the next post, we will explore some more complicated models.