Titanic Part I

Problem Statement

The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.
One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.
In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy.

Titanic: Machine Learning from Disaster

import csv
import numpy as np
import pandas as pd
import pylab as p
%matplotlib inline  
train = pd.read_csv('train.csv', header = 0)
test = pd.read_csv('test.csv', header = 0)
train.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 90.5+ KB

From the info output, we could see that there were null values in the Age, Cabin and Embarked features. The majority of the values in Cabin feature were null and I felt the values were hard to interpret anyways so I decided to not include it in the model. There were also a few missing values in the Age and Embarked features, which would be imputed later.

train.head()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35 0 0 373450 8.0500 NaN S
train[train['Age'].isnull()].head()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
5 6 0 3 Moran, Mr. James male NaN 0 0 330877 8.4583 NaN Q
17 18 1 2 Williams, Mr. Charles Eugene male NaN 0 0 244373 13.0000 NaN S
19 20 1 3 Masselmani, Mrs. Fatima female NaN 0 0 2649 7.2250 NaN C
26 27 0 3 Emir, Mr. Farred Chehab male NaN 0 0 2631 7.2250 NaN C
28 29 1 3 O'Dwyer, Miss. Ellen "Nellie" female NaN 0 0 330959 7.8792 NaN Q
train[train['Embarked'].isnull()]
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
61 62 1 1 Icard, Miss. Amelie female 38 0 0 113572 80 B28 NaN
829 830 1 1 Stone, Mrs. George Nelson (Martha Evelyn) female 62 0 0 113572 80 B28 NaN

So, let’s dive deeper into the data and see if we could understand which features were more important for our model. The competition summary hinted that some groups of people were more likely to survive than others, such as women, children, and the upper-class. The obvious predicators in the data for women and children is sex and age. The Pclass predicator should be a direct representation of a passenger’s socio-economic status. The Fare predicator might also indicates if a passenger was upper-class.

fig, axes = p.subplots(nrows=2, ncols=2, figsize=(15,10))
#compare male and female survival rate
male_female1 = pd.DataFrame({'Male':train[train['Sex'] == 'male']['Survived'].value_counts(),
                             'Female':train[train['Sex'] == 'female']['Survived'].value_counts()}, columns = ['Male','Female'])
male_female1.plot(kind = 'bar', title='Male and Female Survival Comparison 1', alpha = 0.3, stacked = True, ax=axes[0,0])

male_female2 = pd.DataFrame({'Survived':train[train['Survived'] == 1]['Sex'].value_counts(),
                             'Died':train[train['Survived'] == 0]['Sex'].value_counts()}, columns = ['Survived','Died'])
male_female2.plot(kind = 'bar', title='Male and Female Survival Comparison 2', alpha = 0.3, stacked = True, ax=axes[0,1])

#compare male and female survival rate across different age groups
male_survived_age = pd.DataFrame({'Survived':train[(train['Sex'] == 'male') & (train['Survived'] == 1)]['Age'],
                                  'Died':train[(train['Sex'] == 'male') & (train['Survived'] == 0)]['Age']}, columns = ['Survived','Died'])
male_survived_age.plot(kind = 'hist', title='Male Survival Age Histogram', alpha = 0.3, stacked = True, ax=axes[1,0])

female_survived_age = pd.DataFrame({'Survived':train[(train['Sex'] == 'female') & (train['Survived'] == 1)]['Age'],
                                    'Died':train[(train['Sex'] == 'female') & (train['Survived'] == 0)]['Age']}, columns = ['Survived','Died'])
female_survived_age.plot(kind = 'hist', title='Female Survival Age Histogram', alpha = 0.3, stacked = True, ax=axes[1,1])

png

The first two plots showed that females had a significantly higher chance of survival compared with males, thanks to the “women and children first” maxim. Kudos to the gallant gentlemen who sacrificed themselves for the lives of the women and children. As we could see from the male survival histogram, male had a uniformly low survival rate across all age groups. It was noteworthy that the age group with the highest male survival rate was 0 - 10 years old, aka children, and elderly females almost all survived.

fig, axes = p.subplots(nrows=1, ncols=2, figsize=(15,5))
#compare survival rate of different socio-economic class
survival_pclass = pd.DataFrame({'Survived':train[train['Survived'] == 1]['Pclass'].value_counts(),
                    'Died':train[train['Survived'] == 0]['Pclass'].value_counts()}, columns = ['Survived','Died'])
survival_pclass.plot(kind = 'bar', title='Pclass Survival Comparison', alpha = 0.3, stacked = True, ax=axes[0])

#compare survival rate of different socio-economic class and sex
class_sex = pd.DataFrame({'High-Mid Class Male':train[(train['Sex'] == 'male') & (train['Pclass'] != 1)]['Survived'].value_counts(),
                          'High-Mid Class Female':train[(train['Sex'] == 'female') & (train['Pclass'] != 3)]['Survived'].value_counts(),
                          'Low Class Male':train[(train['Sex'] == 'male') & (train['Pclass'] == 3)]['Survived'].value_counts(),
                         'Low Class Female':train[(train['Sex'] == 'female') & (train['Pclass'] == 3)]['Survived'].value_counts()},
                         columns = ['High-Mid Class Male','High-Mid Class Female','Low Class Male','Low Class Female'])
class_sex.plot(kind = 'bar', title='Pclass Sex Survival Histogram', alpha = 0.3, ax=axes[1])

png

Here I generated two plots to show that if the survival rate of passengers are related to their socio-economic class. From the plot on the left, it was evident that passengers with a Pclass of 3, which means lower-class, had a much lower survival rate compared with passengers with a Pclass of 1 or 2. The plot on the right shows how the a combination of Pclass and Sex variables could influence the survival rate of a passenger. It is notable that the different socio-economic status only seemed to significantly affect the survival rate of females, not males. The odds of a upper-class lady surviving the Titanic shipwreck is surprising high.

fig, axes = p.subplots(nrows=1, ncols=2, figsize=(15,5))
embarked_survival1 = pd.DataFrame(train[train['Survived'] == 1]['Embarked'].value_counts()/train['Embarked'].value_counts())
embarked_survival1.plot(linestyle='-', marker='o', ax=axes[0])
embarked_survival2 = pd.DataFrame({'Survived':train[train['Survived'] == 1]['Embarked'].value_counts(),
                                  'Died':train[train['Survived'] == 0]['Embarked'].value_counts()}, columns = ['Survived','Died'])
embarked_survival2[['Survived','Died']].plot(kind = 'bar', title='Male and Female Survival Comparison 1', alpha = 0.3, stacked = True, ax=axes[1])

png

I also generated two plots to check if different Port of Embarkation would make a difference in survival rate. It is surprising that it actually did have an impact. As passengers embarked from Cherbourg have a significant higher survival chance compared with passengers embarked from Queenstown or Southampton.

Basic Models

With the exploratory visualization from the last point, we have some idea what out data looks like. We know that the survival rate is in some way related to Pclass, Sex, Age, Fare and Embarked, but we haven’t got any chance to explore the rest of the predicators. So, let’s start building models and making predictions.

Purely Gender-Based Model

From last post, we know that the most influential predicator in the data set is Sex, as survival rate differs dramatically between male and female passengers. Given the low survival rate of male and high survival rate of female, the easiest possible model is to simply predict 0 for male and 1 for female.

#simply creating a Survival variable, mappped from the Sex variable
test['Survived'] = test['Sex'].map( {'female': 1, 'male': 0} ).astype(int)

The output of the head() function tells us our change to the data frame is effective and we have already made our first prediction! Next, we just need to export the data to a csv file and submit to Kaggle.

prediction_file = open("Gender_Model.csv", "wb")
out = csv.writer(prediction_file)
out.writerow(["PassengerId", "Survived"])
for i in range(0, len(test.PassengerId)):
    out.writerow([test.PassengerId[i], test.Survived[i]])
prediction_file.close()

The results of the gender-based prediction was 0.76555, which is surprisingly good for a single line of prediction code! Of course we are not going to call it a day. Let’s see if we could incorporate Pclass or Age in the model.

Gender-Class Model

While we could do a pivot table, predicting the majority of each gender in different Pclass, I’d say it is time to bring in some models. Let’s try our hands at Logistic Regression first, since it is one of the most interpretable classification model in my opinion. We’ll start with an extremely simple case of just 2 predicators: Pclass and Sex

from patsy import dmatrices
from sklearn.linear_model import LogisticRegression
#setting formula to predict Survived with Pclass and Sex
formula1 = 'Survived ~ C(Pclass) + C(Sex)'
train_response,train_data = dmatrices(formula1, data=train, return_type="dataframe")
test_response,test_data = dmatrices(formula1, data=test, return_type="dataframe")
LRmodel = LogisticRegression()
LRmodel.fit(train_data.values, train_response.values.ravel())
output = LRmodel.predict(test_data).astype(int)

It seems like we got the exact same result 0.76555 again. Why is that? Shouldn’t an adding an predicator change some of our predictions? Here is why.

for i in range(0,2):
    for j in range(0,2):
        for k in range(0,2):
            if i == 1:
                Pclass = 2
            elif j == 1:
                Pclass = 3
            else: 
                Pclass = 1
            if i == j ==1:
                break    
            if k ==1:
                sex = "male"
            else:
                sex = "female"
            print "Given Pclass = %d and Sex = %s, predict %d for survival. " % (Pclass,sex, LRmodel.predict([[1,i,j,k]])) 
Given Pclass = 1 and Sex = female, predict 1 for survival. 
Given Pclass = 1 and Sex = male, predict 0 for survival. 
Given Pclass = 3 and Sex = female, predict 1 for survival. 
Given Pclass = 3 and Sex = male, predict 0 for survival. 
Given Pclass = 2 and Sex = female, predict 1 for survival. 
Given Pclass = 2 and Sex = male, predict 0 for survival. 
print train.pivot_table('Survived', index=['Pclass','Sex'], aggfunc='mean')
Pclass  Sex   
1       female    0.968085
        male      0.368852
2       female    0.921053
        male      0.157407
3       female    0.500000
        male      0.135447
Name: Survived, dtype: float64

Turns out that the predications stays the same, with or without the Pclass prediactor. Logistic Regression computes the probabilities for each response and predicts the response with largest likelihood. In this case, as we can see from the above pivot table, the most probable survival response for female is 1 and 0 for male, despite the change in survival rate due to Pclass, which is why the predictions stay the same.

Even though we didn’t improve the model with the addition of the Pclass predicators, our effort was not totally worthless. Now we can add more variables to the Logistic Regression, hoping it would produce a more accurate prediction with the extra data. In the next post, we will explore some more complicated models.