# Titanic Part II

<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 90.5+ KB

<class 'pandas.core.frame.DataFrame'>
Int64Index: 418 entries, 0 to 417
Data columns (total 11 columns):
PassengerId    418 non-null int64
Pclass         418 non-null int64
Name           418 non-null object
Sex            418 non-null object
Age            332 non-null float64
SibSp          418 non-null int64
Parch          418 non-null int64
Ticket         418 non-null object
Fare           417 non-null float64
Cabin          91 non-null object
Embarked       418 non-null object
dtypes: float64(2), int64(4), object(5)
memory usage: 39.2+ KB


### Data Cleaning

In last post, we have done some exploratory analysis of the data and also built two simple models, which gave us decent prediction accuracy. In this post, our question is, Can we do better? We left off last time with Logistic Regression. And hopefully we just need to feed more predicators to the model for a better accuracy. However, before we can do that, we need to clean our data. We see that there are NAs in the Age, Fare, Cabin, and Embarked variables, so we will have to impute these values. For the Cabin variable, there are way too many NAs so imputing all those values would bring it an significant amount of noise. And the Cabin variable itself is hard to interpret anyways. So we will just impute the Age, Fare and Embarked variables.

Not only did we impute the missing values, we also added in lots of additional predicators that could be correlated to the survival rate, since the correlation between survival rate and predicators might not be fully captured by the variables we had. Note that introducing new predicators, especaiily new categorical predicators, would increases the dimensionality of the data, which could potentially impact our prediction accuracy due to curse of dimensionality. Therefore we will use feature selection technique in our models

/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/ipykernel/__main__.py:16: FutureWarning: scalar indexers for index type Int64Index should be integers and not floating point


## Logistic Regression

We got 0.77990 this times, which is an great improvement over the 2 simple models we had in last post. Hooray, moving up the ladder! Let’s take a look at the classifier that gave us the best results.

C(Pclass)[T.2] -0.905477959537
C(Pclass)[T.3] -1.97896459743
C(Embarked)[T.Q] 0.0
C(Embarked)[T.S] -0.299162084187
C(Title)[T.Miss] -0.573336914082
C(Title)[T.Mr] -2.58797777339
C(Title)[T.Mrs] 0.0
C(AgeGroup)[T.child] 0.284813224479
C(AgeGroup)[T.senior] 0.0
Sex -0.78528900026
Age -0.025393089303
SibSp -0.49647079287
Parch -0.266397881617
Family_Size 0.0
Fare 0.00330362443408


We can tell from the coefficients of the predicators that some of the coefficients has been regularized to 0, which means that the predicator was not selected by the model. For the coefficients we can also observe the effect of the predicators. For example, passenger of Pclass 2 or 3would have a lower survival rate compared with passenger of Pclass 1 due to the negative coefficients.

## Random Forest

Now we will move on to the more complicated Ensemble models, which usually offer the best out-of-box prediction accuracy. Random Forest is one of the most commonly-used Ensemble method that is operated by constructing a random multitude of decision trees at training time and outputting the class that is the mode of the classes.

The most important parameters that we can tune for Random Forest is max_features, which is the number of features to consider when looking for the best split. Decreasing the max_features could improve randomness and reduce variance, while increasing the max_features would move the random forest more towards bagging. Min_samples_split and Min_samples_leaf are two stopping criteria of Random Forest that could contribute to overfitting. The GridSearchCV here is an exhaustive search over all parameter combinations for the estimator that produce the best 10-fold CV error. The Random Forest Method achieved an accuracy of 0.77990 again.

This plot illustrates the feature importantces generated from Random Forest. It is interesting that the dummy variable of Title Mr. is the most important feature, while Title Mrs. and Title Miss. are among the least influential features. The Age, Fare, Pclass and Sex variables also rank high in terms of importances, which is expected.

## Blend Models

We have tried our hands at two models so far and they produced the exact same accuracy. So do both of the models make the same prediction on every test passengers? If not, we could somehow blend these two models together for a better result.

[1, 6, 19, 28, 32, 33, 34, 36, 37, 39, 41, 55, 63, 72, 86, 87, 88, 90, 94, 127, 132, 138, 144, 148, 158, 159, 169, 192, 199, 206, 249, 252, 268, 280, 291, 309, 313, 323, 344, 347, 367, 379, 382, 389, 390, 403, 409, 412, 417]
Proportion of the predictions that the two models disagree on is  0.101871101871


It does turn out that the two models didn’t make the same prediction on every passengers, as the output shows that the models diagree on about 10% of prediction. This is quite desirable since we are very confident on the rest 90% of the predictions. Now we just needed to find a way to blend the two models together.

[[ 0.33960866  0.66039134]
[ 0.29814563  0.70185437]
[ 0.6590034   0.3409966 ]
[ 0.53082693  0.46917307]]


For a ensemble method like Random Forest, the interpretbility is very low. However, the Logistic Regression method is actually very intrepretible, as we can compute the probabiliities of the binary responses for each predictions. Here we can see that for the 2nd prediction that the models differed on, the probabilities for 0 and 1 are 0.298 and 0.702, which means that the logistic regression was quite confident that this passengers was gonna survive, whereas for the 5th prediction that the models differed on, the probabilities for 0 and 1 are 0.530 and 0.469, the logistic regression was a lot less confident. Using the probabilities, we could derive an simple method to predict when the two models disagree.

This gives us 0.79904, the best we have had so far and top 25% on the public leaderboard.

## Areas of Improvement

Of course there is still a lot of room for improvement. - Better tunning of parameters on Logistic Regression and Random Forest. Some people could use one single method to achieve 0.81 or above. Maybe include Pipeline in the CV process. - Do a few models and combine them using majority vote. Maybe SVM, or Gradient Boosting. - Better imputation on missing values - Make use of familiy lastnames as a predicator.