In last post, we have done some exploratory analysis of the data and also built two simple models, which gave us decent prediction accuracy. In this post, our question is, Can we do better?
We left off last time with Logistic Regression. And hopefully we just need to feed more predicators to the model for a better accuracy. However, before we can do that, we need to clean our data. We see that there are NAs in the Age, Fare, Cabin, and Embarked variables, so we will have to impute these values. For the Cabin variable, there are way too many NAs so imputing all those values would bring it an significant amount of noise. And the Cabin variable itself is hard to interpret anyways. So we will just impute the Age, Fare and Embarked variables.
defclean_data(df):# drop predicators that we are not going to usedf.drop(['Ticket','Cabin'],inplace=True,axis=1)# encode the sex class to 0, 1df['Sex']=df.Sex.map({'female':0,'male':1}).astype(int)# simply fill the missing Embarked value with "S" since there are only 2 NAsdf['Embarked']=df.Embarked.fillna('S')# there is a 0 value in Fare column, which is unreasonable. Will have to replace it with NA first.df.Fare=df.Fare.map(lambdax:np.nanifx==0elsex)# build a pivot table to impute NAs with averaged Fare for all Pclassfare_pivot_table=df.pivot_table("Fare",index='Pclass',aggfunc='mean',dropna=True)# use pivot table to impute missing fare valuesdf['Fare']=df[['Fare','Pclass']].apply(lambdax:fare_pivot_table[x['Pclass']]ifpd.isnull(x['Fare'])elsex['Fare'],axis=1)#build a pivot table to impute NAs with median Age according to Pclass and Sexage_pivot_table=df.pivot_table('Age',index=['Pclass','Sex'],aggfunc='median',dropna=True)df['Age']=df[['Sex','Pclass','Age']].apply(lambdax:age_pivot_table[x.Pclass,x.Sex]ifpd.isnull(x.Age)elsex.Age,axis=1)# define a age group categorical variabledf['AgeGroup']='adult'df.loc[(df['Age']<=10),'AgeGroup']='child'df.loc[(df['Age']>=60),'AgeGroup']='senior'# define a fare group categorical variabledf['FareGroup']='low'df.loc[(df['Fare']<=20)&(df['Fare']>10),'FareGroup']='mid'df.loc[(df['Fare']<=30)&(df['Fare']>20),'FareGroup']='mid-high'df.loc[(df['Fare']>30),'FareGroup']='high'# Here we are adding interaction terms between predicators. I have included a few reasonable interaction terms in my mind df['Family_Size']=train['SibSp']+train['Parch']+1# Generating the Title predicatordf['Title']=df['Name'].apply(lambdax:getTitle(x))df['Title']=df.apply(getAndReplaceTitle,axis=1)returndfdefgetTitle(name):# use regex to match the titles in passenger names# titles are the characters preceded by a comma and a space (", ") and succeeded by a period (".") in a passenger's name.# Here we use the positive lookbehind assertion to find the preceding (", ") and the positive lookahead assertion to match the succeeding (".").m=re.search('(?<=,\s)[a-zA-z\s]+(?=\.)',name)ifmisNone:returnnp.nanelse:returnm.group(0)defgetAndReplaceTitle(df):# combine similar titles# the majority of the titles are pretty rare, which doesn't give much information about the passenger.title=df['Title']iftitlein['Mr','Don','Major','Capt','Jonkheer','Rev','Col','Sir']:return'Mr'eliftitlein['the Countess','Mme','Mrs','Lady','Dona']:return'Mrs'eliftitlein['Mlle','Ms','Miss']:return'Miss'eliftitle=='Dr':ifdf['Sex']=='Male':return'Mr'else:return'Mrs'else:returntitle
Not only did we impute the missing values, we also added in lots of additional predicators that could be correlated to the survival rate, since the correlation between survival rate and predicators might not be fully captured by the variables we had. Note that introducing new predicators, especaiily new categorical predicators, would increases the dimensionality of the data, which could potentially impact our prediction accuracy due to curse of dimensionality. Therefore we will use feature selection technique in our models
12
train=clean_data(train)test=clean_data(test)
/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/ipykernel/__main__.py:16: FutureWarning: scalar indexers for index type Int64Index should be integers and not floating point
1234567
#define a function in order to simplify future predictions and outputdefpredictAndOutput(estimator,output_filename):estimator.fit(train_data.values,train_response.values.ravel())output=estimator.predict(test_data).astype(int)submission=pd.DataFrame({"PassengerId":test["PassengerId"],"Survived":output})submission.to_csv(output_filename,index=False)returnoutput
Logistic Regression
1234567891011121314
fromsklearn.grid_searchimportGridSearchCVfromsklearn.linear_modelimportLogisticRegressionformula='Survived ~ C(Pclass) + Sex + Age + SibSp + Parch + Family_Size + Fare + C(Embarked)+ C(Title) + C(AgeGroup)'train_response,train_data=dmatrices(formula,data=train,return_type="dataframe")test['Survived']=1#insert a dummy survived variable in order to create dmatricestest_response,test_data=dmatrices(formula,data=test,return_type="dataframe")LRModel=LogisticRegression()# Exhaustive search over all parameter combinations for the estimator that produce the best CV errorparam_grid={'penalty':['l1','l2'],'C':[0.001,0.01,0.1,1,10,100,1000]}GridSearch=GridSearchCV(LRModel,param_grid,cv=10)GridSearch.fit(train_data.values,train_response.values.ravel())bestLRclf=GridSearch.best_estimator_LRprediction=predictAndOutput(bestLRclf,"LROutput.csv")
We got 0.77990 this times, which is an great improvement over the 2 simple models we had in last post. Hooray, moving up the ladder! Let’s take a look at the classifier that gave us the best results.
C(Pclass)[T.2] -0.905477959537
C(Pclass)[T.3] -1.97896459743
C(Embarked)[T.Q] 0.0
C(Embarked)[T.S] -0.299162084187
C(Title)[T.Miss] -0.573336914082
C(Title)[T.Mr] -2.58797777339
C(Title)[T.Mrs] 0.0
C(AgeGroup)[T.child] 0.284813224479
C(AgeGroup)[T.senior] 0.0
Sex -0.78528900026
Age -0.025393089303
SibSp -0.49647079287
Parch -0.266397881617
Family_Size 0.0
Fare 0.00330362443408
We can tell from the coefficients of the predicators that some of the coefficients has been regularized to 0, which means that the predicator was not selected by the model. For the coefficients we can also observe the effect of the predicators. For example, passenger of Pclass 2 or 3would have a lower survival rate compared with passenger of Pclass 1 due to the negative coefficients.
Random Forest
Now we will move on to the more complicated Ensemble models, which usually offer the best out-of-box prediction accuracy. Random Forest is one of the most commonly-used Ensemble method that is operated by constructing a random multitude of decision trees at training time and outputting the class that is the mode of the classes.
The most important parameters that we can tune for Random Forest is max_features, which is the number of features to consider when looking for the best split. Decreasing the max_features could improve randomness and reduce variance, while increasing the max_features would move the random forest more towards bagging. Min_samples_split and Min_samples_leaf are two stopping criteria of Random Forest that could contribute to overfitting. The GridSearchCV here is an exhaustive search over all parameter combinations for the estimator that produce the best 10-fold CV error.
The Random Forest Method achieved an accuracy of 0.77990 again.
This plot illustrates the feature importantces generated from Random Forest. It is interesting that the dummy variable of Title Mr. is the most important feature, while Title Mrs. and Title Miss. are among the least influential features. The Age, Fare, Pclass and Sex variables also rank high in terms of importances, which is expected.
Blend Models
We have tried our hands at two models so far and they produced the exact same accuracy. So do both of the models make the same prediction on every test passengers? If not, we could somehow blend these two models together for a better result.
1234567
# Generating a list of test passenger indexes that have different predictions from the two modelsdifferent=list()foriinrange(len(RFprediction)):ifRFprediction[i]!=LRprediction[i]:different.append(i)printdifferentprint"Proportion of the predictions that the two models disagree on is ",len(different)/float(481)
[1, 6, 19, 28, 32, 33, 34, 36, 37, 39, 41, 55, 63, 72, 86, 87, 88, 90, 94, 127, 132, 138, 144, 148, 158, 159, 169, 192, 199, 206, 249, 252, 268, 280, 291, 309, 313, 323, 344, 347, 367, 379, 382, 389, 390, 403, 409, 412, 417]
Proportion of the predictions that the two models disagree on is 0.101871101871
It does turn out that the two models didn’t make the same prediction on every passengers, as the output shows that the models diagree on about 10% of prediction. This is quite desirable since we are very confident on the rest 90% of the predictions. Now we just needed to find a way to blend the two models together.
For a ensemble method like Random Forest, the interpretbility is very low. However, the Logistic Regression method is actually very intrepretible, as we can compute the probabiliities of the binary responses for each predictions. Here we can see that for the 2nd prediction that the models differed on, the probabilities for 0 and 1 are 0.298 and 0.702, which means that the logistic regression was quite confident that this passengers was gonna survive, whereas for the 5th prediction that the models differed on, the probabilities for 0 and 1 are 0.530 and 0.469, the logistic regression was a lot less confident. Using the probabilities, we could derive an simple method to predict when the two models disagree.
12345678910111213
fromcopyimportdeepcopyCombinePrediction=deepcopy(RFprediction)foriinrange(len(different)):# If the probabilties of survival is greater than 0.7, predict survivedifLRProba[different[i]][0]>0.7:CombinePrediction[different[i]]=0# If the probabilties of survival is less than 0.3, predict perished elifLRProba[different[i]][0]<0.3:CombinePrediction[different[i]]=1else:continuesubmission=pd.DataFrame({"PassengerId":test["PassengerId"],"Survived":CombinePrediction})submission.to_csv("CombineOutput.csv",index=False)
This gives us 0.79904, the best we have had so far and top 25% on the public leaderboard.
Areas of Improvement
Of course there is still a lot of room for improvement.
- Better tunning of parameters on Logistic Regression and Random Forest. Some people could use one single method to achieve 0.81 or above. Maybe include Pipeline in the CV process.
- Do a few models and combine them using majority vote. Maybe SVM, or Gradient Boosting.
- Better imputation on missing values
- Make use of familiy lastnames as a predicator.