KAGGLE ENSEMBLING GUIDE
This is a reading digest of kaggle’s ensembling guide
This is how you win ML competitions: you take other peoples’ work and ensemble them together.” Vitaly Kuznetsov NIPS2014
Creating ensembles from submission files
The most basic and convenient way to ensemble is to ensemble Kaggle submission CSV files.
Voting ensembles
- ensembles usually improve when adding more ensemble members.
- creating an averaging ensemble from less correlated submissions (the same idea of diversified investment portfolio)
- Majority votes make most sense when the evaluation metric requires hard predictions, for instance with (multiclass-) classification accuracy.
- You can weigh the good model more in a weighted majority vote
- Use cases: 1)CIFAR-10 Object detection in images 2)Forest Cover Type prediction
Averaging
- Averaging works well for a wide range of problems (both classification and regression) and metrics (AUC, squared error or logaritmic loss).
- It is simply averaging than taking the mean of individual model predictions, sometimes called “bagging submissions”
- Averaging predictions often reduces overfit, creating a smooth separation between classes
- Rank Averaging:
- Not all predictors are perfectly calibrated: they may be over- or underconfident when predicting a low or high probability. Or the predictions clutter around a certain range. Therefore, first turn the predictions into ranks, then averaging these ranks.
- Historical Ranks:Ranking requires a test set. Store the old test set predictions together with their rank. Now when you predict a new test sample like “0.35000110” you find the closest old prediction and take its historical rank (in this case rank “3” for “0.35000111”).
- Use cases: 1)Bag of Words Meets Bags of Popcorn - movie sentimental analysis 2)Acquire Valued Shoppers Challenge
Stacked Generalization & Blending
Netflix
- Feature-Weighted Linear Stacking
- Combining Predictions for Accurate Recommender Systems
- SVD, Neighborhood Based Approaches, Restricted Boltzmann Machine, Asymmetric Factor Model and Global Effects.
- The BigChaos Solution to the Netflix Prize
- very complicated – grand prize
Stacked Generalization
The basic idea behind stacked generalization is to use a pool of base classifiers, then using another classifier to combine their predictions, with the aim of reducing the generalization error.
- Split the train set in 2 parts: train_a and train_b
- Fit a first-stage model on train_a and create predictions for train_b
- Fit the same model on train_b and create predictions for train_a
- Finally fit the model on the entire train set and create predictions for the test set.
- Now train a second-stage stacker model on the probabilities from the first-stage model(s).
Blending
With blending, instead of creating out-of-fold predictions for the train set, you create a small holdout set of say 10% of the train set. The stacker model then trains on this holdout set only.
- Pros:
- It is simpler than stacking.
- It wards against an information leak: The generalizers and stackers use different data.
- You do not need to share a seed for stratified folds with your teammates. Anyone can throw models in the ‘blender’ and the blender decides if it wants to keep that model or not.
- Cons:
- You use less data overall
- The final model may overfit to the holdout set.
- Your CV is more solid with stacking (calculated over more folds) than using a single small holdout set.
Stacking Methods
- Stacking with logistic regression
- Kaggle use: “Papirusy z Edhellond”
- Kaggle use: KDD-cup 2014
- Stacking with non-linear algorithms
- Popular non-linear algorithms for stacking are GBM, KNN, NN, RF and ET.
- Kaggle use: TUT Headpose Estimation Challenge
- Feature weighted linear stacking
- Quadratic linear stacking of models
- Stacking classifiers with regressors and vice versa
- For instance, one may try a base model with quantile regression on a binary classification problem.
- Stacking unsupervised learned features
- K-Means clustering
- t-SNE
- Online Stacking
- A more concrete example of (semi-) online stacking is with ad click prediction. Models trained on recent data perform better there.
Everything is a hyper-parameter
When doing stacking/blending/meta-modeling it is healthy to think of every action as a hyper-parameter for the stacker model.
Model Selection
You can further optimize scores by combining multiple ensembled models.
- There is the ad-hoc approach: Use averaging, voting or rank averaging on manually-selected well-performing ensembles.
- Greedy forward model selection (Caruana et al.). Start with a base ensemble of 3 or so good models. Add a model when it increases the train set score the most. By allowing put-back of models, a single model may be picked multiple times (weighing).
- Genetic model selection uses genetic algorithms and CV-scores as the fitness function.
- I use a fully random method inspired by Caruana’s method: Create a 100 or so ensembles from randomly selected ensembles (without placeback). Then pick the highest scoring model.
Automation
- Wrappers were written to make classifiers like VW, Sofia-ML, RGF, MLP and XGBoost play nicely with the Scikit-learn API. — Probably that is the inspiration for datarobot?
- Models visualized as a network can be trained used back-propagation: then stacker models learn which base models reduce the error the most.
- Contextual bandit optimization seems like a good alternative to fully random gridsearch: We want our algorithm to start exploiting good parameters and models and remember that the random SVM it picked last time ran out of memory.
- Automated large ensembles ward against overfit and add a form of regularization, without requiring much tuning or selection. In principle stacking could be used by lay-people.
- Another interesting article:http://mlwave.com/human-ensemble-learning/