EazyPredict - Running and comparing multiple ML models at once
Welcome to the world of ‘EazyPredict’, a Python module that aims to make trying out multiple prediction algorithms as simple and efficient as possible. The module was heavily influenced by the ‘LazyPredict’ module. I developed this module to address a few shortcomings I identified in LazyPredict.
Why EazyPredict?
Some of its key features are as follows -
The ‘EazyPredict’ module utilizes a limited number of prediction algorithms (10) in order to minimize memory usage and prevent potential issues on platforms such as Kaggle.
Users have the option to input a custom list of prediction algorithms (as demonstrated in the example provided) in order to perform personalized comparisons with estimators of their choosing.
The models can be saved to an output folder at the user’s discretion and are returned as a dictionary, allowing for easy addition of custom hyperparameters.
The top N models can be selected to create an ensemble using a voting classifier.
Using it for classification
Let’s try it on this introductory problem on kaggle.
As written on kaggle -
“This is the legendary Titanic ML competition – the best, first challenge for you to dive into ML competitions and familiarize yourself with how the Kaggle platform works. The competition is simple: use machine learning to create a model that predicts which passengers survived the Titanic shipwreck.”
First, we need to load the dataset:
df = pd.read_csv("data/train.csv")
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
So before using eazypredict, we need to pre-process the dataset. This includes the following steps -
- Removing null values
- Encoding categorical data
- Scaling the dataset
- Splitting the training and testing data
# Removes null values
df["Age"].fillna(method="bfill", inplace=True)
df["Cabin"].fillna("No Room", inplace=True)
df["Embarked"].fillna("S", inplace=True)
# Encodes categorical data
ord_enc = OrdinalEncoder()
df["Sex_code"] = ord_enc.fit_transform(df[["Sex"]])
df["Cabin_code"] = ord_enc.fit_transform(df[["Cabin"]])
df["Embarked_code"] = ord_enc.fit_transform(df[["Embarked"]])
# Selects features for X and labels for y
X_feat = [
y_feat = ["Survived"]
X = df[X_feat]
y = df[y_feat]
# Scaling the features
scaler = RobustScaler()
X_norm = pd.DataFrame(scaler.fit_transform(X), columns=X.columns)
# Splitting into train, set
X_train, X_test, y_train, y_test = train_test_split(
X_norm, y, test_size=0.33, random_state=42
Pclass | Age | SibSp | Parch | Fare | Sex_code | Cabin_code | Embarked_code | |
0 | 0.0 | -0.388889 | 1.0 | 0.0 | -0.312011 | 0.0 | 0.0 | 0.0 |
1 | -2.0 | 0.500000 | 1.0 | 0.0 | 2.461242 | -1.0 | -65.0 | -2.0 |
2 | 0.0 | -0.166667 | 0.0 | 0.0 | -0.282777 | -1.0 | 0.0 | 0.0 |
3 | -2.0 | 0.333333 | 1.0 | 0.0 | 1.673732 | -1.0 | -91.0 | 0.0 |
4 | 0.0 | 0.333333 | 0.0 | 0.0 | -0.277363 | 0.0 | 0.0 | 0.0 |
Now we can use eazypredict module to quicly get the predictions of the top classification algorithms.
clf = EazyClassifier()
model_list, prediction_list, model_results = clf.fit(X_train, X_test,
y_train, y_test)
100%|██████████| 10/10 [00:00<00:00, 10.09it/s]
Accuracy | f1 score | ROC AUC score | |
GaussianNB | 0.803390 | 0.803637 | 0.797619 |
MLPClassifier | 0.803390 | 0.800228 | 0.784524 |
RandomForestClassifier | 0.800000 | 0.798956 | 0.788214 |
LGBMClassifier | 0.800000 | 0.798244 | 0.785595 |
RidgeClassifier | 0.796610 | 0.794629 | 0.781429 |
XGBClassifier | 0.779661 | 0.779203 | 0.769762 |
DecisionTreeClassifier | 0.779661 | 0.778869 | 0.768452 |
KNeighborsClassifier | 0.769492 | 0.766785 | 0.752024 |
SVC | 0.688136 | 0.662186 | 0.640238 |
SGDClassifier | 0.681356 | 0.669167 | 0.647619 |
After this you have the ability to select any model and perform hyperparameter tuning on it.
gaussian_clf = model_list["GaussianNB"]
from sklearn.model_selection import GridSearchCV
params_NB = {"var_smoothing": np.logspace(0, -9, num=100)}
gs_NB = GridSearchCV(
estimator=gaussian_clf, param_grid=params_NB, verbose=1, scoring="accuracy"
gs_NB.fit(X_train, y_train.values.ravel())
Fitting 5 folds for each of 100 candidates, totalling 500 fits
{'var_smoothing': 8.111308307896873e-06}
Using it for regression
It can be used for regression in pretty much the same way as above. You just need to import the EazyRegressor estimator.
More details can be found here.
Creating an ensemble model
This is the most effective feature of this library as an ensemble model can create a really good model with minimal effort in hyper parameter tuning.
All you need to do is to pass the results and the model names from the previous “fit” step to the next one.
clf = EazyClassifier()
model_list, prediction_list, model_results = clf.fit(X_train, X_test, y_train, y_test)
ensemble_reg, ensemble_results = clf.fitVotingEnsemble(model_list, model_results)
100%|██████████| 10/10 [00:01<00:00, 6.68it/s]
Models | Accuracy | F1 score | ROC AUC score | |
0 | GaussianNB LGBMClassifier RidgeClassifier MLPC... | 0.816949 | 0.758929 | 0.799881 |
In conclusion, ‘EazyPredict’ is an efficient and user-friendly Python module that makes trying out multiple prediction algorithms a breeze. Its memory-efficient design and customizable options make it a valuable tool for any data scientist or machine learning enthusiast. I hope you enjoy using ‘EazyPredict’ as much as I enjoyed creating it.