EazyPredict ML module

codevardhan | Feb 2, 2023 min read

EazyPredict - Running and comparing multiple ML models at once

Welcome to the world of ‘EazyPredict’, a Python module that aims to make trying out multiple prediction algorithms as simple and efficient as possible. The module was heavily influenced by the ‘LazyPredict’ module. I developed this module to address a few shortcomings I identified in LazyPredict.

Why EazyPredict?

Some of its key features are as follows -

  • The ‘EazyPredict’ module utilizes a limited number of prediction algorithms (10) in order to minimize memory usage and prevent potential issues on platforms such as Kaggle.

  • Users have the option to input a custom list of prediction algorithms (as demonstrated in the example provided) in order to perform personalized comparisons with estimators of their choosing.

  • The models can be saved to an output folder at the user’s discretion and are returned as a dictionary, allowing for easy addition of custom hyperparameters.

  • The top N models can be selected to create an ensemble using a voting classifier.

Using it for classification

Let’s try it on this introductory problem on kaggle.

As written on kaggle -

“This is the legendary Titanic ML competition – the best, first challenge for you to dive into ML competitions and familiarize yourself with how the Kaggle platform works. The competition is simple: use machine learning to create a model that predicts which passengers survived the Titanic shipwreck.”

First, we need to load the dataset:

df = pd.read_csv("data/train.csv")
df.head()

PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

So before using eazypredict, we need to pre-process the dataset. This includes the following steps -

  • Removing null values
  • Encoding categorical data
  • Scaling the dataset
  • Splitting the training and testing data
# Removes null values
df["Age"].fillna(method="bfill", inplace=True)
df["Cabin"].fillna("No Room", inplace=True)
df["Embarked"].fillna("S", inplace=True)

# Encodes categorical data
ord_enc = OrdinalEncoder()

df["Sex_code"] = ord_enc.fit_transform(df[["Sex"]])
df["Cabin_code"] = ord_enc.fit_transform(df[["Cabin"]])
df["Embarked_code"] = ord_enc.fit_transform(df[["Embarked"]])

# Selects features for X and labels for y
X_feat = [
    "Pclass",
    "Age",
    "SibSp",
    "Parch",
    "Fare",
    "Sex_code",
    "Cabin_code",
    "Embarked_code",
]
y_feat = ["Survived"]

X = df[X_feat]
y = df[y_feat]

# Scaling the features
scaler = RobustScaler()
X_norm = pd.DataFrame(scaler.fit_transform(X), columns=X.columns)

# Splitting into train, set 
X_train, X_test, y_train, y_test = train_test_split(
    X_norm, y, test_size=0.33, random_state=42
)
X_norm.head()

Pclass Age SibSp Parch Fare Sex_code Cabin_code Embarked_code
0 0.0 -0.388889 1.0 0.0 -0.312011 0.0 0.0 0.0
1 -2.0 0.500000 1.0 0.0 2.461242 -1.0 -65.0 -2.0
2 0.0 -0.166667 0.0 0.0 -0.282777 -1.0 0.0 0.0
3 -2.0 0.333333 1.0 0.0 1.673732 -1.0 -91.0 0.0
4 0.0 0.333333 0.0 0.0 -0.277363 0.0 0.0 0.0

Now we can use eazypredict module to quicly get the predictions of the top classification algorithms.

clf = EazyClassifier()
model_list, prediction_list, model_results = clf.fit(X_train, X_test, 
                                                     y_train, y_test)

model_results
100%|██████████| 10/10 [00:00<00:00, 10.09it/s]

Accuracy f1 score ROC AUC score
GaussianNB 0.803390 0.803637 0.797619
MLPClassifier 0.803390 0.800228 0.784524
RandomForestClassifier 0.800000 0.798956 0.788214
LGBMClassifier 0.800000 0.798244 0.785595
RidgeClassifier 0.796610 0.794629 0.781429
XGBClassifier 0.779661 0.779203 0.769762
DecisionTreeClassifier 0.779661 0.778869 0.768452
KNeighborsClassifier 0.769492 0.766785 0.752024
SVC 0.688136 0.662186 0.640238
SGDClassifier 0.681356 0.669167 0.647619

After this you have the ability to select any model and perform hyperparameter tuning on it.

gaussian_clf = model_list["GaussianNB"]

from sklearn.model_selection import GridSearchCV

params_NB = {"var_smoothing": np.logspace(0, -9, num=100)}
gs_NB = GridSearchCV(
    estimator=gaussian_clf, param_grid=params_NB, verbose=1, scoring="accuracy"
)

gs_NB.fit(X_train, y_train.values.ravel())

gs_NB.best_params_

Fitting 5 folds for each of 100 candidates, totalling 500 fits

{'var_smoothing': 8.111308307896873e-06}

Using it for regression

It can be used for regression in pretty much the same way as above. You just need to import the EazyRegressor estimator.

More details can be found here.

Creating an ensemble model

This is the most effective feature of this library as an ensemble model can create a really good model with minimal effort in hyper parameter tuning.

All you need to do is to pass the results and the model names from the previous “fit” step to the next one.

clf = EazyClassifier()

model_list, prediction_list, model_results = clf.fit(X_train, X_test, y_train, y_test)

ensemble_reg, ensemble_results = clf.fitVotingEnsemble(model_list, model_results)
ensemble_results

100%|██████████| 10/10 [00:01<00:00, 6.68it/s]

Models Accuracy F1 score ROC AUC score
0 GaussianNB LGBMClassifier RidgeClassifier MLPC... 0.816949 0.758929 0.799881

Conclusion

In conclusion, ‘EazyPredict’ is an efficient and user-friendly Python module that makes trying out multiple prediction algorithms a breeze. Its memory-efficient design and customizable options make it a valuable tool for any data scientist or machine learning enthusiast. I hope you enjoy using ‘EazyPredict’ as much as I enjoyed creating it.

Check out the entire project on Github or PyPI.

comments powered by Disqus