PyCaret 101 : ML Workflow Automation

Introduction to PyCaret

Shalini
5 min readMar 26, 2022

PyCaret is an open-source, low-code machine learning library in Python that automates machine learning workflows. It is an end-to-end machine learning and model management tool that speeds up the experiment cycle exponentially and makes you more productive. It is a deployment ready library in Python which means all the steps performed in an ML experiment can be reproduced using a pipeline that is reproducible and guaranteed for production. A pipeline can be saved in a binary file format that is transferable across environments.

PyCaret is essentially a Python wrapper around several machine learning libraries and frameworks such as scikit-learn, XGBoost, LightGBM, CatBoost, spaCy, and many more.

Overview

PyCaret is a modular library arranged into modules and each module representing a machine learning use-case. Following are the currently present modules.

  1. Supervised ML
    - Classification
    - Regression
  2. Unsupervised ML
    - Clustering
    - Anomaly Detection
    - Natural Language Processing
    - Association Rules Mining
  3. Time Series(beta)
  4. Datasets

In this blog, I’ll demonstrate the use of PyCaret is using a a binary classification on Default of Credit Card Clients Dataset.

Getting Started

PyCaret can be installed with Python’s pip package manager.

pip install pycaret
#for full version installation
pip install pycaret[full]

Lets import PyCaret first.

import pycaret

PyCaret has some pre-loaded datasets which we might use.

# loading the dataset
from pycaret.datasets import get_data
data = get_data('credit')

Now depending upon the type of experiment we are working on, we can import the relevant modules.

#for Regression
from pycaret.regression import *
#for Classification
from pycaret.classification import *
#for Clustering
from pycaret.clustering import *
#for Anomaly Detection
from pycaret.anomaly import *
#for NLP
from pycaret.nlp import *
#for association rule mining
from pycaret.arules import *

Now the next step is to initialise the setup. Its mandatory to do so before any machine learning experiment.

model = setup(data = dataframe_name, target = 'target_variable')

So for our model, we will have the following code for thr setup.

# init setup
from pycaret.classification import *
s = setup(data = data, target = ‘default’, session_id=123)

Compare Models

This function compares each model present in the PyCaret depending upon the problem statement.
Training of every model is done using the default hyperparameters and evaluates performance metrics using the cross-validation.

The output of the function is a table showing the average score of all models across the folds. The number of folds can be defined using the fold parameters within the compare_models function. By default, the fold is set to 10. The table is sorted (highest to lowest) by the metric of choice and can be defined using the sort parameter. By default, the table is sorted by Accuracy for classification experiments and R2 for regression experiments. Certain models are prevented for comparison because of their longer run-time. To bypass this prevention, the turbo parameter can be set to False.

To select the top n numbers of the model, include n_select hyperparameter within the compare_models function. We can even sort it using the metrics.

best_model = compare_models(n_select = n, sort ‘ AUC’)

For now, we will go ahead with the default parameters and see the performance of different classification models on our dataset.

#get the model performances
best_model = compare_models()

Create and Tune Model

As the name suggests create_model function trains and evaluates a model using cross-validation that can be set with fold parameter. The output prints a scoring grid that shows Accuracy, AUC, Recall, Precision, F1, Kappa, and MCC by fold.

Foe demonstration purpose, we will just go ahead and create a Random forrest Classifier model.

#create model
rf = create_model('rf')

When a model is created using the create_model function it uses the default hyperparameters to train the model. In order to tune hyperparameters, the tune_model function is used. This function automatically tunes the hyperparameters of a model using random grid search on a pre-defined search space. We can also use the custom search grid by passing custom_grid parameter in the tune_model function.

#tune model
tuned_rf = tune_model(rf)

Plot Model

PyCaret also has a plot_model function that can be used to analyze the performance across different aspects such as AUC, confusion_matrix, decision boundary, etc. It takes a trained model object as an argument and returns a relevant plot.

# check the residuals of trained model
plot_model(best_model, plot = 'auc')
# check the residuals of trained model
plot_model(best_model, plot = 'pr')
# check the residuals of trained model
plot_model(best_model, plot = 'feature')
# check feature importance
plot_model(best_model, plot = 'confusion_matrix')

Finalize and Save Pipeline

Let’s now finalize the best model i.e. train the best model on the entire dataset including the test set and then save the pipeline as a pickle file.

# finalize the model
final_best = finalize_model(best_model)
# save model to disk
save_model(final_best, 'Binary Classification Model PyCaret')

save_model function will save the entire pipeline (including the model) as a pickle file on your local disk. By default, it will save the file in the same folder as your Notebook or script is in but you can pass the complete path as well if you would like:

save_model(final_best, 'path/Binary-Classification-Model-PyCaret'

Thus we come to an end to this introductory blog on PyCaret. We have covered the entire machine learning pipeline from data ingestion, pre-processing, training model, hyperparameter tuning, analysing plots and saving the model for later use.

Hope it helps!

--

--