Reference Guide For writing productized code using scikit-learn Pipelines

a friendly guide to scikit learn pipelines

raghava
3 min readSep 20, 2020

Introduction :

Okay! Who wouldn’t want to build a production baseline model, iterate on top of it easily and deploy the same code, everyone, right!.

So in this blog post, I am going to share my experience with using scikit learn pipeline for building productized code, so that you don’t end up spending more time as I did :)

How it will be useful: well I am mentioning the total iterative process while building pipelines so it would give the big picture and you can start coding instead of spending more time on researching.

Basics :

First, let's get some concepts clear.

All functionality in scikit learn are grouped as two and they designed so beautifully by Sklearn to have the same interface Transformers & Estimators.

Transformers :

It's used for data preparation and implements fit and transform function.

All data preprocessing functions from scikit learn follow this architecture

Ex: for standard deviation the fit function learns mean and sd then apply on whatever data passed to the transform function.

If for any use case scikit learn transformers are not sufficient then we can build custom transformer for by extending from the base classes.

Estimators :

Used for modelling and has functionalities called fit and predict.

All modeling algorithms from scikit learn follow this architecture

Ex: logistic regression fit method finds coefficients in regression formula and transform does predictions based on the stored coefficients.

If you want to understand how these are constructed you can pick your favourite transformer/estimators and look at source code in scikit learn.

Now any typical machine learning cycle includes calling Transformers on different columns and combining them together to get the feature vector ready, then pass it on to estimator for training and making predictions. So to make this process easy pipelines comes into handy.

Refer here for a nice introduction.

From here on contains questions and reference to quickly makes you think different possibilities.

Feature union vs column transfer

Reference link

So to sum it up for column transformer accepts column names directly but for feature union, we should pass through a custom transformer which does internal filtering

Function transformer vs custom transformers

Well if you have already defined functions for cleaning of data and just want to use them in pipeline then you can write a wrapper function (refer here) and pass all your functions in the respective pipeline.

And if you are just performing transform and no parameters is being persisted while transforming better to go with function transformer.

Practical Tips :

  1. Column transformer — has a remainder attribute which will help to decide what to do with remaining columns (passthrough, drop) etc..
  2. Column transformer — columns attribute accepts string of a column or list of columns to pass data to the transformers with 1d and 2d respectively, so choose the appropriate formate based on input transformer excepts.
  3. The order of transformers, decides the order of the features at the end of the feature vector, so be carefull.
  4. The output of transformer always is a 2d matrix so we can always have a data frame return at the end of each transform.
  5. If multiple transformations needed for some column then use pipeline with multiple transformers and pass it on to column transformer.

Refer here for a detailed description of some of the tips mentioned.

I hope this serves as a reference guide while using scikit learn pipeline, Thanks!!.

--

--

raghava

NLP developer at startup who love to work with text data