Pandas and scikit-learn: secret best friends on Nov. 18, 2017 at 1:50 p.m. in R-M130

The vast majority (all?) examples in the scikit-learn documentation are shown with numpy arrays as the dataset container. When your dataset is entirely numerical and your transformations are fairly simple, this works well. However, when your dataset is more complex (with a mix of dates, categorical and numerical variables... where pandas shines) and your training pipelines are longer, it would be very nice to keep your data as pandas DataFrames and be able to use their power to write most of the steps of your training pipeline. It turns out, this is supported by scikit-learn, and possible out of the box.

In this talk, we'll show how this is possible and easy, through an interactive example. We'll show how keeping your dataset as a pandas DataFrame until the end allows you to prototype transformations using pandas, and then drop those transformations into your model pipeline with minimal work. At the end of the talk, people already using scikit-learn will be able to integrate this technique into the machine learning pipelines to be more productive when working with complex datasets, and will also have an online example to refer to if needed.

This will be the first time this talk is presented.


Speaker

Christian Hudon

Christian has worked on machine learning and embedded systems for most of his professional life (although not necessarily together). He loves making things that people use and love to use.