Université du Québec à Montréal, Montreal, Quebec, Canada
November 18-21, 2017
At Shopify we have over 3000 Python batch ETL jobs. These jobs depend upon each other’s output forming a directed acyclic graph that, when visualized, is indiscernible from the hairballs that my cat pukes up.
These jobs are created by a team of over 100 analysts and engineers who deploy on average 15 changes to them per day to production. With so many people and such a rapid pace of change, understanding how a dataset is constructed, debugging relationships, tracing the flow of data, or even just asking how prevalent a feature or type of relationship is becomes has been a daunting task requiring tracing not only 20k lines of YAML schedule files and 50k lines of Python code.
To make asking questions about these jobs tractable, we’ve created a series of CLI tools that, when combined with unix tools, makes answering questions about our schedule possible.
I’ll cover how we flatten that graph into a series of tables that we can output using a CLI tool and then how one can use grep, awk, sort, join, and column to answer some real questions that we had about our schedule.
I lead one of our data engineering teams which maintains our Pyspark-based ETL framework at Shopify in Ottawa, Canada. Since joining Shopify over a year ago and pivoting from primarily working in Java to Python, I’ve taken a interest in developing a minimalist programming experience, moving from using Eclipse as an editor to vi/emacs, from relying on my editor to tell me about a software project to instead using Unix tools, and trying as much as possible to only interact with the CLI when developing.