Workshop Day 1

Creating reproducible data science workflows with DVC and Metaflow

The problem

You’re a student and about to finish a cool data science project on identifying multimodal hate speech in internet memes (Yes! There is such a project and you can even win a money prize. More info here.

But when you look at the project folder, it looks like a… ehm… a mess? Data files are all over the place, scripts are uncommented, but what is most important, there is no easy way to reproduce your findings.

Figure 1. Obi-Wan Kanobi is good at answering existential questions

Or maybe you’re working in a team with other students on a group project? How to make your collaboration smooth, more transparent and effective? If you're an experienced data scientist and know how to prepare a Makefile, for example, it is always good to catch up with the recent developments that can automate boring stuff for you.

Now, let’s get back to the solutions. No need to panic, captain! There is help! The world of open-source software is full of elixirs to restore your health bar: Data Version Control (DVS), recently release Metaflow by Netflix, Kedro and Airflow from Airbnb and others.

Workshop description

In this workshop we will create a simple workflow that can be easily modified, reproduced and sent out to your coworkers. For that we will use one of the tools mentioned above: DVC and Metaflow. Our case study will be about predicting ambulance calls made in The Hague over the last three years with a time series forecasting library fbprophet.

Figure 2. Happy responsible data scientist managed the complexity of a project on making a city more green (image adapted from https://docs.metaflow.org/introduction/why-metaflow)

Prerequisites

New to data science? No worries! We won’t dive too much into the details on how everything works, rather use the “black box” mindset or functional programming. A function has an input and an output. That’s it :-)

We will like to ask you to do a single thing: install Anaconda distribution from here or make sure that it works and up to date. The rest will be done during the workshop.

The workshop materials will be stored here. Keep an eye on it, closer to the date of the workshop we will fill it out.

Date and Time

Tuesday August 18, 2020
16.00 – 17.00

Presenter

Mikhail Sirenko

EPA graduate and researcher at the faculty of Technology, Policy and Management