Hiring Data Scientists: A Classification Problem

Ethan Rosenthal
Making Dia
Published in
4 min readOct 28, 2016

--

It is easy as an interviewee to bash the entire interviewing process — it takes forever, it’s painful, and it can be demoralizing. Just as well, it’s easy for interviewers to lament the difficulty in finding and assessing good talent. Being in New York City, this reminds me of the contention between drivers and pedestrians. We can switch between either role but will continually complain about the opposing one!

With this in mind, we have strived to make the data science interviewing process at Dia&Co respectful of the candidate while still serving its purpose of identifying people who match our needs. A key part of this process is the Data Challenge.

Take-home tests, data challenges, data assessments, or whatever else you want to call them have become fashionable as a step in data science interviews. As the Data team at Dia&Co has experience with many other companies’ interview processes, we have built up a wealth of opinions on what works and what does not for the Data Challenge.

What the Challenge Is

From our side, the purpose of the Data Challenge is

To assess how well the candidate translates business problems into machine learning solutions as well as get an understanding of their coding ability and style.

However, the challenge also gives the candidate insights into our business and some problems that we may be thinking about. After all, interviewing goes both ways, especially right now when the data science market is so hot.

The challenge itself takes the form of some data and a business problem presented to the candidate. Specifically, this is fake data corresponding to realistic data that a company would have. Sometimes, companies send out data with anonymized features (e.g. “feature_1”, “feature_2”) which feels largely academic and does not test domain or business thinking. The candidate is then asked to write a script to solve this problem and include a brief writeup of their approach.

What the Challenge is Not

We spent considerable time thinking about what the challenge should not do as a result of being burned on poor processes from other companies in the past. Primarily, the challenge serves as the minimum bar that we set for moving candidates passed this stage of the interview. Because of this, we do not give extra consideration for going above and beyond the requirements of the test because we do not wish to favor candidates with infinite free time. We do not wish to assess feature engineering prowess or how accurate one can make their machine learning model because we do not want them to spend all day on it. While these can be important to the job, these easily lead to scope creep in the data challenge, and we have other skills that we would rather prioritize.

All of this is an attempt to be respectful of the candidate’s time. Consequently, we expect the challenge to be able to be completed in less than 3 hours. Many companies claim an upper bound on the time that they expect candidates to spend on their take-home tests, but rarely have we found cases where this is actually true. In fact, I once completed half of a test while taking twice the amount of time, and the company moved me on to the next stage! In this case, the test length should have been cut in half (at least).

Classifying

Once the challenge is completed and returned by the candidate, we have a strict rubric for grading. The rubric gives approximately equal weighting to coding, machine learning, and business know-how. The process is then somewhat similar to a classification problem. The scores on each section of the rubric are our features, we add them up as our model, and then a threshold is used to determine Pass vs. Not Pass. Of course, this threshold is a key parameter for trading off between precision and recall. We do not want to waste time on false positive candidates, nor do we want to miss out on false negatives. We even employ an “ensemble model” and have multiple people grade in order to minimize bias and variance.

Maybe the entire process can be thought of as an operations research problem — maximizing hiring of candidates who will be successful on the job while constraining against money, time, and everything else. Our data challenge classification problem serves as an input into the global optimization problem, and synthesizing machine learning and operations is something we are very excited about here at Dia&Co.

If you would like to work on this synthesis, then please check out our careers page because we’re hiring! And let me know what you think of the challenge :)

--

--

Data Scientist at Dia&Co. Formerly at Birchbox (modeling, data and otherwise) and Columbia (physics phd).