ML/Data Science article 1#

Why you should contribute to open-source as a data scientist#

Publisher: Medium
Publishing Date: Jun 11, 2020

contribute
Courtesy: Markus Spiske.

I have had the privilege of using Scikit-learn for my data science projects and it was one of the first tools that I learned in data science. It is very powerful because it allows the user to conduct predictive analytics using various mathematical models ranging from older models such as Linear Regression to newer ones such as Neural Networks.

It was therefore a great honor to participate in the 2020 Scikit-learn online sprint organized by Reshama Shaikh (https://twitter.com/reshamas) of Data Umbrella, and learn how to contribute to open-source projects. It was such a wonderful experience that I was inspired to write this article, hoping that you also get inspired to contribute to open-source.


You should contribute to open-source to improve your understanding of mathematical modelling.#

This was probably the strongest reason for me. I wanted to improve my understanding of Scikit-learn from a developer perspective, not only a user one. Furthermore, I am currently learning applied statistics and I wanted to connect the dots between what I am learning in class and how those concepts are reflected in this Python package. I have used the preprocessing module in the past, but actually getting to contribute to its documentation made me relearn scaling techniques that I was previously exposed to. Also, ‘looking under the hood’ helped me appreciate the rationale behind the code.

Contributing to open-source improves your development collaboration skills.#

For 4 hours, we had to virtually collaborate with a pair programming partner. A lot of data science work can be done in isolation, but having a fellow collaborator share problems that they are facing helps you avoid the same mistakes. You can also help unblock each other by focusing on solving the same problem. For instance, thanks to my programming partner, I was able to figure out which parts of the Microsoft Visual Studio Build Tools I should install to save on time (You need a C++ compiler to install scikit-learn into your local environment). It also felt good for both of us, who had never collaborated on a project in GitHub before, to have our pull requests merged into the main repository.

You should contribute to open-source because you get the rare opportunity to improve a tool that you already use.#

One of the cool things about open-source development is that it allows you to be both developer and user at the same time. This means that as you grow in knowledge regarding use of the tool, you are able to

  1. define your user pain points

  2. come up with an idea that can solve those pain points

  3. go back to the main repository to check for an issue raised by other users that have the same pain point

  4. code the solution and submit a pull request for the core development team to review

  5. feel awesome when your pull request is merged to the main repository.

The videos below were really helpful in guiding me in setting up the development environment and making a pull request:

Finally, you should contribute to open source because you get to interact with awesome people.#

The open source community is a very welcoming space where you can ask ‘stupid’ questions and receive help. Sometimes when one is a beginner in the data space, even after learning languages such as Python and R, someone may feel intimidated contributing to a hackathon, or in this case, a Scikit-learn sprint, where someone has only 4 hours to come up with a solution and submit a pull request. I would like to reassure you, dear reader, that you will have a community with mentors and representatives from the core development team as well as your pair programming partner to assist you in finding relevant issues, changing the code, passing it through tests, creating a pull request and having that pull request reviewed. I was glad to see that all the issues that I contributed to with my pair programming partner had encouragement and helpful critiques of our work.


I hope that this article was helpful and it gave you the nudge to look for opportunities to contribute to open-source projects. I want to stress that I got to know of this event through a local data science classroom that I am a part of, Master Python 4 DS and I am very grateful for this opportunity. AI Kenya and Nairobi Women in Machine Learning and Data Science are also great communities that someone new to Data Science and living in Kenya can check out.