Using packages such as sklearn vs building ML algorithm from scratch

Question

I have been using different machine learning algorithms throughout various projects at university, and attended some inspirational lectures where industrial companies show and present how they use machine learning, data mining, etc. in their work. I myself mostly use Python, and have previously used libraries such as sklearn. My problem is, that I have a huge difficulty understanding the role of built in algorithms vs making them completely from scratch with pure coding and math - i.e. using theoretical machine learning tools to actually do the work yourself. I understand doing everything yourself can be constrained by time/money/resources. Also sometimes it doesn't make sense to reinvent stuff that has been vastly optimized by others before you.

I keep feeling that using sklearn's built in random forest classifier or using xgboost in python is kind of cheating. I am only preparing the data, cleaning it to get the right formats, maybe do some feature engineering og initial plottings and statistical analysis. The problem is that when all that is done, we simply feed the data to this pre-made algorithm and it does everything behind the scene, and just spits out predictions. I feel that i am not doing anything, and not using all the knowledge i have learned in the data exploration analysis. Neither am i using any of the patterns that i found in the data. Still i hear from big companies that they use xgboost and sklearn - and I can see it actively being used in Kaggle competitions.

Almost every website i find only provides examples using these built in libraries, and do not go through any deeper math or statistics at all. I really enjoy working with machine learning - but i have this strong feeling that im completely missing the "professional" approach of doing things. I know there is a lot of books on theoretical machine learning - but still almost everyone online seems to just use pre-made algorithms. I have been struggling with this understanding for about a year now. The validity of these pre-made algorithms in serious industrial/business/academic use is still not clear to me.

EDIT: To be more specific. My question is: How are these libraries/tools viewed in professional/industrial/academical context in comparison to actually building a model yourself. Are they just a "quick and easy" alternative way to start learning machine learning and data mining for students and amateurs, or are they in fact more powerful (than i at least know) and should not be seen as an alternative, but a viable solution for professionals?

The motivation for my single question above, can be ellaborated by explaining the questions i ask myself. The very questions that began this confusion for me. Is it cheating to use these models? In which situations would you use a pre-built library, and when to avoid it? How do i merge (or use) the knowledge gained from the scientific data analysis i did before modelling, and these pre-built classifiers.

score 4 · Accepted Answer · answered May 14 '18 at 19:52

It depends entirely on your goal.

Student Phase When you're learning about machine learning algorithms, I think it is a really good idea to implement toy examples. I find this process helps with finding what you understood well and what you did not understand as well as you thought. It's doing this work where you'll find deeper understanding of how the algorithm really works and different internal choices you have to make.

Professional Phase When you have a project to deliver, you don't necessary want to rewrite a random forest implementation from scratch. Even if you could build one in a reasonable amount of time, there's value in having something like sklearn that is well vetted and robust enough to handle edge cases that you wouldn't even consider. That's the advantage of using pre-built libraries.

I need more Phase Eventually, you get the point where you understand the math and know how to use the packages well and you realize that there's a feature lacking. That's when you rip open a framework like xgboost or sklearn and modify existing code or even create you own implementation. The reason you would do this is because methods are cutting edge so there just isn't anything out there or the implementation of the framework is actually a handicap in production (as I tend to find with sklearn).

The issue you seem to be facing is lack of accountability for your output. If all you do is clean data push it through a model and get good results, make a chart showing results, I would say you are forgetting the "science" part in data science.

The challenging part isn't using the model it is knowing what moves your model and the potential hurdles your model may face in the real world. I've seen this play out often in my career, where a junior member will make an awesome model on training and test dataset, and suddenly production comes along and performance tanks. Why would that happen? Well, because test and training were similar (and often from the same source), but the junior member failed to question the lack of variety in the data source and question if the real world behaved the same way.

What I am trying to say is, being a data scientist, a small part of the job is cleaning data, running model and making pretty pictures. The real challenge is asking the why question. Why does this work? Why does the model perform poorly? Why does the model perform well? Why is this feature important?

score 4 · Answer 2 · answered May 15 '18 at 14:49

If your goal is to learn how to create new or improved data science methods, then working from scratch may be a good idea. The advantage is that you will learn the detailed inner workings of the algorithms, and you will learn enough to think about new methods or features. Other people may come to know you as an innovator.

If your goal is to apply data science tools to solve problems, then I encourage you to use published/shared tools such as scikit-learn and others. The advantage is that the code has been examined, tested, debugged, and updates by the community, so the likelihood of hidden errors is decreased. Nothing is perfect, but community code is often closer to perfect than code that is newly made from scratch. Other people may come to know you as an analyst who uses proven, quality methods, and whose results are worthy of trust.

Good luck!

Using packages such as sklearn vs building ML algorithm from scratch

2 Answers2