A Comprehensive Roadmap for An Aspiring Data Scientist in 2021

11 min readDec 23, 2020

“Data Scientist (n.): Person who is better at statistics than any software engineer and better at software engineering than any statistician.” -Josh Wills, Director of Data Engineering at Slack

We stand in midst of a deluge of data today. Starting from the smartphone in your palm to the smart refrigerator at your home, it’s everywhere. Today, over 2.5 quintillion bytes of data is generated every day, which is expected to rise up to 463 exabytes by 2025. Even though the systems that generate these vast volumes of data expire in a matter of time, the data doesn’t. And that’s how we arrive at data science, the art of discovering insights and knowledge concealed in the data.

The Essentials

So, you’re finally all buckled up for this phenomenal journey that will encompass through vastly different worlds of science and engineering. But don’t worry, all the prerequisite essentials you need is the will and patience to learn new things.

A little bit of mathematical foundation in linear algebra, statistics, and calculus would save you valuable time in the journey but even if you don’t have any knowledge about them, you’re still eligible to start on this road and are expected to learn them as and when required.

1. Python Programming Language

The first step towards becoming a good data scientist is to have a fair amount of knowledge and experience with the programming language you’d be using for data science. Although any programming language can be used for data science such as Java, Python, JavaScript, Julia, R, Scala, SQL, and many more, these days we tend to use either Python programming language or R programming language.

That is because there are a lot of packages for data science that are readily available for both of these programming languages. Additionally, Python being a general-purpose programming language with a simple syntax offers a lucrative starting point for beginners. The rest of this article goes about a Pythonic way to learn Data Science since this will supposedly be your choice of learning as well.

There’s no absolute need to master all of Python, given it is a general-purpose programming language and is suited to do a number of things. But you should be focusing more on the in-built data structures of Python such as list, dictionary, tuple, string, and their different methods, and so on. Some other concepts that will enable you to write object-oriented code such as classes shall also be practiced. Practicing general coding exercises that cover these topics and getting ample practice is essential to be good at programming.

2. Numpy

By now, you should’ve had a fair understanding of how python works, and in general how simple python programming syntax is. But all this simplicity comes at a cost: behind the scenes, it’s the huge chunks of C programming language code that makes python so flexible. This, in turn, makes python one of the slowest programming languages as compared to C/C++, Java, etc.

In data science, you’ll be dealing with array data structure a lot more than any other data structure. But Python list data structure, although flexible, is too slow to process the humongous amount of data we’d process in machine learning. That’s where Numpy comes to the rescue. Numpy is a package for the Python programming language which is solely there to ease the process of array creation and manipulation.

This is one of those fundamental packages that you’d use in your every machine learning project, due to obvious reasons such as it accelerates the speed of the array operations by integrating C code under the hood, it allows only homogenous data type unlike the Python list data structure and it fragments a given operation for its parallel execution to speed up the process.

You’re expected to have a decent knowledge about the basic concepts such as Numpy array creation and manipulation; the different array operations such as reshaping, broadcasting, adding, or removing dimensions, etc. While you’re at it, you’d also be able to learn the various array methods defined in Numpy such as calculating mean, standard deviation, max and min element, and so on.

3. Data Visualization

When working with volumes and volumes of data, it’s more than often not easy to make sense out of it by directly looking at it. That’s why we opt for data visualization for a visually appealing, graphical representation of the data and its various aspects.

Data visualization in Python is mainly performed using two packages, Matplotlib and Seaborn. Both of these libraries offer a plethora of charts and graphs which can be used to understand and convey various aspects of data, such as histograms, scatter plots, box plots, violin plots, cat plots, and so on.

You’d be exploring these plots while experimenting and understanding the application of each one of them. Once satisfied, you’re all set to dive deep into the world of statistics.

4. Statistics and Probability

Karl Pearson, the man attributed as the father of modern statistics once quoted…

“Statistics is the grammar of Science”

Particularly in data science, statistics holds a paramount relevance because it is the key to understanding and unlocking the secrets withheld in the data. Although being too vast in its existence, you’d like to begin with understanding the implications of fundamental concepts such as mean, standard deviation, variance, covariance, correlation, etc.

Next up, you’d like to learn about the different types of probability distribution functions that govern the behavior of the data which follows that type of distribution, the most important of which is the Gaussian distribution, also known as the Normal distribution. You’d also learn about the usefulness of standard normal distribution and how the central limit theorem can be advantageous.

Another aspect of statistics and probability that you’d be using a lot in data science is inferential statistics, which is primarily used to draw inferences by observing the data. This includes concepts such as hypothesis testing, p-value, confidence intervals, the margin of error, and so on. Although much of the statistical basis required for data science comes pre-packaged in the libraries and packages we use these days, it’s still important to learn these concepts to be able to understand data science at a fundamental level.

Although the concepts mentioned here only encompass a very small portion of the actual amount of statistics that goes in data science, you can work these out to get a fundamental understanding of how statistics shall be used in data science. Later on, you’d again reunite with statistics while dealing with the specifics of the machine learning algorithms.

5. Pandas and Exploratory Data Analysis

In machine learning, you’d mostly be dealing with either structured data or unstructured data. Structured data follows a tabular representation and hence it is important to be able to create and manipulate the tabular data. Pandas library is solely intended for this purpose. It helps with the creation of “data frame” objects that can be easily visualized and understood.

Pandas is by far the most widely used tool by data scientists because it is the standard choice for assisting with exploratory data analysis and data cleaning, which are two very essential parts of almost every data science project. There are many fundamental operations in pandas for data frame manipulation that need to be mastered.

Exploratory data analysis concerns picking out relevant observations from the data by different means of its analysis, including statistical and graphical methods. You’re expected to learn about how to pick out these observations from any given dataset. A good practice is to write your observations in the comments to refer to them later and understand the course of decisions made.

6. Machine Learning Algorithms

Finally, we’re all set to start with the machine learning algorithms to solve the datasets. Mastering these algorithms from a very fundamental level is an outright necessity to be a good data scientist, which includes gaining insights into the mathematical aspects of these algorithms. Although the implementation of these algorithms will be readily available to you within the libraries and packages such as scikit-learn, it’s still of paramount importance to understand them fully.

Alongside these algorithms, you’d also learn different techniques employed to make the learning part more suitable so as to increase the model performance, such as the concept of outliers, the concept of scaling the dataset, how hyperparameters i.e. the parameters of the model which cannot be learned during the training phase are tuned, the notion of “curse of dimensionality”, calculation of feature importance and statistical techniques for feature selection.

On a high level, machine learning algorithms can be classified into two categories: supervised learning algorithms and unsupervised learning algorithms. Supervised learning algorithms are used to learn a decision mapping from the independent variables (also known as features) to the dependent variable (also known as the target). These are further classified into regression algorithms, where the target variable is continuous and classification algorithms, where the target variable is categorical.

There are many fundamental supervised learning algorithms that you’d be learning such as linear regression, k-nearest neighbors, support vector machines, linear discriminant analysis, logistic regression, naive-bayes, and so on. Once done with these, you’d learn about tree-based algorithms such as decision trees.

A category of tree-based algorithms is ensemble algorithms, which use multiple decision trees (each of which is called a weak learner) to solve the same problem and then combine their results to produce the final outcome. Ensemble algorithms are majorly categorized as bagging, boosting, and stacking. Several popular algorithms such as random forest, gradient boosted decision tree, adaptive boosting, light gradient boosting, extreme gradient boosting are to be studied under the ensemble algorithms.

Unsupervised learning algorithms are those, which don’t have any specific target variable to which we map our independent variables. Unsupervised learning algorithms can be grouped into two categories, namely clustering algorithms and associative algorithms. Some of the algorithms you’d learn under unsupervised learning algorithms are k-means clustering, DBSCAN clustering, principal component analysis for dimensionality reduction, and so on. There are several applications of unsupervised learning algorithms such as anomaly detection, clustering, dimensionality reduction, to name a few.

7. Evaluation of Machine Learning Algorithms

After preparing the data and training the machine learning model, the evaluation of trained machine learning models is an absolute necessity in order to ascertain its performance and reliability. It should be noted that the methods used for the evaluation of machine learning models are case sensitive i.e. different types of metrics are used for different purposes. As an example, accuracy might not be a good metric to evaluate the applications of AI in medicine, rather we tend to check the sensitivity and specificity of the model to decide its performance.

A common practice to ensure a fair evaluation of the machine learning models is to divide the dataset into train and test splits so that the model can be trained on the train set while the test set can be used for validation. Failing to do so will result in what is known as “data leakage”, where the data which is to be used for the evaluation of the model has already been used to train the model. This causes the model to produce over-optimistic evaluation results, which don’t depict its actual performance.

Another way to validate the credibility of the model is by cross-validating it for the entire dataset. K-fold cross-validation is when we divide the entire dataset into K subsets and then iterate for each of the K subsets so that a particular subset acts as the test set while the rest of the subsets combinedly act as the train set. Later, the metrics from each of the trained model is averaged to report a final metric.

There are also a number of metrics that you’d learn such as accuracy, precision, recall, F1 score, receiver operating characteristics area under the curve. You’d be learning about the confusion matrix and evaluation of models trained on a class imbalance dataset and how to use dummy classifiers and regressors for the purpose of evaluation.

8. Deep Learning Algorithms

Deep learning is a subset of machine learning where artificial neural networks, inspired by the behavior of neurons in the human brain, are used to learn very complex decision functions. Neural networks can be described as a collection of interconnected “neurons” in a layered fashion, where individual neurons are nothing but a mathematical function.

The only precursor to using deep learning is the availability of a large amount of data. Deep learning can be used for supervised learning as well as unsupervised learning. It finds its applications in a number of places such as virtual assistants, self-driving cars, face recognition to name a few.

In order to implement neural networks, some of the most popular libraries and frameworks are Keras, Tensorflow, and Pytorch. These libraries and frameworks implement deep learning models using very few lines of code. Apart from these, you’d certainly be learning three types of artificial neural networks which are multilayer perceptrons, convolutional neural networks, and recurrent neural networks.

These three types of neural networks form the basis of more complex neural networks and thus they need to be studied at a fundamental level. You’d also be learning about the different activation functions, the different ways in which we can tune the hyperparameters of the network, how the different optimizers work, how to use memory-efficient batch learning to handle huge datasets during the training, and many more techniques to implement deep learning.

9. Deployment

One of the most underrated aspects of machine learning is perhaps the deployment of the trained models. Deployment of machine learning models refers to the process of interfacing and integrating a trained machine learning model in order to expose it to the end-user or application as a service.

The term also extends to the concept of incremental learning, where a machine learning model keeps updating its knowledge-base with the help of all the input data it receives. It allows the creation of machine learning models that can be trained on infinite data streams, which are much more scalable as it can actively adapt to the data as and when it comes.

There are multiple cloud-based deployment platforms where one can deploy their machine learning models. Some of the most popular IaaS cloud deployment options include Amazon AWS, Microsoft Azure, Google Cloud Platform, and so on. PaaS cloud services such as Heroku, PythonAnywhere also enable the deployment of machine learning models.

Conclusion

Everything you read so far is only a very small fraction of all that is there to know in data science, yet these topics constitute the very foundation of it. That being said, data science is a domain that is still being explored and researched very actively, and there are so many questions in data science and AI which are yet to be answered.