Don’t Start Your Data Science Journey Without These 5 Must-Do Steps — A Spotify Data Scientist’s Full Guide
Are you just starting your journey in data science? Maybe you’ve been staring at this data science degree or boot camp for way too long. Now you don’t know where to start?
Maybe you’ve already started your data science journey, and now you’re overwhelmed and all over the place?
Four years ago, I was you — except I jumped blindly into a data science degree with zero coding skills. Spoiler: I struggled. A lot.
Fast-forward to today: I’m a Data Scientist at Spotify and I’m here to give you the heads-up I wish someone had given me 4 years ago. If you’re about to take the next step towards your dream degree, then this article could be a game-changer for you.
Trust me, you don’t want to dive into this unprepared. I thought getting into NYU meant I’d slide through the year. I mean I knew the climb would be bumpy but I wasn’t prepared for the freaking mountains I ended up facing.
Another spoiler: I survived. But it was a painful journey full of tears and binge eating. So I’m here to spare you from this.
In this article, I’ll unpack the 5 most crucial moves I wish I’d done before starting my data science degree at NYU. These are 5 steps that I actually ended up learning all at once during my degree.
This can be extremely overwhelming because learning all these skills at the same time is challenging → There is so much that needs to be processed in so little time.
Nobody should be sleepwalking toward such a challenging project without some solid prep talk.
This article is my letter to past-me — and to you.
Ready? Let’s get straight to it. You’ll thank me later!
But first, let me tell you how I got there in the first place (else you can skip to the next part, I won’t be upset)
Once upon a time, a princess was awoken. It wasn’t true love’s kiss that smacked me at 5 AM that day. Try NYU’s marketing services from god knows where trying to promote their data science program.
I picked up the call and listened. It was the first time I gave a marketing call the benefit of the doubt.
Long story short, they were good, because six months later, I was already roaming the streets of NYC.
I remember feeling like I was about to conquer the world. But at that time, I had no idea that I was actually about to experience the biggest face slap of my life. It was such a hard slap it lasted a good 10 months.
This was four years ago, so in between I gained enough perspective and experience as a Data Scientist in Tech to tell you exactly how to set yourself up for success.
If you want to hear more about the rollercoaster journey that led me to Spotify, then be sure to check out the article below too.
I recommend following these five steps in the order laid down below.
#1. Avoid Future Headaches — Master Linear Algebra & Statistics Basics
If these words don’t ring many bells for you, then you should really not be thinking about jumping into data science training.
Picture this: For a whole year, I was building ML models, but it was only a year later that I realized I was just rehashing code like a robot. I wasn’t connecting these new concepts with the ones I had seen in Linear Algebra and Statistics. This ultimately slowed my progress.
If you don't master Linear Algebra & Stats Basics, you will never:
- Efficiently process and accurately interpret large datasets.
- Grasp the foundational principles behind most ML algorithms.
- Learn to validate and draw meaningful conclusions from your data.
- Be considered a true Data Scientist, especially in the world of Tech firms.
Without these two, you will be sailing aimlessly in the sea of ML.
Being a Data Scientist isn’t just about importing algorithms from libraries and letting the magic operate. It’s about understanding first what it is that you’re actually doing with these algorithms.
Why is Linear Algebra so important?
- Vectors and Matrices: In data science, especially in ML, data is often represented as vectors and matrices. For instance, a dataset with n users and m variables can be represented as an n x m matrix.
- Transformations: Techniques like Principal Component Analysis (PCA) for dimensionality reduction are rooted in linear algebra concepts of eigenvalues and orthogonality. These are essential because they allow you to transform data into a more manageable or interpretable form.
- Machine Learning Models: ML heavily relies on linear algebra. For example, the weights of neural networks can be represented as matrices, and their operations involve a lot of matrix multiplications.
Why are Statistics & Probabilistic Theory so important?
- Descriptive Statistics: Before diving deep into complex models, it’s crucial to understand the basic properties of data, such as mean, median, variance, and standard deviation.
- Inference: Making predictions or understanding patterns isn’t enough. We also need to evaluate how reliable our predictions or results are. Statistical inference helps estimate population parameters and test hypotheses. This allows us to understand the significance of our findings, like we do for A/B tests.
- Probabilistic Theory: The foundation of many ML algorithms is probability theory. Concepts like conditional probability and Bayes’ theorem are crucial must-knows for algorithms like Naive Bayes, Bayesian networks, and many others.
- Distribution Theory: Understanding different probability distributions like normal, binomial, and Poisson helps to make assumptions about data or algorithms. A lot of ML models rely on the assumption that the data follows a specific type of distribution, so if you don’t know much about probability distributions, how can you expect to figure out which algorithm to use?
- Sampling and Estimation: Data scientists almost always work with samples of data rather than entire populations, for many different reasons. Statistics gives you the tools to understand the relationship between samples and populations, to make sure you’re able to generalize from your findings.
- Model Evaluation: Techniques like chi-squared test, t-test, ANOVA, etc., are used to compare and evaluate different models. We use them a lot when doing A/B tests, which rely mainly on hypothesis testing.
At the end, you need to be able to answer questions like:
- What’s a p-value?
- What’s overfitting?
- What’s linear independence?
- What’s a true positive rate? false positive rate?
- What’s statistical significance and how to verify it?
- What are the different statistical tests and how do they work?
and the list goes on. These are also questions that often come up in job interviews, so better get started as early as you can!
Building and optimizing models, as well as interpreting data results and predictions requires understanding what the algorithm is doing in the first place. You won’t go far without diving into those maths concepts first.
#2. Speak the Language of Computers — Understand Basic Algorithmic Frameworks & Data Structures
Before joining NYU, I spent 1–2 months getting my hands dirty with coding. The very first class I attended was already asking us to visualize data on a map using Python libraries.
If you can barely print “Hello World”, you should run back to study the basics of algorithms, because learning to code is like learning a new language. It takes time.
And because no one knows random words and magically glues them next to each other to form correct sentences, the same goes for algorithms.
Why is it so important?
Being a Data Scientist requires extracting value from huge amounts of data. No Excel sheet will survive the weight of terabytes of data, so we have no other choice but to learn complex languages that computers can understand.
And before diving into these languages, you first need to understand their underlying structure.
It’s like learning Japanese when English is your primary language. The intuition and structure of your sentences completely shift. If you don’t know that the pronoun goes last instead of first, you won’t be able to form correct sentences. So get your algorithmic grammar straight.
To do so, learn how algorithms are constructed and the logic behind the architecture. How do you translate your idea into algorithmic words? How do you speak the language of computers before trying to teach them stuff?
How do you learn that?
Let’s break it down into steps you can follow:
- Practice Basic Programming Concepts: Make sure you’re comfortable with loops, conditionals, and basic data types. They’re like the nouns, verbs, and adjectives of this new language.
- Dive into Data Structures: Just as sentences are made up of words, algorithms are constructed using data structures. Learn about arrays, lists, dictionaries, trees, and graphs. Think of them as your algorithmic vocabulary.
- Understand Algorithm Design: Delve into sorting algorithms, search algorithms, and basic optimization techniques. These are the fundamental “phrases” you’ll use frequently.
When it comes to data structures, I’d suggest focusing on the following ones, as they happen to be the ones Data Scientists use the most:
- Strings: Think of strings as chains of characters, like sentences or words. In coding, “apple” is a string made of characters. And the same way we can combine words to create sentences, you can combine strings to create messages.
- Lists: Now, imagine you have a shopping list: milk, bread, eggs. This is a list! Lists are versatile and can store items. You can add to it, remove from it, and even sort it. It’s like having a playlist and being able to shuffle songs, add a new one, or remove the ones you don’t like.
- Tuples: Think of tuples like fixed lists. You’ve got your favorite all-time top 3 movies listed. That list is probably not going to change, right? Tuples are like that — once you create one, you can’t modify it.
- Dictionaries: Picture a dictionary as a container where you store information in pairs — a ‘key’ and its ‘value’. For instance, if ‘name’ is the key, ‘John’ might be its value.
- DataFrames: Imagine organizing a big school reunion. You’ll want a table with names, contact details, dietary preferences, and more. Data frames are like those tables — structured grids of data. They help organize a large amount of information clearly.
- Classes: Here’s where things get a bit abstract and where I struggled the most. Consider classes as blueprints. If you were building houses, the blueprint provides the design: number of rooms, size of the kitchen, etc. But you can use that single blueprint to build many houses. Similarly, in coding, a class is a blueprint for creating objects (a particular data structure). It defines properties (like color or size) and methods (functions related to that class) that can operate on the data.
Other data structures to explore: sets, trees, and graphs.
How do you practice your skills?
Begin by diving into coding platforms like Leapsome, Codewars, or HackerRank, where you can get your hands dirty with different algorithmic challenges.
These platforms offer problems ranging from beginner to expert level, this way you’ll be able to develop your skills as you progress.
Remember, the goal isn’t to become the next top software engineer, we’re doing data science here. So, don’t feel pressured to delve too deep into algorithms.
Your primary focus should be on mastering the basics and, more crucially, becoming proficient in manipulating data structures. The more you play with them, the more comfortable you’ll get.
#3. Go Beyond Importing ML Algorithms — Understand their Structure, You’ll Be Unstoppable
Picture this: In my first semester, I was tuning hyperparameters but without really understanding what a hyperparameter even meant in the scope of that specific ML algorithm.
When I discovered machine learning algorithms, I realized they exist in all shapes and forms. This meant I needed to understand how each algorithm worked when to use them, and what hypotheses needed to be validated before using them.
The only problem is that I kind of made this realization a bit too late in my learning journey. So meanwhile I spent a long time pulling my hair out in trying to make sense of all that jargon. I didn’t know how to properly approach machine learning, but now I do, so here’s my 2 cents.
To start, you first need to understand the structure that comes into play when building an ML model, it usually goes like this:
- Checking the Data Distribution: Think of this as looking at a mixed bowl of fruit and figuring out how many of each fruit type there are. It’s crucial because if your data is skewed towards one type (say, too many apples and not enough oranges), your model might become really good at recognizing apples but not so much the others. By checking the distribution, you can make sure your model has a balanced “diet”, and ultimately avoid overfitting.
- Preparing the Data: Think of this as tidying up your room so you know where everything is. Just like some toys need batteries to work, some ML models need the data in a specific format. This might include one-hot encoding, scaling, or normalizing data columns. Simply put, it’s about making the data neat for the model.
- Splitting the Data: Imagine splitting a deck of cards for a game. We separate our data into training, validation, and test sets. This way, we teach our model with some data and test it with unseen data to see how well it’s learned.
- Training the Model: This is the teaching phase. We feed our training data into the model so it can learn patterns. If necessary, we might transform the model to make it fit the data better.
- Testing the Model: After training, we see how our model performs on the test data — like a quiz after a lesson.
- Tuning the Hyperparameters: Imagine you have a toy car that you can customize. The size of the wheels, the color, or the type of engine you choose for the car are like hyperparameters. You decide and set them. The toy car will then run based on how you’ve set it up. There are tools like cross-validation and grid search to help you find the best settings. To properly tune these, you’ll have to understand how the algorithm works, and this means making a stop at our BFF’s place: Maths.
- Choosing the Right Metric: This is about grading your model. Depending on the objective of your project, you’ll use different ‘scorecards’ or metrics. Whether it’s accuracy, recall, or others, know which one aligns with your goals.
Make sure to check for biases and trade-offs. Just as you balance study and playtime, in ML you often need to strike a balance, like choosing between a super-accurate but slow model and a faster but simpler one.
Keep in mind that each of these steps has its own nuances and details. The more you work with ML models, the more you’ll understand the importance of each!
#4. Tame the Python Beast and its Libraries
When I started ML, there was so much I didn’t know about coding. I didn’t know that I needed to reformat the data in some cases, how to import weird types of files, convert data to different datatypes, and more.
It took me some time before digesting all that jargon and by then, I was already piling up other kinds of struggles. So now that you’ve got the basics of computer language down, the next step is to learn how to apply them!
Here are the most common code functions you will use when handling data. Make sure to know them well!
1. Data Input/Output
Read and write data — reading a .csv or .sql file, and reversely writing a dataframe to a .csv file.
2. Column and Row Operations
Handling columns — renaming them, selecting and indexing columns or rows, creating new ones, modifying elements within the column, and changing their format.
Formatting your dataframe or columns — resetting index, grouping data.
3. Data Shaping and Reshaping
Changing the shape of DataFrames — with join, merge, and concatenate, pivot, and melt.
4. Missing Data Handling
Identifying them, and knowing which technique to apply to deal with them, depending on the research project.
5. Data Filtering and Sorting
Filtering Data — Selecting subsets of rows based on some criteria.
Sorting Data — Arranging data in ascending or descending order based on one or more columns.
6. Data Summarization and Statistics
Aggregating Data — Summarizing data with aggregation functions like sum, average, count, etc.
Descriptive Statistics — Quick statistics like mean, median, mode, standard deviation, etc.
7. String and Data Type Operations
String Manipulation — Handling and cleaning string data, using regular expressions, splitting strings, or converting cases.
Type Conversion — Converting data types, like from string to integer or from float to date.
8. Advanced Operations
Conditional Operations — Applying functions or making changes based on certain conditions.
Setting and Resetting Multi-level Index — Useful for time series or hierarchical data.
9. Custom Functions
Crafting your own code shortcuts to manipulate data and automate things.
Finally, when handling data and doing ML, you’ll always find yourself dealing with libraries
Imagine you’re baking a cake. Instead of making everything from scratch, you get a cake mix from the store. This mix has many of the ingredients you need, all pre-packaged in one box, this saves you time and effort.
A Python library is like that cake mix for programming. It’s a collection of pre-written code that you can use to help you do tasks faster and easier. So naturally, it means you’ll have to cozy up with libraries and get to know them really well.
It’ll be like expanding your circle of friends.
Here are your top 6 pals:
1. Numpy: Your math buddy.
2. Pandas: The data organizer.
3. Matplotlib & Seaborn: The artsy twins for visualizing data.
4. Sklearn: Your go-to for machine learning tools.
5. Statsmodels: Your statistical consultant.
Once you become more proficient with ML, you might want to get familiar with these other libraries too:
1. TensorFlow & PyTorch: The dynamic duo for deep learning.
2. Beautiful Soup & Scrapy: Your web scraping experts.
3. NLTK & SpaCy: Your linguistic experts for text analysis and NLP.
Each library is specialized in a field, so you don’t need to master them, just knowing that they exist will come in handy in the future, when the time comes.
#5. Make Friends with SQL
This one sits high on the list. I use SQL almost every day in my life as a Data Scientist at Spotify. It’s not a piece of cake but I can nicely navigate my way through it now. It wasn’t always the case.
When I first discovered SQL, my brain went into overheat mode. At that time, I was also learning how to code on Spark, doing cloud computing, and advanced Machine Learning. So understanding a whole new coding paradigm was too much to ask of my brain. It’s like learning Swedish and Japanese at the same time.
By the time I’d developed the intuition for SQL, which has a completely different syntax and approach than Python, my course was already over.
If I had taken the time to get familiar with SQL before starting my data journey, I could have better connected the dots when I was in class. It would have also saved me lots of unnecessary stress.
Why is it so important to learn SQL early on?
SQL and Python are the dynamic duo you need to master in Data Science. We’re not talking about a “professional proficiency”-type of level. No, we want to go full-on native speaker mode here. If you can’t properly translate your ideas into SQL and Python language, then be sure they will never come to life.
Not only that, you’ll be even limited in your thinking process because you wouldn’t be able to think of creative ways to address a problem if you’ve never been exposed to the extent of the language in the first place.
A great philosopher named Ludwig Wittgenstein once said:
“The limits of my language mean the limits of my world”
The structure of language provides both the limits and the framework for our thought, meaning that we can’t conceive something for which we have no words or language. This goes for programming too.
Remember these languages are anything but intuitive, these are computer-level intuition, not human. Otherwise, we’d be using plain English to speak to machines instead of using their twisted alien lingo. Probably another of their evil plots to take over the world.
How to learn SQL and what to focus on?
- Introduction to SQL: Understand that SQL (Structured Query Language) is used to manage and query data in relational databases.
- Basic Queries: Start with the
SELECT
statement.SELECT column_name FROM table_name
- Filtering Data: Use the
WHERE
clause to filter specific results.SELECT column_name FROM table_name WHERE condition
- Sorting Results: Arrange your data with the
ORDER BY
clause.SELECT column_name FROM table_name ORDER BY another_column_name DESC/ASC
- Joining Tables: Understand
JOIN
operations to combine tables based on related columns. Familiarize yourself withINNER JOIN
,LEFT JOIN
,RIGHT JOIN
, andFULL JOIN
. - Datetime Functions: Learn functions and operations related to date and time.
→ Extracting components:YEAR()
,MONTH()
,DAY()
, etc
→ Date arithmetic, formatting, and interval calculations. - Aggregation: Use functions like
COUNT()
,SUM()
,AVG()
,MAX()
, andMIN()
to perform calculations on data. - Grouping Data: Combine the
GROUP BY
clause with aggregate functions for group-wise calculations. - CTEs (Common Table Expressions): Simplify complex queries by breaking them into reusable blocks with CTEs.
WITH cte_name AS (SELECT …) SELECT … FROM cte_name
- Window Functions: Master advanced calculations over a set of table rows relative to the current row.
→ Familiarize yourself with functions likeROW_NUMBER(), LEAD(), LAG(), and RANK()
→ ExplorePARTITION BY
to segment your data within your window calculations
→ Understand running totals, e.g.SUM(column_name) OVER (ORDER BY another_column)
- Querying Across Partitions: Master the techniques to fetch data from multiple datetime partitions
SELECT PARSE_DATE(‘%Y%m%D’, _TABLE_SUFFIX) AS partition_date
where * is a placeholder for the datetime suffix
FROM `data.partition_*`WHERE _TABLE_SUFFIX BETWEEN FORMAT_DATE(‘%Y%m%D’, DATE_ADD(‘2023–09–09’, INTERVAL -1 DAY)) AND ‘20230909’
and more!
Where to practice your new skills?
Many coding platforms offer SQL challenges for all levels. Some of these include SQLZoo, LeetCode, HackerRank, Mode Analytics, and more.
Make sure to brush up your SQL skills before starting your data science training so that you don’t end up too overwhelmed when having to juggle that with Machine Learning (and other paradigms)!
Recap — Why following each of these steps before jumping into data science is CRUCIAL
1. Master Linear Algebra & Statistics Basics
Without a solid understanding of Linear Algebra and Statistics:
→ You can’t efficiently process or interpret large datasets.
→ Grasping foundational ML algorithms becomes challenging.
→ Drawing meaningful conclusions and understanding the validation of your data is almost impossible.
→ You risk becoming just a code rehasher, not truly understanding the foundational principles you’re applying.
2. Learn Algorithmic Framework
Without a solid understanding of algorithmic frameworks:
→ You will struggle to extract value from massive datasets.
→ Translating your ideas into algorithmic terms will be challenging.
3. Go Beyond Algorithms, Understand Their Structure
Machine learning algorithms vary greatly in structure and application.
→ Understanding when and how to use each algorithm is vital.
→ Grasping the structure of building an ML model will help you build the foundations of an efficient model.
4. Be Proficient with Python and Its Libraries
Python and its libraries are essential tools in the Data Scientist’s toolkit. So better get familiar with them early on before learning how to master them.
→ Libraries simplify tasks by providing pre-written, optimized code.
→ They expedite tasks that would otherwise be time-consuming to code from scratch.
5. Get Friendly with SQL
If Python rules the data world, be sure that SQL shares the crown.
→ Being fluent in SQL and Python enables you to translate and implement ideas effectively.
→ Understanding SQL early on expands your thinking process, allowing you to be more creative with solving problems.
Remember Wittgenstein: Your language’s limits are your world’s limits.
I struggled a lot in my first year of doing data science, so I’ve learned my lessons. If you diligently follow these steps, I guarantee you won’t have to shed too many tears. Good luck!