Hey mate,
The most important book in my Data Science career so far is an unsung gem – “Think Like A Data Scientist” by Brian Godsey. It's not a heavy math or coding guide but shifts the focus from pure theory to creating tangible, real-world value using data science.
In a previous video, I summarised 6 takeaways, but you demanded more so we will dig into the book more here.
WHAT IS DATA SCIENCE?
It’s weird to ask that in a newsletter about *checks notes* data science but clarifying this is the first step.
Simply put, it is the application of statistical knowledge and coding knowledge to a problem within a domain in order to add value.
The adding value part is important, and is an emphasis within the book because as early-stage data scientists our focus is always "Damn, do I know enough code?", "Maybe I should revise the one ultra-specific statistical concept in case I need it". But as the author emphasises knowing how to use a hammer (code + maths) is not the same as knowing how to build a chair (something that has used the hammer to produce real-world value).
Now that we know what it is, let’s see how we can effectively conduct data science projects to maximise value.
IF YOU FAIL TO PLAN, YOU PLAN TO FAIL
Jumping into a data science project without a clear end in sight is like setting off on a road trip without a map. You’ll meander aimlessly, taking random tangents until you have completely lost sight of what you were initially looking to achieve. So you must ask your ‘customer’ for their desired end result to serve as your destination. If you are working on a personal project, you are your own customer, so what are you trying to achieve?
To keep your project on the straight and narrow (well, as much as a data science project allows), ensure you can have:
Defined the Problem: What is the primary concern or challenge? If considering a new store location, do we need to consider foot traffic, financial demographics, or age groups of an area?
Determined the Deliverable: Is it a dashboard? A written report? Or a concise strategy for the next phase of a project?
Researched Previous Work: Check if similar projects have been undertaken to avoid redundant mistakes.
Asking a non-technical audience about what they want delivered, is a … tricky endeavour at times. It either leads to outrageous requests, short timelines, or general vagueness.
The best way to suggest deliverables is to guess what the customer wants and use this to narrow down to what they actually want.
You: "Wow, it sounds like you want a report that focuses on the distribution of competitor stores around the city"
Them: "No, actually A dashboard with a heatmap that focuses on the income of 25–35-year-olds in the city.”
Boom, you have avoided a 15-minute roundabout conversation.
GOAL SETTING
Following on from this, we will be able set goals for the project. And the golden formula for what goals to set is brilliant from Godfrey. We need to maximise the efficiency of our goals and efficiency is:
Efficiency = Value/ (Effort x Possibility)
The dream is to set a goal that has a high value, minimal effort, and is highly possible. But realistically, it will be about prioritising high-value objectives and trying to balance the effort and possibility to prevent chasing things that will never happen.
INTERROGATING THE DATA
Once we've set our goals, it's time to dive into the data.
Probing the Data:
To extract meaningful insights from the data, ensure you have:
1. Good questions to ask the data – this is where technical knowledge and understanding business problems come in.
2. Relevant Data – what is already available? What can you get? Can you improve the dataset through feature engineering?
3. Insightful analysis – this is a combination of your technical knowledge and the soft skills to present this well.
And when looking to ask a ‘good data science question’ this avoids the common mistakes of:
Expecting the data to answer questions that it cannot.
Asking questions of the data that do not solve the problem.
You need a plan of attack to decide how to answer the questions you have formulated in order to maximise your efficiency. And one of the best ways to develop this plan is simply by checking if anyone has done this project before you.
By exploring blogs, articles, and open-source projects related to your topic, you can gain invaluable insights. This doesn’t mean you’re copying – it’s about learning. Not only can you understand potential challenges you might face, but leveraging similar existing solutions can give you a head start and save crucial time.
ANTICIPATING OBSTACLES
Every project comes with its fair share of obstacles. Here's are the most common ones to keep an eye out for:
Data Access and Extraction: Ensure the data you need is accessible and easy to extract. It sounds basic, but you’d be surprised how often this becomes a hiccup.
Knowledge Gaps: Delve deep into understanding your data. There might be hidden aspects or nuances that are crucial for your analysis.
Volume Concerns: It's not just about having data; it's about having the right amount of data. While a massive dataset might seem like a goldmine, remember that it can also lead to longer processing times. Conversely, too little data might not give you the depth of insight you're aiming for.
Data Integrity: Missing values can throw a wrench in your analysis. And if you're looking at merging datasets, ensure they'll integrate seamlessly while preserving referential integrity.
These are just a few of the gems Godfrey gave us. If you enjoyed this entry and want more from the book feel free to like and comment on this post on substack.
I just made my Instagram and Twitter - feel free to message me there or follow me to get closer to the journey
I'm really taking away the lesson that knowing how to use the tools doesn't mean I'm good at the job, I must understand how to set clear goals, align with business objective to deliver an excellent solution.
Hey please reply me on insta (@cxlibrx)
Love your content💫