Importance of Structured Query Language (SQL) in your Data Science career roadmap

SQL Basics, Practice Exercises, Real-world industry use case scenarios, Data Analysis & SQL with Python example

10 min readApr 15, 2022

Dear Friends,

This time I thought of writing something that could be really helpful to the data science community readers to naively get started, by sharing from my own experiences, which was dragging in my mind for quite a long time now, that was only somehow pending due to my schedule. Finally, I managed to put things together in some decent shape for the readers to follow along 🙂 Through this blog post, I would like to take this opportunity to share my industry work experience with you, that could help any budding aspirant out there to get a panoramic view of the real-world problems & business scenarios. First, I will walk you through some of the basics of Structured Query Language (SQL), followed by some interesting practice questions, discuss different use-cases from telecom, and retail store domains. We will also see how to run SQL queries using Python. And, finally, we will end the tutorial with an open question example for the readers to ponder upon. All codes from this blog post can be found here. Also, do check out my other blogs/contributions from my Github Repo

Please Note:- For the sake of integrity, data privacy & data security reasons, the things shared here are largely masked & largely modified versions from my work portfolio, but the structure was very much along these similar lines in practice, that can be directly interpreted & followed by the readers as demonstrated along this tutorial.

Revisiting SQL

Like any aspiring data scientist out there when I had begun my journey, my day-to-day tasks implicitly involved extracting raw data from the different data sources. Let us understand one thing very clearly, before we talk about Artificial Intelligence or infact any Machine Learning stuff, the data extraction, preprocessing & data analysis plays a very important role in the entire Data Science workflow. In fact, it takes consumes most of the time within the entire pipeline. Exploratory Data Analysis (EDA) is the initial, and an important phase of the workflow. It helps, to get a first look of the data, and help generate relevant hypothesis and further decide the next steps. However, the EDA process could be a hassle most of the times. The fetching/data preparation from the heterogenous/homogenous data sources is the first key thing in the pipeline. We will start with revisiting most of the basic SQL commands — i.e., Select data from two tables, and Joining two tables, etc. The cheatsheet provided below illustrates these in detail.

Concept by Steve Stedman; Copyright & Adapted from https://stevestedman.com/

SQL brain-teaser examples

In this section, we will see some examples that would be a good starting point to understand how SQL can be used to extract insights.

Say suppose, we have the following schema with two tables: ‘Ads’ and ‘Events’ with ‘ad_id’ & ‘event_id’ as keys respectively.

Ads(ad_id, campaign_id, status: active or inactive)
Events(event_id, ad_id, source, event_type: impression, click, conversion, date, hour)

Image by author: Tables with their fields

Write SQL queries to extract the following information:-

1) The number of active ads.

2) All active campaigns. Note:- A campaign is active, if there’s at least one active ad.

3) Number of active campaigns.

4) The number of events per each ad — broken down by event type.

5) The number of events over the last week per each active ad — broken down by event type & date (most recent first).

6) The number of events per campaign — by event type.

7) The number of events over the last week per each campaign and event type — broken down by date (most recent first).

8) CTR (click-through rate) for each ‘Ad’. Note:- CTR = number of clicks / number of impressions.

9) CVR (conversion rate) for each ‘Ad’. Note:- CVR = number of conversions / number of clicks.

10) CTR and CVR for each ‘Ad’ broken down by day and hour (most recent first).

Industry Use case — Telecom Data

Here, let us consider the big data from the telecom sector perspective. The different datasets included information about “CDR — Call Daily Records”, “User Profile”, “Usage history”, “Billing Details”, “Network Logs”, “Cell site information”, “Price Plan Details”, “Service Order”, etc. Say, given the tables & their schema’s, you had to analyze the data to extract the following information's as narrated in the scenarios below. Note:- All the actual schema names are masked as ‘xxxx’. For Prepaid/Postpaid customers just replace the tag ‘PREP’ with ‘POST’ accordingly rest of the structure remains the same.

Usecase-1:- Get the browsing history of the telecom Prepaid or Postpaid mobile users. Group them into different categories for the period
June-Dec 2021.

https://gist.github.com/ajayarunachalam/d37d7595ea4751b6bb4a95f7126210c0

Usecase-2:- Get the AVERAGE REVENUE PER USER (ARPU) for the Prepaid or Postpaid Mobile users from their billing history for the period March 2021. We calculate ARPU as sum of cur_rc_amt + discount_amt + cur_uc_amt + cur_oc_amt. The Average ARPU is calculated as ARPU for total number of active subscribers.

Usecase-3:- Get the information of customers that access the competitors broadband helpline number through calling to their call center for the period of Mar — June 2021. Also, get the information of customers that access the broadband competitor’s information through accessing their websites for the period of Mar — June 2021.

Usecase-4:- Calculate the avg. Quality Of Experience (QOE) estimator for the broadband users for the period of March 2021.

Usecase-5:- Get all the ACTIVE Mobile SUBSCRIBERS (Prepaid/Postpaid) until June 2021.

Usecase-6:- Get the Churned Customer Report with detailed information categorized into reason’s of their churning. Say, let us get the report for the period Dec 2020 — June 2021. And, also say for any particular querying month.

Usecase-7:- Get the call quality information for the churned customers (Prepaid/Postpaid) for the month of June 2021.

Usecase-8:- Get the usage history for the customers, and most usage history of the customers based on state/region/district etc for the period June 2021.

Usecase-9:- Get the billing details of the subscribers (Prepaid/Postpaid) for the month of June 2021.

Usecase-10:- Analyze the data for the following objectives
a) Misbehavior analysis
b) fault complaint analysis
along with their geographical history (latitude/longitude) for the month of June 2021.

Industry Use case — Retail Store Data

Let us consider a retail store example. We are provided with the following data extracted from data warehouse. Suppose we have the following schema with two tables: dim_customer and tr_product with key as the ‘customer’ field.

dim_customer(customer, email)
tr_product(customer, product)

Imagine you had to analyze the data to answer the following business questions.

a) How many customers has bought & how many have never bought?

b) Most purchased item?

SQL through Python — Example

In this section, we will see how one can extract information using SQL within the context of Python environment. SQLAlchemy is the Python SQL toolkit that gives application developers the full power and flexibility of using SQL with Python language. Let us see an example on how one can create a SQL table from the dummy data created within the python environment. The following snippet below demonstrates the above mentioned scenario.

We start with firstly importing ‘pandas’ & ‘random’ python libraries to help with fake data and for dataframe creation, followed by importing ‘Py_Sql_Alchemy_Class’ python module, so that we can use the ‘DB_Table_Ops’ class. The operations in the if block is only need to be done once, so we have a setup boolean that we can change. We also instantiate the ‘DB_Table_Ops’ class to ‘dbto’. The data dictionary ‘d’ is made up of convenient fake data that is put into a pandas dataframe. Then, we create a table schema, and also make sure any old version of ‘table_one’ is dropped from our SQL database, i.e., we make sure the table was dropped, we create it, recheck to make sure it is created & it exists, and then finally push the dataframe into the table.

Now let us do a simple test queries for the following scenarios:-

Select data from ‘table_one’, i.e., we need the output is as shown below.

2. Select data from query into ‘table_one’, i.e., we need the output is as shown below.

Data Analysis with SQL — Example

Here, let us consider a scenario as follows. You are given a data that provides information about an examination. The gender of the candidates taking the exam are also provided. We are also provided with whether the candidate has answered the question related to “Football” correctly or not. And, also the total final score achieved by the candidate is provided. Suppose, we have the following schema with table: ‘Complaint’

Complaint(Gender (M/F), Football_Score_Question (1 — Correct/0 — Incorrect),Total_Examination_Score)

The example snippet of the ‘Complaint’ table is as follows:-

Write a SQL query to check, if the football questions had any gender bias.

I leave this analysis to the readers to give it a through thought, and decode it.

Conclusion

SQL for data science is the most widely used language while working with databases. It is used to communicate with a database. We use SQL statements to create datasets, gain critical insights from data, create reports & dashboards from the extracted/processed data, etc. Through this blog post, we revised through the basics of SQL, gone through the SQL technical exercise, learned along with the hands-on examples for different real-world industry use cases, saw a SQL example through Python, and thoroughly glimpsed through data analysis for extracting insights from the raw data. Finally, I am looking forward to your answers for the open question in the comments section.

If you liked this blog post, pls encourage me to publish more content by your support & love with a clap 👏

Keep Learning... Cheers :) All the code from this tutorial can be found here.

Let’s get connected

You can reach me at ajay.arunachalam08@gmail.com; Contact via Linkedin; Check out my other works from my Github Repo

About Author

I am a Data Science Manager holding Scrum Master Certification. Also, an AWS Certified Machine Learning Specialist & AWS Certified Cloud Solution Architect. I have worked in Telecom, Retail, Banking and Healthcare sectors. From my experience working on real-world business problems, I fully acknowledge that finding good representations is the key in designing the system that can solve interesting challenging real-world problems, that go beyond human-level intelligence, and ultimately explain complicated data for us that we don’t understand. In order to achieve this, I always envision learning algorithms that can learn feature representations from both unlabeled and labelled data, be guided with and/or without human interaction, and that are on different levels of abstractions in order to bridge the gap between low-level data and high-level abstract concepts. I also truly believe that Opacity in AI Systems is the need of the hour. With this in mind, I have always strived to democratize AI, and be more inclined towards building Interpretable/Explainable Models. My interest is in building real-time AI driven Solutions, Machine Learning/Deep Learning at scale, Productionalizing Explainable Models, Deep Reinforcement Learning, Computer Vision, and Natural Language Processing, specifically learning good representations.

References

Steve Stedman

Freelance SQL Server Consultant

stevestedman.com

https://searchdatamanagement.techtarget.com/definition/SQ

SQL - Wikipedia

SQL ( S-Q-L, "sequel"; Structured Query Language) is a domain-specific language used in programming and designed for…

en.wikipedia.org

Ericsson - Helping to shape a world of communication

Viaero Wireless, a regional telecommunications company that has served parts of the Midwest and western U.S. for more…

www.ericsson.com

HDFS Architecture Guide

The Hadoop Distributed File System ( HDFS) is a distributed file system designed to run on commodity hardware. It has…

hadoop.apache.org

Apache Hadoop - Wikipedia

Apache Hadoop () is a collection of open-source software utilities that facilitates using a network of many computers…

en.wikipedia.org

data-science-interviews/technical.md at master · alexeygrigorev/data-science-interviews

The answers here are given by the community. Be careful and double check the answers before using them.

github.com

data-science-interviews/technical.md at master · alexeygrigorev/data-science-interviews

⚠️ The answers here are given by the community. Be careful and double check the answers before using them.

github.com

Introduction to SQL in Python for Data Scientists

The data scientist’s guide for using SQL in Python environment.

towardsdatascience.com

SQL using Python - GeeksforGeeks

In this article, integrating SQLite3 with Python is discussed. Here we will discuss all the CRUD operations on the…

www.geeksforgeeks.org

SQLAlchemy

The Python SQL Toolkit and Object Relational Mapper SQLAlchemy is the Python SQL toolkit and Object Relational Mapper…

pypi.org

Mlearning.ai Submission Suggestions

How to become a writer on Mlearning.ai

medium.com

Importance of Structured Query Language (SQL) in your Data Science career roadmap

SQL Basics, Practice Exercises, Real-world industry use case scenarios, Data Analysis & SQL with Python example

Revisiting SQL

SQL brain-teaser examples

Industry Use case — Telecom Data

Industry Use case — Retail Store Data

SQL through Python — Example

Data Analysis with SQL — Example

Conclusion

Let’s get connected

Steve Stedman

Freelance SQL Server Consultant

SQL - Wikipedia

SQL ( S-Q-L, "sequel"; Structured Query Language) is a domain-specific language used in programming and designed for…

Ericsson - Helping to shape a world of communication

Viaero Wireless, a regional telecommunications company that has served parts of the Midwest and western U.S. for more…

HDFS Architecture Guide

The Hadoop Distributed File System ( HDFS) is a distributed file system designed to run on commodity hardware. It has…

Apache Hadoop - Wikipedia

Apache Hadoop () is a collection of open-source software utilities that facilitates using a network of many computers…

data-science-interviews/technical.md at master · alexeygrigorev/data-science-interviews

The answers here are given by the community. Be careful and double check the answers before using them.

data-science-interviews/technical.md at master · alexeygrigorev/data-science-interviews

⚠️ The answers here are given by the community. Be careful and double check the answers before using them.

Introduction to SQL in Python for Data Scientists

The data scientist’s guide for using SQL in Python environment.

SQL using Python - GeeksforGeeks

In this article, integrating SQLite3 with Python is discussed. Here we will discuss all the CRUD operations on the…

SQLAlchemy

The Python SQL Toolkit and Object Relational Mapper SQLAlchemy is the Python SQL toolkit and Object Relational Mapper…

Mlearning.ai Submission Suggestions

How to become a writer on Mlearning.ai

Written by Ajay Arunachalam