This post is about create a SQLite3 schema to store the Spotify data set in at least 3rd normal form (3NF), populate the tables and then using an example SQL query to find the names of all playlists that contain instrumentals.

The detailed source codes to is included in this notebook.

You can also check the Spotify SQL folder in my Github Repo ZhiQiu976/source-codes-tech-posts to access the normalized Spotify database.

Database Normalization

Database normalization is performed for two main reasons - reduce redundancy and prevent inconsistencies on insert/update/delete.

Note: A fully normalized database is in domain-key normal form (DK/NF) if every constraint is a logical consequence of the definition of the candidate key and domains. However, most practical normalization procedures go through a series of steps known as first, second and third normal forms, and ignore potential modification anomalies that may remain.

Outcome:

Every table should not have any
- duplication
- dependencies that are not key or domain constraints

First Normal Form (1NF)

Table has a primary key (unique, non-null column that identifies each row)**
No repeating groups of columns
Each cell contains a single value

Clearly there is no unique primary key in the original dataframe and this is perhaps because the original dataframe is a denormalized version and not in a SQLite3 schema (so no concept of primary key in this original dataset). Therefore, we skip requirement 1 at this stage.

For requirement 2 and 3, we can see from the original data description) and the above example record that there is no repeating groups o columns and each cell contains a single value.

Second Normal Form (2NF)

All columns in each row depend fully on candidate keys
- Identify candidate Primary Key for each row
- If there is a composite PK, see if other columns have partial dependencies

To achieve 2NF, we would breake the original dataset into three sub-datasets:

df_song = df[['track_id', 'track_name', 'track_artist', 'track_popularity', 
              'danceability', 'energy', 'key', 'loudness', 'mode', 'speechiness', 
              'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo', 
              'duration_ms']]
df_album = df[['track_album_id', 'track_album_name', 'track_album_release_date']]
df_playlist = df[['playlist_id', 'playlist_name', 'playlist_genre', 'playlist_subgenre']]

Third Normal Form (3NF)

No transitive dependencies between non-candidate columns

In order to achieve 3NF, we further split and organize the above three datasets into 6 datafrmes:

df_song: track_id, track_name, track_artist, track_popularity and the other song characteristics
df_album: track_album_id, track_album_name track_album_release_date
df_playlist_new: playlist_id, playlist_name, playlist_genre
df_playlist_subgenre: playlist_genre, playlist_subgenre
df_song_album: track_id and track_album_id
df_song_playlist: track_id and playlist_id

df_song = df_song.drop_duplicates(subset=['track_id'])
df_album = df_album.drop_duplicates(subset=['track_album_id'])
df_playlist = df_playlist.drop_duplicates(subset=['playlist_id'])

df_playlist_new = df_playlist[['playlist_id', 'playlist_name', 'playlist_genre']]
df_playlist_subgenre = df_playlist[['playlist_genre', 'playlist_subgenre']].drop_duplicates(
    subset=['playlist_genre'])
    
df_song_album = df[['track_id', 'track_album_id']].drop_duplicates(subset=['track_id'])
df_song_playlist = df[['track_id', 'playlist_id']].drop_duplicates(subset=['track_id'])

SQLite3 schema

Now that we have done the normalizetion part, we could directly construct a SQL database for the above dataframes:

import sqlite3

# Create database
con = sqlite3.connect('spotify.db')

# Create tables in the database
df_song.to_sql(name='song', con=con, index = False)
df_song_album.to_sql(name='song_album', con=con, index = False)
df_album.to_sql(name='album', con=con, index = False)
df_song_playlist.to_sql(name='song_playlist', con=con, index = False)
df_playlist_new.to_sql(name='playlist_new', con=con, index = False)
df_playlist_subgenre.to_sql(name='playlist_subgenre', con=con, index = False)

The details of this schema are shown as below, the sql queries used to construct this database could be found in the last column:

Example SQL query

Now we would use a SQL query to find the names of all playlists that contain instrumentals:

%%sql 
SELECT COUNT(*) AS num_song, playlist_name
FROM song
INNER JOIN song_playlist
ON song_playlist.track_id = song.track_id
INNER JOIN playlist_new
ON playlist_new.playlist_id = song_playlist.playlist_id
WHERE instrumentalness > 0.5 
GROUP BY playlist_name
ORDER BY num_song DESC

Here are 10 example playlists that contain instrumentals, ordered by the number of songs in each playlist:

The above is just a simple example query and we can certainly use many other queries to explore and take further use of this Spotify database freely via SQL. ✨

SQLite3 Schema for Spotify Data Set

Normalization and Example SQL Query