This post is about doing data analysis and visualization for the 2017 Doctorate Recipients Dataset and constructing a dashboard with Streamlit for results demonstration. Note that dashboard built with Streamlit originally can only be accessed locally, but we can use Heroku to built a small app and make this dashboard online and public!

My final dashboard can be accessed via this link (it may take some time to load in 😛) and the source codes could be find on my Streamlit-Dashboard branch at GitHub.


The Dashboard

This is the dashboard I built ✨ (Use the chevron-right badge to display the sidebar.)

Details about how to play around with it and how to deploy a dashboard are included respectively in the Visual section and Deployment section below.


Data

The 2017 Doctorate Recipients Dataset is quite a rich dateset and contains a lot of information. My analysis mainly focus on information about recipents and institutions. The tables I used are Table 1, Table 3, Table 5 and Table 6.

Firstly I did some pre-processing for the excel data to make them more suitable for plotting and visualization (details included in the pre-processing notebook).

For the first three tables oprations are quite simple, mainly about formatting.

# table 1
df = pd.read_excel('sed17-sr-tab001.xlsx', skiprows=3)
df.iloc[18,2] = 0
df.to_excel('df.xlsx')

# table 5
df2 = pd.read_excel('sed17-sr-tab005.xlsx', skiprows=3)
df2 = df2.iloc[:-2,:]
df2 = df2.set_index('Rank')
df2.to_excel('df2.xlsx')

# table 3
df3 = pd.read_excel('sed17-sr-tab003.xlsx', skiprows=3)
df3 = df3.set_index('Rank')
df3.to_excel('df3.xlsx')

As for the last Table, it is kind of a nested shape, so I transformed its structure a little bit.

Before

image3

After

image3


# table 6
df4 = pd.read_excel("sed17-sr-tab006.xlsx", skiprows=3).loc[1:,].reset_index(drop=True)

def func(i, x):
    '''
    Helper function for data transformation.
    '''
    
    temp = df4.iloc[:, [0, 2*i+1, 2*i+2]]
    temp.columns = ['State or location', 'Male', 'Female']
    temp = temp.melt(id_vars = 'State or location', var_name = 'sex')
    temp["field"] = x
    return temp

df4_new = pd.concat([func(i, x) for i, x in enumerate(df4.columns[1::2])]).reset_index(drop=True)
df4_new = df4_new.loc[np.logical_and(df4_new["value"] != 0, df4_new["value"] != 'D'),].reset_index(drop=True)
df4_new.to_excel('df4_new.xlsx')


Visual

Dashboard Link

In general the design of my dashboard is quite flexible.

In general there are three main sections/categories: Recipients Info, Institutions Info - Ranking, Institutions Info - Disciplinary, which could be chosen from the scroll-down button in the side bar.

session = st.sidebar.selectbox("Category", ["Recipients Info", "Institutions Info - Ranking", "Institutions Info - Disciplinary"])


image3


In the Recipients Info category, there are two types of animated graphs for two main variables: Number of Doctorate Recipents and Percentage Change from Previous Year.

image3


image3

The animated labels are achieved by the following codes (templates can de found from the Altair Library):

# Transparent selectors across the chart. This is what tells us
# the x-value of the cursor
selectors = alt.Chart(source).mark_point().encode(
    x='year',
    opacity=alt.value(0),
).add_selection(
    nearest
)

# Draw points on the line, and highlight based on selection
points = line.mark_point().encode(
    opacity=alt.condition(nearest, alt.value(1), alt.value(0))
)

# Draw text labels near the points, and highlight based on selection
text = line.mark_text(align='left', dx=5, dy=-5).encode(
    text=alt.condition(nearest, 'y:Q', alt.value(' '))
)

# Draw a rule at the location of the selection
rules = alt.Chart(source).mark_rule(color='gray').encode(
    x='year',
).transform_filter(
    nearest
)

# Put the five layers into a chart and bind the data
output = alt.layer(
    line, selectors, points, rules, text
).properties(
    width=600, height=300
)

For either plot we can use the slider in the side bar to choose the year range we want to look at:

st.sidebar.subheader("Recipients Info")
year_range =  st.sidebar.slider('Select a range of years',
    min(df.Year+1), max(df.Year), (1990, 2005))


image4


In the Institutions Info - Ranking category, we can checking top-50 rankings about number of Doctorate recipients for either State or University and use the slider on the left to choose our desired range of ranking:

image4


#sidebar
st.sidebar.subheader("Institutions Info - Ranking")
rank_range =  st.sidebar.slider('Select the ranking you want to see',
                                min(df3.Rank+1), max(df3.Rank), (1, 5))

if mode == 'State':
    rank = [i in range(rank_range[0],rank_range[1]+1) for i in df2.Rank]
    filtered = df2[rank]
    st.table(filtered)


elif mode == 'University':
    rank = [i in range(rank_range[0],rank_range[1]+1) for i in df3.Rank]
    filtered = df3[rank]
    st.table(filtered)


In the Institutions Info - Disciplinary category, we can checking doctorate recipients’ female-male distribution in different states with respect to different fields/disciplines:

image3


image3


graph = alt.Chart(tmp).mark_line(point=True).encode(
    x = 'State or location', 
    y = 'value',
    color='sex:N').interactive()
st.altair_chart(graph, use_container_width=True)


Deployment

The deployment procedure requires opening an account on Heroku, creating a corresponding Github repository, adding some configuration files and a final building. Detailed tutorials could be found on Heroku’s official website and the post attatched How to Build a Streamlit App in Python.