Data Preparation¶
bar_chart_race exposes two functions, prepare_wide_data
and prepare_long_data
to transform pandas DataFrames to the correct form.
Wide data¶
To show how the prepare_wide_data
function works, we'll read in the last three rows from the covid19_tutorial
dataset.
df = bcr.load_dataset('covid19_tutorial').tail(3)
df
Belgium | China | France | Germany | Iran | Italy | Netherlands | Spain | USA | United Kingdom | |
---|---|---|---|---|---|---|---|---|---|---|
date | ||||||||||
2020-04-10 | 3019 | 3340 | 13215 | 2767 | 4232 | 18849 | 2520 | 16081 | 18595 | 8974 |
2020-04-11 | 3346 | 3343 | 13851 | 2894 | 4357 | 19468 | 2653 | 16606 | 20471 | 9892 |
2020-04-12 | 3600 | 3343 | 14412 | 3022 | 4474 | 19899 | 2747 | 17209 | 22032 | 10629 |
This format of data is sometimes known as 'wide' data since each column contains data that all represents the same thing (deaths). Each new country would add an additional column to the DataFrame, making it wider. This is the type of data that the bar_chart_race
function requires.
The prepare_wide_data
function is what bar_chart_race
calls internally, so it isn't necessary to use directly. However, it is available so that you can view and understand how the data gets prepared. To transition the bars smoothly from one time period to the next, both the length of the bars and position are changed linearly. Two DataFrames of the same shape are returned - one for the values and the other for the ranks.
df_values, df_ranks = bcr.prepare_wide_data(df, steps_per_period=4,
orientation='h', sort='desc')
Below, we have the df_values
DataFrame containing the length of each bar for each frame. A total of four rows now exist for each period.
Belgium | China | France | Germany | Iran | Italy | Netherlands | Spain | USA | United Kingdom | |
---|---|---|---|---|---|---|---|---|---|---|
date | ||||||||||
2020-04-10 | 3019.00 | 3340.00 | 13215.00 | 2767.00 | 4232.00 | 18849.00 | 2520.00 | 16081.00 | 18595.00 | 8974.00 |
2020-04-10 | 3100.75 | 3340.75 | 13374.00 | 2798.75 | 4263.25 | 19003.75 | 2553.25 | 16212.25 | 19064.00 | 9203.50 |
2020-04-10 | 3182.50 | 3341.50 | 13533.00 | 2830.50 | 4294.50 | 19158.50 | 2586.50 | 16343.50 | 19533.00 | 9433.00 |
2020-04-10 | 3264.25 | 3342.25 | 13692.00 | 2862.25 | 4325.75 | 19313.25 | 2619.75 | 16474.75 | 20002.00 | 9662.50 |
2020-04-11 | 3346.00 | 3343.00 | 13851.00 | 2894.00 | 4357.00 | 19468.00 | 2653.00 | 16606.00 | 20471.00 | 9892.00 |
2020-04-11 | 3409.50 | 3343.00 | 13991.25 | 2926.00 | 4386.25 | 19575.75 | 2676.50 | 16756.75 | 20861.25 | 10076.25 |
2020-04-11 | 3473.00 | 3343.00 | 14131.50 | 2958.00 | 4415.50 | 19683.50 | 2700.00 | 16907.50 | 21251.50 | 10260.50 |
2020-04-11 | 3536.50 | 3343.00 | 14271.75 | 2990.00 | 4444.75 | 19791.25 | 2723.50 | 17058.25 | 21641.75 | 10444.75 |
2020-04-12 | 3600.00 | 3343.00 | 14412.00 | 3022.00 | 4474.00 | 19899.00 | 2747.00 | 17209.00 | 22032.00 | 10629.00 |
The df_ranks
DataFrame contains the numerical ranking of each country and is used for the position of the bar along the y-axis (or x-axis when veritcal). Notice that there are two sets of bars that switch places.
Belgium | China | France | Germany | Iran | Italy | Netherlands | Spain | USA | United Kingdom | |
---|---|---|---|---|---|---|---|---|---|---|
date | ||||||||||
2020-04-10 | 3.00 | 4.00 | 7.0 | 2.0 | 5.0 | 10.00 | 1.0 | 8.0 | 9.00 | 6.0 |
2020-04-10 | 3.25 | 3.75 | 7.0 | 2.0 | 5.0 | 9.75 | 1.0 | 8.0 | 9.25 | 6.0 |
2020-04-10 | 3.50 | 3.50 | 7.0 | 2.0 | 5.0 | 9.50 | 1.0 | 8.0 | 9.50 | 6.0 |
2020-04-10 | 3.75 | 3.25 | 7.0 | 2.0 | 5.0 | 9.25 | 1.0 | 8.0 | 9.75 | 6.0 |
2020-04-11 | 4.00 | 3.00 | 7.0 | 2.0 | 5.0 | 9.00 | 1.0 | 8.0 | 10.00 | 6.0 |
2020-04-11 | 4.00 | 3.00 | 7.0 | 2.0 | 5.0 | 9.00 | 1.0 | 8.0 | 10.00 | 6.0 |
2020-04-11 | 4.00 | 3.00 | 7.0 | 2.0 | 5.0 | 9.00 | 1.0 | 8.0 | 10.00 | 6.0 |
2020-04-11 | 4.00 | 3.00 | 7.0 | 2.0 | 5.0 | 9.00 | 1.0 | 8.0 | 10.00 | 6.0 |
2020-04-12 | 4.00 | 3.00 | 7.0 | 2.0 | 5.0 | 9.00 | 1.0 | 8.0 | 10.00 | 6.0 |
Don't use before animation¶
There is no need to use this function before making the animation if you already have wide data. Pass the bar_chart_race
function your original data.
Long data¶
'Long' data is a format for data where all values of the same kind are stored in a single column. Take a look at the baseball data below, which contains the cumulative number of home runs each of the top 20 home run hitters accumulated by year.
df_baseball = bcr.load_dataset('baseball')
df_baseball
name | year | hr | |
---|---|---|---|
0 | Hank Aaron | 0 | 0 |
1 | Barry Bonds | 0 | 0 |
2 | Jimmie Foxx | 0 | 0 |
3 | Ken Griffey | 0 | 0 |
4 | Reggie Jackson | 0 | 0 |
... | ... | ... | ... |
424 | Jim Thome | 18 | 541 |
425 | Jim Thome | 19 | 564 |
426 | Jim Thome | 20 | 589 |
427 | Jim Thome | 21 | 604 |
428 | Jim Thome | 22 | 612 |
Name, year, and home runs are each in a single column, contrasting with the wide data, where each column had the same type of data. Long data must be converted to wide data by pivoting categorical column and placing the period in the index. The prepare_long_data
provides this functionality. It simply uses the pandas pivot_table
method to pivot (and potentially aggregate) the data before passing it to prepare_wide_data
. The same two DataFrames are returned.
df_values, df_ranks = bcr.prepare_long_data(df_baseball, index='year', columns='name',
values='hr', steps_per_period=5)
df_values.head(16)
The linearly interpolated values for the first three seasons of each player:
name | Albert Pujols | Alex Rodriguez | Babe Ruth | Barry Bonds | ... | Reggie Jackson | Sammy Sosa | Willie Mays | Willie McCovey |
---|---|---|---|---|---|---|---|---|---|
year | |||||||||
0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 |
0.0 | 7.4 | 0.0 | 0.0 | 3.2 | ... | 0.2 | 0.8 | 4.0 | 2.6 |
0.0 | 14.8 | 0.0 | 0.0 | 6.4 | ... | 0.4 | 1.6 | 8.0 | 5.2 |
0.0 | 22.2 | 0.0 | 0.0 | 9.6 | ... | 0.6 | 2.4 | 12.0 | 7.8 |
0.0 | 29.6 | 0.0 | 0.0 | 12.8 | ... | 0.8 | 3.2 | 16.0 | 10.4 |
1.0 | 37.0 | 0.0 | 0.0 | 16.0 | ... | 1.0 | 4.0 | 20.0 | 13.0 |
1.0 | 43.8 | 1.0 | 0.8 | 21.0 | ... | 6.8 | 7.0 | 20.8 | 15.6 |
1.0 | 50.6 | 2.0 | 1.6 | 26.0 | ... | 12.6 | 10.0 | 21.6 | 18.2 |
1.0 | 57.4 | 3.0 | 2.4 | 31.0 | ... | 18.4 | 13.0 | 22.4 | 20.8 |
1.0 | 64.2 | 4.0 | 3.2 | 36.0 | ... | 24.2 | 16.0 | 23.2 | 23.4 |
2.0 | 71.0 | 5.0 | 4.0 | 41.0 | ... | 30.0 | 19.0 | 24.0 | 26.0 |
2.0 | 79.6 | 12.2 | 4.6 | 45.8 | ... | 39.4 | 21.0 | 32.2 | 29.6 |
2.0 | 88.2 | 19.4 | 5.2 | 50.6 | ... | 48.8 | 23.0 | 40.4 | 33.2 |
2.0 | 96.8 | 26.6 | 5.8 | 55.4 | ... | 58.2 | 25.0 | 48.6 | 36.8 |
2.0 | 105.4 | 33.8 | 6.4 | 60.2 | ... | 67.6 | 27.0 | 56.8 | 40.4 |
3.0 | 114.0 | 41.0 | 7.0 | 65.0 | ... | 77.0 | 29.0 | 65.0 | 44.0 |
The rankings change substantially during this time period.
df_ranks.head(16)
name | Albert Pujols | Alex Rodriguez | Babe Ruth | Barry Bonds | ... | Reggie Jackson | Sammy Sosa | Willie Mays | Willie McCovey |
---|---|---|---|---|---|---|---|---|---|
year | |||||||||
0.0 | 20.0 | 19.0 | 18.0 | 17.0 | ... | 4.0 | 3.0 | 2.0 | 1.0 |
0.0 | 19.8 | 16.0 | 15.0 | 17.0 | ... | 4.2 | 4.8 | 5.2 | 3.4 |
0.0 | 19.6 | 13.0 | 12.0 | 17.0 | ... | 4.4 | 6.6 | 8.4 | 5.8 |
0.0 | 19.4 | 10.0 | 9.0 | 17.0 | ... | 4.6 | 8.4 | 11.6 | 8.2 |
0.0 | 19.2 | 7.0 | 6.0 | 17.0 | ... | 4.8 | 10.2 | 14.8 | 10.6 |
1.0 | 19.0 | 4.0 | 3.0 | 17.0 | ... | 5.0 | 12.0 | 18.0 | 13.0 |
1.0 | 19.2 | 4.2 | 3.2 | 17.0 | ... | 6.6 | 11.2 | 16.6 | 12.8 |
1.0 | 19.4 | 4.4 | 3.4 | 17.0 | ... | 8.2 | 10.4 | 15.2 | 12.6 |
1.0 | 19.6 | 4.6 | 3.6 | 17.0 | ... | 9.8 | 9.6 | 13.8 | 12.4 |
1.0 | 19.8 | 4.8 | 3.8 | 17.0 | ... | 11.4 | 8.8 | 12.4 | 12.2 |
2.0 | 20.0 | 5.0 | 4.0 | 17.0 | ... | 13.0 | 8.0 | 11.0 | 12.0 |
2.0 | 20.0 | 5.6 | 3.6 | 16.6 | ... | 13.8 | 7.8 | 11.6 | 11.4 |
2.0 | 20.0 | 6.2 | 3.2 | 16.2 | ... | 14.6 | 7.6 | 12.2 | 10.8 |
2.0 | 20.0 | 6.8 | 2.8 | 15.8 | ... | 15.4 | 7.4 | 12.8 | 10.2 |
2.0 | 20.0 | 7.4 | 2.4 | 15.4 | ... | 16.2 | 7.2 | 13.4 | 9.6 |
3.0 | 20.0 | 8.0 | 2.0 | 15.0 | ... | 17.0 | 7.0 | 14.0 | 9.0 |
Usage before animation¶
If you wish to use this function before an animation, set steps_per_period
to 1.
df_values, df_ranks = bcr.prepare_long_data(df_baseball, index='year', columns='name',
values='hr', steps_per_period=1,
orientation='h', sort='desc')
def period_summary(values, ranks):
top2 = values.nlargest(2)
leader = top2.index[0]
lead = top2.iloc[0] - top2.iloc[1]
s = f'{leader} by {lead:.0f}'
return {'s': s, 'x': .95, 'y': .07, 'ha': 'right', 'size': 8}
bcr.bar_chart_race(df_values, period_length=1000,
fixed_max=True, fixed_order=True, n_bars=10,
figsize=(5, 3), period_fmt='Season {x:,.0f}',
title='Top 10 Home Run Hitters by Season Played')