Data Preparation¶

bar_chart_race exposes two functions, prepare_wide_data and prepare_long_data to transform pandas DataFrames to the correct form.

Wide data¶

To show how the prepare_wide_data function works, we'll read in the last three rows from the covid19_tutorial dataset.

df = bcr.load_dataset('covid19_tutorial').tail(3)
df

	Belgium	China	France	Germany	Iran	Italy	Netherlands	Spain	USA	United Kingdom
date
2020-04-10	3019	3340	13215	2767	4232	18849	2520	16081	18595	8974
2020-04-11	3346	3343	13851	2894	4357	19468	2653	16606	20471	9892
2020-04-12	3600	3343	14412	3022	4474	19899	2747	17209	22032	10629

This format of data is sometimes known as 'wide' data since each column contains data that all represents the same thing (deaths). Each new country would add an additional column to the DataFrame, making it wider. This is the type of data that the bar_chart_race function requires.

The prepare_wide_data function is what bar_chart_race calls internally, so it isn't necessary to use directly. However, it is available so that you can view and understand how the data gets prepared. To transition the bars smoothly from one time period to the next, both the length of the bars and position are changed linearly. Two DataFrames of the same shape are returned - one for the values and the other for the ranks.

df_values, df_ranks = bcr.prepare_wide_data(df, steps_per_period=4, 
                                            orientation='h', sort='desc')

Below, we have the df_values DataFrame containing the length of each bar for each frame. A total of four rows now exist for each period.

	Belgium	China	France	Germany	Iran	Italy	Netherlands	Spain	USA	United Kingdom
date
2020-04-10	3019.00	3340.00	13215.00	2767.00	4232.00	18849.00	2520.00	16081.00	18595.00	8974.00
2020-04-10	3100.75	3340.75	13374.00	2798.75	4263.25	19003.75	2553.25	16212.25	19064.00	9203.50
2020-04-10	3182.50	3341.50	13533.00	2830.50	4294.50	19158.50	2586.50	16343.50	19533.00	9433.00
2020-04-10	3264.25	3342.25	13692.00	2862.25	4325.75	19313.25	2619.75	16474.75	20002.00	9662.50
2020-04-11	3346.00	3343.00	13851.00	2894.00	4357.00	19468.00	2653.00	16606.00	20471.00	9892.00
2020-04-11	3409.50	3343.00	13991.25	2926.00	4386.25	19575.75	2676.50	16756.75	20861.25	10076.25
2020-04-11	3473.00	3343.00	14131.50	2958.00	4415.50	19683.50	2700.00	16907.50	21251.50	10260.50
2020-04-11	3536.50	3343.00	14271.75	2990.00	4444.75	19791.25	2723.50	17058.25	21641.75	10444.75
2020-04-12	3600.00	3343.00	14412.00	3022.00	4474.00	19899.00	2747.00	17209.00	22032.00	10629.00

The df_ranks DataFrame contains the numerical ranking of each country and is used for the position of the bar along the y-axis (or x-axis when veritcal). Notice that there are two sets of bars that switch places.

	Belgium	China	France	Germany	Iran	Italy	Netherlands	Spain	USA	United Kingdom
date
2020-04-10	3.00	4.00	7.0	2.0	5.0	10.00	1.0	8.0	9.00	6.0
2020-04-10	3.25	3.75	7.0	2.0	5.0	9.75	1.0	8.0	9.25	6.0
2020-04-10	3.50	3.50	7.0	2.0	5.0	9.50	1.0	8.0	9.50	6.0
2020-04-10	3.75	3.25	7.0	2.0	5.0	9.25	1.0	8.0	9.75	6.0
2020-04-11	4.00	3.00	7.0	2.0	5.0	9.00	1.0	8.0	10.00	6.0
2020-04-11	4.00	3.00	7.0	2.0	5.0	9.00	1.0	8.0	10.00	6.0
2020-04-11	4.00	3.00	7.0	2.0	5.0	9.00	1.0	8.0	10.00	6.0
2020-04-11	4.00	3.00	7.0	2.0	5.0	9.00	1.0	8.0	10.00	6.0
2020-04-12	4.00	3.00	7.0	2.0	5.0	9.00	1.0	8.0	10.00	6.0

Don't use before animation¶

There is no need to use this function before making the animation if you already have wide data. Pass the bar_chart_race function your original data.

Long data¶

'Long' data is a format for data where all values of the same kind are stored in a single column. Take a look at the baseball data below, which contains the cumulative number of home runs each of the top 20 home run hitters accumulated by year.

df_baseball = bcr.load_dataset('baseball')
df_baseball

	name	year	hr
0	Hank Aaron	0	0
1	Barry Bonds	0	0
2	Jimmie Foxx	0	0
3	Ken Griffey	0	0
4	Reggie Jackson	0	0
...	...	...	...
424	Jim Thome	18	541
425	Jim Thome	19	564
426	Jim Thome	20	589
427	Jim Thome	21	604
428	Jim Thome	22	612

Name, year, and home runs are each in a single column, contrasting with the wide data, where each column had the same type of data. Long data must be converted to wide data by pivoting categorical column and placing the period in the index. The prepare_long_data provides this functionality. It simply uses the pandas pivot_table method to pivot (and potentially aggregate) the data before passing it to prepare_wide_data. The same two DataFrames are returned.

df_values, df_ranks = bcr.prepare_long_data(df_baseball, index='year', columns='name',
                                            values='hr', steps_per_period=5)
df_values.head(16)

The linearly interpolated values for the first three seasons of each player:

name	Albert Pujols	Alex Rodriguez	Babe Ruth	Barry Bonds	...	Reggie Jackson	Sammy Sosa	Willie Mays	Willie McCovey
year
0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0
0.0	7.4	0.0	0.0	3.2	...	0.2	0.8	4.0	2.6
0.0	14.8	0.0	0.0	6.4	...	0.4	1.6	8.0	5.2
0.0	22.2	0.0	0.0	9.6	...	0.6	2.4	12.0	7.8
0.0	29.6	0.0	0.0	12.8	...	0.8	3.2	16.0	10.4
1.0	37.0	0.0	0.0	16.0	...	1.0	4.0	20.0	13.0
1.0	43.8	1.0	0.8	21.0	...	6.8	7.0	20.8	15.6
1.0	50.6	2.0	1.6	26.0	...	12.6	10.0	21.6	18.2
1.0	57.4	3.0	2.4	31.0	...	18.4	13.0	22.4	20.8
1.0	64.2	4.0	3.2	36.0	...	24.2	16.0	23.2	23.4
2.0	71.0	5.0	4.0	41.0	...	30.0	19.0	24.0	26.0
2.0	79.6	12.2	4.6	45.8	...	39.4	21.0	32.2	29.6
2.0	88.2	19.4	5.2	50.6	...	48.8	23.0	40.4	33.2
2.0	96.8	26.6	5.8	55.4	...	58.2	25.0	48.6	36.8
2.0	105.4	33.8	6.4	60.2	...	67.6	27.0	56.8	40.4
3.0	114.0	41.0	7.0	65.0	...	77.0	29.0	65.0	44.0

The rankings change substantially during this time period.

df_ranks.head(16)

name	Albert Pujols	Alex Rodriguez	Babe Ruth	Barry Bonds	...	Reggie Jackson	Sammy Sosa	Willie Mays	Willie McCovey
year
0.0	20.0	19.0	18.0	17.0	...	4.0	3.0	2.0	1.0
0.0	19.8	16.0	15.0	17.0	...	4.2	4.8	5.2	3.4
0.0	19.6	13.0	12.0	17.0	...	4.4	6.6	8.4	5.8
0.0	19.4	10.0	9.0	17.0	...	4.6	8.4	11.6	8.2
0.0	19.2	7.0	6.0	17.0	...	4.8	10.2	14.8	10.6
1.0	19.0	4.0	3.0	17.0	...	5.0	12.0	18.0	13.0
1.0	19.2	4.2	3.2	17.0	...	6.6	11.2	16.6	12.8
1.0	19.4	4.4	3.4	17.0	...	8.2	10.4	15.2	12.6
1.0	19.6	4.6	3.6	17.0	...	9.8	9.6	13.8	12.4
1.0	19.8	4.8	3.8	17.0	...	11.4	8.8	12.4	12.2
2.0	20.0	5.0	4.0	17.0	...	13.0	8.0	11.0	12.0
2.0	20.0	5.6	3.6	16.6	...	13.8	7.8	11.6	11.4
2.0	20.0	6.2	3.2	16.2	...	14.6	7.6	12.2	10.8
2.0	20.0	6.8	2.8	15.8	...	15.4	7.4	12.8	10.2
2.0	20.0	7.4	2.4	15.4	...	16.2	7.2	13.4	9.6
3.0	20.0	8.0	2.0	15.0	...	17.0	7.0	14.0	9.0

Usage before animation¶

If you wish to use this function before an animation, set steps_per_period to 1.

df_values, df_ranks = bcr.prepare_long_data(df_baseball, index='year', columns='name',
                                            values='hr', steps_per_period=1,
                                            orientation='h', sort='desc')

def period_summary(values, ranks):
    top2 = values.nlargest(2)
    leader = top2.index[0]
    lead = top2.iloc[0] - top2.iloc[1]
    s = f'{leader} by {lead:.0f}'
    return {'s': s, 'x': .95, 'y': .07, 'ha': 'right', 'size': 8}

bcr.bar_chart_race(df_values, period_length=1000,
                   fixed_max=True, fixed_order=True, n_bars=10,
                   figsize=(5, 3), period_fmt='Season {x:,.0f}',
                   title='Top 10 Home Run Hitters by Season Played')