Handling large data sets can be a daunting task. When facing thousands, tens of thousands, or even hundreds of thousands of data points, you must decide the best way to represent this data in a concise, easily interpretable way.

The ZingChart team faced this exact challenge after collecting information on every Olympic athlete that has competed in the Summer and Winter Olympic Games since their inceptions. Read on to see how we used our new boxplot module to hurdle over this obstacle.

Boxplot Overview

The boxplot is also known as:

  • Box diagram

  • Box-and-whisker plot

  • Box graph

Boxplot

In the world of data visualization, the boxplot is a relatively new type of graph. It is useful for condensing large sets of data down into an elegant, simple chart. The box segment consists of the first quartile (25th percentile or Q1), the median (50th percentile or Q2), and the third quartile (75th percentile or Q3). Together, these values make up the midspread of the data set, or 50% of the values. This is fairly standard across boxplot implementations.

The whiskers of a boxplot tend to have a larger level of variation in terms of what they represent. Among charts that do not utilize outlier values, whiskers will extend to the minimum and maximum values of the data set.

Charts that choose to include outliers may have whiskers that extend to the 1st and 99th percentiles, to the 5th and 95th percentiles, to the 10th and 90th percentiles, or to the lowest and highest value within the lower and upper fences, determined by the interquartile range (IQR). The IQR is 1.5 times that of the 75th percentile minus the 25th percentile, or 1.5 * (Q3 – Q1).

In other words, the IQR amounts to boundaries that extend out 1.5 times the box width on both ends. The whiskers are then drawn to extend to the minimum and maximum points that fall within these boundaries. When used in this way, the boxplot may be called a Tukey boxplot, in reference to the creator of the boxplot chart, John Tukey.

boxplot diagram

If you want to learn more about reading boxplots, there is a great tutorial for understanding boxplots at Khan Academy.

Our Big Data Set

Using Scrapy, an open source web scraping framework written in Python, we collected our data from Sports Reference's Olympic athlete information pages. Thanks to the uniformity of the data on that site, collecting the data was a breeze. We gathered information on every Olympic athlete that has competed in the Games since the inception of the modern games under the auspices of the International Olympic Committee, going back to the 1896 Summer and 1924 Winter Games.

When all was said and done, there was an enormous amount of data that had been collected. Our initial file amounted to nearly 40 megabytes of pure, unadulterated data. For each athlete, we collected the following:

  • Full name

  • Date of birth

  • Gender

  • Height

  • Weight

  • Country

  • Games they competed in

  • The age at which they competed

  • The sport played in the games

As is usually the case with large data sets, we had to do quite a bit of data massaging to get the pertinent data into the form expected by the ZingChart boxplot module. (Check out our previous post on data massaging if you want to read more on that topic!)

Using the date of birth, gender, games competed in, and the athlete’s sport, we’ve created this boxplot chart showing the age distribution of Olympic athletes by sport and gender for each year of the Olympic Games:

Can you tell which sports have a greater disparity in age? We see a lot of value in the data set that we collected, and will likely use other portions of the set in the future.

What data have you used a boxplot to visualize? We’d love to hear about some other real-world uses for boxplot charts.

Boxplots in ZingChart

We’re excited about the introduction of the boxplot module into our library, and look forward to adapting the module to accommodate the needs of our users. To get started with boxplots, check out how a boxplot’s values are defined:

“series”: [  
    {  
       “data-box”: [ [<Lower Whisker>, <Q1>, <Median>, <Q3>, <Upper Whisker>] ],  
       “date-outliers”: [ [<Index>, <Outlier Value>] ]  
        }  
]

Styling of the individual elements of the boxplot is handled within the appropriate object in the chart JSON's options object:

**Object** **Description**
box Accepts styling attributes to style the box section of the boxplot.
outlier Accepts styling attributes to style the outlier markers
line-median-level Accepts styling attributes to style the Q2(M) line object.
line-min-level Accepts styling attributes to style the line at the minimum value.
line-min-connector Accepts styling attributes to style the whisker that connects the boxplot and the minimum value line object.
line-max-level Accepts styling attributes to style the line at the maximum value.
line-max-connector Accepts styling attributes to style the whicker that connects the boxplot and the maximum value line object.

Summary

There are a number of improvements in the works for the boxplot module, but the module is available now for you to get started with! To get your own copy, head on over to our try page or visit our CDN.

Also, please leave us a comment below to share your boxplot ideas. What sort of data are you using for boxplots? Which additional features would you like to see added?