We’ve talked a lot about data here on the ZingChart blog - data types, dataviz, data culture. But today we’re going to cover the minutiae of data analysis.
Enter Visual Insights: A Practical Guide to Making Sense of Data, an MIT Press publication authored by Katy Börner and David E. Polley. This book is chock full of:
Helpful self-assessment quizzes
This book is designed to guide readers through the process of understanding their dataset and the best type of analysis to visualize the data.
While the topic can get quite detailed, there are some key takeaways that anyone interested in dataviz can use on a regular basis.
Levels of Data Analysis
The first step in proper data analysis is determining into which level of analysis your dataset falls:
Could you analyze your data on a whiteboard or a piece of paper? If you answered yes, then you’re working at the individual or micro level.
We love the micro level here at ZingChart, especially since our new whiteboard walls allow for plenty of fun visualizations.
On the other end of the spectrum, you’ll find the global or macro level of analysis. Datasets at this level are incredibly large and often require the use of a supercomputer to perform computations.
Finally, there’s the local or meso level of analysis. Meso level datasets are too complex to analyze by hand and require the use of a computer, but not to worry - you won’t need Titan!
Meso Data Analysis Example
In light of the recent listeria outbreak, we chose a dataset from the CDC that details United States foodborne disease outbreaks from 1998 to 2012. Coming in at 16,575 records, this is the perfect size to illustrate meso-level data analysis!
Types of Data Analysis
Now that we’ve determined the level of our dataset, we can delve into the type of analysis we’d like to perform. As Börner and Polley point out, analysis type is not dependent upon analysis level.
That is to say, you can apply any type of data analysis to any level of project. There are four types of data analysis:
Network and Tree (Whom)
Temporal Analysis (When)
Temporal data analysis addresses the question of “when” by helping the user identify time-based information, such as:
Latency to peak times
“When” questions are answered by analysis of time-series data. Time-series data falls into one of two categories:
Discrete data, which has a finite number of values possible
Examples: There are only two sides of a coin, a switch can only be on or off.
Discrete data can be simply described as “a count of things”.
Continuous data, which can be measured on a scale
Examples: Physical measurements such as volume and temperature
Continuous data can reasonably be expected to be any possible value
So, let’s ask our “when” question. Below, we show the number of outbreaks per year by genus name in a discrete time-series line chart.
Geospatial Analysis (Where)
As you might expect, geospatial analysis uses location information to identify position or movement over geographic space. This type of analysis is commonly accomplished with thematic maps by overlaying data on a geospatial substrate. Thematic maps include:
There is often confusion as to the difference between a choropleth and an isopleth. One way to remember this is: if your data is grouped to a predefined area like a city district, state, or county, then you should be using a choropleth.
Below is a choropleth showing total outbreaks by state over the 14 year period our dataset covers.
We’ve also blogged a tutorial about making choropleth maps in the past, if you want to read the details of how to make this type of dataviz.
Geospatial data relies heavily on color, a topic we covered extensively in a previous post.
In the map above, we’ve used a sequential color scheme using more saturated/intense hues to indicate a higher number of outbreaks.
Topical Analysis (What)
According to Börner and Polley, topical analysis is the process of “extracting a set of unique words [...] and their frequencies to determine the topic coverage of a body of text”. In other words, what is the body of text about?
Your “WHAT” question isn’t likely going to be answered by a random collection of text. Instead, you want to analyze a text corpus, a large and structured set of related texts.
To answer the “WHAT” question, we used a wordcloud to identify the most common vehicles of food borne disease during the covered time period. Börner and Polley mention the importance of n-gram in topical analysis, “a sub-sequence of n items from a given sequence of text or speech.”
In an n-gram, you may encounter stopwords such as “the” and “and.”
In a ZingChart word cloud, you can handle these stop words by listing them in the options object, within an array in the “ignore” attribute.
Network and Tree Analysis (With Whom)
Network diagrams and treemaps show hierarchical connections. This makes them effective for telling whom stories. They can range from simple mind maps on a napkin to dense visualizations that require zooming and panning.This chart is best [viewed fullscreen in a new window](/assets/zing-content/uploads/2015/04/CDCtreemap.html).
Dataviz starts and ends with questions.
What questions do we have?
What questions did we answer?
More importantly, what questions did we discover?
By using the appropriate type of analysis for our data, we can not only find answers, but maybe even more important questions. In the analysis we performed for this project, a few new questions emerged:
Are the number of outbreaks in a state directly linked to the population rate?
What factors contribute to Florida’s much higher outbreak rate?
What has happened since the time period covered by our dataset?
Setting about making charts without understanding your own questions means you will spend a great deal of time creating a final product with questionable results. As dataviz expert Kaiser Fung explains,
“Done right, visualizations are more impactful. However, done wrong, visualizations can make data even more confusing!”
An additional takeaway we gathered from reviewing Visual Insights: A Practical Guide to Making Sense of Data was related to data massaging. It has inspired plenty of ideas for future posts on working with data.
What would you like to see related to this topic on our blog? Let us know which questions we can answer for you in the comments below.