How can descriptive statistics be misleading

Mean and variance: Potentially misleading KPIs

“Statistics are like bikinis. What they show is stimulating, but what they hide is the main thing »- Aaron Levenstein.

In view of the enormous amounts of data, statistics have become indispensable nowadays. Statistical calculations are used in practically every imaginable area. On the one hand, to calculate important key figures in various economic topics such as gross domestic product (GDP) or inflation. On the other hand, statistics also play an equally important role in the private sector. From branch to branch there are different possible uses where it could come into play. For example, the average value of the customer's shopping cart can be measured in retail and online trading. This key figure can then be used to measure whether campaigns or a different placement of the items have a positive effect on the shopping cart value and thus lead to more sales.

In call centers, the average processing time for a telephone call can be calculated. If a certain employee takes significantly longer than average to make a phone call, this can be an indicator that the employee needs additional training. Department-wide training can even reduce the average processing time.

Statistics can also be relevant for describing your own offer. The service level or the service level is a widely used indicator for the level of service offered. In the case of OCR software, the offer is often described with extraction rates and sensitivity, which are also determined using statistical calculations.

As the quote from Aaron Levenstein makes clear, statistics can look nice and provide certain indications, but the background and calculations are the most important of any statistic. Graphical representation and certain key figures such as position or scattering parameters should not be viewed in isolation. As Francis Anscombe found in 1973, different data sets can have the same variance and the same mean, but graphically look completely different. But before we go into this further, I will first briefly describe the most important terms.

What are mean, standard deviation, and variance?

The mean of a data set is simply the average of that data. For example, the mean value has the value 10 for a data set with the numbers 5, 10 and 15.

The standard deviation is a measure of the spread of the values ​​around the mean. In the case of a data set with a large number of values, the standard deviation shows how far these data are distributed between the minimum and the maximum and how closely they cluster around the mean. This distribution of the data points can be represented in a function curve. Depending on the nature of the data, this has a different form. If we have a normal distribution, it is similar to a bell shape. For example, for the height of footballers, the mean value could be 1.80m and the standard deviation (σ) 0.1m. Fewer footballers are over 2.00m or under 1.60m, but more between 1.70m and 1.90m. The majority of footballers will be within one standard deviation below or above the mean of 1.80m. Assuming a normal distribution, ~ 68% of footballers would be between 1.70m and 1.90m tall. ~ 95% of all footballers would be within two standard deviations. In this case ~ 95% would be between 1.60m and 2.00m tall. The remaining ~ 5% of footballers would be over 2.00m tall or under 1.60m tall.

While the standard deviation shows how the values ​​are distributed around the mean, the variance is only the square of the standard deviation and is therefore also a measure of dispersion that describes the distribution of observed values ​​around the expected value. Squaring also squares the unit and so in our example the unit would no longer be meters (m) but square meters (m2), which is not very useful in terms of height.

Summary statistics don't tell the full story

These three described parameters make it possible to describe a large, complex data set relatively well with only a few key figures. But there is a risk of relying only on these summary statistics and ignoring the overall distribution. The calculation of these parameters is therefore useful, but should only constitute part of the actual data analysis. In the following I will explain why this is so.

As indicated earlier, the Anscombe Quartet demonstrates this problem. It shows how four data sets can graphically look totally different despite the identical mean and identical variance. The summary statistics of the four Anscombe records are as follows:

  • The mean value of x has the value 9 for all four data sets
  • The mean value of y has the value 7.5 for all four data sets
  • The variance of x has the value 11 for all four data sets
  • The variance of y has the value 4.12 for all four data sets
  • The correlation between x and y is 0.816 in all four data sets
  • The equation for a linear regression is y = 0.5x + 3 for all data sets

If you look at these values, you can intuitively conclude that these data sets are very similar, if not identical. Consequently, one might think that they are visually very similar. If you first represent them graphically, it quickly becomes clear that the similarity is not as great as expected.

The relationships between the individual data points only become clearer with the visualization. While the first data set probably has a linear relationship with some variance, data set three seems to have an almost perfect linear relationship with only minimally deviating residuals. Only one outlier is really in the "reeds". With the last data set, it looks like there is no connection between x and y. But here, too, an outlier can be observed. Data record number two definitely has a relationship, although this is not really linear.

An even more extreme example of this is the "Datasaurus Dozen", whereby all data sets here again have the same mean value as well as the same variance and the same correlation coefficient.

Here, too, an intuitive conclusion would be that the relationships must be very similar to identical. However, as can be seen, some of the data sets take the form of a dinosaur or a star when graphed.

In conclusion, it can be stated that it is important to visualize data sets and not just analyze the summarizing descriptive statistics. Apparently, appearances may be deceptive and potentially result in bad decisions in this context.