Why do outliers appear in statistics


By outliers you can understand data values ​​of your sample that are conspicuously higher or lower than your other values ​​and do not seem to match the other values.

For example, if you examine the age of the architecture students at the time of their bachelor’s degree, you could get the following values ​​sorted by size:

19, 20, 20, 20, 21, 21, 22, 22, 22, 22, 23, 23, 23, 24, 24, 25, 25, 72

The last value obviously falls out of line; here either a senior has completed the regular course or there is a typographical error: You have an outlier. Regardless of the reason for this extreme value, it influences many statistical key figures.

That's how it lies Average, i.e. the average graduation age without the outlier Years, taking into account the older graduate Years. The median value, that is, the mean graduation age in order of size, is 22 in both cases Median in contrast to the mean, reacts robustly to outliers.

Graphical analysis to reveal outliers

The first graphic shows a simple point diagram: it can be clearly seen that a value is completely out of line. The second graphic is a Box plot or a box graphic in which the high graduation age is directly visible as an outlier.

Treatment of outliers

If you have identified one or more observation values ​​as outliers, you have to consider how these extreme values ​​were realized and then make a decision as to what should be done with these values.

If there is an error in the data collection or entry, you can try to correct it. If this is not possible, you should exclude the observation from further analysis. The latter treatment is also advisable if the observation object was mistakenly included in the survey.

Suppose you conducted the survey above to investigate the age at which architecture students enter the job market. Then the 72-year-old graduate does not belong in your analysis and you should exclude him from further investigation.

Some statistical program systems also offer the possibility of generalized the external Your data not to be included in the calculations. But valuable information can also be lost in the process.