The thirst for knowledge has always been a fundamental driving force for me. Throughout the course of my life I have learned numerous things from numerous fields. However, as my knowledge grew, so did an unshakeable feeling that I in fact know nothing. A large part of my studies has therefore been dedicated to a field that tries to find certainty in such an uncertain world. At the very least, it attempts to measure the amount of uncertainty. That is statistics and more specifically probability theory. Once again, the deeper I dug into this field, the more I questioned it.
For illustrative purposes, let us consider the age old question “Will it rain tomorrow?”. In statistics we refer to this as a binary classification problem. Without any knowledge one can of course simply guess “yes” or “no” and still be correct approximately 50% of the time. Of course this was not satisfactory. The human thirst for knowledge stepped in and some of the early thinkers realized that there is a correlation between today’s weather and tomorrow’s weather. Predicating that tomorrow’s weather will be the similar to that of today’s is very intuitive concept that improved the accuracy of prediction to about 65%. Once again, the human thirst for knowledge demanded more accuracy. As the field progressed, more complex models were suggested. The demand for a more accurate answer on such questions caused enormous growth in the field of probability theory. Questions were answered, but were the answers accurate?
Let us now assume that we have a fully qualified statistician with full knowledge on all possible methodologies. Such an expert can then choose the appropriate algorithm for the analysis and create a model that will predict with up to about 90% accuracy. Theoretically.
It has been said that statisticians live in asymptopia. This, in my opinion, is where the first flaw in statistics arises. Most of these statistical models makes assumptions on properties of both the features as well as the outcome. These assumptions can either consist of accepting some functional form of the relationship between input and output, accepting some distribution of the error or randomness involved or accepting that some factors are negligible. More often, the assumptions include all of the above; some of which might be relatively accurate if we were to work with an infinite amount of data, but asymptopia is fictional. In reality, these assumptions are at most 90% accurate. One could argue that these models together with their assumptions had to be proven statistically significant with vigorous testing. Yes, but statistical significance only requires 90% confidence.
Furthermore, this ideal model is of course entirely useless unless it can be applied to data. A statistician is then confronted with another age old problem and in general a term dreaded by most statisticians – Data cleaning. In practice, the data we obtain is unprocessed and cannot be used directly for analysis. We have missing values that has to be disregarded. Also outlying values are often ignored. Removing these values often means simply throwing away up to 5% of the obtained data. Furthermore, when measuring the values of the features, we find that there is some inevitable amount measurement error. Out of the remaining 95% of the data, we are only in fact 95% sure of the accuracy of the measured values.
Finally, this is of course all based on the assumption that the person performing the analysis is all-knowing in the field of statistics. Highly unlikely. In Stellenbosch, we find that roughly 60% of courses include a one year module in statistics. This grants degree holders in different professions sufficient knowledge on the field to believe that they are qualified, even though they on average obtain a 70% passing grade for the one year module. This exacerbates all of the before mentioned problems. Such a person will apply an incorrect model to the data which will only theoretically be 80% accurate in prediction. He will not have sufficient knowledge to be able to perform the analysis manually and will make use of software, while probably incorrectly applying the model and incorrectly interpreting the results, causing a 10% reduction in effectiveness of the model.
Summarizing, we find that a under qualified statistician uses a model that we are only 90% sure works, with a theoretical 80% prediction accuracy, based on assumptions that are only 90% accurate and incorrectly applying software resulting in a model being only 90% effective. He will do so using 95% of obtained data even though only being 95% sure of the accuracy of the measurements. When taking all of this into consideration, we end up with a 52% prediction accuracy. Slightly more accurate than guessing.
In conclusion, quoting a personal role model:
“90% of statistics are made up on the spot” – Homer Simpson
We realize, statistics are deemed irrelevant. We in fact know nothing.