Machine Learning vs. Statistics
What is “Machine Learning?”
Machine learning algorithms find patterns in data when they would be impractical or impossible for a human to observe. Once these patterns have been defined, machine learning tools can be used to forecast future observations based on found rules. Simple machine learning models can be based on probability, while the most advanced machine learning algorithms can leverage artificial intelligence to magnify the predictive power of statistical modeling.
How does Amazon know to suggest toys and accessories for pet owners? How did Target learn a teenager was pregnant before her family did? Machine learning can be used to piece together various scraps of information, including purchase history and shopping habits, to build a profile for customers. With this data, retailers can predict what customers are most willing to purchase. Google Maps leverages machine learning to determine the fastest routes from Point A to Point B when traffic flares up in certain areas. This is possible due to the aggregation of data from other drivers experiencing traffic congestion. Google can then suggest that drivers in that route take detours to avoid clumps of cars traveling below the suggested speed limit.
What is Statistics?
Statistics (or statistical analysis) is core to every machine learning algorithm. It is the study of methods of collecting, interpreting, and presenting empirical data. Probability plays a key part in statistics, as does variation (expected deviation from the mean) and error (difference between observed and predicted values). Gaining information from data is the key purpose of statistics, and by applying statistical analysis to problems, we can increase our understanding of populations and the datasets that represent them.
Statistical history is key to any prescription or diagnosis in medical facilities. If X number of patients exhibiting similar systems had Y disease, then a medical professional can have confidence in a treatment path or suggest more research to determine a more likely diagnosis. Our elections are also hot beds of statistical discourse. Informal polls allow for candidates to adjust their tactics or double down on issues that are doing well with their prospective constituents. Politicians can also use statistics to determine whether long standing legislature is fair to the governed or suggest that new laws be implemented to improve daily life.
Misconceptions About Machine Learning
Often, the terms “Machine Learning” and “Artificial Intelligence” are used interchangeably when non-experts discuss computers solving complex problems. This trend has been magnified by companies who seek to gain a marketing advantage claiming that they use the technologies when they do not understand the differences themselves. While AI can be used in machine learning applications, not every model deployed by a data scientist involves the training of neural networks inspired by the firing synapses of the human brain. Machine learning solutions can be as simple as training a computer to use probabilities to determine the next value in a sequence of outcomes.
Machine learning can be used anywhere!
Machine learning is not always the most appropriate tool for the job. With small datasets, it may be a better (and more cost effective) idea to have a human who is familiar with the problem look for trends in the data manually. Not every problem can be solved with machine learning. It may be fun to train a model, but one must also worry about model generalization. If your machine learning model only generates insights on the one dataset you are looking at, it may not be telling you what you think it’s telling you.
Computers can actually “learn.”
The label “Machine Learning” is somewhat of a misnomer. Machines and computers do not actually “learn.” An argument can be made that AI systems can develop the ability to learn through repeated failure and adjustment, but often when we talk about machine learning, we really mean machine training. Models developed by data scientists and machine learning engineers most often do not become better at solving problems in the future by building intuition about the nature of the problem. Parameters are strictly set when developing a machine learning algorithm that check certain conditions and return a result.
Misconceptions About Statistics
Small sample sizes can generate insights in the same ways that large samples can. What is the threshold between “small” datasets and “large” datasets anyways? A small sample for Amazon could be 10,000 transactions (out of 26.5 million transactions per day) while that same number of samples could make up a much larger proportion of books checked out per year at the San Diego Library. One example of a statistical test for small sets is Fisher’s exact test. This method of testing independence within samples is preferred when observations number less than 1,000.
Visual representations of data are enough to determine significant differences.
Flashy presentations include visually striking plots and graphs for a reason: they are easily interpretable at a quick glance. However, simply showing a large amount of white space between two bars in a bar chart is not enough to prove the value of a visualization. Without proper scales of reference, differences in color or shape is not enough to demonstrate properly the gravity of results. The pie chart is often singled out as one of the worst ways to display information simply because an observer has no quantitative way to differentiate slices outside of “This looks bigger,” or “Wow! I can’t even see that!”
P-values are crucial for every analysis!
While p-values can be a good way to lend credibility to claims represented by visualizations, they are not the only important outcome of a statistical analysis. P-values can also indicate how incompatible the data is with the developed model. The p-statistic is only used to either reject or fail to reject the null hypothesis with the given information. Simply demonstrating a small p-value says nothing to the strength of a relationship.
Machine learning and statistics are intrinsically linked. However, like when comparing a square to a rectangle, machine learning is always based on statistics, but statistics is not always machine learning. Combining these tools in their base forms can generate in-depth insights from pools of data. However, the machine learning or statistical model is only as good as the practitioner who deployed it. It is important to keep in mind that a comprehensive understanding and familiarity with the data is key to choosing the most appropriate tools for the problem.