Data analysts, machine learning/artificial intelligence engineers, statisticians, do titles like this sound very big? But be careful not to be scammed! Under the temptation of high salaries, many data scammers also hide in them, and these scammers destroy The reputation of data professionals who are compliant and law-abiding.
Data scammers are very good at hiding themselves in full view, you may not even realize their existence, they may hide in your company, but fortunately, if you know what clues to look for, then they are very Easily recognizable. The first clue is that they cannot understand that analytics and statistics are two distinct disciplines.
The training that statisticians receive is to infer content outside the data, while the training that analysts receive is to explore the content of the data set. In other words, analysts draw conclusions based on what is contained in the data, while statisticians draw conclusions based on what is not contained in the data. The analyst helps you ask good questions (hypothesis generation), and the statistician helps you get the ideal answer (hypothesis test).
There are also some magical “hybrids” that will have two identities…but they will not play both roles at the same time. Why? A core principle of data science is that if you want to deal with uncertainty, you cannot use the same data points for hypothesis generation and hypothesis testing. When data is limited, uncertainty will force you to choose between statistics and analytics.
Without statistics, there is no way to know whether the point of view you just generated is tenable. Without analytics, one can only move forward in groping, and it is almost impossible to grasp the unknown.
This is a difficult choice! Is to open your eyes and accept inspiration (analysis), swear to give up the satisfaction of knowing whether the new discovery can stand, or pray in a cold sweat and pray that you choose to ask (in the absence of any data, a Is the question worthy of the rigorous answer (statistics) that people think about in the utility room?
“Selling” a hawker with hindsight
The way for the liar to get out of this dilemma is to ignore it, find a potato chip that looks like Elvis, and then pretend to be surprised by this fact. (The logic of statistical hypothesis testing can be boiled down to: Does our data surprise us to change our minds. If we have seen these data, how can we be surprised by them?)
In your opinion, do the clouds and potato chips in the picture look like a rabbit or Elvis? Or do they look like a certain president?
The crooks find a pattern and get inspired from it, and then test the same data with the same pattern, in order to use one or two reasonable p-values to generate results that can verify their theory. By doing so, they are actually deceiving you (maybe It is also deceiving themselves). Such a p-value has no meaning unless you make a commitment to the hypothesis before looking at the data.
The scammers imitated the actions of analysts and statisticians, but did not understand why. This has brought a bad reputation to the entire data science field.
True statisticians always proceed with caution
Because statisticians have a reputation for rigorous reasoning that is almost mysterious, the frequency of “magic oil” in the field of data science has hit a record high. This kind of scam is not easy to detect, especially when unsuspecting victims think it is related to equations and data. A data set is a data set, right? Wrong, it depends on how you use the data set.
These scammers all carry the signs of counterfeit goods, and you only need a clue to see their true colors: scammers have only hindsight-use mathematics to rediscover what they already know exists in the data, and statisticians The test provided is foresight.
Unlike a liar, a good analyst is a model of open-mindedness, always combining inspiring insights and reminders to remind people that a phenomenon observed may have many different explanations, while a good statistician will Make a decision carefully.
Analysts bring inspiration
The analyst does not have to be responsible for everything, they have to draw conclusions based on what is contained in the data. If they want to make a point about something they haven’t seen, then they have another job. They should take off the analyst’s “hat” and wear the “statistician” helmet. After all, no matter what your formal position is, there is no such rule that you cannot join two industries. You can do that if you want, but don’t confuse them.
How do scammers test hypotheses
Being good at statistics does not mean being good at analysis, and vice versa. If someone tells you the opposite, please think for yourself. If this person tells you that you can make statistical inferences on the data you have studied, please ask yourself again. He is probably a liar.
Hidden behind the explanation
If you observe data scammers in real life, you will find that they like to make up wild stories to “explain” the observed data: the more academic the story sounds, the better, and it doesn’t matter that they are just (excessive) in line with the data after the fact.
The liar is totally nonsense. No amount of equations or even rhetoric can make up for the fact that they have no evidence that they know that they are talking beyond the scope of the data. Don’t be fooled by their wild explanations. If it is statistical inference, they must make careful decisions before seeing the data.
This is equivalent to showing off their “psychic” ability, first glance at your card, and then predict what card you hold…No matter what card you hold, they can predict it. Be prepared and listen to their rhetoric: how your facial expressions will reveal the cards in your hand to them. This is a hindsight bias, and it can be seen everywhere in the field of data science.
The analyst said, “This is the queen of squares you just played.” The statistician said, “Before the game started, I wrote my hypothesis on this piece of paper. Let’s start, observe some data, and see how my hypothesis is. Right?” said the liar, “I knew you were going to play the queen of cubes because…”
Machine learning says, “I’ll always call it in advance to see how well I’ve done. Then I repeat and repeat. I may adjust my reaction to fit an effective strategy. But I will use an algorithm to accomplish this. A process, because tracking manually is really annoying,”
Stop scammers from entering your life
When the data to be processed is not too much, you have to choose between statistics and analytics. Fortunately, if you have a lot of data, then you will have a wonderful opportunity to use your own analysis and statistics without being fooled. You can also protect yourself from crooks through a perfect strategy. This is called “data splitting”, which I believe is the most powerful idea in data science.
To protect yourself from scammers, all you have to do is to ensure that certain test data is outside the scope of their prying eyes, and then treat everything else as analytics (don’t take it seriously). When you are faced with a theory that you may fully accept, you can use it to make the decision for you, and then open your secret test data to see if this theory is nonsense.
From the era people are used to to the era of “small data”, this is a huge cultural shift. You have to explain how you know what you know in a relaxed way-convince people that you may indeed know something .
The same principle applies to machine learning/artificial intelligence
Some crooks posing as machine learning/artificial intelligence experts are easy to spot. You can see through bad engineers by seeing through them: the “solutions” they repeatedly try to build cannot be delivered. (The early warning sign is their lack of industry standard programming language and library experience.)
But what about the people who have built systems that seem to work? How do you know if something is suspicious? The same applies here! Scammers are sinister and they will show you how good their models are, It uses the data they used to make the model. If you build an extremely complex machine learning system, how do you know if it will work properly? You can’t know unless you can prove that it can handle new data that has never been seen before.
When there is enough data to split, you can justify the project without changing the neat formula (this is still an old habit and can be seen everywhere, not just in the scientific field).
Do statistical work or maintain a humble attitude
To paraphrase a witty quote from the economist Paul Samuelson: The liar successfully predicted nine of the last five recessions.
The author has no patience with data scammers. What about “understanding” some potato chips that look like Elvis? No one cares if your views are in line with the original “potato chips”. No matter how much the explanation is, the author is not moved. See if the theory/model can be applied (and can always be applied to) a lot of new “potato chips” that have never been seen before, this is the real test of this view.
Advice for data science professionals
Data science professionals, if you want the attention of those who understand the humor here, please stop using fancy equations to support your personal biases. Let us see your real talents. If you want those who “know” your theory/model to see these theories/models as inspirational poems, then please boldly present them with a new data set for a great show!
Advice for leaders
Leaders are not willing to take any “insights” related to data seriously unless these insights have passed the new data test. Don’t want to put in the effort? Stick to analytics, but don’t rely on these insights-they are untenable, and their credibility has not been checked.
In addition, when a company has a large amount of data, it will not do any harm to treat segmentation data as a core part of scientific culture, and even to control access to test data dedicated to statistical data to apply it to the infrastructure. This is a good way to stifle “magic oil” in the cradle!
When the data is too small to be divided, only data crooks will strictly follow their inspiration, rediscover their known phenomena in the data using mathematical methods, and declare that their surprising findings are statistically significant. This is hindsight Ming. This sets them apart from open-minded analysts and careful statisticians.
When the data is sufficient, we must develop the habit of data segmentation, and we must analyze and count different subsets of the original data pile. This way you can have a double advantage without being deceived!