No, AI is not Better Than Humans at Image Recognition
It's just hype and flimsy research
Nature News recently reported that AI now beats humans at image classification and a couple of other basic tasks. This should blow your mind. Just think of self-driving cars that can perfectly identify pedestrians, obstacles, road signs, and traffic markings. Think of AI radiologists that can nearly flawlessly identify tumors and strokes. And think of the quality of life improvements for the visually impaired with access to sophisticated image-to-text systems. This would be one of the most exciting technological achievements of the 21st century if it weren’t complete BS.
Let’s put modern computer vision to the slightest of tests. Having dabbled in computer vision research, I figured I could fool the Google Cloud Vision API in about five attempts. Google has been a leader in AI for decades, so it should be a fairly good representation of the current state of the art. I stopped trying after three attempts where Google misclassified every single one of the images, which, by the way, my three-year-old had no trouble with. Not an impressive performance so far. So why is Nature, the most prestigious scientific publication in the world, saying that AI beats humans in image classification?



The Experiment That Started it All
The Nature News article is actually referencing the AI Index Report 2024 by the Stanford HAI Institute, a report whose ostensible aim is to “provide unbiased, rigorously vetted, broadly sourced data in order for policymakers, researchers, executives, journalists, and the general public to develop a more thorough and nuanced understanding of the complex field of AI”. It says that “AI has surpassed human performance on several benchmarks, including some in image classification, visual reasoning, and English understanding” and that “AI has surpassed human baselines on a handful of benchmarks, such as image classification in 2015”. So what was this computer vision singularity event back in 2015 where AI supposedly outpaced humans at image classification?
To answer that, I first have to tell you about Imagenet. In 2012, the first annual ImageNet Large Scale Visual Recognition Challenge was held, a competition where researchers submitted their best models to compete for the highest accuracy score on the Imagenet dataset. The dataset has 1.2 million images scraped from the photography-focused social network Flickr. Researchers collected images by searching Flickr images using search words like “can opener” or “swing” and then validating that the image contained said search terms using workers hired through Amazon Mechanical Turk. Each image has exactly one out of 1000 possible labels. Teams train their algorithm on all 1.2 million images and test it on a separate set of 100 thousand images. You can submit 5 guesses for each image and if one of them is true, you get full points. In essence, each image in the test set is a multiple-choice question with 1000 possible answers but you get to check five boxes.
In 2014, Andrej Karpathy, then a Stanford graduate student, decided to compare himself against the latest winning algorithm. He painstakingly labeled 1500 images according to the rules of the competition and achieved an error rate of 5.1%, beating that year’s winner, GoogLeNet, by 1.7 percentage points. This brings us to the “singularity event”: the following year, Microsoft developed a model that narrowly beat Karpathy’s result with a 4.94% error rate. This result was widely reported in both the mainstream and tech media as the advent of computers surpassing humans in image recognition. And so it would seem. A human was pitted against an AI algorithm and the algorithm won. Right?
Human Errors Are Not Really Errors At All
When you take a closer look at the results, it turns out that humans and computers make different kinds of mistakes on the challenge. When the computer fails, it fails miserably. It thinks that a figurine is a coffee pot, that a king crab is a strawberry, and that a salt shaker is hairspray. Even seven years later, Google researchers noted that 40% of mistakes made by state-of-the-art algorithms were “major mistakes” that “most humans would likely not make”.
Meanwhile, humans make different types of mistakes. They can’t keep all 1000 classes in their head, so they forget or don’t know that a given class is a possible answer. Some images contain multiple objects, none of which is more salient than the others, and only one of which is the correct answer (to be fair, the computer makes those mistakes as well). And the most common mistakes humans make are caused by what are called fine-grained classes.
Fine-grained classes are classes like “Indigo Bunting” or “Bedlington Terrier” as opposed to more coarse-grained classes like “bird” or “dog”. The label set contains a mix of coarse-grained and fine-grained classes, including over 120 distinct breeds of dogs. I’ll admit that AI might be better than the typical human at these fine-grained labeling tasks simply because the typical human can’t name 120 different dog breeds. Humans can still easily recognize images of dogs and even describe them in great detail, whether or not they are familiar with the breed name. Meanwhile, the computer will get the breed right some of the time, and some of the time think it’s an image of a flatweave rug. A subsequent study found that once you add multiple labels and divide the test set into “organics” and “objects”, effectively separating the fine-grained classes from the coarse-grained classes, the median human accuracy is 99% while the best AI model achieved 95%.
All of this is to say, when you take a closer look at the errors, it’s not clear that AI beat humans in image recognition based on the Imagenet data. Humans can easily recognize what the images depict, even though sometimes they struggle with fine-grained ontological classification. Meanwhile, the AI archives does alright for itself but is way, way off the mark when it’s wrong.

The Results Don’t Replicate
Even if AI had truly outperformed humans on the Imagenet challenge, the more egregious problem is that it doesn’t replicate. In 2019, a group of Berkeley researchers decided to replicate the AI portion of the experiment. They collected a new dataset of the same size using the same methodology as Imagenet: scraping Flickr using keywords, using Amazon Turk workers to verify, using the Imagenet classes, and even limiting their collection to the same time period as when the original Imagenet data was collected. Then they took previously published AI models, trained and tested them on the new data, and compared the results to how the models performed on the original data. Every one of the models had at least an 11 percentage point drop in accuracy. So even when you recreate the experiment using data that is as close to the original as possible, the AI results don’t hold any longer. This means that these models are not so much good at recognizing images as they are good at recognizing the specific images in the Imagenet dataset alone.
AI needs a higher scientific standard
To be fair, the field of computer vision has continued to improve on Microsoft’s 2015 result and has continued to refine the experiment by adding multiple labels to images, using expert labelers to comb out wrong labels, and testing more humans (Shankar et al. had a whopping 5 human participants). But, not only do trained humans continue to outperform AI models on the Imagenet challenge, but the field has also failed to address the biggest shortcoming of using the Imagenet challenge to evaluate human vs AI performance: you can’t draw a sweeping conclusion about the relative abilities of AI and humans based on a single experimental setup.
This limited experiment contains images that are nothing like the broader spectrum of images humans encounter in their daily lives. We don’t experience the world in a series of well-lit and well-composed photographs like the ones you find on Flickr. Real-world images show objects from lots of weird angles, often poorly lit, and with all kinds of occlusions. And we certainly don’t divide the world into 1000 seemingly arbitrary classes. This experimental setup is so limited that the only conclusion you could draw from its results is “Huh, interesting - we should investigate further”.
Other fields of study, such as the medical sciences, have developed rigorous testing methodologies such as randomized double-blind placebo-controlled trials, meta-analyses combining multiple studies to achieve greater statistical significance, and systematic reviews to assess the totality of evidence to answer hard questions.“We tried this drug on this one guy and it seems to work” would not exactly fly in pharmacology.
Meanwhile, the field of AI is still using Karpathy’s Imagenet results as a human baseline a decade later. Not that there was anything wrong with Karpathy’s experiment. It’s an interesting result. The problem is that the field of AI is still using it as a baseline. Which is a symptom of the systemic lack of empirical rigor in the field of AI. It focuses too often on producing a higher accuracy on a benchmark than actually improving our collective understanding of computer vision algorithms or specific contours of AI image recognition capability.
The Bottom Line
So no, AI does not beat humans at image recognition. The scientific evidence crumbles under scrutiny. There’s no new tech that leverages superhuman image recognition. Google doesn’t recognize an upside-down bus, self-driving cars are terrible, and your radiologist remains very much human. And still, this claim continues to be repeated, uncritically, to this day in the media.
To be fair, the claim shows up in the media because of reports like the Stanford AI Index. And you could say the report just says AI beats “a human baseline” on a “benchmark” and that is technically true. But without providing any additional context, the entirely predictable result was that even the scientific media just saw “AI beats humans”.
Researchers in this incredibly well-funded field could easily devise and carry out a series of rigorous experiments to shed clear light on the question of AI vs human image recognition. The reason they haven’t is probably because they know the answer. AI has made a ton of progress but is far, far from human performance. So why can’t they just say that in their reports?






The AI hype cycle is real. Looking forward to getting actual life-changing products into the market.