Are AI Models Smart Enough Yet to Measure Ad Performance?
SMU Cox Marketing Professor Gijs Overgoor and coauthors study large-language models (LLMs) to see how capable they are in identifying performance for video ad creative. While AI models—with LLMs the visible workhorses of AI—show promise for scaling advertising research, fine-tuning with domain-specific training data is essential.
Spending on TV or video ads is big business. Eager to find lucrative use cases, marketers are keen to know how to use AI effectively. In novel research, SMU Cox Marketing Professor Gijs Overgoor and coauthors study large-language models (LLMs) to see how capable they are in identifying performance for video ad creative. While AI models—with LLMs the visible workhorses of AI—show promise for scaling advertising research, fine-tuning with domain-specific training data is essential. The authors show that off-the-shelf models carry biases that can lead to incorrect conclusions and potentially wasted advertising spending.
The motivation for the study was to address two challenges in advertising research. First, traditional methods using research assistants to code TV ads don't scale well, thus are limited to hundreds rather than thousands of ads—and are costly and noisy. Add to that: AI's capability for large-scale content analysis remains untested for video advertising. Overgoor notes, "We have a measurement problem and reams of TV advertising data that we would like to study. On the other hand, we have AI that we are not really sure how good it would be at this very task."
Testing the LLMs
Researchers Overgoor, Knight and Bart ran a series of tests to discover the limitations and potential of LLMs for their task. They evaluated Alibaba's Videollama 2, as their open-source baseline LLM model, considered a top performer. The authors found that it performed reasonably well at the surface level, but suffered from important limitations that are not quite suitable for advertising applications. Notably, the models exhibited significant "yes bias," a known tendency of LLMs to affirm most yes/no questions regardless of accuracy. This bias led to widespread misclassification of information. Therefore, off-the-shelf models proved to be unreliable for making downstream advertising decisions that require precise content understanding and accurate classification.
To further test LLM’s effectiveness or potential, the team developed a fine-tuned model called AdNotator with impressive results. "We found that this AI model combined with the human information used for training, actually outperforms the humans that we trained it on." They trained it on human-coded responses, with their model outperforming individual human coders compared to ads coded by at least three humans. "[By] using this broad base of knowledge that AIs are generally working with and inserting training-specific or ad-specific information, it actually takes out a lot of the noise and a lot of the yeses from the AI,” Overgoor explains. At the same time, their model, a benchmark for the future, reduced the noise from the humans too.
To compare and analyze other LLMs, Gemini Flash was tested. It performed better than the baseline model, but was not as good as humans. "We found surprisingly that Gemini still says yes to a lot of things,” says Overgoor. “This is a tendency that a lot of LLMs have—this sycophancy that they're almost trained to please the human."
What people like
In another step, the researchers looked at 2,000 ads using iSpot TV data to learn what drives ad quality and purchase intention. Action-oriented elements like encouraging immediate purchase or store visits were seen as negative. On the flip side, informational content such as product, brand, and price details were viewed as positive. Sensory appeal factors—visual aesthetics, attractiveness, and sex appeal—were also rated as positive. Interestingly, emotional content itself, just sheer emotion or positivity, didn't strongly predict success.
According to Overgoor, your best performing ad or the ad that people most like to watch may not translate into revenue generation downstream. "We always think Super Bowl ads—how cool and funny they are—but do those ads ultimately drive the intended result?” he ponders. TV ads can have negative ROIs for many brands, he says. So, better targeting is important to be effective. What gets people to the store? “Some features might help people stay engaged and not turn off the TV,” he adds, “however, action elements that people don't necessary like do get people to the store.”
Conclusion
A few takeaways about the use of AI and insights follow:
- Off-the-shelf AI models can be utilized but require careful vetting.
- Different metrics matter: ad likability does not equal revenue generation.
- Follow-up research shows ads people dislike may still drive store visits.
- Researchers and advertisers should validate AI outputs before making decisions.
In final thoughts, Overgoor says, "Even if you use the most recent version of OpenAI, Claude or Gemini, it's still worthwhile to run a few extra tests to make sure that you're not misaligned and introducing bias in your advertising measurement."
The paper “(Mis)Measuring the Drivers of Ad Performance” by Gijs Overgoor, Southern Methodist University’s Cox School of Business, Samsun Knight, University of Toronto, and Yakov Bart, Northeastern University, is a working paper under review.
Written by Jennifer Warren,