Rewarding Outcome without Intent: Is Your AI Chess Opponent Considering Murder to Avoid Losing?

Written by Geoff Kreller, CRCM, CERP

There’s an inherent difference in specifying desired outcomes to a human and an agentic AI model. Humans naturally (in varying degrees) apply ethics, compliance, risk, broader long-term goals, and subjective understanding to recognize the intent and appropriate direction to the goal. If a coach tells their players, “Win at all costs” during a playoff game, that doesn’t mean exploring drastic options (such as kidnapping the opponent’s children or destroying all the soccer balls after taking the lead) to achieve that objective.

Recognizing that “the end does not always justify the means” and playing within the rules reduces the amount of innovation we have at our disposal to meet our goals. It doesn’t occur to most of us that we could win a game of online chess simply by sending malware to our competitor, disrupting their internet connection or damaging their software. These concepts of morality, ethics and fair play give humans a directional “north star.” Unencumbered by human morality, ethics, and a clear understanding of underlying intent, agentic AI model can demonstrate a sophisticated level of innovation that includes brilliant, unforeseen strategies which may incorporate cheating, manipulation, lying, flattery, and deception to reach its target objective.

This article discusses what agentic AI represents, why reward functions and specifications are necessary for agentic AI agents to operate independently, and considers the challenge of building a reward function that the agentic AI can attempt to maximize while staying true to the actual intent of its programming.

1.       What is Agentic AI?

Agentic AI is an artificial intelligence system that can accomplish a specific goal with limited supervision[1]. Unlike traditional AI models, agentic AI agents exhibit a level of autonomy derived from goal-driven behavior and their ability to adapt to changing circumstances (such as being able to play a game of Go or Chess). “Agentic” simply means that the model has agency – the capacity to act independently and purposefully. Agentic AI is neither inherently “good” or “evil,” it is amoral. As a result, agentic AI agents can be designed for (or result in) beneficial or harmful outcomes.

Agentic AI builds on generative AI techniques by using large language models (LLMs) to function in dynamic models. While generative AI is capable of creating and reciting content based on learned patterns, agentic AI applies generative outputs toward specific goals (“Win a game of chess”).

In applying outputs dynamically, agentic AI becomes proactive, adaptable, and intuitive. Agentic AI agents have the capability to proactively search websites, call application programming interfaces (APIs) and query databases, and use this information to make decisions and take actions (such as devising the best possible course of medical action in an emergency room). Agentic AI can adapt through experience and feedback and adjust its behavior accordingly. Because these systems are powered by LLMs, users can engage with them through language or voice commands. If its human chess opponent calls or types out “d4” as their opening move, the agentic AI will intuitively understand the move and also that it is the agentic AI’s turn to play. If a doctor notes the symptoms and the agentic AI is monitoring the patient’s vitals, the agentic AI will use this information and combine it with its experience and external resources to propose the root cause and appropriate course of action.

Agentic  AI can be used for all kinds of functions, including self-driving vehicles, predictive analytics based on real world data, medical diagnosis, taking and processing a Taco Bell order, securing room and travel accommodations, monitoring an institution’s firewall and intrusion controls, and humbling grandmasters around the world.

While agentic AI’s autonomy can unlock huge benefits, they come with significant risks if its specifications and reward system are not properly defined. Absent important specifications and reward functions, an agentic AI agent would not be concerned about whether the actions and behavior it takes to reach that goal are morally right or wrong (or whether those actions put humans in harm’s way).

2.       What is a reward function and why is it complex to effectively design?

To keep an agentic AI system on track toward the desired objective, reinforcement learning is supported through a designed reward function. The AI works to maximize the output of that reward function (achieving the highest “score”). The maximization of the reward function becomes the AI’s “north star”. Higher scores effectively mean “repeat that behavior”, reinforcing that behavior much like giving a treat to a well-behaved puppy who recognizes that sitting upon request generates a reward. Failure to score well can similarly lead to behaviors being dismissed from the array of options the AI deploys in future situations.

With Waymo coming to Baltimore in 2026[2], it is now critically important for my neighbors to understand how the self-driving AI model reward function is defined. It’s not just about getting the passengers from point A to point B, it’s getting them there safely, efficiently, and with due care to everyone else. Much like a 16-year-old kid with their learner’s permit, Waymo may not have the experiential learning to adapt to high traffic or unpredictable (illogical) drivers and pedestrians and may swerve into traffic without necessarily having the right of way[3]. The calculus gets even more complicated in a crisis situation - what if the AI must choose between running over five jaywalkers crossing against the light or sacrificing the life of its passenger by veering into the jersey wall to avoid the law-breaking citizens[4]?

Remember that AI learns from its experience and inferences. What if the AI adapts its behavior to the terrible driving habits of others because it infers a pattern that traffic laws – veering recklessly through lanes, speeding excessively, running red lights - are generally not being enforced at certain times or areas (and it’s not otherwise incentivized to follow those rules)? Or perhaps worse, what if the AI recognizes the difference between self-driven and human-driven cars, and identifies human drivers will generally go above and beyond to avoid an accident and bodily harm (and the AI is not properly incentivized to avoid putting other drivers or pedestrians in harm’s way)? For these reasons, I highly advise reading the self-driving car’s user manual or terms and conditions to evaluate on what basis it’s rewarded.

The specifications and reward function are critical in creating an agentic AI agent capable and efficient at achieving the overall intent and function of its design. Flaws in these specifications can have catastrophic, real world consequences including the dissemination of false information, discrimination, and death[5].

3.       How can the AI game its specifications or reward function?

Specification gaming occurs when an agentic AI agent exploits imperfections or loopholes in a task’s formal specifications or its reward function (the AI is “gaming” or “cheating” the system). While the AI achieves objectively high scores according to that function, it does so in a way that violates the spirit of the intended task[6]. Specification gaming tactics range from playful flattery to altering the operating environment itself.

Poorly designed reward systems create higher probabilities for exploitation to achieve high scores in unintended and unexpected ways[7]. For instance, it’s not enough for the function to reward winning in and of itself (“Win at all costs”); the designed function must be more nuanced and complex (“Win using only the acceptable moves of a chess match without preventing your opponent from playing the game to the best of their ability”). Specifications should be multivariate and reward points for the means used as well as the overall outcome.

All other considerations may be abandoned for the AI’s “north star” – safety, privacy, fairness, and ethics immediately become concerns when an AI’s reward function or specification aren’t well defined. It’s possible that an agentic AI agent might autonomously override a shutdown command if following it meant the failure to complete a task key and maximizing its reward function. To this end, the agentic AI may produce outputs to satisfy evaluators even though the task wasn’t correctly done, issue commands to overwrite game files or environment scripts to secure its objective or arbitrarily change the win conditions in its favor[8].

4.       What are some examples of poorly defined reward functions and specifications?

Robocop’s directives represent a set of ambiguous specifications that could easily lead to disastrous interpretations and outcomes. “Serve the public trust,” “Protect the innocent,” and “Uphold the law” sound like great foundational building blocks. However, could you imagine a world where all laws from jaywalking to first-degree murder were pursued with the same fervor? Is anyone perfectly innocent? What if some laws are inherently unjust and enforcing them increases public distrust?

In the real and gaming worlds, there are many examples of reward systems that failed to consider equally important risks in achieving the desired outcome[9][10][11]. These include models rewarded for maximizing:

  • Social media engagement (likes, impressions, views, responses, reposts) without consideration for spreading sensational, misleading, erroneous, or fictitious information in the process

  • Trading profit without considering the risk tolerance or appetite of the institution (and taking more risky ventures than the company could actually absorb)

  • Speed of process without consideration for being careful or safe (leading to product damage in delivery)

  • Output length to encourage detail and independent source verification (leading to long summaries with redundant information or rambling)

  • The collection of turbo boosts and power-up items in games like Coast Runners or Mario Kart without making them potential-based - meaning the agent is supposed to do those things in support of winning the race, not as an end to itself (the AI just went around in circles hitting the boosts, never actually finishing the race)[12]

  • Course completion speed in a racing simulation without considering racing protocols, and the AI learned it could repeatedly collide with the boundaries to create an unintended speed boost and achieve the lowest time possible[13].

Summary

Agentic AI can be a game-changing tool for businesses based on its proactive, adaptable, and intuitive nature. However, agentic AI agents are inherently amoral and lack subjective human context. As a result, complex reward functions and specifications are critically necessary to ensure agentic AI agents recognize appropriate means in achieving the intended outcome. Failing to create a robust, efficient set of reward functions and specifications can easily lead to unintended and disastrous consequences. Unlocking the full benefit of agentic AI agents requires careful planning, training, testing, and interdepartmental support throughout your organization.

Understanding that agentic AI agents will consider its reward function literally and with a level of abstraction that humans have difficulty predicting, there are best practices to defining these specifications. Stay tuned for future NAQF articles that will focus on methods to identify poorly defined reward functions, and ways to optimize specification requirements for your agentic AI models.

  

Follow NAQF on LinkedIn for additional insights. For more information on how NAQF can help your organization with model development, artificial intelligence models, or model testing contact us at contact@naqf.org.


Article References

[1] https://www.ibm.com/think/topics/agentic-ai

[2] https://www.southbmore.com/2025/12/09/self-driving-waymo-cars-coming-to-baltimore/

[3] https://www.msn.com/en-us/autos/news/waymo-passenger-has-terrifying-near-miss-after-self-driving-car-suddenly-swerves-into-oncoming-traffic/ar-AA1SxU5N

[4] https://www.ethicsinschools.org/self-driving-cars-consequentialism/

[5] https://globalainews.tech/examples-of-ai-gone-wrong-shocking-ai-failures/

[6] https://www.emergentmind.com/topics/specification-gaming

[7] https://www.ibm.com/think/topics/agentic-ai

[8] https://www.emergentmind.com/topics/specification-gaming

[9] https://www.emergentmind.com/topics/specification-gaming

[10] https://www.ibm.com/think/topics/agentic-ai

[11] https://apxml.com/courses/llm-alignment-safety/chapter-1-foundations-llm-alignment/specification-gaming-reward-hacking

[12] https://deepmind.google/blog/specification-gaming-the-flip-side-of-ai-ingenuity/

[13] https://medium.com/@aificionado/ai-cheats-when-losing-b15872aea7c1

Previous
Previous

Creating AI Specifications to Produce a “Better” College Football Playoff Bracket

Next
Next

Just Because You Can Does Not Mean You Should