Setting Up AI Models for Success

Artificial IntelligenceModel Risk ManagementOperational Risk

Feb 15

Written by Geoff Kreller, CRCM, CERP

Artificial Intelligence (AI) can be a game-changing tool in respect to digesting large sets of data at lightning speed and subsequently making predictions, recommendations, and automating tasks. For this reason, AI has the capability to transform and simplify every process and function within finance and banking. From fraud detection to customer service support to underwriting to budgeting projections, well trained models can vastly accelerate the collection and research of data and enable humans to focus on the analysis and consideration of the model’s conclusions and outcome.

The foundation to unlocking these benefits is in the development of a well-trained and well-governed model. For a model to be successful, an organization must maintain diligence throughout the entire model lifecycle. In many cases, companies and employees have become weary, tired, and afraid of AI models simply because the strategy, governance and implementation around the model failed spectacularly.

There’s also an inherent difference in specifying desired outcomes to a human and an agentic AI model. Humans naturally (in varying degrees) apply ethics, compliance, risk, broader long-term goals, and subjective understanding to recognize the intent and appropriate direction to the goal. It doesn’t occur to most of us that we could win a game of online chess simply by sending malware to our competitor, disrupting their internet connection or damaging their software. Concepts of morality, ethics and fair play give humans a directional “north star.” Unencumbered by human morality, ethics, and a clear understanding of underlying intent, agentic AI models can demonstrate a sophisticated level of innovation that includes brilliant, unforeseen strategies which may incorporate cheating, manipulation, lying, flattery, and deception to reach its target objective.

To these ends, this article discusses the governance required to maximize the benefits of deploying an AI model, while also communicating to your employees how they still remain pivotal and relevant within your organization.

1. Model Risk Governance

Even before the emergence of AI, many organizations developed robust model risk management (MRM) teams. Because AI models have the capability to exponentially accelerate decisions and recommendations, it has the ability to quickly exacerbate risks related to credit quality, liquidity, and an institution’s operations and reputation. For this reason, it’s crucial that your organization has an enterprise-wide MRM team that covers risk, compliance, legal, strategy, information technology, information security, engineering, and key department leaders.

An institution’s MRM policy should clearly articulate the roles and responsibilities of model management:

Subject Matter Experts (SMEs) => involved in the model’s training, including initial evaluations on achieving outcomes in the intended manner.
Quality Assurance (QA) => evaluates whether the model is operating as intended (or within the stipulated error tolerance).
Department Leaders => responsible and accountable for models, including their risks, controls, incidents and issues.
Executive and Senior Leadership => provides strategy and approvals for live implementation and use of models (including shutting down or suspending models as appropriate).
Risk and Compliance => Evaluates whether the control environment satisfies all applicable risk obligations, and whether the model’s actions and outcomes are appropriate, auditable and explainable. Risk and Compliance evaluates whether the model usage remains within the risk tolerance/appetite approved by the institution. Risk and Compliance also provides a credible challenge to the model through the use of alternative data, third party data, or challenger models.
Internal Audit => Evaluates the effectiveness of both the first line (SMEs, QAs, Department Leaders) and the second line (Risk and Compliance) in managing the institution’s most critical risks. This includes whether first and second line have appropriately identified the models that the institution uses, and prioritized their review in a way that matches the magnitude of risk inherent in their use,

From ideation, your MRM team should be aware of the desire to create a model (along for what intended purpose) and have a full list of models currently in use by the institution. For each model, the MRM team should have a centralized management tool that outlines each model’s:

Purpose and scope
Status (Ideation, In Testing, Implemented, Suspended, End-of-Life)
Risk level
Types of risk involved
Implementation Date
Last Update
Next Review/Refresh Date
End Date

The MRM team should be aware of the institution’s overall risk appetite and its acceptable risk tolerance during the implementation of a new model. An institution may have an overall risk tolerance of 1% for underwriting errors leading to credit exceptions and temporarily tolerate errors above that threshold during a model’s implementation period. Understanding an institution’s risk appetite and tolerance will inform each individual model’s acceptable tolerance for error and the level of monitoring necessary to support that tolerance level.

A failure to set the “tone-from-the-top” can create unrealistic expectations, sound strategies that are poorly executed upon, vague strategies, and employees feeling lost on how they fit into the organization’s future plans. From an organizational perspective, it’s important to promote the efficiencies that AI may bring, such as “we are excited to begin development on our new Chatbot, which we anticipate will reduce existing customer service contacts by 75%”. However, if you’re in customer service and executive leadership hasn’t considered or discussed how that will impact the organizational chart, it’s only natural for those individuals to be considered about their livelihood and be reticent to train your new model effectively.

In addition to considering how each model will benefit the organization, institutions should evaluate how employees can continue to be vital to its success. The goal may be to reduce the customer service contact rate by 75%, but the organization may be expecting a 100% increase in applications over the next year, and the customer service group is already at maximum call or email capacity. The organization may expect the Chatbot to bend the contact curve backward by answering very common questions, though the organization expects its existing employees to remain and spend more time addressing more complex questions. Moreover, some SMEs may be asked to perpetually monitor the model’s performance and step in when the model starts to drift from its core data and understanding.

In parallel to training the new model to take on the routine work, the organization should leverage the projected organization and resourcing plan to train and upskill its existing workforce to meet the company’s future expected requirements. Whether that involves credibly challenging outcomes, understanding the model’s behavior, or taking on more complex inquiries that remain, preparing employees for their revised role will give them a better sense of security and resilience during this transition.

2. Documenting the Model

Model documentation is crucial in highlighting how each model is expected to be trained, monitored, implemented and periodically refreshed. The model’s parameter document should include the model’s:

Owner (Department and Person)
Approver (Person), which includes approving refreshes, model suspensions and terminations for cause
Purpose and objective
Risk level and type
Inputs, data and variables used in training and development
- Data type and structure
- Dataset size
- Dependence on third party data sources
- How the company has the obtained the consent to use this data
Computational resources required
Training parameters and hyperparameters
Model/Task type
How the model is expected to use the inputs, data, and variables provided
- Algorithms used
- Whether the model allowed to make its own learning inferences and/or act on those patterns
Expected outputs
Priority on intent and interpretability
Key performance and risk indicators
Whether a human interprets those outputs in making a decision or final response
Confidence level and tolerance for error
Refresh periods
Monitoring controls and expectations
Periodic credible challenge/testing

The problem statement (what are we trying to solve) and objectives should be clear by reading the model’s process document. A clearly articulated and defined statement becomes the model’s cornerstone and will inform decisions on the data and algorithms required to meet that objective. Continuing from the customer service example above, a specific and measurable objective might be “reduce customer support demand by 40% (contact to application ratio) by leveraging an AI chatbot to answer common questions asked by our customers”.

Understanding the objective helps translate that problem statement into a solvable AI task[1]. Common AI tasks include:

Classification: Ability to predict a categorical output, such as recognizing a fraudulent transaction, identifying unusual activity, categorizing and trending complaints, or determining whether an incoming email is spam.
Regression: Ability to predict/forecast the value of a continual numerical output (inflation, asset values, housing prices)
Natural Language Processing (NLP): Tasks involving human language which may include sentiment analysis, text summarization, transcription, language translation, or chatbot development
Computer Vision: Tasks involving images or videos, such as object detection, image classification, or facial recognition (including multi-factor authentication)
Clustering: Grouping similar data points together without prior knowledge of the groups, which is useful for market segmentation, customer persona development, or anomaly detection
Recommendation Systems: Tasks that suggest items or content to users based on preferences and past behavior (purchase recommendations, financial product advertisements)

Choosing the appropriate algorithm and architecture based on the type of problem statement and AI task at hand is an important model consideration[2]. Some examples include:

Traditional Machine Learning Algorithms: Great for structured data, smaller data sets, and when there is a high priority on being able to interpret the process and outcome (such as lending and hiring decisions)
- Linear Regression: Useful for binary classification because a linear relationship between the data and target variable is assumed
- Decision Trees: Non-linear models that make decisions based on a series of logical rules (such as “if/then,” “if and only if,” “and,” “not,” and “true/false”)
  - Random Forests: Combines multiple decision trees to improve accuracy and reduce overfitting.
Deep Learning Architectures: Effective for complex, unstructured data such as images, text, audio and video, and often require large data sets and IT resources to develop.
- Artificial Neural Networks (ANNs): The foundational structure of deep learning, consisting of interconnected layers of neurons. Suitable for various tasks, from classification to regression.
- Transformer Models: Enhanced Natural Language Processing (NLP) models that leverage mechanisms to weigh the importance of different parts of the input data, making them highly effective for tasks like language translation, text generation, and sentiment analysis.

Setting parameters and hyperparameters before training begins is important to balance the model from being too rigid (overfitting the relationship between the data and the target variable), being too vague (underfitting), or missing important generalizations because of myopic focus on minute details.

Parameters are the internal settings or weights that the model adjusts during training to fit the data. These adjustments aim to minimize the difference between the model's predictions and the actual outcomes. Parameters help the model achieve the expected outcome using existing training data, and to subsequently use that alignment to achieve the correct outcome when presented with new data to interpret.

Hyperparameters are configuration settings external to the model, and include factors such as the learning rate, batch size, epoch passes, and neural complexity (layers and volume) for neural networks such as ANNs[3].

Learning Rate: This controls the step size at which the model's weights are updated during optimization. A high learning rate can overshoot the optimal solution, while a very low learning rate can make training slow and potentially get stuck in minute details.
Batch Size: The number of training examples utilized in one iteration. Larger batch sizes can lead to faster training but might require more memory and can sometimes converge to less optimal solutions. Smaller batch sizes introduce more noise but can help the model escape focus of minute details and generalize better.
Epochs: One epoch represents one full pass through the entire training dataset. The number of epochs determines how many times the model will see the entire dataset. Too few epochs can lead to underfitting, while too many can result in overfitting.
Number of Layers and Neurons (for Neural Networks): These define the complexity of the neural network architecture. More layers and neurons can capture more intricate patterns but increase computational cost and the risk of overfitting.

Part of the value of an experienced MRM team is the ability to collaborate, manage, and balance countervailing risks that result from models. Key performance indicators, monitoring, and periodic testing help to identify the model’s existing levels of risk, whether those risks are within the company’s risk appetite, and whether the model is performing as expected (providing the expected value to the institution). While MRM is a foundational element enabling success, ensuring the model’s data quality is equally critical.

3. Ensuring Quality Data

Everyone has heard the phrase “bad data in, bad information out”. For AI, bad data could be biased, skewed, incomprehensible, conflicting or incomplete, and that can lead to recommendations and outcomes that are far off the mark. Ensuring that quality data is used in training, testing, and periodic refreshes will mitigate the possibility of erroneous outcomes.

Basic data cleaning involves:

Identifying missing data values
Labelling/Annotating data columns
Identifying and considering outliers and edge cases (capping/flooring)
Removing duplicate entries
Formatting inconsistencies
Scaling data to a common range
Quantifying text and strings (qualitative data)
Anonymizing the data (protecting identities and sensitive information)

It’s crucial to remember that a model only knows what the institution tells it. Any assumptions, subjective or historical context, or gaps in its knowledge won’t be recognized by the AI. AI will take everything literally at face value and will treat all data the same (unless you tell it not to). From a privacy and fair lending perspective, that’s an important aspect to consider when your institution is considering the use of sensitive data during model training.

Historically, employees may already know when policies and procedures have changed. Using a credit model as an example, you may need to create quantitative data fields for loan cohorts by credit policy iteration or account for temporary changes made during COVID for business continuity purposes.

Employees may understand policy or procedural nuances that aren’t specifically identified in those documents. This can lead to chatbots giving wrong or highly inappropriate advice to customers, which happened to Air Canada, New York City, National Eating Disorder Association[4], and OpenAI[5] (ChatGPT). In each of these cases, the AI’s misunderstanding led to significant operational, reputational, and personal risk.

Data quality should be credibly challenged by SMEs and Risk and Compliance. Underrepresentation of diversity (race, ethnicity, gender, age, etc.) can lead to algorithmic bias toward those groups. Missing data can have catastrophic consequences, including cases where model training for self-driving car or plane mechanisms did not fully address situational awareness[6][7]. Initial model validation and testing should not be taken lightly; it may take several iterations before an institution’s data quality provides an appropriate and adequate foundation for a model’s success.

It’s also important at this early stage to consider, manage, and mitigate potential biases that may appear in your data set:

Confirmation Bias: Does your data selection confirm your pre-existing beliefs or hypotheses?
Availability Bias: Did your organization opt to include free data sets and exclude others that were behind paywalls?
Algorithmic Bias: Does your data contain unfair, discriminatory elements because of incomplete or skewed data?
Cognitive Bias: Is the data creating systematic errors in reasoning due to subjective perceptions of reality or limited information about context?
Exclusion Bias: Has important data been left out or not considered?
Sampling/Selection Bias: Was the sample data selected without a representative sample of the whole population included?
Implicit Bias: Does the developer have unconscious bias about the subject matter?
Measurement Bias: Are the measures consistent, accurate and standardized across the entire data set, or do these elements contain subjective elements (such as grades within a GPA where some teachers are harsher or some classes are easier)

Both the Dutch parliament and Amazon discovered (after implementation) that models can display algorithmic bias[8], especially when model’s intent and outcome aren’t monitored, or when the information used to train the model underrepresents a particular race, ethnicity, or gender.

An experienced, independent, and diverse MRM team is capable of credibly challenging the completeness of data sets and highlighting potential biases that may exist, especially if they are aware of the data’s historical context. Whether your car has only been tested on empty roads in the daytime or your institution hasn’t considered your prior lending or hiring data from a demographic standpoint, it’s imperative that your oversight team is able to identify potential flaws, gaps and weaknesses in the data prior to training.

4. Model Training: Loss functions, reward functions, and specifications

Your available data should be split into a training set, a validation set, and a distinct set of testing data. The model learns based on the training set, validates it’s comprehension of the relationships on the second set, and then its responses to fresh data are evaluated for accuracy. If the model performs well in training and fails in validation or testing, that could reflect that the model is overfitting the relationship between the variables and the expected outcome (the line fits the training data, but can’t reasonably be applied to any other data due to its rigidity).

Supervised learning is the most common AI training, where the model learns from labeled historical data, meaning that for each input, the correct output is already known. For instance, training data on credit decisions made on the basis of credit score, debt-to-income ratio, and a loan-to-value ratio would also contain the associated outcome (approval/denial).

During training, the model’s loss function and parameters measure the difference between the model's predictions and the actual target values. The model strives to minimize this loss, effectively improving its accuracy. However, “overfitting” is a real concern. If a model is too aligned to the training data (creating a “perfect” line of best fit for the training data), it may have difficulty in achieving the appropriate outcome when presented with new data. If there are missing variables key to the outcome (in the case of a credit policy, the maximum home value or number of delinquencies for instance), the model will also not be successful in reaching the correct outcome.

Unsupervised learning is conducted on unlabeled data (data that does not have an associated known outcome), and can be useful in discovering hidden patterns, inferences, or relationships within the data. This type of learning is useful as an exploratory analysis; however, the observations, insights and findings should be vetted thoroughly by your MRM team.

Reinforcement learning is important when you’re trying to give AI agency – the ability to make decisions autonomously. Agentic AI agents learn through making decisions by interacting with a test environment.

Specifications and reward functions are critical in creating an agentic AI agent capable and efficient at achieving the overall intent and function of its design. Flaws in these specifications can have catastrophic, real world consequences including the dissemination of false information, discrimination, and death[9].

5. Pre-live testing for accuracy and bias

Once the training period is complete, the validation period commences. For supervised learning, the validation data contains the expected outcome, and the model applies its understanding to that data to determine if it reaches the same outcome. Upon successful validation, the model is tested on a data set where the outcome is not provided, and the model must apply its understanding to that new data. When the model has seen the validation or testing data (especially when the test fails), it is a best practice to consider that data part of the training set and to provide a brand new testing data set when the model is ready to be tested again.

Reinforcement learning through specifications, requirements, and reward functions encourages the model to pursue optimal responses and dismiss other options with a lower likelihood of reaching the target outcome.

There are several common metrics used for testing the effectiveness of classification and regression models. When models show weaknesses in these metrics, improvement in the model is often required. Improvement and refinement can be achieved by tuning the model’s parameters, leveraging a more appropriate algorithm, adding more training data, or considering cross-validation to prevent overfitting.

Classification:

Accuracy – The correct outcomes divided by the total number of cases.
Precision – The proportion of true positive predictions among all positive predictions (the false positive rate)
Recall (Sensitivity) – The proportion of true positive predictions among all actual positive predictions (false negative rate)
F-1 score – Harmonic mean that balances precision and recall

Institutions may also create a “Confusion Matrix”, which is a four-box table that summarizes performance by showing the number of true positives, true negatives, false positives, and false negatives.

In many models, there is often a trade-off at some level between higher precision (limiting false positives) and higher recall (limiting false negatives). Within fraud detection, an institution ideally wants to limit blocks on legitimate transactions (false positives); however, it must consider the percentage of fraud transactions that fail to be flagged (false negatives). Similarly, a credit model needs to balance incorrect approvals (false negatives that are now credit exceptions) and incorrect declines (false positives that lead to missed opportunities for the institution). The F-1 score is often provided to provide the balance between these two measurements.

Regression:

Regression metrics focus on the difference between the predicted values (the “line of best fit”) and the actual data:

Mean Absolute Error (MAE): The average of the absolute differences between predicted and actual values. It measures the average magnitude of errors without considering their direction.
Mean Squared Error (MSE): The average of the squared differences between predicted and actual values. It penalizes larger errors more heavily than MAE.
Root Mean Squared Error (RMSE): The square root of MSE. It is often preferred over MSE because it is in the same units as the target variable, making it more interpretable.
R-squared (Coefficient of Determination): Represents the proportion of the variance in the dependent variable that is predictable from the independent variables. A higher R-squared value indicates a better fit of the model to the data.

Testing Deep Learning Architectures such as ANNs and Transformer Models

In addition to the metrics above, testing deep learning architectures must also be qualitative in nature. Testing should consider ways in which third parties may attempt to manipulate or break the model, whether it’s ordering 20,000 McDonalds Nuggets, introducing derogatory language, initiating an injection attack[10], or asking it questions beyond the limits of its programming.

In these architectures, asking the same prompts in different ways (covering regional variances in parlance and context) is important. The model will likely be ineffective if the customer has to enter the prompt (What is the capital of Kentucky?) exactly as the model was trained to get an accurate response (Frankfort). If the model can’t also answer (Is Louisville the capital of Kentucky?) with “No, Frankfort is the capital of Kentucky”, that will significantly limit the benefits of your ANN or transformer model.

Testing the limits of the architecture is also important. Ensure that any question that it’s not supposed to answer returns the programmed follow-up. Review whether it is susceptible to following instructions embedded in test, such that third parties could inject malicious commands into the model’s programming.

6. Perpetual Monitoring and Refreshing

There was a recent case where Sports Illustrated had to part ways with a content company using AI to write articles, and neither content company nor Sports Illustrated reviewed the articles prior to publishing[11]. Had there been consistent monitoring over the content or its third party, Sports Illustrated likely would have acted much earlier – mitigating significant damage to its operations and reputational brand.

Organizations should consistently vet and edit AI-generated materials for inaccuracies, whether it has generated articles, images, policies, procedures, or blog posts. While AI-generated material can accelerate the development process, those outputs should be considered a rough draft requiring human review and refinement.

All models should be periodically monitoring to ensure that their recommendations and outputs are consistently on target with the model’s purpose, intent, limitations, and overall risk appetite. Complaint trends and themes can be used to detect areas where your model is falling short of the mark. Key performance and risk indicators can quickly show anomalies in the model’s operation, such that there are unforeseen spikes in approvals, denials, or fraud blocks.

In addition, models built on historical data can quickly become stagnant and degrade over time. Model drift (or model decay) refers to the degradation of machine learning model performance due to changes in data or in the relationships between input and output variables. Model drift can negatively impact model performance, resulting in faulty decision-making and bad predictions.

To detect and mitigate drift, organizations should monitor and manage performance on their data and artificial intelligence (AI) platform. If not properly monitored and refreshed over time, even the most well-trained, unbiased AI model can “drift” from its original parameters and produce unwanted results when deployed. As a result, drift detection is a crucial component of strong AI governance[12].

Summary

Much like it takes a village to raise a child, it takes an enterprise-wide effort to create and sustain a model. Successful models are built on a platform of sound governance and data awareness, and strengthened through validation, monitoring, testing, and periodic refreshes.

Executive leaders should proactively recognize and communicate how they expect AI models to change the dynamics, functions, roles, and organizational chart of the company. In conjunction with training these models, organizations should invest the time and energy into preparing employees for how their roles will change within the company after implementation.

The price of model success is eternal vigilance for its entire lifecycle from ideation to archival. Without human guidance, context, and support, models simply become directionless tools that benefit no one.

Follow NAQF on LinkedIn for additional insights. For more information on how NAQF can help your organization with model development, artificial intelligence models, or model testing contact us at contact@naqf.org.

Article References

[1] https://www.scrapeless.com/en/blog/how-to-train-an-ai-model

[2] https://www.scrapeless.com/en/blog/how-to-train-an-ai-model

[3] https://www.scrapeless.com/en/blog/how-to-train-an-ai-model

[4] https://www.livescience.com/technology/artificial-intelligence/32-times-artificial-intelligence-got-it-catastrophically-wrong

[5] https://www.cio.com/article/190888/5-famous-analytics-and-ai-disasters.html

[6] https://www.livescience.com/technology/artificial-intelligence/32-times-artificial-intelligence-got-it-catastrophically-wrong

[7] https://industrywired.com/artificial-intelligence/ai-gone-wrong-top-5-disasters-in-recent-history-10527075

[8] https://www.livescience.com/technology/artificial-intelligence/32-times-artificial-intelligence-got-it-catastrophically-wrong

[9] https://globalainews.tech/examples-of-ai-gone-wrong-shocking-ai-failures/

[10] https://www.forbes.com/sites/bernardmarr/2026/01/28/when-ai-agents-turn-against-you-the-prompt-injection-threat-every-business-leader-must-understand/

[11] https://subtratech.com/how-sports-illustrateds-ai-gamble-backfired-sparked-outrage-and-sent-executives-packing/

[12] https://www.ibm.com/think/topics/model-drift

Geoff Kreller

Setting Up AI Models for Success

Home

ACCREDITATION

CERTIFICATION

TRAINING

News & Updates

ABOUT

Contact

Setting Up AI Models for Success

Breaking Down Yrefy for Investors and Borrowers: Fact, Fiction, or Fraud

Home

ACCREDITATION

CERTIFICATION

TRAINING

News & Updates

ABOUT

Contact