How to Audit Your Enterprise AI Systems

Why AI Auditing Is No Longer Optional

The EU AI Act, effective for high-risk AI systems from August 2026, mandates conformity assessments before deployment and ongoing monitoring documentation afterward. The Federal Reserve's SR 11-7 guidance on model risk management applies explicitly to AI models in financial services. EEOC guidance requires employers using AI in hiring decisions to conduct adverse impact analysis. Across sectors, the regulatory direction is the same: document what your AI systems are doing and prove they are doing it safely, fairly, and as intended.

Beyond regulation, the operational case for AI auditing is compelling. Enterprises that conduct regular AI audits detect model drift 4.7 times faster than those relying on incident reports. They identify fairness problems before they escalate into regulatory action. And they build the institutional knowledge of their AI portfolio that is required to govern it effectively.

The challenge is that most enterprises have not built audit capability. They have built model development capability and model deployment capability. The audit function — independent evaluation against defined standards — has been assumed rather than designed. This guide addresses that gap.

Regulatory Requirement

Under the EU AI Act, providers of high-risk AI systems must maintain technical documentation sufficient for competent authorities to assess the system's conformity with the Act. This documentation must be kept for 10 years after the system is placed on the market. Retroactive documentation is not accepted. The audit trail must exist from deployment.

Defining Audit Scope

The first decision in any AI audit program is what gets audited. Not all AI systems warrant the same audit intensity. A risk-based approach focuses audit resources where the exposure is highest.

Audit scope determination flows from your AI system inventory and risk classification. Systems classified as high-risk require comprehensive audits on a defined schedule. Medium-risk systems require periodic sampling audits. Low-risk systems require only automated monitoring with periodic review.

The factors that determine audit priority are the same factors that determine risk classification: the consequences of error, the affected population, the reversibility of decisions, the degree of human oversight in the existing process, and the maturity of the system. For most enterprises, the initial audit program should target the 20% of systems that represent 80% of the risk exposure.

A critical prerequisite: you cannot audit what you cannot inventory. Before building an audit program, complete a comprehensive AI system inventory. Include systems built internally, systems purchased from vendors, systems built by business units without central oversight, and models embedded in commercial software you deploy. Most enterprises discover substantially more AI systems than they thought they had.

The Six Dimensions of an AI Audit

A comprehensive AI audit evaluates a system across six dimensions. Each dimension requires distinct evaluation methodology and produces distinct evidence for the audit record.

⚡

Performance Integrity

Does the model perform as specified? Evaluate accuracy, precision, recall, and calibration against the documented performance requirements. Test on current production data, not the holdout set from original validation.

⚖️

Fairness and Bias

Does the model produce systematically different outcomes for protected groups? Evaluate against fairness criteria appropriate to the decision context. Include intersectional analysis across combined protected attributes.

🔍

Explainability

Can decisions be explained to the level required by the decision stakes? Evaluate whether explanations are accurate, sufficient for affected parties, and meeting regulatory requirements for the use case.

🛡️

Security and Robustness

Is the model robust to adversarial inputs, distribution shift, and data quality degradation? Test model behavior under edge cases and adversarial conditions, not just typical inputs.

📋

Governance Compliance

Does the system comply with the organization's AI governance policies? Evaluate documentation completeness, accountability assignments, change management records, and incident history.

🔒

Data Governance

Are training and inference data managed in compliance with applicable requirements? Evaluate data provenance, consent records, retention compliance, and privacy controls on both training data and production inputs.

The Audit Methodology

AI audits follow a six-phase methodology. Each phase produces documented outputs that together constitute the audit record. The audit record must be complete enough that a different auditor reviewing the same documentation would reach the same conclusions.

Pre-Audit Preparation

1 to 2 Weeks

Collect all existing documentation for the system: model card, technical specification, training data documentation, validation reports, incident history, governance approvals, and any prior audit findings. Identify gaps in documentation before audit work begins. Establish the evaluation criteria against which the system will be assessed. Agree on the audit scope with the model owner and governance committee. The pre-audit preparation phase often reveals that documentation gaps are themselves findings.

Evidence Collection

2 to 3 Weeks

Gather the evidence base for each audit dimension. This includes technical evidence (model outputs on test datasets, monitoring logs, performance reports), process evidence (governance approvals, change records, incident reports), and documentary evidence (policies, procedures, training materials). Evidence collection must be systematic and documented. What evidence was sought, what was obtained, and what was unavailable must all be recorded.

Request production inference logs for the audit period
Obtain training data documentation and provenance records
Interview model owner, data scientists, and operational stakeholders
Review all governance artifacts from the model's lifetime in production

Technical Evaluation

2 to 3 Weeks

Run the quantitative evaluations for each audit dimension. Performance testing on current production data. Fairness evaluation across protected groups with intersectional analysis. Explainability assessment against the required explanation level for the decision stake. Robustness testing under distribution shift and adversarial conditions. Security evaluation of the model serving infrastructure. Data governance review of training and inference data pipelines. Each evaluation must produce quantitative results documented in the audit record.

Finding Classification

1 Week

Classify each finding by severity based on the severity framework. Prioritize findings for remediation. Draft the preliminary finding list and validate findings with the model owner before finalizing. The model owner has the opportunity to provide context that may affect finding severity, but does not have the authority to remove findings from the audit record.

Reporting

1 Week

Produce the audit report. The report must include: executive summary, audit scope and methodology, summary of findings by severity, detailed finding descriptions with evidence references, remediation recommendations with prioritization, and the model owner's formal response. The audit report is a governance artifact and must be retained for the period required by applicable regulation.

Remediation Tracking

Ongoing

Track remediation of critical and high-severity findings. Critical findings require a remediation plan within 5 business days of report finalization and completion within 30 days or interim risk controls until completion. High-severity findings require remediation within 90 days. Remediation status is reviewed at the AI governance committee's next scheduled meeting following the report. Findings that are not remediated on schedule escalate to the Chief AI Officer.

Need an Independent AI Audit?

Our practitioners conduct AI audits that satisfy regulatory requirements and produce findings boards and regulators take seriously. Start with a scoping conversation.

Schedule an Audit Scoping Call →

The Finding Severity Framework

AI audit findings must be classified by severity to prioritize remediation and communicate risk to governance bodies. This four-tier framework is aligned with regulatory expectations and governance best practice.

Severity	Definition	Remediation Timeline	Escalation
CRITICAL	Active harm occurring or imminent. Regulatory violation confirmed. Fundamental fairness failure affecting a protected class. Immediate suspension of model operation may be required.	5 days for plan; 30 days for remediation or interim controls	Board notification required. Regulatory notification may be required.
HIGH	Significant risk of harm or regulatory violation. Fairness metrics materially outside defined bounds. Material performance degradation. Documentation insufficient for regulatory review.	30 day remediation plan; 90 day completion	CAO notification. Governance committee review within 2 weeks.
MEDIUM	Risk present but not immediately materializing. Process gaps that could become high-severity under adverse conditions. Documentation incomplete but governance controls functioning.	90 day remediation plan; 180 day completion	Governance committee tracking. Monthly progress reporting.
LOW	Best practice not followed. Minor documentation gaps. Efficiency or quality improvements available that do not constitute risk exposures.	Next release cycle	Model owner tracking. Annual audit follow-up.

Documentation Standards That Hold Up

AI audit documentation must satisfy two audiences: internal governance leadership who need to manage risk, and regulators who may review documentation during examination. The documentation standards must be calibrated to the more demanding of the two, which in regulated industries is almost always the regulator.

The audit record for each system must contain:

System identification: System name, version, deployment environment, purpose, affected population, and decision type.
Audit scope and limitations: What was evaluated, what evaluation methodology was used, and what limitations applied to the evaluation (data availability, sample sizes, access restrictions).
Evidence catalog: Index of all evidence collected, with source, date, and how each piece of evidence was used in the evaluation.
Quantitative results: All quantitative evaluation results in tabular form, with comparison to defined acceptance criteria.
Finding register: All findings with severity classification, evidence basis, remediation recommendation, model owner response, and remediation status.
Auditor attestation: Auditor identity (internal or external), independence declaration, and attestation that the evaluation was conducted in accordance with the defined methodology.

A critical test for documentation adequacy: could a regulator who had never seen this system reconstruct what it does, who it affects, how its risks were assessed, and what was done about identified problems? If not, the documentation is insufficient.

📋

AI Governance Handbook

Complete AI audit templates, finding classification frameworks, and documentation checklists aligned with EU AI Act and sector-specific regulatory requirements.

Download Free →

Setting the Right Audit Cadence

Risk-based audit cadence means high-risk systems are audited more frequently than low-risk systems. The baseline cadence for most enterprise AI governance programs:

High-risk systems (EU AI Act high-risk classification, or systems where errors cause material individual harm): comprehensive audit annually, with continuous automated monitoring and quarterly metric review in between. Any significant change to model architecture, training data, or deployment context triggers an out-of-cycle audit.

Medium-risk systems: comprehensive audit every two years, with annual sampling review. Material incidents trigger out-of-cycle audit.

Low-risk systems: audit incorporated into annual governance review. Automated monitoring provides the primary assurance.

The cadence must also account for model changes. A model that appears stable on its original training distribution may be drifting silently. Any model that has not been re-evaluated in 24 months should be considered due for comprehensive review regardless of its risk classification, because the deployment environment changes even when the model does not.

Internal Versus External Auditors

The independence question is central to AI audit credibility. An audit conducted entirely by the team that built and operates the model does not produce independent assurance. It produces self-assessment. Self-assessment has value, but it is not an audit and should not be represented as one.

The practical resolution for most enterprises is a hybrid model. Internal audit teams, working with AI governance expertise, conduct the routine annual audit cycle. Independent external auditors are brought in for: initial program establishment, high-stakes systems where regulatory credibility is paramount, any system that is the subject of regulatory inquiry, and periodic validation that the internal audit methodology is sound.

External auditors provide independence and regulatory credibility that internal teams cannot self-certify. They also bring pattern recognition from auditing AI programs across multiple organizations that is not available to internal teams working within a single enterprise.

For the governance framework into which auditing fits, see our enterprise AI governance framework guide. For how auditing connects to bias management, see our AI bias and fairness guide. For the responsible AI operating model context, see our responsible AI practical guide.

Building the Audit Function

Standing up an enterprise AI audit capability requires three things: methodology, tooling, and people. Enterprises that approach this in the wrong sequence spend 18 months building the wrong thing.

Start with methodology. Define what audits evaluate, how they evaluate it, what evidence is required, and how findings are classified. The methodology document should be reviewed by legal and compliance before any audit work begins, because it defines the standard against which systems will be assessed and shapes the legal defensibility of the audit record.

Then address tooling. Audit tooling includes bias evaluation libraries, model interpretability tools, performance monitoring infrastructure, and documentation management systems. Select tooling that integrates with your existing MLOps environment rather than creating parallel data pipelines that are expensive to maintain.

Then hire or develop the people. An AI audit function needs data scientists who understand model evaluation methodology, governance professionals who understand regulatory requirements, and an audit lead with enough organizational authority to report findings without fear of reprisal from business units whose models receive critical findings. That last requirement is often the hardest to satisfy.

To explore how our advisors can help establish or strengthen your AI audit capability, visit our AI Governance service page or start with our free AI assessment.

Build AI Audit Capability That Satisfies Regulators

Our senior advisors help enterprises establish AI audit programs, conduct independent audits, and build the documentation infrastructure that satisfies regulatory review.

Get Your Free AI Assessment → Explore AI Governance Talk to an Advisor