Services Case Studies White Papers Blog About Our Team
Free AI Assessment → Contact Us

Evaluating AI Vendors: 20 Questions That Reveal the Truth

There's a chasm between vendor demos and production reality. In demos, vendors show you their best case: cherry-picked data, ideal scenarios, polished interfaces. In production, your data is messier, your use case doesn't match their template, performance degrades, and governance becomes an afterthought.

This is why 73 percent of enterprises report a significant performance gap between vendor promises and post-signature results. The evaluation process fails because it's designed to sell, not to inform. Reference calls go to vendors' best customers. Proof of concepts use unrealistic data. Nobody asks hard questions about governance, liability, or exit costs.

This guide gives you 20 questions that cut through vendor narratives and reveal operational truth. These are the questions we ask in every vendor assessment we conduct. They're organized by category: performance, governance, commercial, and support. Ask these questions, listen carefully to the answers, and score responses against a rubric that rewards transparency and penalizes vagueness.

73%
Report performance gap vs. promised
2.4x
Higher ROI from structured evaluation

Why Most Vendor Evaluations Fail: Three Structural Problems

Before we get to the questions, understand why traditional evaluation processes deliver poor signals. Fix these structural problems first.

Problem 1: Reference Checks on Their Best Customers

Vendors always refer you to their happiest customers. These are the accounts with the simplest use cases, the most resources, and the best outcomes. They are not representative. A reference call with a vendor's best customer tells you what the vendor can do when conditions are perfect. It doesn't tell you what happens when your use case is complex or your data is messy.

Insist on a representative sample of customers, not just successes. Ask the vendor to provide a cross-section: small customers and large, simple implementations and complex, users who renewed and users who didn't. This is rarely offered, and the resistance itself is a signal.

Problem 2: Proof of Concept Data That Isn't Yours

The typical PoC uses vendor-provided sample data that's clean, labeled, and dimensioned exactly for the vendor's strengths. Your real data is messier, higher-dimensional, and often contradicts vendor assumptions. A meaningful PoC uses your actual data, even if it's anonymized or held in escrow.

Demand a PoC that uses your real data, your actual use case, and your typical performance metrics. This costs the vendor more engineering effort. The refusal to do this is a red flag.

Problem 3: Nobody Asks Governance Questions

Traditional evaluations focus on performance: latency, accuracy, throughput. These matter. But governance matters more to enterprise risk. Who can access your data? How is it used? What happens if there's a breach? What's the liability structure? These questions rarely get asked because they're not exciting and they often expose vendor weaknesses.

We'll address this below with dedicated governance questions. The fact that nobody usually asks them is precisely why they're so important.

Performance and Production Questions (Q1-Q6)

Start with performance, but ask questions that reveal production constraints, not just benchmark results.

Question 1
What is the 95th percentile latency at your 99th percentile load, using realistic data dimensionality?
Vendors quote average latency under ideal conditions. You need tail latency (95th, 99th percentile) at real-world load. Ask them to test against data with similar dimensionality to yours. "Realistic data dimensionality" is intentionally vague to force specificity. If they ask what your dimensionality is, share representative samples.
Red flags: They quote only average latency, or latency at low load. They don't want to test on your actual data. They claim all latencies are identical (this is mathematically impossible).
Question 2
When accuracy declines (and it always does), what's your remediation process and typical timeline?
Production models degrade. Data drift happens. Distribution shift is real. Vendors often gloss over this. You need to understand: Do they detect degradation automatically or does the customer discover it? What's the process to retrain? Do you have to engage professional services? How long does retrain typically take?
Red flags: "Accuracy never degrades" is false. "We provide tools to detect it" without specifying how often they're actually used in production. Retraining requires a 6+ month engagement with their professional services team.
Question 3
Do you guarantee inference SLAs? If yes, what's the penalty for missing them, and has any customer ever actually received it?
SLA commitments reveal how confident vendors are. If they won't guarantee SLAs, that's a signal. If they offer SLAs but the penalty is negligible compared to your switching cost, the SLA is worthless. The fact that customers rarely receive SLA credits (according to 62% of enterprises) means SLAs are designed as liability protection, not as guarantees.
Red flags: "We offer SLAs" but the penalty is 5% of monthly fees (too low to matter). SLA exclusions for "force majeure" that are vaguely defined. Customers report never receiving SLA credits despite outages.
Question 4
How do you handle data imbalance, and what happens when one class represents less than 5% of your training set?
Real-world data is imbalanced. Fraud detection is 99.9% non-fraud. Customer churn is 95% retain. Vendors' algorithms often perform poorly on minority classes, and their solutions vary wildly. Some have principled approaches; others have hacks. This question reveals technical depth.
Red flags: "We oversample the minority class" (this causes overfitting). "We use synthetic data" (synthetic data can introduce bias). No clear methodology for handling imbalance.
Question 5
What is your typical model training time for a model with 1 million training examples and 500 features, and what factors cause it to vary?
Training time directly impacts your iteration speed. It also reveals computational efficiency. Vendors with sophisticated optimization can train in hours; vendors with brute-force approaches take days. Variance factors (data type, feature sparsity, hardware) reveal whether they understand the problem domain.
Red flags: "It depends" without any quantitative guidance. Training time is weeks for a problem this size. They don't mention hardware tier or scaling options.
Question 6
If we need to serve our model on our own infrastructure (not yours), what formats do you support, and do they require your runtime?
Independence requires the ability to deploy models outside the vendor's infrastructure. Open standards like ONNX require only standard runtimes. Proprietary formats require the vendor's runtime, creating infrastructure lock-in. This question reveals portability.
Red flags: "You can only run models on our infrastructure." "We support ONNX export but models are significantly less accurate" (this signals poor architecture). Exports are slow or require expensive professional services.

Security and Governance Questions (Q7-Q12)

Governance is where vendors often stumble because they prioritize velocity over protection. These questions expose gaps.

Question 7
Do you retain any rights to our training data, and do you use it to improve your models or train other customers' models?
Data ownership is foundational. If vendors retain any rights to your data or use it for their model improvements, you have a liability. Your data is your competitive advantage. The answer must be an unambiguous "no." Any equivocation is a rejection criterion.
Red flags: "We use anonymized data" (anonymization is reversible). "Only for improving your model" (this still implies your data leaves your control). Contractual language that's evasive about ownership.
Question 8
Can we audit your data handling and security practices, and how often do you conduct independent security audits (SOC 2, ISO 27001)?
Most enterprises never ask this question, which means most vendors never feel pressure to be compliant. If they've done security audits, they'll have certificates and will be eager to show them. If they haven't, they're either young (acceptable) or hiding something (red flag).
Red flags: "We're secure but no formal audit." "Audits are expensive and we're a startup." Refusal to provide audit reports. SOC 2 reports with significant findings that remain unaddressed.
Question 9
What is your incident response process, what's your notification timeline for breaches, and have you ever had a security incident affecting customer data?
Transparency here reveals maturity. Vendors who've had incidents and have learned from them are more trustworthy than vendors who claim perfect security. The notification timeline is crucial: 72 hours is standard (regulatory minimum), 24 hours is best practice.
Red flags: "We've never had an incident" (statistically implausible for large platforms). Notification timeline of more than 72 hours. Vague incident response process.
Question 10
How do you handle model explainability and bias auditing? What tools do you provide, and what's the customer's responsibility?
Explainability and bias auditing are increasingly required by regulation (UK AI Act, EU AI Act). Vendors have different philosophies: some provide tools, some expect customers to use third-party tools. Understand the split of responsibility and whether the vendor's approach integrates with your compliance framework.
Red flags: "Explainability is not our responsibility." No bias auditing tools. Black-box models with no interpretability option.
Question 11
What happens to our data if we terminate the contract? What's your data deletion timeline, and do you provide certification of deletion?
Data deletion is a contract term that's often overlooked. Some vendors claim data is deleted immediately; others want 30 to 90 days. You need contractual certainty and ideally certification that deletion actually happened. This matters for compliance (GDPR, etc.).
Red flags: "Data deletion is part of a 90-day transition process." No certification of deletion offered. Data deleted but backups retained "for disaster recovery" indefinitely.
Question 12
If we discover a vulnerability in your system, what's your vulnerability disclosure process and your typical time to patch?
Responsible disclosure reveals maturity. Vendors with robust disclosure processes patch faster and communicate better. Vendors without a process often ignore reports or retaliate against researchers. This is a proxy for engineering maturity.
Red flags: "We don't have a formal disclosure process." Time to patch is measured in months. They've never patched a disclosed vulnerability.
Complete vendor comparison across 40+ criteria?
Our vendor comparison tool evaluates performance, governance, commercial terms, and exit flexibility. See how your top candidates rank across the categories that matter most to your enterprise.
View Vendor Comparison →

Commercial and Contract Questions (Q13-Q16)

Commercial terms determine your financial exposure and optionality. Don't skip these because they feel dry.

Question 13
What is your total cost of ownership model, including infrastructure, support, training, and professional services?
Vendors quote product cost, not total cost. Infrastructure costs (compute, data storage) can exceed product cost. Support tiers, training, and professional services add more. Demand a full cost model for your anticipated usage. Ask for comparison to similar competitors so you can benchmark.
Red flags: They quote only product cost and deflect on TCO. Infrastructure costs are "variable and depend on your usage" (true but unhelpful; ask for estimates). Professional services are mandatory for anything beyond basic use.
Question 14
Can we export our training data and fine-tuned models at will, in open formats, at no additional cost?
This is the litmus test for lock-in. If vendors require paid professional services to export data or models, or if export is only available in proprietary formats, they've chosen lock-in over customer flexibility. The answer must be unambiguous "yes, without additional cost."
Red flags: "Data export requires engagement with professional services." "Export in open formats costs extra." "Exports take weeks to arrange." Models only available in vendor-proprietary format.
Question 15
What are your price increases, and how are they applied to existing contracts during renewal?
Vendors often lock in low prices, then increase aggressively at renewal. Understand the historical pattern. Does the vendor increase by 10-20% at renewal? Is there a lock-in period? Can you negotiate increases, or are they automatic? This directly impacts your TCO.
Red flags: "Average increase of 15% at renewal" without ability to negotiate. Multi-year contracts with auto-renewal at higher tiers. Historical data shows vendors increasing 30%+ after lock-in period ends.
Question 16
What are the termination provisions? Can we terminate for convenience, and what's the notice period and associated costs?
Termination flexibility protects you if the vendor underperforms. Ideally, you can terminate for convenience with 30 days notice and no penalty. More restrictive terms signal lock-in. Multi-year, non-cancellable contracts are increasingly rare but still exist. Avoid them.
Red flags: "No termination for convenience." Multi-year contract with cancellation penalties. "You can terminate but only after 12 months."

Support and Long-Term Engagement Questions (Q17-Q20)

Question 17
What is your SLA for support response time by severity level, and what does "resolution" actually mean?
Support SLAs often hide behind vague language. "Resolution" might mean "we opened a ticket" not "we fixed your problem." Understand the severity levels (P1: system down vs. P4: feature request) and the commitment at each level. P1 should be 1-hour response and 4-hour resolution commitment.
Red flags: No SLAs offered, or "best effort." Resolution means "we respond to your ticket." P1 response time is more than 4 hours. Support is only available during business hours.
Question 18
How is your product roadmap prioritized? Do enterprise customers influence roadmap decisions, and how?
Roadmap influence varies. Some vendors say "customer input is considered but we have our own vision." Others put enterprise customers on advisory boards with meaningful influence. Understand where you fall. If you have a critical requirement that's not on the roadmap, can you influence it, or are you stuck waiting for the vendor's priorities?
Red flags: "Roadmap is set by leadership and customer input is not considered." Your critical requirements have been requested by other customers but remain un-prioritized for years. The roadmap is opaque.
Question 19
How do you communicate product changes and breaking changes? What's the deprecation policy for APIs and features?
Communication and deprecation policy determine operational friction. Vendors that communicate changes 3+ months in advance minimize disruption. Vendors that surprise customers with breaking changes create crisis. The deprecation policy should guarantee 24 months of notice for major breaks.
Red flags: "We communicate changes in the release notes when they ship." No deprecation period for breaking changes. No communication channel besides product announcements.
Question 20
If your company fails or is acquired, what happens to our account, our data, and our models?
Vendors rarely discuss this, but it's your ultimate tail risk. Demand contractual answers: Do we have source code escrow for custom models? How do we access our data if the vendor shuts down? What's the timeline? This should be written into the contract.
Red flags: "We're not planning to fail or be acquired." No source code escrow offered. Data lock-in means if they fail, your data is inaccessible.

Scoring Vendor Responses: The Rubric

Asking these questions is half the work. Scoring responses separates vendors who care about transparency from vendors who hide behind jargon and evasion.

Score Response Type How to Recognize It
10/10 Specific, quantified, backed by data "95th percentile latency is 120ms at our 99th percentile load. We tested this with anonymized customer data similar to your use case. Here's the report."
8/10 Specific, mostly quantified, some hedging "Latency is typically 100-150ms at load. It depends on data dimensionality, which varies by customer."
6/10 Qualitative, defensible, but vague "Latency is generally low. We've been happy with performance in production."
4/10 Evasive, with redirection to other topics "Different customers have different needs. Let me tell you about our feature set instead."
2/10 Contradictory or defensive "Nobody else asks these questions. Our customers are happy." (Defensive about avoiding the specific question)
0/10 Refusal to answer or false claims "We've never had a security incident" or "We can't share that information for competitive reasons."

Use this rubric to score each answer. Average scores across all 20 questions. Vendors scoring 8+ are transparent and confident. Vendors scoring 5-7 are hedging or hiding something. Vendors scoring below 5 should be rejected, regardless of features.

White Paper: AI Vendor Selection Framework
Our detailed framework covers this evaluation methodology plus 40+ additional evaluation criteria organized by performance, governance, commercial, and support. Includes a downloadable scoring spreadsheet so you can evaluate multiple vendors side by side.
Read the white paper →
Ready for vendor-neutral AI guidance?
200+ enterprises. 500+ models in production. Zero vendor affiliations. Our senior advisors help you evaluate and select AI vendors objectively.
Start Free Assessment →
The AI Advisory Insider
Weekly intelligence on enterprise AI — vendor analysis, governance updates, production case studies. No vendor sponsorship.
Related Advisory Service

AI Vendor Selection

Independent evaluation methodology. No vendor relationships. The 12-dimension scoring framework used in 80+ enterprise selection engagements.

Explore AI Vendor →
Free AI Readiness Assessment — 5 minutes. No obligation. Start Now →