Startups SaaS June 22, 2026 • 9 min read

Questions to Ask Before Hiring an AI SaaS Dev Partner

For: A Series A B2B SaaS founder who has approved budget to build a core AI feature but has no internal ML engineers, is now interviewing development partners, and cannot tell from portfolio decks alone whether a vendor has actually shipped and maintained a production AI feature inside a multi-tenant SaaS product — versus fine-tuned a model on a side project and wrapped it in a demo.

If you only have time for one question when you hire an AI SaaS development partner, make it this: "Tell me about a time tenant A's data contaminated tenant B's model behavior in production — what did you change in the pipeline?" A vendor who has actually shipped and maintained multi-tenant AI will answer with a specific incident, a code change, and a monitoring rule they added. A vendor who has built polished single-tenant demos will look confused or pivot to talking about model accuracy. That single exchange tells you more than any portfolio deck.

Below is the interview script we'd run if we were on your side of the table. Fifteen questions, organized by what they actually reveal. Use them to evaluate an AI product development firm before you sign an SOW.

Multi-tenancy and data isolation

1. How did you handle tenant data isolation in your last AI feature — at the storage layer, the embedding layer, and the inference layer?

Why it matters: Most teams understand row-level isolation in Postgres. Far fewer have thought through what happens when you build a shared vector index, or when a fine-tuned model has absorbed patterns from one customer's data that leak into another's suggestions.

Good answer sounds like: Namespaced vector collections per tenant (or per-tenant Pinecone/Weaviate indexes), tenant ID enforced at the retrieval layer before the prompt is assembled, separate embedding pipelines for customers on enterprise SKUs, and an explicit policy on whether tenant data is ever used for shared model improvement.

Red flag: "We just pass the tenant ID in the prompt context." That's not isolation, that's hope.

2. Have you ever had a model behave differently for one tenant because of another tenant's data? What did you change?

Why it matters: This is the question from the lede. It's diagnostic. Anyone who has run a production multi-tenant AI system has hit this — through shared retrieval indexes, through feedback loops, through cold-start defaults learned on early customers.

Good answer: A specific incident. "We saw tenant B's autocomplete suggesting field names from tenant A's schema. Root cause was a shared embedding cache keyed only on input hash. We added tenant ID to the cache key and added a regression test."

Red flag: "We've never seen that." Either they haven't shipped multi-tenant AI, or they aren't measuring.

3. How do you handle a tenant who wants their data excluded from any shared learning?

Why it matters: Enterprise buyers will ask this in security review. If your vendor hasn't thought about it, your sales cycle gets longer.

Good answer: Per-tenant flags that route around shared fine-tuning, documented in the architecture, and surfaced in the admin UI. Bonus if they mention DPA language.

Red flag: Treating it as a future problem.

Production reality, not demos

4. Walk me through the last AI feature you shipped that's still in production. Who owns it now?

Why it matters: Shipping is one skill. Keeping an AI feature accurate for six months across hundreds of tenants is a different one.

Good answer: Names a real product, describes the feature, can tell you what changed between v1 and what's running now, and explains the handoff (or ongoing support contract).

Red flag: The case study is a hackathon-style POC, or "the client took it in-house and we don't have visibility."

5. What's your evaluation harness? How do you know the model is still good a month after launch?

Why it matters: Models degrade. User behavior shifts. Prompt drift is real. Without an eval harness, you're guessing.

Good answer: A golden set of inputs with expected outputs, scheduled regression runs, LLM-as-judge for subjective outputs with human spot-checks, and dashboards on hallucination rate, latency, and refusal rate per tenant cohort.

Red flag: "We test it before deploying." That's not evaluation, that's QA.

6. How do you handle model version upgrades — moving from one foundation model to the next?

Why it matters: The model you launch on will be deprecated. The vendor needs a plan that doesn't involve a panicked rewrite.

Good answer: Abstraction layer over model providers, prompt versioning checked into git, A/B harness to compare old vs new model on the eval set before cutting traffic, ability to roll back per tenant.

Red flag: Hard-coded model names scattered across the codebase.

Cost, latency, and the boring engineering

7. What's your approach to controlling inference cost as we add tenants?

Why it matters: AI features have variable unit economics. A vendor who hasn't thought about this will quietly destroy your gross margin.

Good answer: Caching strategy (semantic and exact), routing cheap queries to smaller models, batching where latency allows, per-tenant usage caps, and a clear telemetry story so you can attribute cost back to specific customers.

Red flag: "We'll use GPT-4 for everything."

8. How do you handle latency for AI features that block a user action?

Why it matters: A 4-second autocomplete is unusable. Streaming, optimistic UI, and fallback strategies are real engineering work.

Good answer: Streaming-first UX, fallback to smaller models on timeout, prewarmed embeddings, p95 latency targets in the SLA.

9. How do you handle prompt injection and jailbreak attempts in a multi-tenant context?

Why it matters: One tenant's malicious user shouldn't be able to extract another tenant's system prompts or trigger actions on their behalf.

Good answer: Input sanitization, output validation against a schema, structured outputs over free text where possible, separating tool-use authority from user input, and a logged incident from a real attempt.

Red flag: Treating safety as "the model handles it."

The team and the contract

10. Who specifically will be on this engagement? Can I interview them?

Why it matters: Agency sales decks rarely match the team that does the work.

Good answer: Named engineers, their GitHub or prior work, and a yes to the interview.

11. What happens when the lead engineer leaves the agency mid-project?

Good answer: Documented architecture decisions, prompts in version control, runbooks, a named backup. The bus factor question is fair and tells you about their internal discipline.

12. Who owns the model weights, the prompts, the eval data, and the training data we generate?

Why it matters: If the answer isn't "you do," walk away. This is a frequent point of AI SaaS vendor due diligence failure.

13. How do you handle the line between "feature works" and "feature is trusted by users"?

Why it matters: Users abandon AI features they can't verify. A vendor who has shipped real products will talk about confidence scoring, citations, undo, and the UX around uncertainty.

Good answer: Surfacing source documents, confidence indicators, easy correction loops that feed back into the eval set.

14. Show me a postmortem from an AI feature you shipped that didn't work.

Why it matters: Everyone has failures. The good ones write them down.

Red flag: "We haven't had any."

15. What does your team think LLMs are bad at, and how do you steer clients away from those use cases?

Why it matters: A partner who can't say no will build you something that embarrasses you in six months.

Good answer: Specific examples — exact numerical reasoning, hard recall over large structured datasets, anything requiring real-time freshness without a retrieval layer, regulated decisions that need an audit trail the model can't provide.

How CodeNicely can help

If your situation looks like the founder profile at the top of this post — Series A, approved budget, no internal ML team, picking between vendor decks that all look similar — the most useful reference is our work on GimBooks. It's a YC-backed accounting SaaS with thousands of small-business tenants. The AI features there had to work across wildly different bookkeeping styles, languages, and document formats, with strict per-tenant data isolation and audit requirements. The constraints you're describing — multi-tenancy, heterogeneous accounts, post-ship accuracy pressure — are the ones we solved there.

For healthcare-adjacent or regulated workloads, HealthPotli is a closer match: AI-driven drug interaction checks where a wrong answer has real consequences, and where the eval harness matters more than the model choice. Both engagements involved long-running ownership, not ship-and-leave. You can see our broader approach on the AI Studio page.

We're happy to walk through any of the 15 questions above on a call and show you the actual artifacts — eval dashboards, prompt repos, isolation architecture diagrams — from those builds.

Frequently Asked Questions

How do I evaluate an AI product development firm if I don't have an ML background myself?

Bring a technical advisor for the interview, even a fractional one. Then focus on the operational questions in this list — multi-tenancy, evals, rollback, cost — rather than model architecture. A vendor who can explain those clearly to a non-ML founder is a better fit than one who shows off math.

What's the single biggest red flag when choosing an AI software development partner?

Inability to produce a postmortem or a specific production incident. Anyone who has actually shipped and maintained AI in a multi-tenant SaaS has war stories. A polished pitch with zero scars usually means the work was a prototype, not a product.

Should I hire a generalist dev shop that added an AI practice, or an AI-first firm?

Neither label matters. What matters is whether they've shipped a multi-tenant SaaS feature that stayed accurate after launch. Ask for the live URL and the eval harness. The org chart is irrelevant.

How long does it take to build an AI feature into an existing SaaS, and what should I budget?

This depends entirely on the feature scope, data readiness, and your existing architecture. We don't quote ranges in public because the honest answer requires looking at your codebase and data. Contact CodeNicely for a personalized assessment.

Who should own the prompts and eval data my vendor produces?

You should, contractually and in practice. Prompts, eval sets, fine-tuning datasets, and model artifacts should live in repositories you control. If a vendor pushes back on this, it tells you they plan to lock you in.

Building something in SaaS?

CodeNicely partners with founders and tech teams to ship AI-native products that move metrics. Tell us about the problem you're solving.

Talk to our team