AI Projects Need Better Requirements, Not Better Models

The model is fine. The requirements aren’t.

Every week I see the same pattern. A team ships an AI feature, and it disappoints. Not because the model is bad, but because nobody agreed on what “good” meant before they started building.

The team spent three months prompt-engineering and fine-tuning. They benchmarked against GPT-4, Claude, and Llama. They optimized latency. They tweaked temperature settings. The model performed well on their test set.

Then it went to real users, and the feedback was: “This isn’t useful.”

The model did exactly what it was asked to do. The problem is that nobody clearly defined what it should do in the context of the actual product.

AI features still need requirements

There’s a strange belief in some teams that AI features are somehow exempt from normal product planning. “We’ll just throw the data at the model and see what it does.” That’s experimentation, not product development.

AI features need the same things every other feature needs:

What’s the input? Not “user data.” Specifically what data, in what format, from where?
What’s the expected output? A summary? A classification? A recommendation? How long, how structured, how confident?
What does “good enough” mean? 80% accuracy? 95%? Does it depend on the use case?
What’s the error tolerance? What happens when the model is wrong? Is a wrong answer worse than no answer?
What’s the fallback? When the AI can’t produce a useful result, what does the user see?

Most teams I talk to can answer maybe one or two of these clearly. The rest is vibes.

Real example: AI-powered summarization

A team built an AI feature that summarized customer support tickets for managers. The model worked great technically. But the requirements were vague: “Summarize the ticket.”

After launch, the complaints rolled in:

Some managers wanted a one-sentence summary. Others wanted three paragraphs.
The summary didn’t include the customer’s sentiment, which managers considered essential.
When the model couldn’t determine the issue (ambiguous tickets), it still produced a confident-sounding summary that was misleading.
Nobody had defined what should happen for tickets in languages the model wasn’t trained on.

Every one of these problems could have been caught with proper requirements. “Summarize the ticket” is not a requirement. It’s a direction.

Better requirements would look like:

Output: 2 to 3 sentence summary including the core issue, customer sentiment (positive/negative/neutral), and urgency level
If confidence is below 70%, display: “Unable to summarize. Please review the full ticket.”
Supported languages: English, Spanish, French. For other languages, show the original text without summary.
Maximum response time: 3 seconds per ticket

The “good enough” conversation

This is the hardest part for teams working on AI features, and it’s the most important.

Traditional features are binary. The button either works or it doesn’t. AI features exist on a spectrum. The model will sometimes be brilliant and sometimes be wrong, and you need to decide where on that spectrum is acceptable.

Have this conversation early:

For a medical application, 95% accuracy might be dangerously low.
For a social media content suggestion, 70% accuracy might be perfectly fine.
For a code review tool, a false positive (flagging good code) might be acceptable, but a false negative (missing a bug) might not be.

If you don’t define these thresholds before building, the team will optimize for the wrong thing. Usually they optimize for general accuracy when what matters is accuracy on specific, high-stakes cases.

Writing requirements for AI: a template

For every AI feature, document these six things:

Input specification: What data goes in, including format, size limits, and edge cases
Output specification: What comes out, including format, length, and structure
Quality threshold: What accuracy or quality level is acceptable, measured how
Error handling: What happens when the model fails, is uncertain, or produces harmful output
Fallback behaviour: What the user sees when AI can’t help
Evaluation method: How you’ll measure whether the feature is working after launch

This isn’t bureaucracy. It’s the difference between an AI feature that ships and one that ships and actually helps people.

The team alignment problem

Here’s what often happens without clear requirements: the ML engineer optimizes for model performance metrics. The product manager evaluates based on user satisfaction. The designer assesses the output formatting. Each person has a different definition of success, and nobody realizes it until late in the process.

Written requirements align everyone. The ML engineer knows what “good enough” means numerically. The designer knows the output constraints. The PM knows what to test with users. One document, shared understanding.

Better requirements first, better models second

If your AI feature isn’t working, resist the urge to immediately swap models or add more training data. First, check whether your requirements are clear enough to build against. Most of the time, they aren’t.

If you’re finding it hard to articulate requirements for AI features (and it is genuinely hard), tools like Projan can help. It’s built for exactly this kind of structured requirements gathering, walking teams through the right questions in a conversation and producing clear specs as output. Give it a try if your team is building AI features and struggling to align on what “done” looks like.

The models are getting better every month. Your requirements won’t improve unless you put in the work to write them.