Why AI Agents Are Powerful but Not Consistent
LLMs are not as good as some say, but not as bad as others do. They’re inconsistent, by design.
Key takeaways
- LLMs (Large Language Models) are non‑deterministic. The same prompt can return different results when entered multiple times. That’s a feature (creativity) and a risk (inconsistency) of LLMs.
- Capability ≠ consistency. Models can sometimes complete complex tasks that traditional software can’t, but not reliably enough to meet the expectations of users accustomed to traditional, deterministic software.
- Bridge the gap with design and process. Users should treat outputs as drafts and not expect them to be perfect from the first attempt. Builders should communicate the reliability of their tools honestly and engineer for variability (e.g. by adding guardrails, retrieval, memory, retries and evaluations).
Capability ≠ Consistency
AI tools powered by large language models (LLMs) have made huge improvements in what they can attempt to do. It’s now very common to hear claims that AI agents can handle tasks that take humans hours or even days. For example, a recent study found that today’s top models can autonomously complete tasks taking a skilled human about one hour, BUT only with a roughly 50% success rate. Based on this, some people are predicting that AI systems could manage projects that last weeks or even a month by the end of this decade. This is exciting, especially for builders of AI tools like me, but there’s something important for builders and users to note: Large Language Models are not deterministic. In other words, an AI system might give you a correct solution to a very difficult problem or stunning creative output one moment, but then disappoint you by failing at a very similar task the next. This gap between what AI can do and what it can do repeatedly and accurately is at the center of a disconnect between what model providers are saying and what users should realistically expect.
The core issue is that LLMs don’t behave like deterministic software. When you enter =SUM() in a spreadsheet, you expect the same result every time. But if you ask an AI agent to generate a report or solve a problem, its output will vary from one attempt to another, even with the same input data and prompt. The Agent might produce a brilliant solution in one instance, then a completely nonsensical or incorrect one in the next, even if your prompt hasn’t changed. Consistency is not guaranteed, and in fact the opposite is true, inconsistency is expected.
Why LLMs Feel Inconsistent
The reason for this inconsistency is because of how LLMs work. Unlike a hand-coded program that follows a fixed set of instructions, a Large Language Model generates text by predicting the most likely next token based on its training data and context. This process is inherently non-deterministic because of the design of the models themselves. This means that the models can produce different outputs even given the exact same input. In fact, this probabilistic approach is what gives LLMs their creativity and flexibility. They can come up with new sentences, code solutions, or marketing copy that wasn’t explicitly programmed, or doesn’t exist in any previous text, by working in a way that is very different to how traditional software does. But the other side is that this stochastic behavior leads to inconsistent results. An LLM’s unpredictable nature is a classic “double-edged sword” - it enables diverse, creative outputs, but it also can cause frustrating inconsistency when a user might expect reliable, consistent answers.
- Randomness and Temperature: AI models often inject randomness to avoid repetitive, robotic sounding answers. Higher “temperature” settings make outputs more varied and creative, while lower settings make them more predictable. But even with temperature set to zero (for maximum determinism), some factors in the model or environment can still result in different answers on different runs.
- Lack of True Understanding or Verification: An AI agent doesn’t know when it’s right or wrong, it just generates what seems possible based on the training dataset. That means it might sometimes “hallucinate” facts or answers that sound confident, but are incorrect. Traditional software would just fail or throw an error if it came across a case it hasn’t been programmed for. LLMs will normally always produce an answer, but that answer might be wrong, and the model won’t reliably notice its own mistakes.
- Sensitivity to Wording and Context: LLMs can be oddly sensitive to how a certain question is phrased or what has previously been written in a conversation thread. A small change in a users prompt can change the output quite a lot. If a user doesn’t prompt the AI in just the right way, they might get a much worse result than they might otherwise - hence the emergence of “prompt engineering". This feels very strange to users who are used to software that either works or doesn’t, no matter what the inputs look like.
- Maintaining Coherence Over Long Tasks: When an AI tries to perform complex, multi-step projects, it often struggles to stay on track. Current models have a limited “planning horizon”. They tend to get distracted, forget details, or lose sight of the orignial objective as the task grows in length. For example, an agent might start writing a report and write a strong introduction and outline, but by the time it’s working on the detailed analysis, it may start to contradict itself or repeat earlier points. This is partly because LLMs have a fixed context window (they can only process a certain amount of information at once) and partly because they don’t have built-in long-term memory of past interactions, unless developers add external memory tools (my post on memory).
All of these differences mean that an AI agent isn’t like classic software; it’s more like a very knowledgeable but somewhat unpredictable person. Sometimes it will exceed expectations, solving a complex problem more quickly than a human could. Other times, it will get things totally wrong and make up answers, even on something it seemed to handle well previously. For users who have only previously used traditional software, this behavior can be confusing or even annoying. We’re used to software being consistent: if it worked once, it should work again on the same task with the same inputs. LLMs break this model. As a result, when an AI system outputs something incorrect or inconsistent, users might assume the whole system is flawed or the AI “lied.” But actually, this variability is a key part of how these models operate today.
What This Means For Users
Treat AI tools as fast, knowledgeable assistants, but expect them to be perfect.
- Start with structure. Give clear goals, constraints, examples and tone. Provide source text or data where possible.
- Work in short loops. Don’t try and “one-shot” everything, work in incremental steps with a clear goal for each.
- Use checklists. For recurring tasks (ad copy, briefs, reports), define a checklist or set of rules the model must follow.
- Check Everything. Treat the model outputs as if they were drafts. Edit anything that doesn’t seem right or sound how it should do.
What This Means For AI Builders
Design for the model variance and set realistic expectations.
Communicate clearly
- Position AI based tools as tools that accelerate work, not a guaranteed automation of everything.
- Explain typical successful uses and known failures. Offer prompting tips in the product.
Engineer for reliability
- Retrieval‑augmented generation (RAG): Ground outputs in your proprietary data to reduce hallucinations.
- Memory and state: Save facts and previous outcomes to improve performance over time.
- Constraint templates: Use schemas, forms and structured output validation to add a layer of consistency.
- Decomposition & tools: Break tasks into steps and call external tools where necessary (search, calculators, APIs), don’t rely purely on the model.
- Guardrails & evals: Add safety/quality checks and run continuous evaluations on tasks.
A Balanced View
Generative AI today has a huge amount of potential (and real use cases), but users need to understand it’s inherent unpredictability. We have models that can compose emails, write code, answer questions, and perform tasks that used to require a human with specialised training. This is a massive leap from what we could do in the 2010s and the reason so many people are excited about integrating LLMs into products and their workflows. But these models also come with uncertainty, that is new when it comes to software products and computer based workflows. Rather than deny or downplay that fact, it’s better if everyone acknowledges it and adapts to this new paradigm. In practical terms, that means users approaching AI outputs with a more critical eye (and being willing to edit or build on them), and developers of AI systems being upfront about the reliability of their tools and building in safeguards.
Consistency is improving, slowly but surely. Trends suggest that what an AI can do with high reliability is expanding (from a few minutes worth of human work to an hour, and more). Techniques in AI training and architecture are being explored to make models more stable and less prone to weird mistakes. And as mentioned, clever engineering (like giving an agent memory or breaking big tasks into smaller ones) can boost effective reliability even with today’s models. We should celebrate how far AI has come, but also be realistic about where it falls short. By aligning our expectations with this reality, we can avoid either overhyping and under-delivering, or being put off by a technology that does have huge value when used properly.
Further reading
- METR, Measuring AI ability to complete long tasks
- Greg Asquith, How Memory Makes Agentic AI Smarter