How Asana built a custom LLM evaluation framework for AI Teammates

April 8th, 2026

8 min read

For teams building with large language models, model selection shapes nearly every dimension of the product experience: quality, latency, cost, and tone. The industry is moving quickly: model release cycles are shortening and the configuration surface area continues to grow. As a result, AI teams need low-cost, repeatable ways to evaluate new models against their specific use cases.

At Asana, AI Teammates is our agentic AI system designed to help teams both organize work and move it forward. Our agents don’t just live in a one on one chat, they collaborate directly within Asana’s shared spaces with the whole team. To deliver a product that our customers can rely on, we need to select the right models with the right configurations.

This led us to develop a set of benchmarks tailored to collaborative work management, designed to make model evaluation both efficient and high signal. In this post, we’ll share how we built these benchmarks from real customer feedback, how we run evaluations across our system, and how the results inform our model selection decisions.

Going beyond standard benchmarks and the hidden complexity behind picking LLM models

Before we cover how we pick models, we’ll cover everything that makes this decision difficult and led us to developing our own Asana-specific benchmarks

Multidimensional tradeoff space

Different models with different reasoning capabilities offer different tradeoffs between latency (either time to first token or tokens per second), cost per token (input or output. Cache read or write), average tokens consumed (either in thinking or output), and finally response quality. This means we can’t conclude much from a release page. We need to measure and compare each of these dimensions for the use cases we care about. A model that excels at synthesizing a status report from dozens of tasks might be overkill (and too slow) for a simple knowledge look-up query.

Each interaction spans multiple LLM calls

A single AI Teammate execution can involve up to seven distinct types of LLM calls. Each call has its own prompt, job to be done and weighs the latency-cost-quality tradeoffs differently. Therefore for each execution type, we evaluate models separately.

Existing benchmarks don’t give enough nuance

The AI community has built many insightful benchmarks to evaluate LLMs, but none of them capture what it means to be effective at collaborative work management within a product like Asana. Understanding organizational context, following nuanced multi-step instructions (potentially from multiple individuals), reasoning about task dependencies and timelines, knowing when to ask versus when to act (and doing all of this efficiently): these are the capabilities that determine whether an AI agent is useful to a real team.

Prompt optimizations may overfit previous models

For certain kinds of mistakes a model can make, we can mitigate through our system prompts or tool guidance. The previous optimizations we have made, may no longer be suited to the latest models. As an example, some newer models have been too eager to do work not within the desired scope and needed additional prompting to actually do less work not prescribed by the user. Switching between providers tends to provide a different set of prompting needs. For example, OpenAI and Anthropic seem to provide slightly different default guidance on dealing with conflicts between system and user prompts. This means in order to fairly evaluate models, we need to attempt to fix mistakes through prompts or context provided before holding it against the model. This is especially true when evaluating a model from a different provider we’ve been using historically (and have been optimizing prompts and context to fit its needs). Overall, evaluating models needs to happen in tandem with prompt updates to achieve the best performance from each model.

Starting from our customers

To build meaningful benchmarks, we first had to deeply understand our customers and their use cases. We've been developing AI Teammates with a set of beta customers, and this partnership has been the foundation of everything we've built.

Execution-level feedback

For every AI Teammate execution, users can submit written feedback about what it did well or what it could have done better. As a team, we read every single piece of feedback that comes in. AI Studio helps make this manageable by writing summaries and categorizing feedback into themes for high-level monitoring. This feedback identifies the exact use cases where customers expect AI Teammates to excel and the moments where they fall short.

Customer conversations

We had a robust user research program during our beta that involved regular user testing sessions, interviews, and resulted in meaningful insights. These conversations help us understand general pain points, where customers are getting the most value, and what problems they're hoping to solve with AI Teammates. Feedback at this level shapes the categories of capability we need to optimize for, while execution-level feedback sharpens the specific test cases.

Dogfooding

Finally, we are our own biggest users. AI Teammates have caught on internally at Asana across a wide range of use cases: security reviews, bug triaging, backlog grooming, status reporting, and more. A steady stream of feedback and ideas comes from our own teams every day, helping us identify rough edges and use cases we should optimize for.

The 4 categories in Asana's Work Management benchmark

With all of this input, we developed four general categories that we evaluate our system against end to end.

Instruction following

We hold our models to an extremely high bar when it comes to doing what they're told, accurately reporting what they did, and not doing things they weren't asked to do. This is key to building user trust that allows them to delegate more complex, ambiguous, and important use cases to AI Teammates.

Our benchmarks here throw the most difficult set of scenarios at the model: subtly conflicting instructions, conditional branching, ambiguous choices, and tasks that require the model to understand the spirit of an instruction rather than just its literal text. We make sure that AI Teammates prioritize task specific instructions of our general guidance on how to approach its work.

Surfacing unknowns and crafting quality plans

AI Teammates need to gracefully walk the line between making reasonable assumptions to fill in gaps and going too far by guessing at information that could lead to undesired work. This is one of the hardest capabilities to get right. Lean too far toward caution and the Teammate asks too many questions, creating friction. Lean too far toward autonomy and it produces work the user didn't want.

Our benchmarks ensure that AI Teammates surface their assumptions and limitations early in their planning, make good assumptions to fill in genuinely missing details, but don't blindly guess at or fabricate unsupported information. The ideal behavior mirrors what a thoughtful new team member would do: "Here's my plan based on what I know. I'm assuming X and Y. Let me know if that's off." And as AI Teammates build memories from past executions, they require less and less user guidance over time.

Project planning and management

A core job in Asana is breaking down complex goals into clear projects and, from a large or messy project, bringing clarity through status updates and active management. This requires reasoning about the structure of work itself: tasks, subtasks, dependencies, timelines, assignments, and how they all fit together.

Our benchmarks ensure that AI Teammates can appropriately create and reason about tasks, dependencies, dates, and project structure. We test everything from "create a project plan for this product launch" to "what's blocking progress in this project and what should we prioritize next." Getting this right requires not just understanding the mechanics of project management but also applying judgment about what level of granularity is appropriate and what information is most actionable.

Large context analysis

The complexity of an organization scales super-linearly with its size. Cross-team dependencies, overlapping roadmaps, and shared resources create a volume of information that's difficult for any one person to fully ingest. This creates one of the most valuable opportunities for AI Teammates.

Our benchmarks here ask the AI Teammate to play the role of a team lead and reason about all the ways a specific platform team may need to support every other product team's roadmap. As a technical lead on our team, this type of analysis represented a full day's worth of manual effort: reading through every team's roadmap, cross-referencing dependencies, and identifying conflicts. An AI Teammate performs this analysis in under 15 minutes with a higher degree of thoroughness than I had been able to achieve manually. These benchmarks test whether the model can maintain coherence and analytical rigor across large volumes of interconnected organizational data.

Our 4-stage LLM evaluation pipeline

With our benchmarks defined, we needed an efficient and reliable way to test and grade models across all of them.

Code-run evals

Our first line of evaluation is automated code-run evals. These create a sandbox environment, set up a specific scenario, call one of the specific components of the AI Teammates system, and test the output. We use a combination of deterministic checks (did it create the right number of tasks? did it set the correct due date?) and LLM-graded evaluations for non-deterministic results (is this status update accurate and well-written?). These evals are fast to run across different models and produce summaries of cost, latency, and correctness that give us a quick read on viability. These cover a majority of use cases and completely handle the simpler LLM interactions within AI Teammates.

End-to-end benchmarks

Next, we run our end-to-end benchmarks against the full AI Teammate system. Each benchmark gets run multiple times. AI helps grade each output, but we verify every result individually to build an intuition for not just whether the model made mistakes, but why. Asana's AI Teammate execution history makes this easy directly from our app by showing each action and the model’s reasoning together. When we identify mistakes that could be mitigated with prompt or context changes, we make those adjustments and rerun the evaluations. As models get better, we’re constantly making both our benchmarks harder and our agent harness more capable and efficient, which means we’re also rerunning the latest version of our benchmarks against previous models again.

Roll-up analysis

Next, we synthesize all of these results into summaries of the tradeoffs between quality, cost, and latency for each model configuration. These roll-ups allow us to make data-backed recommendations about which model to test in production for specific call types within the system. The goal is always the same: the best possible experience for users, informed by concrete data rather than intuition about which model "feels" better.

Internal testing

The final stage of evaluation happens when we start trying out the best configurations in our real workflows internally at Asana. Our internal tooling allows us to specify exact models with specific parameter configurations, which means we can compare models side by side in our real day-to-day work. Engineers, PMs, and other teams across Asana use AI Teammates with the candidate model as part of their normal workflow, giving us signals on real-world performance that benchmarks alone can't capture.

This dogfooding phase surfaces bugs, prompt iterations, and edge cases that only emerge in the messy reality of actual work. When something looks off, we can drill into the execution history, understand the root cause, and determine whether it's a model limitation, a prompt gap, or a system-level issue. Once we're confident in the results, we roll the improvement out to customers.

Key lessons from building a repeatable LLM evaluation process

The most important insight from this work is that model quality is not a single dimension. It’s about deeply knowing your customer, their needs, and their use cases. Only then can we pick the most efficient model to serve these needs.

We've also learned that benchmarks are a living artifact. As models improve, as our product evolves, and as customers bring us new use cases, the benchmarks need to evolve too. The process we've built around gathering customer feedback, developing targeted evaluations, and validating in both internal betas and production is designed to be iterative and repeatable.

Model providers’ release schedules are accelerating and their range of products and configurations is growing exponentially. Rather than passing that burden along to customers, Asana provides an opinionated (but configurable) model selection, configuration, and optimization. This frees up our customers to focus on what they do best, knowing AI Teammates will always be using the best and latest models.

This article was written by Nathan RainaDeGraaf, Software Engineer at Asana. Nathan is the technical lead of the AI Teammates team, where he works to build and scale Asana's collaborative agentic AI product.

Artificial Intelligence (AI)