Causal Dynamics Lab outperforms Anthropic & OpenAI in multiple coding tests

AI coding tools are now producing code faster than teams can check what it will do in real use. Today, Causal Dynamics Lab (CDL) announced new research explaining why this happens, along with a new product called Cielara Code. This product achieved the highest accuracy in code localization among AI coding tools, outperforming both Claude Code (Opus-4.6) and OpenAI Codex (GPT-5.4) across three independent tests.

CDL studied how coding agents operate by tracking their actions across thousands of coding sessions. They found 56.8% of agents’ actions involved reading files, and 24.2% involved using grep. Less than 1% of their actions were actual code edits. The problem was not that agents couldn’t write code; they had difficulty finding the correct code to edit. The situation worsened with more complex tasks: when a correct fix involved more than six files, the agents’ ability to recall the necessary information dropped significantly, and the computing power used in failed attempts increased by a factor of 4 compared to successful ones.

The 2025 DORA report showed the use of AI coding tools led to a 7.2% drop in deployment stability. AWS CTO Werner Vogels called this problem “dynamic verification debt.” A well-known issue with Claude Code (GitHub issue #42796) illustrates the same problem on a larger scale: current agents treat code as flat text without showing how files connect, how functions call each other, or how changes affect the overall system.

How Cielara Code works

Cielara Code uses a model to represent a customer’s production environment in a 6-layer causal graph. This graph includes information on what the code does, why it was created, who owns it, its limitations, where it runs, and what happens at runtime. If there is a failure, it can be linked back to the specific code change, the developer who approved it, and the reason for that change. Before an agent begins to explore, Cielara Code builds a Code Dependency Causal Graph. This graph tracks four types of relationships, allowing the agent to navigate the structure rather than just look through files one by one.

Benchmark results

Across three independent benchmarks, Cielara Code beat both Claude Code (Opus-4.6) and OpenAI Codex (GPT-5.4) at the hardest part of agent work: finding the right place to make a change. Overall localization accuracy hit 0.774, versus 0.738 for Claude Code and 0.707 for Codex. On MULocBench (1,033 issues across 46 repositories), Cielara reached 0.752 recall@5 versus 0.727 for Claude Code, and cut mean task time from 141.84 to 128.62 seconds. The result: fewer wrong-file edits, fewer failed runs, and 30 to 40 percent lower compute cost per task.

REASONARA: causal memory at enterprise scale

Cielara Code makes this practical through REASONARA, a graph-structured causal memory layer that stores 125M+ tokens of effective context but retrieves only what matters for each query. A typical lookup uses 1,000–2,500 tokens, compared with 23,000–115,000 for full-context approaches — a reduction of up to 98%. On independent benchmarks, REASONARA scores 94% on UltraDomain, 92% on LoCoMo, 73% on LoCoMo-plus, and 87.4% on LongMemEval, and runs 5–8× faster than Codex high-reasoning mode. The roadmap targets a one-billion-token context window.

Cielara Code is a safety layer for AI coding agents. It aims to enhance the safety of their output rather than replace them. Currently, 11 Fortune 100 and over 40 Fortune 500 companies use Cielara Code on their codebase.

The team

The team has strong skills based on the problem they are addressing. CEO Hasibul Haque led platform engineering at Uber during its rapid growth. CTO Ryan Turner was a Staff Engineer at Uber and helped maintain the SPIRE Project within the Cloud Native Computing Foundation (CNCF). R&D is led by Dr. Xuchao Zhang, who worked at Microsoft Research, and Dr. Liang Zhao from Emory University, who has 200+publications and is ranked among the top 2% of scientists by Stanford University. CDL has a formal research partnership with Emory’s AI Lab.

What’s next

The Production World Model serves as a foundation. Cielara Code and REASONARA are the first products to use this foundation. In the future, Causal Dynamics Lab will fully simulate the effects of changes in code, infrastructure, policy, and operation. This will create a permanent reasoning layer in the enterprise system that any AI agent can access before making changes that affect production.

Leave a Reply

Discover more from The IT Nerd

Subscribe now to keep reading and get access to the full archive.

Continue reading