Icon of a laboratory flask with molecular structure inside, enclosed in an orange circle

AI Coding Agents Alone vs Alkemy for ML Projects

AI coding agents like Claude Code and Codex bring a huge amount of machine learning knowledge to the table, but every session starts from an empty chat: they don’t inherently know what success looks like, what decisions were made last week, and how to navigate scattered notebooks and scripts.

Alkemy was built to give data scientists a complete operating framework for the ML lifecycle: structured projects, reproducible datasets, flexible experiments, version control, and deployment artifacts.

That’s why AI coding agents work so well with Alkemy: point an agent at an Alkemy project and the conventions, the experiment history and the path to production are already there to follow.

Across the ML lifecycle, here's where AI coding agents struggle on their own, where teams typically patch the gaps, and where Alkemy makes the difference:

Stage / Task	Coding Agent Alone	Coding Agent + Ad-Hoc Scaffolding	Coding Agent + Alkemy
Project setup	Agents generate a project structure from scratch every time, context switch bites hard	Team writes a project template agents follow, works until the template needs to evolve	Agents inherit a standard project layout with built-in conventions for code, config, and artifacts
Building datasets	Each dataset is a one-off script, agent struggle to validate and connect them with experiment usage scattered across code	Team adds naming conventions and shared utilities, agents have to be told about it every time	Datasets are validated objects agents can reference by name across sessions
Running experiments	Experiments are one-off scripts that print metrics, agent forgets them when the session ends	Team adds MLflow or Weights & Biases, but every script is still bespoke and requires extensive boilerplate	Experiments and artifacts are tracked automatically, with results and feedback recorded so agents can use prior runs to guide iteration
Comparing to a benchmark	Agent has no notion of "the model to beat", each run is judged in isolation	Team manually maintains a benchmark in a doc, agents have to be told about it every time and don’t automatically know which experiments to compare	Benchmarks and prior runs live inside the project structure, so agents can compare new results against the current bar
Deployment	Agent writes a separate inference script, production code silently diverges from experiment code	Team builds a deployment pipeline, but inference code must be kept in sync with experiment logic	Deployment use the same code path as experiments, no rewrite and no drift
Handing off mid-project	New agent session or new teammate starts from the README and a repo full of files	New session can follow the team's conventions if they're documented	New session inherits structure, history, and current state from the framework itself
Ongoing maintenance	Structure degrades as work accumulates	Team must maintain and evolve scaffolding alongside ML work	Lifecycle structure is part of the system, not something the team has to build and maintain

Where AI Coding Agents Alone Are Enough:

Any kind of ML work that’s ad-hoc, exploratory, or short-lived, where results don’t need to be repeated, extended, or shipped to production.

Exploratory analysis of relationships between potential inputs and outputs
Evaluating a ML or data analysis library you haven’t used before
Short spikes that inform planning for future data products
One-off models or scripts that are not expected to be maintained or reused

Where They Break Down In Real ML Work:

Any kind of ML work that evolves over time, especially when results need to be reproducible, compared across iterations, and shipped to production.

Work loses continuity across sessions, making it hard to build on prior decisions and results
Ongoing experimentation with no clear view of has been tried or what works best
Dataset logic gets scattered across scripts, lineage becomes unclear
Experiment code diverges from production code, requiring rewrites and increasing the risk of drift

Benchmarking

Setup: Same model (GPT-5.4 in Codex), same one-line prompt ("I need to build a fraud detection model"), and the same public Kaggle transaction dataset. The agent-alone run finished in about 15 minutes with no user involvement after the first prompt. The agent + Alkemy run took about 6 hours of agent runtime and 1 to 2 hours of our time answering the agent's intake questions (precision target, deployment mode, historical-feature availability) and providing short steering inputs later (faster libraries, scrutinize optimistic outputs, more stress testing). We redirected the agent but did not write code or make technical decisions.

Two terms used below: Transaction-only features are computable by the API at scoring time (amount, merchant category, card type). Historical features require looking up a customer's or card's past activity at scoring time (prior merchants used, recent average spend). Historical features are often more predictive for fraud but need extra engineering to serve fresh values in production. Keeping them separate makes the choice a stakeholder investment question rather than a default.

Component	Agent Alone	Agent + Alkemy
Problem intake	Started coding in turn 1. Assumed a 10% precision floor.	Asked about the precision target (30%), deployment mode (API), and historical-feature availability. Planned experiments to evaluate historical-features business case.
Features	One set, transaction-only by default.	Transaction-only and transaction-plus-historical tracked.
Validation	Single chronological 80/20 holdout.	Time-aware CV across five annual folds with a buffer gap, plus sliding-window and 30-day-gap stress tests.
Headline metric	PR-AUC ≈ 0.71 on one slice.	PR-AUC ≈ 0.46 under time-aware CV (transaction-only tuned LightGBM). The same model scored ≈ 0.82 on a random split, which was flagged as too optimistic.
Robustness	None.	Found a 2017 fraud-pattern shift where most candidate models caught almost no fraud. Documented as a production risk.
Artifacts	Training script, scoring script, README.	Validated dataset, decision journal, 18 experiments, feature importance, stakeholder ready business-case report.
Handoff	Re-derivable by rereading the script.	Brief + journal + experiments carry decisions forward.

Takeaway: Both runs produce a working model. Only the Alkemy run produces the journal, the transaction-only vs transaction-plus-historical investment split, the time-aware validation that showed the single-slice number was too optimistic, the 2017 drift finding, and the business-case report a team needs justify putting historical features on their engineering roadmap.

Fairness disclosures: Both runs used GPT-5.4 in Codex, the same raw data, and the same one-line prompt. The Alkemy run also received short user redirections as the agent raised questions, but no technical contributions. Validation splits differ between the two runs, so the predictive numbers are directional, not head-to-head. The agent-alone threshold was chosen against a default 10% precision floor because it did not ask for a business target. With Alkemy, the 30% target came from intake. The dataset was normalized from its original JSON and CSV format to all-CSV with renamed columns to obfuscate the source. The underlying data was not altered.

Common Questions

What is an AI coding agent?

An AI coding agent is software that can write, modify, and reason about code on its own, often with minimal human input. Instead of just answering questions, an agent can take a goal and work toward completing it. Agentic coding is becoming a standard part of development environments, and the technology is rapidly accelerating. AI coding agents can reduce the need for large engineering teams, as they can help small teams with limited resources handle repetitive dev work. This empowers non-experts to build real systems and prototype products faster. They are a force multiplier for engineers, but not a full replacement of them. They need constraints and guidance to prevent errors, otherwise they can get stuck or go in the wrong direction. Alkemy provides a solution for this. Examples of coding agents include: OpenAI Codex Anthropic Claude Code Cursor GitHub Copilot

Which AI coding agents integrate with Alkemy?

Most popular coding agents are supported including Codex, Claude Code, and Cursor out of the box. So long as your coding agent or custom harness supports AGENTS.md (or CLAUDE.md) and Agent Skills you should be able to use Alkemy.

Why do I need Alkemy - couldn't I build an ML project just using an AI coding agent?

Building an ML project with just an AI coding agent would be risky. It is well known that coding agents can make mistakes without proper guidance. It can be difficult and time-consuming to unravel the mess an agent makes when it goes off the rails. With Alkemy, agents are guided by robust workflows and constraints accompanied by clear experiment history to provide a safe and effective path to production. Alkemy offers teams a complete operating framework for the ML lifecycle: structured projects, reproducible datasets, flexible experiments, version control, and deployment artifacts.

Can I control which data the agent accesses?

When it comes to running code on your computer, most popular coding agents have strong permissions features that allow you to customize what it is and isn't allowed to do on its own. If an agent needs to do something it isn't authorized to do it'll prompt you for permission. This is your opportunity to scrutinize its request and ask follow up questions about why that action is needed. When it comes to data access, it's important that you comply with your organization's policies related to AI usage and data access. We recommend starting out with file exports from your data sources rather than letting agents connect directly.

How is this different from tools like ML Flow or Weights & Biases?

Alkemy is designed to manage the entire ML lifecycle end-to-end, not just track experiments. While tools like MLflow or Weights & Biases focus primarily on experiment tracking and metrics, Alkemy provides structured workflows for dataset creation, experimentation, versioning (with Git/DVC), and deployment in a single system. It enforces consistency through configuration and project structure, helping teams build reproducible, production-ready pipelines rather than just logging results. In short, Alkemy is a full workflow framework, not just a tracking tool.

See the Case Study