
AI Coding Agents Alone vs Alkemy for ML Projects
AI coding agents like Claude Code and Codex bring a huge amount of machine learning knowledge to the table, but every session starts from an empty chat: they don’t inherently know what success looks like, what decisions were made last week, and how to navigate scattered notebooks and scripts.
Alkemy was built to give data scientists a complete operating framework for the ML lifecycle: structured projects, reproducible datasets, flexible experiments, version control, and deployment artifacts.
That’s why AI coding agents work so well with Alkemy: point an agent at an Alkemy project and the conventions, the experiment history and the path to production are already there to follow.
Across the ML lifecycle, here's where AI coding agents struggle on their own, where teams typically patch the gaps, and where Alkemy makes the difference:
Stage / Task | Coding Agent Alone | Coding Agent + Ad-Hoc Scaffolding | Coding Agent + Alkemy |
|---|---|---|---|
Project setup | Agents generate a project structure from scratch every time, context switch bites hard | Team writes a project template agents follow, works until the template needs to evolve | Agents inherit a standard project layout with built-in conventions for code, config, and artifacts |
Building datasets | Each dataset is a one-off script, agent struggle to validate and connect them with experiment usage scattered across code | Team adds naming conventions and shared utilities, agents have to be told about it every time | Datasets are validated objects agents can reference by name across sessions |
Running experiments | Experiments are one-off scripts that print metrics, agent forgets them when the session ends | Team adds MLflow or Weights & Biases, but every script is still bespoke and requires extensive boilerplate | Experiments and artifacts are tracked automatically, with results and feedback recorded so agents can use prior runs to guide iteration |
Comparing to a benchmark | Agent has no notion of "the model to beat", each run is judged in isolation | Team manually maintains a benchmark in a doc, agents have to be told about it every time and don’t automatically know which experiments to compare | Benchmarks and prior runs live inside the project structure, so agents can compare new results against the current bar |
Deployment | Agent writes a separate inference script, production code silently diverges from experiment code | Team builds a deployment pipeline, but inference code must be kept in sync with experiment logic | Deployment use the same code path as experiments, no rewrite and no drift |
Handing off mid-project | New agent session or new teammate starts from the README and a repo full of files | New session can follow the team's conventions if they're documented | New session inherits structure, history, and current state from the framework itself |
Ongoing maintenance | Structure degrades as work accumulates | Team must maintain and evolve scaffolding alongside ML work | Lifecycle structure is part of the system, not something the team has to build and maintain |
Where AI Coding Agents Alone Are Enough:
Any kind of ML work that’s ad-hoc, exploratory, or short-lived, where results don’t need to be repeated, extended, or shipped to production.
-
Exploratory analysis of relationships between potential inputs and outputs
-
Evaluating a ML or data analysis library you haven’t used before
-
Short spikes that inform planning for future data products
-
One-off models or scripts that are not expected to be maintained or reused
Where They Break Down In Real ML Work:
Any kind of ML work that evolves over time, especially when results need to be reproducible, compared across iterations, and shipped to production.
-
Work loses continuity across sessions, making it hard to build on prior decisions and results
-
Ongoing experimentation with no clear view of has been tried or what works best
-
Dataset logic gets scattered across scripts, lineage becomes unclear
-
Experiment code diverges from production code, requiring rewrites and increasing the risk of drift
Benchmarking
Setup: Same model (GPT-5.4 in Codex), same one-line prompt ("I need to build a fraud detection model"), and the same public Kaggle transaction dataset. The agent-alone run finished in about 15 minutes with no user involvement after the first prompt. The agent + Alkemy run took about 6 hours of agent runtime and 1 to 2 hours of our time answering the agent's intake questions (precision target, deployment mode, historical-feature availability) and providing short steering inputs later (faster libraries, scrutinize optimistic outputs, more stress testing). We redirected the agent but did not write code or make technical decisions.
Two terms used below: Transaction-only features are computable by the API at scoring time (amount, merchant category, card type). Historical features require looking up a customer's or card's past activity at scoring time (prior merchants used, recent average spend). Historical features are often more predictive for fraud but need extra engineering to serve fresh values in production. Keeping them separate makes the choice a stakeholder investment question rather than a default.
Component | Agent Alone | Agent + Alkemy |
|---|---|---|
Problem intake | Started coding in turn 1. Assumed a 10% precision floor. | Asked about the precision target (30%), deployment mode (API), and historical-feature availability. Planned experiments to evaluate historical-features business case. |
Features | One set, transaction-only by default. | Transaction-only and transaction-plus-historical tracked. |
Validation | Single chronological 80/20 holdout. | Time-aware CV across five annual folds with a buffer gap, plus sliding-window and 30-day-gap stress tests. |
Headline metric | PR-AUC ≈ 0.71 on one slice. | PR-AUC ≈ 0.46 under time-aware CV (transaction-only tuned LightGBM). The same model scored ≈ 0.82 on a random split, which was flagged as too optimistic. |
Robustness | None. | Found a 2017 fraud-pattern shift where most candidate models caught almost no fraud. Documented as a production risk. |
Artifacts | Training script, scoring script, README. | Validated dataset, decision journal, 18 experiments, feature importance, stakeholder ready business-case report. |
Handoff | Re-derivable by rereading the script. | Brief + journal + experiments carry decisions forward. |
Takeaway: Both runs produce a working model. Only the Alkemy run produces the journal, the transaction-only vs transaction-plus-historical investment split, the time-aware validation that showed the single-slice number was too optimistic, the 2017 drift finding, and the business-case report a team needs justify putting historical features on their engineering roadmap.
Fairness disclosures: Both runs used GPT-5.4 in Codex, the same raw data, and the same one-line prompt. The Alkemy run also received short user redirections as the agent raised questions, but no technical contributions. Validation splits differ between the two runs, so the predictive numbers are directional, not head-to-head. The agent-alone threshold was chosen against a default 10% precision floor because it did not ask for a business target. With Alkemy, the 30% target came from intake. The dataset was normalized from its original JSON and CSV format to all-CSV with renamed columns to obfuscate the source. The underlying data was not altered.
Common Questions
An AI coding agent is software that can write, modify, and reason about code on its own, often with minimal human input. Instead of just answering questions, an agent can take a goal and work toward completing it. Agentic coding is becoming a standard part of development environments, and the technology is rapidly accelerating.
AI coding agents can reduce the need for large engineering teams, as they can help small teams with limited resources handle repetitive dev work. This empowers non-experts to build real systems and prototype products faster. They are a force multiplier for engineers, but not a full replacement of them. They need constraints and guidance to prevent errors, otherwise they can get stuck or go in the wrong direction. Alkemy provides a solution for this.
Examples of coding agents include:
OpenAI Codex
Anthropic Claude Code
Cursor
GitHub Copilot
Most popular coding agents are supported including Codex, Claude Code, and Cursor out of the box. So long as your coding agent or custom harness supports AGENTS.md (or CLAUDE.md) and Agent Skills you should be able to use Alkemy.
Building an ML project with just an AI coding agent would be risky. It is well known that coding agents can make mistakes without proper guidance. It can be difficult and time-consuming to unravel the mess an agent makes when it goes off the rails.
With Alkemy, agents are guided by robust workflows and constraints accompanied by clear experiment history to provide a safe and effective path to production. Alkemy offers teams a complete operating framework for the ML lifecycle: structured projects, reproducible datasets, flexible experiments, version control, and deployment artifacts.
When it comes to running code on your computer, most popular coding agents have strong permissions features that allow you to customize what it is and isn't allowed to do on its own. If an agent needs to do something it isn't authorized to do it'll prompt you for permission. This is your opportunity to scrutinize its request and ask follow up questions about why that action is needed.
When it comes to data access, it's important that you comply with your organization's policies related to AI usage and data access. We recommend starting out with file exports from your data sources rather than letting agents connect directly.
Alkemy is designed to manage the entire ML lifecycle end-to-end, not just track experiments. While tools like MLflow or Weights & Biases focus primarily on experiment tracking and metrics, Alkemy provides structured workflows for dataset creation, experimentation, versioning (with Git/DVC), and deployment in a single system. It enforces consistency through configuration and project structure, helping teams build reproducible, production-ready pipelines rather than just logging results. In short, Alkemy is a full workflow framework, not just a tracking tool.