top of page
alkemy.png

AI Coding Agents Alone vs Alkemy for ML Projects

AI coding agents like Claude Code and Codex bring a huge amount of machine learning knowledge to the table, but every session starts from an empty chat: they don’t inherently know what success looks like, what decisions were made last week, and how to navigate scattered notebooks and scripts.

Alkemy was built to give data scientists a complete operating framework for the ML lifecycle: structured projects, reproducible datasets, flexible experiments, version control, and deployment artifacts.

That’s why AI coding agents work so well with Alkemy: point an agent at an Alkemy project and the conventions, the experiment history and the path to production are already there to follow.

Across the ML lifecycle, here's where AI coding agents struggle on their own, where teams typically patch the gaps, and where Alkemy makes the difference:

Stage / Task
Coding Agent Alone
Coding Agent + Ad-Hoc Scaffolding
Coding Agent + Alkemy
Project setup
Agents generate a project structure from scratch every time, context switch bites hard
Team writes a project template agents follow, works until the template needs to evolve
Agents inherit a standard project layout with built-in conventions for code, config, and artifacts
Building datasets
Each dataset is a one-off script, agent struggle to validate and connect them with experiment usage scattered across code
Team adds naming conventions and shared utilities, agents have to be told about it every time
Datasets are validated objects agents can reference by name across sessions
Running experiments
Experiments are one-off scripts that print metrics, agent forgets them when the session ends
Team adds MLflow or Weights & Biases, but every script is still bespoke and requires extensive boilerplate
Experiments and artifacts are tracked automatically, with results and feedback recorded so agents can use prior runs to guide iteration
Comparing to a benchmark
Agent has no notion of "the model to beat", each run is judged in isolation
Team manually maintains a benchmark in a doc, agents have to be told about it every time and don’t automatically know which experiments to compare
Benchmarks and prior runs live inside the project structure, so agents can compare new results against the current bar
Deployment
Agent writes a separate inference script, production code silently diverges from experiment code
Team builds a deployment pipeline, but inference code must be kept in sync with experiment logic
Deployment use the same code path as experiments, no rewrite and no drift
Handing off mid-project
New agent session or new teammate starts from the README and a repo full of files
New session can follow the team's conventions if they're documented
New session inherits structure, history, and current state from the framework itself
Ongoing maintenance
Structure degrades as work accumulates
Team must maintain and evolve scaffolding alongside ML work
Lifecycle structure is part of the system, not something the team has to build and maintain
Where AI Coding Agents Alone Are Enough:

Any kind of ML work that’s ad-hoc, exploratory, or short-lived, where results don’t need to be repeated, extended, or shipped to production.

  • Exploratory analysis of relationships between potential inputs and outputs

  • Evaluating a ML or data analysis library you haven’t used before

  • Short spikes that inform planning for future data products

  • One-off models or scripts that are not expected to be maintained or reused

Where They Break Down In Real ML Work:

Any kind of ML work that evolves over time, especially when results need to be reproducible, compared across iterations, and shipped to production.

  • Work loses continuity across sessions, making it hard to build on prior decisions and results

  • Ongoing experimentation with no clear view of has been tried or what works best

  • Dataset logic gets scattered across scripts, lineage becomes unclear

  • Experiment code diverges from production code, requiring rewrites and increasing the risk of drift

Benchmarking

Setup: Same model (GPT-5.4 in Codex), same one-line prompt ("I need to build a fraud detection model"), and the same public Kaggle transaction dataset. The agent-alone run finished in about 15 minutes with no user involvement after the first prompt. The agent + Alkemy run took about 6 hours of agent runtime and 1 to 2 hours of our time answering the agent's intake questions (precision target, deployment mode, historical-feature availability) and providing short steering inputs later (faster libraries, scrutinize optimistic outputs, more stress testing). We redirected the agent but did not write code or make technical decisions.

Two terms used below: Transaction-only features are computable by the API at scoring time (amount, merchant category, card type). Historical features require looking up a customer's or card's past activity at scoring time (prior merchants used, recent average spend). Historical features are often more predictive for fraud but need extra engineering to serve fresh values in production. Keeping them separate makes the choice a stakeholder investment question rather than a default.

Component
Agent Alone
Agent + Alkemy
Problem intake
Started coding in turn 1. Assumed a 10% precision floor.
Asked about the precision target (30%), deployment mode (API), and historical-feature availability. Planned experiments to evaluate historical-features business case.
Features
One set, transaction-only by default.
Transaction-only and transaction-plus-historical tracked.
Validation
Single chronological 80/20 holdout.
Time-aware CV across five annual folds with a buffer gap, plus sliding-window and 30-day-gap stress tests.
Headline metric
PR-AUC ≈ 0.71 on one slice.
PR-AUC ≈ 0.46 under time-aware CV (transaction-only tuned LightGBM). The same model scored ≈ 0.82 on a random split, which was flagged as too optimistic.
Robustness
None.
Found a 2017 fraud-pattern shift where most candidate models caught almost no fraud. Documented as a production risk.
Artifacts
Training script, scoring script, README.
Validated dataset, decision journal, 18 experiments, feature importance, stakeholder ready business-case report.
Handoff
Re-derivable by rereading the script.
Brief + journal + experiments carry decisions forward.

Takeaway: Both runs produce a working model. Only the Alkemy run produces the journal, the transaction-only vs transaction-plus-historical investment split, the time-aware validation that showed the single-slice number was too optimistic, the 2017 drift finding, and the business-case report a team needs justify putting historical features on their engineering roadmap.

Fairness disclosures: Both runs used GPT-5.4 in Codex, the same raw data, and the same one-line prompt. The Alkemy run also received short user redirections as the agent raised questions, but no technical contributions. Validation splits differ between the two runs, so the predictive numbers are directional, not head-to-head. The agent-alone threshold was chosen against a default 10% precision floor because it did not ask for a business target. With Alkemy, the 30% target came from intake. The dataset was normalized from its original JSON and CSV format to all-CSV with renamed columns to obfuscate the source. The underlying data was not altered.

Common Questions

Ready to get in the game? Get in touch to take the next step in your data journey.

bottom of page