When an MLE Agent Beats Humans, What Does That Actually Mean?

by

Doris Xin

Share

Share

TL;DR

Disarray is an autonomous ML engineering agent that can take a high-level task and independently plan, run, and refine end-to-end ML workflows. In Kaggle competitions, Disarray  won 25 medals across diverse domains (vision, NLP, tabular data), placed top 10 in seven competitions, and even outperformed all human teams in one competition, all within 24 hours on a single GPU. While autonomous agents are becoming remarkably good at modeling, human MLEs remain essential for business alignment, infrastructure, governance, and real-world accountability. The goal of MLE agents is human augmentation, not replacement. 

The concept of fully autonomous agents performing complex tasks is rapidly gaining traction, particularly in machine learning. Specifically, the AI research community has been paying close attention to projects like Andrej Karpathy's autoresearch.

For those not familiar with autoresearch, Karpathy’s project automates the ML research trial-and-error cycle. An AI agent is given a small LLM training setup and works overnight to repeatedly modify the train.py code, run a 5-minute experiment, check for improvement, and either keep or discard the change. The human simply provides instructions in program.md, allowing the agent to autonomously refine the model architecture, optimizer, and training loop.

autoresearch is elegant in its simplicity. However, if we try to expand the use case from ML research to a broader set of ML tasks, this approach has three major constraints:

  • Heavy manual scaffolding. It requires a human to meticulously define the experimental workflow in a program.md. To do this well, you need to already know the problem deeply enough to specify the search strategy, architecture and evaluation criteria. The agent isn't an autonomous researcher so much as a high-speed script executor.

  • It doesn't generalize. A program.md written for training language models doesn't transfer to tabular classification or image segmentation. Each new domain requires a new program, written by someone with domain expertise.

  • The specificity itself may hurt performance. Recent work evaluating repository-level context files for coding agents found that detailed instruction files akin to program.md in autoresearch consistently increase inference costs by 20–23% without improving outcomes. Furthermore, highly constrained agents may struggle to identify novel solutions or adapt to unexpected data quirks compared to systems with greater freedom in exploration.

Introducing the Disarray MLE Agent

We took a different approach with Disarray. Our team is developing autonomous machine learning engineering agents, and, rather than requiring the user to write a detailed plan upfront, our MLE agent operates autonomously from problem formulation through model evaluation. You give it a task description and your data ecosystem. It figures out the rest.

The benefits of this approach are twofold. First, our agent can initiate, plan, and execute complex machine learning workflows from a high-level goal, entirely on its own. It autonomously determines necessary steps, including data collection, feature engineering, hyperparameter tuning and performance evaluation. Second, Disarray is not confined to a single, pre-defined task but can apply learned meta-strategies to entirely new problem types, reflecting a more adaptable, intelligence-based approach.

Validation and Performance:

To rigorously test the capabilities of this autonomous agent, we deployed it in the highly competitive environment of Kaggle data science competitions. A typical competition runs for a few months. In our evaluation, we gave the agent a modest budget of 24 hours on an instance with a single NVIDIA A100 GPU, to stress test what it's able to accomplish in a compressed time frame. Here are some examples of Disarray's solutions compared with the top human teams.

Google Research: Identify Contrails to Reduce Global Warming

Disarray achieved gold-medal performance in this 954-team competition. The task was to identify contrail pixels in GOES-16 satellite infrared imagery, a semantic segmentation problem steeped in domain-specific knowledge.

The winning human teams built deep, multi-stage pipelines: U-Net decoders with heavy pretrained encoders (MaxViT, EfficientNet v2, ResNeSt), false-color preprocessing derived from meteorological science for highlighting ice crystals, temporal fusion strategies to handle inter-frame spatial shifts, soft labels to manage noisy multi-annotator disagreements, pseudo-labeling across temporal frames to multiply training data, and ensembles of 10–18 models. 

Disarray used a U-Net with a ResNeSt encoder and the same false-color preprocessing. It applied focal loss combined with Dice loss to handle extreme class imbalance (contrails occupy ~1–2% of pixels), trained at 384×384 resolution with augmentation, and used 4-rotation test-time augmentation at inference. Both top human solutions and Disarray's solution shared the same domain-specific fundamentals that made this competition solvable at all: the false-color preprocessing, the right loss function for extreme imbalance, and the U-Net architecture. Disarray skipped multi-stage resolution escalation and 10+ model ensembles to account for the resource and time limit.

Cassava Leaf Disease Classification

Disarray reached bronze in 13 hours in this image classification competition of 3,900 teams. The task was to classify cassava plant images into five disease categories despite noisy labels in the training data.

The top human solutions were distinguished by how they handled noise: BiTempered logistic loss (designed specifically for noisy labels), label smoothing, cross-validation noise filtering, ensembles of architecturally diverse models (CNNs and Vision Transformers), and importing external data from a prior Kaggle cassava competition to rebalance the class distribution.

Disarray arrived at the same core recipe. It used BiTempered loss with label smoothing and built a four-model ensemble spanning EfficientNet, ResNeSt, and ViT. The agent discovered empirically that homogeneous ensembles fail on this task after 14+ combinations of EfficientNet variants at different resolutions and only broke through after switching to architecturally diverse models. This is the same insight the competition winners arrived at. Notably, Disarray also sourced external data from the prior cassava competition, a capability that most MLE agents lack, since they typically operate only on the data provided to them upfront.

U.S. Patent Phrase to Phrase Matching

Disarray landed in the top 3% of this 1,889-team competition. The task was to predict semantic similarity between pairs of patent phrases given a technical domain context in the form of a CPC classification code.

The winning human teams converged on DeBERTa-v3-large as the universal foundation, with the patent classification context included alongside the phrase pair in the model input and learned attention pooling instead of the standard [CLS] token. The target is discrete (similarity scores of 0.0, 0.25, 0.5, 0.75, 1.0) but treated as regression with MSE loss. What separated the very top from the rest was ensemble diversity: the best solutions combined 4-6 architecturally different models (DeBERTa, RoBERTa, ELECTRA, and the domain-specific PatentSBerta pretrained on patent text) with meta-learners like CatBoost optimizing per-model weights.

Disarray made several consequential modeling decisions on its own: it chose DeBERTa-v3-large, included CPC context in the input (without which models cap around 0.84), used learned attention pooling to weigh specific technical terms more heavily, treated the discrete targets as regression rather than classification, and implemented GroupKFold by anchor to prevent data leakage. It also caught and fixed a subtle mixed-precision bug where training used autocast, but inference didn't, causing a CUDA dtype mismatch that would have silently degraded predictions. Where it diverges from the human winners is in ensemble breadth. Disarray used a single architecture with multi-seed averaging rather than combining 4-6 diverse models, a reasonable tradeoff given the agent's time and compute constraints.

Tabular Playground Series: May 2022

Disarray ranked second out of 1,115 teams on the competition leaderboard. The task in this competition was binary classification over simulated manufacturing features.

The top human solutions converged on the same recipe: decomposing a high-cardinality categorical feature (a 10-character string) into individual character features, engineering ternary interaction features from the numeric columns and building a dual-input Keras neural network with Mish activation that processed the two feature groups through separate branches before merging. Gradient boosting hit a hard ceiling around 0.994–0.996 AUC regardless of feature engineering. The final push to 0.998+ required neural networks.

Disarray started with gradient boosting, hit the same 0.996 ceiling, and pivoted to a dual-input neural network with feature decomposition and interaction features similar to top human solutions. It also discovered through failed experiments that over-engineering features (expanding to 95 with polynomial interactions) actively degraded performance.

Overall, Disarray participated in Kaggle competitions across image classification, NLP, tabular regression, and object detection. It achieved at least a bronze medal in 25 competitions. It landed in the top ten on the leaderboard in seven competitions and achieved better than human performance in one competition. These results validate Disarray’s capacity to function as a truly autonomous MLE partner. It represents a major step toward developing AI systems that can independently contribute to the machine learning discovery process.

Can agents replace human MLEs?

Despite these impressive achievements, agents are not yet ready to replace human MLEs for two reasons:

1. The Necessity of Human-in-the-Loop for Production Success

While Disarray can autonomously optimize for metrics like AUC, F1-score or RMSE, achieving high performance during offline evaluation is only one (and often the simplest) component of a successful production deployment. A human MLE must remain in the loop because there are critical factors besides model performance that determine success in a real-world system.

First, models operate within complex social and business ecosystems. An offline evaluation cannot capture the nuances of user behavior, legal compliance, ethical considerations or the alignment of the model's output with the company's long-term strategic goals. A human MLE provides this essential contextual layer, ensuring the model's actions are appropriate, responsible, and aligned with organizational values.

And then there’s the need for accountability and governance. When a deployed model makes an error, a human must be accountable, especially when the error involves a significant financial, medical or social impact. An AI agent cannot take responsibility for a system failure or an unintended consequence. The human MLE is the necessary locus of accountability, overseeing the system's performance, maintenance and ethical operation. This concept is explored more fully in our research paper, "Whither AutoML? Understanding the Role of Automation in Machine Learning Workflows", which discusses the necessity of human oversight for accountable AI systems.

2. The Job Function of an MLE Extends Far Beyond Modeling

The second, equally important reason is that being excellent at modeling is not synonymous with being a good MLE. The core job function of an MLE is a multifaceted role that encompasses the entire Machine Learning lifecycle, not just the model building and training phase.

An MLE is responsible for defining data schemas, managing feature stores, architecting the data pipelines that feed the model and ensuring the infrastructure can handle real-time inference at scale with low latency and high availability. 

Also, the transition from a trained model artifact to a production service, a discipline known as MLOps (Machine Learning Operations), requires expertise in CI/CD pipelines, containerization, orchestration, canary deployments, rollback strategies and shadow mode testing. These engineering tasks are outside the scope of even the most advanced automated modeling engines.

What’s more, a model's performance inevitably degrades over time. An MLE implements monitoring systems to detect this decay, manages alert systems, designs A/B tests to validate model versions and architects the feedback loops necessary for continuous improvement. The MLE must manage the ongoing, dynamic reality of a production service.

In short, while Disarray's Kaggle performance represents a monumental leap in the automated modeling component of the ML lifecycle, a vast and complex landscape of engineering, strategy, ethics and operations remains the exclusive domain of the human MLE.

In a future article, we will elaborate on Disarray's engineering capabilities, what it contributes to other parts of the ML lifecycle, and how we are developing more holistic, production-centric evaluation frameworks that move beyond simple offline metrics to assess its true utility.

Share

Share