What I Learned at ICML

Posted by Grace Gong

What I Learned at ICML

At this year’s International Conference on Machine Learning (ICML 2025) in Vancouver, one theme stood out: how do we take advances in machine learning and make them useful, trustworthy, and adaptable in real-world settings?

From evaluating large language models (LLMs) on business-critical tasks to exploring the future of AI alignment and copyright law, here are some of the most compelling workshops and tutorials I attended.

Improving LLM Benchmarks: Making AI Work for Real-World Needs (Jonathan Siddharth)

As frontier models grow in capability, the data required to test them meaningfully becomes harder to generate. This session focused on the need to evolve beyond synthetic benchmarks and evaluate models based on how well they solve real-world problems.

Jonathan proposed incorporating real user feedback and domain-specific scenarios into benchmark design, ensuring that language models are tested on practical reasoning and applied conceptual understanding. The goal is to reduce the gap between academic evaluation and industry relevance by creating benchmarks that measure utility, not just performance on abstract tasks.

Foundation Models for Automated Trading (Hudson River Trading)

Hudson River Trading shared how deep learning models operate at the core of their global trading systems. These models ingest terabytes of noisy, high-frequency financial data across all asset classes and must make accurate predictions under adversarial conditions, regime shifts, and ultra-low latency requirements.

The session covered the construction of foundation models trained on datasets equivalent to trillions of tokens, and emphasized the modeling, engineering, and regularization challenges required to support real-time decision-making, robust liquidity provision, and price discovery in dynamic markets.

Calibration and Bias in Algorithms, Data, and Models (Mark Tygert)

Source: Mark Tygert’s Tutorial on metrics and plots for measuring calibration, bias, fairness, reliability and robustness

This tutorial addressed how to rigorously measure calibration, fairness, and reliability in machine learning models particularly across subpopulations. Rather than comparing average outcomes alone, it advocated for comparing individual outcomes through conditioning on confounding covariates such as age or income.

Mark Tygert presented parameter-free graphical techniques and scalar summary statistics that avoid misleading adjustments and subjective thresholds (e.g., how close is “close” when matching individuals). These methods apply to both observational studies and randomized controlled trials, and are especially effective for assessing the outputs of machine-learned prediction systems.

Generative AI Meets Reinforcement Learning (Amy Zhang & Benjamin Eysenbach)

This tutorial explored the conceptual bridge between generative modeling and reinforcement learning (RL), arguing that generative models can be interpreted as RL agents and vice versa. It discussed how RL frameworks can guide generative model training, how generative AI can inspire new RL algorithms, and how agent-environment interactions including tool use and human collaboration can redefine generative objectives.

Future directions include using reinforcement learning to help generative systems construct their own knowledge, pushing the boundaries of autonomy and generalization.

Mechanistic Interpretability for Language Models (Ziyu Yao & Daking Rai)

Figure 1: Architecture of transformer-based LMs from https://arxiv.org/abs/2407.02646

Mechanistic interpretability (MI) is an emerging sub-field of interpretability that seeks to understand a neural network model by reverse-engineering its internal computations. Recently, MI has garnered significant attention for interpreting transformer-based language models (LMs), resulting in many novel insights yet introducing new challenges. However, there has not been work that comprehensively reviews these insights and challenges, particularly as a guide for newcomers to this field. — Daking Rai, Yilun Zhou, Shi Feng, Abulhair Saparov, Ziyu Yao

This tutorial provided a structured roadmap into MI, especially for transformer-based language models.

The session covered foundational techniques, recent discoveries, and the challenges of scaling MI to modern foundation models.

It was especially valuable for newcomers, offering a beginner-friendly curriculum to help researchers apply MI to practical model debugging, transparency, and interpretability tasks.

Graph Foundation Models: Thoughts and Results (Michael Galkin, Research Scientist, Google Research, and Pramod Doguparty, Software Engineer, Google Ads)

Treating relational tables as interconnected graphs powered by advances in graph learning enables training foundational models that generalize to arbitrary tables, features, and tasks.

This session introduced Graph Foundation Models (GFMs) general-purpose models trained to learn transferable representations across diverse graph structures and tasks. GFMs aim to replace traditional task-specific graph learning approaches by enabling broad generalization.

Similar to frontier language and vision models like Gemini, a GFM is *a single model that learns transferable graph representations that can generalize to any new, previously unseen graph, including its schema, structure, and features.* ***Source:*** Google Research

Training and Classification. *Source:* Google Research

The talk discussed successful applications in link prediction and node classification, while also acknowledging key challenges such as feature heterogeneity and task diversity. It also examined how LLMs could benefit from incorporating graph-structured information to enhance reasoning capabilities.

AI’s Models of the World, and Ours (Jon Kleinberg)

Jon Kleinberg explored the tension between an AI system’s internal model of the world and a human user’s understanding. When generative systems are evaluated only on external outputs, mismatches in implicit representation can lead to systemic failure for example, in collaborative tasks like navigation or gameplay.

The talk used case studies to show how such mismatches manifest, and discussed theoretical results indicating that successful generation does not guarantee accurate world modeling. Understanding both the explicit behavior and the implicit assumptions of generative systems is crucial to safe deployment.

I was looking into his research and found his talk with the Schwartz Reisman Institute to be interesting.

Batch Normalization: Accelerating Deep Network Training (Sergey Ioffe & Christian Szegedy)

Algorithm 1: Batch Normalizing Transform, applied to activation x over a mini-batch. Source: Google Research

This session revisited the landmark technique of Batch Normalization, which addresses the problem of internal covariate shift the changing distribution of layer inputs during training. By normalizing activations within each mini-batch, BatchNorm enables higher learning rates, faster convergence, and more stable training dynamics.

The method also acts as an implicit regularizer and has been shown to outperform Dropout in certain settings. Applied to state-of-the-art image classification tasks, BatchNorm enabled significant performance gains and training efficiency, including breakthroughs on ImageNet.

Adaptive Alignment: Designing AI for a Changing World (Frauke Kreuter)

As AI systems increasingly influence institutions and society, the alignment problem ensuring models reflect human values becomes both more urgent and more complex. Frauke Kreuter proposed leveraging underused datasets from decades of public surveys and international value studies to anchor alignment in empirical social norms.

She emphasized the importance of designing adaptive alignment strategies that can respond to cultural and temporal change. The talk highlighted pitfalls such as framing effects and unrepresentative sampling, and called for long-term collaboration between machine learning researchers and social scientists to build robust human feedback loops.

What to Optimize For, From Robot Arms to Frontier AI (Anca Dragan)

Anca Dragan posed a foundational question: not how to optimize, but what to optimize. From robotic arms to self-driving cars to foundation models like Gemini, the session emphasized that reward specification what we want the system to do is often the hardest and most consequential design choice.

She discussed her lab’s work on reward learning and value alignment, and the risks of unintended side effects when optimization targets are misaligned with human intent. The talk urged AI developers to think deeply about objective design, especially in systems that interact closely with people.

Closing the Loop: Machine Learning for Optimization and Discovery (Andreas Krause)

In domains like scientific discovery, where experimentation is expensive and uncertainty is high, data efficiency is paramount. Andreas Krause presented methods for intelligent exploration using Bayesian optimization, active learning, and meta-learned generative priors.

He discussed how to guide search in high-dimensional spaces, steer foundation models at test time to reduce epistemic uncertainty, and adapt insights from simulation to real-world environments. These techniques are enabling closed-loop systems that accelerate breakthroughs in protein engineering, robotics, and beyond.

He also has a talk on Youtube: Institute for Experiential AI

Generative AI’s Collision with Copyright Law (Pamela Samuelson)

Source: A Machine-intelligent World, Silicon Flatirons

With over 40 lawsuits pending, copyright has become a central issue for generative AI development. Pamela Samuelson discussed legal frameworks across the U.S., EU, Japan, and Canada, and questioned whether fair use, text and data mining exceptions, or collective licensing regimes will govern future model training.

The talk emphasized the importance of legal literacy in the ML community, as developers and researchers will need to participate in shaping the norms and laws that define acceptable data use. The goal is to balance innovation, research, and respect for creators’ rights.

Final Thoughts

ICML 2025 made clear that the future of machine learning is about more than scaling modelsit’s about scaling responsibility, generalization, and societal impact. Whether it’s building trustworthy benchmarks, designing adaptive alignment loops, or navigating the legal future of AI, the conversation is shifting from what models can do to what they should do and how we know they’re doing it right.

It was also my first time solo-travelling to Vancouver, and it was so great to catch up with friends in the area and learn from people around the world who are also passionate about AI & ML.

Search This Blog

Youth180