Research

Below is a showcase of papers which have used PySR to discover or rediscover a symbolic model. These are sorted by the date of release, with most recent papers at the top.

If you have used PySR in your research, please submit a pull request to add your paper to this file.

Discovering data-driven microbial growth models with symbolic regression

T. Anthony Sun ¹, Dovydas Kičiatovas ¹, Inga-Katariina Aapalampi ², Teemu Kuosmanen ¹, Teppo Hiltunen ², Ville Mustonen ¹

¹Department of Organismal and Evolutionary Biology & Department of Computer Science, University of Helsinki, ²Department of Biology, University of Turku

Abstract: 1. Connecting mathematical models with empirically measured microbial growth has remained challenging, as numerous competing models based on different theoretical approaches can fit observations. Therefore, we develop a method to automatically propose growth models from microbial data alone. We validate this approach using an available dataset of E. coli grown on known resources, and study 14 species across various concentrations of a rich medium. 2. The inherently interpretable approach of symbolic regression infers explicit dynamical models directly from growth data. Using symbolic regression natively, does not favour biologically interpretable models, but we find cumulative population gain to be a more informative machine learning feature than population size. 3. Random Forest machine learning allows us to relate this finding to the approximation of a constant-rate per capita resource consumption. This suggests that the area under the growth curve (AUC) measured in routine experiments provides information on the effective resource dynamics governing microbial growth. Finally, we use theoretical insights to inform the symbolic regression algorithm and favour biologically interpretable models. 4. Overall, we found that balancing between data fit, parsimony and biological relevance favoured both the simplest, linear approximation and models based on Monod dynamics, with either one or two underlying resources. Therefore, our approach to read growth laws off of microbial batch cultures provides insights on data-driven modelling.

Distilling human mobility models with symbolic regression

Hao Guo ¹, Weiyu Zhang ¹, Junjie Yang ¹, Yuanqiao Hou ¹, Lei Dong ¹, Yu Liu ¹

¹Peking University

Abstract: Human mobility is a fundamental aspect of social behavior, with broad applications in transportation, urban planning, and epidemic modeling. Represented by the gravity model and the radiation model, established analytical models for mobility phenomena are often discovered by analogy to physical processes. Such discoveries can be challenging and rely on intuition, while the potential of emerging social observation data in model discovery is largely unexploited. Here, we propose a systematic approach that leverages symbolic regression to automatically discover interpretable models from human mobility data. Our approach finds several well-known formulas, such as the distance decay effect and classical gravity models, as well as previously unknown ones, such as an exponential-power-law decay that can be explained by the maximum entropy principle. By relaxing the constraints on the complexity of model expressions, we further show how key variables of human mobility are progressively incorporated into the model, making this framework a powerful tool for revealing the underlying mathematical structures of complex social phenomena directly from observational data.

An Engineering Model for Static Yawed Wind Turbines Based on Actuator Line Simulations and Symbolic Regression

Haoyuan Sun ¹, Andrea Sciacchitano ¹, Wei Yu ¹

¹Faculty of Aerospace Engineering, Delft University of Technology

Abstract: Yaw engineering models are commonly used as add-ons to the industrial Blade Element Momentum (BEM) framework to improve load and power predictions by accounting for the skewed wake effect. However, existing yaw engineering models show noticeable limitations in accurately predicting the induced velocity distribution across the blade span. In this study, we employ a genetic symbolic regression approach to develop a new set of yaw engineering models for both the normal and tangential induced velocities of a static yawed wind turbine. The model regression is performed using simulation data from Reynolds-Averaged Navier-Stokes (RANS) simulations with an actuator line model (ALM) of the NREL 5 MW wind turbine, covering a range of yaw angles ( $γ$ ) and thrust coefficients ( $C_{T}$ ) over which the skewed wake effect is dominant. The regressed models are selected based on an optimal trade-off between accuracy and complexity, with complexity constrained to remain comparable to Branlard's yaw engineering model. The selected models are subsequently verified using three unseen cases that span different operating conditions and wind turbine models. Verification is performed through a series of evaluations, including generalization performance tests, implementation within the BEM framework to assess their aerodynamic performances, and quantitative errors and loading analyses. The results demonstrate that the proposed models improve both the amplitude accuracy and azimuthal phase of induced velocities compared to the existing models of Coleman and Branlard, enabling it to accurately capture the phase of the peak aerodynamic forces across each annulus and to predict the non-restoring yaw moment occurring in the inboard region of the turbine, which other models fail to reproduce.

Discovering parametrizations of implied volatility with symbolic regression

Martin Keller-Ressel ^1,2, Hannes Nikulski ¹

¹Department of Mathematics, TU Dresden, ²ScaDS.AI (Center for Scalable Data Analytics and Artificial Intelligence) Dresden/Leipzig,

Abstract: We investigate the data-driven discovery of parametric representations for implied volatility slices. Using symbolic regression, we search for simple analytic formulas that approximate the total implied variance as a function of log-moneyness and maturity. Our approach generates candidate parametrizations directly from market data without imposing a predefined functional form. We compare the resulting formulas with the widely used SVI parametrization in terms of accuracy and simplicity. Numerical experiments indicate that symbolic regression can identify compact parametrizations with competitive fitting performance.

Discovering mathematical concepts through a multi-agent system

Daattavya Aggarwal ¹, Oisin Kim ¹, Carl Henrik Ek ¹, Challenger Mishra ¹

¹Department of Computer Science & Technology, University of Cambridge

Abstract: Mathematical concepts emerge through an interplay of processes, including experimentation, efforts at proof, and counterexamples. In this paper, we present a new multi-agent model for computational mathematical discovery based on this observation. Our system, conceived with research in mind, poses its own conjectures and then attempts to prove them, making decisions informed by this feedback and an evolving data distribution. Inspired by the history of Euler's conjecture for polyhedra and an open challenge in the literature, we benchmark with the task of autonomously recovering the concept of homology from polyhedral data and knowledge of linear algebra. Our system completes this learning problem. Most importantly, the experiments are ablations, statistically testing the value of the complete dynamic and controlling for experimental setup. They support our main claim: that the optimisation of the right combination of local processes can lead to surprisingly well-aligned notions of mathematical interestingness.

Learning Microstructure in Active Matter

Writu Dasgupta ¹, Suvendu Mandal ¹, Aritra K. Mukhopadhyay ¹, Benno Liebchen ¹

¹Institute for Condensed Matter Physics, Technische Universität Darmstadt

Abstract: Understanding microstructure in terms of closed-form expressions is an open challenge in nonequilibrium statistical physics. We propose a simple and generic method that combines particle-resolved simulations, deep neural networks, and symbolic regression to predict the pair-correlation function of passive and active particles. Our analytical closed-form results closely agree with Brownian dynamics simulations, even at relatively large packing fractions and for strong activity. The proposed method is broadly applicable, computationally efficient, and can be used to enhance the predictive power of nonequilibrium continuum theories and for designing pattern formation.

Tantalizing Evidence of Reionization Relics in the eBOSS DR16 Lyα Forest Correlations: a Preference for Early Reionization

Yifan Zheng ¹, Paulo Montero-Camacho ², Zheng Cai ^{1, 2}, Yi Mao ¹

¹Tsinghua University, ²Peng Cheng Laboratory

Abstract: Cosmic reionization of HI leaves enduring relics in the post-reionization intergalactic medium, potentially influencing the Lyman-α (Lyα) forest down to redshifts as low as z≈2, which is the so-called ''memory of reionization'' effect. Here, we re-analyze the baryonic acoustic oscillation (BAO) measurements from Lyα absorption and quasar correlations using data from the extended Baryonic Oscillation Spectroscopic Survey (eBOSS) Data Release 16 (DR16), incorporating for the first time the memory of reionization in the Lyα forest. Three distinct scenarios of reionization timeline are considered in our analyses. We find that the recovered BAO parameters (α∥, α⊥) remain consistent with the original eBOSS DR16 analysis. However, models incorporating reionization relics provide a better fit to the data, with a tantalizing preference for early reionization, consistent with recent findings from the James Webb Space Telescope. Furthermore, the inclusion of reionization relics significantly impacts the non-BAO parameters. For instance, we report deviations of up to 3σ in the Lyα redshift-space distortion parameter and ∼7σ in the linear Lyα bias for the late reionization scenario. Our findings suggest that the eBOSS Lyα data is more accurately described by models that incorporate a broadband enhancement to the Lyα forest power spectrum, highlighting the importance of accounting for reionization relics in cosmological analyses.

Symbolic Regression for State Estimation of Lithium-ion Battery

Shubham Sambhaji Patil ¹, Anubhav Kamal ¹, Sagar Bharathraj ¹, Ankur Deshwal ¹, Shashishekar P. Adiga ¹

¹Samsung Semiconductor India Research

Abstract: Modeling lithium-ion batteries has been a challenging problem. One of the critical tasks among many is state estimation, as it enables researchers to design better battery management systems (BMS). Understanding important battery parameters allows researchers to monitor battery health, predict performance, and optimize battery operation. Traditionally, mathematical models using partial differential equations (PDEs) such as the pseudo two-dimensional model (P2D) have been widely used to estimate physical quantities within the battery. However, deployment of P2D for real-time prediction is limited by the high computational cost, instability of numerical techniques, and the requirement of specialized software. Recent studies have successfully applied various machine learning algorithms achieving high predictive accuracy in many cases. These algorithms, however, suffer from limitations on generalizability and high computation requirements, which limit their deployment. We investigate the applicability of symbolic regression (SR), a branch of symbolic AI techniques, to the problem. The results demonstrate equivalent accuracy with P2D while offering orders of magnitude faster execution. As this study uses simulated P2D data, the findings should be interpreted as a proof-of-concept indicating that symbolic regression can yield interpretable, computationally efficient surrogates with promising BMS relevance.

Symbolic regression analysis of dynamical dark energy with DESI-DR2 and SN data

Agripino Sousa-Neto ¹, Carlos Bengaly ¹, Javier E. Gonzalez ², Jailson Alcaniz ¹

¹Observatório Nacional, ²Universidade Federal de Sergipe

Abstract: Recent measurements of Baryon Acoustic Oscillations (BAO) from the Dark Energy Spectroscopic Survey (DESI DR2), combined with data from the cosmic microwave background (CMB) and Type Ia supernovae (SNe), challenge the $Λ$ -Cold Dark Matter ( $Λ$ CDM) paradigm. They indicate a potential evolution in the dark energy equation of state (EoS), $w (z)$ , as suggested by analyses that employ parametric models. In this paper, we use a model-independent approach known as high performance symbolic regression (PySR) to reconstruct $w (z)$ directly from observational data, allowing us to bypass prior assumptions about the underlying cosmological model. Our findings confirm that the DESI DR2 data alone agree with the $Λ$ CDM model ( $w (z) = - 1$ ) at the redshift range considered. Additionally, when combining DESI data with existing compilations of SN distance measurements, such as Pantheon+ and DESY5, we observe no deviation from the $Λ$ CDM model within $3 σ$ (C.L.) for the interval of values of present-day matter density parameter $Ω_{m}$ and the sound horizon at the drag epoch $r_{d}$ currently constrained by observational data. Therefore, similarly to the DESI DR1 case, these results suggest that it is still premature to claim statistically significant evidence for a dynamical EoS or deviations from the $Λ$ CDM model based on the current DESI data in combination with supernova measurements.

Machine learning framework to predict product distribution of lignocellulosic biomass pyrolysis

Leonardo Voltolini ¹, Fernando Arrais Romero Dias Lima ^1,3, Carine Menezes Rebello ³, Ivaldo Itabaiana Jr. ¹, Idelfonso B.R. Nogueira ³, Argimiro Resende Secchi ^1,2, Maurício B. de Souza Jr. ^1,2

¹School of Chemistry, EPQB, Universidade Federal do Rio de Janeiro (UFRJ), ²Chemical Engineering Program, PEQ/COPPE, Universidade Federal do Rio de Janeiro (UFRJ), ³Chemical Engineering Department, Norwegian University of Science and Technology (NTNU)

Abstract: Machine learning methods have become a trend to model distinct chemical processes, as an alternative to complex first-principles models. Given the complexity of biomass pyrolysis mechanisms, these methods offer a promising approach but often face challenges regarding data scarcity and lack of interpretability. This study aims to develop an interpretable framework for modeling biomass pyrolysis using data from fixed-bed lignocellulosic biomass pyrolysis experiments. A mass change basis was proposed to construct machine learning models, including artificial neural network (ANN) and symbolic regression (SR) models. Feature importance was assessed using Shapley Additive Explanations (SHAP) and compared to Partial Least Squares (PLS) regression, with PLS consistently identifying the best features for symbolic regression. Both ANN and SR models showed similar accuracy, achieving coefficient of determination (R $^{2}$ ) greater than 0.85 across all phase products in the testing set. Additionally, an uncertainty assessment of SR parameters was conducted to improve model robustness ensuring prediction stability. SR models exhibited superior generalization capacity during extrapolation tests, achieving R $^{2}$ values above 0.9 for char and gas phases. For oil values exceeding 10 grams, the SR models struggled with generalization. Overall, the proposed framework provides a valuable tool for interpreting and modeling pyrolysis process data, enabling its use in the decision-making process.

Data-driven skin friction estimation for UAV wings in subsonic flows

Christos Pliakos ¹, Giorgos Efrem ¹, Dimitrios Terzis ¹, Pericles Panagiotou ¹

¹Laboratory of Fluid Mechanics and Turbomachinery, Department of Mechanical Engineering, Aristotle University of Thessaloniki, 54124 Thessaloniki, Greece

Abstract: Accurate estimation of the skin friction coefficient (𝐶𝑓) is essential for estimating the wall shear stresses (𝜏𝑤) and ultimately the first-layer cell height (𝑦) in wall-resolved RANS simulations of wings, where turbulence models are used, demanding a specific grid resolution near walls (primarily the 𝑦𝑡𝑎𝑟𝑔𝑒𝑡⁺). Conventional flat-plate correlations often fail to account for the three-dimensional nature of real wing flows, introducing uncertainties in 𝐶𝑓 predictions and leading to multiple CFD analyses and mesh refinements to meet the targets. In this work, we propose a machine-learning-based approach exploring symbolic regression to derive a model that correlates wing-specific parameters (e.g., Reynolds number, angle of attack, thickness-to-chord ratio, wing sweep angle) with 𝐶𝑓 at the Mean Aerodynamic Chord (MAC). Data are acquired from an in-house database of over 5,000 RANS simulations for UAV wings operating in the low subsonic regime, covering a wide design space, all conducted following best-practice CFD guidelines to ensure high fidelity. These analyses are performed at various flow conditions covering Reynolds numbers from 10⁵ to 10⁷ and include the complete drag polar for each wing. The proposed correlation provides improved agreement with CFD data and enables more accurate 𝑦⁺ estimations. Validation on different wing geometries, including the ONERA M6 and in-house UAV wings, confirmed the robustness of the model, which improves boundary-layer resolution with only a marginal (~2%) increase in total mesh size, while achieving an R² of 0.68 with negligible computational inference cost. This explicit, data-driven equation offers an efficient method for streamlining mesh generation in aerodynamic simulations.

Symbolic regression for precision LHC physics

Manuel Morales-Alvarado ¹, Daniel Conde ², Josh Bendavid ³, Veronica Sanz ², Maria Ubiali ⁴

¹Istituto Nazionale di Fisica Nucleare, ²Universidad de Valencia, ³Massachusetts Institute of Technology, ⁴University of Cambridge

Abstract: We study the potential of symbolic regression (SR) to derive compact and precise analytic expressions that can improve the accuracy and simplicity of phenomenological analyses at the Large Hadron Collider (LHC). As a benchmark, we apply SR to equation recovery in quantum electrodynamics (QED), where established analytical results from quantum field theory provide a reliable framework for evaluation. This benchmark serves to validate the performance and reliability of SR before extending its application to structure functions in the Drell-Yan process mediated by virtual photons, which lack analytic representations from first principles. By combining the simplicity of analytic expressions with the predictive power of machine learning techniques, SR offers a useful tool for facilitating phenomenological analyses in high energy physics.

Angular Coefficients from Interpretable Machine Learning with Symbolic Regression

Josh Bendavid ¹, Daniel Conde ², Manuel Morales-Alvarado ³, Veronica Sanz ², Maria Ubiali ⁴

¹CERN, European Organization for Nuclear Research, Geneva, ²Universidad de Valencia, ³Istituto Nazionale di Fisica Nucleare, ⁴University of Cambridge

Abstract: We explore the use of symbolic regression to derive compact analytical expressions for angular observables relevant to electroweak boson production at the Large Hadron Collider (LHC). Focusing on the angular coefficients that govern the decay distributions of W and Z bosons, we investigate whether symbolic models can well approximate these quantities, typically computed via computationally costly numerical procedures, with high fidelity and interpretability. Using the PySR package, we first validate the approach in controlled settings, namely in angular distributions in lepton-lepton collisions in QED and in leading-order Drell-Yan production at the LHC. We then apply symbolic regression to extract closed-form expressions for the angular coefficients as functions of transverse momentum, rapidity, and invariant mass, using next-to-leading order simulations of Drell-Yan events. Our results demonstrate that symbolic regression can produce accurate and generalisable expressions that match Monte Carlo predictions within uncertainties, while preserving interpretability and providing insight into the kinematic dependence of angular observables.

Analytical formulae for design of one-dimensional sonic crystals with smooth geometry based on symbolic regression

Viktor Hruška ¹, Aneta Furmanová ¹, Michal Bednařík ¹

¹Czech Technical University in Prague, Faculty of Electrical Engineering

Abstract: Even though locally periodic structures have been studied for more than three decades, the known analytical expressions relating the waveguide geometry and the acoustic transmission are limited to a few special cases. Having an access to numerical model is a great opportunity for data-driven discovery. Our choice of cubic splines to parametrize the waveguide unit cell geometry offers enough variability for waveguide design. Using Webster equation for unit cell and Floquet–Bloch theory for periodic structures, a dataset of numerical solutions was prepared. Employing the methods of physics-informed machine learning, we have extracted analytical formulae relating the waveguide geometry and the corresponding dispersion relation or directly the bandgap widths. The results contribute to the overall readability of the system and enable a deeper understanding of the underlying principles. Specifically, it allows for assessing the influence of the waveguide geometry, offering more efficient alternative to computationally demanding numerical optimization.

SymbolFit: Automatic Parametric Modeling with Symbolic Regression

Ho Fung Tsoi ¹, Dylan Rankin ¹, Cecile Caillol ², Miles Cranmer ³, Sridhara Dasu ⁴, Javier Duarte ⁵, Philip Harris ^{6, 7}, Elliot Lipeles ¹, Vladimir Loncar ^{6, 8}

¹University of Pennsylvania, ²European Organization for Nuclear Research (CERN), ³University of Cambridge, ⁴University of Wisconsin-Madison, ⁵University of California San Diego, ⁶Massachusetts Institute of Technology, ⁷Institute for Artificial Intelligence and Fundamental Interactions, ⁸Institute of Physics Belgrade

Abstract: We introduce SymbolFit, a framework that automates parametric modeling by using symbolic regression to perform a machine-search for functions that fit the data, while simultaneously providing uncertainty estimates in a single run. Traditionally, constructing a parametric model to accurately describe binned data has been a manual and iterative process, requiring an adequate functional form to be determined before the fit can be performed. The main challenge arises when the appropriate functional forms cannot be derived from first principles, especially when there is no underlying true closed-form function for the distribution. In this work, we address this problem by utilizing symbolic regression, a machine learning technique that explores a vast space of candidate functions without needing a predefined functional form, treating the functional form itself as a trainable parameter. Our approach is demonstrated in data analysis applications in high-energy physics experiments at the CERN Large Hadron Collider (LHC). We demonstrate its effectiveness and efficiency using five real proton-proton collision datasets from new physics searches at the LHC, namely the background modeling in resonance searches for high-mass dijet, trijet, paired-dijet, diphoton, and dimuon events. We also validate the framework using several toy datasets with one and more variables.

The automated discovery of kinetic rate models – methodological frameworks

Miguel Ángel de Carvalho Servia ¹, Ilya Orson Sandoval ¹, King Kuok ^Mimi Hii ¹, Klaus Hellgardt ¹, Dongda Zhang ², Ehecatl Antonio del Rio Chanona ¹

¹Imperial College London, ²University of Manchester

Abstract: The industrialization of catalytic processes requires reliable kinetic models for their design, optimization and control. Mechanistic models require significant domain knowledge, while data-driven and hybrid models lack interpretability. Automated knowledge discovery methods, such as ALAMO (Automated Learning of Algebraic Models for Optimization), SINDy (Sparse Identification of Nonlinear Dynamics), and genetic programming, have gained popularity but suffer from limitations such as needing model structure assumptions, exhibiting poor scalability, and displaying sensitivity to noise. To overcome these challenges, we propose two methodological frameworks, ADoK-S and ADoK-W (Automated Discovery of Kinetic rate models using a Strong/Weak formulation of symbolic regression), for the automated generation of catalytic kinetic models using a robust criterion for model selection. We leverage genetic programming for model generation and a sequential optimization routine for model refinement. The frameworks are tested against three case studies of increasing complexity, demonstrating their ability to retrieve the underlying kinetic rate model with limited noisy data from the catalytic systems, showcasing their potential for chemical reaction engineering applications.

Individual chaotic behaviour of the S-stars in the Galactic centre

Sam J. Beckers ¹, Colin M. Poppelaars ¹, Veronica S. Ulibarrena ¹, Tjarda N. Boekholt ², Simon F. Portegies Zwart ¹

¹Leiden Observatory, Leiden University, ²NASA Ames Research Center

Abstract: Located at the core of the Galactic centre, the S-star cluster serves as a remarkable illustration of chaos in dynamical systems. The long-term chaotic behaviour of this system can be studied with gravitational $N$ -body simulations. By applying a small perturbation to the initial position of star S5, we can compare the evolution of this system to its unperturbed evolution. This results in two solutions that diverge exponentially, defined by the separation in position space $δ_{r}$ , with an average Lyapunov timescale of $\sim$420 yr, corresponding to the largest positive Lyapunov exponent. Even though the general trend of the chaotic evolution is governed in part by the supermassive black hole Sagittarius $A^{*}$ (Sgr $A^{*}$ ), individual differences between the stars can be noted in the behaviour of their phase-space curves. We present an analysis of the individual behaviour of the stars in this Newtonian chaotic dynamical system. The individuality of their behaviour is evident from offsets in the position space separation curves of the S-stars and the black hole. We propose that the offsets originate from the initial orbital elements of the S-stars, where Sgr $A^{*}$ is considered in one of the focal points of the Keplerian orbits. Methods were considered to find a relation between these elements and the separation in position space. Symbolic regression provides the clearest diagnostics for finding an interpretable expression for the problem. Our symbolic regression model indicates that $⟨ δ_{r} ⟩ \propto e^{2.3}$ , implying that the time-averaged individual separation in position space increases rapidly with the initial eccentricity of the S-stars.

Discovering interpretable models of scientific image data with deep learning

Christopher J. Soelistyo ¹, Alan R. Lowe ^{1, 2}

¹The Alan Turing Institute, ²University College London

Abstract: How can we find interpretable, domain-appropriate models of natural phenomena given some complex, raw data such as images? Can we use such models to derive scientific insight from the data? In this paper, we propose some methods for achieving this. In particular, we implement disentangled representation learning, sparse deep neural network training and symbolic regression, and assess their usefulness in forming interpretable models of complex image data. We demonstrate their relevance to the field of bioimaging using a well-studied test problem of classifying cell states in microscopy data. We find that such methods can produce highly parsimonious models that achieve ~98% of the accuracy of black-box benchmark models, with a tiny fraction of the complexity. We explore the utility of such interpretable models in producing scientific explanations of the underlying biological phenomenon.

Discovery of a Planar Black Hole Mass Scaling Relation for Spiral Galaxies

Benjamin L. Davis ¹, Zehao Jin ¹

¹Center for Astrophysics and Space Science, New York University Abu Dhabi

Abstract: Supermassive black holes (SMBHs) are tiny in comparison to the galaxies they inhabit, yet they manage to influence and coevolve along with their hosts. Evidence of this mutual development is observed in the structure and dynamics of galaxies and their correlations with black hole mass ( $M_{∙}$ ). For our study, we focus on relative parameters that are unique to only disk galaxies. As such, we quantify the structure of spiral galaxies via their logarithmic spiral-arm pitch angles ( $ϕ$ ) and their dynamics through the maximum rotational velocities of their galactic disks ( $v_{\max}$ ). In the past, we have studied black hole mass scaling relations between $M_{∙}$ and $ϕ$ or $v_{\max}$ , separately. Now, we combine the three parameters into a trivariate $M_{∙}$ -- $ϕ$ -- $v_{\max}$ relationship that yields best-in-class accuracy in prediction of black hole masses in spiral galaxies. Because most black hole mass scaling relations have been created from samples of the largest SMBHs within the most massive galaxies, they lack certainty when extrapolated to low-mass spiral galaxies. Thus, it is difficult to confidently use existing scaling relations when trying to identify galaxies that might harbor the elusive class of intermediate-mass black holes (IMBHs). Therefore, we offer our novel relationship as an ideal predictor to search for IMBHs and probe the low-mass end of the black hole mass function by utilizing spiral galaxies. Already with rotational velocities widely available for a large population of galaxies and pitch angles readily measurable from uncalibrated images, we expect that the $M_{∙}$ -- $ϕ$ -- $v_{\max}$ fundamental plane will be a useful tool for estimating black hole masses, even at high redshifts.

Interpretable machine learning methods applied to jet background subtraction in heavy-ion collisions

Tanner Mengel ¹, Patrick Steffanic ¹, Charles Hughes ^1,2, Antonio Carlos Oliveira da Silva ^1,2, Christine Nattrass ¹

¹University of Tennessee, Knoxville, ²Iowa State University of Science and Technology

Abstract: Jet measurements in heavy ion collisions can provide constraints on the properties of the quark gluon plasma, but the kinematic reach is limited by a large, fluctuating background. We present a novel application of symbolic regression to extract a functional representation of a deep neural network trained to subtract background from jets in heavy ion collisions. We show that the deep neural network is approximately the same as a method using the particle multiplicity in a jet. This demonstrates that interpretable machine learning methods can provide insight into underlying physical processes.

Data-Driven Equation Discovery of a Cloud Cover Parameterization

Arthur Grundner ^1,2, Tom Beucler ³, Pierre Gentine ^2,3, Veronika Eyring ^1,4

¹Institut für Physik der Atmosphäre, Deutsches Zentrum für Luft- und Raumfahrt, ²Center for Learning the Earth with Artificial Intelligence And Physics, Columbia University, ³Institute of Earth Surface Dynamics, University of Lausanne, ⁴Institute of Environmental Physics, University of Bremen

Abstract: A promising method for improving the representation of clouds in climate models, and hence climate projections, is to develop machine learning-based parameterizations using output from global storm-resolving models. While neural networks can achieve state-of-the-art performance, they are typically climate model-specific, require post-hoc tools for interpretation, and struggle to predict outside of their training distribution. To avoid these limitations, we combine symbolic regression, sequential feature selection, and physical constraints in a hierarchical modeling framework. This framework allows us to discover new equations diagnosing cloud cover from coarse-grained variables of global storm-resolving model simulations. These analytical equations are interpretable by construction and easily transferable to other grids or climate models. Our best equation balances performance and complexity, achieving a performance comparable to that of neural networks ( $R^{2} = 0.94$ ) while remaining simple (with only 13 trainable parameters). It reproduces cloud cover distributions more accurately than the Xu-Randall scheme across all cloud regimes (Hellinger distances $< 0.09$ ), and matches neural networks in condensate-rich regimes. When applied and fine-tuned to the ERA5 reanalysis, the equation exhibits superior transferability to new data compared to all other optimal cloud cover schemes. Our findings demonstrate the effectiveness of symbolic regression in discovering interpretable, physically-consistent, and nonlinear equations to parameterize cloud cover.

Electron Transfer Rules of Minerals under Pressure informed by Machine Learning

Yanzhang Li ¹, Hongyu Wang ², Yan Li ¹, Xiangzhi Bai ², Anhuai Lu ¹

¹Peking University, ²Beihang University

Abstract: Electron transfer is the most elementary process in nature, but the existing electron transfer rules are seldom applied to high-pressure situations, such as in the deep Earth. Here we show a deep learning model to obtain the electronegativity of 96 elements under arbitrary pressure, and a regressed unified formula to quantify its relationship with pressure and electronic configuration. The relative work function of minerals is further predicted by electronegativity, presenting a decreasing trend with pressure because of pressure-induced electron delocalization. Using the work function as the case study of electronegativity, it reveals that the driving force behind directional electron transfer results from the enlarged work function difference between compounds with pressure. This well explains the deep high-conductivity anomalies, and helps discover the redox reactivity between widespread Fe(II)-bearing minerals and water during ongoing subduction. Our results give an insight into the fundamental physicochemical properties of elements and their compounds under pressure

The SZ flux-mass (Y-M) relation at low halo masses: improvements with symbolic regression and strong constraints on baryonic feedback

Digvijay Wadekar ¹, Leander Thiele ², J. Colin Hill ³, Shivam Pandey ⁴, Francisco Villaescusa-Navarro ⁵, David N. Spergel ⁵, Miles Cranmer ², Daisuke Nagai ⁶, Daniel Anglés-Alcázar ⁷, Shirley Ho ⁵, Lars Hernquist ⁸

¹Institute for Advanced Study, ²Princeton University, ³Columbia University, ⁴University of Pennsylvania, ⁵Flatiron Institute, ⁶Yale University, ⁷University of Connecticut, ⁸Harvard University

Abstract: Ionized gas in the halo circumgalactic medium leaves an imprint on the cosmic microwave background via the thermal Sunyaev-Zeldovich (tSZ) effect. Feedback from active galactic nuclei (AGN) and supernovae can affect the measurements of the integrated tSZ flux of halos ( $Y_{S Z}$ ) and cause its relation with the halo mass ( $Y_{S Z} - M$ ) to deviate from the self-similar power-law prediction of the virial theorem. We perform a comprehensive study of such deviations using CAMELS, a suite of hydrodynamic simulations with extensive variations in feedback prescriptions. We use a combination of two machine learning tools (random forest and symbolic regression) to search for analogues of the $Y - M$ relation which are more robust to feedback processes for low masses ( $M \leq 10^{14} M_{⊙} / h$ ); we find that simply replacing $Y \to Y (1 + M_{*} / M_{gas})$ in the relation makes it remarkably self-similar. This could serve as a robust multiwavelength mass proxy for low-mass clusters and galaxy groups. Our methodology can also be generally useful to improve the domain of validity of other astrophysical scaling relations. We also forecast that measurements of the Y-M relation could provide percent-level constraints on certain combinations of feedback parameters and/or rule out a major part of the parameter space of supernova and AGN feedback models used in current state-of-the-art hydrodynamic simulations. Our results can be useful for using upcoming SZ surveys (e.g. SO, CMB-S4) and galaxy surveys (e.g. DESI and Rubin) to constrain the nature of baryonic feedback. Finally, we find that the an alternative relation, $Y - M_{*}$ , provides complementary information on feedback than $Y - M$ .

Machine Learning the Gravity Equation for International Trade

Sergiy Verstyuk ¹, Michael R. Douglas ¹

¹Harvard University

Abstract: Machine learning (ML) is becoming more and more important throughout the mathematical and theoretical sciences. In this work we apply modern ML methods to gravity models of pairwise interactions in international economics. We explain the formulation of graphical neural networks (GNNs), models for graph-structured data that respect the properties of exchangeability and locality. GNNs are a natural and theoretically appealing class of models for international trade, which we demonstrate empirically by fitting them to a large panel of annual-frequency country-level data. We then use a symbolic regression algorithm to turn our fits into interpretable models with performance comparable to state of the art hand-crafted models motivated by economic theory. The resulting symbolic models contain objects resembling market access functions, which were developed in modern structural literature, but in our analysis arise ab initio without being explicitly postulated. Along the way, we also produce several model-consistent and model-agnostic ML-based measures of bilateral trade accessibility.

Rediscovering orbital mechanics with machine learning

Pablo Lemos ^1,2, Niall Jeffrey ^3,2, Miles Cranmer ⁴, Shirley Ho ^4,5,6,7, Peter Battaglia ⁸

¹University of Sussex, ²University College London, ³ENS, ⁴Princeton University, ⁵Flatiron Institute, ⁶Carnegie Mellon University, ⁷New York University, ⁸DeepMind

Abstract: We present an approach for using machine learning to automatically discover the governing equations and hidden properties of real physical systems from observations. We train a "graph neural network" to simulate the dynamics of our solar system's Sun, planets, and large moons from 30 years of trajectory data. We then use symbolic regression to discover an analytical expression for the force law implicitly learned by the neural network, which our results showed is equivalent to Newton's law of gravitation. The key assumptions that were required were translational and rotational equivariance, and Newton's second and third laws of motion. Our approach correctly discovered the form of the symbolic force law. Furthermore, our approach did not require any assumptions about the masses of planets and moons or physical constants. They, too, were accurately inferred through our methods. Though, of course, the classical law of gravitation has been known since Isaac Newton, our result serves as a validation that our method can discover unknown laws and hidden properties from observed data. More broadly this work represents a key step toward realizing the potential of machine learning for accelerating scientific discovery.

(Thesis) On Neural Differential Equations - Section 6.1

Patrick Kidger ¹

¹University of Oxford

Abstract: The conjoining of dynamical systems and deep learning has become a topic of great interest. In particular, neural differential equations (NDEs) demonstrate that neural networks and differential equation are two sides of the same coin. Traditional parameterised differential equations are a special case. Many popular neural network architectures, such as residual networks and recurrent networks, are discretisations. NDEs are suitable for tackling generative problems, dynamical systems, and time series (particularly in physics, finance, ...) and are thus of interest to both modern machine learning and traditional mathematical modelling. NDEs offer high-capacity function approximation, strong priors on model space, the ability to handle irregular data, memory efficiency, and a wealth of available theory on both sides. This doctoral thesis provides an in-depth survey of the field. Topics include: neural ordinary differential equations (e.g. for hybrid neural/mechanistic modelling of physical systems); neural controlled differential equations (e.g. for learning functions of irregular time series); and neural stochastic differential equations (e.g. to produce generative models capable of representing complex stochastic dynamics, or sampling from complex high-dimensional distributions). Further topics include: numerical methods for NDEs (e.g. reversible differential equations solvers, backpropagation through differential equations, Brownian reconstruction); symbolic regression for dynamical systems (e.g. via regularised evolution); and deep implicit models (e.g. deep equilibrium models, differentiable optimisation). We anticipate this thesis will be of interest to anyone interested in the marriage of deep learning with dynamical systems, and hope it will provide a useful reference for the current state of the art.

Augmenting astrophysical scaling relations with machine learning: application to reducing the SZ flux-mass scatter

Digvijay Wadekar ¹, Leander Thiele ², Francisco Villaescusa-Navarro ³, J. Colin Hill ⁴, Miles Cranmer ², David N. Spergel ³, Nicholas Battaglia ⁵, Daniel Anglés-Alcázar ⁶, Lars Hernquist ⁷, Shirley Ho ³

¹Institute for Advanced Study, ²Princeton University, ³Flatiron Institute, ⁴Columbia University, ⁵Cornell University, ⁶University of Connecticut, ⁷Harvard University

Abstract: Complex systems (stars, supernovae, galaxies, and clusters) often exhibit low scatter relations between observable properties (e.g., luminosity, velocity dispersion, oscillation period, temperature). These scaling relations can illuminate the underlying physics and can provide observational tools for estimating masses and distances. Machine learning can provide a fast and systematic way to search for new scaling relations (or for simple extensions to existing relations) in abstract high-dimensional parameter spaces. We use a machine learning tool called symbolic regression (SR), which models the patterns in a given dataset in the form of analytic equations. We focus on the Sunyaev-Zeldovich flux-cluster mass relation (Y-M), the scatter in which affects inference of cosmological parameters from cluster abundance data. Using SR on the data from the IllustrisTNG hydrodynamical simulation, we find a new proxy for cluster mass which combines $Y_{S Z}$ and concentration of ionized gas (cgas): $M \propto Y_{conc}^{3 / 5} \equiv Y_{S Z}^{3 / 5} (1 - A c_{gas})$ . Yconc reduces the scatter in the predicted M by ~ 20 - 30% for large clusters ( $M > 10^{14} M_{⊙} / h$ ) at both high and low redshifts, as compared to using just $Y_{S Z}$ . We show that the dependence on cgas is linked to cores of clusters exhibiting larger scatter than their outskirts. Finally, we test Yconc on clusters from simulations of the CAMELS project and show that Yconc is robust against variations in cosmology, astrophysics, subgrid physics, and cosmic variance. Our results and methodology can be useful for accurate multiwavelength cluster mass estimation from current and upcoming CMB and X-ray surveys like ACT, SO, SPT, eROSITA and CMB-S4.

Modeling the galaxy-halo connection with machine learning

Ana Maria Delgado ¹, Digvijay Wadekar ^2,3, Boryana Hadzhiyska ¹, Sownak Bose ^1,7, Lars Hernquist ¹, Shirley Ho ^2,4,5,6

¹Center for Astrophysics | Harvard & Smithsonian, ²New York University, ³Institute for Advanced Study, ⁴Flatiron Institute, ⁵Princeton University, ⁶Carnegie Mellon University, ⁷Durham University

Abstract: To extract information from the clustering of galaxies on non-linear scales, we need to model the connection between galaxies and halos accurately and in a flexible manner. Standard halo occupation distribution (HOD) models make the assumption that the galaxy occupation in a halo is a function of only its mass, however, in reality, the occupation can depend on various other parameters including halo concentration, assembly history, environment, spin, etc. Using the IllustrisTNG hydrodynamic simulation as our target, we show that machine learning tools can be used to capture this high-dimensional dependence and provide more accurate galaxy occupation models. Specifically, we use a random forest regressor to identify which secondary halo parameters best model the galaxy-halo connection and symbolic regression to augment the standard HOD model with simple equations capturing the dependence on those parameters, namely the local environmental overdensity and shear, at the location of a halo. This not only provides insights into the galaxy-formation relationship but, more importantly, improves the clustering statistics of the modeled galaxies significantly. Our approach demonstrates that machine learning tools can help us better understand and model the galaxy-halo connection, and are therefore useful for galaxy formation and cosmology studies from upcoming galaxy surveys.

Back to the Formula -- LHC Edition

Anja Butter ¹, Tilman Plehn ¹, Nathalie Soybelman ¹, Johann Brehmer ²

¹Institut fur Theoretische Physik, Universitat Heidelberg, ²Center for Data Science, New York University

Abstract: While neural networks offer an attractive way to numerically encode functions, actual formulas remain the language of theoretical particle physics. We show how symbolic regression trained on matrix-element information provides, for instance, optimal LHC observables in an easily interpretable form. We introduce the method using the effect of a dimension-6 coefficient on associated ZH production. We then validate it for the known case of CP-violation in weak-boson-fusion Higgs production, including detector effects.

Finding universal relations in subhalo properties with artificial intelligence

Helen Shao ¹, Francisco Villaescusa-Navarro ^1,2, Shy Genel ^2,3, David N. Spergel ^2,1, Daniel Angles-Alcazar ^4,2, Lars Hernquist ⁵, Romeel Dave ^6,7,8, Desika Narayanan ^9,10, Gabriella Contardo ², Mark Vogelsberger ¹¹

¹Princeton University, ²Flatiron Institute, ³Columbia University, ⁴University of Connecticut, ⁵Center for Astrophysics | Harvard & Smithsonian, ⁶University of Edinburgh, ⁷University of the Western Cape, ⁸South African Astronomical Observatories, ⁹University of Florida, ¹⁰University of Florida Informatics Institute, ¹¹MIT

Abstract: We use a generic formalism designed to search for relations in high-dimensional spaces to determine if the total mass of a subhalo can be predicted from other internal properties such as velocity dispersion, radius, or star-formation rate. We train neural networks using data from the Cosmology and Astrophysics with MachinE Learning Simulations (CAMELS) project and show that the model can predict the total mass of a subhalo with high accuracy: more than 99% of the subhalos have a predicted mass within 0.2 dex of their true value. The networks exhibit surprising extrapolation properties, being able to accurately predict the total mass of any type of subhalo containing any kind of galaxy at any redshift from simulations with different cosmologies, astrophysics models, subgrid physics, volumes, and resolutions, indicating that the network may have found a universal relation. We then use different methods to find equations that approximate the relation found by the networks and derive new analytic expressions that predict the total mass of a subhalo from its radius, velocity dispersion, and maximum circular velocity. We show that in some regimes, the analytic expressions are more accurate than the neural networks. We interpret the relation found by the neural network and approximated by the analytic equation as being connected to the virial theorem.

Disentangling a deep learned volume formula

Jessica Craven ¹, Vishnu Jejjala ¹, Arjun Kar ²

¹University of the Witwatersrand, ²University of British Columbia

Abstract: We present a simple phenomenological formula which approximates the hyperbolic volume of a knot using only a single evaluation of its Jones polynomial at a root of unity. The average error is just 2.86% on the first 1.7 million knots, which represents a large improvement over previous formulas of this kind. To find the approximation formula, we use layer-wise relevance propagation to reverse engineer a black box neural network which achieves a similar average error for the same approximation task when trained on 10% of the total dataset. The particular roots of unity which appear in our analysis cannot be written as e2πi/(k+2) with integer k; therefore, the relevant Jones polynomial evaluations are not given by unknot-normalized expectation values of Wilson loop operators in conventional SU(2) Chern-Simons theory with level k. Instead, they correspond to an analytic continuation of such expectation values to fractional level. We briefly review the continuation procedure and comment on the presence of certain Lefschetz thimbles, to which our approximation formula is sensitive, in the analytically continued Chern-Simons integration cycle.

Modeling assembly bias with machine learning and symbolic regression

Digvijay Wadekar ¹, Francisco Villaescusa-Navarro ^2,3, Shirley Ho ^2,3,4, Laurence Perreault-Levasseur ^3,5,6

¹New York University, ²Princeton University, ³Flatiron Institute, ⁴Carnegie Mellon University, ⁵Université de Montréal, ⁶Mila

Abstract: Upcoming 21cm surveys will map the spatial distribution of cosmic neutral hydrogen (HI) over unprecedented volumes. Mock catalogues are needed to fully exploit the potential of these surveys. Standard techniques employed to create these mock catalogs, like Halo Occupation Distribution (HOD), rely on assumptions such as the baryonic properties of dark matter halos only depend on their masses. In this work, we use the state-of-the-art magneto-hydrodynamic simulation IllustrisTNG to show that the HI content of halos exhibits a strong dependence on their local environment. We then use machine learning techniques to show that this effect can be 1) modeled by these algorithms and 2) parametrized in the form of novel analytic equations. We provide physical explanations for this environmental effect and show that ignoring it leads to underprediction of the real-space 21-cm power spectrum at k≳0.05 h/Mpc by ≳10%, which is larger than the expected precision from upcoming surveys on such large scales. Our methodology of combining numerical simulations with machine learning techniques is general, and opens a new direction at modeling and parametrizing the complex physics of assembly bias needed to generate accurate mocks for galaxy and line intensity mapping surveys.

Research ​

Research