Learning Parameterized Expressions

Note: Parametric expressions are currently considered experimental and may change in the future.

Parameterized expressions in SymbolicRegression.jl allow you to discover symbolic expressions that contain optimizable parameters. This is particularly useful when you have data that follows different patterns based on some categorical variable, or when you want to learn an expression with constants that should be optimized during the search.

In this tutorial, we'll generate synthetic data with class-dependent parameters and use symbolic regression to discover the parameterized expressions.

The Problem

Let's create a synthetic dataset where the underlying function changes based on a class label:

\[y = 2\cos(x_2 + 0.1) + x_1^2 - 3.2 \ \ \ \ \text{[class 1]} \\ \text{OR} \\ y = 2\cos(x_2 + 1.5) + x_1^2 - 0.5 \ \ \ \ \text{[class 2]}\]

We will need to simultaneously learn the symbolic expression and per-class parameters!

using SymbolicRegression
using Random: MersenneTwister
using Zygote
using MLJBase: machine, fit!, predict, report
using Test

Now, we generate synthetic data, with these 2 different classes.

Note that the class feature is given special treatment for the SRRegressor as a categorical variable:

X = let rng = MersenneTwister(0), n = 30
    (; x1=randn(rng, n), x2=randn(rng, n), class=rand(rng, 1:2, n))
end
(x1 = [-0.7587307822993239, 0.03249717326229417, 0.04868971510118324, 0.426553609186312, -0.6455341387712752, 0.16047914065004126, -1.174243139542269, -0.8590577126607076, 0.8166486156723918, -1.3610991864137623  …  -1.4738536651017287, -2.3922557197621166, -1.004193438685409, -0.15671091559738035, -0.40851221296291546, -0.02542189385385727, -1.507668525428601, -0.14180202282588494, -1.1251781589718115, -1.6634915201296359],
 x2 = [0.9225430229784556, -0.3431658216750271, 2.1494079717206267, -0.3614340720711092, -0.17101357024903552, 1.4044957064877326, -1.14561382828759, 1.199962308278444, -0.974684107278398, 0.517299511539031  …  0.15367872415413888, 0.1617663833102913, 1.6815067955745, 0.08878524390300123, 2.997069453443845, 1.3160268541638762, 0.10493050742285542, 0.17279914285302175, 0.39613226871496143, 1.2560661845396246],
 class = [1, 1, 2, 2, 2, 2, 1, 2, 2, 1  …  1, 2, 2, 1, 1, 2, 1, 1, 1, 2],)

Now, we generate target values using the true model that has class-dependent parameters:

y = let P1 = [0.1, 1.5], P2 = [3.2, 0.5]
    [2 * cos(x2 + P1[class]) + x1^2 - P2[class] for (x1, x2, class) in zip(X.x1, X.x2, X.class)]
end
30-element Vector{Float64}:
 -1.5819329379648106
 -1.257782764921467
 -2.2452471837347545
  0.5197422215634926
  0.3956348400637564
 -2.4182943336237583
 -0.8184112159299595
 -1.5701319117010097
  1.8972460993013578
  0.28348012694263547
  ⋮
  5.041198154342794
 -1.4900026104671513
 -1.210975832899709
 -5.031135783951781
 -2.394293547335875
  1.0312146396920676
 -1.2538511817635862
 -0.1751135457735593
  0.4140028695829825

We'll configure the symbolic regression search to:

  • Use parameterized expressions with up to 2 parameters
  • Use Zygote.jl for automatic differentiation during parameter optimization (important when using parametric expressions, as it is higher dimensional)
model = SRRegressor(;
    niterations=100,
    binary_operators=[+, *, /, -],
    unary_operators=[cos, exp],
    populations=30,
    expression_type=ParametricExpression,
    expression_options=(; max_parameters=2),
    autodiff_backend=:Zygote,
);

Now, let's set up the machine and fit it:

mach = machine(model, X, y)
untrained Machine; caches model-specific representations of data
  model: SRRegressor(defaults = nothing, …)
  args: 
    1:	Source @658 ⏎ ScientificTypesBase.Table{Union{AbstractVector{ScientificTypesBase.Continuous}, AbstractVector{ScientificTypesBase.Count}}}
    2:	Source @554 ⏎ AbstractVector{ScientificTypesBase.Continuous}

At this point, you would run:

fit!(mach)

You can extract the best expression and parameters with:

report(mach).equations[end]

Key Takeaways

  1. ParametricExpressions allows us to discover symbolic expressions with optimizable parameters
  2. The parameters can capture class-dependent variations in the underlying model

This approach is particularly useful when you suspect your data follows a common functional form, but with varying parameters across different conditions or class!


Show raw source code
#=
# Learning Parameterized Expressions

_Note: Parametric expressions are currently considered experimental and may change in the future._

Parameterized expressions in SymbolicRegression.jl allow you to discover symbolic expressions that contain
optimizable parameters. This is particularly useful when you have data that follows different patterns
based on some categorical variable, or when you want to learn an expression with constants that should
be optimized during the search.

In this tutorial, we'll generate synthetic data with class-dependent parameters and use symbolic regression to discover the parameterized expressions.

## The Problem

Let's create a synthetic dataset where the underlying function changes based on a class label:

\```math
y = 2\cos(x_2 + 0.1) + x_1^2 - 3.2 \ \ \ \ \text{[class 1]} \\
\text{OR} \\
y = 2\cos(x_2 + 1.5) + x_1^2 - 0.5 \ \ \ \ \text{[class 2]}
\```

We will need to simultaneously learn the symbolic expression and per-class parameters!
=#
using SymbolicRegression
using Random: MersenneTwister
using Zygote
using MLJBase: machine, fit!, predict, report
using Test

#=
Now, we generate synthetic data, with these 2 different classes.

Note that the `class` feature is given special treatment for the [`SRRegressor`](@ref)
as a categorical variable:
=#

X = let rng = MersenneTwister(0), n = 30
    (; x1=randn(rng, n), x2=randn(rng, n), class=rand(rng, 1:2, n))
end

#=
Now, we generate target values using the true model that
has class-dependent parameters:
=#
y = let P1 = [0.1, 1.5], P2 = [3.2, 0.5]
    [2 * cos(x2 + P1[class]) + x1^2 - P2[class] for (x1, x2, class) in zip(X.x1, X.x2, X.class)]
end

#=
## Setting up the Search

We'll configure the symbolic regression search to:
- Use parameterized expressions with up to 2 parameters
- Use Zygote.jl for automatic differentiation during parameter optimization (important when using parametric expressions, as it is higher dimensional)
=#

stop_at = Ref(1e-4)  #src

model = SRRegressor(;
    niterations=100,
    binary_operators=[+, *, /, -],
    unary_operators=[cos, exp],
    populations=30,
    expression_type=ParametricExpression,
    expression_options=(; max_parameters=2),
    autodiff_backend=:Zygote,
    early_stop_condition=(loss, _) -> loss < stop_at[],  #src
);

#=
Now, let's set up the machine and fit it:
=#

mach = machine(model, X, y)

#=
At this point, you would run:

\```julia
fit!(mach)
\```

You can extract the best expression and parameters with:

\```julia
report(mach).equations[end]
\```

## Key Takeaways

1. [`ParametricExpression`](@ref)s allows us to discover symbolic expressions with optimizable parameters
2. The parameters can capture class-dependent variations in the underlying model

This approach is particularly useful when you suspect your data follows a common
functional form, but with varying parameters across different conditions or class!
=#

which uses Literate.jl to generate this page.