SymbolicRegression.jl searches for symbolic expressions which optimize a particular objective.
Latest release | Documentation | Forums | Paper |
---|---|---|---|
Build status | Coverage |
---|---|
Check out PySR for a Python frontend. Cite this software
Contents:
- Quickstart
- Constructing expressions
- Exporting to SymbolicUtils.jl
- Contributors ✨
- Code structure
- Search options
Quickstart
Install in Julia with:
using Pkg
Pkg.add("SymbolicRegression")
MLJ Interface
The easiest way to use SymbolicRegression.jl is with MLJ. Let's see an example:
import SymbolicRegression: SRRegressor
import MLJ: machine, fit!, predict, report
# Dataset with two named features:
X = (a = rand(500), b = rand(500))
# and one target:
y = @. 2 * cos(X.a * 23.5) - X.b ^ 2
# with some noise:
y = y .+ randn(500) .* 1e-3
model = SRRegressor(
niterations=50,
binary_operators=[+, -, *],
unary_operators=[cos],
)
Now, let's create and train this model on our data:
mach = machine(model, X, y)
fit!(mach)
You will notice that expressions are printed using the column names of our table. If, instead of a table-like object, a simple array is passed (e.g., X=randn(100, 2)
), x1, ..., xn
will be used for variable names.
Let's look at the expressions discovered:
report(mach)
Finally, we can make predictions with the expressions on new data:
predict(mach, X)
This will make predictions using the expression selected by model.selection_method
, which by default is a mix of accuracy and complexity.
You can override this selection and select an equation from the Pareto front manually with:
predict(mach, (data=X, idx=2))
where here we choose to evaluate the second equation.
For fitting multiple outputs, one can use MultitargetSRRegressor
(and pass an array of indices to idx
in predict
for selecting specific equations). For a full list of options available to each regressor, see the API page.
Low-Level Interface
The heart of SymbolicRegression.jl is the equation_search
function. This takes a 2D array and attempts to model a 1D array using analytic functional forms. Note: unlike the MLJ interface, this assumes column-major input of shape [features, rows].
import SymbolicRegression: Options, equation_search
X = randn(2, 100)
y = 2 * cos.(X[2, :]) + X[1, :] .^ 2 .- 2
options = Options(
binary_operators=[+, *, /, -],
unary_operators=[cos, exp],
populations=20
)
hall_of_fame = equation_search(
X, y, niterations=40, options=options,
parallelism=:multithreading
)
You can view the resultant equations in the dominating Pareto front (best expression seen at each complexity) with:
import SymbolicRegression: calculate_pareto_frontier
dominating = calculate_pareto_frontier(hall_of_fame)
This is a vector of PopMember
type - which contains the expression along with the score. We can get the expressions with:
trees = [member.tree for member in dominating]
Each of these equations is an Expression{T}
type for some constant type T
(like Float32
).
These expression objects are callable – you can simply pass in data:
tree = trees[end]
output = tree(X)
Constructing expressions
Expressions are represented under-the-hood as the Node
type which is developed in the DynamicExpressions.jl package. The Expression
type wraps this and includes metadata about operators and variable names.
You can manipulate and construct expressions directly. For example:
using SymbolicRegression: Options, Expression, Node
options = Options(;
binary_operators=[+, -, *, /], unary_operators=[cos, exp, sin]
)
operators = options.operators
variable_names = ["x1", "x2", "x3"]
x1, x2, x3 = [Expression(Node(Float64; feature=i); operators, variable_names) for i=1:3]
tree = cos(x1 - 3.2 * x2) - x1 * x1
This tree has Float64
constants, so the type of the entire tree will be promoted to Node{Float64}
.
We can convert all constants (recursively) to Float32
:
float32_tree = convert(Expression{Float32}, tree)
We can then evaluate this tree on a dataset:
X = rand(Float32, 3, 100)
tree(X)
This callable format is the easy-to-use version which will automatically set all values to NaN if there were any Inf or NaN during evaluation. You can call the raw evaluation method with eval_tree_array
:
output, did_succeed = eval_tree_array(tree, X)
where did_succeed
explicitly declares whether the evaluation was successful.
Exporting to SymbolicUtils.jl
We can view the equations in the dominating Pareto frontier with:
dominating = calculate_pareto_frontier(hall_of_fame)
We can convert the best equation to SymbolicUtils.jl with the following function:
import SymbolicRegression: node_to_symbolic
eqn = node_to_symbolic(dominating[end].tree)
println(simplify(eqn*5 + 3))
We can also print out the full pareto frontier like so:
import SymbolicRegression: compute_complexity, string_tree
println("Complexity\tMSE\tEquation")
for member in dominating
complexity = compute_complexity(member, options)
loss = member.loss
string = string_tree(member.tree, options)
println("$(complexity)\t$(loss)\t$(string)")
end
Contributors ✨
We are eager to welcome new contributors! If you have an idea for a new feature, don't hesitate to share it on the issues page or forums.
Mark Kittisopikul 💻 💡 🚇 📦 📣 👀 🔧 ⚠️ |
T Coxon 🐛 💻 🔌 💡 🚇 🚧 👀 🔧 ⚠️ 📓 |
Dhananjay Ashok 💻 🌍 💡 🚧 ⚠️ |
Johan Blåbäck 🐛 💻 💡 🚧 📣 👀 ⚠️ 📓 |
JuliusMartensen 🐛 💻 📖 🔌 💡 🚇 🚧 📦 📣 👀 🔧 📓 |
ngam 💻 🚇 📦 👀 🔧 ⚠️ |
Kaze Wong 🐛 💻 💡 🚇 🚧 📣 👀 🔬 📓 |
Christopher Rackauckas 🐛 💻 🔌 💡 🚇 📣 👀 🔬 🔧 ⚠️ 📓 |
Patrick Kidger 🐛 💻 📖 🔌 💡 🚧 📣 👀 🔬 🔧 ⚠️ 📓 |
Okon Samuel 🐛 💻 📖 🚧 💡 🚇 👀 ⚠️ 📓 |
William Booth-Clibborn 💻 🌍 📖 📓 🚧 👀 🔧 ⚠️ |
Pablo Lemos 🐛 💡 📣 👀 🔬 📓 |
Jerry Ling 🐛 💻 📖 🌍 💡 📣 👀 📓 |
Charles Fox 🐛 💻 💡 🚧 📣 👀 🔬 📓 |
Johann Brehmer 💻 📖 💡 📣 👀 🔬 ⚠️ 📓 |
Marius Millea 💻 💡 📣 👀 📓 |
Coba 🐛 💻 💡 👀 📓 |
Pietro Monticone 🐛 📖 💡 |
Mateusz Kubica 📖 💡 |
Jay Wadekar 🐛 💡 📣 🔬 |
Anthony Blaom, PhD 🚇 💡 👀 |
Jgmedina95 🐛 💡 👀 |
Michael Abbott 💻 💡 👀 🔧 |
Oscar Smith 💻 💡 |
Eric Hanson 💡 📣 📓 |
Henrique Becker 💻 💡 👀 |
qwertyjl 🐛 📖 💡 📓 |
Rik Huijzer 💡 🚇 |
Hongyu Wang 💡 📣 🔬 |
Saurav Maheshkar 🔧 |
Code structure
SymbolicRegression.jl is organized roughly as follows. Rounded rectangles indicate objects, and rectangles indicate functions.
(if you can't see this diagram being rendered, try pasting it into mermaid-js.github.io/mermaid-live-editor)
The HallOfFame
objects store the expressions with the lowest loss seen at each complexity.
The dependency structure of the code itself is as follows:
Bash command to generate dependency structure from src
directory (requires vim-stream
):
echo 'stateDiagram-v2'
IFS=$'\n'
for f in *.jl; do
for line in $(cat $f | grep -e 'import \.\.' -e 'import \.' -e 'using \.' -e 'using \.\.'); do
echo $(echo $line | vims -s 'dwf:d$' -t '%s/^\.*//g' '%s/Module//g') $(basename "$f" .jl);
done;
done | vims -l 'f a--> ' | sort
Search options
See https://ai.damtp.cam.ac.uk/symbolicregression/stable/api/#Options