# User Guide

## Overview

Below is a high-level overview of LeanDojo. A Lean project is a Git repo
consisting of multiple source files (`*.lean`

). Each file has
multiple theorems (and their proofs). Given a Lean repo, LeanDojo first traces it.
The results are traced repo, files, and theorems with rich syntactic/semantic information not readily available in source code but useful
for downstream tasks, including but not limited to tactic states, tactics, and premises in tactics.
Once we have traced repo/files/theorems, constructing datasets is simply a matter of post-processing.

Once a theorem is traced, LeanDojo enables us to interact with it by replacing the original, human-written proof with
a special `repl`

tactic. Unlike regular tactics transforming the goal in predefined ways,
the `repl`

tactic reads tactics from an external prover, executes them in Lean, and reports the results back.
We name the resulting artifact Dojo (道場, the Japanese word for a gym for practicing martial arts).
Any external prover (potentially based on machine learning) can interact with Dojo to practice the art of theorem proving.

Next, we explain each concept in more detail.

## Lean Repos and Files

We use lean-liquid in Lean 3 as an example
of Lean repos. It is a formalization of Peter Scholze’s Liquid Tensor Experiments.
Every Lean 3 repo has a config file `leanpkg.toml`

at its root. [1] For example,

```
[package]
name = "lean-liquid"
version = "0.1"
lean_version = "leanprover-community/lean:3.48.0"
path = "src"
[dependencies]
mathlib = {git = "https://github.com/leanprover-community/mathlib", rev = "5947fb69cc1fdfebaba1e1b1f0a04f26f0f612bf"}
```

In the config file, `lean_version`

specifies the version of Lean this repo requires. `path`

specifies
the directory for Lean source files. And `[dependencies]`

tells us
the repo depends on a specific commit of mathlib.
The repo can be compiled by running `leanpkg build`

, which performs two steps under the hood:

Run

`lean configure`

to pull all dependencies (only mathlib here) into the`_target/deps/`

directory.Run

`lean --make src`

to compile the Lean source files (`src`

is configurable in`leanpkg.toml`

). Think of it as Lean’s analogy of GNU Make, which is smart enough to figure out dependencies.

For each `*.lean`

file, the compilation process checks all proofs in it
and produces a `*.olean`

file (analogous to compiling `*.cpp`

to `*.o`

).
You may find more information about the toolchain of Lean 3 here
and related LeanDojo implementation in lean_dojo.data_extraction.lean.

Lean 4 uses a different build system ([lake](https://github.com/leanprover/lake)) with different configuration files and directory structures. However, the high-level idea is the same.

## Theorems and Proofs

Theorems can be proved in Lean in multiple styles (term-style, tactic-style, mixed-style, etc.). We focus on tactic-style proofs. A proof is a sequence of tactics. Think of a tactic as a proof step. Each tactic transforms current goals into (hopefully simpler) sub-goals. The proof is complete when all remaining goals become trivial. Below is an example of interacting with a proof assistant to prove the theorem about summing \(n\) positive integers.

Initially, we know \(n \in \mathbb{N}\) and want to prove \(1 + 2 + \dots + n = \frac{n (n + 1)}{2}\).
We enter the tactic `induction n`

to reason by induction on the integer \(n\), which leads to two sub-goals,
one corresponding to \(n = 0\) and the other corresponding to \(n \gt 0\). Then we enter the tactic
`reflexivity`

to prove the trivial case (\(n = 0\)). Finally, we enter the tactic `subst; reflexivity`

to finish the proof. This process generates a proof tree whose nodes are tactic states (states for short) and whose edges
are tactics. Each state has a goal (e.g., \(1 + 2 + \dots + n = \frac{n (n + 1)}{2}\)) and a local context consisting of
hypotheses local to the goal (e.g., \(n \in \mathbb{N}\)).

Theorems/proofs do not exist in isolation. When proving a theorem, we almost inevitably rely on existing theorems and definitions. Consider a simple example we’ve seen in Getting Started:

```
open nat (add_assoc add_comm)
theorem hello_world (a b c : ℕ) : a + b + c = a + c + b :=
begin
rw [add_assoc, add_comm b, ←add_assoc]
end
```

The tactic `rw [add_assoc, add_comm b, ←add_assoc]`

uses the premise `add_assoc`

,
which is an existing theorem stating that the addition between natural numbers is associative
(Similarly, `add_comm`

is about commutativity). Premise selection, i.e. generating the
premises `add_assoc`

and `add_comm`

, is a major challenge in theorem proving (for
both humans and machines). This is because the space of potentially useful premises is huge.
In principle, it includes all existing math when we attempt to prove a new math theorem. In practice,
there can be hundreds of thousands of potential premises, which cannot fit into the input window of
any Transformer-based large language model. Therefore, it is difficult for these models to perform
premise selection effectively when generating tactics.

We have presented a vastly simplified view of theorems and proofs in Lean. For more detail, Theorem Proving in Lean (3, 4)
is the definitive source. You may also refer to Sec. 3 of the LeanDojo paper.
In LeanDojo, theorems are implemented by the `lean_dojo.data_extraction.lean.Theorem`

class.

## Traced Repos

Conceptually, a traced repo is simply a collection of traced files. These files form a directed acyclic graph (DAG) by importing each other. They include not only files from the target repo but also dependencies that are actually used by the target repo. Taking lean-liquid in Lean 3 as an example, it depends on mathlib and Lean’s standard library. The corresponding traced repo includes all traced files from lean-liquid, as well as some files from mathlib and Lean’s standard library that are actually used by lean-liquid. After tracing finishes, these files are stored on the disk in the following directories:

```
lean-liquid
├─src
└─_target
└─deps
├─lean
│ └─library
└─ mathlib
```

We call `lean-liquid`

the **root directory** of the traced repo.
Lean’s standard library is in `lean-liquid/_target/deps/lean/library`

.
Source files of lean-liquid itself are in `lean-liquid/src`

.
Mathlib and other dependencies (if any) are in `lean-liquid/_target/deps`

.

In Lean 4, dependencies are stored in `lake-packages`

instead of `_target/deps`

.

In LeanDojo, traced repos are implemented by the `lean_dojo.data_extraction.traced_data.TracedRepo`

class.

## Traced Files

Tracing a `*.lean`

file generates the following files:

`*.olean`

: Lean’s compiled object file. We’re not concerned with them.`*.dep_paths`

: Paths of dependencies imported by the`*.lean`

file. We generate them using`lean --deps`

(Lean 3) or ExtractData.lean (Lean 4).`*.ast.json`

: AST of the`*.lean`

file annotated with semantic information such as tactic states and name resolutions. We generate them using`lean --ast --tsast --tspp`

(Lean 3) or ExtractData.lean (Lean 4).`*.trace.xml`

: Syntactic and semantic information extracted from Lean. They are generated by post-processing`*.dep_paths`

and`*.ast.json`

files to organize the information in a nice way.

In LeanDojo, tracing is done by running build_lean3_repo.py
with our modified Lean 3 or
build_lean4_repo.py with
ExtractData.lean
Traced files are implemented by the `lean_dojo.data_extraction.traced_data.TracedFile`

class.

## Traced Theorems and Tactics

Traced theorems and tactics provide easy access to various information, such as the human-written
proof of a theorem, the number of tactics in a theorem, the premises used in a theorem/tactic,
whether the theorem has a tactic-style proof, etc. Please refer to `lean_dojo.data_extraction.traced_data.TracedTheorem`

and `lean_dojo.data_extraction.traced_data.TracedTactic`

for details.

## Constructing Benchmark Datasets Using LeanDojo

Traced repos/files/theorems provide all information we need, and we can construct concrete machine learning datasets with some additional post-processing. See our examples of constructing datasets from Lean 3’s mathlib and Lean 4’s mathlib4.

## Interacting with Lean

LeanDojo enables interacting with Lean programmatically using tactics. Please see the demos (Lean 3 version, Lean 4 version) and dojo.py.

## Caching

Tracing large repos such as mathlib can take a long time (~1 hour if using 32 CPUs) and at least 32 GB memory.
Therefore, we trace the repo only once and cache the traced repo for fast access in the future.
The default cache directory is `~/.cache/lean_dojo`

, which can be configured through the `CACHE_DIR`

environment variable.
Traced repos in the cache are read-only, which prevents accidental modifications.
You need `chmod`

if you want to, e.g., clean the cache. However, **refrain from manually modifying the cache while LeanDojo is running.**
It may lead to unpredictable behaviors, e.g., LeanDojo may attempt to refill the cache.

Traced repos in the cache are portable across machines. We host a number of them on AWS S3,
and LeanDojo will automatically download them if they are not in the local cache. To disable this behavior and
build all repos locally, set the `DISABLE_REMOTE_CACHE`

environment variable to any value.

The caching mechanism in LeanDojo is implemented in cache.py.

## Environment Variables

LeanDojo’s behavior can be configured through the following environment variables:

`CACHE_DIR`

: Cache directory (see Caching). Default to`~/.cache/lean_dojo`

.`DISABLE_REMOTE_CACHE`

: Whether to disable remote caching and build all repos locally. Not set by default.`TMP_DIR`

: Temporary directory used by LeanDojo for storing intermediate files. Default to the systems’ global temporary directory.`NUM_PROCS`

: Number of parallel processes for data extraction. Default to 32 or the number of CPUs (whichever is smaller).`TACTIC_TIMEOUT`

: Maximum time (in milliseconds) before interrupting a tactic when interacting with Lean (only applicable to Lean 3). Default to 5000.`TACTIC_CPU_LIMIT`

: Number of CPUs for executing tactics when interacting with Lean. Default to 1.`TACTIC_MEMORY_LIMIT`

: Maximum memory when interacting with Lean. Default to 16 GB.`GITHUB_ACCESS_TOKEN`

: GitHub personal access token for using the GitHub API. They are optional. If provided, they can increase the API rate limit.`LOAD_USED_PACKAGES_ONLY`

: Setting it to any value will cause LeanDojo to load only the dependency files that are actually used by the target repo. Otherwise, for Lean 4, it will load all files in the dependency repos. Not set by default.`VERBOSE`

or`DEBUG`

: Setting either of them to any value will cause LeanDojo to print debug information. Not set by default.

LeanDojo supports python-dotenv. You can use it to manage environment variables in a `.env`

file.

## Questions and Bugs

For general questions and discussions, please use GitHub Discussions.

To report a potential bug, please open an issue. In the issue, please include your OS information, the exact steps to reproduce the error, and complete logs in debug mode (setting the environment variable

`VERBOSE`

to 1). The more details you provide, the better we will be able to help you.