This module define classes for repos, files, and theorems in Lean. Objects of these classes contain only surface information, without extracting any trace.

lean_dojo.data_extraction.lean.GITHUB_ACCESS_TOKEN = None

GiHub personal access token is optional. If provided, it can increase the rate limit for GitHub API calls.

lean_dojo.data_extraction.lean.LEAN4_NIGHTLY_REPO = Repository(full_name="leanprover/lean4-nightly")

The GitHub Repo for Lean 4 nightly releases.

lean_dojo.data_extraction.lean.LEAN4_REPO = Repository(full_name="leanprover/lean4")

The GitHub Repo for Lean 4 itself.

class lean_dojo.data_extraction.lean.LeanFile(root_dir: Path, path: Path)[source]

Bases: object

A Lean source file (*.lean).

__getitem__(key) str[source]

Return a code segment given its start/end positions.

This enables lean_file[start:end].


key (slice) – A slice of two Pos objects for the start/end of the code segment.

property abs_path: Path

Absolute path of a LeanFile object.

E.g., /home/kaiyu/traced_lean-example/lean-example/src/example.lean

code: List[str]

Raw source code as a list of lines.

convert_pos(byte_idx: int) Pos[source]

Convert a byte index (String.Pos in Lean 4) to a Pos object.

property end_pos: Pos

Return the end position of a source file.


zero_indexed (bool, optional) – Whether to use 0-index instead of 1-index. Defaults to False.


A Pos object representing the end of this file.

Return type:


endwith_newline: bool

Whether the last line ends with a newline.

get_line(line_nb: int) str[source]

Return a given line of the source file.


line_nb (int) – Line number (1-indexed).

num_bytes: List[int]

The number of UTF-8 bytes of each line, including newlines.

num_columns(line_nb: int) int[source]

Number of columns in a source file.

property num_lines: int

Number of lines in a source file.

offset(pos: Pos, delta: int) Pos[source]

Off set a position by a given number.

path: Path

Relative path w.r.t. root_dir

E.g., lean-example/src/example.lean

root_dir: Path

Root directory of the traced repo this LeanFile object belongs to.

root_dir must be an absolute path, e.g., /home/kaiyu/traced_lean-example/lean-example

property start_pos: Pos

Return the start position of a source file.


A Pos object representing the start of this file.

Return type:


class lean_dojo.data_extraction.lean.LeanGitRepo(url: str, commit: str)[source]

Bases: object

Git repo of a Lean project.

clone_and_checkout() None[source]

Clone the repo to the current working directory and checkout a specific commit.

commit: str

The repo’s commit hash.

You can also use tags such as v3.5.0. They will be converted to commit hashes.

property commit_url: str
exists() bool[source]
classmethod from_path(path: Path) LeanGitRepo[source]

Construct a LeanGitRepo object from the path to a local Git repo.

get_config(filename: str, num_retries: int = 2) Dict[str, Any][source]

Return the repo’s files.

get_dependencies(path: str | Path | None = None) Dict[str, LeanGitRepo][source]

Return the dependencies required by the target repo.


path (Union[str, Path, None], optional) – Root directory of the repo if it is on the disk.


A dictionary mapping the name of each dependency to its LeanGitRepo object.

Return type:

Dict[str, LeanGitRepo]

get_license() str | None[source]

Return the content of the LICENSE file.

property is_lean4: bool
lean_version: str

Required Lean version.

property name: str
repo: Repository

A github.Repository object.

show() None[source]

Show the repo in the default browser.

url: str

The repo’s Github URL.

Note that we only support Github as of now.

class lean_dojo.data_extraction.lean.Pos(line_nb: int, column_nb: int)[source]

Bases: object

Position in source files.

We use 1-index to keep it consistent with code editors such as Visual Studio Code.

column_nb: int

Column number

classmethod from_str(s: str) Pos[source]

Construct a Pos object from its string representation, e.g., "(323, 1109)".

line_nb: int

Line number

class lean_dojo.data_extraction.lean.RepoInfoCache(tag2commit: ~typing.Dict[~typing.Tuple[str, str], str] = <factory>, lean_version: ~typing.Dict[~typing.Tuple[str, str], str] = <factory>)[source]

Bases: object

To minize the number of network requests, we cache and re-use the info of all repos, assuming it does not change during the execution of LeanDojo.

lean_version: Dict[Tuple[str, str], str]
tag2commit: Dict[Tuple[str, str], str]
class lean_dojo.data_extraction.lean.Theorem(repo: LeanGitRepo, file_path: Path, full_name: str)[source]

Bases: object

Theorem in Lean.

Theorems are named constants of type Prop. They are typically defined using the keywords theorem or lemma, but it’s possible to use other keywords such as def or instance

file_path: Path

Lean source file the theorem comes from.

full_name: str

Fully qualified name of the theorem.

repo: LeanGitRepo

Lean repo the theorem comes from.

property uhash: str

Unique hash of the theorem.

property uid: str

Unique identifier of the theorem.

lean_dojo.data_extraction.lean.cleanse_string(s: str | Path) str[source]

Replace : and / with _ in a string.

lean_dojo.data_extraction.lean.get_latest_commit(url: str) str[source]

Get the hash of the latest commit of the Git repo at url.

lean_dojo.data_extraction.lean.get_lean4_commit_from_config(config_dict: Dict[str, Any]) str[source]

Return the required Lean commit given a lean-toolchain config.

lean_dojo.data_extraction.lean.get_lean4_version_from_config(toolchain: str) str[source]

Return the required Lean version given a lean-toolchain config.

lean_dojo.data_extraction.lean.is_supported_version(v) bool[source]

Check if v is at least v4.3.0-rc2.

lean_dojo.data_extraction.lean.normalize_url(url: str) str[source]
lean_dojo.data_extraction.lean.url_to_repo(url: str, num_retries: int = 2) Repository[source]