lean_dojo.data_extraction.traced_data

This module defines traced repos/files/theorems.

class lean_dojo.data_extraction.traced_data.Comment(start: Pos, end: Pos, text: str)[source]

Bases: object

A comment in a Lean file.

end: Pos

classmethod from_xml(tree: Element) → Comment[source]

start: Pos

text: str

to_xml(parent: Element) → None[source]

class lean_dojo.data_extraction.traced_data.TracedFile(root_dir: Path, repo: LeanGitRepo, lean_file: LeanFile, ast: FileNode, comments: List[Comment], traced_repo: TracedRepo | None = None)[source]

Bases: object

A traced file is a Lean source file annotated with syntactic/semantic information such as tactic states, Lean expressions, and abstract syntax trees (ASTs).

property abs_path: Path: Absolute path of the *.lean file.

ast: FileNode

Abstract syntax tree (AST) of the entire *.lean file.

AST nodes are defined in lean_dojo.data_extraction.ast.

check_sanity() → None[source]

Perform some basic sanity checks.

The function raises exceptions in case of unsuccessful checks.

comments: List[Comment]: All comments in the *.lean file.

classmethod from_traced_file(root_dir: str | Path, json_path: Path, repo: LeanGitRepo) → TracedFile[source]

Construct a TracedFile object by parsing a *.ast.json file produced by lean --ast --tsast --tspp (Lean 3) or ExtractData.lean (Lean 4).

Parameters:

root_dir (Union[str, Path]) – Root directory of the traced repo.
json_path (Path) – Path of the *.ast.json file relative to root_dir.

classmethod from_xml(root_dir: str | Path, path: str | Path, repo: LeanGitRepo) → TracedFile[source]

Load a TracedFile object from its *.trace.xml file.

Parameters:

root_dir (Union[str, Path]) – Root directory of the traced repo.
path (Union[str, Path]) – Path of the *.trace.xml file relative to root_dir.
repo (LeanGitRepo) – The repo to which the traced file belongs.

get_direct_dependencies(repo: LeanGitRepo) → List[Tuple[str, Path]][source]: Return the names and paths of all modules imported by the current *.lean file.

get_premise_definitions() → List[Dict[str, Any]][source]

Return all theorems and definitions defined in the current file that can be potentially used as premises, including the premises in the theorem statement and premises in the tactics used to prove the theorem.

Returns:: _description_
Return type:: List[Dict[str, Any]]

get_traced_theorem(thm_or_name: Theorem | str) → TracedTheorem | None[source]: Return a TracedTheorem object given an Theorem object or its fully-qualified name.

get_traced_theorems() → List[TracedTheorem][source]: Return a list of traced theorem in this traced file.

property has_prelude: bool

Check whether the file starts with :code:prelude.

:code:prelude instructs Lean NOT to include its built-in library automatically.

lean_file: LeanFile: Lean source file of this traced file.

property path: Path: Path of the *.lean file relative to the root directory.

repo: LeanGitRepo: The Lean repo this traced file belongs to.

root_dir: Path: Root directory (in absolute path) of the corresponding traced repo.

to_xml() → str[source]: Serialize a TracedFile object to XML.

traced_repo: TracedRepo | None = None

The traced repo this traced file belongs to.

Note that traced_repo will become None after the traced file is serialized/deserialized on its own.

traverse_preorder(callback, node_cls: type | None = None)[source]

Traverse the AST in preorder.

Parameters:

callback (function) – Callback function for visiting AST nodes.
node_cls (Optional[type], optional) – Restrict the application of callback to only nodes of type node_cls. Defaults to None, which means applying callback to all.

class lean_dojo.data_extraction.traced_data.TracedRepo(repo: LeanGitRepo, dependencies: Dict[str, LeanGitRepo], root_dir: Path, traced_files: List[TracedFile], traced_files_graph: DiGraph | None)[source]

Bases: object

A traced repo is a Lean repo of traced files and additional information, such as other repos it depends on, as well as the dependency graph between files.

check_sanity() → None[source]

Perform some basic sanity checks.

The function raises exceptions in case of unsuccessful checks.

dependencies: Dict[str, LeanGitRepo]: Dictionary mapping the name of each dependency to a LeanGitRepo object.

classmethod from_traced_files(root_dir: str | Path, build_deps: bool = True) → TracedRepo[source]

Construct a TracedRepo object by parsing *.ast.json and *.path files: produced by lean --ast --tsast --tspp (Lean 3) or ExtractData.lean (Lean 4).

Parameters:

root_dir (Union[str, Path]) – Root directory of the traced repo.
build_deps (bool, optional) – Whether to build the dependency graph between files.

get_traced_file(path: str | Path) → TracedFile[source]: Return a traced file by its path.

get_traced_theorem(thm: Theorem) → TracedTheorem | None[source]: Return a TracedTheorem object corresponding to thm

get_traced_theorems() → List[TracedTheorem][source]: Return all traced theorems in the repo.

classmethod load_from_disk(root_dir: str | Path, build_deps: bool = True) → TracedRepo[source]: Load a traced repo from *.trace.xml files.

property name: str: Name of the repo.

repo: LeanGitRepo: The corresponding Lean repo.

root_dir: Path: Root directory of the traced repo.

save_to_disk() → None[source]: Save all traced files in the repo to the disk as *.trace.xml files.

show() → None[source]: Show the repo in the default browser.

traced_files: List[TracedFile]: List of traced files in the repo.

traced_files_graph: DiGraph | None

Dependency graph between files in the repo.

The graph is a DAG, and there is an edge from file X to file Y if and only if X imports Y

class lean_dojo.data_extraction.traced_data.TracedTactic(ast: Node, traced_theorem: TracedTheorem | None = None)[source]

Bases: object

A traced tactic is a tactic annotated with additional information including its AST and the states before/after the tactic.

ast: Node: AST of the tactic.

property end: Pos: End position in *.lean file.

get_annotated_tactic() → Tuple[str, List[Dict[str, Any]]][source]

Return the tactic annotated with premise information.

Premises in the tactic are marked by <a> ... </a>. For example, rw [add_comm b] contains a premise add_comm and therefore becomes rw [<a>add_comm</a> b]. In addition, the function returns the provenance (full name, file path, line/column numbers) of all premises.

Returns:: The first return value is the tactic string marked by <a> ... </a>. The second return value is a list of provenances.
Return type:: Tuple[str, List[Dict[str, Any]]]

property start: Pos: Start position in *.lean file.

property state_after: str: Pretty-printed state after applying the tactic.

property state_before: str: Pretty-printed state before applying the tactic.

property tactic: str: The raw tactic string.

to_string() → str[source]

traced_theorem: TracedTheorem | None = None: The traced theorem this tactic belongs to.

class lean_dojo.data_extraction.traced_data.TracedTheorem(root_dir: Path, theorem: Theorem, ast: CommandTheoremNode | LemmaNode | MathlibTacticLemmaNode, comments: List[Comment], traced_file: TracedFile | None = None)[source]

Bases: object

A traced theorem is a theorem with additional information such as the AST.

ast: CommandTheoremNode | LemmaNode | MathlibTacticLemmaNode: AST of the theorem.

comments: List[Comment]: All comments in the theorem/proof.

property end: Pos: End position in *.lean file.

property file_path: Path: The theorem’s file path (relative to the root directory).

get_num_tactics() → int[source]: Return the number of tactics in the proof.

get_premise_full_names() → List[str][source]: Return the fully qualified names of all premises used in the proof.

get_proof_node() → Node[source]: Return the AST of the theorem’s proof.

get_tactic_proof() → str | None[source]: Return the tactic-style proof (if any).

get_theorem_statement() → str[source]: Return the theorem statement.

get_traced_tactics(atomic_only: bool = False) → List[TracedTactic][source]: Return a list of traced tactics in the proof.

has_tactic_proof() → bool[source]: Check if the theorem has a tactic-style proof.

property is_private: bool: Check if the theorem is private.

locate_proof() → Tuple[Pos, Pos][source]: Return the start/end positions of the proof.

property repo: LeanGitRepo: The Lean repo this theorem belongs to.

root_dir: Path: Root directory of the corresponding traced repo.

show() → None[source]: Show the theorem in the default browser.

property start: Pos: Start position in *.lean file.

theorem: Theorem: The corresponding Theorem object.

traced_file: TracedFile | None = None: The traced file this theorem belongs to.

property traced_repo: TracedRepo | None: The traced repo this theorem belongs to.

lean_dojo.data_extraction.traced_data.get_code_without_comments(lean_file: LeanFile, start: Pos, end: Pos, comments: List[Comment]) → str[source]

Return the code in lean_file from start to end with comments removed.

Parameters:

lean_file (LeanFile) – The lean source file.
start (Pos) – The start position.
end (Pos) – The end position.
comments (List[Comment]) – A list of Comment objects.

Returns:

Human-written code with comments removed.

Return type:

str