Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Generating an AST

To generate an AST, simply provide a Tree-sitter node-types.json and LanguageFn of any language to the generate function of the auto_lsp_codegen crate.

cargo add auto_lsp_codegen

Note

Although auto_lsp_codegen is a standalone crate, the generated code depends on the main auto_lsp crate.

Usage

The auto_lsp_codegen crate exposes a single generate function, which takes:

How you choose to use the TokenStream is up to you.

The most common setup is to call it from a build.rs script and write the generated code to a Rust file.

Note, however, that the output can be quite large—for example, Python’s AST results in ~11,000 lines of code.

use auto_lsp_codegen::generate;
use std::{fs, path::PathBuf};

fn main() {
    if std::env::var("AST_GEN").unwrap_or("0".to_string()) == "0" {
        return;
    }

    let output_path = PathBuf::from("./src/generated.rs");

    fs::write(
        output_path,
        generate(
            tree_sitter_python::NODE_TYPES,
            &tree_sitter_python::LANGUAGE.into(),
            None,
        )
        .to_string(),
    )
    .unwrap();
}

You can also invoke it from your own CLI or tool if needed.

How Codegen Works

The generated code structure depends on the Tree-sitter grammar.

Structs for Rules

Each rule in node-types.json becomes a dedicated Rust struct. For example, given the rule:

function_definition: $ => seq(
      optional('async'),
      'def',
      field('name', $.identifier),
      field('type_parameters', optional($.type_parameter)),
      field('parameters', $.parameters),
      optional(
        seq(
          '->',
          field('return_type', $.type),
        ),
      ),
      ':',
      field('body', $._suite),
    ),

The generated struct would look like this:

#![allow(unused)]
fn main() {
#[derive(Debug, Clone, PartialEq)]
pub struct FunctionDefinition {
    pub name: std::sync::Arc<Identifier>,
    pub body: std::sync::Arc<Block>,
    pub type_parameters: Option<std::sync::Arc<TypeParameter>>,
    pub parameters: std::sync::Arc<Parameters>,
    pub return_type: Option<std::sync::Arc<Type>>,
    /* ... */
}
}

Field Matching

To match fields, codegen uses the field_id() method from the Tree-sitter cursor.

From the above example, the generated builder might look like this:

builder.builder(db, &node, Some(id), |b| {
  b.on_field_id::<Identifier, 19u16>(&mut name)?
    .on_field_id::<Block, 6u16>(&mut body)?
    .on_field_id::<TypeParameter, 31u16>(&mut type_parameters)?
    .on_field_id::<Parameters, 23u16>(&mut parameters)?
    .on_field_id::<Type, 24u16>(&mut return_type)
});

Each u16 represents the unique field ID assigned by the Tree-sitter language parser.

Handling Children

If a node has no named fields, a children enum is generated to represent all possible variants.

  • If the children are unnamed, a generic "Operator_" enum is generated
  • If the children are named, the enum will be a concatenation of all possible child node types with underscores, using sanitized Rust-friendly names.

For example, given the rule:

  _statement: $ => choice(
      $._simple_statement,
      $._compound_statement,
    ),

The generated enum would look like this:

#![allow(unused)]
fn main() {
pub enum SimpleStatement_CompoundStatement {
    SimpleStatement(SimpleStatement),
    CompoundStatement(CompoundStatement),
}
}

Note

If the generated enum name becomes too long, consider using a Tree-sitter supertype to group nodes together.

The kind_id() method is used to determine child kinds during traversal.

The AstNode::contains method relies on this to check whether a node kind belongs to a specific struct or enum variant.

Vec and Option Fields

repeat and repeat1 in the grammar will generate a Vec field. optional(...) will generate an Option<T> field.

Token Naming

Unnamed tokens are mapped to Rust enums using a built-in token map. For instance:

  { "type": "+", "named": false },
  { "type": "+=", "named": false },
  { "type": ",", "named": false },
  { "type": "-", "named": false },
  { "type": "-=", "named": false },

Generates:

#![allow(unused)]
fn main() {
pub enum Token_Plus {}
pub enum Token_PlusEqual {}
pub enum Token_Comma {}
pub enum Token_Minus {}
pub enum Token_MinusEqual {}
}

Tokens with regular identifiers are converted to PascalCase.

Custom Tokens

If your grammar defines additional unnamed tokens not covered by the default map, you can provide a custom token mapping to generate appropriate Rust enum names.

use auto_lsp_codegen::generate;

let _result = generate(
        &tree_sitter_python::NODE_TYPES,
        &tree_sitter_python::LANGUAGE.into(),
        Some(HashMap::from([
            ("+", "Plus"),
            ("+=", "PlusEqual"),
            (",", "Comma"),
            ("-", "Minus"),
            ("-=", "MinusEqual"),
        ])),
    );

Tokens that are not in the map will be added, and tokens that already exist in the map will be overwritten.

Super Types

Tree-sitter supports supertypes, which allow grouping related nodes under a common type.

For example, in the Python grammar:

  {
    "type": "_compound_statement",
    "named": true,
    "subtypes": [
      {
        "type": "class_definition",
        "named": true
      },
      {
        "type": "decorated_definition",
        "named": true
      },
      /* ... */
      {
        "type": "with_statement",
        "named": true
      }
    ]
  },

This becomes a Rust enum:

#![allow(unused)]
fn main() {
pub enum CompoundStatement {
    ClassDefinition(ClassDefinition),
    DecoratedDefinition(DecoratedDefinition),
    /* ... */
    WithStatement(WithStatement),
}
}

Note

Some super types might contain other super types, in which case, the generated enum will flatten the hierarchy.