Tree-sitter Internals

How NeoVim’s tree-sitter integration works: the LanguageTree abstraction, query system, language injection, incremental parsing, and the highlighter.

LanguageTree

The central abstraction is LanguageTree (runtime/lua/vim/treesitter/languagetree.lua). It wraps tree-sitter’s parser and manages parse trees, including nested languages (like JavaScript inside HTML).

-- Get the LanguageTree for a buffer
local lt = vim.treesitter.get_parser(0, "lua")

-- Parse and get the tree
local trees = lt:parse()    -- Returns list of trees (one per region)
local root = trees[1]:root()

-- The tree is immutable. After buffer changes, re-parse:
lt:parse()  -- Incremental - only re-parses changed regions

Key LanguageTree concepts:

ConceptWhat it is
RegionsPortions of the buffer this parser handles
ChildrenNested LanguageTrees for injected languages
ValidWhether the tree needs re-parsing
CallbacksHooks for parse events (bytes_changed, child_added, etc.)

Incremental Parsing

Tree-sitter parsers are incremental. When you type a character, NeoVim doesn’t re-parse the entire file. Instead:

1. Buffer change notification (bytes_changed callback)
2. LanguageTree marks affected regions as invalid
3. On next parse(), only invalid regions are re-parsed
4. Tree-sitter's internal tree diff produces a minimal edit
5. Highlighting updates only the changed screen lines

This is why tree-sitter highlighting stays fast even in large files. A keystroke in a 10,000-line file only re-parses the affected function, not the whole file.

-- You can observe this with callbacks
local parser = vim.treesitter.get_parser(0, "lua")
parser:register_cbs({
  on_bytes = function(_, _, start_row, start_col, _, old_end_row, old_end_col, _, new_end_row, new_end_col, _)
    print(string.format("Changed at %d:%d", start_row, start_col))
  end,
  on_changedtree = function(ranges)
    print("Re-parsed " .. #ranges .. " regions")
  end,
})

Language Injection

A key feature: one buffer can contain multiple languages. HTML files have CSS and JavaScript. Markdown has code blocks. Tree-sitter handles this through injection queries.

;; runtime/queries/html/injections.scm (simplified)
;; Inject CSS into <style> tags
(style_element
  (raw_text) @injection.content
  (#set! injection.language "css"))

;; Inject JavaScript into <script> tags
(script_element
  (raw_text) @injection.content
  (#set! injection.language "javascript"))

NeoVim’s implementation:

LanguageTree (HTML)
  |
  +-- LanguageTree (CSS)     -- for <style> regions
  |
  +-- LanguageTree (JavaScript)  -- for <script> regions

Each child LanguageTree only parses its regions of the buffer. The parent coordinates which regions belong to which child.

-- Inspect injected languages
local parser = vim.treesitter.get_parser(0)
print("Root language:", parser:lang())

for lang, child in pairs(parser:children()) do
  print("Injected:", lang)
  for _, region in ipairs(child:included_regions()) do
    print("  Region:", vim.inspect(region))
  end
end

The Query System

Queries are S-expressions that pattern-match against tree-sitter nodes. The query system (runtime/lua/vim/treesitter/query.lua) compiles, caches, and executes these patterns.

Query Structure

;; A query pattern
(function_declaration           ;; Match this node type
  name: (identifier) @func_name ;; Capture the name child as @func_name
  parameters: (parameters       ;; Navigate into parameters
    (identifier) @param))       ;; Capture each parameter

;; Predicates filter matches
(string_content) @string.special
  (#match? @string.special "^%w+$")  ;; Only match alphanumeric strings

;; Directives modify captures
(comment) @comment
  (#set! priority 200)  ;; Highlighting priority

Built-in Predicates

PredicatePurpose
#eq?Exact string match
#match?Lua pattern match
#any-of?Match against a set of strings
#has-type?Check node type
#not-eq?Negated string match
#contains?Substring check
#lua-match?Lua pattern (more powerful than #match?)

Query Execution in Lua

local query = vim.treesitter.query.parse("lua", [[
  (function_declaration
    name: (identifier) @name
    body: (block) @body)
]])

local parser = vim.treesitter.get_parser(0, "lua")
local tree = parser:parse()[1]

-- iter_captures: iterate individual captures
for id, node, metadata in query:iter_captures(tree:root(), 0) do
  local capture_name = query.captures[id]
  local row1, col1, row2, col2 = node:range()
  print(capture_name, row1, col1, vim.treesitter.get_node_text(node, 0))
end

-- iter_matches: iterate full pattern matches
for pattern, match, metadata in query:iter_matches(tree:root(), 0) do
  for id, nodes in pairs(match) do
    for _, node in ipairs(nodes) do
      print(query.captures[id], vim.treesitter.get_node_text(node, 0))
    end
  end
end

The Highlighter

The highlighter (runtime/lua/vim/treesitter/highlighter.lua) connects tree-sitter queries to NeoVim’s display engine.

How It Works

1. vim.treesitter.start() creates a highlighter for the buffer
2. Highlighter registers a decoration provider with NeoVim
3. When NeoVim redraws a line range, it calls the provider
4. Provider runs highlights.scm query over visible lines only
5. Each @capture maps to a highlight group (e.g., @function -> @function)
6. Extmarks are placed for each captured range

The key optimization: queries only run on visible lines. Scrolling to a new region triggers a query, but off-screen regions are not processed.

Highlight Groups

Tree-sitter captures map to highlight groups with the @ prefix:

@variable          - Variables
@function          - Function definitions
@function.call     - Function calls
@keyword           - Keywords (if, for, return)
@string            - String literals
@comment           - Comments
@type              - Type names
@operator          - Operators (+, -, =)
@punctuation.bracket - Brackets
@punctuation.delimiter - Commas, semicolons

These groups link to standard highlight groups. Your colorscheme defines what @function looks like:

-- Check what a highlight resolves to
:Inspect  -- With cursor on a token, shows the highlight chain

Priority

When multiple highlights overlap, priority determines which wins:

;; Higher priority wins
(comment) @comment (#set! priority 200)

Default priorities:

  • Syntax highlighting: 100
  • Semantic tokens (from LSP): 125
  • User-defined: 200

Tree-sitter vs Regex Highlighting

AspectRegex (:syntax)Tree-sitter
AccuracyHeuristic, often wrongStructural, parse-correct
Speed (small files)Slightly fasterSlightly slower initial parse
Speed (large files)Can be slow (regex backtracking)Incremental, stays fast
Nested languagesFragile, breaks oftenFirst-class via injection
ExtensibilityVim syntax files (cryptic)S-expression queries (readable)
Error recoveryNone (falls apart on syntax errors)Built-in (tree-sitter recovers gracefully)

Tip: You can run both simultaneously if needed. Tree-sitter highlighting takes priority where it applies. Regex highlighting fills in gaps for languages without tree-sitter parsers.

Folding Internals

Tree-sitter folding (runtime/lua/vim/treesitter/_fold.lua) computes fold levels from the parse tree:

-- The fold expression
vim.o.foldexpr = "v:lua.vim.treesitter.foldexpr()"

-- Internally, this:
-- 1. Gets the node at each line
-- 2. Walks up to find "foldable" ancestors (defined by folds.scm)
-- 3. Returns the fold level based on nesting depth

Folds.scm defines what’s foldable:

;; runtime/queries/lua/folds.scm
[
  (function_declaration)
  (if_statement)
  (for_statement)
  (while_statement)
  (table_constructor)
] @fold

Performance Considerations

Tree-sitter is fast, but there are limits:

  1. Parser loading: First parse of a large file takes time. Subsequent parses are incremental and fast.
  2. Injection overhead: Each injected language is a separate parser. A Markdown file with 50 code blocks spawns 50 child parsers.
  3. Query complexity: Complex queries with many predicates are slower. Highlight queries are optimized to run per-line.
  4. Memory: Each parsed tree lives in memory. Very large files (100K+ lines) use significant memory.

Gotcha: If NeoVim feels slow in a specific file, check :InspectTree to see if there are excessive injected languages or an unusually deep parse tree.