A revolutionary neural architecture using pure TreeFFN Encoder-Decoder for structured reasoning and parallel sequence processing.
- 99% validation accuracy on ARC Prize 2025 dataset with only 3.16M parameters
- 100% token-level accuracy with complete parallel processing
- 36.2 MB model size - extremely deployment-friendly
- No attention mechanisms - pure TreeFFN-based architecture
- Converged in 1500 steps - highly efficient training
TreeGPT introduces a groundbreaking attention-free architecture that uses only Tree Feed-Forward Networks (TreeFFN) in an encoder-decoder configuration. By eliminating attention mechanisms entirely, we achieve:
- Superior performance on structured reasoning tasks
- Complete parallel processing - no sequential bottlenecks
- Stable convergence - more predictable training dynamics
- Reduced complexity - fewer failure modes during training
Input Sequence β Encoder TreeFFN (LβR) β Decoder TreeFFN (RβL) β Output
- Encoder TreeFFN: Left-to-right processing captures forward dependencies
- Decoder TreeFFN: Right-to-left generation enables reverse reasoning
- Adjacent connections: Simple neighbor-to-neighbor graph structure
- Parallel execution: Both encoder and decoder process entire sequences simultaneously
- Python 3.8+
- PyTorch 2.0+
- torch-scatter
git clone https://github.com/lizixi-0x2f/TreeGPT.git
cd TreeGPT
pip install -r requirements.txtpython train_no_attention.pyThis single command will:
- Automatically detect your device (MPS/CUDA/CPU)
- Run the complete training pipeline
- Save the best model with progressive validation
- Generate training history and results
import torch
from src.TreeGPT import TreeGPT
# Initialize pure TreeFFN model
model = TreeGPT(
vocab_size=17, # ARC task vocabulary
d_model=256, # Hidden dimension
n_layers=2, # Number of encoder-decoder layers
max_seq_len=8192, # Maximum sequence length
tree_iterations=2 # TreeFFN iterations per component
)
# Pure parallel processing
input_ids = torch.randint(0, 17, (1, 100)) # Example input
output = model(input_ids) # Complete sequence processed in one pass
print(f"Output shape: {output.shape}")from src.arc_treegpt import ARCTreeGPT, ARCGridTokenizer
# Initialize ARC-specific model (pure TreeFFN)
model = ARCTreeGPT(vocab_size=17, d_model=256, n_layers=2)
tokenizer = ARCGridTokenizer()
# Process ARC grid with parallel inference
grid = [[1, 2], [3, 4]]
sequence = tokenizer.grid_to_sequence(grid)
input_tensor = torch.tensor([sequence])
# Generate solution (no autoregressive generation needed)
solution_grid = model.generate_arc_solution(input_tensor, tokenizer)
print(f"Generated solution: {solution_grid}"){
"final_step": 1500,
"validation_accuracy": 99.0,
"token_accuracy": 99.99,
"model_parameters": "3.16M",
"model_size": "36.2 MB",
"architecture": "Pure TreeFFN Encoder-Decoder"
}Recent evaluation on 5 random ARC samples:
- Token Accuracy: 100% on all samples
- Full Sequence Accuracy: 100% on all samples
- Perfect Predictions: All tested sequences matched targets exactly
- No factual errors detected in model outputs
| Model | Attention | Parameters | Val Accuracy | Training Steps | Architecture |
|---|---|---|---|---|---|
| TreeGPT (Pure) | β | 3.16M | 99.0% | 1500 | TreeFFN Encoder-Decoder |
| TreeGPT (Hybrid) | β | 4.2M+ | 95.0% | 2400+ | Attention + TreeFFN |
| Standard Transformer | β | 3.5M+ | 85.0% | 3000+ | Pure Attention |
class TreeFFNSeq2SeqBlock(nn.Module):
def __init__(self, d_model, tree_iterations=2):
# Encoder: Left-to-right TreeFFN
self.encoder_tree_ffn = TreeFFN(...)
# Decoder: Right-to-left TreeFFN
self.decoder_tree_ffn = TreeFFN(...)
def forward(self, x):
# 1. Encoder processes LβR with adjacent connections
encoder_h = self.encoder_tree_ffn(x, encoder_edges)
# 2. Decoder processes RβL with adjacent connections
decoder_h = self.decoder_tree_ffn(x + encoder_h, decoder_edges)
return x + decoder_h- Pure TreeFFN Processing: No attention mechanisms whatsoever
- Bidirectional Flow: Encoder (LβR) + Decoder (RβL)
- Adjacent Connections: Simple neighbor-to-neighbor edges
- Parallel Execution: Complete sequences processed simultaneously
- Residual Connections: Stable gradient flow throughout network
Our hypothesis for superior performance without attention:
- Reduced Complexity: Fewer components = fewer failure modes
- Enforced Locality: TreeFFN naturally captures spatial relationships
- Parallel Efficiency: No sequential dependencies or bottlenecks
- Stable Dynamics: More predictable convergence patterns
# Complete experiment replication
python train_no_attention.pyExpected Results:
- Convergence within 1500 steps
- Validation accuracy: ~99%
- Model size: 36.2 MB
- Training time: ~10-20 minutes (depending on hardware)
# Load and test best checkpoint
python manual_eval_check.pyThe model saves complete training history to training_history_treeffn_seq2seq.json:
{
"val_full_accs": [0.0, 0.5, 0.79, 0.95, 0.99, 0.99, 0.99, 0.99],
"val_token_accs": [0.998, 0.9997, 0.9999, 0.99998, 0.99999, ...],
"steps": [300, 600, 900, 1200, 1500, 1800, 2100, 2400]
}TreeGPT/
βββ src/ # Core implementation (UPDATED)
β βββ TreeGPT.py # Pure TreeFFN Encoder-Decoder
β βββ TreeFFN.py # Tree Feed-Forward Network
β βββ arc_treegpt.py # ARC task specialization
βββ experiments/ # Training pipeline (UPDATED)
β βββ full_train_eval.py # Pure parallel training
βββ train_no_attention.py # Single-command training script
βββ manual_eval_check.py # Model verification script
βββ best_treeffn_seq2seq.pth # Best trained model
βββ training_history_treeffn_seq2seq.json # Complete training log
βββ letter_to_clem.md # Technical report
βββ README.md # This file (UPDATED)
Note: All attention-related files have been removed from the architecture.
- Simplified architecture approach
- Preliminary results and metrics
- Easy replication instructions
- Areas for further investigation
- Request for feedback from ARC-AGI-2 committee
- No Sequential Dependencies: Complete parallel processing
- Structural Reasoning: TreeFFN captures spatial relationships naturally
- Simplified Training: No complex scheduling or teacher forcing
- Deployment Ready: Compact 36.2 MB model size
- Soft Iterations: Learnable number of reasoning steps
- Graph Processing: Native support for structured data
- Bidirectional Flow: Forward + backward reasoning in parallel
- Edge Projection: Learned transformations for relationships
We welcome contributions to the pure TreeFFN architecture!
git clone https://github.com/lizixi-0x2f/TreeGPT.git
cd TreeGPT
pip install -e .# Test the architecture
python test_simple_training.py
# Manual evaluation
python manual_eval_check.pyIf you use TreeGPT in your research, please cite:
@article{li2025treegpt,
title={TreeGPT: Pure TreeFFN Encoder-Decoder Architecture for Structured Reasoning},
author={Li, Zixi},
year={2025},
note={Pure TreeFFN implementation - no attention mechanisms}
}- Author: Zixi Li
- Email: lizx93@mail2.sysu.edu.cn
- Institution: Noesis Lab, Sun Yat-sen University
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
β Star this repository if you find the attention-free approach interesting! β
Pure TreeFFN. No Attention. Superior Results.