TreeGPT: Pure TreeFFN Encoder-Decoder Architecture

A revolutionary neural architecture using pure TreeFFN Encoder-Decoder for structured reasoning and parallel sequence processing.

🏆 Key Achievements

99% validation accuracy on ARC Prize 2025 dataset with only 3.16M parameters
100% token-level accuracy with complete parallel processing
36.2 MB model size - extremely deployment-friendly
No attention mechanisms - pure TreeFFN-based architecture
Converged in 1500 steps - highly efficient training

🚀 Revolutionary Architecture: Pure TreeFFN

TreeGPT introduces a groundbreaking attention-free architecture that uses only Tree Feed-Forward Networks (TreeFFN) in an encoder-decoder configuration. By eliminating attention mechanisms entirely, we achieve:

Superior performance on structured reasoning tasks
Complete parallel processing - no sequential bottlenecks
Stable convergence - more predictable training dynamics
Reduced complexity - fewer failure modes during training

Core Innovation: Bidirectional TreeFFN Processing

Input Sequence → Encoder TreeFFN (L→R) → Decoder TreeFFN (R←L) → Output

Encoder TreeFFN: Left-to-right processing captures forward dependencies
Decoder TreeFFN: Right-to-left generation enables reverse reasoning
Adjacent connections: Simple neighbor-to-neighbor graph structure
Parallel execution: Both encoder and decoder process entire sequences simultaneously

🛠️ Installation

Prerequisites

Python 3.8+
PyTorch 2.0+
torch-scatter

Quick Setup

git clone https://github.com/lizixi-0x2f/TreeGPT.git
cd TreeGPT
pip install -r requirements.txt

🎯 Quick Start

One-Command Training

python train_no_attention.py

This single command will:

Automatically detect your device (MPS/CUDA/CPU)
Run the complete training pipeline
Save the best model with progressive validation
Generate training history and results

Basic TreeGPT Usage

import torch
from src.TreeGPT import TreeGPT

# Initialize pure TreeFFN model
model = TreeGPT(
    vocab_size=17,          # ARC task vocabulary
    d_model=256,            # Hidden dimension  
    n_layers=2,             # Number of encoder-decoder layers
    max_seq_len=8192,       # Maximum sequence length
    tree_iterations=2       # TreeFFN iterations per component
)

# Pure parallel processing
input_ids = torch.randint(0, 17, (1, 100))  # Example input
output = model(input_ids)  # Complete sequence processed in one pass
print(f"Output shape: {output.shape}")

ARC Task Processing

from src.arc_treegpt import ARCTreeGPT, ARCGridTokenizer

# Initialize ARC-specific model (pure TreeFFN)
model = ARCTreeGPT(vocab_size=17, d_model=256, n_layers=2)
tokenizer = ARCGridTokenizer()

# Process ARC grid with parallel inference
grid = [[1, 2], [3, 4]]
sequence = tokenizer.grid_to_sequence(grid)
input_tensor = torch.tensor([sequence])

# Generate solution (no autoregressive generation needed)
solution_grid = model.generate_arc_solution(input_tensor, tokenizer)
print(f"Generated solution: {solution_grid}")

📊 Breakthrough Results

Training Performance

{
  "final_step": 1500,
  "validation_accuracy": 99.0,
  "token_accuracy": 99.99,
  "model_parameters": "3.16M", 
  "model_size": "36.2 MB",
  "architecture": "Pure TreeFFN Encoder-Decoder"
}

Manual Evaluation Results

Recent evaluation on 5 random ARC samples:

Token Accuracy: 100% on all samples
Full Sequence Accuracy: 100% on all samples
Perfect Predictions: All tested sequences matched targets exactly
No factual errors detected in model outputs

Architecture Comparison

Model	Attention	Parameters	Val Accuracy	Training Steps	Architecture
TreeGPT (Pure)	❌	3.16M	99.0%	1500	TreeFFN Encoder-Decoder
TreeGPT (Hybrid)	✅	4.2M+	95.0%	2400+	Attention + TreeFFN
Standard Transformer	✅	3.5M+	85.0%	3000+	Pure Attention

🏗️ Pure TreeFFN Architecture

Encoder-Decoder Design

class TreeFFNSeq2SeqBlock(nn.Module):
    def __init__(self, d_model, tree_iterations=2):
        # Encoder: Left-to-right TreeFFN
        self.encoder_tree_ffn = TreeFFN(...)
        
        # Decoder: Right-to-left TreeFFN  
        self.decoder_tree_ffn = TreeFFN(...)
        
    def forward(self, x):
        # 1. Encoder processes L→R with adjacent connections
        encoder_h = self.encoder_tree_ffn(x, encoder_edges)
        
        # 2. Decoder processes R←L with adjacent connections  
        decoder_h = self.decoder_tree_ffn(x + encoder_h, decoder_edges)
        
        return x + decoder_h

Key Components

Pure TreeFFN Processing: No attention mechanisms whatsoever
Bidirectional Flow: Encoder (L→R) + Decoder (R←L)
Adjacent Connections: Simple neighbor-to-neighbor edges
Parallel Execution: Complete sequences processed simultaneously
Residual Connections: Stable gradient flow throughout network

Why No Attention Works Better

Our hypothesis for superior performance without attention:

Reduced Complexity: Fewer components = fewer failure modes
Enforced Locality: TreeFFN naturally captures spatial relationships
Parallel Efficiency: No sequential dependencies or bottlenecks
Stable Dynamics: More predictable convergence patterns

🧪 Experiments & Replication

Single Command Replication

# Complete experiment replication
python train_no_attention.py

Expected Results:

Convergence within 1500 steps
Validation accuracy: ~99%
Model size: 36.2 MB
Training time: ~10-20 minutes (depending on hardware)

Manual Model Verification

# Load and test best checkpoint
python manual_eval_check.py

Training History Analysis

The model saves complete training history to training_history_treeffn_seq2seq.json:

{
  "val_full_accs": [0.0, 0.5, 0.79, 0.95, 0.99, 0.99, 0.99, 0.99],
  "val_token_accs": [0.998, 0.9997, 0.9999, 0.99998, 0.99999, ...],
  "steps": [300, 600, 900, 1200, 1500, 1800, 2100, 2400]
}

📁 Project Structure

TreeGPT/
├── src/                              # Core implementation (UPDATED)
│   ├── TreeGPT.py                   # Pure TreeFFN Encoder-Decoder
│   ├── TreeFFN.py                   # Tree Feed-Forward Network  
│   └── arc_treegpt.py               # ARC task specialization
├── experiments/                      # Training pipeline (UPDATED)
│   └── full_train_eval.py           # Pure parallel training
├── train_no_attention.py            # Single-command training script
├── manual_eval_check.py             # Model verification script
├── best_treeffn_seq2seq.pth         # Best trained model
├── training_history_treeffn_seq2seq.json  # Complete training log
├── letter_to_clem.md                # Technical report
└── README.md                        # This file (UPDATED)

Note: All attention-related files have been removed from the architecture.

🔬 Technical Report

Simplified architecture approach
Preliminary results and metrics
Easy replication instructions
Areas for further investigation
Request for feedback from ARC-AGI-2 committee

🎯 What Makes This Different

Paradigm Shift: Beyond Attention

No Sequential Dependencies: Complete parallel processing
Structural Reasoning: TreeFFN captures spatial relationships naturally
Simplified Training: No complex scheduling or teacher forcing
Deployment Ready: Compact 36.2 MB model size

TreeFFN Advantages

Soft Iterations: Learnable number of reasoning steps
Graph Processing: Native support for structured data
Bidirectional Flow: Forward + backward reasoning in parallel
Edge Projection: Learned transformations for relationships

🤝 Contributing

We welcome contributions to the pure TreeFFN architecture!

Development Setup

git clone https://github.com/lizixi-0x2f/TreeGPT.git  
cd TreeGPT
pip install -e .

Running Tests

# Test the architecture
python test_simple_training.py

# Manual evaluation
python manual_eval_check.py

📄 Citation

If you use TreeGPT in your research, please cite:

@article{li2025treegpt,
  title={TreeGPT: Pure TreeFFN Encoder-Decoder Architecture for Structured Reasoning},
  author={Li, Zixi},
  year={2025},
  note={Pure TreeFFN implementation - no attention mechanisms}
}

📞 Contact

Author: Zixi Li
Email: lizx93@mail2.sysu.edu.cn
Institution: Noesis Lab, Sun Yat-sen University

📜 License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

⭐ Star this repository if you find the attention-free approach interesting! ⭐

Pure TreeFFN. No Attention. Superior Results.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
arc-prize-2025		arc-prize-2025
experiments		experiments
figures		figures
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
best_treeffn_seq2seq.pth		best_treeffn_seq2seq.pth
manual_eval_check.py		manual_eval_check.py
requirements.txt		requirements.txt
setup.py		setup.py
train_no_attention.py		train_no_attention.py
training_history_treeffn_seq2seq.json		training_history_treeffn_seq2seq.json

Folders and files

Latest commit

History

Repository files navigation

TreeGPT: Pure TreeFFN Encoder-Decoder Architecture

🏆 Key Achievements

🚀 Revolutionary Architecture: Pure TreeFFN

Core Innovation: Bidirectional TreeFFN Processing

🛠️ Installation

Prerequisites

Quick Setup

🎯 Quick Start

One-Command Training

Basic TreeGPT Usage

ARC Task Processing

📊 Breakthrough Results

Training Performance

Manual Evaluation Results

Architecture Comparison

🏗️ Pure TreeFFN Architecture

Encoder-Decoder Design

Key Components

Why No Attention Works Better

🧪 Experiments & Replication

Single Command Replication

Manual Model Verification

Training History Analysis

📁 Project Structure

🔬 Technical Report

🎯 What Makes This Different

Paradigm Shift: Beyond Attention

TreeFFN Advantages

🤝 Contributing

Development Setup

Running Tests

📄 Citation

📞 Contact

📜 License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages