** Status: Incipient Personal Research Project** This repository hosts experimental code for researching information theory applications in Natural Language Processing (NLP), attempting to bridge the gap between classical statistical methods (Shannon/Markov) and modern Neurosymbolic AI.
For now it's still a simple scaffolding. The goal is to build a high-performance backend (using Common Lisp) that provides foundational metrics for text analysis. The vision is driven by construct 'ald-fashined' (so to speak) like classical statistical methods such as precise entropy, divergence metrics, and several others NLP tools, and nkowledge from causality (SCM, others), graph theory, and linguistics formalizations theories as pipes connecting local-running small LLM models (sLLM).
By integrating this approach we hope to steer LLM generation, like hallucination detection, and knowledge extraction from texts.
This is a living document of the research trajectory.
- Shannon Entropy: Measure raw information content of text sequences.
- Markov Entropy: Measure predictability of text given order-$N$ history.
- Kullback-Leibler Divergence: Quantify "surprise" or information loss when approximating one text distribution with another.
- High-Performance Backend: Implement core logic in SBCL Typescript/Common Lisp with an HTTP API.
- Web Interface: Build a modern dashboard using Next.js + ShadcnUI.
- Interactive Playground: Real-time entropy calculation as the user types.
- Visualizations:
- Heatmaps: Color-code words based on their "surprise" (entropy contribution).
- Transition Graphs: Visualize Markov chains using generic-graph-adapter/Mermaid.
- Divergence Plot: Compare two texts side-by-side visually (KL-Divergence).
Already under fully development into another project to be ported here
- Dynamic N-gram Analysis: Detection of phrase boundaries using entropy spikes (Perplexity).
- Syntactic Entropy: Measuring entropy over Part-of-Speech tags rather than raw tokens.
- Text Classification: Zero-shot classification using compression-based distances (NCD) or KL-Divergence.
- Local Inference: Integrate bindings for local inference (llama.cpp or similar) to run models like Phi-3, Gemma-2B, or TinyLlama locally.
- Neurosymbolic Grounding: Use the calculated Entropy/KL metrics to "vet" or "rank" generations from the sLLM.
- Constrained Decoding: Use grammar-constrained decoding (GBNF) driven by Lisp logic.
- Proactive Semantic Engine: Middleware layer combining classical semantic embeddings with symbolic Lisp logic.
- Macro-Driven Inference: Utilize Lisp's homoiconicity to automatically expand captured knowledge into executable inference rules via macros.
- Deterministic Grounding: Provide sLLMs with rigid tool calls, persistent memory, and boolean logic validation to prevent hallucinations.
- Knowledge Extraction: Automated conversion of unstructured text into symbolic knowledge graphs (S-expressions).
- Prompt Engineering as Code: Adopt DSPy concepts to optimize prompts programmatically.
- Teleprompter Implementation: Build a Lisp-based optimizer that attempts to "compile" vague intents into optimal prompts by measuring metric improvements.
- Fine-tuning: Finetune sLLMs on "high-entropy" synthetic data generated to maximize reasoning capabilities.
- SBCL (Steel Bank Common Lisp)
- Quicklisp
- libev (via Homebrew/apt)
Start the interactive REPL and load the system:
;; In SBCL REPL
(ql:quickload :foundations-core)
(foundations:start :port 8080)Entropy:
curl -X POST http://127.0.0.1:8080/api/science/entropy \
-H "Content-Type: application/json" \
-d '{"text": "BANANA", "order": 1}'KL-Divergence (Information Loss):
curl -X POST http://127.0.0.1:8080/api/science/divergence \
-H "Content-Type: application/json" \
-d '{"text_p": "PROBABILIDADE", "text_q": "POSSIBILIDADE"}'This project is licensed under a Dual License model:
- Personal & Non-Commercial Use: Licensed under the MIT License. You are free to use, modify, and distribute this software for personal queries, research, or open-source projects.
- Commercial Use: When production ready, this project is intended for commercial use besides research. For any commercial application, proprietary software integration, or deployed services generating revenue, a separate Commercial License is required.
Please contact the author for commercial licensing inquiries