A fast, lightweight, offline search engine built using pure Python.
It indexes .txt files, performs keyword-based search, highlights matched lines, and provides document statistics β all without external libraries.
This project mimics the core behavior of a mini Google-style search engine:
- Inverted index
- Term-frequency scoring
- Line highlighting
- Document similarity
- Clean modular architecture
- Multi-word query support
- Frequency-based ranking
- Case-insensitive search
Highlights query words inside documents using this format:
>>>python<<< is a powerful programming language
- Top 10 most frequent words
- Vocabulary size
- Jaccard similarity between two documents
Automatically scans the data/ folder and loads all .txt files.
Each responsibility is handled by a dedicated module:
file_loader.pytext_cleaner.pyindexer.pysearch_engine.pyhighlighter.pystats.pymain.py
document-search-engine/
β
βββ data/
β βββ sample1.txt
β βββ sample2.txt
β βββ sample_notes.txt
β βββ tech_article1.txt
β βββ tech_article2.txt
β βββ tech_article3.txt
β
βββ src/
β βββ file_loader.py
β βββ text_cleaner.py
β βββ indexer.py
β βββ search_engine.py
β βββ highlighter.py
β βββ stats.py
β βββ main.py
β
βββ README.md
Make sure Python is installed.
cd document-search-engine
cd src
python main.py
Input:
python data artificial
Output:
Results:
- tech_article1.txt (score: 12)
- sample1.txt (score: 6)
Extract from sample1.txt:
>>>python<<< is a powerful programming language
Top Words:
python β 3
data β 2
computing β 1
Vocabulary Size: 145
Similarity(sample1.txt vs tech_article1.txt): 0.312
If you want to improve the project later:
- Phrase search:
"machine learning" - Search suggestions: spell correction
- Synonym expansion (thesaurus-based)
- Web UI (Flask)
- PDF parsing
Building this project helped me understand and apply several fundamental software engineering and computer science concepts:
Implemented a fast lookup structure that maps words to documents and positions, similar to how real search engines work.
Practiced lowercasing, punctuation removal, and tokenization to ensure consistent, accurate search results.
Organized the project into clean, single-purpose components such as indexing, searching, highlighting, and analytics.
Used frequency-based scoring to determine document relevance for any given query β a simplified version of real IR systems.
Built logic to extract matching lines and highlight keywords for improved readability and usability.
Learned how to implement word frequency counters, vocabulary measurements, and Jaccard similarity across documents.
Designed an intuitive, menu-driven command-line interface that ties all features together.
Improved workflow by initializing a repo, committing changes, and pushing the project to GitHub in a clean, organized format.
Built entirely using pure Python as a learning + portfolio project.
