Skip to content

A lightweight offline search engine built using pure Python. It indexes text documents, performs fast keyword search, highlights matches, and provides document insights like frequency analysis and similarity scores.

License

Notifications You must be signed in to change notification settings

SujethaJanet-2004/document-search-engine

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

πŸ“ Offline Document Search Engine (Python)

A fast, lightweight, offline search engine built using pure Python.
It indexes .txt files, performs keyword-based search, highlights matched lines, and provides document statistics β€” all without external libraries.

This project mimics the core behavior of a mini Google-style search engine:

  • Inverted index
  • Term-frequency scoring
  • Line highlighting
  • Document similarity
  • Clean modular architecture

πŸš€ Features

πŸ” 1. Fast Keyword Search

  • Multi-word query support
  • Frequency-based ranking
  • Case-insensitive search

✨ 2. Highlighted Matches

Highlights query words inside documents using this format:

>>>python<<< is a powerful programming language

πŸ“Š 3. Document Statistics

  • Top 10 most frequent words
  • Vocabulary size
  • Jaccard similarity between two documents

πŸ“ 4. Automatic File Loading

Automatically scans the data/ folder and loads all .txt files.

🧱 5. Clean Modular Code

Each responsibility is handled by a dedicated module:

  • file_loader.py
  • text_cleaner.py
  • indexer.py
  • search_engine.py
  • highlighter.py
  • stats.py
  • main.py

πŸ—‚ Project Structure

document-search-engine/
β”‚
β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ sample1.txt
β”‚   β”œβ”€β”€ sample2.txt
β”‚   β”œβ”€β”€ sample_notes.txt
β”‚   β”œβ”€β”€ tech_article1.txt
β”‚   β”œβ”€β”€ tech_article2.txt
β”‚   └── tech_article3.txt
β”‚
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ file_loader.py
β”‚   β”œβ”€β”€ text_cleaner.py
β”‚   β”œβ”€β”€ indexer.py
β”‚   β”œβ”€β”€ search_engine.py
β”‚   β”œβ”€β”€ highlighter.py
β”‚   β”œβ”€β”€ stats.py
β”‚   └── main.py
β”‚
└── README.md

πŸ–₯️ How to Run

1️⃣ Install Python 3.8+

Make sure Python is installed.

2️⃣ Open the project folder

cd document-search-engine

3️⃣ Run the main program

cd src
python main.py

πŸ§ͺ Examples

πŸ” Search Example

Input:

python data artificial

Output:

Results:
- tech_article1.txt (score: 12)
- sample1.txt (score: 6)

✨ Highlight Example

Extract from sample1.txt:

>>>python<<< is a powerful programming language

πŸ“Š Stats Example

Top Words:
python – 3
data – 2
computing – 1

Vocabulary Size: 145

πŸ”— Similarity Example

Similarity(sample1.txt vs tech_article1.txt): 0.312

πŸ›  Future Enhancements (Optional)

If you want to improve the project later:

  • Phrase search: "machine learning"
  • Search suggestions: spell correction
  • Synonym expansion (thesaurus-based)
  • Web UI (Flask)
  • PDF parsing

πŸ“˜ What I Learned

Building this project helped me understand and apply several fundamental software engineering and computer science concepts:

πŸ”Ή Inverted Indexing

Implemented a fast lookup structure that maps words to documents and positions, similar to how real search engines work.

πŸ”Ή Text Processing & Normalization

Practiced lowercasing, punctuation removal, and tokenization to ensure consistent, accurate search results.

πŸ”Ή Modular System Design

Organized the project into clean, single-purpose components such as indexing, searching, highlighting, and analytics.

πŸ”Ή Ranking & Search Scoring

Used frequency-based scoring to determine document relevance for any given query β€” a simplified version of real IR systems.

πŸ”Ή Highlight Extraction

Built logic to extract matching lines and highlight keywords for improved readability and usability.

πŸ”Ή Document Analytics

Learned how to implement word frequency counters, vocabulary measurements, and Jaccard similarity across documents.

πŸ”Ή CLI Interface Engineering

Designed an intuitive, menu-driven command-line interface that ties all features together.

πŸ”Ή Git & Version Control

Improved workflow by initializing a repo, committing changes, and pushing the project to GitHub in a clean, organized format.

πŸ‘€ Author

Built entirely using pure Python as a learning + portfolio project.


About

A lightweight offline search engine built using pure Python. It indexes text documents, performs fast keyword search, highlights matches, and provides document insights like frequency analysis and similarity scores.

Topics

Resources

License

Stars

Watchers

Forks

Languages