MarketML is a sophisticated ML system that builds comprehensive business personas from multi-source data aggregation, intelligent analysis, and predictive modeling. Designed for digital marketing agencies serving SMBs (0-1Cr revenue), it automates persona generation with 20-30 second latency while maintaining high accuracy.
New to MarketML? Start here:
- π Quick Reference - Get started in 2 minutes
- π Status Snapshot - Visual project status dashboard
- π Executive Summary - For stakeholders and decision-makers
- π Getting Up to Speed - Comprehensive project overview (30 min read)
- β‘ Quick Start Guide - Step-by-step setup and testing
- π Progress Log - Development history and milestones
Current Status: 95% Complete | 71 files | 8,100+ lines | Ready for testing
- Multi-Source Data Aggregation: Scrapes LinkedIn, company websites, and news sources in parallel
- Advanced NLP: SpaCy-based entity extraction with custom models for Indian market
- Contextual Enrichment: Geographic, industry, and competitive intelligence layering
- ML-Powered Scoring: XGBoost ensemble for business maturity, marketing readiness, and budget capacity
- Template-Based Generation: Produces structured JSON + natural narratives (15-word + full)
- Quality Validation: Multi-level validation with confidence scoring
- Progressive UI Feedback: Real-time status updates during 20-30s generation process
- Async Processing: Celery-based job queue for scalability (500+ personas/day)
- Incremental Learning: Automatic model retraining from user feedback
- Production-Ready: Docker, monitoring, CI/CD, Azure deployment configs
βββββββββββββββ
β Check-in β β User provides name + location
β Counter β
ββββββββ¬βββββββ
β
βΌ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β PERSONA BUILDER PIPELINE β
β β
β 1. Scraping (0-10s) β
β βββ LinkedIn (public profiles) β
β βββ Company Websites β
β βββ News Articles β
β βββ Parallel execution with rate limiting β
β β
β 2. Extraction (10-15s) β
β βββ SpaCy NER (persons, orgs, locations) β
β βββ Contact info extraction β
β βββ Structured data parsing β
β β
β 3. Enrichment (15-20s) β
β βββ Geographic context (affluence, tier, market) β
β βββ Industry intelligence β
β βββ Competitive analysis β
β βββ 10 temporal attributes β
β β
β 4. Feature Engineering (20-22s) β
β βββ 50+ computed features β
β βββ Digital footprint scoring β
β βββ Readiness indicators β
β β
β 5. ML Scoring (22-25s) β
β βββ XGBoost ensemble (60% weight) β
β βββ Random Forest (30%) β
β βββ Linear model (10%) β
β βββ Outputs: maturity, readiness, budget, tier β
β β
β 6. Generation (25-28s) β
β βββ Template-based narrative creation β
β βββ Marketing insights generation β
β βββ Recommendations based on tier β
β βββ 15-word summary for UI β
β β
β 7. Validation (28-30s) β
β βββ Completeness checks β
β βββ Consistency validation β
β βββ Confidence calibration β
β βββ Quality issue flagging β
β β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
ββββββββββββββββ
β PostgreSQL β β Stores personas, feedback, jobs
β Redis β β Caching + Celery queue
β Qdrant β β Vector search for similar personas
ββββββββββββββββ
Backend:
- Python 3.11
- FastAPI (async web framework)
- Celery (distributed task queue)
- SQLAlchemy (async ORM)
- Redis (caching + queue)
ML/NLP:
- SpaCy (NER)
- Sentence-Transformers (embeddings)
- XGBoost (scoring models)
- Scikit-learn (feature engineering)
Data:
- SQLite (development, upgradeable to PostgreSQL)
- Qdrant (vector database)
- Playwright (web scraping)
- BeautifulSoup4 (HTML parsing)
DevOps:
- Docker + Docker Compose
- GitHub Actions (CI/CD)
- Prometheus + Grafana (monitoring)
- Azure (deployment target)
- Python 3.11+
- Docker & Docker Compose
- 8GB+ RAM
- Git
- Clone repository
git clone https://github.com/dlai-sd/MarketML.git
cd MarketML- Create environment file
cp .env.example .env
# Edit .env with your configuration- Start with Docker Compose
docker-compose up -d- Access the application
- API: http://localhost:8000/v1/docs
- Grafana: http://localhost:3000
- Prometheus: http://localhost:9090
- Create virtual environment
python3.11 -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate- Install dependencies
pip install -r requirements.txt
python -m spacy download en_core_web_sm
playwright install chromium- Start services
# Terminal 1: Redis
redis-server
# Terminal 2: API
uvicorn app.main:app --reload
# Terminal 3: Celery worker
celery -A app.tasks.celery_app worker --loglevel=info
# Terminal 4: Celery beat (optional, for scheduled tasks)
celery -A app.tasks.celery_app beat --loglevel=infocurl -X POST "http://localhost:8000/v1/personas/generate" \
-H "Content-Type: application/json" \
-d '{
"name": "Yogesh Khandge",
"location": "Pune, Maharashtra",
"description": "Furniture business owner"
}'Response:
{
"job_id": "uuid-here",
"status": "pending",
"progress": 0,
"current_step": "Queued for processing",
"created_at": "2025-12-20T10:00:00Z"
}curl "http://localhost:8000/v1/jobs/{job_id}"curl "http://localhost:8000/v1/personas/{persona_id}"Response:
{
"persona_id": "uuid",
"confidence_score": 0.87,
"structured": {
"name": "Yogesh Khandge",
"title": "Founder",
"company": "Noya Furniture",
"location": {
"city": "Pune",
"affluence_score": 7.2,
"market_context": "Tier-1 metropolitan area..."
},
"scores": {
"maturity": 65,
"marketing_readiness": 72,
"budget_capacity": 58,
"recommended_tier": 2
}
},
"short_narrative": "Furniture entrepreneur in Pune, growing business with digital focus",
"narrative": "Full narrative...",
"marketing_insights": [...],
"recommended_actions": [...]
}# All tests
pytest
# With coverage
pytest --cov=app --cov-report=html
# Specific test types
pytest -m unit
pytest -m integration
pytest -m modellocust -f tests/load/locustfile.py --users 50 --spawn-rate 5MarketML/
βββ app/
β βββ main.py # FastAPI application
β βββ config.py # Configuration management
β βββ api/v1/ # API endpoints
β βββ core/ # Database, schemas
β βββ scrapers/ # Web scrapers
β βββ extractors/ # Entity extraction
β βββ enrichment/ # Context enrichment
β βββ features/ # Feature engineering
β βββ scoring/ # ML models
β βββ generation/ # Persona generation
β βββ validation/ # Quality validation
β βββ tasks/ # Celery tasks
βββ data/ # Enrichment databases
βββ models/ # Trained ML models
βββ tests/ # Test suite
βββ frontend/ # Test UI (TODO)
βββ monitoring/ # Prometheus/Grafana configs
βββ docker-compose.yml # Docker orchestration
βββ Dockerfile # Container definition
βββ requirements.txt # Python dependencies
βββ README.md # This file
(Coming soon - Week 2 of development)
Interactive web interface for:
- Real-time persona generation with progress bar
- Visual display of structured data
- Editing and feedback submission
- A/B testing different generation strategies
GitHub Actions workflow:
- Lint & Format: Black, Flake8, MyPy
- Unit Tests: pytest with coverage
- Integration Tests: End-to-end API tests
- Model Tests: Validate ML model performance
- Build Docker Image: Multi-stage build
- Push to Azure Container Registry
- Deploy to Azure App Service
- Azure subscription
- Azure CLI installed
- Resource group created
# Login to Azure
az login
# Create resources (first time only)
./scripts/azure_setup.sh
# Deploy application
./scripts/azure_deploy.shSee docs/AZURE_DEPLOYMENT.md for detailed instructions.
Prometheus Metrics:
personas_generated_total: Total personas generatedpersona_generation_seconds: Generation latency histogramscrape_duration_seconds: Scraper performance by sourceavg_confidence_score: Average persona confidence
Grafana Dashboards:
- System Overview: CPU, memory, request rates
- Pipeline Performance: Stage-by-stage latency
- ML Model Metrics: Score distributions, prediction accuracy
- Business Metrics: Conversion rates, user satisfaction
Access Grafana at http://localhost:3000 (admin/admin)
Key environment variables:
# Application
APP_ENV=development
APP_DEBUG=True
API_V1_PREFIX=/v1
# Database
DATABASE_URL=sqlite:///./marketml.db
# Redis
REDIS_HOST=redis
REDIS_PORT=6379
# Scraping
SCRAPER_TIMEOUT=30
SCRAPER_RATE_LIMIT=1
# LLM (optional fallback)
DEEPSEEK_API_KEY=your-key-here
# Feature Flags
ENABLE_LLM_FALLBACK=true
ENABLE_CACHING=true
ENABLE_MONITORING=true- Fork the repository
- Create feature branch (
git checkout -b feature/amazing-feature) - Commit changes (
git commit -m 'Add amazing feature') - Push to branch (
git push origin feature/amazing-feature) - Open Pull Request
This project is licensed under the MIT License - see LICENSE file.
- Test cases: Yogesh Khandge, Noya Furniture, Yashus Digital Marketing
- Target market: Indian SMBs (0-1Cr revenue)
- Deployment region: Azure Central India
For questions or issues:
- Open GitHub Issue
- Check
PROGRESS.mdfor development updates - See
docs/for detailed documentation
Built with β€οΈ for MarketML - Empowering Digital Marketing with AI MarketML