Skip to content

dlai-sd/MarketML

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

48 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MarketML - AI-Powered Marketing Persona Builder

Python 3.11 FastAPI License

MarketML is a sophisticated ML system that builds comprehensive business personas from multi-source data aggregation, intelligent analysis, and predictive modeling. Designed for digital marketing agencies serving SMBs (0-1Cr revenue), it automates persona generation with 20-30 second latency while maintaining high accuracy.

πŸ“š Documentation Hub

New to MarketML? Start here:

Current Status: 95% Complete | 71 files | 8,100+ lines | Ready for testing

🎯 Key Features

  • Multi-Source Data Aggregation: Scrapes LinkedIn, company websites, and news sources in parallel
  • Advanced NLP: SpaCy-based entity extraction with custom models for Indian market
  • Contextual Enrichment: Geographic, industry, and competitive intelligence layering
  • ML-Powered Scoring: XGBoost ensemble for business maturity, marketing readiness, and budget capacity
  • Template-Based Generation: Produces structured JSON + natural narratives (15-word + full)
  • Quality Validation: Multi-level validation with confidence scoring
  • Progressive UI Feedback: Real-time status updates during 20-30s generation process
  • Async Processing: Celery-based job queue for scalability (500+ personas/day)
  • Incremental Learning: Automatic model retraining from user feedback
  • Production-Ready: Docker, monitoring, CI/CD, Azure deployment configs

πŸ—οΈ Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Check-in   β”‚ β†’ User provides name + location
β”‚  Counter    β”‚
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜
       β”‚
       β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚           PERSONA BUILDER PIPELINE                      β”‚
β”‚                                                          β”‚
β”‚  1. Scraping (0-10s)                                    β”‚
β”‚     β”œβ”€β”€ LinkedIn (public profiles)                      β”‚
β”‚     β”œβ”€β”€ Company Websites                                β”‚
β”‚     β”œβ”€β”€ News Articles                                   β”‚
β”‚     └── Parallel execution with rate limiting           β”‚
β”‚                                                          β”‚
β”‚  2. Extraction (10-15s)                                 β”‚
β”‚     β”œβ”€β”€ SpaCy NER (persons, orgs, locations)           β”‚
β”‚     β”œβ”€β”€ Contact info extraction                         β”‚
β”‚     └── Structured data parsing                         β”‚
β”‚                                                          β”‚
β”‚  3. Enrichment (15-20s)                                 β”‚
β”‚     β”œβ”€β”€ Geographic context (affluence, tier, market)    β”‚
β”‚     β”œβ”€β”€ Industry intelligence                           β”‚
β”‚     β”œβ”€β”€ Competitive analysis                            β”‚
β”‚     └── 10 temporal attributes                          β”‚
β”‚                                                          β”‚
β”‚  4. Feature Engineering (20-22s)                        β”‚
β”‚     β”œβ”€β”€ 50+ computed features                           β”‚
β”‚     β”œβ”€β”€ Digital footprint scoring                       β”‚
β”‚     └── Readiness indicators                            β”‚
β”‚                                                          β”‚
β”‚  5. ML Scoring (22-25s)                                 β”‚
β”‚     β”œβ”€β”€ XGBoost ensemble (60% weight)                   β”‚
β”‚     β”œβ”€β”€ Random Forest (30%)                             β”‚
β”‚     β”œβ”€β”€ Linear model (10%)                              β”‚
β”‚     └── Outputs: maturity, readiness, budget, tier      β”‚
β”‚                                                          β”‚
β”‚  6. Generation (25-28s)                                 β”‚
β”‚     β”œβ”€β”€ Template-based narrative creation               β”‚
β”‚     β”œβ”€β”€ Marketing insights generation                   β”‚
β”‚     β”œβ”€β”€ Recommendations based on tier                   β”‚
β”‚     └── 15-word summary for UI                          β”‚
β”‚                                                          β”‚
β”‚  7. Validation (28-30s)                                 β”‚
β”‚     β”œβ”€β”€ Completeness checks                             β”‚
β”‚     β”œβ”€β”€ Consistency validation                          β”‚
β”‚     β”œβ”€β”€ Confidence calibration                          β”‚
β”‚     └── Quality issue flagging                          β”‚
β”‚                                                          β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
       β”‚
       β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  PostgreSQL  β”‚ β†’ Stores personas, feedback, jobs
β”‚  Redis       β”‚ β†’ Caching + Celery queue
β”‚  Qdrant      β”‚ β†’ Vector search for similar personas
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸ“Š Technology Stack

Backend:

  • Python 3.11
  • FastAPI (async web framework)
  • Celery (distributed task queue)
  • SQLAlchemy (async ORM)
  • Redis (caching + queue)

ML/NLP:

  • SpaCy (NER)
  • Sentence-Transformers (embeddings)
  • XGBoost (scoring models)
  • Scikit-learn (feature engineering)

Data:

  • SQLite (development, upgradeable to PostgreSQL)
  • Qdrant (vector database)
  • Playwright (web scraping)
  • BeautifulSoup4 (HTML parsing)

DevOps:

  • Docker + Docker Compose
  • GitHub Actions (CI/CD)
  • Prometheus + Grafana (monitoring)
  • Azure (deployment target)

πŸš€ Quick Start

Prerequisites

  • Python 3.11+
  • Docker & Docker Compose
  • 8GB+ RAM
  • Git

Local Development

  1. Clone repository
git clone https://github.com/dlai-sd/MarketML.git
cd MarketML
  1. Create environment file
cp .env.example .env
# Edit .env with your configuration
  1. Start with Docker Compose
docker-compose up -d
  1. Access the application

Manual Setup (without Docker)

  1. Create virtual environment
python3.11 -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
  1. Install dependencies
pip install -r requirements.txt
python -m spacy download en_core_web_sm
playwright install chromium
  1. Start services
# Terminal 1: Redis
redis-server

# Terminal 2: API
uvicorn app.main:app --reload

# Terminal 3: Celery worker
celery -A app.tasks.celery_app worker --loglevel=info

# Terminal 4: Celery beat (optional, for scheduled tasks)
celery -A app.tasks.celery_app beat --loglevel=info

πŸ“– API Usage

Generate Persona (Async)

curl -X POST "http://localhost:8000/v1/personas/generate" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Yogesh Khandge",
    "location": "Pune, Maharashtra",
    "description": "Furniture business owner"
  }'

Response:

{
  "job_id": "uuid-here",
  "status": "pending",
  "progress": 0,
  "current_step": "Queued for processing",
  "created_at": "2025-12-20T10:00:00Z"
}

Check Job Status

curl "http://localhost:8000/v1/jobs/{job_id}"

Get Generated Persona

curl "http://localhost:8000/v1/personas/{persona_id}"

Response:

{
  "persona_id": "uuid",
  "confidence_score": 0.87,
  "structured": {
    "name": "Yogesh Khandge",
    "title": "Founder",
    "company": "Noya Furniture",
    "location": {
      "city": "Pune",
      "affluence_score": 7.2,
      "market_context": "Tier-1 metropolitan area..."
    },
    "scores": {
      "maturity": 65,
      "marketing_readiness": 72,
      "budget_capacity": 58,
      "recommended_tier": 2
    }
  },
  "short_narrative": "Furniture entrepreneur in Pune, growing business with digital focus",
  "narrative": "Full narrative...",
  "marketing_insights": [...],
  "recommended_actions": [...]
}

πŸ§ͺ Testing

Run Tests

# All tests
pytest

# With coverage
pytest --cov=app --cov-report=html

# Specific test types
pytest -m unit
pytest -m integration
pytest -m model

Load Testing

locust -f tests/load/locustfile.py --users 50 --spawn-rate 5

πŸ“ Project Structure

MarketML/
β”œβ”€β”€ app/
β”‚   β”œβ”€β”€ main.py                    # FastAPI application
β”‚   β”œβ”€β”€ config.py                  # Configuration management
β”‚   β”œβ”€β”€ api/v1/                    # API endpoints
β”‚   β”œβ”€β”€ core/                      # Database, schemas
β”‚   β”œβ”€β”€ scrapers/                  # Web scrapers
β”‚   β”œβ”€β”€ extractors/                # Entity extraction
β”‚   β”œβ”€β”€ enrichment/                # Context enrichment
β”‚   β”œβ”€β”€ features/                  # Feature engineering
β”‚   β”œβ”€β”€ scoring/                   # ML models
β”‚   β”œβ”€β”€ generation/                # Persona generation
β”‚   β”œβ”€β”€ validation/                # Quality validation
β”‚   └── tasks/                     # Celery tasks
β”œβ”€β”€ data/                          # Enrichment databases
β”œβ”€β”€ models/                        # Trained ML models
β”œβ”€β”€ tests/                         # Test suite
β”œβ”€β”€ frontend/                      # Test UI (TODO)
β”œβ”€β”€ monitoring/                    # Prometheus/Grafana configs
β”œβ”€β”€ docker-compose.yml             # Docker orchestration
β”œβ”€β”€ Dockerfile                     # Container definition
β”œβ”€β”€ requirements.txt               # Python dependencies
└── README.md                      # This file

🎨 Test UI

(Coming soon - Week 2 of development)

Interactive web interface for:

  • Real-time persona generation with progress bar
  • Visual display of structured data
  • Editing and feedback submission
  • A/B testing different generation strategies

πŸ”„ CI/CD Pipeline

GitHub Actions workflow:

  1. Lint & Format: Black, Flake8, MyPy
  2. Unit Tests: pytest with coverage
  3. Integration Tests: End-to-end API tests
  4. Model Tests: Validate ML model performance
  5. Build Docker Image: Multi-stage build
  6. Push to Azure Container Registry
  7. Deploy to Azure App Service

☁️ Azure Deployment

Prerequisites

  • Azure subscription
  • Azure CLI installed
  • Resource group created

Deploy

# Login to Azure
az login

# Create resources (first time only)
./scripts/azure_setup.sh

# Deploy application
./scripts/azure_deploy.sh

See docs/AZURE_DEPLOYMENT.md for detailed instructions.

πŸ“ˆ Monitoring

Prometheus Metrics:

  • personas_generated_total: Total personas generated
  • persona_generation_seconds: Generation latency histogram
  • scrape_duration_seconds: Scraper performance by source
  • avg_confidence_score: Average persona confidence

Grafana Dashboards:

  • System Overview: CPU, memory, request rates
  • Pipeline Performance: Stage-by-stage latency
  • ML Model Metrics: Score distributions, prediction accuracy
  • Business Metrics: Conversion rates, user satisfaction

Access Grafana at http://localhost:3000 (admin/admin)

πŸ”§ Configuration

Key environment variables:

# Application
APP_ENV=development
APP_DEBUG=True
API_V1_PREFIX=/v1

# Database
DATABASE_URL=sqlite:///./marketml.db

# Redis
REDIS_HOST=redis
REDIS_PORT=6379

# Scraping
SCRAPER_TIMEOUT=30
SCRAPER_RATE_LIMIT=1

# LLM (optional fallback)
DEEPSEEK_API_KEY=your-key-here

# Feature Flags
ENABLE_LLM_FALLBACK=true
ENABLE_CACHING=true
ENABLE_MONITORING=true

🀝 Contributing

  1. Fork the repository
  2. Create feature branch (git checkout -b feature/amazing-feature)
  3. Commit changes (git commit -m 'Add amazing feature')
  4. Push to branch (git push origin feature/amazing-feature)
  5. Open Pull Request

πŸ“ License

This project is licensed under the MIT License - see LICENSE file.

πŸ™ Acknowledgments

  • Test cases: Yogesh Khandge, Noya Furniture, Yashus Digital Marketing
  • Target market: Indian SMBs (0-1Cr revenue)
  • Deployment region: Azure Central India

πŸ“ž Support

For questions or issues:

  • Open GitHub Issue
  • Check PROGRESS.md for development updates
  • See docs/ for detailed documentation

Built with ❀️ for MarketML - Empowering Digital Marketing with AI MarketML

About

MarketML

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •