This project implements a deep learning pipeline for classifying environmental sounds from the ESC-50 dataset. It features a custom convolutional neural network (AutoCNN) built with PyTorch, using residual blocks for improved learning. The solution includes data augmentation, training on GPU using Modal, and a scalable inference API.
ML/Backend:
- PyTorch
- Torchaudio
- Modal (GPU-accelerated training & inference)
- FastAPI (for serving inference endpoints)
Client (optional visualization layer):
- Next.js
- Tailwind CSS
- TypeScript
- Shadcn UI
β
Custom CNN architecture with residual connections
β
ESC-50 dataset ingestion and preprocessing
β
Data augmentation with Mixup and spectrogram masking
β
Fully managed training on GPU using Modal
β
Inference API with real-time audio classification
β
Intermediate feature map visualization for debugging and interpretability
β
Example endpoint for testing predictions from WAV files
# Modal API Key
NEXT_PUBLIC_MODAL_API=This project uses the ESC-50 dataset, which contains 50 environmental sound categories. The training pipeline automatically downloads and prepares the dataset during Modal app initialization.
# Clone the repo
git clone https://github.com/JoelDeonDsouza/Auto_CNN.git
cd auto-cnn
# Install dependencies
pip install -r requirements.txtYou can launch a training job on Modal with GPU acceleration.
modal run train.pyThe trained model will be saved in a Modal-managed volume.
Deploy the inference API:
modal deploy main.pyTest an inference request locally:
modal run main.pyPut your WAV files in the audio-tests/ directory. An example (chirpingBirds.wav) is included for testing.
AutoCNN: Custom CNN model with residual blocksESC50Dataset: PyTorch Dataset class for ESC-50train.py: Training loop with augmentation, optimizer, and TensorBoard loggingmain.py: FastAPI-powered inference endpointmodal: Manages GPU workloads and deploys endpoints
The AutoCNN model follows a ResNet-inspired structure with four convolutional stages, followed by global pooling and a linear classifier.