Skip to content
View autumnmarin's full-sized avatar

Block or report autumnmarin

Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
autumnmarin/README.md

πŸš€ Exciting News! πŸ†

Finished in the Top 5% of the Rainfall Prediction Kaggle competition! 🌧️

I combined CatBoost, feature engineering, oversampling, and Optuna hyperparameter tuning into a powerful pipeline, and even explored embeddings & clustering for deeper insight.

Write-up coming soon – stay tuned! ⭐


πŸ‘‹ Hello, I’m Autumn. I hold a Master’s in Analytics from Georgia Tech, where I developed deep expertise in data science, machine learning, and strategy. I have a strong curiosity for uncovering patterns in complex data, turning insights into action, and communicating results in a way that drives meaningful, high-value impact.


πŸŽ’ Backpack Price Modeling and Prediction with ML

GitHub Repo

This project focuses on predicting product prices in the Backpack Price Prediction Kaggle competition. Rather than applying a basic regression model, the pipeline leverages feature engineering, real-world intuition, and model optimization to improve predictive accuracy in a noisy commercial dataset.


πŸ”Ή Key Highlights:

πŸ“Œ Feature Engineering: Constructed product-specific features such as weight-to-compartment interactions, log transformations for skewed fields, and multi-way categorical combinations (e.g., brand + material + size).
πŸ“Œ Modeling: Benchmarked XGBoost, LightGBM, and CatBoost, incorporating Optuna for tuning and using a stacked ensemble with Ridge regression for final predictions.
πŸ“Œ Performance Metrics: Evaluated using RMSE on both notebook and Kaggle leaderboard submissions to track generalization.


πŸ“Š Technologies Used:

  • Python 🐍 Pandas, NumPy, Scikit-learn
  • Winning Model: Stacked Ensemble (XGBoost + LightGBM + CatBoost)
  • Feature Engineering & Preprocessing (One-hot encoding, interaction terms, outlier removal)
  • Hyperparameter Tuning (Optuna, Cross-Validation)
  • GitHub for Version Control πŸ› 

πŸ”¬ Innovative Methods Used:
While many tabular models focus solely on boosting performance, this project highlights the value of domain-aware feature construction and rigorous evaluation across multiple modeling pipelines. A shared preprocessing module ensured fairness across models and streamlined experimentation.

πŸ”— Check out the full write-up in the repository!


πŸ₯ Predicting Cirrhosis Patient Outcomes with Multi-Class Classification

GitHub Repo

This project focuses on predicting patient outcomes in the Cirrhosis Outcome Prediction Kaggle competition. Instead of applying a basic classification model, I utilized feature engineering, domain knowledge, and model optimization techniques to improve multi-class prediction accuracy.

πŸ”Ή Key Highlights:

πŸ“Œ Feature Engineering: Created domain-specific features like bilirubin-to-albumin ratio, log transformations for skeId features, and binary indicators for critical thresholds.
πŸ“Œ Modeling: Compared XGBoost, LightGBM, and CatBoost, fine-tuning hyperparameters and using stacking ensembles for performance gains.
πŸ“Œ Performance Metrics: Evaluated using multi-class log loss and cross-validation to ensure model generalization.

πŸ“Š Technologies Used:

  • Python 🐍 Pandas, NumPy, Scikit-learn
  • Winning Model XGBoost
  • Feature Engineering & Data Preprocessing (One-hot encoding, ratio calculations, outlier removal)
  • Hyperparameter Tuning (Randomized Search, Stratified K-Fold Validation)
  • GitHub for Version Control πŸ› 

πŸ”¬ Innovative Methods Used:
Many classification models for medical datasets rely on direct correlations or minimal preprocessing. This project takes a more data-driven and clinical approach, engineering features that reflect real-world liver disease progression. This improves both interpretability and predictive power.

πŸ”— Check out the full write-up in the repository!


🏠 A Different Approach to Feature Engineering for Predicting House Prices in Ames

GitHub Repo

This project is an in-depth analysis of the Ames Housing dataset, where I applied machine learning models to predict house sale prices. Instead of merely running standard models, I leveraged feature engineering, domain knowledge, and advanced model comparison techniques to improve prediction accuracy.

πŸ”Ή Key Highlights:

  • πŸ“Œ Feature Engineering: Grouped related features to enhance predictive poIr
  • πŸ“Œ Modeling: Compared Decision Trees, Random Forests, Gradient Boosting, and Linear Regression
  • πŸ“Œ Performance Metrics: Evaluated RMSE and RΒ² to measure model effectiveness

πŸ“Š Technologies Used:

  • Python 🐍 (Pandas, NumPy, Scikit-learn, Matplotlib, Seaborn)
  • Machine Learning Models (Linear Regression, Gradient Boosting, Random Forest, Decision Trees)
  • Feature Engineering & Data Preprocessing
  • GitHub for Version Control πŸ› 

πŸ“’ πŸ”¬ Innovative Methods Used: Most approaches to this dataset focus on either raw correlations or brute-force feature selection. My approach leverages real-estate knowledge to construct meaningful categories (e.g., grouping porch types, analyzing basement features separately), which led to better model interpretability and stronger predictions...in some cases. See the write-up where I explain.

πŸ”— Check out the full write-up in the repository!


🎯 Medley Relay Optimization

GitHub Repo

This project tackles the challenge of optimizing a medley relay lineup, where swimmers often excel in multiple strokes, creating trade-offs in event selection. Instead of guessing or manually shuffling times, I developed an Excel Solver-based optimization model to automatically determine the fastest possible relay combination.

πŸ”— Check out the full write-up in the repository!


πŸš€ ML Decoded

GitHub Repo

image

Machine learning terms with simple, intuitive explanations.

πŸ”— View the repository: ML Decoded on GitHub


πŸš€ Technical Skills

πŸ€– Machine Learning & Predictive Modeling

  • Developing and optimizing models using:
    • Linear Regression, Decision Trees, Random Forests
    • Gradient Boosting (LightGBM, XGBoost, CatBoost)
    • Support Vector Machines (SVM), Neural Networks (TensorFlow, PyTorch)
    • Clustering (K-Means, DBSCAN), Principal Component Analysis (PCA)

🧠 Data Analysis & Feature Engineering

  • Data wrangling, preprocessing, and feature engineering with:
    • Pandas, NumPy, Scikit-learn, Statsmodels
    • Handling missing values, scaling, encoding categorical variables
    • Engineering domain-specific features to enhance model performance

πŸ“Š Data Visualization & Storytelling

  • Communicating insights using:
    • Matplotlib, Seaborn, Plotly, Tableau
    • Creating interactive and high-impact visualizations for stakeholder engagement

πŸ’Ύ Big Data & Scalable Computing

  • Working with large-scale datasets using:
    • Amazon S3, Google BigQuery, Apache Spark, SQL
    • Optimizing storage and query performance for large datasets

πŸ“ˆ Business Intelligence & Data-Driven Strategy

  • Applying data science for:
    • Forecasting, market analysis, and strategic decision-making
    • Business intelligence tools: PoIr BI, Looker
    • Automating reporting and dashboarding solutions

πŸŽ“ Education

  • Masters of Science in Analytics 🐝
    Georgia Institute of Technology 🌐

  • Web Development Professional Certificate πŸ“œ
    University of California, Davis 🌟

  • Bachelor of Science in Business Finance πŸ’Ή
    California State University Sacramento 🌳


πŸ“š Bookshelf:


🌐 Let's Connect!


Visitor Count

Pinned Loading

  1. AITrading AITrading Public

    Jupyter Notebook