Skip to content
View ajharris's full-sized avatar

Block or report ajharris

Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
ajharris/README.md

About

I build data science and machine learning systems that move cleanly from raw data to defensible insight, with an emphasis on well-motivated problems, reproducible pipelines, and model interpretability.

My work sits at the intersection of applied ML, scientific computing, and software engineering, often using public or operational data to prototype end-to-end analyses that could realistically run in production.


Current Focus & Active Projects

  • Distributed-ML
    A distributed, dataset-agnostic CT preprocessing pipeline using Dask, designed for large clinical imaging datasets and downstream ML workflows.

  • publicdata_ca
    A reusable data acquisition and normalization framework for Canadian public datasets (StatCan, CMHC, CIHI), supporting rapid ML case studies such as housing affordability indices and hospital utilization analysis.

  • Applied ML Case Studies
    Short, tightly scoped projects demonstrating:

    • Unsupervised learning (Isolation Forests, autoencoders)
    • Feature engineering from messy public datasets
    • Evaluation under limited or noisy ground truth
    • Clear motivation and decision-oriented outputs
  • YesChef GPT
    An AI-powered system that structures generative outputs into machine-readable components (ingredients, preparation steps, pickup notes), emphasizing controllability and downstream usability over novelty.


Background in Scientific Computing

  • C++ medical image registration using ITK (Insight Toolkit)
  • MATLAB pipelines using Marching Cubes for carotid artery tracing in CT angiography
  • Control systems for LED solar simulators supporting photovoltaic research
  • Formal training in medical physics, with strong grounding in measurement, uncertainty, and validation

Currently Exploring

  • Anomaly detection in healthcare operations
    Early detection of unusual demand or utilization patterns using unsupervised and semi-supervised methods.

  • Public-sector ML pipelines
    Designing reusable ingestion and feature pipelines that make public data viable for rapid experimentation.

  • Evaluation without labels
    Practical techniques for validating unsupervised models when ground truth is incomplete or unavailable.

  • Bridging notebooks to systems
    Turning exploratory analyses into maintainable, testable services without losing scientific intent.


Perspective

I approach data science as an engineering discipline:
start with a clear question, respect the data’s limitations, and build models that can be explained, tested, and trusted.

My goal is to work on problems where statistical thinking, ML techniques, and real-world constraints all matter — especially in healthcare, infrastructure, and public data contexts.

Pinned Loading

  1. housing-affordability housing-affordability Public

    Housing Affordability Stress Index

    Jupyter Notebook

  2. Distributed-ML Distributed-ML Public

    Distributed-ML: CT Preprocessing Pipeline with Dask This repository implements a distributed, dataset-agnostic CT preprocessing pipeline designed for large clinical imaging datasets such as NLST, C…

    Python

  3. knock-em-dead-resume knock-em-dead-resume Public

    An AI-powered resume builder based on Martin Yate’s Knock ’Em Dead formula. It finds job ads, extracts key skills, and helps craft tailored, achievement-driven resumes optimized for each role, with…

    Python

  4. Andrew-Harris-Tech/EVXchange Andrew-Harris-Tech/EVXchange Public

    EVXchange is a web app that lets electric vehicle (EV) owners find and book nearby charging stations hosted by individuals or businesses. It works like AirBnB, but for EV chargers — people can rent…

    Python 1

  5. BJJ-notebook BJJ-notebook Public

    Python

  6. yes-chef-gpt yes-chef-gpt Public

    Python