Heather's Portfolio

Here is a summary about my Data Science portfolio. Check this post for my work and project experiences!

Internship

Jun 2020 - Aug 2020

Collaborated in pair to implement industry (NAICS code) classification in Python for over 350,000 business description texts with 1041 classes.
Conducted detailed EDA, text cleaning and pre-processing and attempted various model such as Naive Bayes, logistic regression, SVM, lightGBM, Random Forest and kNN.
Built the champion model of 65% CV accuracy (improved the existing model by 20%) with BoW and SVM, and constructed an original hierarchical ensemble for speed lifting (30k data in 1 minute).
Presented work to about 80 people in half an hour and posted a 5-page blog on company’s internal website.

Jun 2019 - Aug 2019

Helped building deposit balance prediction model for white-collar debit card users in SAS.
Assisted in initiating data analytics and process automation through Python to increase team’s performance and efficiency.
Collaborated cross-functionally with other teams/departments to provide products and services while performing due diligence to customers.

Jun 2018 - Aug 2018

Helped collecting financial statements of local automobile corporations’ risk report for the first year.
Participated in risk analysis, drafted loan investigations and due diligence reports writing.
Assisted in loan management and helped evaluating credit level of borrowers by using bank’s internal credit system (mainly about financial liquidity).

Independently conducted data analysis and predictive modelling for prices of Airbnb listing places in Buenos Aires with Python.
Attempted various feature engineering (Binning, Ordinal Encoding, One-Hot Encoding, Clustering) and machine learning algorithms basing on detailed data cleaning and EDA.
Logically fine-tuned a Random Forest model and a lightGBM model for final prediction and positioned at top 6% (9/142) at final leaderboard of categorisation accuracy.

Worked in a team of 4 to conduct data analysis on a clinical dataset (MIMIC) about predicting patients’ discharge location after ICU admission in Python.
Conducted ETL, EDA, data cleaning, missing data imputation, feature engineering, variable selection and various machine learning model building on about 60,000 raw records.
Constructed a visual-rich and multi-page dashboard on GCP for data exploration, model evaluation and results demonstration with the functionality of interactive prediction by implementing a real-time XGBoost algorithm.

Worked in pair to explore both linear (ICA) and non-linear (CNN, Open-unmix) techniques for audio separation.
Prepared benchmark dataset MUSDB18 with self-conducted pre-processing (chunking, sampling, augmentation, random track mixing and fixed validation split).
Conducted metric selection (SDR, SIR, SAR, ISR, precision and recall) and evaluated different model results with SDR.
Finished 6-pages paper-style report writing with detailed literature review and made a 15 minutes presentation to class.

Worked in pair to implement an algorithm in Python for a paper about Indian Buffet Process Prior in an Infinite Latent Feature Model.
Conducted algorithm optimisation by using algorithm reorganisation, JIT and Cython and proved its applicability in both simulated and real-world datasets.
Built a PyPI package for future usage and finished a 10-page report for detailed algorithm explanation and performance evaluation.

Leaded a team of four to evaluate paintings’ prices in 18th century Paris with R.
Conducted detailed EDA, missing data imputation, feature selection and attempted multiple ML models such as linear regression, BMA, Ridge/Lasso, CART, Boosting, Bagging etc.
Built a final MARS model with more than 90% accuracy, accomplished a final report of 19 pages, and demonstrated results to 45 people in a 10-minute group presentation with 5-page slides.

Worked in a team of four to build a model for Capital Bike Share to construct a probability distribution for every user’s bike end docking station.
Obtained and organised bike share data in 2013-2017 from various sources with R and API requested historical real-time whether data as supplementary data source.
Constructed probability distribution for 2018 data by calculating historical end docking station percentage within a certain time duration and used profiling and parallel computation to improve model speed.

Worked in a team of three to conduct spacial data analysis in R to investigate competitive relationship between two shop brands Sheetz and Wawa.
Used web scraping and API requesting to collect data and finished a 7-page analysis with visualisations to explore the degree of overlapping in geography between the target brands.

Worked in a group of five to build SVM Classifier, CNN and GNN models for facial expression recognition.
Evaluated model performance on FE2R013 and CK+ datasets and finished an 8-page paper about methods construction and model explanation.

Worked individually to implement complicated API requesting for breaking news headlines and historical articles accessing from over 30,000 news sources.
Created a well organised and aesthetically pleasing R Shiny news deck (dashboard) which serves as a central news hub for readers and customers.