Here is a summary about my Data Science portfolio. Check this post for my work and project experiences!

Internship

Data Science Intern - Wells Fargo, USA

Jun 2020 - Aug 2020

  • Collaborated in pair to implement industry (NAICS code) classification in Python for over 350,000 business description texts with 1041 classes.
  • Conducted detailed EDA, text cleaning and pre-processing and attempted various model such as Naive Bayes, logistic regression, SVM, lightGBM, Random Forest and kNN.
  • Built the champion model of 65% CV accuracy (improved the existing model by 20%) with BoW and SVM, and constructed an original hierarchical ensemble for speed lifting (30k data in 1 minute).
  • Presented work to about 80 people in half an hour and posted a 5-page blog on company’s internal website.

Data Mining Intern - Bank of Ningbo, China

Jun 2019 - Aug 2019

  • Helped building deposit balance prediction model for white-collar debit card users in SAS.
  • Assisted in initiating data analytics and process automation through Python to increase team’s performance and efficiency.
  • Collaborated cross-functionally with other teams/departments to provide products and services while performing due diligence to customers.

Credit Risk Analytics Intern - China Construction Bank, China

Jun 2018 - Aug 2018

  • Helped collecting financial statements of local automobile corporations’ risk report for the first year.
  • Participated in risk analysis, drafted loan investigations and due diligence reports writing.
  • Assisted in loan management and helped evaluating credit level of borrowers by using bank’s internal credit system (mainly about financial liquidity).


Course Projects

Kaggle Competition - Airbnb Listing Data Analysis

  • Independently conducted data analysis and predictive modelling for prices of Airbnb listing places in Buenos Aires with Python.
  • Attempted various feature engineering (Binning, Ordinal Encoding, One-Hot Encoding, Clustering) and machine learning algorithms basing on detailed data cleaning and EDA.
  • Logically fine-tuned a Random Forest model and a lightGBM model for final prediction and positioned at top 6% (9/142) at final leaderboard of categorisation accuracy.

Clinical Data Predictive Modeling

  • Worked in a team of 4 to conduct data analysis on a clinical dataset (MIMIC) about predicting patients’ discharge location after ICU admission in Python.
  • Conducted ETL, EDA, data cleaning, missing data imputation, feature engineering, variable selection and various machine learning model building on about 60,000 raw records.
  • Constructed a visual-rich and multi-page dashboard on GCP for data exploration, model evaluation and results demonstration with the functionality of interactive prediction by implementing a real-time XGBoost algorithm.

Audio Separation

  • Worked in pair to explore both linear (ICA) and non-linear (CNN, Open-unmix) techniques for audio separation.
  • Prepared benchmark dataset MUSDB18 with self-conducted pre-processing (chunking, sampling, augmentation, random track mixing and fixed validation split).
  • Conducted metric selection (SDR, SIR, SAR, ISR, precision and recall) and evaluated different model results with SDR.
  • Finished 6-pages paper-style report writing with detailed literature review and made a 15 minutes presentation to class.

Bayesian Algorithm Implementation

  • Worked in pair to implement an algorithm in Python for a paper about Indian Buffet Process Prior in an Infinite Latent Feature Model.
  • Conducted algorithm optimisation by using algorithm reorganisation, JIT and Cython and proved its applicability in both simulated and real-world datasets.
  • Built a PyPI package for future usage and finished a 10-page report for detailed algorithm explanation and performance evaluation.

Predictive Modeling

  • Leaded a team of four to evaluate paintings’ prices in 18th century Paris with R.
  • Conducted detailed EDA, missing data imputation, feature selection and attempted multiple ML models such as linear regression, BMA, Ridge/Lasso, CART, Boosting, Bagging etc.
  • Built a final MARS model with more than 90% accuracy, accomplished a final report of 19 pages, and demonstrated results to 45 people in a 10-minute group presentation with 5-page slides.

Bike Share Prediction

  • Worked in a team of four to build a model for Capital Bike Share to construct a probability distribution for every user’s bike end docking station.
  • Obtained and organised bike share data in 2013-2017 from various sources with R and API requested historical real-time whether data as supplementary data source.
  • Constructed probability distribution for 2018 data by calculating historical end docking station percentage within a certain time duration and used profiling and parallel computation to improve model speed.

Spacial Data Analysis

  • Worked in a team of three to conduct spacial data analysis in R to investigate competitive relationship between two shop brands Sheetz and Wawa.
  • Used web scraping and API requesting to collect data and finished a 7-page analysis with visualisations to explore the degree of overlapping in geography between the target brands.

Facial Expression Recognition

  • Worked in a group of five to build SVM Classifier, CNN and GNN models for facial expression recognition.
  • Evaluated model performance on FE2R013 and CK+ datasets and finished an 8-page paper about methods construction and model explanation.

R Shiny Dashboard

  • Worked individually to implement complicated API requesting for breaking news headlines and historical articles accessing from over 30,000 news sources.
  • Created a well organised and aesthetically pleasing R Shiny news deck (dashboard) which serves as a central news hub for readers and customers.