Here is a summary about my Data Science portfolio. Check this post for my work and project experiences
!
Internship
Data Science Intern - Wells Fargo, USA
Jun 2020 - Aug 2020
- Collaborated in pair to implement industry (NAICS code) classification in Python for over 350,000 business description texts with 1041 classes.
- Conducted detailed EDA, text cleaning and pre-processing and attempted various model such as Naive Bayes, logistic regression, SVM, lightGBM, Random Forest and kNN.
- Built the champion model of 65% CV accuracy (improved the existing model by 20%) with BoW and SVM, and constructed an original hierarchical ensemble for speed lifting (30k data in 1 minute).
- Presented work to about 80 people in half an hour and posted a 5-page blog on company’s internal website.
Data Mining Intern - Bank of Ningbo, China
Jun 2019 - Aug 2019
- Helped building deposit balance prediction model for white-collar debit card users in SAS.
- Assisted in initiating data analytics and process automation through Python to increase team’s performance and efficiency.
- Collaborated cross-functionally with other teams/departments to provide products and services while performing due diligence to customers.
Credit Risk Analytics Intern - China Construction Bank, China
Jun 2018 - Aug 2018
- Helped collecting financial statements of local automobile corporations’ risk report for the first year.
- Participated in risk analysis, drafted loan investigations and due diligence reports writing.
- Assisted in loan management and helped evaluating credit level of borrowers by using bank’s internal credit system (mainly about financial liquidity).
Course Projects
Kaggle Competition - Airbnb Listing Data Analysis
- Independently conducted data analysis and predictive modelling for prices of Airbnb listing places in Buenos Aires with Python.
- Attempted various feature engineering (Binning, Ordinal Encoding, One-Hot Encoding, Clustering) and machine learning algorithms basing on detailed data cleaning and EDA.
- Logically fine-tuned a Random Forest model and a lightGBM model for final prediction and positioned at top 6% (9/142) at final leaderboard of categorisation accuracy.
Clinical Data Predictive Modeling
- Worked in a team of 4 to conduct data analysis on a clinical dataset (MIMIC) about predicting patients’ discharge location after ICU admission in Python.
- Conducted ETL, EDA, data cleaning, missing data imputation, feature engineering, variable selection and various machine learning model building on about 60,000 raw records.
- Constructed a visual-rich and multi-page dashboard on GCP for data exploration, model evaluation and results demonstration with the functionality of interactive prediction by implementing a real-time XGBoost algorithm.
Audio Separation
- Worked in pair to explore both linear (ICA) and non-linear (CNN, Open-unmix) techniques for audio separation.
- Prepared benchmark dataset MUSDB18 with self-conducted pre-processing (chunking, sampling, augmentation, random track mixing and fixed validation split).
- Conducted metric selection (SDR, SIR, SAR, ISR, precision and recall) and evaluated different model results with SDR.
- Finished 6-pages paper-style report writing with detailed literature review and made a 15 minutes presentation to class.
Bayesian Algorithm Implementation
- Worked in pair to implement an algorithm in Python for a paper about Indian Buffet Process Prior in an Infinite Latent Feature Model.
- Conducted algorithm optimisation by using algorithm reorganisation, JIT and Cython and proved its applicability in both simulated and real-world datasets.
- Built a PyPI package for future usage and finished a 10-page report for detailed algorithm explanation and performance evaluation.
Predictive Modeling
- Leaded a team of four to evaluate paintings’ prices in 18th century Paris with R.
- Conducted detailed EDA, missing data imputation, feature selection and attempted multiple ML models such as linear regression, BMA, Ridge/Lasso, CART, Boosting, Bagging etc.
- Built a final MARS model with more than 90% accuracy, accomplished a final report of 19 pages, and demonstrated results to 45 people in a 10-minute group presentation with 5-page slides.
Bike Share Prediction
- Worked in a team of four to build a model for Capital Bike Share to construct a probability distribution for every user’s bike end docking station.
- Obtained and organised bike share data in 2013-2017 from various sources with R and API requested historical real-time whether data as supplementary data source.
- Constructed probability distribution for 2018 data by calculating historical end docking station percentage within a certain time duration and used profiling and parallel computation to improve model speed.
Spacial Data Analysis
- Worked in a team of three to conduct spacial data analysis in R to investigate competitive relationship between two shop brands Sheetz and Wawa.
- Used web scraping and API requesting to collect data and finished a 7-page analysis with visualisations to explore the degree of overlapping in geography between the target brands.
Facial Expression Recognition
- Worked in a group of five to build SVM Classifier, CNN and GNN models for facial expression recognition.
- Evaluated model performance on FE2R013 and CK+ datasets and finished an 8-page paper about methods construction and model explanation.
R Shiny Dashboard
- Worked individually to implement complicated API requesting for breaking news headlines and historical articles accessing from over 30,000 news sources.
- Created a well organised and aesthetically pleasing R Shiny news deck (dashboard) which serves as a central news hub for readers and customers.