Data Science & Database Administration
As Chief Technology Officer and Lead Developer for E-Laborative Technologies Ltd. I have been tasked with a variety of challenging requirements related to data storage, analysis, management, and administration. I have earned a professional certificate from HarvardX in Data Science, and I have individual certifications in probability, inference & modeling, linear regression, data wrangling, productivity, R, and machine learning. I also earned a certificate for Epidemiology in Public Health Practice from Johns Hopkins University.
I regularly employ SQL and R for big data analytics, visualization, and storage. Most of my professional work involves proprietary business data, financial and legal documentation, workforce efficiency systems, and product distribution systems. I also enjoy researching public health issues and actively work on several projects related to environmental science, finance, public health, and epidemiology as a private citizen researcher.
The following statistics dashboards demonstrate some of my past work using R and Shiny to conduct exploratory data analysis related to various subjects of general public interest. These dashboards utilize various curated public repos, API's, R packages, and other known sources for data acquistion.
Shiny Dashboards & Analysis Tools:
Coronavirus Visualization Dashboard
This dashboard shows Coronavirus stats and X day interval growth rates over bar, column and grid charts. It also offers colored chloropleth maps of the US by state, by county, and in individual state-level views. This tool incorporates data from the U.S. Census Bureau and COVID-19 stats compiled by the CoronaDataScraper project. (*The coronadatascraper data source for this project stopped updating in Nov 2020.)
This investment simulator helps you explore how dollar cost averaging strategies combined with a home purchase could impact your expected net worth after a ten-plus year period.
This visualization tool enables the user to browse and locate weather readings data from over 170 years of observations at thousands of US weather stations operated by the NOAA and made available through the National Climatic Data Center.
This mammal predictor app was built in connection with my HarvardX capstone in Data Science. In this project, I used heirarchical imputation strategies combined with a recursive random forest algorithm, which can use raw data on body mass, length, and other characteristics to predict the order, family and genus of a mammal with high accuracy.
Costco Drive Time Isochrone
This dashboard plots drive times to costco locations along the I5 corridor in Oregon.
- IMHE Data Visualization Dashboard
- US Census Data Visualizations
- CIA World Factbook
- NCDC Visualizations
- Corona Data Scraper
- Li Covid Atlas Project
- EDA - Exploratory data analysis
- PSM - Problem Solving Methodology
- Conceptual Framework - A picture of the problem that includes KD's
- KI - Key Indicator
- KD - Key Determinants
- OI - Outcome Indicators
- Descriptive Epidemiology - Carefully frames statistical assertions in the form of a concise statement
- Person, Place, Time - The key considerations for a descriptive epi statement
- Proximal Determinants - Most causally linked to OI
- Distal Determinants - Less causally linked or non-modifyable determinants
- Ecological Fallacy - When inferences about the nature of individuals are deduced from inferences about the group to which those individuals belong
- Simpson's Paradox - When stratification reverses observed trends
- Leontief Model - A method of assessing the strength and balance of supply and demand in an economy
- Law of Large Numbers - As sample size grows, the mean of the sample apporaches the true mean
- System of Equations - A group of linear equations containing 2 or more variables with coefficients to be solved or optimized
- Identity Matrix - A square matrix matching the rows of matrix A with a diagonal line of 1's from the upper left to lower right, used to solve matrix problems
- De Morgan's Properties - Rules relating to how two or more sets interact by way of complements, unions, intersections, and exclusions
- Histogram - A visualization to show the quantity of records within a given rage of values within a dataset
- Boxplot - A visualization to show the interquartile range of discrete variables
- Scatterplot - A plot of paired variable points on an X / Y axis
- Venn Diagram - A visualization of the relationship of 2 or more sets resembling overlapping circles
- Heat Map - A visualization indicating quantitative values through use of coloring
- Choropleth - A visualization indicating quantitative values through use of coloring of pre-defined areas on a geographic map
- Isochrone - A visualization indicating distances in times on a geographic map
- Timeseries - Data whiuch provides observations in successive intervals
- Discrete Variables - Variables that exist independently of one another in a non-continuous fashion, eg. integers.
- Continuous Variables - Variables that exist in a continuous fashion, eg. floating point decimals.
- Ordinal Variables - Variables that have a natural or intrinsic order order, eg, A, B, C, D etc.
- Cardinal Variables - Variables that have a no natural or intrinsic order, eg. Dog, Cat, Giraffe
- CDF - Cumulative distribution function
- Bayes Theorem - Rules for determining conditional probability
- Chebychevs Theorem - Rules to determine probability of a random outcome
- Prisoners Dilemma - A paradox in decision analysis where individuals acting in self-interest do not produce the optimal outcome
- Fisher's Principle - An important law in evolutionary biology which explans the ratio of the sexes and the ESS
- ESS - Evolutionarily stable strategy, a refinement of the Nash equilibrium
- Correlation - A measure of how two variables move together
- Covariance - A measure of the joint variability of two random variables
- Bivariate Normal - When aX+bY has a normal distribution
- Decision Tree - A method of classification or prediction in which successive nodes split 2 ways based on a series of cutoffs
- RF - Random forests - an ensemble method utilizing decision trees and bagging
- PCA - Principal component analysis - a method of dimensionality reduction that defines an orthogonal coordinate system to describe variance
- LDA - Linear discriminant analysis - a method to find a linear combination of objects or classes
- ANOVA - Analysis of variance - a statistical hypothesis testing to characterize variance
- Imputation - The act of replacing missing data to improve a model's performance
- MAR - Missing at Random - when data is missing randomly, but missingness may follow a pattern in columns
- MCAR - Missing Completely at Random - when data is missing randomly with no discernable pattern
- MNAR - Missing Not at Random - when data is missing, and following a pattern of missingness
- Confusion Matrix - A set of numbers categorizing the outcome of a machine learning model in terms of True Positives, False Positives, True Negatives, and False Negatives.