A Statistics Companion in R Markdown


The field of statistics requires careful standards and practices when collecting, arranging and classifying pieces of information “datum”, plural: data. These standards help ensure reproducible experiments and analysis to draw a conclusion, answer a question, or make a valid assertion about that information. Statistics uses charts and graphs to communicate that information and display the outcomes in a way that’s easy for a non-scientist to understand.

Statistics is very important to many fields of science from social sciences to finance to computer programming and artificial intelligence. Statistics are the “language of data” - and understanding how to properly quantify and express that information, and visualize those statistics clearly, can help create compelling arguments and persuade the reader from the class room to the board room.

This guide compiles many frequently used formulas, visualizations and theories in this field of study into the R programming language and presents these concepts in a written, textbook-style guide. In this guide, each chapter contains a different general subject matter, and contains a variety of formulas that can be called as a function. These functions may be called in a series, or nested into other formulas, to generate useful calculations and visualizations.

We will also explore some real-world applications of these concepts, such as:

R and R Markdown provide a powerful platform for general statistical programming. Visualization capabilities in R are enhanced through the “Grammar of Graphics” as implemented through the GGPlot2 and GGForce packages. Data manipulation functions like select, mutate, and summarize are added by the dplyr package, enabling more convenient and readable R language constructs, using the pipe %>% operator. This operator can help your write intuitive syntax in R that resembles common structured query languages like SQL.

This document will not provide basic instruction in R. It is assumed that the reader has some familiarity with the R language, syntax patterns, installation, and configuration. If this is your first introduction to R, please check out the appendix to learn more about the basics of this language and installing the RStudio programming environment. The Rmd files (which this document was written in) may be downloaded from github, enabling the reader to change any portion of this document and customize the word problems to cover different applied subjects.

This guide is intended for finance professionals, statisticians, data scientists, math teachers, students*, or anyone else who needs to generate, visualize, apply, or solve finite math problems.

  • Note to students: Many of these formulas can be used to solve probability math challenges; but you should avoid using this guide to answer questions on exams or as a shortcut to doing the hard work of understanding this important subject. The only way to the reach the goal of deep understanding, is by doing the work by hand. So please, do yourself a favor, and learn these concepts on pencil and paper before you use this guide. Your instructors will always want to see your work, and this guide won’t help you with that!

Load packages

This Rmarkdown file will use several packages as defined below. This code will load the packages only of they are not already installed. These packages will be loaded into the local scope via the library functions.

if(!require(pracma)) install.packages("pracma", repos = "http://cran.us.r-project.org", dependencies = TRUE)
if(!require(dplyr)) install.packages("dplyr", repos = "http://cran.us.r-project.org", dependencies = TRUE)
if(!require(ggplot2)) install.packages("ggplot2", repos = "http://cran.us.r-project.org", dependencies = TRUE)
if(!require(ggforce)) install.packages("ggforce", repos = "http://cran.us.r-project.org", dependencies = TRUE)
if(!require(gtools)) install.packages("gtools", repos = "http://cran.us.r-project.org", dependencies = TRUE)
if(!require(matlib)) install.packages("matlib", repos = "http://cran.us.r-project.org", dependencies = TRUE)
if(!require(MASS)) install.packages("MASS", repos = "http://cran.us.r-project.org", dependencies = TRUE)
if(!require(kableExtra)) install.packages("kableExtra", repos = "http://cran.us.r-project.org", dependencies = TRUE)
if(!require(RcppAlgos)) install.packages("RcppAlgos", repos = "http://cran.us.r-project.org", dependencies = TRUE)
if(!require(latex2exp)) install.packages("latex2exp", repos = "http://cran.us.r-project.org", dependencies = TRUE)
if(!require(xfun)) install.packages("xfun", repos = "http://cran.us.r-project.org", dependencies = TRUE)
if(!require(readxl)) install.packages("readxl", repos = "http://cran.us.r-project.org", dependencies = TRUE)



Definition of Data

This section provides a basic primer about data. Data is the plural form of datum - which means, a single observation. So data is, by definition, more than one observation. Observations or measurements may be qualitative - those measuring a trait or attribute - or quantitative - those measuring a quantity or value - usually in a concrete, numerical or categorical way.


We will explore how to define variables, which can help with important choices on how to analyze and visualize the data being collected. All variables in a strongly typed or explicit environment must have a data type defined - which could be abstract, like a generic “object” or it may be specific like “integer” or “decimal” or “character”. Every programming language has different rules for how variables are defined, and whether they are strongly typed. R does not require that you define a variable before assigning it’s value. So R is weakly typed. A variable exists in teh type that befits the first value passed in to the variable. A variable may be a numeric value in a given unit (Like centimeters of distance, liters of volume, kilometers per hour of speed, etc.), or a variable may b a categorical value of a given type.

#create some variables of different types
var_num <- 5.01
var_chr <- "hello"
var_vector <- c(1,2,3,4,5)
var_df <- data.frame(greeting="salutations")

#check the types
## [1] "double"
## [1] "character"
## [1] "double"
## [1] "list"

Variables defined in R may be set as various types like you see above.

Unit of a variable

The unit of a variable is the unit of measurement you are using. WHen recording observations, one should try to avoid mixing units, and always label values clearly as to what unit is being used. In this companion,we will generally add _unit to the variable name, so if the unit is centimeters, it will add _cm to the variable name. Below are some unit abbreviations you may see in this companion.

Local Distance _km = kilometers _m = meters _cm = centimeters _mm = millimeters _um = micrometers (microns) _nm = nanometers

Astronomical Distance _pc = parsecs _ly = light years _au = astronomical units

Mass _t = tons _kg = kilograms _g = grams _mg = milligrams

Speed _kph = kilometers per hour _mps = meters per second

Time _hr = hours _min = minutes _sec = seconds

Precision of a Variable

The precision of a variable, is a measure of its rounding. The precision is the number of digits to the right of the decimal place in any numeric value. It’s important because when taking observations of a population, the experiment should use the same amount of precision, and same unit of measure, in taking all measurements.

Format Numbers with Rounding

It’s important to understand the difference between formatting a number through rounding (in which the number of digits to the right of the decimal point is fixed - vs. formatting using significant digits, where the final number of digits displays depends on the position of the 0’s. In the sections below we will explore how values are rounded or formatted to show varying amounts of precision. Remember, when you “round off” you are “losing precision” so - that’s not always a good idea, and should usually happen AFTER all other operations like multiplication, division, averaging etc have occurred.

var_pi <- 3.1415926
stmt <- paste("Pi rounded to 2 places is:", round(pi,2))
## [1] "Pi rounded to 2 places is: 3.14"
pc <- 6.62607015 * 10 ^ -34
stmt <- paste("Plancks constant rounded to 2 places is:", round(pc,2))
## [1] "Plancks constant rounded to 2 places is: 0"
stmt <- paste("Plancks constant rounded to 36 places is:", round(pc,36))
## [1] "Plancks constant rounded to 36 places is: 0.000000000000000000000000000000000663"

Format Numbers for Display using Significant Digits

Significant digits are those numbers that have meaning in a variable - so pretty much any non-zero number. the significant digits counts from the left to right, so a value with the same number of significant You can limit the significant digits of a variable using format.

var_bignum <- 9876543210
say_bignum <- numbers_to_words(var_bignum)
stmt <- paste(say_bignum," formatted to 3 significant digits is:", format(var_bignum,digits=3))
## [1] "nine billion, eight hundred seventy-six million, five hundred forty-three thousand, two hundred ten  formatted to 3 significant digits is: 9876543210"
var_mpp <- 1.917*10^13
say_mpp <- numbers_to_words(var_mpp)
stmt <- paste(say_mpp," formatted to 3 significant digits is:", format(var_mpp,digits=3))
## [1] "nineteen trillion, one hundred seventy billion  formatted to 3 significant digits is: 19170000000000"
var_pi <- 3.1415926
stmt <- paste("Pi formatted to 3 significant digits is:", format(pi,digits=3))
## [1] "Pi formatted to 3 significant digits is: 3.14"
pc <- 6.62607015 * 10 ^ -34
stmt <- paste("Plancks constant formatted to 3 significant digits is:", format(pc,digits=3))
## [1] "Plancks constant formatted to 3 significant digits is: 0.000000000000000000000000000000000663"
pc <- 10000.999
stmt <- paste("10000.999 formatted to 3 significant digits is:", format(pc,digits=3))
## [1] "10000.999 formatted to 3 significant digits is: 10001"
pc <- 10000.99
stmt <- paste("10000.99 formatted to 3 significant digits is:", format(pc,digits=3))
## [1] "10000.99 formatted to 3 significant digits is: 10001"
pc <- 10000
stmt <- paste("10000 formatted to 3 significant digits is:", format(pc,digits=3))
## [1] "10000 formatted to 3 significant digits is: 10000"
pc <- 1000
stmt <- paste("1000 formatted to 3 significant digits is:", format(pc,digits=3))
## [1] "1000 formatted to 3 significant digits is: 1000"
pc <- 100
stmt <- paste("100 formatted to 3 significant digits is:", format(pc,digits=3))
## [1] "100 formatted to 3 significant digits is: 100"
pc <- 10
stmt <- paste("10 formatted to 3 significant digits is:", format(pc,digits=3))
## [1] "10 formatted to 3 significant digits is: 10"
pc <- 10.1
stmt <- paste("10.1 formatted to 3 significant digits is:", format(pc,digits=3))
## [1] "10.1 formatted to 3 significant digits is: 10.1"
pc <- 10.12
stmt <- paste("10.12 formatted to 3 significant digits is:", format(pc,digits=3))
## [1] "10.12 formatted to 3 significant digits is: 10.1"
pc <- 1
stmt <- paste("1 formatted to 3 significant digits is:", format(pc,digits=3))
## [1] "1 formatted to 3 significant digits is: 1"
pc <- 1.2
stmt <- paste("1.2 formatted to 3 significant digits is:", format(pc,digits=3))
## [1] "1.2 formatted to 3 significant digits is: 1.2"
pc <- 1.23
stmt <- paste("1.23 formatted to 3 significant digits is:", format(pc,digits=3))
## [1] "1.23 formatted to 3 significant digits is: 1.23"

Application: Round to the Nearest N

A common problem when dealing with data is rounding amounts that are not a decimal. In this function we will divide by the unit we want to round to, then round that product with 0 points of precision, then multiply by that number to get the “nearest number” that is divisible.

#define a list of misc. numbers
num_list <- c(11,23,43,24,32,56,55,76,60,87,98,92)

#round_to will round val to the nearest amount specified
round_to <- function(val,amount){
 rounded_amount <-  round(val / amount,0) * amount

#apply to the list using 10 as the factor to round to
##  [1]  10  20  40  20  30  60  60  80  60  90 100  90
#apply to the list using 5 as the factor to round to
##  [1]  10  25  45  25  30  55  55  75  60  85 100  90

Population & Samples

A group of observations about the same subject is called a population, and in many cases we will examine some elements of a population, but not all, which is referred to as a sample. A sample can give you insight into characteristics of the population, even if it does not contain the entire population.

A sample is fundamental to statistics - So if we are looking to determine water quality in a well - we take a sample (say, a 10 milliliters) and determine how much arsenic and lead is in that volume. Then we can multiple that by the total volume of water we drink each day, and determine approximately how much arsenic or lead we are ingesting. It’s a lot easier to take a sample, than to measure all the water in the well.

Sometimes, its impossible to measure something without taking a sample. For example, we can’t measure the exact number of stars in the galaxy- we can only estimate how many there are based on how many we can see in a given area of the night sky. Samples allow us to extrapolate conclusions, based on the limited information we have available.

Thankfully, we can extrapolate information that is pretty darn accurate - This is due to concepts we will explore later, including the concept of standard deviation, the central limit theorem, the law of large numbers, and other important principles in statistical analysis.

# Set the seed of R's random number generator, which is useful for creating simulations or random objects that can be reproduced.
# Create a matrix object
nCols = 5
nRows = 3
population <- matrix(runif(nCols * nRows), ncol = nCols )

# Print the Population
## Component 1 :
##           [,1]      [,2]      [,3]        [,4]      [,5]
## [1,] 0.1550508 0.7768197 0.2068877 0.193943927 0.8346921
## [2,] 0.9683788 0.4078857 0.1871036 0.434231178 0.8300830
## [3,] 0.4682631 0.5387971 0.7799697 0.002274518 0.9569678

Discrete Variables

The most obvious example of a discrete variable is an integer or whole number.

So what makes a variable “discrete”? A measurement is discrete when it belongs a to well-defined group.

So by this definition - A discrete variable is a variable that “stands alone” - while, by contrast, a continuous variable is one that “bleeds together”.

As an example, an integer must be either 1 or 2 or 3 …, not 1.5 or 1.78 or 3.1415926…

But this is where it gets tricky. Just because a number has a decimal, does not necessarily mean it’s continuous. You could have a set of discrete variables that also contain decimals.

Lets consider movie ratings - if you have a web site where you can rate a movie:

1, 1.5, 2, 2.5, 3, 3.5, 4, 4.5, or 5 stars

These discrete variables contain decimals - but they are still discrete, because your entire population of answers will always fall into one of these 9 values. Even if you have a million response,s you only have 9 possible ratings.

Continuous Variables

Determining if a variable is continuous or discrete may not always be easy - and it often depends on the natural process that is generating that variable - and also whether that measurement is subject to any rounding.

Lets consider a few more examples:

Measuring the size of a bird’s wing precisely - will produce a “continuous” variable - it could be 13.35 cm, or 14.42 cm, or 15.18 cm, etc.

So a continuous variable like this “could be anything”, within a reasonable expected range. So a bird wing is not going to be 60 feet long, or negative 3 inches. That would be ridiculous. But it could be anything from, lets say, an inch for a tiny Hummingbird to over 6 feet for a Condor. The range of a population of data is equal to th largest / maximum value minus the smallest / minimum value.

So in this sense - a continuous variable could be anything, but within an expected range, with some expected lower and upper boundaries or “bounds”. WHen describing variables, its also important to consider the precision - which formally is the number of decimal places you are measuring to. So a bird wing measured to a tenth of a cm has 1 place of precision; measuring to a hundredth has 2 places, a thousandth has 3, etc.

When making observations, its fairly important that all observations are measured to the same minimum level of precision - and that should be clear to whomever is making the measurement, YOu would not want one of your observers making very precise measurements, and another rounding to the nearest whole number. This would create inconsistent data for your population which could skew or bias the results.

A continuous variable could be a weight, a height, a length, a time span - just about anything that can be measured with a high level of precision and variability.

On the other hand a discrete variable will usually represent a distinct, fixed quantity - a count, a category, a letter of the alphabet, etc. So discrete variables don’t blend in to one another like continuous variables do.

But here’s another complication - if you were to round off a set of continuous measurements to the nearest whole number, that population could then also be considered discrete. So if we decide when we measure birds wings that we don’t need to be that precise - and round off that variable to the nearest cm, then youyour population is always going to be 13, or 14 or 15 cm, etc.

So in that sense, data can be both discrete and continuous - even the same data, depending on how its collected, and the level of precision you are using.

So - how can we scientifically determine this important element? In short - you can’t. Its always a question that is answered by the data itself. However - in creating a data collection system you can certainly add filters to the data to ensure it is discrete (to a certain level of precision) or you can allow that data to be continuous. So in this function we will explore how we could create a function that enforces one of these types.

write_vector <- function(v){
  s <- paste("{",paste(v, collapse = ","),"}")
collate_list <- function(data_point,data_type){
  #we are taking the data point passed in and applying rounding to ensure it falls within an expected discrete population
  new_point <- ifelse(data_type=='discrete',round(data_point),data_point)
data_collection <- c(13.35,14.42,15.18)

discrete_measurements <- sapply(data_collection,collate_list,'discrete')
## [1] 13 14 15
continuous_measurements <- sapply(data_collection,collate_list,'continuous')
## [1] 13.35 14.42 15.18
#write a sentence to summarize the result
result <- paste("discrete_measurements = ",write_vector(discrete_measurements),"\r\n continuous_measurements = ",write_vector(continuous_measurements),sep="") 

discrete_measurements = { 13,14,15 }
continuous_measurements = { 13.35,14.42,15.18 }

Ordinal Variables

Its useful to understand that a variable is discrete or continuous as it can help us decide what to do with that data and how to best display it.

BUt it also is helpful to consider if a discrete variable is Categorical, Ordinal, or Cardinal.

An Ordinal variable is variable in a population that has a natural “order” or position-

Numbers are ordinal since you expect to see them in some kind of order - 1, then 2 then 3 then 4, etc.

Letters like like A, B, C, D is an example of ordinal, discrete variables

Sizes like Small, Medium, Large (Or Tall, Grande, Venti - if you like to go your own way) that’s another example of a ordinal.

Ordinal, Discrete variable can be displayed in Ascending (going from small to large, or start to end) or Descending (large to small, or end to start) order.

Cardinal Variables

An offshoot of this idea is that of Cardinal variables which describe a quantity as opposed to a position. So 1 horse, 2 horses, 3 horses are Cardinal. WHereas saying that this is the First Horse, Second Horse, and Third Horse is Ordinal

In addition, you might say that discrete variables are Interval if the distance between them is fixed. like 100 pounds, 200 pounds, 300 pounds. These could be considered Discrete, Cardinal and Interval.

Categorical Variables

So what about “Categorical” variables? These are variables that are discrete, but generally not subject to an “expected” order.

An example of cardinal variable could be families of insects: Ants, Moths, Bees, Caterpillars, Grasshoppers, Beetles, etc. These are Categorical variables. You don’t necessarily expect to see Ants before Bees in a list, or Moths before Grasshoppers.

You might expect to see these categories ordered by some other “ordinal” measure like the average weight or wingspan; or you might expect to see them in a list that is alphabetized. But in that case - you are actually looking at an ordinal variable that is derived from the categorical variable (the first letter of the category name). So again, a category is not generally ordinal - it has no “intrinsic” order.

However - once again there’s some gray area here. Lets take colors - one might say that red, blue and yellow are categorical and not ordinal since they exist independently of one another. So if someone asked you what are your favorite colors, in order, you could say Blue, then Red then Yellow. But in another sense they ARE ordinal because the colors of the spectrum are determined by the electromagnetic wavelengths of visible light - and we can recall our dear friend Roy G. Biv to remember the “expected order” of colors - (Red, Orange, Yellow, Green, Blue, Indigo, Violet). But once again - the category itself is not ordinal - rather, a property of the color (its wavelength) is ordinal.

So here we can see that it can be complicated to determine if a variable is ordinal or categorical, depending on what the variable describes, and how that relates to natural systems around us, and how we as humans expect things to be positioned in a list.

In this function we will return a statement that guesses whether a list is ordinal, cardinal, or categorical, based on the type of data in the list.

categorize_list <- function(set,measure){
  is_all_numeric <- is.numeric(set)
  is_all_integer <- is.integer(set)
  is_nonnum <- !is.numeric(set)
    y <- 0 
    interval_set <- c()
    precision_set <- c()
    for(x in sort(set)){
      is_integer <- x == round(x)
      precision_set <- c(precision_set,is_integer)
      if(y != 0){
        since_last <- x - y
        interval_set <- c(interval_set,since_last)
      y <- x
    datarelate <- 'uneven'
    if(length(unique(interval_set)) == 1){
      datarelate <- 'even'
    precision <- 'numeric values'
    continuity <- 'continuous'
    if(precision_set == TRUE){
       precision <- 'integer values'
       continuity <- 'discrete'
    if(measure == 'quantity' | measure == 'amount'  | measure == 'value' | measure == 'width' | measure == 'height' | measure == 'length' | measure == 'weight' | measure == 'mass' | measure == 'volume' | measure == 'wavelength' | measure == 'amplitude'){
      msg <- paste('This set contains ',continuity,' ',precision ,', and describes quantity. It has ',datarelate,' intervals. It\'s probably cardinal data.',sep='')
    }else if(measure == 'position' | measure == 'rank' | measure == 'placement' | measure == 'order'){
      msg <- paste('This set contains ',continuity,' ',precision ,', and describes position. It has ',datarelate,' intervals. It\'s probably ordinal data.',sep='')
      msg <- paste('This set contains ',continuity,' ',precision ,', and describes ',measure,'. It has ',datarelate,' intervals. It\'s either ordinal or cardinal data.',sep='')
     msg <- paste('This set contains some non-numeric values describing ',measure,'. It\'s probably categorical data.',sep='')

collected_data1 <- c(1,7,14,23,2,4,51,5)
collected_data2 <- c(1,2,3,4,5,6)
collected_data3 <- c(1.1,1.7,2,2.234,3,3.14159,5,5.1)
collected_data4 <- c('bird','fish','cat','dog')
collected_data5 <- c('red','orange','yellow','green','blue','indigo','violet')

## [1]  1  7 14 23  2  4 51  5
## [1] "This set contains discrete integer values, and describes quantity. It has uneven intervals. It's probably cardinal data."
## [1] 1 2 3 4 5 6
## [1] "This set contains discrete integer values, and describes position. It has even intervals. It's probably ordinal data."
## [1] 1.10000 1.70000 2.00000 2.23400 3.00000 3.14159 5.00000 5.10000
## [1] "This set contains continuous numeric values, and describes quantity. It has uneven intervals. It's probably cardinal data."
## [1] "bird" "fish" "cat"  "dog"
## [1] "This set contains some non-numeric values describing animals. It's probably categorical data."
## [1] "red"    "orange" "yellow" "green"  "blue"   "indigo" "violet"
## [1] "This set contains some non-numeric values describing colors. It's probably categorical data."

This function will provide a rough guess as to how each of these lists would probably be categorized.

We know that collected_data1 contains discrete integer values with uneven intervals - let’s take a look at what that set actually contains again.

Now to get a sense of whats in the vector - we will go ahead and sort the list so we can see what all the values are in order, since the data appears to be ordinal.

When dealing with ordinal data, there are some key commands in base R that are helpful. ONe of these is the “sort” function that will order the data in ascending order by default.

Alternatively you can “negate” the data AND the sort to get a descending order version of the set.

sorted_quantities_asc <- sort(collected_data1)
## [1]  1  2  4  5  7 14 23 51
sorted_quantities_desc <- -sort(-collected_data1)
## [1] 51 23 14  7  5  4  2  1

Defining Data Frames using dplyr & magrittr

Now we will define a data frame. THe data frame is R’s version of a database table. To define a data frame you must set each column name and pass in a vector of values to build the rows of your data frame. A data frame can be a single column or many columns. We will start with a one column data frame.

#create a data frame with one column
df_quantities <- data.frame(quantity=collected_data1)

Data frames are convenient for working with the GGPlot package; they allow you to name your vectors and combine them into tables, then clearly defining the x/y axis and aesthetics of your GGPlot using easy-to-interpret language constructs.

simple_plot <- ggplot(data=df_quantities, aes(x=rank(-quantity),y=quantity)) + 

Here we can visualize the values in teh set, using the quantity as the vertical y axis and the rank of the quantity, as the order along the horizontal x axis. Rank is the ordinal position of this value in the list if the list is sorted from least to most or A to Z. The rank is reversed with a - to place the highest values first resulting in the chart above.

Application: Measuring Birds

Now, lets consider a less abstract scenario; we are asked to compile some statistics about bird specimens in a natural science collection. For each bird, we must measure 5 variables- its’ length in cm, wingspan in mm, wing color, weight in grams, and species common name.

For each type of measurement below we can run the categorize_list function to confirm the characteristics of this vector of observation. These vectors will then be combined into a data frame below.

length_cm <- c(25,20,12,8,30,12,21)
wingspan_cm <- c(38,28,20,4,39,22,36)
weight_g <- c(77,32.2,15,7.5,76.5,16,68)
species_name <- c('American Robin','Baltimore Oriole','House finch','Hummingbird','Scrub jay','Indigo bunting','Starling')
species_color <- c('red','orange','yellow','green','blue','indigo','violet')

## [1] 25 20 12  8 30 12 21
## [1] "This set contains discrete integer values, and describes quantity. It has uneven intervals. It's probably cardinal data."
## [1] 38 28 20  4 39 22 36
## [1] "This set contains discrete integer values, and describes wingspan. It has uneven intervals. It's either ordinal or cardinal data."
## [1] 77.0 32.2 15.0  7.5 76.5 16.0 68.0
## [1] "This set contains continuous numeric values, and describes quantity. It has uneven intervals. It's probably cardinal data."
## [1] "American Robin"   "Baltimore Oriole" "House finch"      "Hummingbird"     
## [5] "Scrub jay"        "Indigo bunting"   "Starling"
## [1] "This set contains some non-numeric values describing species. It's probably categorical data."
df_birds <- data.frame(
  length_cm = length_cm,
  wingspan_cm = wingspan_cm,
  weight_g = weight_g,
  species_name = species_name

simple_plot <- ggplot(data=df_birds, aes(x=weight_g,y=wingspan_cm,label=species_name)) + #geom_smooth(method="lm") +
  geom_point() + 


Preview the Power: Try Out a Generalized Linear Model (GLM)

Now lets jump ahead a bit and get a little taste of how data.frame and principles of statistics can be used for linear modeling in R. Suppose you are given the task now of estimating the weight for a few more bird specimens that are mounted, and cannot be accurately weighed. A generalized linear model (GLM) can use the length and wingspan of these new specimens to predict the unknown weight.

Before we delve into all the theory and math behind this prediction, lets just use some common sense - if we need to predict an unknown weight; we know intuitively that bigger things are usually heavier. And we know that we can measure the length and wingspan. So to test thisnotion that bogger bords are heavier mathmatically, we will use the cor function in R to confirm this phenomenon. The cor function will show us the correlation between multiple variables in our data frame. A correlation of 1 means that they move together perfectly, in a 1:1 manner, or some other ratio.

df_birds_measurements <- df_birds %>% dplyr::select(length_cm,wingspan_cm,weight_g)
##             length_cm wingspan_cm  weight_g
## length_cm   1.0000000   0.9230812 0.9347757
## wingspan_cm 0.9230812   1.0000000 0.9113025
## weight_g    0.9347757   0.9113025 1.0000000

Sure enough, weight, wingspan and length are all highly correlated in these birds. So we can estimate based on the length. But we also have this other metric, the wingspan - longer wings would probably support heavier loads… so which one do we use? In this case, teh answer is - use both. A glm can incorporate both the length and the wingspan.

The glm function will generate a model which will predict a value for an unknown variable based on one or more known variables. Later, we will learn more of the inner workings of the glm to understand how it does this.

You can now use your available bird data to make a prediction about the weight of these additional birds.

model <- glm(weight_g ~ wingspan_cm + length_cm, data=df_birds)
## Call:
## glm(formula = weight_g ~ wingspan_cm + length_cm, data = df_birds)
## Deviance Residuals: 
##       1        2        3        4        5        6        7  
##   9.515  -14.826   -5.755    9.575   -4.172   -6.371   12.034  
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept) -25.1140    12.8097  -1.961    0.121
## wingspan_cm   0.8079     1.0657   0.758    0.491
## length_cm     2.4759     1.6902   1.465    0.217
## (Dispersion parameter for gaussian family taken to be 159.4973)
##     Null deviance: 5782.08  on 6  degrees of freedom
## Residual deviance:  637.99  on 4  degrees of freedom
## AIC: 59.452
## Number of Fisher Scoring iterations: 2

the model has now been fitted to the data.

new_wing <-c(24,22,34,14)
new_length <-c(16,14,26,10)
new_species <-c('Nuthatch*','Titmouse*','Dove*','Junco*')

#define new observation
newdata = data.frame(wingspan_cm=new_wing, length_cm=new_length)

Now we will introduce some new birds with one field (weight) missing, and use this data to make a prediction on the missing weight.

#use model to predict value of am
predict(model, newdata, type="response")
##        1        2        3        4 
## 33.89072 27.32307 66.72893 10.95600
predicted_weight <- predict(model, newdata, type="response")

The predicted weights will come back as a list in the same order that rows of data were passed in. We can now recombine the data from the new birds, and the predicted weights.

df_new <- data.frame( 
  length_cm = new_length,
  wingspan_cm = new_wing,
  weight_g = predicted_weight,
  species_name = new_species)
df_birds_combined <- df_birds %>% bind_rows(df_new)
 simple_plot <- ggplot(data=df_birds_combined, aes(x=weight_g,y=wingspan_cm,label=species_name)) + #geom_smooth(method="lm") +
  geom_point() + 
  geom_label(nudge_x=1,hjust=0) + 
  ggtitle('Selected Birds Weight by Wingspan (*=estimates)')

Loading Data: Minnesota Birds

OK - so we have seen how we can load data manually, one value at a time, but clearly this is not very helpful when you have large data sets you are working with, usually in excel or some kind of database format.

Excel files are a common format for data collection and interchange - and R has some excellent capabilities to import excel file types. So we will start there. You must load the readxl package to gain the read_excel function in R. In this example, the data file is in the same directory as my .R file:

tbl_birds <-read_excel("minnesota_birds.xlsx")
df_minnesota_birds <- data.frame(tbl_birds)
##      Name           Scientific.Name    Indicator.or.Other.Notes
##  Length:444         Length:444         Length:444              
##  Class :character   Class :character   Class :character        
##  Mode  :character   Mode  :character   Mode  :character        
##    Category            State             Kingdom             Phylum         
##  Length:444         Length:444         Length:444         Length:444        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##     Class              Order              Family             Genus          
##  Length:444         Length:444         Length:444         Length:444        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##  Conservation.status Length.Min..cm.  Length.Max..cm. Body.Mass.Min..g.
##  Length:444          Min.   :  7.00   Min.   :  8.9   Min.   :   2.0   
##  Class :character    1st Qu.: 13.88   1st Qu.: 16.0   1st Qu.:  18.0   
##  Mode  :character    Median : 20.50   Median : 25.0   Median :  55.0   
##                      Mean   : 28.51   Mean   : 35.4   Mean   : 327.3   
##                      3rd Qu.: 39.00   3rd Qu.: 46.0   3rd Qu.: 331.0   
##                      Max.   :138.00   Max.   :180.0   Max.   :9200.0   
##  Body.Mass.Max..g. Wingspan.Min..cm. Wingspan.Max..cm.
##  Min.   :    3.0   Min.   :  8.00    Min.   : 10.60   
##  1st Qu.:   28.0   1st Qu.: 23.00    1st Qu.: 25.00   
##  Median :   88.5   Median : 38.00    Median : 42.00   
##  Mean   :  665.6   Mean   : 52.67    Mean   : 60.62   
##  3rd Qu.:  589.5   3rd Qu.: 72.75    3rd Qu.: 83.00   
##  Max.   :14300.0   Max.   :240.00    Max.   :300.00   
##                    NA's   :1         NA's   :1

The above code reads in a table from a XLSX file and converts it to a data.frame.

After we do this, we can use dplyr as before to mutate, then select specific fields, then mutate it again, etc. until our data is ready for whatever task we have in store.

In this case I want only the 3 measurements I used in my prior data set. I will discard a bunch of extra fields that are in the excel data. Also recall that our model above had only an average measurement for length, wingspan and weight - not a range of min/max values. So if we want to compare this external data with our own data, we may need to fortify the data with some new columns for averages to make the values line up.

Using dplyr to manipulate data frames

Filter Operator

In this example we will use the pipe operator to pipe data from the left to the right. Each time you pipe data you are saving the step of assigning that operation to a named variable. Piping data is a sort of shorthand way to apply multiple transformations to your data in a series, and just apply the final transformation into a named variable.

df_small_birds <- df_birds_combined %>% filter(weight_g <= 50)
df_large_birds <- df_birds_combined %>% filter(weight_g > 50)

Mutate Operator

In the example above only one pipe operation is performed; so this is the most basic version. You just take your data (df_quantities), pipe it into a function (filter), and assign that to a variable (df_small_birds). It’s works exactly like if you just called a single function and returned the result. Below is an example of using multiple pipe operators on a line - first filtering, then adding a column.

df_small_birds <- df_birds_combined %>% filter(weight_g <= 50) %>% mutate(class='small')
df_large_birds <- df_birds_combined %>% filter(weight_g > 50) %>% mutate(class='large') 

Arrange Operator

In the example below we are filtering to larger birds, adding a column, and arranging the result list by weight.

df_large_birds_byweight <- df_birds_combined %>% filter(weight_g > 50) %>% mutate(class='large')  %>% arrange(weight_g)

End-to-End Data Transformation Pipeline with dplyr

dplyr’s data.frame manipulation features combined with the pipe operator provide a great tool for running these kinds of multi-step transformations - it enables you to pipe data from step to step to step, with no intermediate variables “holding” data - which is great for reproducible science - there’s no extra variable names to potentially mix up, your development environment only retains the starting and ending state, and there’s simply no gap from one operation to the next. You feed in one table, and get out another one, no matter how many steps you do in the process.

This example is intended to be a little complex - because we want to demonstrate how the data can be manipulated multiple times in a single operation. In this case we are selecting some fields, renaming them, averaging 2 fields values row by row, then selecting only the averages we want, with a in a new field order, and a sorted arrangement of rows. That’s 5 different transformations. But the pipe means we don’t need to create 5 new data frames to hold each iteration of change were going to perform. We just run the operations in order, like so:

#assign to df_minnesota_bird_measurements
df_minnesota_bird_measurements <- 
  #using df_minnesota_birds as the source
  df_minnesota_birds %>% 
  #mutate the field names we want to keep, removing dots and upper case letters, etc.
  species_name = Name,
  weight_g_min = Body.Mass.Min..g.,
  weight_g_max = Body.Mass.Max..g.,
  length_cm_min = Length.Min..cm.,
  length_cm_max = Length.Max..cm.,
  wingspan_cm_min = Wingspan.Min..cm.,
  wingspan_cm_max = Wingspan.Max..cm.
 ) %>% 
  #now select only those fields
 ) %>% 
  #group them row wise for the next operations
  dplyr::rowwise() %>% 
  #average the min/max values for each row into a new average column
   weight_g=mean(c(weight_g_min, weight_g_max), na.rm=T),
   length_cm=mean(c(length_cm_min, length_cm_max), na.rm=T),
   wingspan_cm=mean(c(wingspan_cm_min, wingspan_cm_max), na.rm=T)
   ) %>% 
  #select just the averages, in teh same order as our previous table
  ) %>% 
   #reorder the table by wingspan descending

#look at the table now

So now we’ve completed this series of transformations to create a clear chain of events, that takes the data in the external file, and makes it match the format I need for further work.

 simple_plot <- ggplot(data=df_minnesota_bird_measurements, aes(x=log10(weight_g),y=log10(wingspan_cm),label=species_name)) + geom_smooth(method="loess") +
  geom_point() + 
  #geom_text() + 
  ggtitle('Selected Birds Weight by Wingspan')

df_birds_all <- df_birds_combined %>% bind_rows(df_minnesota_bird_measurements %>% dplyr::select(length_cm,wingspan_cm,weight_g,species_name) %>% mutate(species_color = 'Not Available'))

df_birds_all %>% arrange(-wingspan_cm) %>% head(20)
#grab a small subset of the data
df_sample <- df_birds_all %>% filter(wingspan_cm>=18 & wingspan_cm < 24 & weight_g >= 15 & weight_g < 25)

 simple_plot <- ggplot(data=df_sample, aes(x=log10(weight_g),y=log10(wingspan_cm),label=species_name)) + 
  #geom_smooth(method="loess") +
  geom_point() + 
  geom_label(aes(alpha=.5)) + 
  ggtitle('Selected Birds Weight by Wingspan')