QuakeR: Analyzing USGS Earthquake Data in R Markdown

The many public and private agencies that track and analyze physical phenomena on earth provide a fascinating, never-ending source of data for R projects. In this guide, we will work with one of the amazingly rich and detailed, continually updated, readily available data sources concerning earth sciences. The data set we will explore here is provided by the U.S. Geological Survey - the premier source of data on earthquakes, coming from seismographs at a network of monitoring stations all over the world. We will explore how various R code constructs can be used to download and visualize a large data set concerning earthquake data.

Introduction & setup

R and R Markdown provide a powerful platform for accessing public data sources and developing reproducible science. Visualization capabilities in R are enhanced through the “Grammar of Graphics” as implemented through the GGPlot2 and GGForce packages. Data manipulation functions like select, mutate, and summarize are added by the dplyr package, enabling more convenient and readable R language constructs, using the pipe %>% operator. This operator can help your write intuitive syntax in R that resembles common structured query languages like SQL.

This document will not provide basic instruction in R. It is assumed that the reader has some familiarity with the R language, syntax patterns, installation, and configuration. If this is your first introduction to R, please check out the appendix to learn more about the basics of this language and installing the RStudio programming environment. The Rmd files (which this document was written in) may be downloaded from github, enabling the reader to change any portion of this document and customize the word problems to cover different applied subjects.

This project is intended for students and instructors in the field of earth sciences, GIS professionals, and anyone who is interested in using the USGS data for analysis projects in R.

Load packages

This Rmarkdown file will use several packages as defined below. This code will load the packages only of they are not already installed.

After install, these packages will be loaded into the local scope via the library functions.

#install required packages
if(!require(dplyr)) install.packages("dplyr", repos = "http://cran.us.r-project.org")
if(!require(ggplot2)) install.packages("ggplot2", repos = "http://cran.us.r-project.org")
if(!require(rvest)) install.packages("rvest", repos = "http://cran.us.r-project.org")
if(!require(stringr)) install.packages("stringr", repos = "http://cran.us.r-project.org")
if(!require(glue)) install.packages("glue", repos = "http://cran.us.r-project.org")
if(!require(maps)) install.packages("maps", repos = "http://cran.us.r-project.org")
if(!require(gridExtra)) install.packages("gridExtra", repos = "http://cran.us.r-project.org")
if(!require(gganimate)) install.packages("gganimate", repos = "http://cran.us.r-project.org")
if(!require(gifski)) install.packages("gifski", repos = "http://cran.us.r-project.org")
if(!require(geosphere)) install.packages("geosphere", repos = "http://cran.us.r-project.org")
if(!require(ggmap)) install.packages("ggmap", repos = "http://cran.us.r-project.org")

#load library into the local scope

Extract, Transform, and Load (ETL)

Every R project begins with ETL. This process involves the initial set up of the project, pulling data from a dynamic data source, standardizing the data, and making it usable for your project.

Identify USGS data sources

The USGS.gov web site of the United State Geological Survey (USGS) provides a variety of scientific data sources for public use. These data sources are derived from earth sensing equipment all over the world, including earth sensing satellites, ground and sea based weather stations, and seismic monitoring facilities spanning the globe. In this guide, we will focus on how to get this data and provide visualizations from USGS sources.

Download CSV data

The first data source we will tap into is a constantly updated data source that is available as a Comma Searated Values (CSV) format. There are data feeds available in various other formats like Javascript Standard Object Notation (JSON) and Extensible Markup Language (XML), but we will use the CSV format, since it is the most basic to read and parse.

The data provides a list of all earthquakes detected in a specified time range. The rows are limited to 20,000 per query, so you have to limit the time ranges to ensure you don’t pull too many rows at the same time. Its easy to stitch the separate data sets back together using bind_rows.

The base R read.csv function is the simplest way to get this data. In this following code block we will perform 3 queries on the USGS public earthquake feed for 3 consecutive months. Note how the start date of each query is the same as the end date of the previous one.

#download a csv of earthquake data into a data frame called earthquakes_all
#earthquakes_all <- read.csv("https://earthquake.usgs.gov/earthquakes/feed/v1.0/summary/all_month.csv")
earthquakes_jul <- read.csv("https://earthquake.usgs.gov/fdsnws/event/1/query?format=csv&starttime=2022-07-01&endtime=2022-08-01")
earthquakes_aug <- read.csv("https://earthquake.usgs.gov/fdsnws/event/1/query?format=csv&starttime=2022-08-01&endtime=2022-09-01")
earthquakes_sep <- read.csv("https://earthquake.usgs.gov/fdsnws/event/1/query?format=csv&starttime=2022-09-01&endtime=2022-10-01")

#use the csv in the folder
#earthquakes_all <- read.csv("all_month-8-30.csv")
earthquakes_all <- earthquakes_jul %>% bind_rows(earthquakes_aug) %>% bind_rows(earthquakes_sep)

Using the read.csv, We have downloaded a list of all earthquakes over a course of several months, so, we should have quite a few data points to work with. One might be surprised to see a list of thousands and thousands of earthquake events in a single month. Considering how infrequently we might feel an actual earthquake - It’s amazing to think that the earth is so seismically active!

We’ll now take a look at the top 10 rows to explore how the data is provided using the head function, and get a run down on the data format using the summary function.

#view some attributes of the data set
##      time              latitude        longitude          depth       
##  Length:34430       Min.   :-64.66   Min.   :-180.0   Min.   : -3.80  
##  Class :character   1st Qu.: 33.41   1st Qu.:-151.5   1st Qu.:  3.93  
##  Mode  :character   Median : 38.79   Median :-121.6   Median :  9.40  
##                     Mean   : 38.38   Mean   :-110.1   Mean   : 24.64  
##                     3rd Qu.: 52.92   3rd Qu.:-113.0   3rd Qu.: 19.60  
##                     Max.   : 85.43   Max.   : 180.0   Max.   :653.59  
##       mag           magType               nst              gap      
##  Min.   :-1.260   Length:34430       Min.   :  0.00   Min.   : 11   
##  1st Qu.: 0.850   Class :character   1st Qu.:  9.00   1st Qu.: 72   
##  Median : 1.400   Mode  :character   Median : 17.00   Median :106   
##  Mean   : 1.657                      Mean   : 23.73   Mean   :122   
##  3rd Qu.: 2.100                      3rd Qu.: 30.00   3rd Qu.:160   
##  Max.   : 7.600                      Max.   :444.00   Max.   :360   
##  NA's   :1                           NA's   :7888     NA's   :7888  
##       dmin             rms             net                 id           
##  Min.   : 0.000   Min.   :0.0000   Length:34430       Length:34430      
##  1st Qu.: 0.027   1st Qu.:0.1100   Class :character   Class :character  
##  Median : 0.067   Median :0.1800   Mode  :character   Mode  :character  
##  Mean   : 0.818   Mean   :0.2945                                        
##  3rd Qu.: 0.246   3rd Qu.:0.4600                                        
##  Max.   :47.586   Max.   :2.2800                                        
##  NA's   :13263    NA's   :1                                             
##    updated             place               type           horizontalError 
##  Length:34430       Length:34430       Length:34430       Min.   : 0.080  
##  Class :character   Class :character   Class :character   1st Qu.: 0.280  
##  Mode  :character   Mode  :character   Mode  :character   Median : 0.520  
##                                                           Mean   : 2.015  
##                                                           3rd Qu.: 1.460  
##                                                           Max.   :99.000  
##                                                           NA's   :10631   
##    depthError           magError         magNst          status         
##  Min.   :    0.000   Min.   :0.000   Min.   :  0.00   Length:34430      
##  1st Qu.:    0.450   1st Qu.:0.102   1st Qu.:  4.00   Class :character  
##  Median :    0.800   Median :0.155   Median :  8.00   Mode  :character  
##  Mean   :    3.195   Mean   :0.254   Mean   : 15.38                     
##  3rd Qu.:    1.818   3rd Qu.:0.232   3rd Qu.: 16.00                     
##  Max.   :16786.400   Max.   :6.190   Max.   :760.00                     
##  NA's   :1           NA's   :8800    NA's   :7925                       
##  locationSource      magSource        
##  Length:34430       Length:34430      
##  Class :character   Class :character  
##  Mode  :character   Mode  :character  

Limit data analysis to earthquakes only & remove error rows

A careful read of the type field reveals that this data set contains some event types that are not earthquakes.

earthquakes_all %>% distinct(type)

For the remainder of the analysis, we will look only at earthquakes and remove explosions, quarry blasts, ice quakes, and other phenomena that are not “strictly earthquakes”.

earthquakes_all <- earthquakes_all %>% filter(type=='earthquake')
earthquakes_all <- earthquakes_all %>% filter(!is.na(mag))
earthquakes_all <- earthquakes_all %>% filter(!is.na(latitude))
earthquakes_all <- earthquakes_all %>% filter(!is.na(longitude))
earthquakes_all <- earthquakes_all %>% filter(!is.na(time))

Examine columns

The earthquake data is very extensive, with 22 columns about each event - creating about 220,000 total data points.

columns_all <- colnames(earthquakes_all) 
##  [1] "time"            "latitude"        "longitude"       "depth"          
##  [5] "mag"             "magType"         "nst"             "gap"            
##  [9] "dmin"            "rms"             "net"             "id"             
## [13] "updated"         "place"           "type"            "horizontalError"
## [17] "depthError"      "magError"        "magNst"          "status"         
## [21] "locationSource"  "magSource"

The data source has 22 columns of data, including location, magnitude, and many other measurements for each event. Some of these seem fairly obvious - latitude, longitude, magnitude, and time, etc. We can use the glue function from the glue package to make statements about the data and drop in R variables.

#examine the range of 4 important variables
range_lat <- range(earthquakes_all$latitude)
range_lon <- range(earthquakes_all$longitude)
range_mag <- range(earthquakes_all$mag)
range_depth <- range(earthquakes_all$depth)

glue("This data set covers a latitude range from {range_lat[1]} to {range_lat[2]} and a longuitude range from {range_lon[1]} to {range_lon[2]}. The events in this data registered between {range_mag[1]} and {range_mag[2]} on the richter scale, at depths from {range_depth[1]} to {range_depth[2]} km below sea level.")
## This data set covers a latitude range from -64.6571 to 85.4314 and a longuitude range from -179.9968 to 179.9992. The events in this data registered between -1.26 and 7.6 on the richter scale, at depths from -3.8 to 653.587 km below sea level.

Define key constants/conversions

earth_circum_km <- 40075
earth_diam_km <- 12742
meters_per_km <- 1000
meters_per_mile <- 1609.34

As noted above, values in the latitude and longitude indicate that the data covers the majority of the planet Earth, at magnitudes that vary from hardly perceptibel events to major earthquakes. These facts raise an important question - what unit* is the depth and the magnitude? It is not clear what some of the values in the data set would be referring to unless you are have domain knowledge of USGS terminology and measures.

The USGS site does not provide any obvious link to a meta data file that defines these fields. We will explore how to get the descriptive meta data in the next section.

Geocode terminology

The term Geocode refers to the latitude and longitude pair of coordinates for any given place on earth.

Latitude Latitude can be understood as distance North or South from the equator. Lines of latitude run parallel to the equator, are evenly spaced, and never touch each other. Latitudes south of the equator are given as negative values from 0 to -90. Latitudes north of the equator are given as positive values from 0 to 90.

Longitude Longitude can be understood as distance East or West from the Prime meridian. Longitude is generally expressed in terms of degrees to the east or west of this line, between 0 and 180 degrees. This line was originally set as a straight line proceeding from the north pole, through Greenwich, England, to the South Pole. The line opposite the Prime Meridian, is also called the anti-meridian, which runs roughly between the western part of Alaska and the easternmost part of Russia. Lines of longitude run perpendicular to the equator, are evenly spaced at the equator, and converge into each other at the north and south pole. Longitudes west of the prime meridian (0 degrees) are given as negative values from 0 to -180. Longitudes east of the prime meridian (0 degrees) are given as positive values from 0 to 180.

A histogram (Ogive) of earthquake event data

Jumping ahead a bit, for the purpose of the following basic visualizations, I’ll reveal that magnitude is in local magnitude Richter scale units abbreviated as ML and depth is in Kilometers abbreviated as Km.

It is often a good idea to start analysis of a population with a histogram (Ogive) so we can get a rough sense of the distribution of values. There is simply no way to understand the distribution of a population data using only mean, median, and range. A histogram will start to illuminate how earthquake events are distributed by magnitude and depth.

Our first GGPlot will keep it simple - we will only provide one metric to analyze, and add a title, subtitle, and axis labels. We will add citation/attribution and more exciting customization features like colors in later charts.

Visualizing a data set in GGPlot is usually easiest when you use the dplyr package and pipe operator*. You can feed the data set into the GGPlot this way. so the syntax to do this is like this:

*In later version of R the pipe operator does not require the dplyr package to be loaded. But for this guide, we will consider it a separate package and include it.

[dataset] %>% [ggplot constructor with aesthetics parameters] + [ggplot chart type] + [plot labels]

This is perhaps the simplest way to create a ggplot and output it to the console.

earthquakes_all %>% ggplot(aes(x=mag)) + 
  geom_histogram() + 
  labs(title='Magnitude of all earthquake events',x='Magnitude (ML)',y='Count')

A histogram of events by magnitude reveals the general count of earthquakes by magnitude. This one of the simplest visualizations we can perform on the data - but it reveals quite a lot - based on this we can see that earthquakes tend to occur mostly around magnitude 1 - but there is another node around 4.25 with an increased incidence of events.

earthquakes_all %>% ggplot(aes(x=depth)) + 
  geom_histogram() + 
  labs(title='Depth of all earthquake events',x='Magnitude (ML)',y='Count')