The many public and private agencies that track and analyze physical phenomena on earth provide a fascinating, never-ending source of data for R projects. In this guide, we will work with one of the amazingly rich and detailed, continually updated, readily available data sources concerning earth sciences. The data set we will explore here is provided by the U.S. Geological Survey - the premier source of data on earthquakes, coming from seismographs at a network of monitoring stations all over the world. We will explore how various R code constructs can be used to download and visualize a large data set concerning earthquake data.
R and R Markdown provide a powerful platform for accessing public data sources and developing reproducible science. Visualization capabilities in R are enhanced through the “Grammar of Graphics” as implemented through the GGPlot2 and GGForce packages. Data manipulation functions like select, mutate, and summarize are added by the dplyr package, enabling more convenient and readable R language constructs, using the pipe %>% operator. This operator can help your write intuitive syntax in R that resembles common structured query languages like SQL.
This document will not provide basic instruction in R. It is assumed that the reader has some familiarity with the R language, syntax patterns, installation, and configuration. If this is your first introduction to R, please check out the appendix to learn more about the basics of this language and installing the RStudio programming environment. The Rmd files (which this document was written in) may be downloaded from github, enabling the reader to change any portion of this document and customize the word problems to cover different applied subjects.
This project is intended for students and instructors in the field of earth sciences, GIS professionals, and anyone who is interested in using the USGS data for analysis projects in R.
This Rmarkdown file will use several packages as defined below. This code will load the packages only of they are not already installed.
After install, these packages will be loaded into the local scope via the library functions.
#install required packages
if(!require(dplyr)) install.packages("dplyr", repos = "http://cran.us.r-project.org")
if(!require(ggplot2)) install.packages("ggplot2", repos = "http://cran.us.r-project.org")
if(!require(rvest)) install.packages("rvest", repos = "http://cran.us.r-project.org")
if(!require(stringr)) install.packages("stringr", repos = "http://cran.us.r-project.org")
if(!require(glue)) install.packages("glue", repos = "http://cran.us.r-project.org")
if(!require(maps)) install.packages("maps", repos = "http://cran.us.r-project.org")
if(!require(gridExtra)) install.packages("gridExtra", repos = "http://cran.us.r-project.org")
if(!require(gganimate)) install.packages("gganimate", repos = "http://cran.us.r-project.org")
if(!require(gifski)) install.packages("gifski", repos = "http://cran.us.r-project.org")
if(!require(geosphere)) install.packages("geosphere", repos = "http://cran.us.r-project.org")
if(!require(ggmap)) install.packages("ggmap", repos = "http://cran.us.r-project.org")
#load library into the local scope
library(dplyr)
library(ggplot2)
library(rvest)
library(stringr)
library(glue)
library(maps)
library(gridExtra)
library(gganimate)
library(geosphere)
library(gifski)
library(ggmap)
options(scipen=999)
Every R project begins with ETL. This process involves the initial set up of the project, pulling data from a dynamic data source, standardizing the data, and making it usable for your project.
The USGS.gov web site of the United State Geological Survey (USGS) provides a variety of scientific data sources for public use. These data sources are derived from earth sensing equipment all over the world, including earth sensing satellites, ground and sea based weather stations, and seismic monitoring facilities spanning the globe. In this guide, we will focus on how to get this data and provide visualizations from USGS sources.
The first data source we will tap into is a constantly updated data source that is available as a Comma Searated Values (CSV) format. There are data feeds available in various other formats like Javascript Standard Object Notation (JSON) and Extensible Markup Language (XML), but we will use the CSV format, since it is the most basic to read and parse.
The data provides a list of all earthquakes detected in a specified time range. The rows are limited to 20,000 per query, so you have to limit the time ranges to ensure you don’t pull too many rows at the same time. Its easy to stitch the separate data sets back together using bind_rows.
The base R read.csv function is the simplest way to get this data. In this following code block we will perform 3 queries on the USGS public earthquake feed for 3 consecutive months. Note how the start date of each query is the same as the end date of the previous one.
#download a csv of earthquake data into a data frame called earthquakes_all
#earthquakes_all <- read.csv("https://earthquake.usgs.gov/earthquakes/feed/v1.0/summary/all_month.csv")
earthquakes_jul <- read.csv("https://earthquake.usgs.gov/fdsnws/event/1/query?format=csv&starttime=2022-07-01&endtime=2022-08-01")
earthquakes_aug <- read.csv("https://earthquake.usgs.gov/fdsnws/event/1/query?format=csv&starttime=2022-08-01&endtime=2022-09-01")
earthquakes_sep <- read.csv("https://earthquake.usgs.gov/fdsnws/event/1/query?format=csv&starttime=2022-09-01&endtime=2022-10-01")
#use the csv in the folder
#earthquakes_all <- read.csv("all_month-8-30.csv")
earthquakes_all <- earthquakes_jul %>% bind_rows(earthquakes_aug) %>% bind_rows(earthquakes_sep)
Using the read.csv, We have downloaded a list of all earthquakes over a course of several months, so, we should have quite a few data points to work with. One might be surprised to see a list of thousands and thousands of earthquake events in a single month. Considering how infrequently we might feel an actual earthquake - It’s amazing to think that the earth is so seismically active!
We’ll now take a look at the top 10 rows to explore how the data is provided using the head function, and get a run down on the data format using the summary function.
#view some attributes of the data set
head(earthquakes_all)
summary(earthquakes_all)
## time latitude longitude depth
## Length:34430 Min. :-64.66 Min. :-180.0 Min. : -3.80
## Class :character 1st Qu.: 33.41 1st Qu.:-151.5 1st Qu.: 3.93
## Mode :character Median : 38.79 Median :-121.6 Median : 9.40
## Mean : 38.38 Mean :-110.1 Mean : 24.64
## 3rd Qu.: 52.92 3rd Qu.:-113.0 3rd Qu.: 19.60
## Max. : 85.43 Max. : 180.0 Max. :653.59
##
## mag magType nst gap
## Min. :-1.260 Length:34430 Min. : 0.00 Min. : 11
## 1st Qu.: 0.850 Class :character 1st Qu.: 9.00 1st Qu.: 72
## Median : 1.400 Mode :character Median : 17.00 Median :106
## Mean : 1.657 Mean : 23.73 Mean :122
## 3rd Qu.: 2.100 3rd Qu.: 30.00 3rd Qu.:160
## Max. : 7.600 Max. :444.00 Max. :360
## NA's :1 NA's :7888 NA's :7888
## dmin rms net id
## Min. : 0.000 Min. :0.0000 Length:34430 Length:34430
## 1st Qu.: 0.027 1st Qu.:0.1100 Class :character Class :character
## Median : 0.067 Median :0.1800 Mode :character Mode :character
## Mean : 0.818 Mean :0.2945
## 3rd Qu.: 0.246 3rd Qu.:0.4600
## Max. :47.586 Max. :2.2800
## NA's :13263 NA's :1
## updated place type horizontalError
## Length:34430 Length:34430 Length:34430 Min. : 0.080
## Class :character Class :character Class :character 1st Qu.: 0.280
## Mode :character Mode :character Mode :character Median : 0.520
## Mean : 2.015
## 3rd Qu.: 1.460
## Max. :99.000
## NA's :10631
## depthError magError magNst status
## Min. : 0.000 Min. :0.000 Min. : 0.00 Length:34430
## 1st Qu.: 0.450 1st Qu.:0.102 1st Qu.: 4.00 Class :character
## Median : 0.800 Median :0.155 Median : 8.00 Mode :character
## Mean : 3.195 Mean :0.254 Mean : 15.38
## 3rd Qu.: 1.818 3rd Qu.:0.232 3rd Qu.: 16.00
## Max. :16786.400 Max. :6.190 Max. :760.00
## NA's :1 NA's :8800 NA's :7925
## locationSource magSource
## Length:34430 Length:34430
## Class :character Class :character
## Mode :character Mode :character
##
##
##
##
A careful read of the type field reveals that this data set contains some event types that are not earthquakes.
earthquakes_all %>% distinct(type)
For the remainder of the analysis, we will look only at earthquakes and remove explosions, quarry blasts, ice quakes, and other phenomena that are not “strictly earthquakes”.
earthquakes_all <- earthquakes_all %>% filter(type=='earthquake')
earthquakes_all <- earthquakes_all %>% filter(!is.na(mag))
earthquakes_all <- earthquakes_all %>% filter(!is.na(latitude))
earthquakes_all <- earthquakes_all %>% filter(!is.na(longitude))
earthquakes_all <- earthquakes_all %>% filter(!is.na(time))
The earthquake data is very extensive, with 22 columns about each event - creating about 220,000 total data points.
columns_all <- colnames(earthquakes_all)
columns_all
## [1] "time" "latitude" "longitude" "depth"
## [5] "mag" "magType" "nst" "gap"
## [9] "dmin" "rms" "net" "id"
## [13] "updated" "place" "type" "horizontalError"
## [17] "depthError" "magError" "magNst" "status"
## [21] "locationSource" "magSource"
The data source has 22 columns of data, including location, magnitude, and many other measurements for each event. Some of these seem fairly obvious - latitude, longitude, magnitude, and time, etc. We can use the glue function from the glue package to make statements about the data and drop in R variables.
#examine the range of 4 important variables
range_lat <- range(earthquakes_all$latitude)
range_lon <- range(earthquakes_all$longitude)
range_mag <- range(earthquakes_all$mag)
range_depth <- range(earthquakes_all$depth)
glue("This data set covers a latitude range from {range_lat[1]} to {range_lat[2]} and a longuitude range from {range_lon[1]} to {range_lon[2]}. The events in this data registered between {range_mag[1]} and {range_mag[2]} on the richter scale, at depths from {range_depth[1]} to {range_depth[2]} km below sea level.")
## This data set covers a latitude range from -64.6571 to 85.4314 and a longuitude range from -179.9968 to 179.9992. The events in this data registered between -1.26 and 7.6 on the richter scale, at depths from -3.8 to 653.587 km below sea level.
earth_circum_km <- 40075
earth_diam_km <- 12742
meters_per_km <- 1000
meters_per_mile <- 1609.34
As noted above, values in the latitude and longitude indicate that the data covers the majority of the planet Earth, at magnitudes that vary from hardly perceptibel events to major earthquakes. These facts raise an important question - what unit* is the depth and the magnitude? It is not clear what some of the values in the data set would be referring to unless you are have domain knowledge of USGS terminology and measures.
The USGS site does not provide any obvious link to a meta data file that defines these fields. We will explore how to get the descriptive meta data in the next section.
The term Geocode refers to the latitude and longitude pair of coordinates for any given place on earth.
Latitude can be
understood as distance North or South from the equator. Lines of
latitude run parallel to the equator, are evenly spaced, and never touch
each other. Latitudes south of the equator are given as negative values
from 0 to -90. Latitudes north of the equator are given as positive
values from 0 to 90.
Longitude can be
understood as distance East or West from the Prime meridian. Longitude
is generally expressed in terms of degrees to the east or west of this
line, between 0 and 180 degrees. This line was originally set as a
straight line proceeding from the north pole, through Greenwich,
England, to the South Pole. The line opposite the Prime Meridian, is
also called the anti-meridian, which runs roughly between the western
part of Alaska and the easternmost part of Russia. Lines of longitude
run perpendicular to the equator, are evenly spaced at the equator, and
converge into each other at the north and south pole. Longitudes west of
the prime meridian (0 degrees) are given as negative values from 0 to
-180. Longitudes east of the prime meridian (0 degrees) are given as
positive values from 0 to 180.
Jumping ahead a bit, for the purpose of the following basic visualizations, I’ll reveal that magnitude is in local magnitude Richter scale units abbreviated as ML and depth is in Kilometers abbreviated as Km.
It is often a good idea to start analysis of a population with a histogram (Ogive) so we can get a rough sense of the distribution of values. There is simply no way to understand the distribution of a population data using only mean, median, and range. A histogram will start to illuminate how earthquake events are distributed by magnitude and depth.
Our first GGPlot will keep it simple - we will only provide one metric to analyze, and add a title, subtitle, and axis labels. We will add citation/attribution and more exciting customization features like colors in later charts.
Visualizing a data set in GGPlot is usually easiest when you use the dplyr package and pipe operator*. You can feed the data set into the GGPlot this way. so the syntax to do this is like this:
*In later version of R the pipe operator does not require the dplyr package to be loaded. But for this guide, we will consider it a separate package and include it.
[dataset] %>% [ggplot constructor with aesthetics parameters] + [ggplot chart type] + [plot labels]
This is perhaps the simplest way to create a ggplot and output it to the console.
earthquakes_all %>% ggplot(aes(x=mag)) +
geom_histogram() +
labs(title='Magnitude of all earthquake events',x='Magnitude (ML)',y='Count')
A histogram of events by magnitude reveals the general count of earthquakes by magnitude. This one of the simplest visualizations we can perform on the data - but it reveals quite a lot - based on this we can see that earthquakes tend to occur mostly around magnitude 1 - but there is another node around 4.25 with an increased incidence of events.
earthquakes_all %>% ggplot(aes(x=depth)) +
geom_histogram() +
labs(title='Depth of all earthquake events',x='Magnitude (ML)',y='Count')