In this project we will examine how to use data provided by public data sources for analysis of home values, rental rates, and movements in home price over a 17 year time period from 2004 to 2020. Analysis of this nature can be used for investment research, home value analysis, appraisal, and comparative value analysis.
This report uses data from current, Zillow Research real estate information database (zhvi), and additional archival data, also from Zillow, which was retrieved from Kaggle. The additional data was uploaded in 2017 and includes some data metrics that do not currently appear on the Zillow Research site. The datasets include median home price estimates and rental estimates for geographical areas between the zip code and state level.
Set the state, metro (MSA) and zip code you are interested in researching. Some charts that deal with more specific subjects or geographic regions will use these as filters to limit the scope of analysis. Not all charts will use all parameters.
explore_statename <- "California"
explore_state <- "CA"
explore_city <- "Santa Rosa"
explore_metro <- "Santa Rosa"
explore_county <- "Sonoma"
explore_zip <- "95409"
explore_multi_metro <- c('Santa Rosa','San Francisco','San Jose')
Load packages used by this analysis project and set the default to standard notation.
# Note: this process could take a couple of minutes
if(!require(dplyr)) install.packages("dplyr", repos = "http://cran.us.r-project.org")
if(!require(ggrepel)) install.packages("ggrepel", repos = "http://cran.us.r-project.org")
if(!require(tidyverse)) install.packages("tidyverse", repos = "http://cran.us.r-project.org")
if(!require(tidycensus)) install.packages("tidycensus", repos = "http://cran.us.r-project.org")
if(!require(caret)) install.packages("caret", repos = "http://cran.us.r-project.org")
if(!require(dslabs)) install.packages("dslabs", repos = "http://cran.us.r-project.org")
if(!require(rpart.plot)) install.packages("rpart.plot", repos = "http://cran.us.r-project.org")
if(!require(rgbif)) install.packages("rgbif", repos = "http://cran.us.r-project.org")
if(!require(corrr)) install.packages("corrr", repos = "http://cran.us.r-project.org")
if(!require(splitstackshape)) install.packages("splitstackshape", repos = "http://cran.us.r-project.org")
if(!require(mice)) install.packages("mice", repos = "http://cran.us.r-project.org")
if(!require(randomForest)) install.packages("randomForest", repos = "http://cran.us.r-project.org")
if(!require(data.table)) install.packages("data.table", repos = "http://cran.us.r-project.org")
if(!require(lubridate)) install.packages("lubridate", repos = "http://cran.us.r-project.org")
if(!require("rnoaa")) install.packages("rnoaa", repos = "http://cran.us.r-project.org")
library(randomForest)
library(mice)
library(tidyverse)
library(dslabs)
library(caret)
library(dplyr)
library(rpart.plot)
library("rgbif")
library(corrr)
library(splitstackshape)
library(data.table)
library(ggrepel)
library(tidycensus)
library(lubridate)
library('rnoaa')
options(scipen = 999)
In this section we will conduct Extract, Transform & Load (ETL) processes on several data files. The data has been saved as a comma separated values (.csv) file in the /data/ directory relative to this Rmd file.
#set reload to false if you wish to rerun analysis without reloading the files.
reload=TRUE
#set teh working directory where the project is loaded. Data folder should be in this directory
setwd("D:/RProjects/RealEstateProspector/")
#the file name for the county level time series file
sourcefilename <- "./data/County_time_series.csv"
if(reload){
counties <- read.csv(sourcefilename, header=TRUE, fill = TRUE, stringsAsFactors = FALSE, fileEncoding="UTF-8-BOM")
}
totalrows <- nrow(counties)
#the file name for the county level cross walk file
sourcefilename2 <- "./data/CountyCrossWalk_Zillow.csv"
if(reload){
crosswalk <- read.csv(sourcefilename2, header=TRUE, fill = TRUE, stringsAsFactors = FALSE, fileEncoding="UTF-8-BOM")
}
totalrows2 <- nrow(crosswalk)
profiles <- counties %>% left_join(crosswalk, by=c("RegionName" = "FIPS"))
#the file name for the ZHVI source by Metro
sourcefilename3 <- "./data/Metro_zhvi_uc_sfrcondo_tier_0.33_0.67_sm_sa_mon.csv"
if(reload){
zhvi <- read.csv(sourcefilename3, header=TRUE, fill = TRUE, stringsAsFactors = FALSE, fileEncoding="UTF-8-BOM")
}
totalrows3 <- nrow(zhvi)
#the file name for the ZORI source by Metro
sourcefilename4 <- "./data/Metro_ZORI_AllHomesPlusMultifamily_SSA.csv"
if(reload){
zori <- read.csv(sourcefilename4, header=TRUE, fill = TRUE, stringsAsFactors = FALSE, fileEncoding="UTF-8-BOM")
}
totalrows4 <- nrow(zori)
#the file name for the ZHVI source by Zip
sourcefilename5 <- "./data/Zip_zhvi_uc_sfrcondo_tier_0.33_0.67_sm_sa_mon.csv"
if(reload){
zhvi_zip <- read.csv(sourcefilename5,
header=TRUE,
fill = TRUE,
stringsAsFactors = FALSE,
fileEncoding="UTF-8-BOM")
}
totalrows5 <- nrow(zhvi_zip)
#Join ZHVI and ZORI Metro tables on RegionID
values <- zhvi %>% left_join(zori, by=c("RegionID" = "RegionID"))
A total of 518791 rows are present in the County time series data.
A total of 3144 rows are present in the County Cross walk data.
A total of 914 rows are present in the ZHVI by Metro data.
A total of 107 rows are present in the ZORI by Metro data.
A total of 30230 rows are present in the ZHVI by Zip data.
We will now conduct some initial exploratory data analysis (EDA).
#These calls help with initial EDA - Commented out
#glimpse(zhvi)
#glimpse(zori)
#summary(zhvi)
#summary(zori)
The Data includes many characteristics of real estate data related to home values, inventories, time on market, rent estimates, and other factors.
The Zillow ZHVI and ZORI Data will use end of year values for charts in this section. The ZHVI is a proprietary smoothed estimate of “typical home values” provided by zillow. It is not the exact median or average. Details about how the ZHVI is calculated is available on the Zillow research site: https://www.zillow.com/research/zhvi-user-guide/
Year end home value data is available, grouped by metro.
#explore the ZHVI data on home values at the end of each year BY METRO
#2020 uses the october data that was latest available at time of this analysis
yearend_homevalues <- zhvi %>%
select( RegionName, StateName,
eoy_2004=X2004.12.31,
eoy_2005=X2005.12.31,
eoy_2006=X2006.12.31,
eoy_2007=X2007.12.31,
eoy_2008=X2008.12.31,
eoy_2009=X2009.12.31,
eoy_2010=X2010.12.31,
eoy_2011=X2011.12.31,
eoy_2012=X2012.12.31,
eoy_2013=X2013.12.31,
eoy_2014=X2014.12.31,
eoy_2015=X2015.12.31,
eoy_2016=X2016.12.31,
eoy_2017=X2017.12.31,
eoy_2018=X2018.12.31,
eoy_2019=X2019.12.31,
eoy_2020=X2020.10.31) %>%
pivot_longer(c('eoy_2004',
'eoy_2005',
'eoy_2006',
'eoy_2007',
'eoy_2008',
'eoy_2009',
'eoy_2010',
'eoy_2011',
'eoy_2012',
'eoy_2013',
'eoy_2014',
'eoy_2015',
'eoy_2016',
'eoy_2017',
'eoy_2018',
'eoy_2019',
'eoy_2020'
),
names_to = "year",
values_to = "homevalue") %>%
mutate(varz='home_value')
The year end home data is also available grouped by zip code. This data will be sued for “dril down” into specific cities since thre are generally too many zip codes in a state to be effectively visualized.
#explore the ZHVI data on home values at the end of each year BY ZIP
#2020 uses the october data that was latest available at time of this analysis
yearend_homevalues_byzip <- zhvi_zip %>%
select( RegionID, RegionName, CountyName, City, StateName, Metro,
eoy_2004=X2004.12.31,
eoy_2005=X2005.12.31,
eoy_2006=X2006.12.31,
eoy_2007=X2007.12.31,
eoy_2008=X2008.12.31,
eoy_2009=X2009.12.31,
eoy_2010=X2010.12.31,
eoy_2011=X2011.12.31,
eoy_2012=X2012.12.31,
eoy_2013=X2013.12.31,
eoy_2014=X2014.12.31,
eoy_2015=X2015.12.31,
eoy_2016=X2016.12.31,
eoy_2017=X2017.12.31,
eoy_2018=X2018.12.31,
eoy_2019=X2019.12.31,
eoy_2020=X2020.10.31) %>%
pivot_longer(c('eoy_2004',
'eoy_2005',
'eoy_2006',
'eoy_2007',
'eoy_2008',
'eoy_2009',
'eoy_2010',
'eoy_2011',
'eoy_2012',
'eoy_2013',
'eoy_2014',
'eoy_2015',
'eoy_2016',
'eoy_2017',
'eoy_2018',
'eoy_2019',
'eoy_2020'
),
names_to = "year",
values_to = "homevalue") %>%
mutate(varz='home_value')
Rent data is more limited in the length of time the data was recorded, and is available for a limited number of metros only.
#Explore ZORI values - 2014 to present (Shorter time period)
yearend_rents <- zori %>% select(RegionName,
eoy_2014=X2014.12,
eoy_2015=X2015.12,
eoy_2016=X2016.12,
eoy_2017=X2017.12,
eoy_2018=X2018.12,
eoy_2019=X2019.12,
eoy_2020=X2020.10
) %>%
pivot_longer(c('eoy_2014',
'eoy_2015',
'eoy_2016',
'eoy_2017',
'eoy_2018',
'eoy_2019',
'eoy_2020'
),
names_to = "year",
values_to = "medianrent") %>%
mutate(varz='median_rent')
In this section we will pull in some data from the US Census Bureau’s American Community Survey using the tidycensus package. An API Key acquired from the US Census web site.
If census_api_key(’’) is blank - you will need to go fetch a key of your own, and enter it inside the emoty quotes in this method call.
Unfortunately - Zillow does not use the GEOID assigned by the USCB - so the best we can do is try to use string comparisons and a state abbreviation index to find matching geography data in USCB data sources. Additional work should be done here to create a proper index table, but the county name will be sufficient for rough matching.
The following data set are using the 5-Year ACS which is recommended for use where precision is importrant - the year ACS covers all geographies and provides the largest sample sizes, while the 1 and 3 year ACS are only available for geographies with greater populations.
Many additional subjects are available for analysis as outlined in the link below. This analysis will be expanded to include additional subjects in future versions.
https://www.census.gov/programs-surveys/acs/guidance/subjects.html
#before running this code, you must plug in your own census API KEY:
#census_api_key("")
#For reduction of key usage, this data should be loaded only once per session or as needed
#change this to TRUE to load the census data again.
#This must be set to true before knitting or the knit may fail.
reloadcensus=TRUE
library(datasets)
states_index <- data.frame(STATEABB=as.character(state.abb),
STATENAME=as.character(state.name),
STATEREGION=as.character(state.region))
if(reloadcensus==TRUE){
joinsto <- "county"
population <- get_acs(geography = joinsto, variables = "B01001_001E")
income <- get_acs(geography = joinsto, variables = "B06011_001E")
medianage <- get_acs(geography = joinsto, variables = "B01002_001E")
renters <- get_acs(geography = joinsto, variables = "B25129_038E")
population <- population %>% separate(NAME, c("COUNTY","STATE"),sep=",") %>% mutate(population_est = estimate) %>% select(-estimate)
income <- income %>% separate(NAME, c("COUNTY","STATE"),sep=",") %>% mutate(income_est = estimate) %>% select(-estimate,-moe)
medianage <- medianage %>% separate(NAME, c("COUNTY","STATE"),sep=",") %>% mutate(medianage_est = estimate) %>% select(-estimate,-moe)
renters <- renters %>% separate(NAME, c("COUNTY","STATE"),sep=",") %>% mutate(renters_est = estimate) %>% select(-estimate,-moe)
population$STATE <- trimws(population$STATE)
income$STATE <- trimws(income$STATE)
medianage$STATE <- trimws(medianage$STATE)
renters$STATE <- trimws(renters$STATE)
population <- population %>%
mutate(STATE = as.character(STATE)) %>%
left_join(states_index, by = c("STATE"="STATENAME"))
income <- income %>%
mutate(STATE = as.character(STATE)) %>%
left_join(states_index, by = c("STATE"="STATENAME"))
medianage <- medianage %>%
mutate(STATE = as.character(STATE)) %>%
left_join(states_index, by = c("STATE"="STATENAME"))
renters <- renters %>%
mutate(STATE = as.character(STATE)) %>%
left_join(states_index, by = c("STATE"="STATENAME"))
joinsto <- "zcta"
population_zip <- get_acs(geography = joinsto, variables = "B01001_001E")
income_zip <- get_acs(geography = joinsto, variables = "B06011_001E")
medianage_zip <- get_acs(geography = joinsto, variables = "B01002_001E")
renters_zip <- get_acs(geography = joinsto, variables = "B25129_038E")
population_zip <- population_zip %>% mutate(population_est = estimate) %>% select(-estimate)
income_zip <- income_zip %>% mutate(income_est = estimate) %>% select(-estimate)
medianage_zip <- medianage_zip %>% mutate(medianage_est = estimate) %>% select(-estimate)
renters_zip <- renters_zip %>% mutate(renters_est = estimate) %>% select(-estimate)
}
population %>% slice(1:10)
## # A tibble: 10 x 8
## GEOID COUNTY STATE variable moe population_est STATEABB STATEREGION
## <chr> <chr> <chr> <chr> <dbl> <dbl> <chr> <chr>
## 1 01001 Autauga Cou~ Alaba~ B01001_0~ NA 55380 AL South
## 2 01003 Baldwin Cou~ Alaba~ B01001_0~ NA 212830 AL South
## 3 01005 Barbour Cou~ Alaba~ B01001_0~ NA 25361 AL South
## 4 01007 Bibb County Alaba~ B01001_0~ NA 22493 AL South
## 5 01009 Blount Coun~ Alaba~ B01001_0~ NA 57681 AL South
## 6 01011 Bullock Cou~ Alaba~ B01001_0~ NA 10248 AL South
## 7 01013 Butler Coun~ Alaba~ B01001_0~ NA 19828 AL South
## 8 01015 Calhoun Cou~ Alaba~ B01001_0~ NA 114618 AL South
## 9 01017 Chambers Co~ Alaba~ B01001_0~ NA 33660 AL South
## 10 01019 Cherokee Co~ Alaba~ B01001_0~ NA 25903 AL South
income %>% slice(1:10)
## # A tibble: 10 x 7
## GEOID COUNTY STATE variable income_est STATEABB STATEREGION
## <chr> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 01001 Autauga County Alabama B06011_001 29725 AL South
## 2 01003 Baldwin County Alabama B06011_001 29802 AL South
## 3 01005 Barbour County Alabama B06011_001 17963 AL South
## 4 01007 Bibb County Alabama B06011_001 21958 AL South
## 5 01009 Blount County Alabama B06011_001 26976 AL South
## 6 01011 Bullock County Alabama B06011_001 21571 AL South
## 7 01013 Butler County Alabama B06011_001 19685 AL South
## 8 01015 Calhoun County Alabama B06011_001 24069 AL South
## 9 01017 Chambers County Alabama B06011_001 22754 AL South
## 10 01019 Cherokee County Alabama B06011_001 23121 AL South
medianage %>% slice(1:10)
## # A tibble: 10 x 7
## GEOID COUNTY STATE variable medianage_est STATEABB STATEREGION
## <chr> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 01001 Autauga County Alabama B01002_001 38.2 AL South
## 2 01003 Baldwin County Alabama B01002_001 43 AL South
## 3 01005 Barbour County Alabama B01002_001 40.4 AL South
## 4 01007 Bibb County Alabama B01002_001 40.9 AL South
## 5 01009 Blount County Alabama B01002_001 40.7 AL South
## 6 01011 Bullock County Alabama B01002_001 40.2 AL South
## 7 01013 Butler County Alabama B01002_001 40.8 AL South
## 8 01015 Calhoun County Alabama B01002_001 39.6 AL South
## 9 01017 Chambers County Alabama B01002_001 42 AL South
## 10 01019 Cherokee County Alabama B01002_001 46.5 AL South
renters %>% slice(1:10)
## # A tibble: 10 x 7
## GEOID COUNTY STATE variable renters_est STATEABB STATEREGION
## <chr> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 01001 Autauga County Alabama B25129_038 5715 AL South
## 2 01003 Baldwin County Alabama B25129_038 20034 AL South
## 3 01005 Barbour County Alabama B25129_038 3654 AL South
## 4 01007 Bibb County Alabama B25129_038 1763 AL South
## 5 01009 Blount County Alabama B25129_038 4424 AL South
## 6 01011 Bullock County Alabama B25129_038 1017 AL South
## 7 01013 Butler County Alabama B25129_038 1955 AL South
## 8 01015 Calhoun County Alabama B25129_038 13351 AL South
## 9 01017 Chambers County Alabama B25129_038 4376 AL South
## 10 01019 Cherokee County Alabama B25129_038 2433 AL South
In this section we will explore elements of data from the Zillow and ACS data sets in various charts. The charts will often examine a subject at increasingly specific levels of geographic hierarchy, starting at a national or state level and drilling down to the metro, city, or zip code.
A box plot of median values by state may reveal which states have the highest median home values, and what the distribution of median values is in each state. The red, blue and green horizontal lines represent the maximium, median, and minimum (respectively) of all observations in the chart. The state you are exploring will be compared to the national median.
#get a list of states
statelist <- yearend_homevalues_byzip %>%
filter(!is.na(homevalue) & !is.na(StateName)) %>%
group_by(StateName) %>%
summarize(minptr = median(homevalue)) %>%
arrange(minptr)
#extract top zips by state
topzips <- yearend_homevalues_byzip %>%
filter(City != 'Hollandale') %>% #the value in this county appears to be an anomaly/outlier
filter(year == 'eoy_2020') %>%
group_by(StateName) %>%
arrange(-homevalue) %>%
slice(1:1)
topzip_instate <- topzips %>% filter(StateName == explore_state)
#extract top zips by state
botzips <- yearend_homevalues_byzip %>%
filter(City != 'Hollandale') %>% #the value in this county appears to be an anomaly/outlier
filter(year == 'eoy_2020') %>%
group_by(StateName) %>%
arrange(homevalue) %>%
slice(1:1)
botzip_instate <- botzips %>% filter(StateName == explore_state)
#determine the highest, lowest and average median value
#these will be added as horizontal lines on the box plot
lowestmedian <- min(statelist$minptr)
meanmedian <- mean(statelist$minptr)
maxmedian <- max(statelist$minptr)
state_regions <- states_index %>% mutate(StateName = STATEABB, StateRegion = STATEREGION) %>% select(StateName,StateRegion)
yearend_homevalues_byzip <- yearend_homevalues_byzip %>% left_join(state_regions, by="StateName")
top_median_comp <- round((topzip_instate$homevalue / meanmedian) * 100,1)
bot_median_comp <- round((botzip_instate$homevalue / meanmedian) * 100,1)
#create some log10 transforms
yearend_homevalues_byzip %>%
filter(City != 'Hollandale') %>%
filter(year == 'eoy_2020') %>%
mutate(homevalue=log10(homevalue)) %>%
select(StateName,homevalue) %>%
gather(varz, length,
homevalue) %>%
ggplot(aes(StateName,
length,
fill = StateName)) +
geom_boxplot() +
geom_text_repel(data=topzips,
aes(x=StateName,
y=log10(homevalue),
label=paste(City,StateName)),color="darkgreen") +
geom_text_repel(data=botzips,
aes(x=StateName,
y=log10(homevalue),
label=paste(City,StateName)),color="red") +
geom_hline(yintercept=log10(lowestmedian),color="red") +
geom_hline(yintercept=log10(meanmedian),color="blue") +
geom_hline(yintercept=log10(maxmedian),color="darkgreen") +
#facet_wrap(~StateRegion, scales = "free", ncol=2) +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
theme(legend.position="bottom") +
labs(title = "Home Values by Zip Code - Grouped by State",
subtitle = "Top city in each state is noted",
caption = paste("Data source:",sourcefilename)
)
top_comp <- paste("The highest priced zip code in ", topzip_instate$StateName, " is ", topzip_instate$RegionName," in the City of ",topzip_instate$City," in ", topzip_instate$CountyName,". The median home sales price there is ", topzip_instate$homevalue,", which is ", top_median_comp,"% of the national median ", round(meanmedian,0) ,". ")
bot_comp <- paste("The highest priced zip code in ", botzip_instate$StateName, " is ", botzip_instate$RegionName," in the City of ",botzip_instate$City," in ", botzip_instate$CountyName,". The median home sales price there is ", botzip_instate$homevalue,", which is ", bot_median_comp,"% of the national median ", round(meanmedian,0) ,". ")
top_comp
## [1] "The highest priced zip code in CA is 94027 in the City of Atherton in San Mateo County . The median home sales price there is 6541400 , which is 3597.5 % of the national median 181831 . "
bot_comp
## [1] "The highest priced zip code in CA is 92389 in the City of Shoshone in Inyo County . The median home sales price there is 26669 , which is 14.7 % of the national median 181831 . "
The highest priced zip code in CA is 94027 in the City of Atherton in San Mateo County . The median home sales price there is 6541400 , which is 3597.5 % of the national median 181831 .
The highest priced zip code in CA is 92389 in the City of Shoshone in Inyo County . The median home sales price there is 26669 , which is 14.7 % of the national median 181831 .
This is the same home value data as above but zoomed in to a specific state. The locations with the highest home values are stamped for each metro.
selectregion <- 'West'
selectmetro <- explore_metro
year <- 2020
yearmarker <- paste('eoy_',year,sep='')
allvalues <- yearend_homevalues_byzip %>%
filter(StateRegion ==selectregion) %>%
filter(year == yearmarker)%>%
filter(!is.na(homevalue)) %>% pull(homevalue)
minmedian <- min(allvalues)
midmedian <- median(allvalues)
maxmedian <- max(allvalues)
toplocation <- yearend_homevalues_byzip %>%
filter(StateRegion ==selectregion) %>%
filter(year == yearmarker) %>%
group_by(StateName) %>%
arrange(-homevalue) %>%
slice(1:1)
botlocation <- yearend_homevalues_byzip %>%
filter(StateRegion ==selectregion) %>%
filter(year == yearmarker) %>%
group_by(StateName) %>%
arrange(homevalue) %>%
slice(1:1)
yearend_homevalues_byzip %>%
filter(StateRegion ==selectregion) %>%
filter(year == yearmarker)%>%
ggplot(aes(as.character(StateName),
log10(homevalue),
fill = StateName)) +
geom_boxplot() +
geom_text_repel(data=toplocation,color="darkgreen",
aes(x=as.character(StateName),
y=log10(homevalue),
label=paste(City,StateName))) +
geom_text_repel(data=botlocation,color="red",
aes(x=as.character(StateName),
y=log10(homevalue),
label=paste(City,StateName))) +
geom_hline(yintercept=log10(minmedian),color="red") +
geom_hline(yintercept=log10(midmedian),color="blue") +
geom_hline(yintercept=log10(maxmedian),color="darkgreen") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
theme(legend.position="none") +
labs(title = paste(year,selectregion,"Range of Median Home Values in Each Zipcode, by State"),
subtitle = "Taller bars indicate larger range of values",
caption = paste("Data source:",sourcefilename3))