Real Estate ProspectR

In this project we will examine how to use data provided by public data sources for analysis of home values, rental rates, and movements in home price over a 17 year time period from 2004 to 2020. Analysis of this nature can be used for investment research, home value analysis, appraisal, and comparative value analysis.

Introduction & Setup

This report uses data from current, Zillow Research real estate information database (zhvi), and additional archival data, also from Zillow, which was retrieved from Kaggle. The additional data was uploaded in 2017 and includes some data metrics that do not currently appear on the Zillow Research site. The datasets include median home price estimates and rental estimates for geographical areas between the zip code and state level.

Set State, Metro, and Zip codes to Explore

Set the state, metro (MSA) and zip code you are interested in researching. Some charts that deal with more specific subjects or geographic regions will use these as filters to limit the scope of analysis. Not all charts will use all parameters.

explore_statename <- "California"
explore_state <- "CA"
explore_city <- "Santa Rosa"
explore_metro <- "Santa Rosa"
explore_county <- "Sonoma"
explore_zip <- "95409"
explore_multi_metro <- c('Santa Rosa','San Francisco','San Jose')

Environment Setup

Load packages used by this analysis project and set the default to standard notation.

# Note: this process could take a couple of minutes
if(!require(dplyr)) install.packages("dplyr", repos = "http://cran.us.r-project.org")
if(!require(ggrepel)) install.packages("ggrepel", repos = "http://cran.us.r-project.org")
if(!require(tidyverse)) install.packages("tidyverse", repos = "http://cran.us.r-project.org")
if(!require(tidycensus)) install.packages("tidycensus", repos = "http://cran.us.r-project.org")
if(!require(caret)) install.packages("caret", repos = "http://cran.us.r-project.org")
if(!require(dslabs)) install.packages("dslabs", repos = "http://cran.us.r-project.org")
if(!require(rpart.plot)) install.packages("rpart.plot", repos = "http://cran.us.r-project.org")
if(!require(rgbif)) install.packages("rgbif", repos = "http://cran.us.r-project.org")
if(!require(corrr)) install.packages("corrr", repos = "http://cran.us.r-project.org")
if(!require(splitstackshape)) install.packages("splitstackshape", repos = "http://cran.us.r-project.org")
if(!require(mice)) install.packages("mice", repos = "http://cran.us.r-project.org")
if(!require(randomForest)) install.packages("randomForest", repos = "http://cran.us.r-project.org")
if(!require(data.table)) install.packages("data.table", repos = "http://cran.us.r-project.org")
if(!require(lubridate)) install.packages("lubridate", repos = "http://cran.us.r-project.org")
if(!require("rnoaa")) install.packages("rnoaa", repos = "http://cran.us.r-project.org")

library(randomForest)
library(mice)
library(tidyverse)
library(dslabs)
library(caret)
library(dplyr)
library(rpart.plot)
library("rgbif")
library(corrr)
library(splitstackshape)
library(data.table)
library(ggrepel)
library(tidycensus)
library(lubridate)
library('rnoaa')

options(scipen = 999)

Load Data Set

In this section we will conduct Extract, Transform & Load (ETL) processes on several data files. The data has been saved as a comma separated values (.csv) file in the /data/ directory relative to this Rmd file.

#set reload to false if you wish to rerun analysis without reloading the files.
reload=TRUE

#set teh working directory where the project is loaded. Data folder should be in this directory
setwd("D:/RProjects/RealEstateProspector/")

#the file name for the county level time series file
sourcefilename <- "./data/County_time_series.csv"
if(reload){
  counties <- read.csv(sourcefilename, header=TRUE, fill = TRUE, stringsAsFactors = FALSE, fileEncoding="UTF-8-BOM")
}
totalrows <- nrow(counties)

#the file name for the county level cross walk file
sourcefilename2 <- "./data/CountyCrossWalk_Zillow.csv"
if(reload){
  crosswalk <- read.csv(sourcefilename2, header=TRUE, fill = TRUE, stringsAsFactors = FALSE, fileEncoding="UTF-8-BOM")
}
totalrows2 <- nrow(crosswalk)

profiles <- counties %>% left_join(crosswalk, by=c("RegionName" = "FIPS"))

#the file name for the ZHVI source by Metro
sourcefilename3 <- "./data/Metro_zhvi_uc_sfrcondo_tier_0.33_0.67_sm_sa_mon.csv"
if(reload){
  zhvi <- read.csv(sourcefilename3, header=TRUE, fill = TRUE, stringsAsFactors = FALSE, fileEncoding="UTF-8-BOM")
}
totalrows3 <- nrow(zhvi)

#the file name for the ZORI source by Metro
sourcefilename4 <- "./data/Metro_ZORI_AllHomesPlusMultifamily_SSA.csv"
if(reload){
  zori <- read.csv(sourcefilename4, header=TRUE, fill = TRUE, stringsAsFactors = FALSE, fileEncoding="UTF-8-BOM")
}
totalrows4 <- nrow(zori)


#the file name for the ZHVI source by Zip
sourcefilename5 <- "./data/Zip_zhvi_uc_sfrcondo_tier_0.33_0.67_sm_sa_mon.csv"
if(reload){
  zhvi_zip <- read.csv(sourcefilename5, 
                       header=TRUE, 
                       fill = TRUE, 
                       stringsAsFactors = FALSE, 
                       fileEncoding="UTF-8-BOM")
}
totalrows5 <- nrow(zhvi_zip)

#Join ZHVI and ZORI Metro tables on RegionID
values <- zhvi %>% left_join(zori, by=c("RegionID" = "RegionID"))

A total of 518791 rows are present in the County time series data.
A total of 3144 rows are present in the County Cross walk data.
A total of 914 rows are present in the ZHVI by Metro data.
A total of 107 rows are present in the ZORI by Metro data.
A total of 30230 rows are present in the ZHVI by Zip data.

We will now conduct some initial exploratory data analysis (EDA).

#These calls help with initial EDA - Commented out
#glimpse(zhvi)
#glimpse(zori)
#summary(zhvi)
#summary(zori)

The Data includes many characteristics of real estate data related to home values, inventories, time on market, rent estimates, and other factors.

Data Transformations

The Zillow ZHVI and ZORI Data will use end of year values for charts in this section. The ZHVI is a proprietary smoothed estimate of “typical home values” provided by zillow. It is not the exact median or average. Details about how the ZHVI is calculated is available on the Zillow research site: https://www.zillow.com/research/zhvi-user-guide/

Year End Home Values by Region

Year end home value data is available, grouped by metro.

#explore the ZHVI data on home values at the end of each year BY METRO
#2020 uses the october data that was latest available at time of this analysis
yearend_homevalues <- zhvi %>%
  select( RegionName, StateName,
                eoy_2004=X2004.12.31, 
                eoy_2005=X2005.12.31,
                eoy_2006=X2006.12.31,
                eoy_2007=X2007.12.31,
                eoy_2008=X2008.12.31,
                eoy_2009=X2009.12.31,
                eoy_2010=X2010.12.31,
                eoy_2011=X2011.12.31,
                eoy_2012=X2012.12.31,
                eoy_2013=X2013.12.31,
                eoy_2014=X2014.12.31,
                eoy_2015=X2015.12.31,
                eoy_2016=X2016.12.31,
                eoy_2017=X2017.12.31,
                eoy_2018=X2018.12.31,
                eoy_2019=X2019.12.31,
                eoy_2020=X2020.10.31) %>%
  pivot_longer(c('eoy_2004', 
                 'eoy_2005', 
                 'eoy_2006', 
                 'eoy_2007', 
                 'eoy_2008', 
                 'eoy_2009', 
                 'eoy_2010', 
                 'eoy_2011', 
                 'eoy_2012', 
                 'eoy_2013', 
                 'eoy_2014', 
                 'eoy_2015', 
                 'eoy_2016', 
                 'eoy_2017', 
                 'eoy_2018', 
                 'eoy_2019', 
                 'eoy_2020'
                 ), 
               names_to = "year", 
               values_to = "homevalue") %>% 
 mutate(varz='home_value')

Year End Home Values by Zip Code

The year end home data is also available grouped by zip code. This data will be sued for “dril down” into specific cities since thre are generally too many zip codes in a state to be effectively visualized.

#explore the ZHVI data on home values at the end of each year BY ZIP
#2020 uses the october data that was latest available at time of this analysis

yearend_homevalues_byzip <- zhvi_zip %>%
  select( RegionID, RegionName, CountyName, City, StateName, Metro,
                eoy_2004=X2004.12.31, 
                eoy_2005=X2005.12.31,
                eoy_2006=X2006.12.31,
                eoy_2007=X2007.12.31,
                eoy_2008=X2008.12.31,
                eoy_2009=X2009.12.31,
                eoy_2010=X2010.12.31,
                eoy_2011=X2011.12.31,
                eoy_2012=X2012.12.31,
                eoy_2013=X2013.12.31,
                eoy_2014=X2014.12.31,
                eoy_2015=X2015.12.31,
                eoy_2016=X2016.12.31,
                eoy_2017=X2017.12.31,
                eoy_2018=X2018.12.31,
                eoy_2019=X2019.12.31,
                eoy_2020=X2020.10.31) %>%
  pivot_longer(c('eoy_2004', 
                 'eoy_2005', 
                 'eoy_2006', 
                 'eoy_2007', 
                 'eoy_2008', 
                 'eoy_2009', 
                 'eoy_2010', 
                 'eoy_2011', 
                 'eoy_2012', 
                 'eoy_2013', 
                 'eoy_2014', 
                 'eoy_2015', 
                 'eoy_2016', 
                 'eoy_2017', 
                 'eoy_2018', 
                 'eoy_2019', 
                 'eoy_2020'
                 ), 
               names_to = "year", 
               values_to = "homevalue") %>% 
 mutate(varz='home_value')

Year-End Rent Estimate Data by Region

Rent data is more limited in the length of time the data was recorded, and is available for a limited number of metros only.

#Explore ZORI values - 2014 to present (Shorter time period)
yearend_rents <- zori %>% select(RegionName, 
                eoy_2014=X2014.12,
                eoy_2015=X2015.12,
                eoy_2016=X2016.12,
                eoy_2017=X2017.12,
                eoy_2018=X2018.12,
                eoy_2019=X2019.12,
                eoy_2020=X2020.10
                )  %>%
  pivot_longer(c('eoy_2014', 
                 'eoy_2015', 
                 'eoy_2016', 
                 'eoy_2017', 
                 'eoy_2018', 
                 'eoy_2019', 
                 'eoy_2020'
                 ), 
               names_to = "year", 
               values_to = "medianrent") %>% 
 mutate(varz='median_rent')

Loading the USCB American Community Survey

In this section we will pull in some data from the US Census Bureau’s American Community Survey using the tidycensus package. An API Key acquired from the US Census web site.

If census_api_key(’’) is blank - you will need to go fetch a key of your own, and enter it inside the emoty quotes in this method call.

Unfortunately - Zillow does not use the GEOID assigned by the USCB - so the best we can do is try to use string comparisons and a state abbreviation index to find matching geography data in USCB data sources. Additional work should be done here to create a proper index table, but the county name will be sufficient for rough matching.

The following data set are using the 5-Year ACS which is recommended for use where precision is importrant - the year ACS covers all geographies and provides the largest sample sizes, while the 1 and 3 year ACS are only available for geographies with greater populations.

Many additional subjects are available for analysis as outlined in the link below. This analysis will be expanded to include additional subjects in future versions.

https://www.census.gov/programs-surveys/acs/guidance/subjects.html

#before running this code, you must plug in your own census API KEY:
#census_api_key("")

#For reduction of key usage, this data should be loaded only once per session or as needed 
#change this to TRUE to load the census data again. 
#This must be set to true before knitting or the knit may fail.
reloadcensus=TRUE

library(datasets)
states_index <- data.frame(STATEABB=as.character(state.abb),
                           STATENAME=as.character(state.name),
                           STATEREGION=as.character(state.region))

if(reloadcensus==TRUE){
  
joinsto <- "county"

population <- get_acs(geography = joinsto, variables = "B01001_001E")
income <- get_acs(geography = joinsto, variables = "B06011_001E")
medianage <- get_acs(geography = joinsto, variables = "B01002_001E")
renters <- get_acs(geography = joinsto, variables = "B25129_038E")

population <- population %>% separate(NAME, c("COUNTY","STATE"),sep=",") %>% mutate(population_est = estimate) %>% select(-estimate)
income <- income %>% separate(NAME, c("COUNTY","STATE"),sep=",") %>% mutate(income_est = estimate) %>% select(-estimate,-moe)
medianage <- medianage %>% separate(NAME, c("COUNTY","STATE"),sep=",") %>% mutate(medianage_est = estimate) %>% select(-estimate,-moe)
renters <- renters %>% separate(NAME, c("COUNTY","STATE"),sep=",") %>% mutate(renters_est = estimate) %>% select(-estimate,-moe)

population$STATE <- trimws(population$STATE)
income$STATE <- trimws(income$STATE)
medianage$STATE <- trimws(medianage$STATE)
renters$STATE <- trimws(renters$STATE)

population <- population %>% 
  mutate(STATE = as.character(STATE)) %>%
  left_join(states_index, by = c("STATE"="STATENAME"))

income <- income %>% 
  mutate(STATE = as.character(STATE)) %>%
  left_join(states_index, by = c("STATE"="STATENAME"))

medianage <- medianage %>% 
  mutate(STATE = as.character(STATE)) %>%
  left_join(states_index, by = c("STATE"="STATENAME"))

renters <- renters %>% 
  mutate(STATE = as.character(STATE)) %>%
  left_join(states_index, by = c("STATE"="STATENAME"))


joinsto <- "zcta"

population_zip <- get_acs(geography = joinsto, variables = "B01001_001E")
income_zip <- get_acs(geography = joinsto, variables = "B06011_001E")
medianage_zip <- get_acs(geography = joinsto, variables = "B01002_001E")
renters_zip <- get_acs(geography = joinsto, variables = "B25129_038E")

population_zip <- population_zip %>% mutate(population_est = estimate) %>% select(-estimate)
income_zip <- income_zip %>% mutate(income_est = estimate) %>% select(-estimate)
medianage_zip <- medianage_zip %>% mutate(medianage_est = estimate) %>% select(-estimate)
renters_zip <- renters_zip %>% mutate(renters_est = estimate) %>% select(-estimate)


}



population %>% slice(1:10)

## # A tibble: 10 x 8
##    GEOID COUNTY       STATE  variable    moe population_est STATEABB STATEREGION
##    <chr> <chr>        <chr>  <chr>     <dbl>          <dbl> <chr>    <chr>      
##  1 01001 Autauga Cou~ Alaba~ B01001_0~    NA          55380 AL       South      
##  2 01003 Baldwin Cou~ Alaba~ B01001_0~    NA         212830 AL       South      
##  3 01005 Barbour Cou~ Alaba~ B01001_0~    NA          25361 AL       South      
##  4 01007 Bibb County  Alaba~ B01001_0~    NA          22493 AL       South      
##  5 01009 Blount Coun~ Alaba~ B01001_0~    NA          57681 AL       South      
##  6 01011 Bullock Cou~ Alaba~ B01001_0~    NA          10248 AL       South      
##  7 01013 Butler Coun~ Alaba~ B01001_0~    NA          19828 AL       South      
##  8 01015 Calhoun Cou~ Alaba~ B01001_0~    NA         114618 AL       South      
##  9 01017 Chambers Co~ Alaba~ B01001_0~    NA          33660 AL       South      
## 10 01019 Cherokee Co~ Alaba~ B01001_0~    NA          25903 AL       South

income %>% slice(1:10)

## # A tibble: 10 x 7
##    GEOID COUNTY          STATE   variable   income_est STATEABB STATEREGION
##    <chr> <chr>           <chr>   <chr>           <dbl> <chr>    <chr>      
##  1 01001 Autauga County  Alabama B06011_001      29725 AL       South      
##  2 01003 Baldwin County  Alabama B06011_001      29802 AL       South      
##  3 01005 Barbour County  Alabama B06011_001      17963 AL       South      
##  4 01007 Bibb County     Alabama B06011_001      21958 AL       South      
##  5 01009 Blount County   Alabama B06011_001      26976 AL       South      
##  6 01011 Bullock County  Alabama B06011_001      21571 AL       South      
##  7 01013 Butler County   Alabama B06011_001      19685 AL       South      
##  8 01015 Calhoun County  Alabama B06011_001      24069 AL       South      
##  9 01017 Chambers County Alabama B06011_001      22754 AL       South      
## 10 01019 Cherokee County Alabama B06011_001      23121 AL       South

medianage %>% slice(1:10)

## # A tibble: 10 x 7
##    GEOID COUNTY          STATE   variable   medianage_est STATEABB STATEREGION
##    <chr> <chr>           <chr>   <chr>              <dbl> <chr>    <chr>      
##  1 01001 Autauga County  Alabama B01002_001          38.2 AL       South      
##  2 01003 Baldwin County  Alabama B01002_001          43   AL       South      
##  3 01005 Barbour County  Alabama B01002_001          40.4 AL       South      
##  4 01007 Bibb County     Alabama B01002_001          40.9 AL       South      
##  5 01009 Blount County   Alabama B01002_001          40.7 AL       South      
##  6 01011 Bullock County  Alabama B01002_001          40.2 AL       South      
##  7 01013 Butler County   Alabama B01002_001          40.8 AL       South      
##  8 01015 Calhoun County  Alabama B01002_001          39.6 AL       South      
##  9 01017 Chambers County Alabama B01002_001          42   AL       South      
## 10 01019 Cherokee County Alabama B01002_001          46.5 AL       South

renters %>% slice(1:10)

## # A tibble: 10 x 7
##    GEOID COUNTY          STATE   variable   renters_est STATEABB STATEREGION
##    <chr> <chr>           <chr>   <chr>            <dbl> <chr>    <chr>      
##  1 01001 Autauga County  Alabama B25129_038        5715 AL       South      
##  2 01003 Baldwin County  Alabama B25129_038       20034 AL       South      
##  3 01005 Barbour County  Alabama B25129_038        3654 AL       South      
##  4 01007 Bibb County     Alabama B25129_038        1763 AL       South      
##  5 01009 Blount County   Alabama B25129_038        4424 AL       South      
##  6 01011 Bullock County  Alabama B25129_038        1017 AL       South      
##  7 01013 Butler County   Alabama B25129_038        1955 AL       South      
##  8 01015 Calhoun County  Alabama B25129_038       13351 AL       South      
##  9 01017 Chambers County Alabama B25129_038        4376 AL       South      
## 10 01019 Cherokee County Alabama B25129_038        2433 AL       South

Analysis

In this section we will explore elements of data from the Zillow and ACS data sets in various charts. The charts will often examine a subject at increasingly specific levels of geographic hierarchy, starting at a national or state level and drilling down to the metro, city, or zip code.

Boxplot of Median Home Values by State

A box plot of median values by state may reveal which states have the highest median home values, and what the distribution of median values is in each state. The red, blue and green horizontal lines represent the maximium, median, and minimum (respectively) of all observations in the chart. The state you are exploring will be compared to the national median.

#get a list of states
statelist <- yearend_homevalues_byzip %>% 
  filter(!is.na(homevalue) & !is.na(StateName)) %>% 
  group_by(StateName) %>% 
  summarize(minptr = median(homevalue)) %>% 
  arrange(minptr) 
 

#extract top zips by state
topzips <- yearend_homevalues_byzip %>% 
  filter(City != 'Hollandale') %>% #the value in this county appears to be an anomaly/outlier
  filter(year == 'eoy_2020') %>% 
  group_by(StateName) %>% 
  arrange(-homevalue) %>% 
  slice(1:1)

topzip_instate <- topzips %>% filter(StateName == explore_state)

#extract top zips by state
botzips <- yearend_homevalues_byzip %>% 
  filter(City != 'Hollandale') %>% #the value in this county appears to be an anomaly/outlier
  filter(year == 'eoy_2020') %>% 
  group_by(StateName) %>% 
  arrange(homevalue) %>% 
  slice(1:1)

botzip_instate <- botzips %>% filter(StateName == explore_state)

#determine the highest, lowest and average median value
#these will be added as horizontal lines on the box plot
lowestmedian <- min(statelist$minptr)
meanmedian <- mean(statelist$minptr)
maxmedian <- max(statelist$minptr)

state_regions <- states_index %>% mutate(StateName = STATEABB, StateRegion = STATEREGION) %>% select(StateName,StateRegion)

yearend_homevalues_byzip <- yearend_homevalues_byzip %>% left_join(state_regions, by="StateName")

top_median_comp <- round((topzip_instate$homevalue / meanmedian) * 100,1)
bot_median_comp <- round((botzip_instate$homevalue / meanmedian) * 100,1)

#create some log10 transforms
yearend_homevalues_byzip %>% 
  filter(City != 'Hollandale') %>% 
  filter(year == 'eoy_2020') %>% 
  mutate(homevalue=log10(homevalue)) %>%
  select(StateName,homevalue) %>%
  gather(varz, length, 
  homevalue) %>%
  ggplot(aes(StateName, 
             length, 
             fill = StateName)) +
  geom_boxplot() +
  geom_text_repel(data=topzips,
            aes(x=StateName, 
             y=log10(homevalue),
             label=paste(City,StateName)),color="darkgreen") +
  geom_text_repel(data=botzips,
            aes(x=StateName, 
             y=log10(homevalue),
             label=paste(City,StateName)),color="red") +
  geom_hline(yintercept=log10(lowestmedian),color="red") +
  geom_hline(yintercept=log10(meanmedian),color="blue") +
  geom_hline(yintercept=log10(maxmedian),color="darkgreen") +
  #facet_wrap(~StateRegion, scales = "free", ncol=2) +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
  theme(legend.position="bottom")  + 
  labs(title = "Home Values by Zip Code - Grouped by State", 
       subtitle = "Top city in each state is noted", 
       caption = paste("Data source:",sourcefilename)
       )

top_comp <- paste("The highest priced zip code in ", topzip_instate$StateName, " is ", topzip_instate$RegionName," in the City of ",topzip_instate$City," in ", topzip_instate$CountyName,". The median home sales price there is ", topzip_instate$homevalue,", which is ", top_median_comp,"% of the national median ", round(meanmedian,0) ,". ")

bot_comp <- paste("The highest priced zip code in ", botzip_instate$StateName, " is ", botzip_instate$RegionName," in the City of ",botzip_instate$City," in ", botzip_instate$CountyName,". The median home sales price there is ", botzip_instate$homevalue,", which is ", bot_median_comp,"% of the national median ", round(meanmedian,0) ,". ")

top_comp

## [1] "The highest priced zip code in  CA  is  94027  in the City of  Atherton  in  San Mateo County . The median home sales price there is  6541400 , which is  3597.5 % of the national median  181831 . "

bot_comp

## [1] "The highest priced zip code in  CA  is  92389  in the City of  Shoshone  in  Inyo County . The median home sales price there is  26669 , which is  14.7 % of the national median  181831 . "

The highest priced zip code in CA is 94027 in the City of Atherton in San Mateo County . The median home sales price there is 6541400 , which is 3597.5 % of the national median 181831 .

The highest priced zip code in CA is 92389 in the City of Shoshone in Inyo County . The median home sales price there is 26669 , which is 14.7 % of the national median 181831 .

Boxplot of Median Home Values by State in a Selected Region

This is the same home value data as above but zoomed in to a specific state. The locations with the highest home values are stamped for each metro.

selectregion <- 'West'
selectmetro <- explore_metro
year <- 2020
  
yearmarker <- paste('eoy_',year,sep='')

allvalues <- yearend_homevalues_byzip %>% 
  filter(StateRegion ==selectregion) %>% 
  filter(year == yearmarker)%>% 
  filter(!is.na(homevalue)) %>% pull(homevalue)

minmedian <- min(allvalues)
midmedian <- median(allvalues)
maxmedian <- max(allvalues)

toplocation <- yearend_homevalues_byzip %>% 
  filter(StateRegion ==selectregion) %>% 
  filter(year == yearmarker) %>% 
  group_by(StateName) %>% 
  arrange(-homevalue) %>% 
 slice(1:1)

botlocation <- yearend_homevalues_byzip %>% 
  filter(StateRegion ==selectregion) %>% 
  filter(year == yearmarker) %>% 
  group_by(StateName) %>% 
  arrange(homevalue) %>% 
 slice(1:1)

yearend_homevalues_byzip %>% 
  filter(StateRegion ==selectregion) %>% 
  filter(year == yearmarker)%>% 
  ggplot(aes(as.character(StateName), 
             log10(homevalue), 
             fill = StateName)) +
  geom_boxplot() +
  geom_text_repel(data=toplocation,color="darkgreen",
            aes(x=as.character(StateName), 
             y=log10(homevalue),
             label=paste(City,StateName))) +
  geom_text_repel(data=botlocation,color="red",
            aes(x=as.character(StateName), 
             y=log10(homevalue),
             label=paste(City,StateName))) +
  geom_hline(yintercept=log10(minmedian),color="red") +
  geom_hline(yintercept=log10(midmedian),color="blue") +
  geom_hline(yintercept=log10(maxmedian),color="darkgreen") +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
  theme(legend.position="none")  + 
  labs(title = paste(year,selectregion,"Range of Median Home Values in Each Zipcode, by State"), 
       subtitle = "Taller bars indicate larger range of values", 
       caption = paste("Data source:",sourcefilename3))

Boxplot of Median Home Values by Metro in a Selected State

This is the same home value data as above but zoomed in to a specific state. The locations with the highest home values are stamped for each metro.

selectstate <- explore_state
selectmetro <- explore_metro
year <- 2020
  
yearmarker <- paste('eoy_',year,sep='')

allvalues <- yearend_homevalues_byzip %>% 
  filter(StateName ==selectstate) %>% 
  filter(year == yearmarker)%>% 
  filter(!is.na(homevalue)) %>% pull(homevalue)

minmedian <- min(allvalues)
midmedian <- median(allvalues)
maxmedian <- max(allvalues)

toplocation <- yearend_homevalues_byzip %>% 
  filter(StateName == selectstate) %>% 
  filter(year == yearmarker) %>% 
  group_by(Metro) %>% 
  arrange(-homevalue) %>% 
 slice(1:1)

botlocation <- yearend_homevalues_byzip %>% 
  filter(StateName == selectstate) %>% 
  filter(year == yearmarker) %>% 
  group_by(Metro) %>% 
  arrange(homevalue) %>% 
 slice(1:1)

yearend_homevalues_byzip %>% 
  filter(StateName == selectstate) %>%
  filter(year == yearmarker)%>% 
  ggplot(aes(as.character(Metro), 
             log10(homevalue), 
             fill = Metro)) +
  geom_boxplot() +
  geom_text_repel(data=toplocation,color="darkgreen",
            aes(x=as.character(Metro), 
             y=log10(homevalue),
             label=paste(City,StateName))) +
  geom_text_repel(data=botlocation,color="red",
            aes(x=as.character(Metro), 
             y=log10(homevalue),
             label=paste(City,StateName))) +
  geom_hline(yintercept=log10(minmedian),color="red") +
  geom_hline(yintercept=log10(midmedian),color="blue") +
  geom_hline(yintercept=log10(maxmedian),color="darkgreen") +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
  theme(legend.position="none")  + 
  labs(title = paste(year,selectstate,"Range of Median Home Values in Each Zipcode, by Metro"), 
       subtitle = "Taller bars indicate larger range of values", 
       caption = paste("Data source:",sourcefilename3))

selectstate <- explore_state
selectmetro <- explore_metro
year <- 2020
  
yearmarker <- paste('eoy_',year,sep='')

toplocation <- yearend_homevalues_byzip %>% 
  filter(StateName == selectstate) %>% 
  filter(year == yearmarker) %>% 
  group_by(Metro) %>% 
  arrange(-homevalue) %>% 
  ungroup() %>% 
  slice(1:1)

botlocation <- yearend_homevalues_byzip %>% 
  filter(StateName == selectstate) %>% 
  filter(year == yearmarker) %>% 
  group_by(Metro) %>% 
  arrange(homevalue) %>% 
  ungroup() %>% 
  slice(1:1)

Boxplot of Median Home Values by County in a Selected State

This is the same home value data as above but grouped by county instead of metro. Most states have more counties than metros - this chart may have an overwhelming number of counties in some cases.

selectstate <- explore_state
selectmetro <- explore_metro
year <- 2020
  
yearmarker <- paste('eoy_',year,sep='')

allvalues <- yearend_homevalues_byzip %>% 
  filter(StateName ==selectstate) %>% 
  filter(year == yearmarker)%>% 
  filter(!is.na(homevalue)) %>% pull(homevalue)

minmedian <- min(allvalues)
midmedian <- median(allvalues)
maxmedian <- max(allvalues)

toplocation <- yearend_homevalues_byzip %>% 
  filter(StateName == selectstate) %>% 
  filter(year == yearmarker) %>% 
  group_by(CountyName) %>% 
  arrange(-homevalue) %>% 
  ungroup() %>% 
  slice(1:1)

yearend_homevalues_byzip %>% 
  filter(StateName == selectstate) %>%
  filter(year == yearmarker)%>% 
  ggplot(aes(as.character(CountyName), 
             homevalue, 
             fill = Metro)) +
  geom_boxplot() +
  geom_hline(yintercept=minmedian,color="green") +
  geom_hline(yintercept=midmedian,color="blue") +
  geom_hline(yintercept=maxmedian,color="red") +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
  theme(legend.position="bottom")  + 
  labs(title = paste(year,selectstate,"Range of Median Home Values"), 
       subtitle = "2004-2020 End of Year Estimates", 
       caption = paste("Data source:",sourcefilename3))

Box Plot of Home Values by Zip codes in a City

This box plot examines the home values in a specified city, by zip code.

selectstate <- explore_state
selectcity <- explore_city

allvalues <- yearend_homevalues_byzip %>% 
  filter(StateName == selectstate) %>%
  #filter(year == 'eoy_2020') %>%
  filter(City == selectcity) %>%
  filter(!is.na(homevalue)) %>% pull(homevalue)

minmedian <- min(allvalues)
midmedian <- median(allvalues)
maxmedian <- max(allvalues)

yearend_homevalues_byzip %>% 
  filter(StateName == selectstate) %>%
  #filter(year == 'eoy_2020') %>%
  filter(City == selectcity) %>%
  ggplot(aes(as.character(RegionName), 
             homevalue, 
             fill = as.character(RegionName))) +
  geom_boxplot() +
  geom_hline(yintercept=minmedian,color="green") +
  geom_hline(yintercept=midmedian,color="blue") +
  geom_hline(yintercept=maxmedian,color="red") +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
  theme(legend.position="none")  + 
  labs(title = paste("Range of Median Home Values in",selectcity,selectstate), 
       subtitle = "2004-2020 End of Year Estimates - Taller bars indicate greater price changes", 
       caption = paste("Data source:",sourcefilename3))

Boxplot of Rent Estimates by Metro in a Selected State

In the following plot we will examine the range of rental estimates over the past 6 years from 2014 to 2020. Taller boxes means there has been more movement (generally, increases) in rents, whereas very tight boxes means very little movement has been seen in rental prices. This data set has a more limited amount of information, so teh number of metros included may not be complete.

selectstate <- explore_state
selectmetro <- explore_metro
year <- 2020

allvalues <- yearend_rents %>% 
  filter(RegionName %like% paste(", ",selectstate,sep="")) %>% 
  filter(!is.na(medianrent)) %>% pull(medianrent)

minmedian <- min(allvalues)
midmedian <- median(allvalues)
maxmedian <- max(allvalues)


yearend_rents %>%
  filter(RegionName %like%  paste(", ",selectstate,sep="")) %>%
  ggplot(aes(RegionName, 
             medianrent, 
             fill = RegionName)) +
  geom_boxplot() +
  geom_hline(yintercept=minmedian,color="green") +
  geom_hline(yintercept=midmedian,color="blue") +
  geom_hline(yintercept=maxmedian,color="red") +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
  theme(legend.position="none")  + 
  labs(title = paste("2014-2020 Range of Median Rents in", selectstate," by Metro"), 
       subtitle = "Taller bars indicate greater changes in median rents over the 6 year period.", 
       caption = paste("Data source:",sourcefilename4))

Median Home Prices over Time

This grouped line plot examines the year end home values for a specified state, with one line for each metro region. Your selected explore region is indicated with a thicker line.

selectstate <- explore_state
selectcity <- explore_city

selectregion <- paste(explore_city,", ",explore_state,sep="")


labels <- yearend_homevalues %>% filter(StateName == selectstate) %>% group_by(RegionName) %>% sample_n(1)

#labels <-yearend_homevalues %>% filter(StateName == state & year == "eoy_2020")
thickline <-yearend_homevalues %>% filter(RegionName==selectregion)

yearend_homevalues %>% filter(StateName == selectstate) %>% ggplot() + 
  geom_line(aes(y=homevalue,x=year, group = RegionName, color=RegionName), size=1,alpha=.6) + 
  geom_line(data = thickline, aes(y=homevalue,x=year, group = RegionName, color=RegionName), size=2) + 
  geom_text(data = labels,aes(y=homevalue,x=year,label=RegionName,color=RegionName),hjust=0) +
  theme(legend.position="bottom") +
    theme(plot.margin = unit(c(1,3,1,1), "cm")) + 
  labs(title = paste(selectstate,"Home Values 2004-2020"), 
       subtitle = paste("2004-2020 End of Year Estimates of Median Home Price by Metro - ",explore_city,"is Highlighted"), 
       caption = paste("Data source:",sourcefilename3))

Year-over-Year Price Movement

This scatter plot indicates the median price at the end of the year, for each zip code, in a given city and state. Larger dots indicate greater year over year growth or decline in median home values. Blue indicates increases and red indicates decreases. Only cities with 2020 median value within the selected range are included. This chart demonstrates that some cities may not have yet recovered their former median values.

selectstate <- explore_state
selectcity <- explore_city

min_value <- 450000
max_value <- 500000


yearend_homevalues_byzip_mod <- yearend_homevalues_byzip %>% 
  filter(StateName == selectstate) %>%
  mutate(CountyName = str_replace(CountyName,' County',''))  %>% 
  group_by(StateName,CountyName,RegionName,City,year) %>%
  summarize(homevalue = mean(homevalue)) %>%
  mutate(varz='home_value') %>% 
  mutate(yoy_gainloss = ifelse( lag(RegionName,1) == RegionName,homevalue-lag(homevalue,1), 0)) %>% 
  mutate(yoy_gainloss = ifelse( is.na(yoy_gainloss), 0,yoy_gainloss))


city_list <- yearend_homevalues_byzip_mod %>% 
   filter(StateName == selectstate) %>%
  filter(year == "eoy_2020") %>% 
  group_by(StateName,City,year) %>%
  summarize(homevalue = mean(homevalue)) %>%
  filter(homevalue <= max_value) %>% 
  filter(homevalue >= min_value)

county_list <- yearend_homevalues_byzip_mod %>% 
  filter(StateName == selectstate) %>%
  filter(year == "eoy_2020") %>% 
  group_by(StateName,CountyName,year) %>%
  summarize(homevalue = mean(homevalue)) %>%
  filter(homevalue <= max_value) %>% 
  filter(homevalue >= min_value)
  
 
yearend_homevalues_byzip_mod %>%
  filter(StateName == selectstate) %>%
  filter(City %in% city_list$City) %>%
  group_by(City,year) %>%
  summarize(homevalue = mean(homevalue), yoy_gainloss = mean(yoy_gainloss)) %>%
  ggplot(aes(as.character(City), 
             homevalue, 
             color = yoy_gainloss, 
             size = abs(yoy_gainloss))) +
  geom_point() +
  geom_line(size=1) +
  geom_text(aes(label=paste(str_replace(year,"eoy_",""))),size=3,nudge_x=.2, hjust=0) +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
  theme(legend.position="none") +
scale_colour_gradient2(
  low = "red",
  mid = "blue",
  high = "green",
  midpoint = 0) + 
  labs(title = paste("Changes in Median Home Values in ",selectstate, " by City: Cities where 2020 Median is ",min_value,"to",max_value), 
       subtitle = "2004-2020 End of Year Estimates", 
       caption = paste("Data source:",sourcefilename3))

selectstate <- explore_state
selectcity <- explore_city

yearend_homevalues_byzip_mod <- yearend_homevalues_byzip %>% 
 mutate(varz='home_value') %>% 
  mutate(yoy_gainloss = ifelse( lag(RegionName,1) == RegionName ,homevalue-lag(homevalue,1), 0))
 
yearend_homevalues_byzip_mod %>%
  filter(StateName == selectstate) %>%
  filter(City == selectcity) %>%
  ggplot(aes(as.character(RegionName), 
             homevalue, 
             color = yoy_gainloss, 
             size = abs(yoy_gainloss))) +
  geom_point() +
 geom_line(size=1) +
   geom_text(aes(label=paste(str_replace(year,"eoy_",""))),size=3,nudge_x=.1, hjust=0) +
   facet_wrap(~StateName, ncol=3, scales="free_x") +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
  theme(legend.position="none") +
scale_colour_gradient2(
  low = "red",
  mid = "blue",
  high = "green",
  midpoint = 0) + 
  labs(title = paste("Changes in Median Home Values in ",selectcity, " by Zipcode"), 
       subtitle = "2004-2020 End of Year Estimates", 
       caption = paste("Data source:",sourcefilename3))

Rental Price to Home Value Correlations

This plot uses a different data set from Zillow. This scatter/segment plot shows changes in median rental price vs home price in each county of the selected state. The lines represent a period from 2010 to 2017. Each dot on the line is one year later. The arrow shows the direction of time. Not all time periods are available for all areas.

#setup dataset for line chart
profiles_orig <- profiles %>% select(
  Date,
  DaysOnZillow_AllHomes,
  MedianListingPricePerSqft_DuplexTriplex,
  MedianListingPricePerSqft_AllHomes,
  PriceToRentRatio_AllHomes,
  MedianRentalPrice_AllHomes,
  MedianRentalPrice_DuplexTriplex,     
  MedianRentalPrice_1Bedroom,
  MedianRentalPrice_2Bedroom,
  MedianRentalPrice_3Bedroom,
  MedianRentalPrice_4Bedroom,
  PctOfHomesIncreasingInValues_AllHomes,
  ZHVI_2bedroom,
  ZHVI_AllHomes,
  Sale_Counts,
  Sale_Prices,
  StateName,
  CountyName,
  MetroName_Zillow,
  CBSAName
) %>% mutate(
   Date=as.Date(Date)
)

profiles_filled <- profiles_orig[rowSums(is.na(profiles_orig)) != ncol(profiles_orig),]

Set a value limit to limit the analysis to locations where the median value is below the specified limit.

# Plot distribution of each predictor stratified by order
#create some log10 transforms
min_value <- 300000
max_value <- 500000

statelookup <- explore_statename

profiles_clean <- profiles_orig %>% 
  mutate(location = paste(CountyName,StateName)) %>% 
  mutate(year = as.numeric(format(Date,'%Y'))) %>%
  filter(StateName == explore_statename) %>%
  filter(CountyName %in% county_list$CountyName)

library(dplyr)
toplocations <- profiles_clean %>% 
  dplyr::filter(year == 2015) %>% 
  dplyr::group_by(year,CountyName,StateName) %>%
  dplyr::summarize(x=mean(MedianRentalPrice_AllHomes,na.rm=T),
            y=mean(ZHVI_AllHomes,na.rm=T)
            ) %>% 
  dplyr::mutate(diff = x/y) %>% 
  dplyr::arrange(-diff) %>% 
  dplyr::slice(1:20)
toplocations

## # A tibble: 3 x 6
## # Groups:   year, CountyName [3]
##    year CountyName StateName      x       y    diff
##   <dbl> <chr>      <chr>      <dbl>   <dbl>   <dbl>
## 1  2015 Sacramento California 1374. 276175  0.00498
## 2  2015 Solano     California 1785. 331767. 0.00538
## 3  2015 Yolo       California 1694. 353067. 0.00480

maxprice <- max(profiles_clean %>% 
   filter(StateName == statelookup) %>% 
   filter(!is.na(ZHVI_AllHomes)) %>% 
   pull(ZHVI_AllHomes))
maxprice

## [1] 478400

minprice <- min(profiles_clean %>% 
   filter(StateName == statelookup) %>% 
   filter(!is.na(ZHVI_AllHomes)) %>% 
   pull(ZHVI_AllHomes))
minprice

## [1] 120500

maxrent <- max(profiles_clean %>% 
   filter(StateName == statelookup) %>% 
   filter(!is.na(MedianRentalPrice_AllHomes)) %>% 
   pull(MedianRentalPrice_AllHomes))
maxrent

## [1] 2200

minrent <- min(profiles_clean %>% 
   filter(StateName == statelookup) %>% 
   filter(!is.na(MedianRentalPrice_AllHomes)) %>% 
   pull(MedianRentalPrice_AllHomes))
minrent

## [1] 1195

  p2010 <- profiles_clean %>% 
  dplyr::filter(year == 2010) %>% 
  dplyr::filter(StateName == statelookup) %>% 
  dplyr::group_by(year,CountyName,StateName) %>%
  dplyr::summarize(x=mean(MedianRentalPrice_AllHomes),
            y=mean(ZHVI_AllHomes)
            )

   p2011 <- profiles_clean %>% 
  filter(year == 2011) %>% 
  filter(StateName == statelookup) %>% 
  group_by(year,CountyName,StateName) %>%
  dplyr::summarize(x=mean(MedianRentalPrice_AllHomes),
            y=mean(ZHVI_AllHomes)
            )

  p2012 <- profiles_clean %>% 
  filter(year == 2012) %>% 
  filter(StateName == statelookup) %>% 
 group_by(year,CountyName,StateName) %>%
  dplyr::summarize(x=mean(MedianRentalPrice_AllHomes),
            y=mean(ZHVI_AllHomes)
            )

   p2013 <- profiles_clean %>% 
  filter(year == 2013) %>% 
  filter(StateName == statelookup) %>% 
 group_by(year,CountyName,StateName) %>%
  dplyr::summarize(x=mean(MedianRentalPrice_AllHomes),
            y=mean(ZHVI_AllHomes)
            )

    p2014 <- profiles_clean %>% 
  filter(year == 2014) %>% 
  filter(StateName == statelookup) %>% 
 group_by(year,CountyName,StateName) %>%
  dplyr::summarize(x=mean(MedianRentalPrice_AllHomes),
            y=mean(ZHVI_AllHomes)
            )

  p2015 <- profiles_clean %>% 
  filter(year == 2015) %>% 
  filter(StateName == statelookup) %>% 
  group_by(year,CountyName,StateName) %>%
  dplyr::summarize(x=mean(MedianRentalPrice_AllHomes),
            y=mean(ZHVI_AllHomes)
            )

   p2016 <- profiles_clean %>% 
  filter(year == 2016) %>% 
  filter(StateName == statelookup) %>% 
 group_by(year,CountyName,StateName) %>%
  dplyr::summarize(x=mean(MedianRentalPrice_AllHomes),
            y=mean(ZHVI_AllHomes)
            )

  p2017 <- profiles_clean %>% 
  filter(year == 2017) %>% 
  filter(StateName == statelookup) %>% 
  group_by(year,CountyName,StateName) %>%
  dplyr::summarize(x=mean(MedianRentalPrice_AllHomes),
            y=mean(ZHVI_AllHomes)
            )

   pBoth11 <- p2010 %>% 
    left_join(p2011, by = c("CountyName"="CountyName","StateName"="StateName")) %>%
    mutate(dtotal = (x.y - x.x)) 
  pBoth12 <- p2011 %>% 
    left_join(p2012, by = c("CountyName"="CountyName","StateName"="StateName")) %>%
    mutate(dtotal = (x.y - x.x)) 
  pBoth13 <- p2012 %>% 
    left_join(p2013, by = c("CountyName"="CountyName","StateName"="StateName")) %>%
    mutate(dtotal = (x.y - x.x)) 
  pBoth14 <- p2013 %>% 
    left_join(p2014, by = c("CountyName"="CountyName","StateName"="StateName")) %>%
    mutate(dtotal = (x.y - x.x)) 
  pBoth15 <- p2014 %>% 
    left_join(p2015, by = c("CountyName"="CountyName","StateName"="StateName")) %>%
    mutate(dtotal = (x.y - x.x)) 
  pBoth16 <- p2015 %>% 
    left_join(p2016, by = c("CountyName"="CountyName","StateName"="StateName")) %>%
    mutate(dtotal = (x.y - x.x)) 
  pBoth17 <- p2016 %>% 
    left_join(p2017, by = c("CountyName"="CountyName","StateName"="StateName")) %>%
    mutate(dtotal = (x.y - x.x)) 
  
   ggplot() +
   geom_text(data = p2011, color="lightblue",
              aes(x=x,
              y=y,
              label=paste(CountyName,year)),
             size=4) +
  geom_text(data = p2013, color="lightblue",
              aes(x=x,
              y=y,
              label=paste(CountyName,year)),
             size=4) +
  geom_text(data = p2015, color="lightblue",
              aes(x=x,
              y=y,
              label=paste(CountyName,year)),
             size=4) +
  geom_text(data = p2017, color="blue",
              aes(x=x, 
              y=y,
              label=paste(CountyName,year)),
             size=4,
            hjust=0) + 
 geom_segment(data = pBoth11,
              aes(x=x.x, 
              y=y.x,
              xend=x.y, 
              yend=y.y,
              color=dtotal),
              
             size=1
             ) + 
geom_segment(data = pBoth12,
              aes(x=x.x, 
              y=y.x,
              xend=x.y, 
              yend=y.y,
              color=dtotal),
              
             size=1
             ) + 
geom_segment(data = pBoth13,
              aes(x=x.x, 
              y=y.x,
              xend=x.y, 
              yend=y.y,
              color=dtotal),
              
             size=1
             ) + 
geom_segment(data = pBoth14,
              aes(x=x.x, 
              y=y.x,
              xend=x.y, 
              yend=y.y,
              color=dtotal),
              
             size=1
             ) + 
 geom_segment(data = pBoth15,
              aes(x=x.x, 
              y=y.x,
              xend=x.y, 
              yend=y.y,
              color=dtotal),
             size=1) + 
      geom_segment(data = pBoth16,
              aes(x=x.x, 
              y=y.x,
              xend=x.y, 
              yend=y.y,
              color=dtotal),
              
             size=1) + 
 geom_segment(data = pBoth17,
              aes(x=x.x, 
              y=y.x,
              xend=x.y, 
              yend=y.y,
              color=dtotal),
              arrow=arrow(),
             size=1) + 
     
 geom_point(data = p2010,
              aes(x=x, 
              y=y),
             size=2) +
 geom_point(data = p2011,
              aes(x=x, 
              y=y),
             size=2) +
geom_point(data = p2012,
              aes(x=x, 
              y=y),
             size=2) +
geom_point(data = p2013,
              aes(x=x, 
              y=y),
             size=2) +
geom_point(data = p2014,
              aes(x=x, 
              y=y),
             size=2) +
 geom_point(data = p2015,
              aes(x=x, 
              y=y),
             size=2) +
geom_point(data = p2016,
              aes(x=x, 
              y=y),
             size=2) +
geom_point(data = p2017,
              aes(x=x, 
              y=y),
             size=2) +
scale_colour_gradient2(
  low = "red",
  mid = "white",
  high = "blue",
  midpoint = 0
 
) +
    coord_cartesian(xlim = c(minrent, maxrent),ylim = c(minprice, maxprice)) +
 labs(title = paste("Median Rental Price vs Home Value Over Time - Counties in ",statelookup,"w/ Med. Val <=",max_value), 
      subtitle = "2010-2017 Value vs Rent Change Dynamics", 
      caption = paste("Data source:",sourcefilename), 
      x="Median Rent Estimate",
      y="Median Home Price")

Generalized Boxplots of Misc. Columns

The following code will generate a generalized set of faceted box plots containing all specified columns in the gather operator.

statelookup <- explore_statename

# Plot distribution of each column
profiles_clean %>% 
  filter(StateName == statelookup) %>% 
  gather(varz, length,
  #ZHVI_AllHomes, 
  PctOfHomesIncreasingInValues_AllHomes,
  MedianListingPricePerSqft_AllHomes,
  #MedianRentalPrice_AllHomes,
  PriceToRentRatio_AllHomes) %>%
  ggplot(aes(paste(CBSAName), 
             length, 
             fill = paste(CBSAName))) +
  geom_boxplot() +
  facet_wrap(~varz, scales = "free", ncol=1) +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
  theme(legend.position="bottom")  + 
  labs(title = "Distribution of Median Values", subtitle = "Taller bars indicate greater range of values over time", caption = paste("Data source:",sourcefilename))

Demographic Research

Setup Demographic Census data sources

Here we wll set up data sources for demographic research.

selectstate <- explore_state
selectmetro <- explore_metro
year <- 2020
income_to_value <- 3

yearmarker <- paste('eoy_',year,sep='')

allvalues <- yearend_homevalues_byzip %>% 
  filter(StateName ==selectstate) %>% 
  filter(year == yearmarker)%>% 
  filter(!is.na(homevalue)) %>% pull(homevalue)

minmedian <- min(allvalues)
midmedian <- median(allvalues)
maxmedian <- max(allvalues)

toplocation <- yearend_homevalues_byzip %>% 
  filter(StateName == selectstate) %>% 
  filter(year == yearmarker) %>% 
  group_by(CountyName) %>% 
  arrange(-homevalue) %>% 
  ungroup() %>% 
  slice(1:1)

Home Value vs Population

Home values in 2020 vs population estimates are explored below - the outliers are areas which have high values and low populations, or low values and high poulations.

 population %>%
   left_join(states_index, by = c("STATE"="STATENAME"))

## # A tibble: 3,220 x 10
##    GEOID COUNTY    STATE  variable   moe population_est STATEABB.x STATEREGION.x
##    <chr> <chr>     <chr>  <chr>    <dbl>          <dbl> <chr>      <chr>        
##  1 01001 Autauga ~ Alaba~ B01001_~    NA          55380 AL         South        
##  2 01003 Baldwin ~ Alaba~ B01001_~    NA         212830 AL         South        
##  3 01005 Barbour ~ Alaba~ B01001_~    NA          25361 AL         South        
##  4 01007 Bibb Cou~ Alaba~ B01001_~    NA          22493 AL         South        
##  5 01009 Blount C~ Alaba~ B01001_~    NA          57681 AL         South        
##  6 01011 Bullock ~ Alaba~ B01001_~    NA          10248 AL         South        
##  7 01013 Butler C~ Alaba~ B01001_~    NA          19828 AL         South        
##  8 01015 Calhoun ~ Alaba~ B01001_~    NA         114618 AL         South        
##  9 01017 Chambers~ Alaba~ B01001_~    NA          33660 AL         South        
## 10 01019 Cherokee~ Alaba~ B01001_~    NA          25903 AL         South        
## # ... with 3,210 more rows, and 2 more variables: STATEABB.y <chr>,
## #   STATEREGION.y <chr>

#--home values by population at zip code level

# value_pops <- yearend_homevalues_byzip %>%
#   filter(statename ==selectstate) %>%
#   filter(year == yearmarker) %>%
#   #filter(metro %in% c('cincinnati','columbus','cleveland-elyria')) %>%
#   mutate(regionname = as.character(regionname)) %>%
#   left_join(population, by = c("regionname"="geoid")) #%>%
#   #group_by(regionname) %>%
#   #summarize(population_est = sum(population_est))
#
# value_pops %>%
#   ggplot(aes(log10(population_est), log10(homevalue),color=metro)) +
#   geom_point() +
#   theme(legend.position="bottom")  +
#   geom_text(aes(label=paste(regionid,city)))

#--simplify to city level

value_pops <- yearend_homevalues_byzip %>% 
  filter(StateName ==selectstate) %>% 
  filter(year == yearmarker) %>% 
  mutate(RegionName = as.character(RegionName)) %>% 
  left_join(population_zip, by = c("RegionName"="GEOID")) %>% 
  select(City,population_est,homevalue) %>%
  group_by(City) %>%
  summarize(population_est = sum(population_est), homevalue=mean(homevalue))
  
value_pops %>% 
  ggplot(aes(log10(population_est), log10(homevalue))) + 
  geom_point() + 
  theme(legend.position="bottom")  + 
  geom_text(aes(label=paste(City))) + 
  facet_wrap(~floor(log10(population_est)),scales="free",ncol=2) + 
  labs(title = paste("Population vs Median Home Value"), 
       subtitle = "Discover where home values are higher or lower vs cities of similar population", 
       caption = paste("Data source:",sourcefilename3))

Home Value vs Number of Renters

Home values vs number of renters are explored below - the outliers are areas which have high values and low rents, or low values and high rents, etc.

#--simplify to city level

value_renters <- yearend_homevalues_byzip %>% 
  filter(StateName ==selectstate) %>% 
  filter(year == yearmarker) %>% 
  mutate(RegionName = as.character(RegionName)) %>% 
  left_join(renters_zip, by = c("RegionName"="GEOID")) %>% 
  select(City,renters_est,homevalue) %>%
  group_by(City) %>%
  summarize(renters_est = sum(renters_est), homevalue=mean(homevalue))
  
value_renters %>% 
  ggplot(aes(log10(renters_est), log10(homevalue))) + 
  geom_point() + 
  theme(legend.position="bottom")  + 
  geom_text(aes(label=paste(City))) + 
   facet_wrap(~floor(log10(renters_est)),scales="free",ncol=2) + 
  labs(title = paste("Number of Renters vs Median Home Value"), 
       subtitle = "Discover where home values are higher or lower vs cities of similar numbers of renters", 
       caption = paste("Data source:",sourcefilename3))

Home Values vs Median Income

Home values vs income are explored below - only areas which have low home values relative to median incomes are shown. Areas where home values are less than 3 times median income are shown.

#--home values by income at zip code level

value_incomes <- yearend_homevalues_byzip %>% 
  #filter(StateName ==selectstate) %>% 
  filter(year == yearmarker) %>% 
   #filter(Metro %in% c('Cincinnati','Columbus','Cleveland-Elyria')) %>% 
  mutate(RegionName = as.character(RegionName)) %>% 
  left_join(income_zip, by = c("RegionName"="GEOID")) #%>% 
  #group_by(RegionName) %>%
  #summarize(population_est = sum(population_est))

#examine areas where income x 1.2 > home value
value_incomes %>% 
  filter(income_est * income_to_value > homevalue) %>%
  filter(StateName ==selectstate) %>% 
  ggplot(aes(log10(income_est), log10(homevalue), color=Metro)) + 
  geom_point() + 
  geom_text(aes(label=ifelse(income_est * income_to_value > homevalue,paste(RegionID,City,StateName),'')) )  + 
  labs(title = paste("Counties with low home values relative to income"), 
       subtitle = paste("Cities where annual income *",income_to_value," exceeds median home value"), 
       caption = paste("Data source:",sourcefilename3))

joinsto <- "county"

population_county <- get_acs(geography = joinsto, variables = "B01001_001E")

population_county <- population_county %>% 
  separate(NAME, c("COUNTY","STATE"),sep=",") %>% 
  mutate(population_est = estimate) %>% 
  select(-estimate)

population_county$STATE <- trimws(population_county$STATE)
population_county$COUNTYNAME <- str_replace(population_county$COUNTY," County","")

population_county <- population_county %>% 
  mutate(STATE = as.character(STATE)) %>%
  left_join(states_index, by = c("STATE"="STATENAME"))


population_county_instate <- population_county %>% filter(STATEABB == explore_state)

#Prop tax details
prop_tax_county <- get_acs(geography = joinsto, variables = "B25103_001E")
prop_tax_county <- prop_tax_county %>% 
  separate(NAME, c("COUNTY","STATE"),sep=",") %>% 
  mutate(tax_est = estimate) %>% 
  select(-estimate)

prop_tax_county$STATE <- trimws(prop_tax_county$STATE)
prop_tax_county$COUNTYNAME <- str_replace(prop_tax_county$COUNTY," County","")

prop_tax_county <- prop_tax_county %>% 
  mutate(STATE = as.character(STATE)) %>%
  left_join(states_index, by = c("STATE"="STATENAME"))

prop_tax_county_instate <- prop_tax_county %>% filter(STATEABB == explore_state)


#average age of homes
prop_age_county <- get_acs(geography = joinsto, variables = "B25035_001E")
prop_age_county <- prop_age_county %>% 
  separate(NAME, c("COUNTY","STATE"),sep=",") %>% 
  mutate(age_est = estimate) %>% 
  select(-estimate)

prop_age_county$STATE <- trimws(prop_age_county$STATE)
prop_age_county$COUNTYNAME <- str_replace(prop_age_county$COUNTY," County","")

prop_age_county <- prop_age_county %>% 
  mutate(STATE = as.character(STATE)) %>%
  left_join(states_index, by = c("STATE"="STATENAME"))

prop_age_county_instate <- prop_age_county %>% filter(STATEABB == explore_state)


#average 
prop_age_rental_county <- get_acs(geography = joinsto, variables = "B25037_003E")
prop_age_rental_county <- prop_age_rental_county %>% 
  separate(NAME, c("COUNTY","STATE"),sep=",") %>% 
  mutate(age_est = estimate) %>% 
  select(-estimate)

prop_age_rental_county$STATE <- trimws(prop_age_rental_county$STATE)
prop_age_rental_county$COUNTYNAME <- str_replace(prop_age_rental_county$COUNTY," County","")

prop_age_rental_county <- prop_age_rental_county %>% 
  mutate(STATE = as.character(STATE)) %>%
  left_join(states_index, by = c("STATE"="STATENAME"))

prop_age_rental_county_instate <- prop_age_rental_county %>% filter(STATEABB == explore_state)


#average 
prop_age_owned_county <- get_acs(geography = joinsto, variables = "B25037_002E")
prop_age_owned_county <- prop_age_owned_county %>% 
  separate(NAME, c("COUNTY","STATE"),sep=",") %>% 
  mutate(age_est = ifelse(estimate == 0,1960,estimate)) %>% 
  select(-estimate)

prop_age_owned_county$STATE <- trimws(prop_age_owned_county$STATE)
prop_age_owned_county$COUNTYNAME <- str_replace(prop_age_owned_county$COUNTY," County","")

prop_age_owned_county <- prop_age_owned_county %>% 
  mutate(STATE = as.character(STATE)) %>%
  left_join(states_index, by = c("STATE"="STATENAME"))

prop_age_owned_county_instate <- prop_age_owned_county %>% filter(STATEABB == explore_state)

#pct renter occupied 
prop_units_county <- get_acs(geography = joinsto, variables = "S2503_C01_001E")
prop_units_county <- prop_units_county %>% 
  separate(NAME, c("COUNTY","STATE"),sep=",") %>% 
  mutate(units_est = estimate) %>% 
  select(-estimate)

prop_units_county$STATE <- trimws(prop_units_county$STATE)
prop_units_county$COUNTYNAME <- str_replace(prop_units_county$COUNTY," County","")

prop_units_county <- prop_units_county %>% 
  mutate(STATE = as.character(STATE)) %>%
  left_join(states_index, by = c("STATE"="STATENAME"))

reload=FALSE

prop_units_county_instate <- prop_units_county %>% filter(STATEABB == explore_state)

Sales Counts by County

Locations with highest overall sale counts in 2017 (2017 is the last year for which this data is available)

#--home values by income at zip code level
hot_markets <- profiles_clean %>% 
  filter(!is.na(Sale_Counts)) %>% 
  #filter(year == 2017) %>% 
  filter(StateName == statelookup) %>% 
  select(StateName, CountyName, location, year, Sale_Counts) %>% 
  group_by(CountyName) %>%
  summarize(total_sales = sum(Sale_Counts)) %>%
  arrange(-total_sales)

#examine areas where income x 1.2 > home value
hot_markets %>% 
  ggplot(aes(reorder(CountyName,total_sales), total_sales)) + 
  geom_point() + 
  geom_text_repel(aes(label=total_sales)) + 
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
  theme(legend.position="none")  + 
  labs(title = paste(statelookup,"Sales Counts, Selected Counties"), 
       subtitle = paste(""), 
       caption = paste("Data source:",sourcefilename3))

Fastest Sellers by County

Locations with shortest time on market in 2017 (2017 is the last year for which this data is available)

#--home values by income at zip code level
fast_sellers <- profiles_clean %>% 
  filter(year == 2017) %>% 
  filter(StateName == statelookup) %>% 
  filter(!is.na(DaysOnZillow_AllHomes)) %>% 
  select(StateName, CountyName, location, year, DaysOnZillow_AllHomes) %>% 
  group_by(CountyName) %>%
  summarize(mean_days = mean(DaysOnZillow_AllHomes)) %>%
  arrange(mean_days)


fast_sellers %>% 
  ggplot(aes(reorder(CountyName,mean_days), mean_days)) + 
  geom_point() + 
  geom_text(aes(label=round(mean_days,0)),nudge_y=1) + 
   theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
  theme(legend.position="none")  + 
  labs(title = paste(statelookup,"Mean time on Market,Selected COunties"), 
       subtitle = paste(""), 
       caption = paste("Data source:",sourcefilename3))

Taxes by County

Locations with shortest time on market in 2017 (2017 is the last year for which this data is available)

#--home values by income at zip code level
prop_tax_county_instate %>% 
  ggplot(aes(reorder(COUNTYNAME,tax_est), tax_est)) + 
  geom_point() + 
  geom_text_repel(aes(label=round(tax_est,0)),nudge_y=150) + 
  geom_errorbar(aes(ymin=tax_est-moe,ymax=tax_est+moe)) + 
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
  theme(legend.position="none")  + 
  labs(title = paste(statelookup," Mean Annual Property Tax Estimate by County"), 
       subtitle = paste(""), 
       caption = paste("Data source:",sourcefilename3))

Median Property Age by County

Locations with shortest time on market in 2017 (2017 is the last year for which this data is available)

#--home values by income at zip code level
prop_age_county_instate %>% 
  ggplot(aes(reorder(COUNTYNAME,age_est), age_est)) + 
  geom_point(color="blue") + 
  geom_errorbar(aes(ymin=age_est-moe,ymax=age_est+moe)) + 
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
  theme(legend.position="none")  + 
  labs(title = paste(statelookup," Median Property Age Estimate by County - All Homes"), 
       subtitle = paste(""), 
       caption = paste("Data source:",sourcefilename3))

#--home values by income at zip code level
prop_age_rental_county_instate %>% 
  ggplot(aes(reorder(COUNTYNAME,age_est), age_est)) + 
  geom_point(color="red") + 
  geom_errorbar(aes(ymin=age_est-moe,ymax=age_est+moe)) + 
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
  theme(legend.position="none")  + 
  labs(title = paste(statelookup," Median Property Age Estimate by County - Renter Occupied Homes"), 
       subtitle = paste(""), 
       caption = paste("Data source:",sourcefilename3))

#--home values by income at zip code level
prop_age_owned_county_instate %>% 
  ggplot(aes(reorder(COUNTYNAME,age_est), age_est)) + 
  geom_point(color="green") + 
  geom_errorbar(aes(ymin=age_est-moe,ymax=age_est+moe)) + 
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
  theme(legend.position="none")  + 
  labs(title = paste(statelookup," Median Property Age Estimate by County - Owner Occupied Homes"), 
       subtitle = paste(""), 
       caption = paste("Data source:",sourcefilename3))

This chart takes the rank of each county in terms of median home age across all homes. The same ranking is applied to all counties in the rental or owner homes only. The chart demonstrates that in counties with the year built after about 1963, usually the age of rental homes tends to be older than owner occupied homes. This gap widens in in the counties that have the newest homes.

#--home values by income at zip code level

prop_age_county_instate$rank <- rank(prop_age_county_instate$age_est, ties.method="first")


prop_ranks <- prop_age_county_instate %>% select(COUNTY,rank)

prop_age_rental_county_instate <- prop_age_rental_county_instate %>% left_join(prop_ranks, by="COUNTY")
prop_age_owned_county_instate <- prop_age_owned_county_instate %>% left_join(prop_ranks, by="COUNTY")

ranks <- prop_age_county_instate$rank
counties <- prop_age_county_instate$COUNTY
  
prop_age_county_instate %>% 
  ggplot() + 
  geom_point(data=prop_age_county_instate, aes(rank(age_est), age_est),color="blue") + 
  geom_smooth(data=prop_age_county_instate, aes(rank(age_est), age_est), method="loess",color="blue",se=F) + 
  geom_point(data=prop_age_rental_county_instate, aes(rank(age_est), age_est),color="red") + 
  geom_smooth(data=prop_age_rental_county_instate, aes(rank(age_est), age_est), method="loess",color="red",se=F) + 
  geom_point(data=prop_age_owned_county_instate, aes(rank(age_est), age_est),color="green") + 
  geom_smooth(data=prop_age_owned_county_instate, aes(rank(age_est), age_est), method="loess",color="green",se=F) +   scale_x_continuous(breaks=ranks,labels=counties) +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
  labs(title = paste(statelookup," Median Home age by Tenure"), 
       subtitle = paste("Red = Renter Occupied, Blue = All Homes, Green = Owner Occupied"),
       caption = paste("Data source:",sourcefilename3),
       x="Counties (Oldest to Newest)",
       y="Median Age of Homes")

Population by County

County population statistics

#--home values by income at zip code level
population_county_instate %>% 
  ggplot(aes(reorder(COUNTYNAME,population_est), population_est)) + 
  geom_point() + 
  geom_text_repel(aes(label=round(population_est,0)),nudge_y=150) + 
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
  theme(legend.position="none")  + 
  labs(title = paste(statelookup," Population By County"), 
       subtitle = paste(""), 
       caption = paste("Data source:",sourcefilename3))

#--home values by income at zip code level
prop_units_county_instate %>% 
  ggplot(aes(reorder(COUNTYNAME,units_est), units_est)) + 
  geom_point() + 
  geom_text_repel(aes(label=round(units_est,0)),nudge_y=150) + 
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
  theme(legend.position="none")  + 
  labs(title = paste(statelookup," Housing Units By County"), 
       subtitle = paste(""), 
       caption = paste("Data source:",sourcefilename3))

pop_county <- population_county_instate %>% select(COUNTY,population_est)

prop_units_pop <- prop_units_county_instate %>% left_join(pop_county, by="COUNTY")
prop_units_pop$unitsperperson <- prop_units_pop$units_est/prop_units_pop$population_est
prop_units_pop$rank <- rank(prop_units_pop$unitsperperson)

#--home values by income at zip code level
prop_units_pop %>% 
  ggplot(aes(rank, unitsperperson)) + 
  geom_point() + 
  geom_text_repel(aes(label=paste(COUNTY,round(unitsperperson,2)))) + 
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
  theme(legend.position="none")  + 
  labs(title = paste(statelookup," Housing Units Per Person"), 
       subtitle = paste(""), 
       caption = paste("Data source:",sourcefilename3),
       x="Rank of Housing Units per Person",
       y="Number of Housing Units per Person")

Comments

Real estate Data provided by Zillow Research https://www.zillow.com/research/data/

Weather Data provided by NOAA - National Climatic Data Center https://www.ncdc.noaa.gov

Census Data provided by USCB Ammerican Community Survey https://api.census.gov/data/2019/acs/acs5/variables.html

Real Estate ProspectR

Ryan J. Cooper

11/21/2020

Real Estate ProspectR

Introduction & Setup

Set State, Metro, and Zip codes to Explore

Environment Setup

Load Data Set

Data Transformations

Year End Home Values by Region

Year End Home Values by Zip Code

Year-End Rent Estimate Data by Region

Loading the USCB American Community Survey

Analysis

Boxplot of Median Home Values by State

Boxplot of Median Home Values by State in a Selected Region

Boxplot of Median Home Values by Metro in a Selected State

Boxplot of Median Home Values by County in a Selected State

Box Plot of Home Values by Zip codes in a City

Boxplot of Rent Estimates by Metro in a Selected State

Median Home Prices over Time

Year-over-Year Price Movement

Rental Price to Home Value Correlations

Generalized Boxplots of Misc. Columns

Demographic Research

Setup Demographic Census data sources

Home Value vs Population

Home Value vs Number of Renters

Home Values vs Median Income

Sales Counts by County

Fastest Sellers by County

Taxes by County

Median Property Age by County

Population by County

Comments