House Price Prediction, Philadelphia

real estate
code
analysis
Author

Tianxiao & Ling

Published

September 27, 2023

Introduction

In light of recent concerns about the accuracy of Zillow’s housing market predictions, our team has undertaken an effort to enhance house price prediction in Philadelphia. Our primary goal is to contribute to more advanced decision-making for the City of Philadelphia by leveraging machine learning techniques. To achieve improved accuracy and generalizability, we have taken a comprehensive approach. We’ve incorporated various indicators across different fields to ensure that our predictions align more closely with real-world conditions. By doing so, we aim to provide a more accurate representation of Philadelphia’s housing market, which can be valuable for both current decision-making and future planning. This report serves as a detailed account of our approach and methods. It emphasizes the significance of understanding the local context and the seamless integration of this knowledge with pre-existing data. Our overarching objective is to provide a richer context and a more precise understanding of Philadelphia’s housing landscape, fostering informed decision-making for the present and future.

Code
data <- st_read("https://raw.githubusercontent.com/mafichman/musa_5080_2023/main/Midterm/data/2023/studentData.geojson")

variables <- c('objectid',
               'census_tract',
               'parcel_shape',
               'building_code',
               'quality_grade',
               'central_air',
               'depth',
               'exempt_land',
               'exterior_condition',
               'fireplaces',
               'frontage',
               'fuel',
               'garage_spaces',
               'garage_type',
               'general_construction',
               'interior_condition',
               'number_of_bathrooms',
               'number_of_bedrooms',
               'number_of_rooms',
               'number_stories',
               'off_street_open',
               'sale_price',
               'year_built',
               'toPredict')

f_df <- data[, variables]

Data Clean

To make the data available for modeling, we make the data clean process. This step includes the initial data exploration, outlier and N/A value handling and the join with other relevant dataset. ## Data Exploration First, we make an exploration for the variables that describe the situation of house. The frequency of each variable basically show the trend of normal distribution, and the situation prove support for the assumption of the regression model we choose later.

Code
ex_his <- ggplot(f_df, aes(x = exterior_condition)) +
  geom_histogram(binwidth = 1, fill = "grey", color = "black") +
  labs(title = "exterior", x = "Level", y = "Frequency")

in_his <- ggplot(f_df, aes(x = interior_condition)) +
  geom_histogram(binwidth = 1, fill = "grey", color = "black") +
  labs(title = "interior", x = "Level", y = "Frequency")

qua_his <- ggplot(f_df, aes(x = quality_grade)) +
  geom_histogram(stat='count',fill = "grey", color = "black") +
  labs(title = "quality", x = "Level", y = "Frequency")

par_his <- ggplot(f_df, aes(x = parcel_shape)) +
  geom_histogram(stat='count',fill = "grey", color = "black") +
  labs(title = "parcel", x = "Level", y = "Frequency")

nos_his <- ggplot(f_df, aes(x = number_stories)) +
  geom_histogram(stat='count',fill = "grey", color = "black") +
  labs(title = "num_story", x = "Level", y = "Frequency")

nobed_his <- ggplot(f_df, aes(x = number_of_bedrooms)) +
  geom_histogram(stat='count',fill = "grey", color = "black") +
  labs(title = "num_bed", x = "Level", y = "Frequency")

nobath_his <- ggplot(f_df, aes(x = number_of_bathrooms)) +
  geom_histogram(stat='count',fill = "grey", color = "black") +
  labs(title = "num_bath", x = "Level", y = "Frequency")

gencon_his <- ggplot(f_df, aes(x = general_construction)) +
  geom_histogram(stat='count',fill = "grey", color = "black") +
  labs(title = "num_construction", x = "Level", y = "Frequency")

grid.arrange(ex_his, in_his, qua_his, par_his, nos_his,nobed_his, nobath_his,gencon_his, ncol = 4)

And when we focus on the distribution of sale price and other variables, we find that the data has many outlier with much larger value. To decrease the effect of outlier to the model performance, we make the assumption that the outlier comes from the record mistake, and set the maximum limitation for these outliers. Also, due to some varibles lack large amounts of data (we define the large amount as more than 75% lack), we choose to ignore these variables to guarantee objectivity as far as possible.

Code
sale_box <- ggplot(f_df, aes(y = sale_price)) +
  geom_boxplot() +
  labs(title = "Boxplot of sale price",
       x = "sale price",
       y = "Value")+
  theme(
    plot.title = element_text(size = 10),  
    axis.title.x = element_text(size = 8),  
    axis.title.y = element_text(size = 8)
  )

ostrt_box <- ggplot(f_df, aes(y = off_street_open)) +
  geom_boxplot() +
  labs(title = "Boxplot of off_street_open",
       x = "offstreet_space",
       y = "Value")+
  theme(
    plot.title = element_text(size = 10),  
    axis.title.x = element_text(size = 8),  
    axis.title.y = element_text(size = 8)
  )

dep_box <- ggplot(f_df, aes(y = depth)) +
  geom_boxplot() +
  labs(title = "Boxplot of depth",
       x = "depth",
       y = "Value")+
  theme(
    plot.title = element_text(size = 10),  
    axis.title.x = element_text(size = 8),  
    axis.title.y = element_text(size = 8)
  )

fr_box <- ggplot(f_df, aes(y = frontage)) +
  geom_boxplot() +
  labs(title = "Boxplot of frontage",
       x = "frontage",
       y = "Value")+
  theme(
    plot.title = element_text(size = 10),  
    axis.title.x = element_text(size = 8),  
    axis.title.y = element_text(size = 8)
  )

grid.arrange(sale_box, ostrt_box, dep_box, fr_box, ncol = 4)

Code
df_pro <- f_df
df_pro$frontage <- pmin(df_pro$frontage, 500)
df_pro$depth <- pmin(df_pro$depth, 600)
df_pro$sale_price <- pmin(df_pro$sale_price,4000000)
df_pro$off_street_open <- pmin(df_pro$off_street_open,10000)
df_pro$number_of_bedrooms <- pmin(df_pro$number_of_bedrooms,10)
df_pro$number_of_bathrooms <- pmin(df_pro$number_of_bathrooms,5)

df_pro <- subset(df_pro,select = -c(number_of_rooms,general_construction,exempt_land,fireplaces))

df_pro$central_air[is.na(df_pro$central_air)] <- 'N'
df_pro$depth[is.na(df_pro$depth)] <- 0
df_pro <- df_pro %>%
  filter(!is.na(exterior_condition))%>%
  filter(!is.na(exterior_condition))%>%
  filter(!is.na(interior_condition))%>%
  filter(!is.na(number_of_bedrooms))%>%
  filter(!is.na(number_stories))%>%
  filter(!is.na(number_of_bathrooms))%>%
  filter(!is.na(off_street_open))%>%
  filter(!is.na(garage_spaces))

Multi-dimensional data

In order to develop an enhanced understanding on the housing market, we acquired data that looks beyond the house itself, we gathered data on the external characteristics including the houses proximity to public facilities like urban corridors, private schools, green space. We also examined the economic and racial demographics of the geogrhic area each home is located in.

Census Data

The first source used was from the U.S Census Bureau - specifically the American Community Survey (ACS) dataset - we used data from the 2021 five year ACS survey, which is the most recent ACS five year survey data available.

Code
tracts21 <-  
  get_acs(geography = "tract",
          variables = c("B01001_001E", # ACS total Pop estimate
              'B02001_002E', # Estimate!!Total:!!White alone
              'B02001_003E', # Estimate!!Total:!!Black or African American alone
              'B02001_005E', # Estimate!!Total:!!Asian alone
              'B19013A_001E',# Median household income White alone
              'B19013B_001E',# Median household income Black alone
              'B19013D_001E',# Median household income Asian alone
              'B19013_001E', # Median household income 
              'B01002_001E', # Median Age by Sex 
              "B25002_001E", # Estimate of total housing units
              "B25002_003E", # Number of vacant housing units
              "B06009_006E", # Total graduate or professional degree
              "B15001_050E",
              "B15001_009E",
              "B25058_001E",
              "B06012_002E",
              'B25105_001E', # Median monthly housing costs 
              'B25037_001E', # Median year structure built --!!Total
              'B10058_002E', # Total:!!In labor force
              'B10058_007E', # Not in labor force
              'B08141_017E', # Public transportation:!!No vehicle available
              'B08141_018E', # Public transportation:!!1 vehicle available 
              'B09002_002E', # OWN CHILDREN UNDER 18 YEARS married family
              'B09002_015E', # Female householder
              'B14001_004E'), # Enrolled in kindergarten 
          year=2021, state=42,
          county=101, geometry=TRUE, output = "wide") %>% 
  st_transform('ESRI:102728') %>%

  rename(TotalPop = B01001_001E, 
         White = B02001_002E,
         Black = B02001_003E,
         Asian = B02001_005E,
         MedHHInc_White = B19013A_001E,
         MedHHInc_Black = B19013B_001E,
         MedHHInc_Asian = B19013D_001E,
         MedHHInc = B19013_001E,
         Median_Age_bysex = B01002_001E,
         Total_Housing_Units = B25002_001E,
         Vacant_Housing_Units = B25002_003E,
         Total_Graduate_Prof_Degree = B06009_006E,
         FemaleBachelors = B15001_050E, 
         MaleBachelors = B15001_009E,
         MedRent = B25058_001E,
         TotalPoverty = B06012_002E,
         Med_mon_hscst = B25105_001E,
         Med_built_year = B25037_001E,
         Lab_force = B10058_002E,
         no_lav_force = B10058_007E,
         no_pub_trans = B08141_017E,
         one_pub_trans = B08141_018E,
         own_child = B09002_002E,
         own_child_fe = B09002_015E,
         enroll_kinder = B14001_004E)
Getting data from the 2017-2021 5-year ACS
Downloading feature geometry from the Census website.  To cache shapefiles for use in future sessions, set `options(tigris_use_cache = TRUE)`.
Code
tracts21 <- tracts21[, !grepl("M$", names(tracts21))]

mean_in_w <- mean(tracts21$MedHHInc_White,na.rm = TRUE)
tracts21$MedHHInc_White[is.na(tracts21$MedHHInc_White)] <- mean_in_w

mean_in_b <- mean(tracts21$MedHHInc_Black,na.rm = TRUE)
tracts21$MedHHInc_Black[is.na(tracts21$MedHHInc_Black)] <- mean_in_b

mean_in_a <- mean(tracts21$MedHHInc_Asian,na.rm = TRUE)
tracts21$MedHHInc_Asian[is.na(tracts21$MedHHInc_Asian)] <- mean_in_a

mean_in <- mean(tracts21$MedHHInc,na.rm = TRUE)
tracts21$MedHHInc[is.na(tracts21$MedHHInc)] <- mean_in

med_age <- mean(tracts21$Median_Age_bysex,na.rm = TRUE)
tracts21$Median_Age_bysex[is.na(tracts21$Median_Age_bysex)] <- med_age

med_rent <- mean(tracts21$MedRent,na.rm = TRUE)
tracts21$MedRent[is.na(tracts21$MedRent)] <- med_rent

mean_medmonhh <- mean(tracts21$Med_mon_hscst,na.rm = TRUE)
tracts21$Med_mon_hscst[is.na(tracts21$Med_mon_hscst)] <- mean_medmonhh

med_b_yr <- median(tracts21$Med_built_year,na.rm = TRUE)
tracts21$Med_built_year[is.na(tracts21$Med_built_year)] <- med_b_yr
Code
df_pro <- df_pro %>%
  st_transform('ESRI:102728')
tracts21 <- tracts21 %>%
  st_transform('ESRI:102728')

hh_tract <- st_join(df_pro,tracts21, join = st_within)

hh_tract <- hh_tract %>%
  st_transform('ESRI:102728')

hh_tract <- subset(hh_tract,select = -c(central_air,fuel,garage_type,Asian,MedHHInc_White,MedHHInc_Black,MedHHInc_Asian,FemaleBachelors,MaleBachelors,Med_built_year,one_pub_trans,no_lav_force))

Get Data from Open Data Philly

OpenData Philly is an open source website that provides a catalog of free data, officially sponsored by the City of Philadelphia. The data provided by the city and other organizations allows us to collect a wide variety of information we can utilize to categorize, describe, and develop a profile for key geographic characteristics which might impact price. And we use the following dataset as our potential variable:

  • Commercial Corridors – dataset record the location of commercial corridors in Philadelphia.

  • School – dataset the record the location of school.

  • City Facilities – dataset record the different type of ameneties.

  • Neighborhhod-Philadelphia – planning districts will be used as a proxy for neighborhoods.

  • Crime – dataset records crime location in Philadelphia with detailed type.

Code
boundary <- st_read("https://opendata.arcgis.com/datasets/405ec3da942d4e20869d4e1449a2be48_0.geojson") %>%
  st_transform('ESRI:102728')

Commercial_Corridors <- st_read("https://opendata.arcgis.com/datasets/f43e5f92d34e41249e7a11f269792d11_0.geojson") %>%
  st_transform('ESRI:102728')

crime <- read.csv('https://phl.carto.com/api/v2/sql?filename=incidents_part1_part2&format=csv&q=SELECT%20*%20,%20ST_Y(the_geom)%20AS%20lat,%20ST_X(the_geom)%20AS%20lng%20FROM%20incidents_part1_part2%20WHERE%20dispatch_date_time%20%3E=%20%272021-01-01%27%20AND%20dispatch_date_time%20%3C%20%272022-01-01%27')
Crime <-
  crime %>%
    filter(text_general_code == "Aggravated Assault Firearm",
           lat > -1) %>%
    dplyr::select(lat, lng) %>%
    na.omit() %>%
    st_as_sf(coords = c("lng", "lat"), crs = "EPSG:4326") %>%
    st_transform('ESRI:102728') %>%
    distinct()

School <- st_read("https://opendata.arcgis.com/datasets/d46a7e59e2c246c891fbee778759717e_0.geojson") %>%
  st_transform('ESRI:102728')

CityFacilities <- st_read("https://opendata.arcgis.com/datasets/b3c133c3b15d4c96bcd4d5cc09f19f4e_0.geojson") %>%
  st_transform('ESRI:102728')

nhoods <- 
  st_read('https://raw.githubusercontent.com/azavea/geo-data/master/Neighborhoods_Philadelphia/Neighborhoods_Philadelphia.geojson') %>%
  st_transform('ESRI:102728')%>%
  dplyr::select(name)
Code
hh_tract <- st_join(hh_tract, nhoods, join = st_within)

School: Private

Homes located in proximity to reputable educational institutions, particularly private schools, often enjoy increased desirability and can fetch higher prices in the real estate market. As a result, we assess the presence of private schools within a 0.25-mile radius of each house. This distance allows for convenient commuting for students and parents while providing access to a variety of educational activities and resources.

Code
School <- School %>%
  dplyr::select(OBJECTID,GRADE_LEVEL,TYPE_SPECIFIC)

School_Private <- School %>%
    filter(TYPE_SPECIFIC == "PRIVATE") %>%
    distinct() %>%
    dplyr::select(geometry)
Code
hh_tract$school_1320 <- hh_tract %>% 
    st_buffer(1320) %>% 
    aggregate(mutate(Crime, counter = 1),., sum) %>%
    pull(counter)
hh_tract$school_1320[is.na(hh_tract$school_1320)] <- 0

Distance to Commercial Corridors

The proximity to commercial corridors can be a significant factor in attracting more residents to move in. This proximity offers residents the convenience of having grocery stores, restaurants, shops, and other amenities nearby. Additionally, it enhances walkability and fosters more economic activity, contributing to a livelier neighborhood. Therefore, we calculate the distance between each house and the nearest commercial corridor. However, there might also be potential risks of noise, traffic, and crimes, which may influence people’s decision-making process.

Code
nearest_dists <- st_distance(hh_tract, Commercial_Corridors)
closest_distances <- apply(nearest_dists, 1, min)
hh_tract <- hh_tract %>%
  mutate(dis_cc = closest_distances) 

Crime: Aggravated Assault within 0.25/0.125 miles

Crime is often a crucial neighborhood indicator. Homebuyers frequently prefer properties situated in neighborhoods with lower crime rates, as it ensures a safer and more desirable living environment. By calculating crime counts within different buffer distances of 0.25 miles and 0.125 miles from each house, we aim to account for the potential influence of varying crime ranges on home values. Then, we utilize a k-nearest neighbor approach to provide an additional perspective on the relationship between crime and home values, which calculates the distance between each house and the nearest crime incidents, considering multiple nearest neighbors (k) ranging from 1 to 5. This helps us understand how different levels of crime proximity might affect home values.

Code
hh_tract$crimes.Buffer_660 <- hh_tract %>% 
    st_buffer(660) %>% 
    aggregate(mutate(Crime, counter = 1),., sum) %>%
    pull(counter)

hh_tract$crimes.Buffer_1320 <- hh_tract %>% 
    st_buffer(1320) %>% 
    aggregate(mutate(Crime, counter = 1),., sum) %>%
    pull(counter)

hh_tract$crimes.Buffer_660[is.na(hh_tract$crimes.Buffer_660)] <- 0 
hh_tract$crimes.Buffer_1320[is.na(hh_tract$crimes.Buffer_1320)] <- 0
Code
## Nearest Neighbor Feature
hh_tract <-
  hh_tract %>% 
    mutate(
      crime_nn1 = nn_function(st_coordinates(hh_tract), 
                              st_coordinates(Crime), k = 1),
      
      crime_nn2 = nn_function(st_coordinates(hh_tract), 
                              st_coordinates(Crime), k = 2), 
      
      crime_nn3 = nn_function(st_coordinates(hh_tract), 
                              st_coordinates(Crime), k = 3), 
      
      crime_nn4 = nn_function(st_coordinates(hh_tract), 
                              st_coordinates(Crime), k = 4), 
      
      crime_nn5 = nn_function(st_coordinates(hh_tract), 
                              st_coordinates(Crime), k = 5)) 

City Facilities: Park and recreation, Library, Museum and zoo

City facilities such as parks and recreation areas, libraries, museums, and zoos are essential amenities that can significantly influence the attractiveness and livability of a neighborhood. Homebuyers often value proximity to these facilities, as they contribute to a higher quality of life and a sense of community. So we further calculate the proximity of each house to these selected facilities with a buffer of 0.25 miles, which is a reasonable walking distance for residents to access these amenities. Then, we use knn approach again to assess how proximity to these city facilities may affect home values.

Code
CityFacilities <- CityFacilities %>%
dplyr::filter(ASSET_GROUP1 == "A11" | ASSET_GROUP1 == "A18" | ASSET_GROUP1 == "A8") %>%
  select(geometry)%>%
  distinct()

hh_tract$facilities.Buffer <- hh_tract %>% 
    st_buffer(1320) %>% 
    aggregate(mutate(CityFacilities, counter = 1),., sum) %>%
    pull(counter)

hh_tract$facilities.Buffer[is.na(hh_tract$facilities.Buffer)] <- 0
Code
## Nearest Neighbor Feature
hh_tract <-
  hh_tract %>% 
    mutate(
      facilities_nn1 = nn_function(st_coordinates(hh_tract), 
                              st_coordinates(CityFacilities), k = 1),
      
      facilities_nn2 = nn_function(st_coordinates(hh_tract), 
                              st_coordinates(CityFacilities), k = 2), 
      
      facilities_nn3 = nn_function(st_coordinates(hh_tract), 
                              st_coordinates(CityFacilities), k = 3), 
      
      facilities_nn4 = nn_function(st_coordinates(hh_tract), 
                              st_coordinates(CityFacilities), k = 4), 
      
      facilities_nn5 = nn_function(st_coordinates(hh_tract), 
                              st_coordinates(CityFacilities), k = 5)) 

##Home Price Scatterplots The four graphs below represent four key dependent variables that broadly cover each aspect of a neighborhood in order to determine house prices. Interior condition represent directly the quality of house, relating highly to the house cost. As forthe number of crime in 0.25 miles surrounding house, representing the degree of security. And the number of ameneties like park represents the accessibility to the public service. Last but not least, the number of female who owned child in the tract where house locates indicates the potential demand for school in the area.

Code
## Home Features cor
st_drop_geometry(hh_tract) %>% 
  dplyr::select(sale_price, own_child_fe, crimes.Buffer_660, facilities.Buffer,interior_condition) %>%
  filter(sale_price <= 1000000) %>%
  gather(Variable, Value, -sale_price) %>% 
   ggplot(aes(Value, sale_price)) +
     geom_point(size = .5) + geom_smooth(method = "lm", se=F, colour = "#FA7800") +
     facet_wrap(~Variable, ncol = 2, scales = "free") +
     labs(title = "4 varialbes' correlation to sale price") +
     plotTheme()
`geom_smooth()` using formula = 'y ~ x'

##Correlation Test from Correlation Matrix A correlation matrix compares factors against one another to see how related they are at determining the others value. Each box presented can indicate whether there is a correlation (directly or inverse) or little relationship between them. We are able to see strong correlation between many factors. From the correlation matrix, we can find that the number of school, the number of crime in 400 meter range, the number of female owned children in the tract, the total poverty situation where the house located have a positive correlation to the house price; Median household income, median rent, median monthly house cost have a negative correlation to the house price. These high correlation performance will function as an important basis for the variable selection in the model for predicting house price.

Code
hh_tract_pre <- hh_tract%>%
  dplyr::select(-c(objectid,census_tract,GEOID,Total_Housing_Units))

numericVars <- 
  select_if(st_drop_geometry(hh_tract_pre), is.numeric) %>% na.omit()

ggcorrplot(
  round(cor(numericVars), 1), 
  p.mat = cor_pmat(numericVars),
  colors = c("#25CB10", "white", "#FA7800"),
  type="lower",
  insig = "blank") +  
    labs(title = "Correlation across numeric variables") 

Statistical table for variable selected

The model used for predicting house price contains variable including varibles in the table below. In detail, we use internal characteristics (interior_condition,number of bathrooms, number of bedrooms), public services (number of school in 0.5 miles, number of amenities in 0.5 miles, distance to commercial corridor) and social economic spatial structure (number of crime in 0.25 miles, poverty situation, public transit accessibility, number of female own children).

Code
hh_model <- subset(hh_tract,select = c('sale_price','interior_condition','number_of_bathrooms','number_of_bedrooms','school_1320','crimes.Buffer_660','own_child_fe','TotalPoverty','facilities.Buffer','dis_cc','no_pub_trans'))

sumtable(hh_model,
         summ=c('min(x)',
                'mean(x)',
                'median(x)',
                'sd(x)',
                'pctile(x)[25]', 
                'pctile(x)[75]',
                'max(x)'))
Summary Statistics
Variable Min Mean Median Sd Pctile[25] Pctile[75] Max
sale_price 0 267494 225000 234256 135000 325000 4000000
interior_condition 0 3.7 4 0.83 4 4 7
number_of_bathrooms 0 1.1 1 0.66 1 1 5
number_of_bedrooms 0 2.6 3 1.2 2 3 10
school_1320 0 7.5 5 7.5 2 11 49
crimes.Buffer_660 0 2 1 2.4 0 3 20
own_child_fe 0 406 305 363 118 604 1876
TotalPoverty 0 1046 853 787 435 1475 4127
facilities.Buffer 0 4.7 4 5 1 7 41
dis_cc 0 672 548 639 217 957 7386
no_pub_trans 0 170 123 154 58 258 761

Map of Home Prices

This map explores how home prices change in Philadelphia in 2021. We can see from the map that home prices are higher in the center city and outermost suburbs than in other areas.

Code
hh_tract <- hh_tract %>%
  mutate(sale_class = cut(sale_price, breaks = c(0,20000, 100000, 200000, 250000,400000, max(hh_tract$sale_price, na.rm = TRUE))))

ggplot() +
  geom_sf(data = nhoods, fill = "grey80") +
  geom_sf(data = hh_tract, aes(colour = q5(sale_class)), 
          show.legend = "point", size = .75) +
  scale_colour_manual(values = palette5,
                   labels=c('0-$20k','$20k-$100k', '$100k-$200k', '$200k-$250k', '$250k-$400k'),
                   name="Home Prices\n($)") +
  labs(title="Sale Price, Philadelphia") +
  mapTheme()

Interesting Maps

High Education Level Population Distribution

The distribution of individuals with higher education closely mirrors the pattern of home prices. This demographic tends to reside predominantly in the city center or in the suburban areas on the outskirts of the city.

Code
ggplot()+
  geom_sf(data = nhoods, fill = "grey80") +
  geom_sf(data=hh_tract,aes(colour = q5(Total_Graduate_Prof_Degree)), 
          show.legend = "point", size = .75) +
  scale_colour_manual(values = palette5,
                   labels=qBr(hh_tract,"Total_Graduate_Prof_Degree"),
                   name="Population") +
  labs(title="Distribution of population with a high level of education, Philadelphia") +
mapTheme() 

Facilities Density

Regarding facilities, a majority of them are concentrated in university city and the western part of the center city. This clustering is clearly visible and gradually extends outward toward the city’s periphery.

Code
ggplot() +
  geom_sf(data = nhoods, fill = "grey80") +
  stat_density2d(data = data.frame(st_coordinates(CityFacilities)), 
                 aes(X, Y, fill = ..level.., alpha = ..level..),
                 size = 0.01, bins = 40, geom = 'polygon') +
  scale_fill_gradient(low = "#25CB10", high = "#FA7800", name = "Density") +
  scale_alpha(range = c(0.00, 0.35), guide = FALSE) +
  labs(title = "Density of City Facilities, Philadelphia") +
  mapTheme()

Private School Density

Private schools are concentrated in the center of the city and in the northwestern part of the city. The south-east corner and the south are relatively sparsely distributed.

Code
ggplot() +
  geom_sf(data = nhoods, fill = "grey80") +
  stat_density2d(data = data.frame(st_coordinates(School_Private)), 
                 aes(X, Y, fill = ..level.., alpha = ..level..),
                 size = 0.01, bins = 40, geom = 'polygon') +
  scale_fill_gradient(low = "#25CB10", high = "#FA7800", name = "Density") +
  scale_alpha(range = c(0.00, 0.35), guide = FALSE) +
  labs(title = "Density of Private Schools, Philadelphia") +
  mapTheme()

Assault Density

Regarding the concentration of aggravated assaults, there are two distinct areas of concern. One is located to the west of University City, and the other is situated to the north of Center City. In these areas, both property sale prices and educational attainment levels among residents tend to be relatively low.

Code
ggplot() +
  geom_sf(data = nhoods, fill = "grey80") +
  stat_density2d(data = data.frame(st_coordinates(Crime)), 
                 aes(X, Y, fill = ..level.., alpha = ..level..),
                 size = 0.01, bins = 40, geom = 'polygon') +
  scale_fill_gradient(low = "#25CB10", high = "#FA7800", name = "Density") +
  scale_alpha(range = c(0.00, 0.35), guide = FALSE) +
  labs(title = "Density of Aggravated Assaults, Philadelphia") +
  mapTheme()

Data Modeling

Table of Results (Training)

We have decided on a filtered list of key dependent variables to use to predict home values. Those factors being the following: interior condition, number of bathrooms, number of bedrooms, number of school in 0.5 miles, number of crime in 0.25 miles, number of female who own child in the tract, total poverty number, number of ameneties like parks in 0.5 miles, distance to commercial corridor, and the number of no public transportation access in the tract.

Code
inTrain <- createDataPartition(
              y = paste(hh_tract$name), 
              p = .60, list = FALSE)
p.training <- hh_tract[inTrain,] 
p.test <- hh_tract[-inTrain,]  

indpdt_var <- c('sale_price','interior_condition','number_of_bathrooms','number_of_bedrooms','school_1320','crimes.Buffer_660','own_child_fe','TotalPoverty','facilities.Buffer','dis_cc','no_pub_trans')

reg.training <- 
  lm(sale_price ~ ., data = as.data.frame(p.training) %>% 
                        dplyr::select(indpdt_var))

summary(reg.training)

Call:
lm(formula = sale_price ~ ., data = as.data.frame(p.training) %>% 
    dplyr::select(indpdt_var))

Residuals:
    Min      1Q  Median      3Q     Max 
-526125  -92091  -25235   50473 3610064 

Coefficients:
                      Estimate Std. Error t value Pr(>|t|)    
(Intercept)         520752.789  10129.071  51.412  < 2e-16 ***
interior_condition  -49177.410   2205.268 -22.300  < 2e-16 ***
number_of_bathrooms 108131.439   3519.572  30.723  < 2e-16 ***
number_of_bedrooms  -26261.869   1791.348 -14.660  < 2e-16 ***
school_1320          -5962.815    464.716 -12.831  < 2e-16 ***
crimes.Buffer_660    -6468.748   1374.538  -4.706 2.55e-06 ***
own_child_fe           -92.871      8.326 -11.154  < 2e-16 ***
TotalPoverty           -32.183      3.851  -8.357  < 2e-16 ***
facilities.Buffer     2118.017    341.230   6.207 5.55e-10 ***
dis_cc                 -12.538      2.674  -4.688 2.78e-06 ***
no_pub_trans            53.351     12.558   4.248 2.17e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 193200 on 13990 degrees of freedom
Multiple R-squared:  0.3476,    Adjusted R-squared:  0.3471 
F-statistic: 745.4 on 10 and 13990 DF,  p-value: < 2.2e-16

Table of Goodness of Fit

n this phase of the process, we are currently training the model to predict housing prices. We have supplied the model with data points, each accompanied by its respective set of attributes linked to the sale price. The model learns the relationship between these attributes and sale prices, subsequently generating its own set of predicted values when presented with new data. The table presented below demonstrates the disparities between the model’s predictions and the actual values, quantified as their absolute errors. Given that we are dealing with substantial-value assets, such as housing, the differences between the model’s predictions and the actual values can be substantial. Additionally, it is equally crucial to assess these discrepancies in terms of percentage values to gauge the relative magnitude of the disparities.

Code
p.test <-
  p.test %>%
  mutate(Regression = "Baseline Regression",
         sale_price.Predict = predict(reg.training, p.test),
         sale_price.Error = sale_price.Predict - sale_price,
         sale_price.AbsError = abs(sale_price.Predict - sale_price),
         sale_price.APE = (abs(sale_price.Predict - sale_price)) / sale_price.Predict)%>%
  filter(sale_price < 5000000) 

p.test %>%
   st_drop_geometry() %>%
  summarise(MAE = mean(sale_price.AbsError),
            MAPE = abs(mean(sale_price.APE)*100)) %>%
  kbl(col.name=c('Mean Absolute Error','Mean Absolute Percentage Error')) %>%
  kable_classic()
Mean Absolute Error Mean Absolute Percentage Error
105953.8 41.75261

cross validation

Up to this point, our analysis has been primarily centered on a single training and test dataset. To evaluate the model’s capacity to generalize to new data, it’s essential to employ a repeated modeling approach involving multiple runs with varying training datasets. This methodology is known as K-Fold Cross Validation, and it entails partitioning the data into 100 distinct training datasets. Each training dataset omits a different set of home sales data points, which are subsequently utilized to assess the model’s error. The following scatter plot illustrates a histogram representing the Mean Absolute Error (MAE) across all 100 analyses. The MAE ranges from $80,000 to $130,000. The distribution of these MAEs closely approximates a normal distribution, with a prominent peak around $100,000. Despite a degree of inaccuracy, this observation suggests that the model exhibits generalizability and consistently yields results when confronted with various training data.

Code
fitControl <- trainControl(method = "cv", number = 100)
set.seed(825)

reg.cv <- 
  train(sale_price ~ ., data = st_drop_geometry(hh_tract) %>% 
          dplyr::select(indpdt_var), 
        method = "lm", trControl = fitControl, na.action = na.pass)

ggplot(reg.cv$resample, aes(x=MAE)) + 
  geom_histogram(color='white',fill="orange",bins=100)+
  scale_x_continuous(limits=c(0,150000),breaks=c(25000,50000,75000,100000,125000,150000))+
  scale_y_continuous(breaks=c(0,2,4,6,8,10))

Plot Predicted Prices

The graph presented below illustrates the disparity between our predictive model and the concept of “perfect predictions,” where 100% accuracy is attained. Across the majority of clusters, our model aligns closely with the line of perfection, indicating its overall accuracy. However, when dealing with a smaller number of high-value homes, especially those exceeding $2 million, the model exhibits a tendency to undervalue their worth. Another noteworthy limitation of the model is the presence of negative results. Multiple data points fall below the (0,0) point, indicating that the model predicts a negative value for certain homes, despite their actual sales prices being positive integers. Strikingly, the model appears to slightly overestimate the values of these homes, as evidenced by the model prediction line (in green) surpassing the line of perfection. This observation could serve as an indicator of properties that require significant investment in maintenance or are in such poor condition that the cost of repairs surpasses the purchase price, categorizing them as potential “fixer-upper” candidates.

Code
p.test %>%
  dplyr::select(sale_price.Predict, sale_price) %>%
    ggplot(aes(sale_price, sale_price.Predict)) +
  geom_point() +
  stat_smooth(aes(sale_price, sale_price), 
             method = "lm", se = FALSE, size = 1, colour="#FA7800") + 
  stat_smooth(aes(sale_price, sale_price.Predict), 
              method = "lm", se = FALSE, size = 1, colour="#25CB10") +
  labs(title="Predicted sale price as a function of observed price",
       subtitle="Orange line represents a perfect prediction; Green line represents prediction") +
  plotTheme()

Spatial Pattern of Errors

Map of Residual Errors

The map presented below provides a visual representation of the absolute margin of error associated with each sales price prediction. The majority of errors fall within the range of plus or minus $50,000. However, it’s important to note that there are distinct areas with more substantial errors, particularly in the northwest, where higher-value homes are located. This underscores the model’s consistent tendency to underestimate the values of these high-end properties.

Code
p.test <- p.test %>% 
  mutate(spErrorClass = cut(sale_price.Error, breaks = c(min(p.test$sale_price.Error, na.rm=TRUE)-1, -100000, -50000, 50000, 100000, max(p.test$sale_price.Error, na.rm=TRUE))))

ggplot()+
    geom_sf(data=nhoods,fill='grey80',color='transparent')+
    geom_sf(data=p.test, size=0.5,aes(colour = spErrorClass))+
    geom_sf(data=nhoods,fill='transparent',color='black')+
    scale_color_manual(values = palette5,
                    name = "Sale Price Error Margins",
                    na.value = 'grey80',
                    labels = c('Less than -100,000', '-100,000 to -50,000', '-50,000 to 50,000', '50,001 - 100,000','More than 100,000'))+
  mapTheme()

Spatial Lag Analysis

Analyzing the residuals map displayed above, we can observe a pattern of spatial clustering in the residuals. To further investigate this, we assess whether there’s a similarity between our predicted values and the predicted values of nearby homes. To do this, we employ a spatial lag calculation, which involves computing the spatial lag as the weighted sum of values from neighboring locations. The scatter plot below juxtaposes the sales price error for each home in our model against the spatial lag error for sales price, which is calculated based on the average error of the five nearest neighbors. The points in the scatter plot exhibit clustering, signifying that our model generally predicts similar errors for a home and its five nearest neighboring homes. The blue line on the plot represents the regression line, indicating this spatial similarity in error predictions.

Code
coords.test <-  st_coordinates(p.test) 

neighborList.test <- knn2nb(knearneigh(coords.test, 5))

spatialWeights.test <- nb2listw(neighborList.test, style="W")
 
p.test %>% 
  mutate(lagPriceError = lag.listw(spatialWeights.test, sale_price.Error)) %>%
  ggplot()+
  geom_point(aes(x =lagPriceError, y =sale_price.Error)) +
  stat_smooth(aes(lagPriceError, sale_price.Error), 
             method = "lm", se = FALSE, size = 1, colour="blue")+
  xlim(-1000000,1000000)+
  ylim(-1000000,1000000)+
  plotTheme()

Moran I test

Moran’s I is a measurement in statistics that evaluates the tendency for data points with similar values to occur near one another. Similar to Nearest Neighbor, to statistically prove that houses spatially close to each other would have similar sale prices. This type of measurement helps us identify high and low value clusters (i.e different neighborhoods and their perceived value). In the graph below, you can observe a relatively narrow distribution in gray, which represents the counts of houses and their sale prices used in our regression model. The orange line corresponds to the Moran’s I value, which serves to assess whether there is a correlation between property value and location. The observed Moran’s I stands at approximately 0.4, a statistically significant value indicating a correlated spatial distribution of property values.

Code
moranTest <- moran.mc(p.test$sale_price.Error, 
                      spatialWeights.test, nsim = 999)

ggplot(as.data.frame(moranTest$res[c(1:999)]), aes(moranTest$res[c(1:999)])) +
  geom_histogram(binwidth = 0.01) +
  geom_vline(aes(xintercept = moranTest$statistic), colour = "#FA7800",size=1) +
  scale_x_continuous(limits = c(-1, 1)) +
  labs(title="Observed and permuted Moran's I",
       subtitle= "Observed Moran's I in orange",
       x="Moran's I",
       y="Count") +
  plotTheme()

Predicted values for where ‘toPredict’ is both “MODELLING” and “CHALLENGE”.

We aimed to compare our modeling dataset against the “challenge” dataset, which encompasses specific houses for which we needed to make predictions. The modeling map displays the houses we employed to train the model on how to predict various sales prices, even when those houses already had existing sale prices. The model subsequently began predicting values based on this modeling data, and the outcomes are depicted below. For the “challenge” dataset, the model lacked any pre-existing sales prices to reference. As a result, we now observe the predicted sales prices generated by the model for this set of data. Comparing the maps to each other, the challenge houses reflects the modelling sets through what would otherwise be the entire neighborhood. This parallel between the two maps prove that the challenge set is consistent with the predicted house prices used in the model and therefore can reasonably predict home values without the need of any previous sale price.

Code
hh_tract <- hh_tract %>% 
  mutate(sale_price.Predict = predict(reg.training, hh_tract))

  
 ggplot()+
   geom_sf(data=nhoods,fill='grey80',color='transparent')+
   geom_sf(data=hh_tract, size = 1, aes(color= q5(sale_price.Predict)))+
   facet_wrap(~toPredict)+
   scale_color_viridis_d(labels=qBr(hh_tract,"sale_price.Predict"),
                   name="Quintile\nBreaks") +
 mapTheme()

Predictions by neighborhood

Code
p.test %>%
as.data.frame() %>%
  group_by(name) %>%
    summarize(meanPrediction = mean(sale_price.Predict),
              meanPrice = mean(sale_price)) %>%
      kable() %>% 
  kable_styling()
name meanPrediction meanPrice
ACADEMY_GARDENS 313183.0979 295419.05
ALLEGHENY_WEST 143262.5034 85687.48
ANDORRA 404836.8861 402793.75
ASTON_WOODBRIDGE 238714.9852 287096.43
BARTRAM_VILLAGE 161179.3921 95744.44
BELLA_VISTA 424206.4508 555164.58
BELMONT 70926.0714 138405.88
BREWERYTOWN 221028.3352 228589.87
BRIDESBURG 312132.0606 200516.32
BURHOLME 255146.4259 260000.00
BUSTLETON 332599.6579 378614.59
BYBERRY 545885.7394 361250.00
CALLOWHILL 325057.0376 287500.00
CARROLL_PARK 86130.0904 119115.04
CEDARBROOK 380714.4444 264435.45
CEDAR_PARK 296317.5155 461459.71
CHESTNUT_HILL 425388.0109 969235.33
CLEARVIEW 286770.5611 157350.55
COBBS_CREEK 183882.6411 172067.12
CRESCENTVILLE 134613.7655 204950.00
DEARNLEY_PARK 359261.3186 347276.92
DICKINSON_NARROWS 375879.7875 357342.44
DUNLAP 96908.1346 150812.50
EASTWICK 299900.8238 234138.38
EAST_FALLS 378336.5519 415235.30
EAST_KENSINGTON 334676.6554 356702.21
EAST_OAK_LANE 335013.8912 263622.78
EAST_PARKSIDE 164238.4397 101331.25
EAST_PASSYUNK 359417.6703 369944.58
EAST_POPLAR 343215.7361 167666.67
ELMWOOD 179091.8888 122608.90
FAIRHILL 127209.7276 50076.19
FAIRMOUNT 423069.9158 439904.62
FELTONVILLE 161175.8784 120185.00
FERN_ROCK 249766.4411 172133.90
FISHTOWN 389084.2584 384590.11
FITLER_SQUARE 465778.4266 939325.00
FOX_CHASE 303742.2368 315408.45
FRANCISVILLE 464114.1133 493279.24
FRANKFORD 152700.8349 127409.52
FRANKLINVILLE 112733.8112 74028.30
GARDEN_COURT 387127.4591 461555.56
GERMANTOWN_EAST 208372.2991 145838.43
GERMANTOWN_MORTON 194672.2474 163134.39
GERMANTOWN_PENN_KNOX 295326.2061 322985.71
GERMANTOWN_SOUTHWEST 210599.1570 186343.46
GERMANTOWN_WESTSIDE 290993.2816 171366.67
GERMANTOWN_WEST_CENT 328969.3698 388359.21
GERMANY_HILL 471398.6431 405656.67
GIRARD_ESTATES 329667.4836 305383.72
GLENWOOD 138715.1587 75405.88
GRADUATE_HOSPITAL 458035.6063 583837.83
GRAYS_FERRY 205775.1777 180690.16
GREENWICH 319151.8049 261468.00
HADDINGTON 112776.5640 115785.54
HARROWGATE 23448.4356 87017.84
HARTRANFT 165910.5410 105663.55
HAVERFORD_NORTH 229810.6242 175756.56
HAWTHORNE 460867.9722 592847.83
HOLMESBURG 179673.9978 201102.94
HUNTING_PARK 93308.9048 88302.09
JUNIATA_PARK 151308.2529 154565.38
KINGSESSING 190626.0585 136016.08
LAWNDALE 220614.3177 191576.50
LEXINGTON_PARK 308038.9044 307275.00
LOGAN 188461.6980 142351.80
LOGAN_SQUARE 458097.3752 526992.71
LOWER_MOYAMENSING 308267.2746 230304.63
LUDLOW 483012.1207 337166.67
MANAYUNK 390439.6649 341272.40
MANTUA 152824.5282 211866.67
MAYFAIR 227137.0179 232220.42
MCGUIRE 362.7725 65714.29
MELROSE_PARK_GARDENS 299697.5821 177327.78
MILLBROOK 324326.7602 282708.57
MILL_CREEK 130311.2099 112055.30
MODENA 316047.7297 295304.35
MORRELL_PARK 324461.4553 294357.41
MOUNT_AIRY_EAST 315985.8335 300325.22
MOUNT_AIRY_WEST 379370.4600 492501.52
NEWBOLD 340099.1481 303131.70
NICETOWN 242050.5672 100662.50
NORMANDY_VILLAGE 272431.1201 313958.33
NORTHERN_LIBERTIES 442951.4148 635777.62
NORTHWOOD 207580.7881 209995.89
NORTH_CENTRAL 182149.7097 236589.42
OGONTZ 207642.8897 146244.57
OLD_KENSINGTON 435858.7504 447569.44
OLNEY 168499.0692 144991.45
OVERBROOK 270669.8751 206751.63
OXFORD_CIRCLE 233806.2729 216263.19
PACKER_PARK 460367.1451 546490.27
PARKWOOD_MANOR 318789.1952 282820.25
PASCHALL 142326.2596 100835.42
PASSYUNK_SQUARE 408431.2784 428459.33
PENNSPORT 407954.5112 370430.30
PENNYPACK 311989.9811 306320.60
PENNYPACK_WOODS 312316.6692 261250.00
PENROSE 261804.8861 174590.91
POINT_BREEZE 379799.7166 326131.41
POWELTON 257423.8048 380000.00
QUEEN_VILLAGE 390044.9539 569240.86
RHAWNHURST 280825.4615 290088.18
RICHMOND 229070.2879 182177.13
RITTENHOUSE 426305.8777 797764.54
RIVERFRONT 374635.6916 441249.75
ROXBOROUGH 373662.5460 320127.44
ROXBOROUGH_PARK 411597.8106 527944.44
SHARSWOOD 263195.6767 134937.50
SOCIETY_HILL 417227.4487 730063.49
SOMERTON 315262.2539 342732.68
SOUTHWEST_SCHUYLKILL 183011.0433 135575.48
SPRING_GARDEN 445306.9801 443212.10
SPRUCE_HILL 378143.5800 610750.00
STADIUM_DISTRICT 333811.5238 287570.59
STANTON 58650.7560 87913.81
STRAWBERRY_MANSION 122640.1230 86970.56
SUMMERDALE 185666.4741 164812.56
TACONY 232639.7552 183643.01
TIOGA 162364.3769 96660.98
TORRESDALE 315534.0055 290113.20
UPPER_KENSINGTON 11318.5677 64203.95
UPPER_ROXBOROUGH 356465.5292 364390.14
WALNUT_HILL 246846.7255 286243.75
WASHINGTON_SQUARE 374948.3612 688845.96
WEST_KENSINGTON 280537.5148 208440.23
WEST_OAK_LANE 242813.1915 198077.28
WEST_PARKSIDE 250118.7184 163500.00
WEST_PASSYUNK 241586.9338 198819.36
WEST_POPLAR 321317.5687 407925.00
WEST_POWELTON 246067.3835 253083.33
WHITMAN 332864.5096 235917.81
WINCHESTER_PARK 343127.1764 317564.29
WISSAHICKON 389331.6720 360597.92
WISSAHICKON_HILLS 334588.7911 368833.33
WISSINOMING 201347.5909 166505.62
WISTER 175089.0203 132357.69
WYNNEFIELD 242641.1852 217276.00
WYNNEFIELD_HEIGHTS 314667.7244 160300.00
YORKTOWN 254379.3574 272875.00

Spatial Pattern of Regression by Neirbodhood

Irrespective of the model’s overall accuracy and its general capability to predict house prices, it’s crucial to recognize its weaknesses on a neighborhood-by-neighborhood basis. By doing so, we can gain insight into which areas the model tends to overestimate or underestimate. A prominent observation from the map is that the model’s most significant errors are concentrated in the lower-income north and east sections of the city. This suggests that the model is less accurate in these less desirable neighborhoods, where its predictions tend to be further from the actual values.

Code
st_drop_geometry(p.test) %>%
  group_by(Regression, name) %>%
  summarize(mean.MAPE = abs(mean(sale_price.APE, na.rm = T))) %>%
  ungroup() %>% 
  left_join(nhoods) %>%
    st_sf() %>%
    ggplot() + 
      geom_sf(aes(fill = mean.MAPE)) +
      scale_fill_gradient(low = palette5[1], high = palette5[5],
                          name = "Mean Absolute Percent Error") +
      labs(title = "Mean test set MAPE by neighborhood") +
      mapTheme()

scatterplot plot of MAPE by neighborhood as a function of mean price by neighborhood.

Code
st_drop_geometry(p.test) %>%
  group_by(Regression, name) %>%
  summarize(mean.MAPE = mean(sale_price.APE, na.rm = T), 
            mean.sale_price = mean(sale_price, na.rm = T)) %>%
  ungroup() %>% 
  left_join(nhoods) %>% 
  ggplot() +
  geom_point(aes(x =mean.sale_price, y =mean.MAPE))+
  stat_smooth(aes(mean.sale_price, mean.MAPE), 
             method = "lm", se = FALSE, size = .5, colour="red")+
  plotTheme()

the variable Split by Census Groups

These two maps show the significant income gap between races in Philadelphia. There is an evident parallel between higher income areas and fewer minority populations. There are many reasons historically, politically, and through socioeconomic that create and define these boundaries. But that is also exactly why incorporating race as a factor in the models prior were not recommended. Adding race to a model adds a cascading effect of discrimination that feeds future predictions to assume race is a pre-determined factor that effects house values. Race is only a factor in housing models because of historical systemic racism and discriminatory policies that created parallels between racial and economic disparity.

Code
tracts21 <- 
  get_acs(geography = "tract", variables = c("B01001_001E","B01001A_001E","B06011_001"), 
          year = 2021, state='PA', county='Philadelphia', geometry=T, output = "wide") %>%
  st_transform('ESRI:102728')  %>%
  rename(TotalPop = B01001_001E,
         NumberWhites = B01001A_001E,
         Median_Income = B06011_001E) %>%
  mutate(percentWhite = NumberWhites / TotalPop,
         raceContext = ifelse(percentWhite > .5, "Majority White", "Majority Non-White"),
         incomeContext = ifelse(Median_Income > 32322, "High Income", "Low Income"))

grid.arrange(ncol = 2,
  ggplot() + geom_sf(data = na.omit(tracts21), aes(fill = raceContext)) +
    scale_fill_manual(values = c("#25CB10", "#FA7800"), name="Race Context") +
    labs(title = "Race Context") +
    mapTheme() + theme(legend.position="bottom"), 
  ggplot() + geom_sf(data = na.omit(tracts21), aes(fill = incomeContext)) +
    scale_fill_manual(values = c("#25CB10", "#FA7800"), name="Income Context") +
    labs(title = "Income Context") +
    mapTheme() + theme(legend.position="bottom"))

Accuracy & Generalizability

The accuracy of the model is contingent on specific metrics tailored to the characteristics of Philadelphia. To scale the model for use in other cities, extensive research and data collection would be essential, especially regarding factors like school systems, crime statistics, and other external sources that might differ significantly across regions. While the quantity of data can enhance the accuracy of predictors, the quality of data, particularly data that pinpoints critical variables affecting home prices, is equally crucial for efficient model training. In terms of generalization, the model’s applicability may primarily extend to the Philadelphia region due to its familiarity with the local context and the specific dataset it was trained on. Incorporating data from locations beyond this region would likely lead to inaccuracies because of the varying local contexts and the need for entirely new datasets to capture the specific dynamics of those areas.

Conclusion

Based on our findings, we do not recommend widespread adoption of the model, primarily due to its inaccuracy in extreme situations. It’s apparent that the model’s limitations become most pronounced when dealing with exceptionally high or low house prices. However, our model has shed light on the importance of public services and security in influencing house prices, considering factors such as green space accessibility and distance to corridors. In the pursuit of fostering housing equity in Philadelphia, it is imperative to give due attention to equity in public resources. Nevertheless, we cannot disregard the model’s weakness in predicting areas where house prices are extremely high or low. It may be worthwhile to reconsider our approach when defining and handling “outliers.” Additionally, we should explore other variables that could better capture the features associated with higher or lower housing prices.