Thesis Posts

Refining a Model and Creating Bid Curves of the Orange Line Station Premium

November 29, 2016 By Jesse Simpson Leave a Comment

Over this last week, I re-consulted the hedonic literature to gather new ideas for model specification and variables, reading through papers until the Greek letters and statistical language started to make sense. I also decided to do some more formalized data validation. I spot checked properties exhibiting high standard residuals in my regression analysis through Zillow, Redfin, Trulia, and the source County Assessor’s site, to ascertain if any of the provided fields were outdated or misspecified. The outdated errors mostly come in the form of properties which have been recently demolished, which was readily assessed using PortlandMaps sales and permit data and/or Google Streetview. If I was able to find a new figure of the house square footage, I inputed that figure; if not, I deleted the field. The mispecification errors arose through the somewhat odd data management necessitated by PortlandMaps’s taxlot-based data. In Clackamas County in particular, there are many undeveloped taxlot slivers that are obviously sold as part of an associated house, yet recorded as separate transactions for the same price, though only for the characteristics embodied by the small, undeveloped lot. For taxlot sliver errors, I removed the sliver sale from the dataset and added its lot square footage to the associated home sale.

I made a large move forward in tackling the issues of spatial autocorrelation that had been visibly plaguing my model by using spatial fixed effects—namely, dummy variables of the official neighborhood that each property is within. Drawing on Duncan (2007), I have chosen to rely heavily on specifications of neighborhoods to control for neighborhood effects with no good quantitative measure. Previously, I had attempted to control for neighborhood variables through block group dummy variables, following Yan’s (2013) model. Not only was this very cumbersome, but the specification seems likely to soak up any of the locational effects of rail. Instead, I used the boundaries of neighborhood associations, which provide a larger proxy for locational effects. I spatially joined the neighborhood association shapefile to the sales points and manually created the series of dummy variables based on the assigned neighborhood. These variables were highly significant and returning coefficients in line with expectations of the housing market, representing the estimate premium or discount assigned to neighborhoods relative to the Richmond neighborhood. Including the neighborhood dummy variables increased R-squared by about 6-8% and greatly reduced the spatial autocorrelation of residuals.

I further delimited the bounds of my study to 1.25 miles in network distance. Since this equates to roughly a 25 minute walk, it seems unreasonable that a station premium would persist past that point and most hedonic analysis find price effects, if any, in the range of a quarter mile to one mile (Higgins and Kanaroglou 2016). Leaving a larger than necessary study area will tend to increase the errors involved in estimating the coefficients of variables while perhaps. This was prompted first by realizing that I was including large parts of the Southwest Hills in my analysis—up to 2 miles walking distance from the nearest station. There’s little reason to believe that such locations are valuing the addition of the Orange Line for providing one-hour total travel time access to Milwaukie.

I have essential settled on the level-log specification for my model. Leaving the independent variable untransformed is attractive for theoretic and practical considerations. Taking the logarithm of the price means that every dependent variable’s effect is interpreted as yielding a percentage change, and thus the effects would be greater for more expensive homes. Also, interpreting logarithms is difficult, with the transformation back into regular prices apparently a big statistical no-no. Nevertheless, I would expect to see a curved function of some sort, as the change in the value of proximity would seem to increase more with increased proximity—the change in value from half a mile to a quarter of a mile is surely greater than than of going from 1.25 to one, representing a much larger percentage decrease in time taken to access transit. Alongside this continuous variable, I also ran models with dummy variables created for quarter mile distance bands for cross-verification purposes. In terms of other variable modification, I changed the variable measuring the median age of surrounding homes (quarter mile buffer) into a measure of the percentage of surrounding homes that were built before 1940. This variable was more significant in the model than a simple linear age and more closely approximates the effect I had in mind when I created the age variable—the increased price associated with being around historic homes in prewar neighborhoods. I also created a variable for whether a home is attached, based on the property code description. My model suggests that this brings a ~$30,000-$45,000 discount. I also changed the posited disamenity rail variable from the continuous distance variable to dummy distance bands. The continuous variables were quite odd to interpret with the station proximity variable, as any resultant price effects are the effects of increasing your proximity to the station while holding the distance to the line constant, which only happens in the case of disconnected street grids or by moving parallel to the line. The resultant coefficients were also much larger than anticipated while biasing the estimate of the station proximity upward. The hedonic model results obtained are posted below, with the White standard errors shown due to significant heteroskedasticity remaining. I think I need this to be my final specification, considering that we’re running out of time in the semester and I still need to do significant work on individual station models.

Overall, the results show the progressive development of a light rail premium from planning to operation. Distance to stations (lnOLSta) goes from being insignificantly positive to insignificantly negative to significantly negative from planning to construction to operation, resulting in a price premium equal to $230 per 1% decrease in the distance to the station, or $63,000 for properties within a quarter mile and $30,000 between a quarter and three-quarters of a mile. One can derive a dollar estimate of the price premium resulting the continuous logarithm coefficient. Though this value is technically calculated within the model as the expected change in price given a one-log change from the mean log-distance, the price premium from light rail can be derived from this estimate by assuming that the premium at 1.25 miles is zero. From this, we can create a table of the estimated percentage change in price from 1.25mi by multiplying the percentage change in distance by the coefficient. 95% confidence intervals were created by multiplying the standard error by the relevant t score and then adding this margin of error to the coefficient.

station bid rent graph

The Pains and Joys of Regressions

November 17, 2016 By Jesse Simpson Leave a Comment

Over this past week, I’ve delved deep into my regression analysis of the property values around the Orange Line. It’s been an intensely iterative process, as I’ve moved back and forth between several different ways of setting up a regression model while churning through the seemingly endless mounds of previous transit-related hedonic studies, slowly building an understanding of the relevant terminology and methodologies. The foundation of my model is the hedonic model—examining the relationship between prices over time with respect to the network distance to Orange Line stations, controlling for structural and other locational attributes. There are many choices to make with regard to each of these terms, choices whose implications and motivations I only roughly understand at the moment. The dependent variable, price, can be expressed directly or as its natural logarithm. Taking the log of price both unskews the distribution and turns the term into an expression of the percentage change in price. The focus independent variable of the study, network distance from rail, can be similarly logged, creating a log-log model that expresses the relationship between distance and price as an elasticity—easily interpreted as the percent change in price given a percent change in distance. This log-log model allows for the proximity effects of transit to be viewed as an inverse exponential curve, with price effects most dramatic near the station. As one approaches the station, a given absolute change in distance will become a larger and larger percentage change, with this percentage change being the relevant factor for determining the percentage change in price.

Illustration of Different Functional Forms of Regressions

heteroskedasticity with respect to station variable It’s also worth noting that many studies use distance band specifications, examining the relationship between price and transit proximity within vs outside a given radius. I find this rather arbitrary and think some sort of logistic transformation better captures the reality of diminishing utility of transit as distances increase. In the regressions I performed this week, however, the log-log form appears to be somewhat problematic, introducing a high degree of heteroskedasticity (an increase in errors as the variable increases). The plot shown at the right is the regressed relationship between the natural log of station distance (horizontal axis) and the natural log of price (vertical axis)—it’s essentially the perfect image of heteroskedasticity, which is apparently quite problematic in statistical analysis. Over the coming week or two I will need to address this issue, via some kind of different model specification.

The statistical choices compound further when we consider the controlling independent structural or locational variables. My present data set includes a limited number of structural variables, owing to the paucity of the public dataset—simply the age of the structure, the square footage of the building, and the square footage of the lot. This leaves out many of the structural variables commonly used in hedonic models, like the number of bedrooms, bathrooms, and stories, the presence of a garage, deck/patio, or other amenities, rated building quality, architectural style, and heating/cooling systems. The use of these variables in hedonic regression seems basically dependent on the availability of data, with more fields considered better. For the three structural variables I do have, there is additionally the option to transform the variables by using their natural log or squared value. Yan et al. (2013), for instance, take the natural log of building area and use an age-squared variable along with age (to capture the tendency for homes to acquire some historic value at a certain point). For my present and interim regression model, I took the natural log of building and lot square footage and used both the age and age-squared variables.

The subjectivities of statistical analysis compound even further when we move to the locational attributes. I began with a focus on other proximity attributes, like the distance to heavy rail, highways, parks, rivers, commercial and industrial zoning, grocery stores, community centers, and the Orange line itself; the percentage of streets with sidewalks; the average age of structures within a quarter of a mile; and a measure of slope (the sum feet of 10 foot contour lines contained within a tenth of mile of the property). These attributes were driven by the availability of data and hypothesized potential amenity or disamenity effects. After reading Yan et al. (2013), who simply used dummy variables for each block group that the home sales are in to control for locational variation, I became interested in this apparently straightforward (though time-consuming) method. The resulting model, however, was slightly less explanatory than the previous model (Adj. R-squared=.52), displayed roughly the same amount of residual clustering (as measured with Moran’s I), and showed no significance for either of the Orange Line variables. Including some block group demographic data (median household income, percent of residents with a bachelor’s degree, and percent non-white), a dummy variable for being located in the Portland city limits, alongside the slope (FT_CONTOUR), mean year of surrounding structures (MEANYR_STR), the log distance to downtown, and a temporal index variable (LOGSCIND—the S&P Case-Shiller index for the Portland metro area for each month, achieved the highest explanatory power of models tested. The results of this regression are below.

OLS variables Screenshot 2016-11-17 11.23.55

The most relevant variables for my thesis are LNORANGE (natural log of network distance to stations) and LNOLINE (natural log of straight-line distance to the line itself). In this particular specification, they both lie right on the edge of statistical significance—the main p-value for both is <.05 but the Koenker Statistic indicates that there is significant heteroskedasticity, meaning that the Robust Probability must be used. The coefficients accord with theory, indicating a .027% decrease in price with a 1% increase in walking distance to the station and a .021% increase in price with a 1% increase in distance from the line itself (the line runs largely in an industrial right-of-way). This would mean that, holding all else (including the distance to the line) equal, a property .25 miles from the station would be worth 13.5% more than one 1.5 miles away. Viewed in space, with estimated values transformed from logs back into dollars, this model’s estimated values and standard residuals are below:

Considering the dramatic diversity of transit land value uplift hedonic models extant in the literature, as surveyed by Higgins (2016), it seems that there is no singular “right choice” with regard to model specification. Nevertheless, with reference to Armstrong and Rodriguez (2006), it has also come to my attention that I need to more systematically evaluate the “spatial lag” of these variables, in conjunction with whatever treatment of the heteroskedasticity I find.

More Goals & An Outline

November 3, 2016 By Jesse Simpson Leave a Comment

Over this past week, I aimed to run my network distance calculations for each of the taxlots in Portland and set up my linear regression variables. The former goal ended up being more time intensive than anticipated, as it took several rounds of calculation to figure out the optimal way to have ArcGis process the Origin-Destination Cost Matrix for hundreds of thousands of taxlots. In the end, it seems that adding the shapefile as a single feature class to a file geodatabase is optimal. Additionally, after exploring the options for a spatio-temporal analysis, I decided that examining an integrated time-space model was too intensive in terms of my own statistical knowledge and computational power to pursue, at least in terms of data visualization. Instead, I will be examining discreet years. I processed the sales data for 2016 this week, producing an interpolated kriging map of property sale values in 2016.

In terms of linear regression variables, I did not quite meet my goal of setting all of the them up for analysis, though I did make quite a bit of progress. I enumerated the variables I will be using (distance to downtown, distance to parks, school quality, distance to other job centers, distance to freeways, percent of sidewalks improved, proximity to community centers, year built, lot square footage) and created the data for the school quality variable. To do this, I used the school attendance shapefile from Portland and manually added the school attendance areas for Milwaukie, with reference to an online map. I then added the GreatSchools rating for the elementary, middle, and high schools for the attendance areas, followed by averaging these scores. I also realized that the structural variables included within public records are far too sparse for an optimal hedonic regression analysis. These variables are limited to building square footage and the year built, and even for these fields, many attributes have no data supplied. I may need to explore options for attaining for structural home data through Zillow’s API.

In addition to these GIS methods, I created an outline for my project, viewable at the right.

Portland Transit Isochrone Maps & Some New Literature

October 20, 2016 By Jesse Simpson Leave a Comment

Over this past week, I’ve started working through my methodology, focusing primarily on how I will conduct my analysis of network-based (i.e. walking, rather than as-the-crow-flies) distances to transit stations. This work has been quite fruitful, and I found ArcGIS’s Network Analyst tool immensely useful for generating the transit isochrone maps I envisioned. I first built a network from the street network shapefile from RLIS in ArcCatelog (excluding highways and freeways). From there, it was relatively straightforward to use the service area analyst tool to create line and polygon maps of the distance to the nearest rail station, posted below. This line map shades the streets within 1 mi of each rail station in the Portland area, with different shades used for each .2 mi increment:

This polygon map shows the area within .25, .5, and 1 mi of rail stations by network area:

My next steps for analyzing this data will be to focus the study area on the one or two corridors which I will be using and find a way to associate the parcels with this network analysis.

I also went back and read through three sources I’ve accumulated over the past half-year of research. Bartholomew and Ewing (2011) conducted a literature review of hedonic analyses of the price effects of transit on land value. This article lays out the basic elements of hedonic analysis, noting the characteristics controlled by most hedonic models (those related to structures—square footage of living space, the number of bedrooms and bathrooms, the presence of a garage, the age of the house, etc—and those related to land—proximity to amenities and disamenities). They then survey, summarize, and synthesize the existing literature on the hedonic price effects of transit, categorizing and grouping these effects as access-related effects and design-related effects. They conclude that the literature confirms that the value of pedestrian- and transit-designed development is being capitalized into real estate prices and that the valuation of transit accessibility and TOD-based design are linked synergistically.

Mackenzie (2013) analyzed neighborhood access to transit in 2000 and 2009 in relation to demographic data in the Portland metro area. He measures transit access as the average network-level walking distance from each housing parcel within a block group to transit, considering the transit accessible if it’s within 1/2 of a mile of light rail or 1/3 of a mile of bus stops, and summing the number of accessible routes for each parcel in order to find the average number of routes accessible by transit for each block group. He used Moran’s I clustering technique to categorize neighborhoods as having concentrations of poverty and racial/ethnic minorities. He then conducted a regression analysis and controlled for population density, job count, working age population, and percent of people renting, with postsecondary education, and no vehicle. He found that neighborhoods with a high concentration of black and Latino populations experienced a decline in transit accessibility from 2000 to 2009, though high poverty neighborhoods overall actually saw in increase in transit accessibility.

Wu et al. (2014) use a difference-in-difference framework (comparison of a to measure changes in the Beijing property market over the 00s in relation to the opening and planning impacts of transit expansions. They lay out the methodological reasons to prefer a difference-in-difference model (comparison of a control and treatment group over time) over cross-sectional analyses (comparison of a single area before and after the opening of a transit expansion). They use a regression model to examine land prices over three periods, before 2003, between 2003 and 2008, and after 2008 (2003-2008 being a period of intense rail construction in Beijing), comparing parcels within 2km of a new staton to those outside this distance band, controlling for eduction, the presence of public housing, density, heritage, and employment. Wu et al. find that both residential and commercial land parcels receiving increased station proximity appreciated in value significantly, that these effects are priced in to some extent during the planning process, and the proximity effects on land prices are unevenly distributed.

Throttling Towards a Thesis

October 5, 2016 By Jesse Simpson Leave a Comment

Framing question: How do transit-oriented development and revitalization plans shape the geography of property values, development, and equity?

Focus question: What effects have recent integrated land use-transportation plans in Portland had on property values and new construction, over the course of the planning and construction process? What policies are proposed to deal with potential displacement arising from these plans?

The major theoretical frameworks which I will employ in this investigation include: the rent gap, neoliberalism, the revanchist city, locational indifference, and the spatial fix. My major data sources will be the sales data for properties in Portland and the building footprint data, analyzed in conjunction with data on the location of transit stations and the distance via the street network to those stations. Additionally, I will use a variety of official City of Portland/Trimet sources, to contextualize the project timeline of rail lines, to investigate the extent of political consideration of transit-induced gentrification and assess the policies proposed to proactively deal with this association.

Golub, Guhathakurta, and Sollapuram’s 2012 article “Spatial and Temporal Capitalization Effects of Light Rail in Phoenix From Conception, Planning, and Construction to Operation” and Immergluck’s 2009 article “Large Redevelopment Initiatives, Housing Values and Gentrification: The Case of the Atlanta Beltline” illustrate appropriate methodologies for determining the relationship between housing prices and transit over time. Both are concerned explicitly with the planning and construction impacts on prices and use a hedonic model incorporating time as a variable. They measure the distance from transit stations/lines as the crow flies, seeking to determine the effect of proximity to planned transit on changes in property values, controlling for the characteristics of the houses themselves and for other locational qualities. Replicating this may prove fruitful, though it would require learning more about how to properly do a regression model.

A secondary methodology to pursue aligns with that of Jones and Ley’s 2016 article “Transit-oriented development and gentrification along Metro Vancouver’s low-income SkyTrain corridor,” which uses a predominately qualitative analysis to tell a narrative of how gentrification becomes a foregone conclusion accepted by municipal plans. They weave together contextualizing maps of socioeconomic status, focus group statements, statements at meetings concerning rezoning, and interviews with city council members into a rich explanatory tapestry. For my thesis, this methodology would consist of seeking out, presenting, and interpreting qualitative data concerning the intersection of transit plans, municipal growth agendas, and neighborhood priorities.