Case Study

For this case study, you have been contracted by the Irish Revenue. Using only the data specified below, the Revenue require 3 primary objectives to be addressed:

  1. Perform an exploratory data analysis to highlight interesting patterns in the data.
  2. Identify residential sales which seem unusual i.e. residential sales which are unusually high or low considering the available features.
  3. Provide your opinion on whether a model can reliably forecast monthly stamp duty. Stamp duty is a tax applied to the sale of property. For the purpose of this question, assume stamp duty is charged on all sales (exclusive of 13.5% VAT) at a rate of 1%.

The below formula is how to calculate stamp duty:

$$ \text{Stamp Duty Due} = \text{Stamp Duty Rate} \times (\text{Sale Price}-(\text{Sale Price} \times \text{VAT})) $$$$ \text{Stamp Duty Due} = 0.01 \times (\text{Sale Price}-(\text{Sale Price} \times 0.135)) $$

Data

  1. Residential Property Price Register
  2. Residential Property Price Index - Data description

Before beginning, make sure to visit the provided links to gain a better understanding of the data.

Please read through entire assessment before beginning.

1. Download and load datasets

The first step to perform is to download and read the data by following the below steps.

Date of Sale (dd/mm/yyyy), Address, County, Price (€), VAT Exclusive, Not Full Market Price, Description of Property

2. Data Cleaning & Processing

Next, complete the following steps to obtain the dataset which the exploratory analysis will be performed on.

Note, perform any additional steps that may be warranted based on your understanding of the data that you have observed from the data itself and/or from the information provided in the links.

3. Exploratory Data Analysis

Revenue have given a brief that explains they are interested in finding patterns in the property price and residential price index information.

Examples include the change in property prices over time (adjusted and unadjusted for inflation), the number of properties sold, the distribution characteristics of prices and the difference between counties and types of properties.

This is not an all-encompassing list and should be expanded upon based on what you feel the client would be interested in knowing. The purpose of this question is to showcase your ability to extract useful information relevant for a client from a data source, visualise and elaborate on those findings.

Figure 1 Monthly Residential Property Price Index - Relative to 2015 Average

As you can see from figure 1, the inflation adjusted RPI for National (ex Dublin) begins and ends the 10 year period at a very similar level, with a substantial dip ending in 2013. This means that the real cost of housing is likely the same at the end of 2019 as it was in 2010. Conversely, the inflation adjusted RPI in Dublin at the end of 2019 is substantialy higher than its 2010 value. This means that the real cost of housing has likely risen substantialy during the same 10 year period. This graph eludes to a disparity in the changing cost of living in Dublin vs the rest of the nation over the last 10 years.

Figure 2 Monthly Quantity of Residential Properties Sold

The monthly quantity of properties sold follows a seasonal pattern. January generally has the lowest volume for the year, the volume then builds throughout the year with December being the busiest month for the market. This pattern is present for new and second hand properties inside and outside of Dublin. It is evident that there has been a gradual rise in the number of properties sold in the first half of the decade and since the volume has been relatively stable. The quantity of new houses has continued to rise in the last five years but this is a smaller proportion of the overall market.

Figure 3 Monthly Mean & Median Sale Price by location (Dublin vs non-Dublin)

Figure 3 depicts clearly the price gap, mean and median, between housing in and oustide of Dublin. The begining of the 10 year period saw a roughly 50k price gap between the median sale price and roughly a 100k price gap between the mean sale price. The inflation adjusted mean and median price for residential property outside of Dublin has not risen significantly during the 10 year period. The inflation adjusted mean and median price for residential property inside of Dublin however has risen significantly. This leaves the inflation adjusted median price gap at the end of the 10 year period at roughly 75k and the inflation adjust mean price gap at roughly 250k. This is consistent with trends observed in figure 1 and once again points at the disproportionalty high rise in real cost of living in Dublin.

Figure 4 Monthly Mean & Median Sale Price by (New vs Second Hand)

Over the 10 year period the mean and median cost of new and second hand residential property has remain relatively well matched. It is evident however that the cost of new housing has risen to a greater extent than second hand housing. In both the mean and the median plots it can be seen that the cost of second hand properties were higher in 2010 than new properties. This relationship has been flipped by the end of 2019. This may point to a lack of affordable new builds on the market. Volatility in the data for new builds is quite high as they respresent a smaller propotion of the market and therefore there is less data avaliable to correct for outlier.

Figure 5 Monthly Mean & Median Sale Price by County

Once again in figure 5 the price gap between Dublin and the other counties included in this study is evident. Volatility of the data is quite high in these plots as breaking the data down by county leaves less data in each group. Meath is the highest non-Dublin county. It is likely the higher prices in Meath are a result of its proximity to Dublin.

Figure 6 Histogram of the the Estimated December 2019 Price for all Properties by Location

This data is not perfectly normally distributed. The means lean towards zero and the data has a long tail.

Figure 7 Histogram of the the Estimated December 2019 Price for all Properties by County

This data is not perfectly normally distributed. The means lean towards zero and the data has a long tail.

4. Outlier Detection

Using an unsupervised approach, implement a method to classify whether a property sale should be considered an outlier or not. In the context of the county and type of property, the Revenue is interested in 2 types of outliers:

The Revenue are interested in creating a backlog of property sales to investigate for tax avoidance.

The Revenue are interested in identifying sales which do not represent a sale of a single residential property. The client believes there may be many instances of a single sale which are for several properties, such as apartment blocks.

Therefore, a deliverable for the client is to provide a list of properties which were sold for abnormally low and high values i.e. outliers.

The approach chosen should be informed from information you have observed in the previous step.

Prior to an implementation, please provide a detailed explanation of the reason the selected approach was chosen and provide a summary of the findings regarding the list of outliers identified.

Z-Score Outlier Detection

The z-score or standard score of an observation is a metric that indicates how many standard deviations a data point is from the sample’s mean, assuming a gaussian distribution. This makes z-score a parametric method. Very frequently data points are not to described by a gaussian distribution, this problem can be solved by applying transformations to data ie: scaling it.

After making the appropriate transformations to the selected feature space of the dataset, the z-score of any data point can be calculated with the following expression:

            z_value = (value - mean)/standard deviation

Once the z value has been calculated a threshold for acceptable values can be determined. A z value greater than 3 represents a value in the 0.1% percentile. Below is a chart describing the relationship of z value and outliers:

normal_cruve

As seen in the above histograms the data is not perfectly normally distributed. The data mean leans toward zero and has a long tail. In order to transform this data into a better normal distribution we can find the natural log of the values.

Figure 8 Histogram of the Log of the Estimated December 2019 Price for all Properties by Location

As we can see this transformation has allowed us to produce a much better normal distribution for out data.

Figure 9 Histogram of the Log Root of the Estimated December 2019 Price for all Properties County

As we can see this transformation has allowed us to produce a much better normal distribution for out data.

5. Forecasting Stamp Duty

The final client deliverable is an evaluation report which outlines the suitability of using a model, which can forecast monthly stamp duty.

For the evaluation report, please provide the following sections:

The Revenue have provided the following stamp duty formula.

$$ \text{Stamp Duty Due} = \text{Stamp Duty Rate} \times (\text{Sale Price}-(\text{Sale Price} \times \text{VAT})) $$

where, Stamp Duty Rate is assumed to be 1% for all sales and VAT is 13.5%.

Limit the evaluation to the data output from part 2 in the case study.

Forecasting Stamp Duty

In order to properly explore this problem it will be useful to plot stamp duty over the 10 year period.

Methodology & Figure 10 Monthly & Yearly Stamp Duty

As seen earlier in figure 2, the quantity of houses sold follows a seasonsal pattern with a slight upward trend. The mean sale price has also been shown to have a slight upward trend. The yearly trend of stamp duty is therefore upwards. This yearly trend of stamp duty depends on many variables, it is likely to always trend up over long periods of time due to inflation but accurately predicting yearly fluctuations depends on market conditions. A sophisticated prediction of macro economic trends is beyond the scope of this assignment so we will assume that this upward trend continues and accept this as a limitation of the study. The seasonal trend depends on the calendar months however, which are a constant and it can therefore be reasonably assumed that this consumer behaviour will continue.

Model selection

These trends point to a seasonal naive model as a baseline. A seasonal naive model predicts the last value of the same season (same month last year) when forecasting. The upward yearly trend can be superimposed onto this to give a final prediction.

Figure 11 Yearly Stamp Duty with Yearly Prediction

The polyline prediction is fitted to the data for the first 4 years, after this the line is prediction of the the stamp duty for the year based on previous year avaliable. This works reasonable well for this task.

Figure 12 Monthly Stamp Duty Prediction Seasonal Naive Model with/without Trend

This figure shows a prediction of the last 6 years of the decade analysed (as I used a minimum of 4 years to produce the trend line for prediction). The seasonal naive model looks to perform well from a visual stand point. Evaluation metrics will be provided below. It is hard to tell visually whether adding the yearly trend % change multiplier helped the prediction. This will also be evaluated below.

Summary of Findings

The seasonal naive model does a good job of predicting future stamp duty and can be used with an expectation that it will perform with an average absolut percentage error of 20%.

The model performed better without the trend prediction added when measured by root mean square error. The model performed better with the trend prediction added when measured by mean absolute percentage error. This highlights that the trend prediction is not adequate and needs to be improved.

More data can be included to better produce predictions for the yearly trend in property sales quantity and prices. This prediction requires a broader look at macro economic trends.