Melbourne Housing Market Analysis: An Exploration using R and Power BI

The aim of this portfolio project is to demonstrate my ability to leverage the power of R language and Power BI for data analysis and visualisation. The data set under scrutiny is the Melbourne Housing Market dataset, which has been extracted from Kaggle. The dataset provides comprehensive information about the property sales in Melbourne, including key details such as property type, location, size, price, and the date of sale. This project will guide you through my process of data exploration, cleaning, visualisation, and interpretation, showcasing my analytical acumen and proficiency in using R and Power BI.

Harry Bui on 2023

Melbourne Housing Market Analysis: An Exploration using R and Power BI

0. Collecting and Importing the Data

The initial step in the analysis process involves gathering and importing the dataset. For this project, I utilised the Melbourne Housing Market dataset available on Kaggle (link). This dataset offers an extensive record of property sales in Melbourne, encapsulating crucial details such as the property type, location, size, sale price, and sale date. The data is in CSV format, making it easy to import into my R environment for further analysis.

1. Data Understanding and Exploration

The first analytical step after importing the data involves understanding and exploring the dataset. I aimed to familiarize ourselves with the data's structure, various attributes, and the preliminary statistical summary. I began by scrutinizing the initial few rows of the data using the head() function, which gives a snapshot of the dataset's structure and the information contained within each attribute.

Next, I obtained the dataset's dimensions with the dim() function, providing us with the number of rows (representing individual property sales records) and columns (representing different property attributes). For a more comprehensive statistical overview, I deployed the summary() function. This function produces a statistical summary for each attribute in the dataset, including the minimum, maximum, median, mean, and quartile values for numerical attributes, and count of unique categories for categorical attributes.

Here's the code snippet I used:

  # Load required libraries
  library(tidyverse)
  library(lubridate)
  
  # Load the data
  df <- read.csv('Melbourne_housing_FULL.csv')
  
  # Display the first few rows of the dataframe
  head(df)
  
  # Get the shape of the dataset
  dim(df)
  
  # Get the summary of the dataset
  summary(df)       
  
A description of the photo

The data consists of 21 columns and 34857 rows:

  • Suburb: Name of the suburb
  • Address: Address of the property
  • Rooms: Number of rooms in the property
  • Type: Type of the property (h - house, u - unit, etc.)
  • Price: Price of the property in Australian dollars
  • Method: Method of selling the property (S - property sold, SP - property sold prior, etc.)
  • SellerG: Name of the real estate agent
  • Date: Date when the property was sold
  • Distance: Distance of the property from the Central Business District (CBD) in Kilometres
  • Postcode: Postal code of the property
  • Bedroom2: Number of bedrooms in the property (from a different source)
  • Bathroom: Number of bathrooms in the property
  • Car: Number of car spots in the property
  • Landsize: Size of the land in square metres
  • BuildingArea: Size of the building in square metres
  • YearBuilt: Year when the house was built
  • CouncilArea: Name of the governing council for the area
  • Lattitude: Latitude of the property
  • Longtitude: Longitude of the property
  • Regionname: Name of the general region (West, North West, North, North east, etc.)
  • Propertycount: Number of properties that exist in the suburb
  • The data types of the columns in the DataFrame are as follows:

    A description of the photo

    From the initial exploration, I can observe the following:

  • There are missing values in several columns such as 'Price', 'Distance', 'Postcode', 'Bedroom2', 'Bathroom', 'Car', 'Landsize', 'BuildingArea', 'YearBuilt', 'CouncilArea', 'Lattitude', 'Longtitude', 'Regionname', and 'Propertycount'. I will need to handle these missing values in the data cleaning phase.
  • The 'Date' column is of object type. I might need to convert it to datetime for better analysis and handling.
  • The 'Rooms' column has a minimum value of 1 and a maximum value of 16. This looks reasonable.
  • The 'Price' column has a minimum value of 85,000 and a maximum value of 11,200,000. This is a wide range and I might want to check for outliers.
  • The 'Distance' column has a minimum value of 0 and a maximum value of 48.1. This also looks reasonable, but I might want to check for outliers.
  • The 'Landsize' column has a minimum value of 0 and a maximum value of 433,014. This is a wide range and might contain outliers.
  • The 'BuildingArea' column also has a minimum value of 0 and a maximum value of 44,515. This could also contain outliers.
  • The 'YearBuilt' column has a minimum value of 1196 and a maximum value of 2106. The minimum value seems very old for a house, and the maximum value is in the future. These could be errors in the data.
  • The exploration phase highlighted several areas for further investigation, such as missing values, potential outliers, and data type inconsistencies, which I address in the next step

    2. Data Cleaning and Preprocessing

    A crucial stage in any data analysis pipeline involves cleaning and preprocessing the data. This step ensures that the dataset is free of inconsistencies, errors, and missing values, thereby enabling more accurate analysis and reliable conclusions.

    My initial exploration revealed several areas that required attention:

  • Missing Values: I noticed that several columns, such as 'Price', 'Distance', 'Postcode', 'CouncilArea', 'Regionname', and 'Propertycount', among others, had missing values.
  • Data Types: The 'Date' column was of object type, and the 'Distance' column was not numerical. Both needed to be converted for effective analysis.
  • Here's how I tackled these issues:

    1. Missing Values: I took a strategic approach to handling missing values. For the 'Price' column, as it's crucial to my analysis, I decided to drop rows with missing values rather than filling them with averages or medians to avoid potential biases. For columns 'Distance', 'Postcode', 'CouncilArea', 'Regionname', and 'Propertycount', the missing values were relatively few compared to the dataset's size, so I dropped these rows. For the remaining columns with substantial missing values, I opted to keep these data points for this portfolio project.
    2. Data Types: I converted the 'Date' column to datetime format and split it into separate 'Year', 'Month', and 'Day' columns for more detailed time series analysis. The 'Distance' column was converted to a numerical type.

    Here's the R code snippet showcasing my data cleaning and preprocessing steps:

      # Check for missing values
      colSums(is.na(df))
          
      # Drop the rows with missing values 
      df_cleaned <- df %>% drop_na(Price, Distance, Postcode, CouncilArea, Regionname, Propertycount)
          
      # Check the count of missing values in each column again
      colSums(is.na(df_cleaned))
         
      # Convert 'Date' column to date type
      df_cleaned$Date <- dmy(df_cleaned$Date)
          
      # Split 'Date' into 'Year', 'Month', and 'Day' columns
      df_cleaned$Year <- year(df_cleaned$Date)
      df_cleaned$Month <- month(df_cleaned$Date)
      df_cleaned$Day <- day(df_cleaned$Date)
          
      # Drop the original 'Date' column
      df_cleaned <- df_cleaned %>% select(-Date)
          
      # Convert 'Distance' column to numerical type
      df_cleaned$Distance <- as.numeric(df_cleaned$Distance) 

    After these data cleaning and preprocessing steps, my dataset was ready for further analysis and visualisation. This meticulous process ensures the reliability and accuracy of my subsequent steps.

    3. Data Visualisation

    Visualisations provide a powerful tool for understanding data, revealing patterns, trends, and insights that may not be apparent from raw data. For this project, I turned to Power BI, a robust platform offering a suite of interactive visualisation capabilities.
    The report leverages several visualisation types, each designed to illuminate different facets of the housing market data:

  • Choropleth Map: This visualisation offers a geographical representation of median property prices across Melbourne suburbs. It enables users to easily identify areas with higher and lower median prices.
  • Line Chart: The line chart traces the price trends over time for houses, units, and townhouses, providing insights into market dynamics and fluctuations.
  • Pie Chart: By depicting the sales distribution among different property types, the pie chart gives a clear picture of market preferences.
  • Table: This table highlights the top 20 suburbs experiencing the most significant price increments, serving as a useful tool for potential investors.
  • Slicer: The slicer enhances the dashboard's interactivity, allowing users to customize data visualisations based on the number of rooms.
  • A description of the photo

    This is an interactive report that allows me to select and examine specific data. For instance, when I select the option for 3-bedroom properties, the dashboard narrows its focus, displaying data exclusively for these properties.

    A description of the photo

    The dashboard review the following insights:

  • Southern and eastern Melbourne suburbs command higher median prices, followed by northern and western ones.
  • Albert Park, Middle Park, and Malvern are the priciest suburbs, each boasting a median price exceeding 2 million AUD for 3-bedroom properties.
  • The price for 3-bedroom properties fluctuated in 2016, drastically dropped in the first half of 2017, and then resumed fluctuating.
  • While townhouses follow a similar price trend as houses, the price gap between them diminished in 2017 compared to 2016.
  • Seasonality is evident, with prices often peaking at the year's beginning and end, and dipping mid-year.
  • Houses constitute the majority of sales (77.7%), followed by townhouses (13.2%) and units (9.0%).
  • Parkville, Strathmore Heights, and Gowanbrae witnessed substantial price increments, indicating a robust real estate market during this period.
  • Here is the dashboard when switching to 2-bedroom properties: A description of the photo

    The dashboard uncovers different market dynamics:

  • The average price for 2-room houses, townhouses, and units has generally been on an upward trajectory. Houses recorded the most pronounced growth, followed by units, and then townhouses.
  • Units represent the majority of 2-room property sales (51.4%), hinting at their popularity among smaller households or investors seeking more affordable options.
  • Ivanhoe East, Eaglemont, and Doncaster saw significant price increments from 2016 to 2017, indicative of a thriving real estate market.
  • Despite lower sales volumes, 2-room houses showed considerable price growth in certain suburbs, suggesting potential opportunities for capital growth.
  • The dashboard offers a comprehensive overview of Melbourne's real estate. I observed distinct trends and variations in the market, depending on property type and suburb location. For 3-bedroom properties, houses dominated the market, while for 2-bedroom properties, units were more popular. I also identified certain suburbs where prices grew substantially over a year, highlighting potential hotspots for investment. These insights can be invaluable for homebuyers, investors, and policy makers alike, helping them make informed decisions. However, real estate investment is multifaceted and it's crucial to consider other factors such as economic outlook, infrastructure development, population growth, and personal investment goals.

    4. Conclusion

    The data analysis and visualizations presented in this report provide in-depth insights into Melbourne's housing market dynamics, with specific focus on properties with 2 and 3 bedrooms. The dashboards designed using Power BI demonstrate the power of data visualization in understanding complex data and uncovering hidden trends and patterns.

    The analysis reveals that property type and location significantly impact property prices. Suburbs like Albert Park, Middle Park, and Malvern command high prices for 3-bedroom properties, while for 2-bedroom properties, units are the most sold, particularly in suburbs like Ivanhoe East, Eaglemont, and Doncaster. The observed seasonality in prices suggests that timing plays an important role in property investment.

    While this analysis provides valuable insights, it is important to note that property investment decisions should be based on a holistic analysis that takes into account a range of factors including individual investment goals, risk appetite, market conditions, and expert advice.

    This portfolio project underscores my ability to handle complex datasets, clean and preprocess data, and use data visualization tools like Power BI to present data in an insightful and understandable manner. The project also showcases my ability to derive actionable insights from data that can guide decision-making in real-world scenarios such as property investment. With a strong foundation in R programming and data visualization, I am well-equipped to tackle complex data analysis tasks and provide meaningful insights.