top of page
Paper Texture_edited.jpg

Home Sweet Home

My Role: Data Analysist

Tools: RJMP | Tableau | Excel 

 Date: Winter 2024–25

01 | Project Summary

Real estate pricing is often opaque and unpredictable, leaving buyers and sellers uncertain about what truly drives home values. This case study analyzed 22,000+ home sales in King County, WA, using features like square footage, year built, assessed values, and location to build a predictive model.
 
The goal: create a reliable tool for estimating sale prices and identifying key value drivers.

visual.jpg

02 | Process & Iteration

​

I divided the project into three key phases:

Exploration

​

Conducted univariate analysis to assess variable distributions.

Identified ImpsVal and LandVal as high-variance features with strong influence on price

​

​

Cleaning & Engineering

​

  • Grouped variables into categorical bins (e.g., building grade, year built, lot size)
     

  • Engineered a new Region feature by clustering ZIP codes
     

  • Removed extreme outliers to reduce modeling noise

​

​

Model Building

​

I tested five model iterations, refining features and evaluating performance:

table_edited.jpg

03 | Visual Artifacts

For a deeper dive into visual and statistical exploration, view the full 

​

Key visuals include:

​

  • Map of Average Sale Prices by ZIP Code
     

  • Scatter Plot of Land Value vs. Sale Price
     

  • Time Series Line Graph: Price trends for new vs. existing homes
     

  • Bar Chart: Sale prices by ZIP Code and construction status
     

  • Histograms & Bivariate Plots of key features (e.g., SqFtTotLiving, BldgGrade)

04 | Findings

Land and Improvement Value dominate. Together, these two features explained over 90% of variance in sale prices.
 

Living space had a moderate impact. SqFtTotLiving showed an R² of 0.54 but was not in the top-performing model.
 

Location underperformed. ZIP-derived Region added little value—likely because land and improvement values already encapsulated geographic influence.

What I Learned

Data ≠ assumptions: I expected region to be a top predictor, but assessor values told a different story.
 

Iteration matters: Ensemble models like Bootstrap Forest dramatically improved performance.
 

Clarity is power: Visual artifacts and simple tables were essential for sharing insights with non-technical audiences.

Next Steps

  1) Model seasonal patterns using temporal variables (e.g., Ym)
 

  2) Test the model against post-pandemic real estate trends
 

  3) Use RapidMiner to explore:
 

  • Neural networks for capturing nonlinear dynamics.
     

  • Cross-validation to reduce overfitting.
     

  • Correlation & multicollinearity analysis for feature refinement.
     

  • Association rules to uncover actionable buyer behavior patterns.

​

​

These extensions could aim to boost predictive accuracy and yield more strategic insights for real estate professionals. As well as further showcase my skillset with tools like Rapidminer.

bottom of page