

Home Sweet Home
My Role: Data Analysist
Tools: R | JMP | Tableau | Excel
Date: Winter 2024–25
01 | Project Summary
Real estate pricing is often opaque and unpredictable, leaving buyers and sellers uncertain about what truly drives home values. This case study analyzed 22,000+ home sales in King County, WA, using features like square footage, year built, assessed values, and location to build a predictive model.
The goal: create a reliable tool for estimating sale prices and identifying key value drivers.

02 | Process & Iteration
​
I divided the project into three key phases:
Exploration
​
Conducted univariate analysis to assess variable distributions.
Identified ImpsVal and LandVal as high-variance features with strong influence on price
​
​
Cleaning & Engineering
​
-
Grouped variables into categorical bins (e.g., building grade, year built, lot size)
-
Engineered a new Region feature by clustering ZIP codes
-
Removed extreme outliers to reduce modeling noise
​
​
Model Building
​
I tested five model iterations, refining features and evaluating performance:

03 | Visual Artifacts
For a deeper dive into visual and statistical exploration, view the full
​
Key visuals include:
​
-
Map of Average Sale Prices by ZIP Code
-
Scatter Plot of Land Value vs. Sale Price
-
Time Series Line Graph: Price trends for new vs. existing homes
-
Bar Chart: Sale prices by ZIP Code and construction status
-
Histograms & Bivariate Plots of key features (e.g., SqFtTotLiving, BldgGrade)
04 | Findings
Land and Improvement Value dominate. Together, these two features explained over 90% of variance in sale prices.
Living space had a moderate impact. SqFtTotLiving showed an R² of 0.54 but was not in the top-performing model.
Location underperformed. ZIP-derived Region added little value—likely because land and improvement values already encapsulated geographic influence.
What I Learned
Data ≠ assumptions: I expected region to be a top predictor, but assessor values told a different story.
Iteration matters: Ensemble models like Bootstrap Forest dramatically improved performance.
Clarity is power: Visual artifacts and simple tables were essential for sharing insights with non-technical audiences.
Next Steps
1) Model seasonal patterns using temporal variables (e.g., Ym)
2) Test the model against post-pandemic real estate trends
3) Use RapidMiner to explore:
-
Neural networks for capturing nonlinear dynamics.
-
Cross-validation to reduce overfitting.
-
Correlation & multicollinearity analysis for feature refinement.
-
Association rules to uncover actionable buyer behavior patterns.
​
​
These extensions could aim to boost predictive accuracy and yield more strategic insights for real estate professionals. As well as further showcase my skillset with tools like Rapidminer.

