Building a spatial data warehouse for real estate data.

Jules Huisman
4 min readJul 1, 2021

--

An output visualization of the data warehouse; An analysis of distances to the nearest bus stop.

This is the blog post accompanying my master thesis “Integrating geospatial real estate data”.

What is the challenge?

There are a lot of open data sources that can be used for analysis in the real estate sector. Ideally, you would want one unified view of all these different sources. However, multiple hurdles arise when companies try to use this data.

First, most companies do not have the infrastructure in place to handle bigger data workloads. Most of them are currently using Microsoft Excel for their data work, severely limiting the scope of their analyses.

Second, there is a lot of heterogeneity in the external spatial sources, different file types, different file locations, different coordinate systems, etc. This makes it difficult to join these sources together.

Third, you would like to store all this data in a central data warehouse. But how do you model all these sources to allow for easy access?

Spatial data integration

This is where my research fits in, I designed and built an end-to-end data pipeline for spatial analytics in the real estate sector.

Existing research showed a multitude of difficulties when ingesting spatial sources (as mentioned above). To solve this problem I designed a custom ingestion module that allows practitioners to easily ingest different spatial sources to their heart's desire.

The ingestion module is based on an open specification (the singer specification) and allows you to ingest almost all spatial file types while simultaneously giving you the ability to transform the data to any coordinate system you want.

To make this a bit more concrete, below you can see an example of the configuration needed to ingest all liveability data of the Netherlands for 2014, 2016, and 2018. Looks simple right ;)

Ingestion module configuration example

The ingestion module can be pointed to any storage system your company uses: Postgres, BigQuery, Redshift, Snowflake, Excel, Google Sheets, S3, etc. For my thesis, I ingested five different sources and pointed them to Google BigQuery, a cloud data warehouse.

The architecture of the integration system.

Data modeling

After the data has landed in the warehouse it needs to be modeled. A dimensional modeling technique was researched and proved to be a well-suited technique for this use case.

The image below shows a visualization of the data in the warehouse. This data model allows you to perform a wide range of queries on the data. Some examples include:

How are the energy labels evolving for each municipality year over year?

What is the total surface area of offices in the Amsterdam?

Where are new residential projects being built where no supermarket is built yet?

In which neighbourhoods is the liveability going down?

A model of the data in the data warehouse.

Query results

For my thesis, I answered different questions using this warehouse. Below are some results of these queries.

What is the distance of houses to the nearest bus stop?

An analysis of distances to the nearest bus stop. LIVE PREVIEW

All residential development projects and their distance to the nearest supermarket.

An analysis of residential development projects and existing supermarkets. LIVE PREVIEW

All offices with an energy label lower than C that are located in high liveability neighbourhoods.

All offices with an energy label lower than C, that are located in good neighborhoods. LIVE PREVIEW

Conclusion

A spatial warehouse like this can help real estate companies employ a data strategy, something that is often lacking in their current model. It can be the gateway to using data analytics or even data science in their business processes. These techniques can help real estate companies improve their current processes, or even create new business processes in the future.

Sources

The geospatial integration framework can be found here: Github

The ingestion module can be found here: Github

--

--

Jules Huisman
Jules Huisman

No responses yet