Monday, October 29, 2007

ETL: Fundamental to Spatial Analysis and Sharing of Data

Extract, Transform, and Load

ETL or Extract, Transform, and Load is a process in Data Warehousing. Data Warehouses are all around us - whether we are querying the Yellow Pages on-line to geo-coding our datasets, there is most often a database or data warehouse behind the scenes.

But how we get data into the warehouse, is called ETL.

This is a short introduction to ETL as it is an important part of data warehousing. I also cover a little about Spatial ETL.

Extract

Extract is where we begin this story. Essentially, in this step, we extract the data from source systems. A source system is where the data originates. It may be a well database, an address database, a MSExcel CSV file or almost anything you can think of.

A data warehouse ends up consolidating the data from different source systems. Each separate source system (in many cases) use a different data organization or format. Common data source formats are relational databases and flat files.

Transform

As our story continues, we run into the transform stage. In this chapter, we want to apply a series of rules or functions to the extracted data from the source to derive the data to be loaded to the end target.

If we are fortunate in having clean data (this rarely happens!) the data source will require very little or even no manipulation of the data. The most common scenario is that one or more of the following transformations need to be done to meet the business and technical needs of the end users.

For example, some of the transformations that may occur are (taken from Wikipedia - a good listing of transformations):

  • Selecting only certain columns to load
  • Translating coded values (e.g., if the source system stores 1 for male and 2 for female, but the warehouse stores M for male and F for female), this is called automated data cleansing; no manual cleansing occurs during ETL
  • Encoding free-form values (e.g., mapping "Male" and "1" and "Mr" into M)
  • Deriving a new calculated value (e.g., sale_amount = qty * unit_price)
  • Joining together data from multiple sources (e.g., lookup, merge, etc.)
  • Summarizing multiple rows of data (e.g., total sales for each store, and for each region)
  • Generating surrogate key values
  • Transposing or pivoting (turning multiple columns into multiple rows or vice versa)
  • Splitting a column into multiple columns (e.g., putting a comma-separated list specified as a string in one column as individual values in different columns)
  • Applying any form of simple or complex data validation; if failed, a full, partial or no rejection of the data, and thus no, partial or all the data is handed over to the next step, depending on the rule design and exception handling.

Load

After the data has been transformed, cleansed, it is now loaded into the data warehouse.

Spatial ETL

When we apply the above principles to spatial data, we call it Spatial ETL.

As we know, spatial data can suffer tremendously in accuracy, data formats, projections, datum's, etc.

With the advent of WFS and WMS and other Web Services and definitions - knowing how accurate and clean our data is is important - especially since in Web Services we are sharing data between systems.

A common method in the Well-known text (WKT). WKT is a text markup language for representing vector geometry objects on a map. It also relates to spatial reference systems of spatial objects and of the transformations between spatial reference systems. A binary equivalent, known as well-known binary (WKB) is used to transfer and store the same information on databases, such as PostGIS. The formats are regulated by the Open Geospatial Consortium (OGC) and described in their Simple Feature Access and Coordinate Transformation Service specifications.

Frank Warmerdam, the creator and maintainer of the PRO.4 and GDAL libraries of which I've talked about before covers quite well, WKT implementations and things to be aware of in several GIS products (some of the information is out of date - most users are now using Arc 9 and above).

Quoting below (from here):

  • Oracle Spatial (WKT is used internall in MDSYS.WKT, loosely SFSQL based)
  • ESRI - The Arc8 system's projection engine uses a roughly simple features compatible description for projections. I believe ESRI provided the WKT definition for the simple features spec.
  • Cadcorp - Has the ability to read and write CT 1.0 style WKT. Cadcorp wrote the CT spec.
  • OGR/GDAL - reads/writes WKT as it's internal coordinate system description format. Attempts to support old and new forms as well as forms from ESRI.
  • FME - Includes WKT read/write capabilities built on OGR.
  • MapGuide - Uses WKT in the SDP data access API. Roughly SF compliant.
  • PostGIS - Keeps WKT in the spatial_ref_sys table, but it is up to clients to translate to
  • PROJ.4 format for actual use. I believe the spatial_ref_sys table is populated using OGR generated translations.

ETL and Spatial ETL - I'll cover more of how data warehousing and GIS and internet mapping can be tied together at a later date.

If you are really interested, take a look at the MapBender project mentioned in earlier posts. It ties together many of the concepts of ETL and data sharing quite well.

If you have ideas for Blog posts that relate to GIS, Datum's, Map Projections, Oracle, Data Warehousing, let me know. Some ideas for future blogs are:

  • The African Geoid Project
  • Map Projections of the Middle East
  • Determining Mean Sea Level
  • NTv2 files and how to create them

Let me know. Contact me via e-mail.