Data extraction is the process of obtaining data from a database or SaaS platform so that it can be replicated to a destination — such as a data warehouse — designed to support online analytical processing (OLAP). Data extraction is the first step in a data ingestion process called ETL — extract, transform, and load.
The goal of ETL is to prepare data for analysis or business intelligence (BI).
Suppose an organization wants to monitor its reputation in the marketplace. It may have data from many sources, including online reviews, social media mentions, and online transactions. An ETL tool can extract data from these sources and load it into a data warehouse where it can be analyzed and mined for insights into brand perception
Types of data extraction
Extraction jobs may be scheduled, or analysts may extract data on-demand as dictated by business needs and analysis goals. Data can be extracted in three primary ways:
Update notification
The easiest way to extract data from a source system is to have that system issue a notification when a record has been changed. Most databases provide a mechanism for this so that they can support database replication (change data capture or binary logs), and many SaaS applications provide webhooks, which offer conceptually similar functionality.
Incremental extraction
Some data sources are unable to provide notification that an update has occurred, but they are able to identify which records have been modified and provide an extract of those records. During subsequent ETL steps, the data extraction code needs to identify and propagate changes. One drawback of incremental extraction is that it may not be able to detect deleted records in source data because there’s no way to see a record that’s no longer there.
Full extraction
The first time you replicate any source you have to do a full extraction, and some data sources have no way to identify data that has been changed, so reloading a whole table may be the only way to get data from that source. Because full extraction involves high data transfer volumes, which can put a load on the network, it’s not the best option if you can avoid it.
The data extraction process
Whether the source is a database or a SaaS platform, the data extraction process involves the following steps:
Check for changes to the structure of the data, including the addition of new tables and columns. Changed data structures have to be dealt with programmatically.
Retrieve the target tables and fields from the records specified by the integration’s replication scheme.
Extract the appropriate data, if any.
Extracted data is loaded into a destination that serves as a platform for BI reporting, such as a cloud data warehouse like Amazon Redshift, Microsoft Azure SQL Data Warehouse, Snowflake, or Google BigQuery. The load process needs to be specific to the destination.
ELT – extract, load, transform, analyze
API-specific challenges
While it may be possible to extract data from a database using SQL, the extraction process for SaaS products relies on each platform’s application programming interface (API). Working with APIs can be challenging:
APIs are different for every application.
Many APIs are not well documented. Even APIs from reputable, developer-friendly companies sometimes have poor documentation.
APIs change over time. For example, Facebook’s “move fast and break things” approach means the company frequently updates its reporting APIs – and Facebook doesn’t always notify API users in advance.
Learn more about the next.