4 Layers of a Modern Data Pipeline

Published in on Feb 5, 2020

Below is an visual representation of the four layers that make up a modern data pipeline. I included a chart that shows common examples for each layer in the AWS stack, as well as examples of Open Source alternatives.

Typically, your modern data flow starts at the bottom (storage) and flows to the top (analytics). However, that is not always the case. Some traditional pipelines don't store data until it has been ingested and crunched. We will address the reasons for this difference in a later article about ETL vs ELT.

The remainder of the article explains each layer in more detail and provides additional examples for each. You can mix and match components for each layer, so long as you join them all together.

Big Data Stack Image
AWS for each layer Open Source for each layer
Analytics: Quicksight Analytics: Metadata/Apache Superset
Crunching: EMR/Redshift Crunching: Hadoop/Pandas
Plumbing: Glue/Transfer Plumbing: Apache Airflow/NiFi/SFTP
Storage: S3/RDS/Dynamo Storage: File System/MySQL/Mongo

The layers in detail:

1. The Data Layer: Storage

Key Concepts

  • The data layer is the foundation of your pipeline
  • All of your "raw" data lives in your data layer, and can also include your processed data
  • There can be different components that make up your data layer
    • For example, you might load all of your raw data into a cheap storage location before processesing it. This is often something like S3.
    • Then you might load your processed data into a relational database like RDS.
    • Both of these components are a part of your data layer

A few AWS "Serverless" options

  • S3, S3 Glacier, S3 Glacier Deep Storage
  • RDS
  • Elasticsearch Service

Common AWS-alternatives

  • Your own file system on your server
  • A MySQL/Mongo database on your server
  • HDFS
  • Google Cloud Storage
  • Azure Blob Storage

2. The Ingestion & Integration Layer

Key Concepts

  • This layer allows you to feed your "ETL" pipeline with data from the Data Layer
  • Typically, data is "cataloged" so your jobs will understand the formats and structures of your data
  • Once data is pulled into you pipeline, you are typically doing the following:
    • Sanitizing (validating or clensing) data
    • Merging data sources together
    • Enriching data (computed columns, external API lookups, etc)
    • Optimizing data (think aggregations, normalizing/denormalizing data, etc)
  • Finally, you load the data into it's target data storage for use by other processes

A few AWS "Serverless" options

  • Transfer (Managed SFTP)
  • Glue (Crawlers and Spark jobs)

Common AWS-alternatives

  • Hadoop
  • Self hosted Spark clusters
  • Enterprise ETL providers
    • Talend, Data Dog, Informatica, Panoply
  • OSS Options
    • Apache Airflow
    • CloverETL
    • Jaspersoft

3. The Processing Layer

Key Concepts

  • The processing layer allows you to "crunch" your data
  • Typically, this is used for "aggregations" and "modeling"
  • Aggregations allow you to pre-calculate data for use in a visualization layer
    • This process helps with load times
  • Additionally, this layer allows you to change your data model
    • Think normalizing, denormalizing, view creation, etc

A few AWS "Serverless" options

  • EMR (Elastic Map Reduce)
  • Athena (run SQL directly on S3)
  • Redshift Spectrum (OLAP Data warehouse)
  • Technically, you can do some of this with additional Glue jobs

Common AWS-alternatives

  • Haddop
  • PrestoDB (This is what Athena is based on, created by Facebook)
  • Self hosted python & SQL scripts

4. The Analytics & BI Layer

Key Concepts

  • This layer allows you to visualize your data
  • Typically, this is used for "insight" delivery
  • Think charts, filters, data tables, etc
  • BI Layers are often used by business stake holders and must be intuitive
  • Tyipcally, this layer providers "interactivity" with the data (grouping, filtering, etc)
  • This layer usually only allows read-access to the data

A few AWS "Serverless" options

  • Quicksight

Commin AWS-alternatives

  • Metabase
  • Apache Superset
  • Tableu
  • Looker

Conclusion

In the end, you can mix and match components across these layers. However, you need to address all of these layers, and integrate them together. This will make sure you are in good shape to flexibly process and display large amounts of data. Hope this basic explanation was helpful!

Onboard to my newsletter if you like the articles. It makes me feel good...😉