ETL pipeline testing on AWS EMR against production data without copying anything

Delivering high-quality data products requires strict testing of pipelines before deploying those into production. Today, to test using quality data, one either needs to use a subset of the production data, or is forced to create multiple copies of the entire data. Testing against sample data is not good enough. The alternative, however, is costly and time-consuming. We will demonstrate how to get the entire production data set with zero-copy.
You will learn:

  • Create multiple isolated testing environments without copying data
  • Automate the process of testing your logic, using local Airflow installation against lakeFS on AWS and S3.


