How can we make data engineers work less? Part 2 — Step-by-Step guide
This post’s goal is to continue engaging data engineering teams to automate their ETLs, so they can focus on developing advanced strategies, not only ordinary ETL codes.
Here you will find code snippets to guide you through the development of the ingestion engine.
If you haven’t read the Part 1 of this series, I highly recommend taking a look on it (3 minutes of reading), so you can understand the purpose of this architecture that will be implemented!
Common problems on the development of data pipelines
If you happen to understand what the image presented above means, you know that us data engineers must be prepared to scenarios where an infinity of data pipelines has to be created. This can be very time-consuming for the data engineering team if each pipeline is an isolated script for each table or set of tables. Key problems:
- Sustainability
Any time you have to ingest more data on the platform, a new script will be created. Considering these pipelines as independent codes, if you find a bug on one of them, you will probably have to debug the other ones to solve the problems again.
2. Scalability
If you have to finish an ingestion of many tables, you may have to write pipelines over and over again, until the list is completed, which can be an exhausting task.
By using the following architecture, scaling is a simple task. Just add a new table configuration and move on. This ingestion engine is also organized in Classes, so any bug solved will reflect on all the engine, without the need of redoing it for the other tables.
Ingestion engine — Step-by-Step
Pre-requisites
- Python & Pyspark: For this solution I use Python and PySpark, but feel free to adapt the code to your favorite language/tool.
This solution is on Google Cloud Platform, using Cloud Storage and Bigquery, but the workflow can be adapted to any Cloud/environment of your choice.
Workflow
- configuration.py: table’s structure, datatypes, and any other specific parameter.
- transformation.py: functions to perform transformations, for example, cast of datatypes.
- connection.py: functions to connect at data sources and destinations
- main.py: orchestrates the workflow
What else?
That’s it! The architecture is simple but effective. This change of paradigm makes all the difference on scalability and maintenance of your ingestion pipelines. Give it a shot and start spending your time developing advanced strategies, not only ordinary ETL codes.
Want to see more about it?
This is the second post of a series of one click solutions posts. Be around to get the news!
Previous post:
This post is powered by the ideas of Mario Abreu, data architect, and my friend.