How can we make data engineers work less? Part 1
This post aims to engage data engineering teams to automate their ETLs, so they can focus on developing advanced strategies, not only ordinary ETL codes.
What is the concept of “one click ingestion”?
Automate all the processes of data ingestion, until it is possible to ingest a table with one click.
The main element to make this possible is object-oriented programming and reusability of code.
- Object-oriented programming (OOP): I use Python in most of my projects, but you can use Java or Scala, for example, and still be able to mix it with some distributed processing framework or not. It is up to you and to your data platform’s needs.
- Reusability of code: this concept is part of OOP’s fundamental elements and is the key to succeed at scalability. Writing reusable code will provide you a core pipeline to all your ingestion algorithms (and one step closer to automated ingestion)
Why should I do this?
Many engineers may know that daily activities can be repetitive. In order to ingest a new table from a data source, you will use mostly these steps:
- Create/use some kind of connector to bring the data from the source (API/transactional database) to your data platform's storage first layer.
- Get this raw data from the storage, process it and load it to the Data Lake or Data Warehouse, for example.
Imagine a scenario where you can create an engine that can ingest any table from your raw layer to your Data Lake. Any time a new table has to be ingested, all that has to be done is: configure the table’s schema on the ingestion engine. This would accelerate the ingestion process, wouldn’t it? So, the architecture to make this possible consists of four key elements:
Connection.py: a python Class with all functions used to connect with data sources, and to load the processed data into the destiny, in this example, a datalake.
Transformation.py: Class that is responsible for performing any kind of transformation to all fields of the table. For example, cast of datatypes.
Configuration.py: a configuration file, containing the details of each table that is being ingested. For example, the schema, the dataset’s name, and the source type.
Main.py: the Class that orchestrates all the processes that have to be executed.
That’s it! Using this engine to develop reusable code, all you will have to do when ingesting a new table, is to define the configuration at Configuration.py. The rest is ready to use.
What’s next?
The next step closer to one click ingestion is to create a cloud function that receives a parameter consisting of the table’s configuration and a python script to add it on the Configuration.py.
The sky is the limit! I hope that this post encourages data engineers to think beyond ordinary ETLs solutions and maybe we can make one click solutions possible.
Want to see more about it?
This is the first post of a series of one click solutions posts. Be around to get the news!
Coming soon: How can we make data engineers work less? Part 2 — Step-by-Step guide
This post is powered by the ideas of Mario Abreu and Ray Lacerda, my friends and teammates.