Leveraging AWS Glue for organizing an ETL data pipeline



Organizing an ETL (extract, load, transfer) data pipeline is a complex task, made even more challenging by the necessity of maintaining the infrastructure capable of running it. AWS Glue provides a serverless environment for running ETL jobs, so organizations can focus on managing their data, not their hardware. AWS has pioneered the movement towards a cloud based infrastructure, and Glue, one if its newer offerings, is the most fully-realized solution to bring the serverless revolution to ETL job processing. This article provides a quick overview of what is required to build an ETL pipeline in Glue.

Naturally, if you want to run your ETL jobs in the cloud, you need to establish a way for the cloud to access your data. This is very simple if you use one of the many services that AWS provides for warehousing data, such as S3 or Redshift. If you use MySQL, Oracle, Microsoft SQL Server, or PostgreSQL databases, these can be accessed by Glue if hosted in an Amazon VPC (Virtual Private Cloud). It is also possible to connect your data to Glue via a JDBC (Java Database Connectivity) connection to a data store of your choosing.

Glue builds a data catalog that stores the location, schema, and runtime metrics of your data. The schema is automatically inferred from your data by “crawlers.” A crawler takes a subset of your data and uses it to predict what the names and data types for each table should be. This allows you to set up pipelines that draw upon semi-structured data, where the crawlers do the work of finding an appropriate schema as the data flows in. However, the crawlers’ predictions may not be perfect. (Crawlers know this: they provide a rating of how confident they are about the schemas they generate.) Imagine your data includes addresses. Should the house numbers be stored as integers (int) or as strings? A crawler examining a subset of your data may determine “int”, because it missed an entry such as “132-A”. However, even if you already know the structure of your data, the crawlers provide a good check to make sure the schema is what you think it should be.

With your data in place and cataloged, you are ready to pass it to a script to transform it. Glue ETL scripts must be written in PySpark (a Python dialect) or Scala. Glue will automatically generate PySpark scripts for many standard transform functions such as applying a mapping or dropping a field. For instance, after specifying a data source and target, Glue can automatically generate the following code for transforming a database storing information about taxi fares.

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

args = getResolvedOptions(sys.argv, ['JOB_NAME'])

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "sample-data",
                                                            table_name = "taxi_trips",
                                                            transformation_ctx = "datasource0")

applymapping1 = ApplyMapping.apply(frame = datasource0,
                                   mappings = [
                                       ("vendorid", "long", "vendor_id", "long"),
                                       ("payment_type", "long", "payment_type", "long"),
                                       ("fare_amount", "double", "fare", "double"),
                                       ("total_amount", "double", "total_amount", "double")
                                   transformation_ctx = "applymapping1")

datasink2 = glueContext.write_dynamic_frame.from_options(frame = applymapping1,
                                                         connection_type = "s3",
                                                         connection_options = {"path": "s3://example-data-destination/taxi-data"},
                                                         format = "json",
                                                         transformation_ctx = "datasink2")

Of course, you may edit these scripts or simply write your own from scratch to accomplish your own goals. While Glue provides an environment for editing scripts, you may prefer to write code in an IDE or notebook. Glue provides the ability to integrate your code with development tools so that you can test your code locally. However, this integration currently only supports code written in Python 2.7.

Your ETL jobs can be triggered either by an event, a schedule, or manually. The successful completion of one job can trigger the next, so that your whole pipeline may be triggered by a single job. One powerful application is to trigger a test of your pipeline each time your code is updated, in support of a continuous testing framework. You can also use triggers to determine when crawlers should update your data catalogs, in case the structure of the data has changed.

In the end, your transformed data may be loaded into an S3 bucket or other repository, as determined by your script. This entire process, from start to end, can be managed from the AWS Glue Console. You can view which jobs are running, see their logs, and even edit their code all from the same interface. The Glue Console is not as polished as the user interfaces for some other services from AWS, but it does let you manage the entire ETL process from a single point.

A view from the Glue Console

One of the strongest cases for trying Glue is integration with other services from AWS. With your data stores hosted in S3, you can use Athena to run serverless SQL queries. You can use Lambda to kick off your ETL jobs using triggers far outside the scope of what is already available in Glue. You can use Amazon SNS to receive notification when your pipeline has completed successfully, or when it has failed. Thus, while transferring an existing ETL workflow to Glue might be a significant undertaking, doing so makes it much easier to take advantage of some very powerful tools.

Deployed in 2017, Glue is still a relatively new service, and is still in development. While it has some rough edges, new features are being rolled out frequently. Already, it is a robust tool that services each part of an ETL pipeline, and well worth consideration if you are thinking about going serverless.

Alison Forster is a Software Engineer working at 3M Health Information Services. She lives with her family in Albany, NY along with eight chickens.


Click on a tab to select how you'd like to leave your comment

Leave a Comment

Your email address will not be published. Required fields are marked *