Subscribe to the Teradata Blog

Get the latest industry news, technology trends, and data science insights each week.

Integrating Teradata Vantage with AWS Glue

Integrating Teradata Vantage with AWS Glue
As many Teradata customers have interest in integrating Teradata Vantage with AWS First Party Services, this Getting Started Guide document will help you to integrate AWS Glue with Teradata Vantage.  

The approach this guide explains is one of many potential approaches to integrate with the service, and is offered on an as-is basis. Although the approach has been implemented and tested internally, there is no formal support from either Teradata or AWS on the approach.

That said, your feedback is very desired and appreciated – what worked, what didn’t, how can this be improved, or etc.

Please send your feedback to Vinod.Raman@Teradata.com and wenjie.tehan@teradata.com.

Overview

AWS Glue is serverless, and provides a fully managed ETL (extract, transform, and load) service that makes it easy for customers to prepare and load their data for analytics. AWS Glue consists of a central metadata repository known as the AWS Glue Data Catalog, an ETL engine that automatically generates Python or Sala code, and a flexible scheduler that handles dependency resolution, job monitoring and retries. We’ll be looking at the ETL functionality in this article.

AWS Glue natively supports Amazon Redshift and Amazon RDS (Amazon Aurora, MariaDB, Microsoft SQL Server, MySEL, Oracle and PostgreSQL). Teradata is not natively supported by AWS Glue, but data can still be imported into Amazon S3 using custom database connectors. The following figure shows how the data flows between Teradata Vantage and Amazon S3.
Screen-Shot-2020-05-27-at-2-15-05-PM-(1).png
 
This article describes the process to migrate data from Teradata Vantage to Amazon S3, and from Amazon S3 to Teradata Vantage.

Prerequisites: 
  • Security group that allows Glue to access Teradata Vantage
  • S3 location of the JDBC driver (i.e. tdjdbc)
  • S3 location to store Glue job script (i.e. tdglue/scripts)
  • S3 location of the temporary directory (i.e. tdglue/temp)
  • S3 location to store input data (i.e. tdglue/input)
  • S3 location to store output data (i.e. tdglue/output)
  • Vantage instance available to use 
Instructions on how to create a S3 bucket can be found here.

Getting Started

Download Teradata JDBC Driver

Download the latest Teradata JDBC Driver from here. Once you have the JDBC driver file, uncompress and upload the jar file to a AWS S3 bucket (i.e. tdjdbc) using the steps here.

Set up IAM role

The next step is to create the IAM role that the ETL job will use.

From AWS Management Console, search for IAM. On the IAM console, choose Roles in the left navigation pane.
Choose “Create Role”. The role type of trusted entity is “AWS service”, specify “Glue”
Picture1.png
 
  • Choose “Next: Permissions”
  • Search for the “AWSGlueServiceRole” policy, and select it

Picture1-(1).png
  • Search again for “AmazonS3FullAccess” policy, and select it
Picture1-(2).png
  • Search for “SecretsManagerReadWrite” policy, and select it.
Picture1-(3).pngNote: This policy is optional. You only need this policy if you are using “Secrets Manager” (see next section)
  • Choose “Next: Tags”, add in the Key Value pair for tags if application.
  • Choose “Next:Review”
  • Give your role a name (i.e. GluePermissions), and confirm all the policies you selected are there.
  • Choose “Create role”

Set up Secrets Manager (optional)

Secrets Manager can be used to store credentials in a safe store. In this case, you can use Secrets Manager to store database information. This step is optional.

To set up Secret Manager, do the following:
  • Open the console, and search for “Secrets Manager”
  • In the AWS Secrets Manager console, choose “Store a new secret”
  • Under Select a secret type, choose “Other type of secrets
  • In the “Secret key/value”, set one row for each of the following parameters:
    • db_name
    • db_username
    • db_password
    • db_url (jdbc:Teradata://<database server name> i.e. jdbc:teradata://10.10.10.10)
    • db_table
    • driver_name (com.teradata.jdbc.TeraDriver)
    • output_bucket: (i.e. s3://tdglue/output)
  • Choose “Next”
  • For Secret name, use “TD_Vantage_Connection_Info”
  • Choose “Next
  • Keep the “Disable automatic rotation” check box selected
  • Choose “Next
Choose “Store

Authoring Jobs

Authoring an ETL Job (from Vantage to S3)

The next step is to author the AWS Glue job, following these steps:
  • In the AWS Management Console, search for “AWS Glue”
Picture1-(4).png
  • In the navigation pane on the left, choose “Jobs” under the “ETL”
  • Choose “Add job”
Picture1-(5).png
  • Fill in the basic Job properties:
    • Give the job a name (i.e. td2s3).
    • Choose the IAM role that you created previously (i.e. GluePermissions)
    • For “Type” and “Glue version”, use “Spark” and the latest Spark and Python version
    • For “This job runs”, choose “A new script to be authored by you”
    • For “S3 path where the script is stored” and “Temporary directory”, choose the buckets or folders you created at “Prerequisite” step (i.e. tdglue/scripts and tdglue/temp).
Picture1-(6).png
  • In the “Security configuration, script libraries and job parameters” section, choose the location of your JDBC driver (terajdbc4.jar) for “Dependent jars” path.
Picture1-(7).png
  • Choose “Next”
  • On the Connections page, choose “Save job and edit script”. This creates the job and opens the script editor.
In the editor, replace the existing code with following script.

Note: region_name in the script should be replaced with the region where you created your secrets using Secrets Manager
Screen-Shot-2020-05-27-at-2-28-35-PM.pngScreen-Shot-2020-05-27-at-2-29-38-PM.pngScreen-Shot-2020-05-27-at-2-30-03-PM.png
  • Click on “Save” then click on “Run job”
Job status can be monitored from Glue console. Once job is done, “Run status” would be marked as “Succeeded”
Picture1-(8).pngAnd the output file will be kept at the S3 output bucket you set up before.
Picture1-(9).png

Authoring an ETL Job (from S3 to Vantage)

Preparing data

In this step, we’ll create a Glue table using Crawler.
Upload your data file into a S3 bucket (i.e. tdglue/input).
  • In the AWS Management Console, search for “AWS Glue”
Picture1-(10).png
  • In the navigation pane on the left, choose “Jobs” under the “ETL”
  • Choose “Add job”
Picture1-(11).png
  • Fill in the basic Job properties:
    • Give the job a name (i.e. td2s3).
    • Choose the IAM role that you created previously (i.e. GluePermissions)
    • For “Type” and “Glue version”, use “Spark” and the latest Spark and Python version
    • For “This job runs”, choose “A new script to be authored by you”
    • For “S3 path where the script is stored” and “Temporary directory”, choose the buckets or folders you created at “Prerequisite” step (i.e. tdglue/scripts and tdglue/temp).
Picture1-(12).png
  • In the “Security configuration, script libraries and job parameters” section, choose the location of your JDBC driver (terajdbc4.jar) for “Dependent jars” path.
Picture1-(13).png
  • Choose “Next”
  • On the Connections page, choose “Save job and edit script”. This creates the job and opens the script editor.
In the editor, replace the existing code with following script.

Note: region_name in the script should be replaced with the region where you created your secrets using Secrets Manager
Screen-Shot-2020-05-27-at-2-36-28-PM.pngScreen-Shot-2020-05-27-at-2-37-40-PM.pngScreen-Shot-2020-05-27-at-2-38-02-PM.png
  • Click on “Save” then click on “Run job”
Job status can be monitored from Glue console. Once job is done, “Run status” would be marked as “Succeeded”
Picture1-(14).pngAnd the output file will be kept at the S3 output bucket you set up before.
Picture1-(15).png

Authoring an ETL Job (from S3 to Vantage)

Preparing data

In this step, we’ll create a Glue table using Crawler.
Upload your data file into a S3 bucket (i.e. tdglue/input).
  • In the AWS Management Console, search for “AWS Glue”
Picture1-(16).png
  • In the navigation pane on the left, choose “Databases”
  • Click on “Add database”, give it a name, then click “Create”
  • Click on “Tables” under “Databases” in the left navigation panel
  • Click on the down arrow next to “Add tables” then choose “Add tables using a crawler”
Picture1-(17).png
  • At “Add information about your crawler” window, give crawler a name and click on “Next”
  • At “Specify crawler source type” window, select “Data stores” then click on “Next”
  • At “Add a data store” window, use “S3” for “Choose a data store”, and put in the path where your data file is, then click on “Next”
Picture1-(18).png
  • Choose “No” for “Add another data store”, then “Next”
  • At “Choose an IAM role” window, pick “Choose an existing IAM role, and use the role your created at the “Set up IAM role” step (i.e. GluePermissions), click on “next”
Picture1-(19).png
  • Use “Run on demand” as Frequency at “Create a schedule for this crawler” window, click on “Next”
  • At “Configure the crawler’s output” window, choose the database you created earlier and click on Next
Picture1-(20).png
  • At the next window, review the information then click on “Finish”
  • At the “Crawlers” window, click on “Run it now” to the question “Crawler <your crawler name> was created to run on demand. Run it now?”. If you don’t see the question, choose the crawler you just created, then click “Run crawler”
Picture1-(21).png
  • Once crawler’s done, the status of the crawler becomes “Ready”, and a table will be created and can be viewed by selecting “Tables” from left panel under the “Databases” you created earlier.

Add a job

  • If you are using Secrets Manager, go back to the Secrets Manager console, click on “TD_Vanatage_Connection_info”, select “Retrieve secret value” under “Secret value”, then click on “Edit”. Add two more keys – s3_database and s3_table. These two keys contain the database and table name you created using the crawler. 
Skip this step if you are not using Secrets Manager.
  • In the navigation pane on the left, choose “Jobs” under the “ETL”
  • Choose “Add job”
Picture1-(22).png 
  • Fill in the basic Job properties:
    • Give the job a name (i.e. s32td).
    • Choose the IAM role that you created previously (i.e. GluePermissions)
    • For “Type” and “Glue version”, use “Spark” and the latest Spark and Python version
    • For “This job runs”, choose “A new script to be authored by you”
    • For “S3 path where the script is stored” and “Temporary directory”, choose the buckets or folders you created at “Prerequisite” step (i.e. tdglue/scripts and tdglue/temp).
Picture1-(23).png
  • Expand the “Security configuration, script libraries and job parameters” section, choose the location of your JDBC driver (terajdbc4.jar) for “Dependent jars” path.
Picture1-(24).png
  • Choose “Next”
  • On the Connections page, choose “Save job and edit script”. This creates the job and opens the script editor.
In the editor, replace the existing code with following script.
 
Note: region_name in the script should be replaced with the region where you created your secrets using Secrets Manager

Screen-Shot-2020-05-27-at-2-45-28-PM.pngScreen-Shot-2020-05-27-at-2-45-56-PM.png
  • Click on “Save” then click on “Run job”
Job status can be monitored from Glue console. Once job is done, “Run status” would be marked as “Succeeded”
Picture1-(26).png
 
Portrait of Kevin Bogusch

(Author):
Kevin Bogusch

Kevin Bogusch is the Cloud Product Marketing Manager at Teradata. Kevin has spent his 25-year career in IT from programming to system and database administration, from teaching to sales and solution architecting. He has a BS in Computer Engineering, an MS in Computer Information Systems, and an MBA in Operations Management. He is a certified Teradata Vantage Master and certified AWS Professional Solutions Architect.
 
  View all posts by Kevin Bogusch
Portrait of Wenjie Tehan

(Author):
Wenjie Tehan

Wenjie is a Technical Consulting Manager, currently working with the Teradata Global Alliances team. 
 
With over 20 years in the IT industry, Wenjie has worked as developer, tester, business analyst, solution designer and project manager. This breadth of roles makes her perfect for the current role, understanding how the business needs data and how this data can be managed to meet those business needs.  
 
Wenjie has a BS in computer science from University of California at San Diego, and ME in computer engineering at Cornell University. Wenjie is also certified on both Teradata and AWS. View all posts by Wenjie Tehan

Turn your complex data and analytics into answers with Teradata Vantage.

Contact us