Skip to main content

Adding an AWS Glue (Data Catalog) Connection

Prerequisites

  • A user with sufficient permissions is required to establish a connection with AWS Glue.
  • Zeenea traffic flows towards the database must be open.

The Agent's host server must have sufficient credentials to connect to AWS Glue; in this case, available authentication methods are:

  • Instance Role
  • Environment Variable
  • Configuration File
note

A configuration template can be downloaded here: aws-glue.conf

TargetProtocolUsual Ports
AWS GlueHTTP443

Supported Versions

The AWS Glue connector was successfully tested with the online application.

Installing the Plugin

The AWS Glue plugin can be downloaded here: Zeenea Connector Downloads.

For more information on how to install a plugin, please refer to the following article: Installing and Configuring Connectors as a Plugin.

Declaring the Connection

Creating and configuring connectors is done through a dedicated configuration file located in the /connections folder of the relevant scanner.

Read more: Managing Connections

In order to establish a connection with an AWS Glue instance, specifying the following parameters in the dedicated file is required:

ParameterExpected value
nameThe name that will be displayed to catalog users for this connection.
codeThe unique identifier of the connection on the Zeenea platform. Once registered on the platform, this code must not be modified or the connection will be considered as new and the old one removed from the scanner.
connector_idThe type of connector to be used for the connection. Here, the value must be aws.glue and this value must not be modified.
connection.aws.access_key_idAWS Glue Access Key Identifier
connection.aws.secret_access_keyAWS Glue Secret Access Key
connection.aws.regionAWS region
connection.aws.profileAWS Profile for authentication
connection.aws.multi_account.enabled Allow a single connection to retrieve data from other AWS account data catalog.

In order to connect to multiple accounts, you need to configure AWS cross access between accounts.

AWS documentation : https://docs.aws.amazon.com/glue/latest/dg/cross-account-access.html.

Default value is false.

Since version 3.3.1
connection.aws.multi_account.listDefine which account/region to connected to. It must be a list of account:region entries, separated by a space.


Example : 123456789012:eu-west-2 987654321098:eu-west-2

Since version 3.3.1
connection.fetch_page_size(Advanced) define the size of batch of items loaded by each request in inventory.

Since version 1.0.3
filterLets you filter based on specific characteristics. See Rich Filters below for a comprehensive explanation.

Since version 3.4.1

User Permissions

In order to collect metadata, the running user's permissions must allow them to access and read databases that need cataloging.

Roles

The user must be able to run the following actions on the target bucket and the objects it contains:

  • glue:GetTable
  • glue:GetTables
  • glue:GetDatabases

Example for cataloging a bucket (in JSON):

{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "SearchTables",
"Effect": "Allow",
"Action": [
"glue:GetTable",
"glue:GetTables",
"glue:GetDatabases"
],
"Resource": "*"
}
]
}

Rich Filters

Databases and Tables

Starting with version 3.4.1, the connector embeds a rich filter mechanism that enables you to extract only specific tables or databases matching the criteria.

CriteriaDescription
databaseDatabase name
tableTable name

Example:

filter = "database in ('production', 'qa') and table ~ /(?:hr|it|market)_figures/"

note

The filter attribute can contain either the raw value or a file URL to the content. (e.g., file:///path/to/zeenea/connections/aws-glue-inventory-filter.json)

When using an side-file, filter changes are taken into account without restarting the scanner.

Data Extraction

To extract information, the connector requests AWS Glue to get tables and metadata.

Collected Metadata

Inventory

Will collect the list of tables and views accessible by the user.

Dataset

A dataset can be a table or a view.

  • Name
  • Source Description
  • Technical Data:
    • AWS Region
    • Database Name
    • Location
    • Owner
    • CreateTime
    • UpdateTime
    • LastAccessTime
    • LastAnalyzeTime
    • TableType

Field

Dataset field.

  • Name
  • Source Description
  • Type
  • Can be null: Depending on field settings
  • Multivalued: Depending on field settings
  • Primary Key: Not supported. Default value false.

Object Identification Keys

An identification key is associated with each object in the catalog. In the case of the object being created by a connector, the connector builds it.

Read more: Identification Keys

ObjectIdentification KeyDescription
Datasetcode/aws region/dataset identifier
  • code: Unique identifier of the connection noted in the configuration file
  • aws region: AWS region code
  • dataset identifier: Table name
    - Database schema name
    - S3 bucket key
Fieldcode/aws region/dataset identifier/field name
  • code: Unique identifier of the connection noted in the configuration file
  • aws region: AWS region code
  • dataset identifier:
    - Database schema name
    - S3 bucket key
  • field name