Adding a BigQuery Connection

Prerequisites

In order to establish a connection to Google BigQuery, a BigQuery user with sufficient permissions is required.
Zeenea traffic flows towards Google BigQuery must be open.

Zeenea uses Google HTTP API to collect metadata.

Target	Protocol	Usual Ports
BigQuery	HTTPS	443

note

The configuration template can be downloaded here: BigQuery.conf.

Supported Versions

The BigQuery connector has been successfully tested with the Web version.

There are two possible forms for this connector:

BigQuery: This connector will catalog all BigQuery tables for one project
BigQueryOrganization: This connector catalogs all tables, for all projects inside an organization. The BigQueryOrganization connector allows for larger access to an organization's data, but it also requires a user with more permissions.

Installing the Plugin

From version 54 of the scanner, the SQL Server connector is presented as a plugin.

It can be downloaded here and requires a scanner version 64 or later: Zeenea Connector Downloads.

For more information on how to install a plugin, please refer to the following article: Installing and Configuring Connectors as a Plugin.

Declaring the Connection

Creating and configuring connectors is done through a dedicated configuration file located in the /connections folder of the relevant scanner.

Parameter	Expected Value
`name`	The name that will be displayed to catalog users for this connection
`code`	Unique identifier of the connection on the Zeenea platform. Once registered on the platform, this code must not be modified or the connection will be considered as new and the old one removed from the scanner.
`connector_id`	The type of connector to be used for the connection. Here, the value must be `BigQuery` for project level or `BigQueryOrganization` for organization level. This value must not be modified.
`connection.json_key`	Path to the access JSON Key.
`connection.billing_project_id`	(Optional) Identifier of the project used for the connection. The Google invoice is for this project. By default, one connection is established for each project that is scanned. NOTE: The retrieval of PK, FK requires this parameter.
`filter`	To filter datasets during the inventory. See Rich Filters.
`inventory.partition.pattern`	Parameter for confusing datasets represented in several tables. Must be completed with the variable part of the tables name. See examples in the template file.
`proxy.scheme`	Depending on the proxy, `http` or `https`
`proxy.hostname`	Proxy address
`proxy.port`	Proxy port
`proxy.username`	Proxy username
`proxy.password`	Proxy account password
`fingerprint.sampling_amount`	Max number of lines sampled from fingerprinting
`fingerprinting.max_blocks`	Max number of blocks read from fingerprinting
`schema_analysis.retrieve_keys`	Retrieve Primary Keys and Foreign Keys
`lineage.view.enabled`	Specifies whether Lineage View is enabled. Default value `true`.

For project level only

Parameter	Expected Value
`connection.project_id`	Project identifier
`fingerprint.project_id`	Specific Project ID
`fingerprint.json_key`	Path to the specific project access JSON Key

User Permissions

In order to collect metadata, the user permissions must allow them to list and read all sources that need to be cataloged.

Roles

To extract metadata, the technical account must have the following predefined roles:

For the BigQueryOrganization connector:
- On the project where the service account is being managed:
  - Project/Browser (roles/Browser)
  - BigQuery Metadata viewer (roles/bigquery.metadataviewer)
- On all projects that need to be cataloged:
  - BigQuery Metadata Viewer (roles/bigquery.metadataviewer)
For the BigQuery connector:
- On the project that needs to be cataloged:
  - BigQuery Metadata Viewer (roles/bigquery.metadataviewer)

The running user will also need this specific authorization: bigquery.jobs.create (required in order to retrieve Primary Keys & Foreign Keys)

If the data profiling feature is enabled, another technical account is used. The project on which the data profiling feature should focus also needs to be specified and the running user needs to have a read access on its tables.

BigQuery/BigQuery Data Viewer (roles/bigquery.dataViewer)

Rich Filters

Since version 47 of the scanner, the BigQuery connector benefits from the feature of rich filters in the configuration of the connector. This functionality also applies if on the metadata "Roles" of the datasets.

Criteria	Description	Connector
project	Google Cloud project name	BigQueryOrganization
dataset	BigQuery dataset name	BigQueryOrganization BigQuery
table	Table or view name	BigQueryOrganization BigQuery

Data Extraction

At the project level, the connector will crawl through all data sources to extract information. This action will be done on all projects at the organization level.

Collected Metadata

Inventory

The inventory collects all data sources that the user can access.

Dataset

A dataset is a Denodo derived view.

Name
Source Description
Technical Data:
- Dataset Id
- Table Id
- Project
- Dataset
- Table
- Type
- Created
- Expiration
- Last modified
- Long-Term Storage Size (bytes)
- Number of rows
- Size (bytes)
- Data location
- Is partitioned
- Partitioned by
- Partitioned on field
- Clustered by
- Labels

Field

Dataset field.

Name
Source Description
Type
Can be null: Depending on field settings
Multivalued: TRUE if the field type is STRUCT, FALSE otherwise
Primary Key: Depending on field settings
Technical Data:
- Technical Name: Field technical name
- Native type: Field native type

Data Profiling

IMPORTANT

The Data Profiling feature, which can be enabled on this connection, allows your Explorers to get a better grasp on the type of data stored in each fields. This feature, which can be activated in the Scanner, is by default set to run on a weekly basis, every Saturday. However, depending on the number of fields you've activated this feature for, the calculation can quickly become costly. Please make sure the estimated impact of this feature is acceptable and that the default frequency appropriate, before enabling it.

The data profiling feature allows you to calculate statistical profiles on your datasets.More information on this feature can be found here: Data Profiling.

To activate this feature, the account used must have a read access on the relevant tables.

In order to retrieve the statistical profiles in BigQuery, the following requests are executed:

SELECT count(*) AS result FROM tableName

The above request is used to determine the number of lines in the tableName table. Afterwards, the connector will then determine the size of the targeted data with the following request:

SELECT sum(size_bytes) AS size
FROM datasetName.__TABLES__
WHERE table_id = 'tableId'

Knowing the size of the targeted data is important, as it is necessary for the algorithm to choose the most appropriate sampling request (otherwise, it would have to scan the entire table).

Scenario 1, the targeted data size is less than 10Go:

SELECT
   field1, field2
   FROM tableName
   WHERE rand()  percentLines

Scenario 2, the size of the targeted data is greater than 10Go and the size of the collected lines is less than 10Go:

SELECT
   field1, field2
   FROM tableName
   TABLESAMPLE SYSTEM (percent10Go PERCENT)
   WHERE rand()  percentLinesFor10Go
   LIMIT 10000

Scenario 3, in which both the size of the targeted data and the size of the collected lines are greater than 10Go each (Optimization scenario):

SELECT
   field1, field2
   FROM tableName
   TABLESAMPLE SYSTEM (percentLines PERCENT)
   LIMIT 10000

The above requests will collect a data sample on fields where the feature has been enabled (fields 1 and 2). The sample contains a maximum of 10,000 lines (defined in the percentLines parameter).

This request can either be manually executed from the Admin portal, or it can be scheduled, according to the collect-fingerprint variable in the application.conf file. For more information on this, refer to Zeenea Scanner Setup.

Object Identification Keys

An identification key is associated with each object in the catalog. In the case of the object being created by a connector, the connector builds it.

More information about how it works can be found here: Identification Keys.

Object	Identification Key	Description
Dataset	BigQueryOrganization: code/project id/dataset id/table id BigQuery: code/dataset id/table id	code: Unique identifier of the connection noted in the configuration file project id: BigQuery project id dataset id: BigQuery dataset name table id: Table name
Field	BigQueryOrganization: code/project id/dataset id/table id/field name BigQuery: code/dataset id/table id/field name	code: Unique identifier of the connection noted in the configuration file project id: BigQuery project id dataset id: BigQuery dataset name table id: Table name field name

Prerequisites​

Supported Versions​

Installing the Plugin​

Declaring the Connection​

For project level only​

User Permissions​

Roles​

Rich Filters​

Data Extraction​

Collected Metadata​

Inventory​

Dataset​

Field​

Data Profiling​

Scenario 1, the targeted data size is less than 10Go:​

Scenario 2, the size of the targeted data is greater than 10Go and the size of the collected lines is less than 10Go:​

Scenario 3, in which both the size of the targeted data and the size of the collected lines are greater than 10Go each (Optimization scenario):​

Object Identification Keys​