For every star on GitHub, we'll donate $2 to clean up our waterways. Star us now!
Meltano supports multiple utilities, one of them is the DataHub utility. It provides an integration for the metadata platform called DataHub. You can find the reference at both the utility-datahub level as well as the file-bundle-datahub level.
This guide explains how to use the datahub utility either with a local running instance, or with a remotely running instance. It will explain how to setup the utility, configure datahub sources and how to run ingestions.
We assume you have some familarity with DataHub and Meltano.
Have your DataHub GMS url and and auth token ready for the setup. Or do the local setup described below.
The components needed to get DataHub & Meltano to work together are:
If you’re just testing out DataHub & Meltano, you can start a local DataHub GMS.
To get started with the local version, install datahub & then run the datahub docker quickstart
command to launch the docker-compose cluster. After that you’re all set up (you don’t need to ingest sample data, as suggested by the quickstart prompt.)
By default, the UI will be located at http://localhost:9002/.
We provide a sandbox version of this Meltano setup configured to point to a local DataHub.
The GMS Url for your local version defaults to http://localhost:8080 unless you change it.
Before installing the utility, take a look at the DataHub modules you will need. DataHub modules are for instance places (called sources) like PostgreSQL or AWS S3 you want to retrieve metadata from.
For this example we choose:
To install the utility you can either define it in the meltano.yml file or use the command line to add it. For the command line, use
meltano add utility datahub[s3,postgres,dbt]
. This will prepopulate your meltano.yml with the plugin as follows:
utilities:
- name: datahub
variant: datahub-project[s3,postgres,dbt]
pip_url: acryl-datahub
config:
gms_host:
gms_auth:
Alternatively, you can use this part of the YAML and run
meltano install
Configuring the Utility by setting the GMS endpoint for the datahub CLI running inside the Meltano project. If DataHubs metadata service authentification is turned off, you just need to configure the gms_host attribute:
utilities:
- name: datahub
variant: datahub-project
pip_url: acryl-datahub[s3,postgres,dbt]
config:
gms_host: http://localhost:8080
or alternatively run the equal CLI command
meltano config datahub set gms_host http://localhost:8080
. If you have MSA turned on, you will need an access token, and configure the gms_auth
attribute as well:
utilities:
- name: datahub
variant: datahub-project
pip_url: acryl-datahub
config:
gms_host: http://localhost/gms/api
gms_auth: myToken
Alternatively, run
meltano config datahub set gms_auth myToken
You need one so-called recipe for each source you want to ingest metadata from. You can browse the catalog there to write your own recipes for all possible sources.
They are written in YAML and stored as *.dhub.yml
files.
The meltano utility comes with one preconfigured recipe for dbt. You will need to adapt the platform
parameter inside this recipe.
source:
type: "dbt"
config:
# Coordinates
manifest_path: ${MELTANO_PROJECT_ROOT}/.meltano/transformers/dbt/target/manifest.json
catalog_path: ${MELTANO_PROJECT_ROOT}/.meltano/transformers/dbt/target/catalog.json
sources_path: ${MELTANO_PROJECT_ROOT}/.meltano/transformers/dbt/target/sources.json
# TODO: Change me to the appropriate platform, ie. bigquery, postgres, etc.
# https://github.com/datahub-project/datahub/blob/master/metadata-service/war/src/main/resources/boot/data_platforms.json
target_platform: "CHANGE ME"
sink:
type: datahub-rest
config:
server: ${DATAHUB_GMS_HOST}
token: ${DATAHUB_GMS_TOKEN}
The dbt recipe is placed inside the ${MELTANO_PROJECT_ROOT}/utilities/datahub/
but you can place the recipes anywhere.
A sample AWS S3 ingestion recipe looks like this:
source:
type: s3
config:
path_specs:
-
include: "s3://test/*.csv"
aws_config:
aws_access_key_id: XXX
aws_secret_access_key: XXX
aws_region: us-east-1
aws_endpoint_url: http://host.docker.internal:5005 #mock, replace with yours!
env: "PROD"
profiling:
enabled: false
A sample PostgreSQL ingestion recipe looks like this:
source:
type: postgres
config:
# Coordinates
host_port: host.docker.internal:5432
database: demo
# Credentials
username: admin
password: password
To run the ingestion, you use the following command:
meltano invoke datahub ingest -c YOURRECIPE.dhub.yaml
for example for S3 and PostgreSQL you would call:
meltano invoke datahub ingest -c s3recipe.dhub.yaml
and
meltano invoke datahub ingest -c postgresrecipe.dhub.yaml
To run the dbt ingestion, you can use the following meltano command:
meltano invoke datahub :dbt-ingest
Note: For the dbt ingestion to work, you need to have run the source freshness command as well as the docs generate command from dbt beforehand. If you haven’t yet, run:
meltano invoke dbt-postgres:docs-generate
and
meltano invoke dbt-postgres:freshness
There’s an example repository linked to this How To Meltano Toy Projects: DataHub & Meltano.
Be sure to read the two Readme’s inside the file bundle located: