Let’s learn by example.
Throughout this tutorial, we’ll walk you through the creation of a end-to-end modern ELT stack. In this part, we’re going to start with the data extraction process.
We’re going to take data from a “source”, namely GitHub, and extract a list of commits to one repository.
To test that this part works, we will dump the data into JSON files.
In Part 2, we will then place this data into a PostgreSQL database.
We’ll assume you have Meltano installed already. You can tell Meltano is installed and which version by running meltano version
```console
$ meltano --version
meltano, version 2.6.0
This tutorial is written using meltano >= v2.0.0.
If you don’t have a GitHub account to follow along, you could either exchange the commands for a different tap, like GitLab or PostgreSQL, or you can create a free GitHub account. You will also need a personal access token to your GitHub account.
If you're having trouble throughout this tutorial, you can always head over to the Slack channel to get help.
Create Your Meltano Project #
Step 1 is to create a new Meltano project that (among other things)
will hold the plugins that implement the details of our ELT pipeline.
-
Navigate to the directory that you’d like to hold your Meltano projects.
-
Initialize a new project in a directory of your choosing using meltano init
:
meltano init my-meltano-project
```console
$ meltano init my-new-project
Created my-new-project
Creating project files...
my-new-project/
|-- .meltano
|-- meltano.yml
|-- README.md
|-- requirements.txt
|-- output/.gitignore
|-- .gitignore
|-- extract/.gitkeep
|-- load/.gitkeep
|-- transform/.gitkeep
|-- analyze/.gitkeep
|-- notebook/.gitkeep
|-- orchestrate/.gitkeep
Creating system database... Done!
... Project my-new-project has been created!
Meltano Environments initialized with dev, staging, and prod.
To learn more about Environments visit: https://docs.meltano.com/concepts/environments
Next steps:
cd my-new-project
Visit https://docs.meltano.com/getting-started#create-your-meltano-project to learn where to go from here.
This action will create a new directory with, among other things, your meltano.yml
project file. Your file will look something like this:
version: 1
default_environment: dev
project_id: <unique-GUID>
environments:
- name: dev
- name: staging
- name: prod
- Navigate to the newly created project directory:
Now that you have your very own Meltano project, it’s time to add plugins to it. We’re going to add an extrator for GitHub to get our data. An extractor is responsible for pulling data out of any data source. In this case, we choose a specific one with the --variant
option to make this tutorial easy to work with.
- Add the GitHub extractor
$ meltano add extractor tap-github --variant=meltanolabs
```console
$ meltano add extractor tap-github --variant=meltanolabs
2022-09-19T09:32:05.162591Z [info ] Environment 'dev' is active
Added extractor 'tap-github' to your Meltano project
Variant: meltanolabs (default)
Repository: https://github.com/meltanolabs/tap-github
Documentation: https://hub.meltano.com/extractors/tap-github
Installing extractor 'tap-github'...
Installed extractor 'tap-github'
To learn more about extractor 'tap-github', visit https://hub.meltano.com/extractors/tap-github
```
This will add the new plugin to your meltano.yml
project file:
plugins:
extractors:
- name: tap-github
variant: meltanolabs
pip_url: git+https://github.com/MeltanoLabs/tap-github.git
- Test that the installation was successful by calling
meltano invoke
:
$ meltano invoke tap-github --help
If you see the extractor’s help message printed, the plugin was definitely installed successfully.
```console
$ meltano invoke tap-github --help
2022-09-19T09:32:05.162591Z [info ] Environment 'dev' is active
usage: tap-github [-h] -c CONFIG [-s STATE] [-p PROPERTIES] [--catalog CATALOG] [-d]
options:
-h, --help show this help message and exit
-c CONFIG, --config CONFIG
Config file
-s STATE, --state STATE
State file
-p PROPERTIES, --properties PROPERTIES
Property selections: DEPRECATED, Please use --catalog instead
--catalog CATALOG Catalog file
-d, --discover Do schema discovery
```
The GitHub tap requires configuration before it can start extracting data.
- The simplest way to configure a new plugin in Meltano is using the mode
interactive
:
$ meltano config tap-github set --interactive
- Follow the prompts to step through all available settings, the ones you’ll need to fill out are repositories, start_date and your private_token.
```console
$ meltano config tap-github set --interactive
Configuring Extractor 'tap-github'
[...]
Settings
1. user_agent: [...]
3. auth_token: GitHub token to authenticate ...
[...]
8. repositories: An array of strings containing the github repos to be ...
[...]
11. start_date:
[...]
To learn more about extractor 'tap-github' and its settings, visit https://hub.meltano.com/extractors/tap-github
Loop through all settings (all), select a setting by number (1 - 11), or exit (e)? [all]:
$ 3
[...]Description:
GitHub token to authenticate with.
New value:
$
Repeat for confirmation:
$
<[... other 2 values...]
```
This will add the configuration to your meltano.yml
project file:
plugins:
extractors:
- name: tap-github
config:
start_date: 2022-01-01
repositories:
- sbalnojan/meltano-lightdash
It will also add your secret auth token to the file .env
:
TAP_GITHUB_AUTH_TOKEN='ghp_XXX' # your token!
- Double check the config by running
meltano config tap-github
:
meltano config tap-github
```console
$ meltano config tap-github
2022-09-19T11:26:22.888257Z [info ] Environment 'dev' is active
2022-09-19T11:26:23.573556Z [info ] The default environment (dev) will be ignored for `meltano config`. To configure a specific Environment, please use option `--environment=[]`.
{
"repository": "sbalnojan/meltano-lightdash",
"start_date": "2022-01-01"
}
```
Select Entities and Attributes to Extract #
Now that the extractor has been configured, it’ll know where and how to find your data, but won’t yet know which specific entities and attributes (tables and columns) you’re interested in.
By default, Meltano will instruct extractors to extract all supported entities and attributes, but we’re going to select specific ones for this tutorial.
- Find out what entities and attributes are available, using
meltano select YOUR_TAP --list --all
:
meltano select tap-github --list --all
```console
$ meltano select tap-github --list
2022-09-19T10:59:43.554214Z [info ] Environment 'dev' is active
Legend:
selected
excluded
automatic
Enabled patterns:
*.*
Selected attributes:
[selected ] assignees._sdc_repository
[automatic]
[...]
[selected ] commits.comments_url
[selected ] commits.commit
[selected ] commits.commit.author
[...]
[selected ] teams.repositories_url
[selected ] teams.slug
[selected ] teams.url
```
- Select the entities and attributes for extraction using
meltano select
:
meltano select tap-github commits url
meltano select tap-github commits sha
meltano select tap-github commits commit_timestamp
This will add the selection rules to your meltano.yml
project file:
version: 1
default_environment: dev
environments:
- name: dev
config:
plugins:
extractors:
- name: tap-github
select:
- commits.url
- commits.sha
- commits.commit_timestamp
- name: staging
- name: prod
project_id: YOUR_ID
plugins:
extractors:
- name: tap-github
variant: meltanolabs
pip_url: git+https://github.com/MeltanoLabs/tap-github.git
config:
start_date: 2022-01-01
repository: sbalnojan/meltano-lightdash
- Run
meltano select --list
to double-check your selection:
meltano select tap-github --list
Add a dummy loader to dump the data into JSON #
To test that the extraction process works, we add a JSON target.
- Add the JSON target using
meltano add loader target-jsonl
.
```console
$ meltano add loader target-jsonl</span>
2022-09-19T13:47:42.389423Z [info ] Environment 'dev' is active
To add it to your project another time so that each can be configured differently,
add a new plugin inheriting from the existing one with its own unique name:
meltano add loader target-jsonl--new --inherit-from target-jsonl
Installing loader 'target-jsonl'...
Installed loader 'target-jsonl'
To learn more about loader 'target-jsonl', visit https://hub.meltano.com/loaders/target-jsonl
```
This target requires zero configuration, it just outputs the data into a jsonl
file.
Now that your Meltano project, extractor, and dummy loader are set up, we can test run the extraction process.
There’s just one step here: run your newly added extractor and jsonl loader in a pipeline using meltano run
:
$ meltano run tap-github target-jsonl
```console
$ meltano run tap-github target-jsonl
2022-09-19T13:53:36.403099Z [info ] Environment 'dev' is active
2022-09-19T13:53:41.062802Z [info ] Found state from 2022-09-19 13:53:17.415907.
2022-09-19T13:53:41.071885Z [warning ] No state was found, complete import.
2022-09-19T13:53:43.054384Z [info ] INFO Starting sync of repository: sbalnojan/meltano-lightdash cmd_type=elb consumer=False name=tap-github producer=True stdio=stderr string_id=tap-github
2022-09-19T13:53:43.553171Z [info ] INFO METRIC: {"type": "timer", "metric": "http_request_duration", "value": 0.4796161651611328, "tags": {"endpoint": "commits", "http_status_code": 200, "status": "succeeded"}} cmd_type=elb consumer=False name=tap-github producer=True stdio=stderr string_id=tap-github
2022-09-19T13:53:43.561190Z [info ] INFO METRIC: {"type": "counter", "metric": "record_count", "value": 1, "tags": {"endpoint": "commits"}} cmd_type=elb consumer=False name=tap-github producer=True stdio=stderr string_id=tap-github</span>
2022-09-19T13:53:43.735250Z [info ] Incremental state has been updated at 2022-09-19 13:53:43.734535.
2022-09-19T13:53:43.820467Z [info ] Block run completed. block_type=ExtractLoadBlocks err=None set_number=0 success=True</span>
```
You should see data flowing from your source into the jsonl file.
You can verify that it worked by looking inside the newly created file called output/commits.jsonl
.
```console
$ cat output/commits.jsonl
{"sha": "409bdd601e0531833665f538bccecd0f69e101c0", "url": "https://api.github.com/repos/sbalnojan/meltano-lightdash/commits/409bdd601e0531833665f538bccecd0f69e101c0"}
```
Next Steps #
Next, head over to Part 2: Loading extracted data into a target.