databoost.dev

Your browser does not seem to support JavaScript. As a result, your viewing experience will be diminished, and you have been placed in read-only mode.

Please download a browser that supports JavaScript, or enable it if it's disabled (i.e. NoScript).

World

Topics from outside of this forum. Views and opinions represented here may not reflect those of this forum and its members.

A world of content at your fingertips…

Think of this as your global discovery feed. It brings together interesting discussions from across the web and other communities, all in one place.

While you can browse what's trending now, the best way to use this feed is to make it your own. By creating an account, you can follow specific creators and topics to filter out the noise and see only what matters to you.

Ready to dive in? Create an account to start following others, get notified when people reply to you, and save your favorite finds.

R
R rui.ms.ramos

General Discussion

Welcome to your NodeBB!
Welcome to your brand new NodeBB forum!

This is what a topic and post looks like. As an administrator, you can edit the post's title and content.
To customise your forum, go to the Administrator Control Panel. You can modify all aspects of your forum there, including installation of third-party plugins.

Additional Resources
0 0 0 Reply

R rui.ms.ramos

Welcome to your brand new NodeBB forum! This is what a topic and post looks like. As an administrator, you can edit the post's title and content. To customise your forum, go to the Administrator Control Panel. You can modify all aspects of your forum there, including installation of third-party plugins. Additional Resources NodeBB Documentation Community Support Forum Project repository
R

R rui.ms.ramos

General Discussion

Redpand vs Kafka

Hi,

Is anyone currently using Redpanda in production or has undergone a migration from Kafka to Redpanda? I'd love to hear your story!

0 0 0 Reply

R rui.ms.ramos

Hi, Is anyone currently using Redpanda in production or has undergone a migration from Kafka to Redpanda? I'd love to hear your story!
R
R rui.ms.ramos

Job Board

Tonic App - Several AI Positions available
Hi,

Tonic APP has the following Open positions:
NOTE: I'm not affiliated with the company
0 0 0 Reply

R rui.ms.ramos

Hi, Tonic APP has the following Open positions: AI Ops AI Product Manager Gen AI Developer NOTE: I'm not affiliated with the company
R
R rui.ms.ramos

Job Board

ASOS - Machine Learning Engineer (Mid-Level) [UK]
ASOS positions
- Machine Learning Engineer (Mid-Level)
Note: I'm not affiliated with the company
0 0 0 Reply

R rui.ms.ramos

ASOS positions Machine Learning Engineer (Mid-Level) Note: I'm not affiliated with the company
R
R rui.ms.ramos

Blogs

Parquet Compression - 200GB to 549kB
Was reading this article where Philippe Rivière and Éric Mauvière optimized a 200GB Parquet data and prepare it to 549kB.

Now this work touch some very relevant points regarding Data Engineering procedures and best practices, I would suggest going on the article as it explains in detail what they applied in each stage and how.

Use Case

"This new fascinating dataset just dropped on Hugging Face. French public domain newspapers 🤗 references about 3 million newspapers and periodicals with their full text OCR’ed and some meta-data. The data is stored in 320 large parquet files. The data loader for this Observable framework project uses DuckDB to read these files (altogether about 200GB) and combines a minimal subset of their metadata — title and year of publication, most importantly without the text contents —, into a single highly optimized parquet file."

Undoubtedly, this dataset proves immensely valuable for training and processing Language Model (LLM) models

Best Practices

I firmly believe that these best practices should be applied not only to Parquet but also to other columnar formats.

These are the key factors you should have into consideration:

1. Select only the Columns That you will use

This is one of simplest optimizations that you can do. Remember that data is stored in a columnar way so picking the columns that matter not only will will filter out very quickly as it will reduce significantly the volume

2. Apply the most appropriate Compression algoritm

The majority of contemporary data formats support compression. When examining the most common ones for Parquet—such as LZO, Snappy, and Gzip—we observe several notable differences (ref: sheet)

For instance gzip cannot be splitted, which means if you are going to process the data with a distributed process like Spark for instance you must use the driver to deal with all the uncompression.

LZO strikes a better balance between speed and compression rate when compared to Snappy. In this specific case, I would also recommend exploring Brotli as the datasets seem to contain text. Choosing an effective algorithm is crucial.

3. Sort the data

While it may not seem immediately relevant, aligning the rows in this manner results in extended streaks of constant values across multiple columns, enhancing the compaction ratio applied by the compression algorithm

Thoughs

They took it a step further by implementing additional optimizations, such as increasing the row_group_size. What's crucial to highlight here is the significant gains achievable through the application of good engineering practices, resulting in faster and more cost-effective processes.

It is also important to state that the data isn't exactly the same as the source data, but is the required data to train the model. DuckDB is also exceptionally fast for executing these types of processes.

While I'm eager to test it out, unfortunately, I find myself short on both time and disk space!

References
0 0 0 Reply

R rui.ms.ramos

Was reading this article where Philippe Rivière and Éric Mauvière optimized a 200GB Parquet data and prepare it to 549kB. Now this work touch some very relevant points regarding Data Engineering procedures and best practices, I would suggest going on the article as it explains in detail what they applied in each stage and how. Use Case "This new fascinating dataset just dropped on Hugging Face. French public domain newspapers 🤗 references about 3 million newspapers and periodicals with their full text OCR’ed and some meta-data. The data is stored in 320 large parquet files. The data loader for this Observable framework project uses DuckDB to read these files (altogether about 200GB) and combines a minimal subset of their metadata — title and year of publication, most importantly without the text contents —, into a single highly optimized parquet file." Undoubtedly, this dataset proves immensely valuable for training and processing Language Model (LLM) models Best Practices I firmly believe that these best practices should be applied not only to Parquet but also to other columnar formats. These are the key factors you should have into consideration: 1. Select only the Columns That you will use This is one of simplest optimizations that you can do. Remember that data is stored in a columnar way so picking the columns that matter not only will will filter out very quickly as it will reduce significantly the volume 2. Apply the most appropriate Compression algoritm The majority of contemporary data formats support compression. When examining the most common ones for Parquet—such as LZO, Snappy, and Gzip—we observe several notable differences (ref: sheet) For instance gzip cannot be splitted, which means if you are going to process the data with a distributed process like Spark for instance you must use the driver to deal with all the uncompression. LZO strikes a better balance between speed and compression rate when compared to Snappy. In this specific case, I would also recommend exploring Brotli as the datasets seem to contain text. Choosing an effective algorithm is crucial. 3. Sort the data While it may not seem immediately relevant, aligning the rows in this manner results in extended streaks of constant values across multiple columns, enhancing the compaction ratio applied by the compression algorithm Thoughs They took it a step further by implementing additional optimizations, such as increasing the row_group_size. What's crucial to highlight here is the significant gains achievable through the application of good engineering practices, resulting in faster and more cost-effective processes. It is also important to state that the data isn't exactly the same as the source data, but is the required data to train the model. DuckDB is also exceptionally fast for executing these types of processes. While I'm eager to test it out, unfortunately, I find myself short on both time and disk space! References https://mastodon.social/@severo/111957633001467414 https://github.com/apache/parquet-format/blob/master/Compression.md https://huggingface.co/spaces/observablehq/fpdn https://dev.to/alexmercedcoder/parquet-file-compression-for-everyone-zstd-brotli-lz4-gzip-snappy-5gb8
R
R rui.ms.ramos

Job Board

Aubay - Data Solutions Architect
For the interest of the Community,
- Data Solutions Architect
NOTE: I'm not affiliate with the company to provide either positive or negative feedback
0 0 0 Reply

R rui.ms.ramos

For the interest of the Community, Data Solutions Architect NOTE: I'm not affiliate with the company to provide either positive or negative feedback
R

R rui.ms.ramos

Job Board

Portugal Companies hiring

The following list is being filled by Luis Parada in LinkedIn and holds several companies that are hiring at the moment in Portiugal, for the interest of the community.

https://www.linkedin.com/pulse/whos-hiring-portugal-luis-parada--vhiue/

Check out has there are several options for Data Professionals

0 0 0 Reply

R rui.ms.ramos

The following list is being filled by Luis Parada in LinkedIn and holds several companies that are hiring at the moment in Portiugal, for the interest of the community. https://www.linkedin.com/pulse/whos-hiring-portugal-luis-parada--vhiue/ Check out has there are several options for Data Professionals
R
R rui.ms.ramos

Projects

Stanchion
If you use sqlite you may know that it updates data at the row level. I've just bumped into this project that aims to bring column-oriented storage to SQLite and would like to share with you.

This is the project description:

Stanchion

Column-oriented tables in SQLite

Why?

Stanchion is a SQLite 3 extension that brings the power of column-oriented storage to SQLite, the most widely deployed database. SQLite exclusively supports row-oriented tables, which means it is not an ideal fit for all workloads. Using the Stanchion plugin brings all of the benefits of column-oriented storage and data warehousing to anywhere that SQLite is already deployed, including your existing tech stack.

There are a number of situations where column-oriented storage outperforms row-oriented storage:
- Storing and processing metric, log, and event data
- Timeseries data storage and analysis
- Analytical queries over many rows and a few columns (e.g. calculating the average temperature over months of hourly weather data)
- Change tracking, history/temporal tables
- Anchor modeling / Datomic-like data models
Stanchion is an ideal fit for analytical queries and wide tables because it only scans data from the columns that are referenced by a given query. It uses compression techniques like run length and bit-packed encodings that significantly reduce the size of stored data, greatly reducing the cost of large data sets. This makes it an ideal solution for storing large, expanding datasets.

You can find more information on the official Github repo:
- https://github.com/dgllghr/stanchion
0 0 0 Reply

R rui.ms.ramos

If you use sqlite you may know that it updates data at the row level. I've just bumped into this project that aims to bring column-oriented storage to SQLite and would like to share with you. This is the project description: Stanchion Column-oriented tables in SQLite Why? Stanchion is a SQLite 3 extension that brings the power of column-oriented storage to SQLite, the most widely deployed database. SQLite exclusively supports row-oriented tables, which means it is not an ideal fit for all workloads. Using the Stanchion plugin brings all of the benefits of column-oriented storage and data warehousing to anywhere that SQLite is already deployed, including your existing tech stack. There are a number of situations where column-oriented storage outperforms row-oriented storage: Storing and processing metric, log, and event data Timeseries data storage and analysis Analytical queries over many rows and a few columns (e.g. calculating the average temperature over months of hourly weather data) Change tracking, history/temporal tables Anchor modeling / Datomic-like data models Stanchion is an ideal fit for analytical queries and wide tables because it only scans data from the columns that are referenced by a given query. It uses compression techniques like run length and bit-packed encodings that significantly reduce the size of stored data, greatly reducing the cost of large data sets. This makes it an ideal solution for storing large, expanding datasets. You can find more information on the official Github repo: https://github.com/dgllghr/stanchion
R
R rui.ms.ramos

Job Board

SoSafe - Open Positions
Some Data positions available in SoSafe:
Note: I' m not affiliated with the Company
0 0 0 Reply

R rui.ms.ramos

Some Data positions available in SoSafe: Team Lead Data Engineer (m/f/d) Team Lead Cloud Engineering (m/f/d) Senior Data Engineer (m/f/d) Senior Data Engineer (m/f/d) Senior Backend Engineer Node.js (m/f/d) Note: I' m not affiliated with the Company
R
R rui.ms.ramos

Blogs

Integrating dbt and ClickHouse
Integrating dbt and ClickHouse

In this we will be following the integration steps to use dbt and clickouse with sample IMDB data.

Configure ClickHouse sources

Setup clickhouse check this article if you would like more information on this product.

Then connect with a client and run the following DDL scripts
```
CREATE DATABASE imdb;

CREATE TABLE imdb.actors
(
    id         UInt32,
    first_name String,
    last_name  String,
    gender     FixedString(1)
) ENGINE = MergeTree ORDER BY (id, first_name, last_name, gender);

CREATE TABLE imdb.directors
(
    id         UInt32,
    first_name String,
    last_name  String
) ENGINE = MergeTree ORDER BY (id, first_name, last_name);

CREATE TABLE imdb.genres
(
    movie_id UInt32,
    genre    String
) ENGINE = MergeTree ORDER BY (movie_id, genre);

CREATE TABLE imdb.movie_directors
(
    director_id UInt32,
    movie_id    UInt64
) ENGINE = MergeTree ORDER BY (director_id, movie_id);

CREATE TABLE imdb.movies
(
    id   UInt32,
    name String,
    year UInt32,
    rank Float32 DEFAULT 0
) ENGINE = MergeTree ORDER BY (id, name, year);

CREATE TABLE imdb.roles
(
    actor_id   UInt32,
    movie_id   UInt32,
    role       String,
    created_at DateTime DEFAULT now()
) ENGINE = MergeTree ORDER BY (actor_id, movie_id);
```
After creating the source tables lets fill them with data from AWS, running the following code.
```
INSERT INTO imdb.actors
SELECT *
FROM s3('https://datasets-documentation.s3.eu-west-3.amazonaws.com/imdb/imdb_ijs_actors.tsv.gz',
'TSVWithNames');

INSERT INTO imdb.directors
SELECT *
FROM s3('https://datasets-documentation.s3.eu-west-3.amazonaws.com/imdb/imdb_ijs_directors.tsv.gz',
'TSVWithNames');

INSERT INTO imdb.genres
SELECT *
FROM s3('https://datasets-documentation.s3.eu-west-3.amazonaws.com/imdb/imdb_ijs_movies_genres.tsv.gz',
'TSVWithNames');

INSERT INTO imdb.movie_directors
SELECT *
FROM s3('https://datasets-documentation.s3.eu-west-3.amazonaws.com/imdb/imdb_ijs_movies_directors.tsv.gz',
        'TSVWithNames');

INSERT INTO imdb.movies
SELECT *
FROM s3('https://datasets-documentation.s3.eu-west-3.amazonaws.com/imdb/imdb_ijs_movies.tsv.gz',
'TSVWithNames');

INSERT INTO imdb.roles(actor_id, movie_id, role)
SELECT actor_id, movie_id, role
FROM s3('https://datasets-documentation.s3.eu-west-3.amazonaws.com/imdb/imdb_ijs_roles.tsv.gz',
'TSVWithNames');
```
Setup DBT

Starting by setting up DBT environment
```
pip install dbt-core
pip install dbt-clickhouse
```
Init the dbt project
```
dbt init imdb
```
Update the file dbt_project.yml and make sure to add the actors
```
models:
  imdb:
    # Config indicated by + and applies to all files under models/example/
    actors:
      +materialized: view
```
Create the following file models/actors/schema.yml with the following content
```
version: 2

sources:
- name: imdb
  tables:
  - name: directors
  - name: actors
  - name: roles
  - name: movies
  - name: genres
  - name: movie_directors
```
Create the following file models/actors/actor_summary.sql with the content
```
{{ config(order_by='(updated_at, id, name)', engine='MergeTree()', materialized='table') }}

with actor_summary as (
SELECT id,
    any(actor_name) as name,
    uniqExact(movie_id)    as num_movies,
    avg(rank)                as avg_rank,
    uniqExact(genre)         as genres,
    uniqExact(director_name) as directors,
    max(created_at) as updated_at
FROM (
        SELECT {{ source('imdb', 'actors') }}.id as id,
                concat({{ source('imdb', 'actors') }}.first_name, ' ', {{ source('imdb', 'actors') }}.last_name) as actor_name,
                {{ source('imdb', 'movies') }}.id as movie_id,
                {{ source('imdb', 'movies') }}.rank as rank,
                genre,
                concat({{ source('imdb', 'directors') }}.first_name, ' ', {{ source('imdb', 'directors') }}.last_name) as director_name,
                created_at
        FROM {{ source('imdb', 'actors') }}
                    JOIN {{ source('imdb', 'roles') }} ON {{ source('imdb', 'roles') }}.actor_id = {{ source('imdb', 'actors') }}.id
                    LEFT OUTER JOIN {{ source('imdb', 'movies') }} ON {{ source('imdb', 'movies') }}.id = {{ source('imdb', 'roles') }}.movie_id
                    LEFT OUTER JOIN {{ source('imdb', 'genres') }} ON {{ source('imdb', 'genres') }}.movie_id = {{ source('imdb', 'movies') }}.id
                    LEFT OUTER JOIN {{ source('imdb', 'movie_directors') }} ON {{ source('imdb', 'movie_directors') }}.movie_id = {{ source('imdb', 'movies') }}.id
                    LEFT OUTER JOIN {{ source('imdb', 'directors') }} ON {{ source('imdb', 'directors') }}.id = {{ source('imdb', 'movie_directors') }}.director_id
        )
GROUP BY id
)

select *
from actor_summary
```
Configure the clickstream connection on the following file ~/.dbt/profiles.yml
```
imdb:
  target: dev
  outputs:
    dev:
      type: clickhouse
      schema: imdb_dbt
      host: localhost
      port: 8123
      user: default
      password: ''
      secure: False
```
After this updates run the dbt debug command.
To make sure the connection is working properly
```
dbt debug
00:31:58  Running with dbt=1.7.6
00:31:58  dbt version: 1.7.6
00:31:58  python version: 3.11.6
00:31:58  python path: /home/rramos/Development/local/dbt/bin/python
00:31:58  os info: Linux-6.6.10-zen1-1-zen-x86_64-with-glibc2.38
00:31:58  Using profiles dir at /home/rramos/.dbt
00:31:58  Using profiles.yml file at /home/rramos/.dbt/profiles.yml
00:31:58  Using dbt_project.yml file at /home/rramos/Development/local/dbt/imdb/dbt_project.yml
00:31:58  adapter type: clickhouse
00:31:58  adapter version: 1.7.1
00:31:58  Configuration:
00:31:58    profiles.yml file [OK found and valid]
00:31:58    dbt_project.yml file [OK found and valid]
00:31:58  Required dependencies:
00:31:58   - git [OK found]
...
00:31:58  Registered adapter: clickhouse=1.7.1
00:31:58    Connection test: [OK connection ok]
```
If the connection test passed properly, one just need to create the model via dbt.
```
dbt run
```
And you should have a similar output
```
dbt run
00:38:13  Running with dbt=1.7.6
00:38:13  Registered adapter: clickhouse=1.7.1
00:38:13  Unable to do partial parsing because a project config has changed
00:38:15  Found 1 model, 6 sources, 0 exposures, 0 metrics, 421 macros, 0 groups, 0 semantic models
00:38:15  
00:38:15  Concurrency: 1 threads (target='dev')
00:38:15  
00:38:15  1 of 1 START sql view model `imdb`.`actor_summary` ............................. [RUN]
00:38:15  1 of 1 OK created sql view model `imdb`.`actor_summary` ........................ [OK in 0.17s]
00:38:15  
00:38:15  Finished running 1 view model in 0 hours 0 minutes and 0.27 seconds (0.27s).
00:38:15  
00:38:15  Completed successfully
00:38:15  
00:38:15  Done. PASS=1 WARN=0 ERROR=0 SKIP=0 TOTAL=1
```
Test query the model
```
SELECT *
FROM imdb_dbt.actor_summary
WHERE num_movies > 5
ORDER BY avg_rank DESC
```
Conclusion

In this article I've went trough the process of setup a Clickhouse database and setup dbt to setup the models with IMDB test data for actors, directors, movies, etc.

This two systems work like a charm together. Clickstream shows great performance for analytical queries, and dbt compiles and runs your analytics code against your data platform, enabling you and your team to collaborate on a single source of truth for metrics, insights, and business definitions.

I would like to extend this exercise by incorporating github actions related with dbt test actions before promoting to production, and extending the model.

But that will be for another time.

Let me know you bump into some issues or have some improvement suggestions.

References
- https://clickhouse.com/docs/en/integrations/dbt
- https://docs.getdbt.com/guides
0 0 0 Reply

R rui.ms.ramos

Integrating dbt and ClickHouse In this we will be following the integration steps to use dbt and clickouse with sample IMDB data. Configure ClickHouse sources Setup clickhouse check this article if you would like more information on this product. Then connect with a client and run the following DDL scripts CREATE DATABASE imdb; CREATE TABLE imdb.actors ( id UInt32, first_name String, last_name String, gender FixedString(1) ) ENGINE = MergeTree ORDER BY (id, first_name, last_name, gender); CREATE TABLE imdb.directors ( id UInt32, first_name String, last_name String ) ENGINE = MergeTree ORDER BY (id, first_name, last_name); CREATE TABLE imdb.genres ( movie_id UInt32, genre String ) ENGINE = MergeTree ORDER BY (movie_id, genre); CREATE TABLE imdb.movie_directors ( director_id UInt32, movie_id UInt64 ) ENGINE = MergeTree ORDER BY (director_id, movie_id); CREATE TABLE imdb.movies ( id UInt32, name String, year UInt32, rank Float32 DEFAULT 0 ) ENGINE = MergeTree ORDER BY (id, name, year); CREATE TABLE imdb.roles ( actor_id UInt32, movie_id UInt32, role String, created_at DateTime DEFAULT now() ) ENGINE = MergeTree ORDER BY (actor_id, movie_id); After creating the source tables lets fill them with data from AWS, running the following code. INSERT INTO imdb.actors SELECT * FROM s3('https://datasets-documentation.s3.eu-west-3.amazonaws.com/imdb/imdb_ijs_actors.tsv.gz', 'TSVWithNames'); INSERT INTO imdb.directors SELECT * FROM s3('https://datasets-documentation.s3.eu-west-3.amazonaws.com/imdb/imdb_ijs_directors.tsv.gz', 'TSVWithNames'); INSERT INTO imdb.genres SELECT * FROM s3('https://datasets-documentation.s3.eu-west-3.amazonaws.com/imdb/imdb_ijs_movies_genres.tsv.gz', 'TSVWithNames'); INSERT INTO imdb.movie_directors SELECT * FROM s3('https://datasets-documentation.s3.eu-west-3.amazonaws.com/imdb/imdb_ijs_movies_directors.tsv.gz', 'TSVWithNames'); INSERT INTO imdb.movies SELECT * FROM s3('https://datasets-documentation.s3.eu-west-3.amazonaws.com/imdb/imdb_ijs_movies.tsv.gz', 'TSVWithNames'); INSERT INTO imdb.roles(actor_id, movie_id, role) SELECT actor_id, movie_id, role FROM s3('https://datasets-documentation.s3.eu-west-3.amazonaws.com/imdb/imdb_ijs_roles.tsv.gz', 'TSVWithNames'); Setup DBT Starting by setting up DBT environment pip install dbt-core pip install dbt-clickhouse Init the dbt project dbt init imdb Update the file dbt_project.yml and make sure to add the actors models: imdb: # Config indicated by + and applies to all files under models/example/ actors: +materialized: view Create the following file models/actors/schema.yml with the following content version: 2 sources: - name: imdb tables: - name: directors - name: actors - name: roles - name: movies - name: genres - name: movie_directors Create the following file models/actors/actor_summary.sql with the content {{ config(order_by='(updated_at, id, name)', engine='MergeTree()', materialized='table') }} with actor_summary as ( SELECT id, any(actor_name) as name, uniqExact(movie_id) as num_movies, avg(rank) as avg_rank, uniqExact(genre) as genres, uniqExact(director_name) as directors, max(created_at) as updated_at FROM ( SELECT {{ source('imdb', 'actors') }}.id as id, concat({{ source('imdb', 'actors') }}.first_name, ' ', {{ source('imdb', 'actors') }}.last_name) as actor_name, {{ source('imdb', 'movies') }}.id as movie_id, {{ source('imdb', 'movies') }}.rank as rank, genre, concat({{ source('imdb', 'directors') }}.first_name, ' ', {{ source('imdb', 'directors') }}.last_name) as director_name, created_at FROM {{ source('imdb', 'actors') }} JOIN {{ source('imdb', 'roles') }} ON {{ source('imdb', 'roles') }}.actor_id = {{ source('imdb', 'actors') }}.id LEFT OUTER JOIN {{ source('imdb', 'movies') }} ON {{ source('imdb', 'movies') }}.id = {{ source('imdb', 'roles') }}.movie_id LEFT OUTER JOIN {{ source('imdb', 'genres') }} ON {{ source('imdb', 'genres') }}.movie_id = {{ source('imdb', 'movies') }}.id LEFT OUTER JOIN {{ source('imdb', 'movie_directors') }} ON {{ source('imdb', 'movie_directors') }}.movie_id = {{ source('imdb', 'movies') }}.id LEFT OUTER JOIN {{ source('imdb', 'directors') }} ON {{ source('imdb', 'directors') }}.id = {{ source('imdb', 'movie_directors') }}.director_id ) GROUP BY id ) select * from actor_summary Configure the clickstream connection on the following file ~/.dbt/profiles.yml imdb: target: dev outputs: dev: type: clickhouse schema: imdb_dbt host: localhost port: 8123 user: default password: '' secure: False After this updates run the dbt debug command. To make sure the connection is working properly dbt debug 00:31:58 Running with dbt=1.7.6 00:31:58 dbt version: 1.7.6 00:31:58 python version: 3.11.6 00:31:58 python path: /home/rramos/Development/local/dbt/bin/python 00:31:58 os info: Linux-6.6.10-zen1-1-zen-x86_64-with-glibc2.38 00:31:58 Using profiles dir at /home/rramos/.dbt 00:31:58 Using profiles.yml file at /home/rramos/.dbt/profiles.yml 00:31:58 Using dbt_project.yml file at /home/rramos/Development/local/dbt/imdb/dbt_project.yml 00:31:58 adapter type: clickhouse 00:31:58 adapter version: 1.7.1 00:31:58 Configuration: 00:31:58 profiles.yml file [OK found and valid] 00:31:58 dbt_project.yml file [OK found and valid] 00:31:58 Required dependencies: 00:31:58 - git [OK found] ... 00:31:58 Registered adapter: clickhouse=1.7.1 00:31:58 Connection test: [OK connection ok] If the connection test passed properly, one just need to create the model via dbt. dbt run And you should have a similar output dbt run 00:38:13 Running with dbt=1.7.6 00:38:13 Registered adapter: clickhouse=1.7.1 00:38:13 Unable to do partial parsing because a project config has changed 00:38:15 Found 1 model, 6 sources, 0 exposures, 0 metrics, 421 macros, 0 groups, 0 semantic models 00:38:15 00:38:15 Concurrency: 1 threads (target='dev') 00:38:15 00:38:15 1 of 1 START sql view model `imdb`.`actor_summary` ............................. [RUN] 00:38:15 1 of 1 OK created sql view model `imdb`.`actor_summary` ........................ [OK in 0.17s] 00:38:15 00:38:15 Finished running 1 view model in 0 hours 0 minutes and 0.27 seconds (0.27s). 00:38:15 00:38:15 Completed successfully 00:38:15 00:38:15 Done. PASS=1 WARN=0 ERROR=0 SKIP=0 TOTAL=1 Test query the model SELECT * FROM imdb_dbt.actor_summary WHERE num_movies > 5 ORDER BY avg_rank DESC Conclusion In this article I've went trough the process of setup a Clickhouse database and setup dbt to setup the models with IMDB test data for actors, directors, movies, etc. This two systems work like a charm together. Clickstream shows great performance for analytical queries, and dbt compiles and runs your analytics code against your data platform, enabling you and your team to collaborate on a single source of truth for metrics, insights, and business definitions. I would like to extend this exercise by incorporating github actions related with dbt test actions before promoting to production, and extending the model. But that will be for another time. Let me know you bump into some issues or have some improvement suggestions. References https://clickhouse.com/docs/en/integrations/dbt https://docs.getdbt.com/guides
R
R rui.ms.ramos

Learning

Harvard University - Free Courses
Harvard has several Online course available for Free on their catalog
- https://pll.harvard.edu/catalog/free
Make sure to check individual course registration page.

Some examples:
0 0 0 Reply

R rui.ms.ramos

Harvard has several Online course available for Free on their catalog https://pll.harvard.edu/catalog/free Make sure to check individual course registration page. Some examples: https://pll.harvard.edu/course/cs50s-introduction-game-development https://pll.harvard.edu/course/cs50-introduction-computer-science https://pll.harvard.edu/course/cs50s-introduction-artificial-intelligence-python https://pll.harvard.edu/course/cs50s-understanding-technology-0 https://pll.harvard.edu/course/cs50s-web-programming-python-and-javascript https://pll.harvard.edu/course/fundamentals-tinyml https://pll.harvard.edu/course/applications-tinyml https://pll.harvard.edu/course/mlops-scaling-tinyml https://pll.harvard.edu/course/cs50s-introduction-databases-sql https://pll.harvard.edu/course/data-science-visualization https://pll.harvard.edu/course/data-science-linear-regression https://pll.harvard.edu/course/data-science-machine-learning https://pll.harvard.edu/course/introduction-data-science-python https://pll.harvard.edu/course/data-analysis-life-sciences-4-high-dimensional-data-analysis https://pll.harvard.edu/course/data-science-wrangling
R
R rui.ms.ramos

Announcements

Where to Start ?
Awesome, now that you've landed in the Databoost Community, you might be wondering:
- Where should I begin?
I can offer some suggestions, but it's entirely up to you:
- Begin by completing your profile; choose "Edit Profile" from the right menu.
- I recommend reading the Code of Conduct next
- Create you first post introducing yourself to the community on the following New Members Thread
You'll find a left navigation menu to easily explore our Community Forum topics.

If you come across something that's not working or have a suggestion, please kick off a new thread in the following section Comments & Feedback

Categories

Let me provide insights into the current categories to guide you in deciding where to begin your involvement
- Announcements: This category will contain posts related to announcements for our community, with access limited to our Moderators group
- General Discussion: A place to talk about whatever you want
- Projects: Share your projects and useful findings here category
- Job Board: A space to post job openings or express availability for new challenges
- Learning: Learning Assets, Courses or reference Articles
- Articles: Member-contributed articles
- Comments & Feedback: Got a question? Ask away!
This list will likely expand with increased community engagement, but our Moderators will strive to maintain a clean and organized environment.

If you would like to contribute with a new category just let me know, you can use the Chat section.

Hope you enjoy a good time,
Cheers
0 0 0 Reply

R rui.ms.ramos

Awesome, now that you've landed in the Databoost Community, you might be wondering: Where should I begin? I can offer some suggestions, but it's entirely up to you: Begin by completing your profile; choose "Edit Profile" from the right menu. I recommend reading the Code of Conduct next Create you first post introducing yourself to the community on the following New Members Thread You'll find a left navigation menu to easily explore our Community Forum topics. If you come across something that's not working or have a suggestion, please kick off a new thread in the following section Comments & Feedback Categories Let me provide insights into the current categories to guide you in deciding where to begin your involvement Announcements: This category will contain posts related to announcements for our community, with access limited to our Moderators group General Discussion: A place to talk about whatever you want Projects: Share your projects and useful findings here category Job Board: A space to post job openings or express availability for new challenges Learning: Learning Assets, Courses or reference Articles Articles: Member-contributed articles Comments & Feedback: Got a question? Ask away! This list will likely expand with increased community engagement, but our Moderators will strive to maintain a clean and organized environment. If you would like to contribute with a new category just let me know, you can use the Chat section. Hope you enjoy a good time, Cheers
R

R rui.ms.ramos

Announcements

New Members

Hey there, new members!

👋 Welcome to our vibrant community at Databoost! We'd love to get to know you better. Take a moment to introduce yourself in this thread. Share a bit about your background, what brings you here, and any exciting projects you're working on. Don't be shy – our community is all about connecting and learning from each other. Looking forward to meeting you!

Let me start ...

Hello, I'm Rui Ramos, an Engineering Manager with several years of experience in IT. My passion lies in High-Performance Computing, Cloud-based architectures, and Data-centric services. Currently based in Porto, Portugal, I'm excited to learn and collaborate with all of you.

Your turn, don' t be shy :blush:

0 0 0 Reply

R rui.ms.ramos

Hey there, new members! 👋 Welcome to our vibrant community at Databoost! We'd love to get to know you better. Take a moment to introduce yourself in this thread. Share a bit about your background, what brings you here, and any exciting projects you're working on. Don't be shy – our community is all about connecting and learning from each other. Looking forward to meeting you! Let me start ... Hello, I'm Rui Ramos, an Engineering Manager with several years of experience in IT. My passion lies in High-Performance Computing, Cloud-based architectures, and Data-centric services. Currently based in Porto, Portugal, I'm excited to learn and collaborate with all of you. Your turn, don' t be shy :blush:
R
R rui.ms.ramos

Announcements

Code of Conduct
Code of Conduct: Databoost Community

Welcome to Databoost Community! To ensure a positive and inclusive community, we ask all members to adhere to the following guidelines:

Be Respectful:
- Treat others with kindness and respect.
- Avoid offensive language, discrimination, and personal attacks.
Maintain a Positive Environment:
- Foster constructive and supportive discussions.
- Refrain from spreading negativity or engaging in trolling behavior.
Stay On Topic:
- Keep discussions relevant to the forum's purpose.
- Avoid spamming or posting unrelated content.
Value Diversity:
- Embrace a variety of perspectives and experiences.
- Avoid making assumptions based on race, gender, or other personal characteristics.
Protect Privacy:
- Refrain from sharing personal information about yourself or others.
- Respect the privacy of fellow community members.
Report Inappropriate Behavior:
- If you encounter any inappropriate content or behavior, report it to the moderators promptly.
- Help us maintain a safe and welcoming environment for everyone.
No Plagiarism or Copyright Violations:
- Respect intellectual property rights. Give proper credit and avoid posting copyrighted material without permission.
Professional Conduct:
- Remember that this forum is a professional space; maintain professional communication and behavior.
Follow Forum Guidelines:
- Familiarize yourself with any specific guidelines for sub-forums or categories.
- Adhere to instructions provided by moderators.
Consequences for Violations:
- Violations of this code of conduct may result in warnings or, in severe cases, account suspension or termination.
By participating in Databoost Community, you agree to uphold these guidelines and contribute to a positive and collaborative community. Let's make Databoost Community a great place for everyone!

Thank you for your cooperation.

Databoost Community Moderation Team
0 0 0 Reply

R rui.ms.ramos

Code of Conduct: Databoost Community Welcome to Databoost Community! To ensure a positive and inclusive community, we ask all members to adhere to the following guidelines: Be Respectful: Treat others with kindness and respect. Avoid offensive language, discrimination, and personal attacks. Maintain a Positive Environment: Foster constructive and supportive discussions. Refrain from spreading negativity or engaging in trolling behavior. Stay On Topic: Keep discussions relevant to the forum's purpose. Avoid spamming or posting unrelated content. Value Diversity: Embrace a variety of perspectives and experiences. Avoid making assumptions based on race, gender, or other personal characteristics. Protect Privacy: Refrain from sharing personal information about yourself or others. Respect the privacy of fellow community members. Report Inappropriate Behavior: If you encounter any inappropriate content or behavior, report it to the moderators promptly. Help us maintain a safe and welcoming environment for everyone. No Plagiarism or Copyright Violations: Respect intellectual property rights. Give proper credit and avoid posting copyrighted material without permission. Professional Conduct: Remember that this forum is a professional space; maintain professional communication and behavior. Follow Forum Guidelines: Familiarize yourself with any specific guidelines for sub-forums or categories. Adhere to instructions provided by moderators. Consequences for Violations: Violations of this code of conduct may result in warnings or, in severe cases, account suspension or termination. By participating in Databoost Community, you agree to uphold these guidelines and contribute to a positive and collaborative community. Let's make Databoost Community a great place for everyone! Thank you for your cooperation. Databoost Community Moderation Team

databoost.dev

World

Welcome to your brand new NodeBB forum!

Additional Resources

Use Case

Best Practices

1. Select only the Columns That you will use

2. Apply the most appropriate Compression algoritm

3. Sort the data

Thoughs

References

Stanchion

Why?

Integrating dbt and ClickHouse

Configure ClickHouse sources

Setup DBT

Conclusion

References