Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • World
  • Users
  • Groups
Skins
  • Light
  • Brite
  • Cerulean
  • Cosmo
  • Flatly
  • Journal
  • Litera
  • Lumen
  • Lux
  • Materia
  • Minty
  • Morph
  • Pulse
  • Sandstone
  • Simplex
  • Sketchy
  • Spacelab
  • United
  • Yeti
  • Zephyr
  • Dark
  • Cyborg
  • Darkly
  • Quartz
  • Slate
  • Solar
  • Superhero
  • Vapor

  • Default (No Skin)
  • No Skin
Collapse

databoost.dev

administrators

Private

Posts


  • Welcome to your NodeBB!
    R rui.ms.ramos

    Welcome to your brand new NodeBB forum!

    This is what a topic and post looks like. As an administrator, you can edit the post's title and content.
    To customise your forum, go to the Administrator Control Panel. You can modify all aspects of your forum there, including installation of third-party plugins.

    Additional Resources

    • NodeBB Documentation
    • Community Support Forum
    • Project repository
    General Discussion

  • Redpand vs Kafka
    R rui.ms.ramos

    Hi,

    Is anyone currently using Redpanda in production or has undergone a migration from Kafka to Redpanda? I'd love to hear your story!

    General Discussion

  • Tonic App - Several AI Positions available
    R rui.ms.ramos

    Hi,

    Tonic APP has the following Open positions:

    • AI Ops
    • AI Product Manager
    • Gen AI Developer

    NOTE: I'm not affiliated with the company

    Job Board

  • ASOS - Machine Learning Engineer (Mid-Level) [UK]
    R rui.ms.ramos

    ASOS positions

    • Machine Learning Engineer (Mid-Level)

    Note: I'm not affiliated with the company

    Job Board

  • Parquet Compression - 200GB to 549kB
    R rui.ms.ramos

    Was reading this article where Philippe Rivière and Éric Mauvière optimized a 200GB Parquet data and prepare it to 549kB.

    Now this work touch some very relevant points regarding Data Engineering procedures and best practices, I would suggest going on the article as it explains in detail what they applied in each stage and how.

    Use Case

    "This new fascinating dataset just dropped on Hugging Face. French public domain newspapers 🤗 references about 3 million newspapers and periodicals with their full text OCR’ed and some meta-data. The data is stored in 320 large parquet files. The data loader for this Observable framework project uses DuckDB to read these files (altogether about 200GB) and combines a minimal subset of their metadata — title and year of publication, most importantly without the text contents —, into a single highly optimized parquet file."

    Undoubtedly, this dataset proves immensely valuable for training and processing Language Model (LLM) models

    Best Practices

    I firmly believe that these best practices should be applied not only to Parquet but also to other columnar formats.

    These are the key factors you should have into consideration:

    1. Select only the Columns That you will use

    This is one of simplest optimizations that you can do. Remember that data is stored in a columnar way so picking the columns that matter not only will will filter out very quickly as it will reduce significantly the volume

    2. Apply the most appropriate Compression algoritm

    The majority of contemporary data formats support compression. When examining the most common ones for Parquet—such as LZO, Snappy, and Gzip—we observe several notable differences (ref: sheet)

    For instance gzip cannot be splitted, which means if you are going to process the data with a distributed process like Spark for instance you must use the driver to deal with all the uncompression.

    LZO strikes a better balance between speed and compression rate when compared to Snappy. In this specific case, I would also recommend exploring Brotli as the datasets seem to contain text. Choosing an effective algorithm is crucial.

    3. Sort the data

    While it may not seem immediately relevant, aligning the rows in this manner results in extended streaks of constant values across multiple columns, enhancing the compaction ratio applied by the compression algorithm

    Thoughs

    They took it a step further by implementing additional optimizations, such as increasing the row_group_size. What's crucial to highlight here is the significant gains achievable through the application of good engineering practices, resulting in faster and more cost-effective processes.

    It is also important to state that the data isn't exactly the same as the source data, but is the required data to train the model. DuckDB is also exceptionally fast for executing these types of processes.

    While I'm eager to test it out, unfortunately, I find myself short on both time and disk space!

    References

    • https://mastodon.social/@severo/111957633001467414
    • https://github.com/apache/parquet-format/blob/master/Compression.md
    • https://huggingface.co/spaces/observablehq/fpdn
    • https://dev.to/alexmercedcoder/parquet-file-compression-for-everyone-zstd-brotli-lz4-gzip-snappy-5gb8
    Blogs

  • Aubay - Data Solutions Architect
    R rui.ms.ramos

    For the interest of the Community,

    • Data Solutions Architect

    NOTE: I'm not affiliate with the company to provide either positive or negative feedback

    Job Board

  • Portugal Companies hiring
    R rui.ms.ramos

    The following list is being filled by Luis Parada in LinkedIn and holds several companies that are hiring at the moment in Portiugal, for the interest of the community.

    https://www.linkedin.com/pulse/whos-hiring-portugal-luis-parada--vhiue/

    Check out has there are several options for Data Professionals

    Job Board

  • Stanchion
    R rui.ms.ramos

    If you use sqlite you may know that it updates data at the row level. I've just bumped into this project that aims to bring column-oriented storage to SQLite and would like to share with you.

    This is the project description:

    Stanchion

    Column-oriented tables in SQLite

    Why?

    Stanchion is a SQLite 3 extension that brings the power of column-oriented storage to SQLite, the most widely deployed database. SQLite exclusively supports row-oriented tables, which means it is not an ideal fit for all workloads. Using the Stanchion plugin brings all of the benefits of column-oriented storage and data warehousing to anywhere that SQLite is already deployed, including your existing tech stack.

    There are a number of situations where column-oriented storage outperforms row-oriented storage:

    • Storing and processing metric, log, and event data
    • Timeseries data storage and analysis
    • Analytical queries over many rows and a few columns (e.g. calculating the average temperature over months of hourly weather data)
    • Change tracking, history/temporal tables
    • Anchor modeling / Datomic-like data models

    Stanchion is an ideal fit for analytical queries and wide tables because it only scans data from the columns that are referenced by a given query. It uses compression techniques like run length and bit-packed encodings that significantly reduce the size of stored data, greatly reducing the cost of large data sets. This makes it an ideal solution for storing large, expanding datasets.

    You can find more information on the official Github repo:

    • https://github.com/dgllghr/stanchion
    Projects

  • SoSafe - Open Positions
    R rui.ms.ramos

    Some Data positions available in SoSafe:

    • Team Lead Data Engineer (m/f/d)
    • Team Lead Cloud Engineering (m/f/d)
    • Senior Data Engineer (m/f/d)
    • Senior Data Engineer (m/f/d)
    • Senior Backend Engineer Node.js (m/f/d)

    Note: I' m not affiliated with the Company

    Job Board

  • Integrating dbt and ClickHouse
    R rui.ms.ramos

    Integrating dbt and ClickHouse

    In this we will be following the integration steps to use dbt and clickouse with sample IMDB data.

    Configure ClickHouse sources

    Setup clickhouse check this article if you would like more information on this product.

    Then connect with a client and run the following DDL scripts

    CREATE DATABASE imdb;
    
    CREATE TABLE imdb.actors
    (
        id         UInt32,
        first_name String,
        last_name  String,
        gender     FixedString(1)
    ) ENGINE = MergeTree ORDER BY (id, first_name, last_name, gender);
    
    CREATE TABLE imdb.directors
    (
        id         UInt32,
        first_name String,
        last_name  String
    ) ENGINE = MergeTree ORDER BY (id, first_name, last_name);
    
    CREATE TABLE imdb.genres
    (
        movie_id UInt32,
        genre    String
    ) ENGINE = MergeTree ORDER BY (movie_id, genre);
    
    CREATE TABLE imdb.movie_directors
    (
        director_id UInt32,
        movie_id    UInt64
    ) ENGINE = MergeTree ORDER BY (director_id, movie_id);
    
    CREATE TABLE imdb.movies
    (
        id   UInt32,
        name String,
        year UInt32,
        rank Float32 DEFAULT 0
    ) ENGINE = MergeTree ORDER BY (id, name, year);
    
    CREATE TABLE imdb.roles
    (
        actor_id   UInt32,
        movie_id   UInt32,
        role       String,
        created_at DateTime DEFAULT now()
    ) ENGINE = MergeTree ORDER BY (actor_id, movie_id);
    

    After creating the source tables lets fill them with data from AWS, running the following code.

    INSERT INTO imdb.actors
    SELECT *
    FROM s3('https://datasets-documentation.s3.eu-west-3.amazonaws.com/imdb/imdb_ijs_actors.tsv.gz',
    'TSVWithNames');
    
    INSERT INTO imdb.directors
    SELECT *
    FROM s3('https://datasets-documentation.s3.eu-west-3.amazonaws.com/imdb/imdb_ijs_directors.tsv.gz',
    'TSVWithNames');
    
    INSERT INTO imdb.genres
    SELECT *
    FROM s3('https://datasets-documentation.s3.eu-west-3.amazonaws.com/imdb/imdb_ijs_movies_genres.tsv.gz',
    'TSVWithNames');
    
    INSERT INTO imdb.movie_directors
    SELECT *
    FROM s3('https://datasets-documentation.s3.eu-west-3.amazonaws.com/imdb/imdb_ijs_movies_directors.tsv.gz',
            'TSVWithNames');
    
    INSERT INTO imdb.movies
    SELECT *
    FROM s3('https://datasets-documentation.s3.eu-west-3.amazonaws.com/imdb/imdb_ijs_movies.tsv.gz',
    'TSVWithNames');
    
    INSERT INTO imdb.roles(actor_id, movie_id, role)
    SELECT actor_id, movie_id, role
    FROM s3('https://datasets-documentation.s3.eu-west-3.amazonaws.com/imdb/imdb_ijs_roles.tsv.gz',
    'TSVWithNames');
    

    Setup DBT

    Starting by setting up DBT environment

    pip install dbt-core
    pip install dbt-clickhouse
    

    Init the dbt project

    dbt init imdb
    

    Update the file dbt_project.yml and make sure to add the actors

    models:
      imdb:
        # Config indicated by + and applies to all files under models/example/
        actors:
          +materialized: view
    

    Create the following file models/actors/schema.yml with the following content

    version: 2
    
    sources:
    - name: imdb
      tables:
      - name: directors
      - name: actors
      - name: roles
      - name: movies
      - name: genres
      - name: movie_directors
    

    Create the following file models/actors/actor_summary.sql with the content

    {{ config(order_by='(updated_at, id, name)', engine='MergeTree()', materialized='table') }}
    
    with actor_summary as (
    SELECT id,
        any(actor_name) as name,
        uniqExact(movie_id)    as num_movies,
        avg(rank)                as avg_rank,
        uniqExact(genre)         as genres,
        uniqExact(director_name) as directors,
        max(created_at) as updated_at
    FROM (
            SELECT {{ source('imdb', 'actors') }}.id as id,
                    concat({{ source('imdb', 'actors') }}.first_name, ' ', {{ source('imdb', 'actors') }}.last_name) as actor_name,
                    {{ source('imdb', 'movies') }}.id as movie_id,
                    {{ source('imdb', 'movies') }}.rank as rank,
                    genre,
                    concat({{ source('imdb', 'directors') }}.first_name, ' ', {{ source('imdb', 'directors') }}.last_name) as director_name,
                    created_at
            FROM {{ source('imdb', 'actors') }}
                        JOIN {{ source('imdb', 'roles') }} ON {{ source('imdb', 'roles') }}.actor_id = {{ source('imdb', 'actors') }}.id
                        LEFT OUTER JOIN {{ source('imdb', 'movies') }} ON {{ source('imdb', 'movies') }}.id = {{ source('imdb', 'roles') }}.movie_id
                        LEFT OUTER JOIN {{ source('imdb', 'genres') }} ON {{ source('imdb', 'genres') }}.movie_id = {{ source('imdb', 'movies') }}.id
                        LEFT OUTER JOIN {{ source('imdb', 'movie_directors') }} ON {{ source('imdb', 'movie_directors') }}.movie_id = {{ source('imdb', 'movies') }}.id
                        LEFT OUTER JOIN {{ source('imdb', 'directors') }} ON {{ source('imdb', 'directors') }}.id = {{ source('imdb', 'movie_directors') }}.director_id
            )
    GROUP BY id
    )
    
    select *
    from actor_summary
    

    Configure the clickstream connection on the following file ~/.dbt/profiles.yml

    imdb:
      target: dev
      outputs:
        dev:
          type: clickhouse
          schema: imdb_dbt
          host: localhost
          port: 8123
          user: default
          password: ''
          secure: False
    

    After this updates run the dbt debug command.
    To make sure the connection is working properly

    dbt debug
    00:31:58  Running with dbt=1.7.6
    00:31:58  dbt version: 1.7.6
    00:31:58  python version: 3.11.6
    00:31:58  python path: /home/rramos/Development/local/dbt/bin/python
    00:31:58  os info: Linux-6.6.10-zen1-1-zen-x86_64-with-glibc2.38
    00:31:58  Using profiles dir at /home/rramos/.dbt
    00:31:58  Using profiles.yml file at /home/rramos/.dbt/profiles.yml
    00:31:58  Using dbt_project.yml file at /home/rramos/Development/local/dbt/imdb/dbt_project.yml
    00:31:58  adapter type: clickhouse
    00:31:58  adapter version: 1.7.1
    00:31:58  Configuration:
    00:31:58    profiles.yml file [OK found and valid]
    00:31:58    dbt_project.yml file [OK found and valid]
    00:31:58  Required dependencies:
    00:31:58   - git [OK found]
    ...
    00:31:58  Registered adapter: clickhouse=1.7.1
    00:31:58    Connection test: [OK connection ok]
    

    If the connection test passed properly, one just need to create the model via dbt.

    dbt run
    

    And you should have a similar output

    dbt run
    00:38:13  Running with dbt=1.7.6
    00:38:13  Registered adapter: clickhouse=1.7.1
    00:38:13  Unable to do partial parsing because a project config has changed
    00:38:15  Found 1 model, 6 sources, 0 exposures, 0 metrics, 421 macros, 0 groups, 0 semantic models
    00:38:15  
    00:38:15  Concurrency: 1 threads (target='dev')
    00:38:15  
    00:38:15  1 of 1 START sql view model `imdb`.`actor_summary` ............................. [RUN]
    00:38:15  1 of 1 OK created sql view model `imdb`.`actor_summary` ........................ [OK in 0.17s]
    00:38:15  
    00:38:15  Finished running 1 view model in 0 hours 0 minutes and 0.27 seconds (0.27s).
    00:38:15  
    00:38:15  Completed successfully
    00:38:15  
    00:38:15  Done. PASS=1 WARN=0 ERROR=0 SKIP=0 TOTAL=1
    

    Test query the model

    SELECT *
    FROM imdb_dbt.actor_summary
    WHERE num_movies > 5
    ORDER BY avg_rank DESC
    

    Conclusion

    In this article I've went trough the process of setup a Clickhouse database and setup dbt to setup the models with IMDB test data for actors, directors, movies, etc.

    This two systems work like a charm together. Clickstream shows great performance for analytical queries, and dbt compiles and runs your analytics code against your data platform, enabling you and your team to collaborate on a single source of truth for metrics, insights, and business definitions.

    I would like to extend this exercise by incorporating github actions related with dbt test actions before promoting to production, and extending the model.

    But that will be for another time.

    Let me know you bump into some issues or have some improvement suggestions.

    References

    • https://clickhouse.com/docs/en/integrations/dbt
    • https://docs.getdbt.com/guides
    Blogs

Member List

R rui.ms.ramos
  • Login

  • Don't have an account? Register

  • Login or register to search.
Powered by NodeBB Contributors
  • First post
    Last post
0
  • Categories
  • Recent
  • Tags
  • Popular
  • World
  • Users
  • Groups