Skip to content
  • Announcements regarding our community

    0 0
    0 Topics
    0 Posts
    R
    Awesome, now that you've landed in the Databoost Community, you might be wondering: Where should I begin? I can offer some suggestions, but it's entirely up to you: Begin by completing your profile; choose "Edit Profile" from the right menu. I recommend reading the Code of Conduct next Create you first post introducing yourself to the community on the following New Members Thread You'll find a left navigation menu to easily explore our Community Forum topics. If you come across something that's not working or have a suggestion, please kick off a new thread in the following section Comments & Feedback Categories Let me provide insights into the current categories to guide you in deciding where to begin your involvement Announcements: This category will contain posts related to announcements for our community, with access limited to our Moderators group General Discussion: A place to talk about whatever you want Projects: Share your projects and useful findings here category Job Board: A space to post job openings or express availability for new challenges Learning: Learning Assets, Courses or reference Articles Articles: Member-contributed articles Comments & Feedback: Got a question? Ask away! This list will likely expand with increased community engagement, but our Moderators will strive to maintain a clean and organized environment. If you would like to contribute with a new category just let me know, you can use the Chat section. Hope you enjoy a good time, Cheers
  • A place to talk about whatever you want

    1 1
    1 Topics
    1 Posts
    R
    Welcome to your brand new NodeBB forum! This is what a topic and post looks like. As an administrator, you can edit the post's title and content. To customise your forum, go to the Administrator Control Panel. You can modify all aspects of your forum there, including installation of third-party plugins. Additional Resources NodeBB Documentation Community Support Forum Project repository
  • Got a question? Ask away!

    0 0
    0 Topics
    0 Posts
    No new posts.
  • Share your projects and useful findings here.

    1 1
    1 Topics
    1 Posts
    R
    If you use sqlite you may know that it updates data at the row level. I've just bumped into this project that aims to bring column-oriented storage to SQLite and would like to share with you. This is the project description: Stanchion Column-oriented tables in SQLite Why? Stanchion is a SQLite 3 extension that brings the power of column-oriented storage to SQLite, the most widely deployed database. SQLite exclusively supports row-oriented tables, which means it is not an ideal fit for all workloads. Using the Stanchion plugin brings all of the benefits of column-oriented storage and data warehousing to anywhere that SQLite is already deployed, including your existing tech stack. There are a number of situations where column-oriented storage outperforms row-oriented storage: Storing and processing metric, log, and event data Timeseries data storage and analysis Analytical queries over many rows and a few columns (e.g. calculating the average temperature over months of hourly weather data) Change tracking, history/temporal tables Anchor modeling / Datomic-like data models Stanchion is an ideal fit for analytical queries and wide tables because it only scans data from the columns that are referenced by a given query. It uses compression techniques like run length and bit-packed encodings that significantly reduce the size of stored data, greatly reducing the cost of large data sets. This makes it an ideal solution for storing large, expanding datasets. You can find more information on the official Github repo: https://github.com/dgllghr/stanchion
  • Blog posts from individual members

    0 0
    0 Topics
    0 Posts
    R
    Was reading this article where Philippe Rivière and Éric Mauvière optimized a 200GB Parquet data and prepare it to 549kB. Now this work touch some very relevant points regarding Data Engineering procedures and best practices, I would suggest going on the article as it explains in detail what they applied in each stage and how. Use Case "This new fascinating dataset just dropped on Hugging Face. French public domain newspapers 🤗 references about 3 million newspapers and periodicals with their full text OCR’ed and some meta-data. The data is stored in 320 large parquet files. The data loader for this Observable framework project uses DuckDB to read these files (altogether about 200GB) and combines a minimal subset of their metadata — title and year of publication, most importantly without the text contents —, into a single highly optimized parquet file." Undoubtedly, this dataset proves immensely valuable for training and processing Language Model (LLM) models Best Practices I firmly believe that these best practices should be applied not only to Parquet but also to other columnar formats. These are the key factors you should have into consideration: 1. Select only the Columns That you will use This is one of simplest optimizations that you can do. Remember that data is stored in a columnar way so picking the columns that matter not only will will filter out very quickly as it will reduce significantly the volume 2. Apply the most appropriate Compression algoritm The majority of contemporary data formats support compression. When examining the most common ones for Parquet—such as LZO, Snappy, and Gzip—we observe several notable differences (ref: sheet) For instance gzip cannot be splitted, which means if you are going to process the data with a distributed process like Spark for instance you must use the driver to deal with all the uncompression. LZO strikes a better balance between speed and compression rate when compared to Snappy. In this specific case, I would also recommend exploring Brotli as the datasets seem to contain text. Choosing an effective algorithm is crucial. 3. Sort the data While it may not seem immediately relevant, aligning the rows in this manner results in extended streaks of constant values across multiple columns, enhancing the compaction ratio applied by the compression algorithm Thoughs They took it a step further by implementing additional optimizations, such as increasing the row_group_size. What's crucial to highlight here is the significant gains achievable through the application of good engineering practices, resulting in faster and more cost-effective processes. It is also important to state that the data isn't exactly the same as the source data, but is the required data to train the model. DuckDB is also exceptionally fast for executing these types of processes. While I'm eager to test it out, unfortunately, I find myself short on both time and disk space! References https://mastodon.social/@severo/111957633001467414 https://github.com/apache/parquet-format/blob/master/Compression.md https://huggingface.co/spaces/observablehq/fpdn https://dev.to/alexmercedcoder/parquet-file-compression-for-everyone-zstd-brotli-lz4-gzip-snappy-5gb8
  • A place to share job openings or availability for new challenges

    5 5
    5 Topics
    5 Posts
    R
    Hi, Tonic APP has the following Open positions: AI Ops AI Product Manager Gen AI Developer NOTE: I'm not affiliated with the company
  • Learning Assets, Courses or reference Articles

    1 1
    1 Topics
    1 Posts
    R
    Harvard has several Online course available for Free on their catalog https://pll.harvard.edu/catalog/free Make sure to check individual course registration page. Some examples: https://pll.harvard.edu/course/cs50s-introduction-game-development https://pll.harvard.edu/course/cs50-introduction-computer-science https://pll.harvard.edu/course/cs50s-introduction-artificial-intelligence-python https://pll.harvard.edu/course/cs50s-understanding-technology-0 https://pll.harvard.edu/course/cs50s-web-programming-python-and-javascript https://pll.harvard.edu/course/fundamentals-tinyml https://pll.harvard.edu/course/applications-tinyml https://pll.harvard.edu/course/mlops-scaling-tinyml https://pll.harvard.edu/course/cs50s-introduction-databases-sql https://pll.harvard.edu/course/data-science-visualization https://pll.harvard.edu/course/data-science-linear-regression https://pll.harvard.edu/course/data-science-machine-learning https://pll.harvard.edu/course/introduction-data-science-python https://pll.harvard.edu/course/data-analysis-life-sciences-4-high-dimensional-data-analysis https://pll.harvard.edu/course/data-science-wrangling