{
  "version": "https://jsonfeed.org/version/1",
  "title": "Ian's Digital Garden",
  "home_page_url": "https://ianwwagner.com/",
  "feed_url": "https://ianwwagner.com//tag-big-data.json",
  "description": "",
  "items": [
    {
      "id": "https://ianwwagner.com//conserving-memory-while-streaming-from-duckdb.html",
      "url": "https://ianwwagner.com//conserving-memory-while-streaming-from-duckdb.html",
      "title": "Conserving Memory while Streaming from DuckDB",
      "content_html": "<p>In the weeks since my previous post on <a href=\"working-with-arrow-and-duckdb-in-rust.html\">Working with Arrow and DuckDB in Rust</a>,\nI've found a few gripes that I'd like to address.</p>\n<h1><a href=\"#memory-usage-of-query_arrow-and-stream_arrow\" aria-hidden=\"true\" class=\"anchor\" id=\"memory-usage-of-query_arrow-and-stream_arrow\"></a>Memory usage of <code>query_arrow</code> and <code>stream_arrow</code></h1>\n<p>In the previous post, I used the <code>query_arrow</code> API.\nIt's pretty straightforward and gives you iterator-compatible access to the query results.\nHowever, there's one small problem: its memory consumption scales roughly linearly with your result set.</p>\n<p>This isn't a problem for many uses of DuckDB, but if your datasets are in the tens or hundreds of gigabytes\nand you're wanting to process a large number of rows, the RAM requirements can be excessive.\nThe memory profile of <code>query_arrow</code> seems to be &quot;create all of the <code>RecordBatch</code>es upfront\nand keep them around for as long as you hold the <code>Arrow</code> handle.</p>\n<div class=\"markdown-alert markdown-alert-note\">\n<p class=\"markdown-alert-title\">Disclaimer</p>\n<p>I have <strong>not</strong> done extensive allocation-level memory profiling as of this writing.\nIt's quite possible that I've missed something, but this seems to be what's happening\nfrom watching Activity Monitor.\nPlease let me know if I've misrepresented anything!</p>\n</div>\n<p>Fortunately, DuckDB also has another API: <a href=\"https://docs.rs/duckdb/latest/duckdb/struct.Statement.html#method.stream_arrow\"><code>stream_arrow</code></a>.\nThis appears to allocate <code>RecordBatch</code>es on demand rather than all at once.\nThere is also some overhead, which I'll revisit later that varies with result size.\nBut overall, profiling indicates that <code>stream_arrow</code> requires significantly less RAM over the life of a large <code>Arrow</code> iterator.</p>\n<p>Unfortunately, none of the above information about memory consumption appears to be documented,\nand there are no (serious) code samples demonstrating the use of <code>stream_arrow</code>!</p>\n<blockquote>\n<p>[!question] Down the rabbit hole...\nDigging into the code in duckdb-rs raises even more questions,\nsince several underlying C functions, like <a href=\"https://duckdb.org/docs/api/c/api.html\"><code>duckdb_execute_prepared_streaming</code></a>\nare marked as deprecated.\nPresumably, alternatives are being developed or the methods are just not stable yet.</p>\n</blockquote>\n<h1><a href=\"#getting-a-schemaref\" aria-hidden=\"true\" class=\"anchor\" id=\"getting-a-schemaref\"></a>Getting a <code>SchemaRef</code></h1>\n<p>The signature of <code>stream_arrow</code> is a bit different from that of <code>query_arrow</code>.\nHere's what it looks like as of crate version 1.1.1:</p>\n<pre><code class=\"language-rust\">pub fn stream_arrow&lt;P: Params&gt;(\n    &amp;mut self,\n    params: P,\n    schema: SchemaRef,\n) -&gt; Result&lt;ArrowStream&lt;'_&gt;&gt;\n</code></pre>\n<p>This looks pretty familiar at first if you've used <code>query_arrow</code>,\nbut there's a new third parameter: <code>schema</code>.\n<code>SchemaRef</code> is just a type alias for <code>Arc&lt;Schema&gt;</code>.\nArrow objects have a schema associated with them,\nso this is a reasonable detail for a low-level API.\nBut DuckDB is fine at inferring this when needed!\nSurely there is a way of getting it from a query, right?\n(After all, <code>query_arrow</code> has to do something similar, but doesn't burden the caller.)</p>\n<p>My first attempt at getting a <code>Schema</code> object was to call the <a href=\"https://docs.rs/duckdb/latest/duckdb/struct.Statement.html#method.schema\"><code>schema()</code></a> method on <code>Statement</code>.\nThe <code>Statement</code> type in duckdb-rs is actually a high-level wrapper around <code>RawStatement</code>,\nand at the time of this writing, the schema getter <a href=\"https://github.com/duckdb/duckdb-rs/blob/2bd811e7b1b7398c4f461de4de263e629572dc90/crates/duckdb/src/raw_statement.rs#L212\">hides an <code>unwrap</code></a>.\nThe docs do tell you this (using a somewhat nonstandard heading?),\nbut basically you can't get a schema without executing a query.\nI wish they used the <a href=\"https://cliffle.com/blog/rust-typestate/\">Typestate pattern</a>\nor at least made the result an <code>Option</code>, but alas...</p>\n<p>This leaves developers with three options.</p>\n<ol>\n<li>Construct the schema manually.</li>\n<li>Construct a different <code>Statement</code> that is the same SQL, but with a <code>LIMIT 0</code> clause at the end.</li>\n<li>Execute the statement, but don't load all the results into RAM.</li>\n</ol>\n<h2><a href=\"#manually-construct-a-schema\" aria-hidden=\"true\" class=\"anchor\" id=\"manually-construct-a-schema\"></a>Manually construct a Schema?</h2>\n<p>Manually constructing the schema is a non-starter for me.\nA program which has a hand-written code dependency on a SQL string is a terrible idea\non several levels.\nBesides, DuckDB clearly <em>can</em> infer the schema in <code>query_arrow</code>, so why not here?</p>\n<h2><a href=\"#query-another-nearly-identical-statement\" aria-hidden=\"true\" class=\"anchor\" id=\"query-another-nearly-identical-statement\"></a>Query another, nearly identical statement</h2>\n<p>The second idea is, amusingly, what ChatGPT o1 suggested (after half a dozen prompts;\nit seems like it will just confidently refuse to fetch documentation now,\nand hallucinates new APIs based off its outdated training data).\nThe basic idea is to add <code>LIMIT 0</code> to the end of the original query\nso it's able to get the schema, but doesn't actually return any results.</p>\n<pre><code class=\"language-rust\">fn fetch_schema_for_query(db: &amp;Connection, sql: &amp;str) -&gt; duckdb::Result&lt;SchemaRef&gt; {\n    // Append &quot;LIMIT 0&quot; to the original query, so we don't actually fetch anything\n    // NB: This does NOT handle cases such as the original query ending in a semicolon!\n    let schema_sql = format!(&quot;{} LIMIT 0&quot;, sql);\n\n    let mut statement = db.prepare(&amp;schema_sql)?;\n    let arrow_result = statement.query_arrow([])?;\n\n    Ok(arrow_result.get_schema())\n}\n</code></pre>\n<p>There is nothing fundamentally unsound about this approach.\nBut it requires string manipulation, which is less than ideal.\nThere is also at least one obvious edge case.</p>\n<h2><a href=\"#execute-the-stamement-without-loading-all-results-first\" aria-hidden=\"true\" class=\"anchor\" id=\"execute-the-stamement-without-loading-all-results-first\"></a>Execute the stamement without loading all results first</h2>\n<p>The third option is not as straightforward as I expected it to be.\nAt first, I tried the <code>row_count</code> method,\nbut internally this <a href=\"https://github.com/duckdb/duckdb-rs/blob/2bd811e7b1b7398c4f461de4de263e629572dc90/crates/duckdb/src/raw_statement.rs#L79\">just calls a single FFI function</a>.\nThis doesn't actually update the internal <code>schema</code> field.\nYou really <em>do</em> need to run through a more &quot;normal&quot; execution path.</p>\n<p>A solution that <em>seems</em> reasonably clean is to do what the docs say and call <code>stmt.execute()</code>.\nIt's a bit strange to do this on a <code>SELECT</code> query to be honest,\nbut the API does indeed internally mutate the <code>Schema</code> property,\n<em>and</em> returns a row count.\nSo it seems semantically equivalent to a <code>SELECT COUNT(*) FROM (...)</code>\n(and in my case, getting the row count was helpful too).</p>\n<p>In my testing, it <em>appears</em> that this may actually allocate a non-trivial amount of memory,\nwhich may be mildly surprising.\nHowever, the max amount of memory we require during execution is definitely less overall.\nAny ideas why this is?</p>\n<h1><a href=\"#full-example-using-stream_arrow\" aria-hidden=\"true\" class=\"anchor\" id=\"full-example-using-stream_arrow\"></a>Full example using <code>stream_arrow</code></h1>\n<p>Let's bring what we've learned into a &quot;real&quot; example.</p>\n<pre><code class=\"language-rust\">// let sql = &quot;SELECT * FROM table;&quot;;\nlet mut stmt = conn.prepare(sql)?;\n// Execute the query (so we have a usable schema)\nlet size = stmt.execute([])?;\n// Now we run the &quot;real&quot; query using `stream_arrow`.\n// This returned in a few hundred milliseconds for my dataset.\nlet mut arrow = stmt.stream_arrow([], stmt.schema())?;\n// Iterate over arrow...\n</code></pre>\n<p>When you structure your code like this rather than using the easier <code>query_arrow</code>,\nyou can significantly reduce your memory footprint for large datasets.\nIn my testing, there was no appreciable impact on performance.</p>\n<h1><a href=\"#open-questions\" aria-hidden=\"true\" class=\"anchor\" id=\"open-questions\"></a>Open Questions</h1>\n<p>The above leaves me with a few open questions.\nFirst, with my use case (a dataset of around 12GB of Parquet files), <code>execute</code> took several <em>seconds</em>.\nThe &quot;real&quot; <code>stream_arrow</code> query took a few hundred milliseconds.\nWhat's going on here?\nPerhaps it's doing a scan and/or caching some data initially the way to make subsequent queries faster?</p>\n<p>Additionally, the memory profile does have a &quot;spike&quot; which makes me wonder what exactly each step loads into RAM,\nand thus, the memory requirements for working with extremely large datasets.\nIn my testing, adding a <code>WHERE</code> clause that significantly reduces the result set\nDOES reduce the memory footprint.\nThat's somewhat worrying to me, since it implies there is still measurable overhead\nproportional to the dataset size.\nWhat practical limits does this impose on dataset size?</p>\n<div class=\"markdown-alert markdown-alert-note\">\n<p class=\"markdown-alert-title\">Note</p>\n<p>An astute reader may be asking whether the memory profile of the <code>LIMIT 0</code> and <code>execute</code> approaches are equivalent.\nThe answer appears to be yes.</p>\n</div>\n<p>I've <a href=\"https://github.com/duckdb/duckdb-rs/issues/418\">opened issue #418</a>\nasking for clarification.\nIf any readers have any insights, post them in the issue thread!</p>\n",
      "summary": "",
      "date_published": "2024-12-31T00:00:00-00:00",
      "image": "",
      "authors": [
        {
          "name": "Ian Wagner",
          "url": "https://fosstodon.org/@ianthetechie",
          "avatar": "media/avi.jpeg"
        }
      ],
      "tags": [
        "rust",
        "apache arrow",
        "parquet",
        "duckdb",
        "big data",
        "data engineering"
      ],
      "language": "en"
    },
    {
      "id": "https://ianwwagner.com//how-and-why-to-work-with-arrow-and-duckdb-in-rust.html",
      "url": "https://ianwwagner.com//how-and-why-to-work-with-arrow-and-duckdb-in-rust.html",
      "title": "How (and why) to work with Arrow and DuckDB in Rust",
      "content_html": "<p>My day job involves wrangling a lot of data very fast.\nI've heard a lot of people raving about several technologies like DuckDB,\n(Geo)Parquet, and Apache Arrow recently.\nBut despite being an &quot;early adopter,&quot;\nit took me quite a while to figure out how and why to leverage these practiclaly.</p>\n<p>Last week, a few things &quot;clicked&quot; for me, so I'd like to share what I learned in case it helps you.</p>\n<h1><a href=\"#geoparquet\" aria-hidden=\"true\" class=\"anchor\" id=\"geoparquet\"></a>(Geo)Parquet</h1>\n<p>(Geo)Parquet is quite possibly the best understood tech in the mix.\nIt is not exactly new.\nParquet has been around for quite a while in the big data ecosystem.\nIf you need a refresher, the <a href=\"https://guide.cloudnativegeo.org/geoparquet/\">Cloud-optimized Geospatial Formats Guide</a>\ngives a great high-level overview.</p>\n<p>Here are the stand-out features:</p>\n<ul>\n<li>It has a schema and some data types, unlike CSV (you can even have maps and lists!).</li>\n<li>On disk, values are written in groups per <em>column</em>, rather than writing one row at a time.\nThis makes the data much easier to compress, and lets readers easily skip over data they don't need.</li>\n<li>Statistics at several levels which enable &quot;predicate pushdown.&quot; Even though the files are columnar in nature,\nyou can narrow which files and &quot;row groups&quot; within each file have the data you need!</li>\n</ul>\n<p>Practically speaking, parquet lets you can distribute large datasets in <em>one or more</em> files\nwhich will be significantly <em>smaller and faster to query</em> than other familiar formats.</p>\n<h2><a href=\"#why-you-should-care\" aria-hidden=\"true\" class=\"anchor\" id=\"why-you-should-care\"></a>Why you should care</h2>\n<p>The value proposition is clear for big data processing.\nIf you're trying to get a record of all traffic accidents in California,\nor find the hottest restaurants in Paris based on a multi-terabyte dataset,\nparquet provides clear advantages.\nYou can skip row groups within the parquet file or even whole files\nto narrow your search!\nAnd since datasets can be split across files,\nyou can keep adding to the dataset over time, parallelize queries,\nand other nice things.</p>\n<p>But what if you're not doing these high-level analytical things?\nWhy not just use a more straightforward format like CSV\nthat avoids the need to &quot;rotate&quot; back into rows\nfor non-aggregation use cases?\nHere are a few reasons to like Parquet:</p>\n<ul>\n<li>You actually have a schema! This means less format shifting and validation in your code.</li>\n<li>Operating on row groups turns out to be pretty efficient, even when you're reading the whole dataset.\nCombining batch reads with compression, your processing code will usually get faster.</li>\n<li>It's designed to be readable from object storage.\nThis means you can often process massive datasets from your laptop.\nParquet readers are smart and can skip over data you don't need.\nYou can't do this with CSV.</li>\n</ul>\n<p>The upshot of all this is that it generally gets both <em>easier</em> and <em>faster</em>\nto work with your data...\nprovided that you have the right tools to leverage it.</p>\n<h1><a href=\"#duckdb\" aria-hidden=\"true\" class=\"anchor\" id=\"duckdb\"></a>DuckDB</h1>\n<p>DuckDB describes itself as an in-process, portable, feature-rich, and fast database\nfor analytical workloads.\nDuckDB was that tool that triggered my &quot;lightbulb moment&quot; last week.\nFoursquare, an app which I've used for a decade or more,\nrecently released an <a href=\"https://location.foursquare.com/resources/blog/products/foursquare-open-source-places-a-new-foundational-dataset-for-the-geospatial-community/\">open data set</a>,\nwhich was pretty cool!\nIt was also in Parquet format (just like <a href=\"https://overturemaps.org/\">Overture</a>'s data sets).</p>\n<p>You can't just open up a Parquet file in a text editor or spreadsheet software like you can a CSV.\nMy friend Oliver released a <a href=\"https://wipfli.github.io/foursquare-os-places-pmtiles/\">web-based demo</a>\na few weeks ago which lets you inspect the data on a map at the point level.\nBut to do more than spot checking, you'll probably want a database that can work with Parquet.\nAnd that's where DuckDB comes in.</p>\n<h2><a href=\"#why-you-should-care-1\" aria-hidden=\"true\" class=\"anchor\" id=\"why-you-should-care-1\"></a>Why you should care</h2>\n<h3><a href=\"#its-embedded\" aria-hidden=\"true\" class=\"anchor\" id=\"its-embedded\"></a>It's embedded</h3>\n<p>I understood the in-process part of DuckDB's value proposition right away.\nIt's similar to SQLite, where you don't have to go through a server\nor over an HTTP connection.\nThis is both simpler to reason about and <a href=\"quadrupling-the-performance-of-a-data-pipeline.html\">is usually quite a bit faster</a>\nthan having to call out to a separate service!</p>\n<p>DuckDB is pretty quick to compile from source.\nYou probably don't need to muck around with this if you're just using the CLI,\nbut I wanted to eventually use it embedded in some Rust code.\nCompiling from source turned out to be the easiest way to get their crate working.\nIt looks for a shared library by default, but I couldn't get this working after a <code>brew</code> install.\nThis was mildly annoying, but on the other hand,\nvendoring the library does make consistent Docker builds easier 🤷🏻‍♂️</p>\n<h3><a href=\"#features-galore\" aria-hidden=\"true\" class=\"anchor\" id=\"features-galore\"></a>Features galore!</h3>\n<p>DuckDB includes a mind boggling number of features.\nNot in a confusing way; more in a Python stdlib way where just about everything you'd want is already there.\nYou can query a whole directory (or bucket) of CSV files,\na Postgres database, SQLite, or even an OpenStreetMap PBF file 🤯\nYou can even write a SQL query against a glob expression of Parquet files in S3\nas your &quot;table.&quot;\n<strong>That's really cool!</strong>\n(If you've been around the space, you may recognize this concept from\nAWS Athena and others.)</p>\n<h3><a href=\"#speed\" aria-hidden=\"true\" class=\"anchor\" id=\"speed\"></a>Speed</h3>\n<p>Writing a query against a local directory of files is actually really fast!\nIt does a bit of munging upfront, and yes,\nit's not quite as fast as if you'd prepped the data into a clean table,\nbut you actually can run quite efficient queries this way locally!</p>\n<p>When running a query against local data,\nDuckDB will make liberal use of your system memory\n(the default is 80% of system RAM)\nand as many CPUs as you can throw at it.\nBut it will reward you with excellent response times,\ncourtesy of the &quot;vectorized&quot; query engine.\nWhat I've heard of the design reminds me of how array-oriented programming languages like APL\n(or less esoteric libraries like numpy) are often implemented.</p>\n<p>I was able to do some spatial aggregation operations\n(bucketing a filtered list of locations by H3 index)\nin about <strong>10 seconds on a dataset of more than 40 million rows</strong>!\n(The full dataset is over 100 million rows, so I also got to see the selective reading in action.)\nThat piqued my interest, to say the least.\n(Here's the result of that query, visualized).</p>\n<p><figure><img src=\"media/foursquare-os-places-density-2024.png\" alt=\"A map of the world showing heavy density in the US, southern Canada, central Mexico, parts of coastal South America, Europe, Korea, Japan, parts of SE Aaia, and Australia\" /></figure></p>\n<h3><a href=\"#that-analytical-thing\" aria-hidden=\"true\" class=\"anchor\" id=\"that-analytical-thing\"></a>That analytical thing...</h3>\n<p>And now for the final buzzword in DuckDB's marketing: analytical.\nDuckDB frequently describes itself as optimized for OLAP (OnLine Analytical Processing) workloads.\nThis is contrasted with OLTP (OnLine Transaction Processing).\n<a href=\"https://en.wikipedia.org/wiki/Online_analytical_processing\">Wikipedia</a> will tell you some differences\nin a lot of sweepingly broad terms, like being used for &quot;business reporting&quot; and read operations\nrather than &quot;transactions.&quot;</p>\n<p>When reaching for a definition, many sources focus on things like <em>aggregation</em> queries\nas a differentiator.\nThis didn't help, since most of my use cases involve slurping most or all of the data set.\nThe DuckDB marketing and docs didn't help clarify things either.</p>\n<p>Let me know on Mastodon if you have a better explanatation of what an &quot;analytical&quot; database is 🤣</p>\n<p>I think a better explanation is probably 1) you do mostly <em>read</em> queries,\nand 2) it can execute highly parallel queries.\nSo far, DuckDB has been excellent for both the &quot;aggregate&quot; and the &quot;iterative&quot; use case.\nI assume it's just not the best choice per se if your workload is a lot of single-record writes?</p>\n<h2><a href=\"#how-im-using-duckdb\" aria-hidden=\"true\" class=\"anchor\" id=\"how-im-using-duckdb\"></a>How I'm using DuckDB</h2>\n<p>Embedding DuckDB in a Rust project allowed me to deliver something with a better end-user experience,\nis easier to maintain,\nand saved writing hundreds of lines of code in the process.</p>\n<p>Most general-purpose languages like Python and Rust\ndon't have primitives for expressing things like joins across datasets.\nDuckDB, like most database systems, does!\nYes, I <em>could</em> write some code using the <code>parquet</code> crate\nthat would filter across a nested directory tree of 5,000 files.\nBut DuckDB does that out of the box!</p>\n<p>It feels like this is a &quot;regex moment&quot; for data processing.\nJust like you don't (usually) need to hand-roll string processing,\nthere's now little reason to hand-roll data aggregation.</p>\n<p>For the above visualization, I used the Rust DuckDB crate for the data processing,\nconverted the results to JSON,\nand served it up from an Axum web server.\nAll in a <em>single binary</em>!\nThat's lot nicer than a bash script that executes SQL,\ndumps to a file, and then starts up a Python or Node web server!\nAnd breaks when you don't have Python or Node installed,\nyour OS changes its default shell,\nyou forget that some awk flag doesn't work on the GNU version,\nand so on.</p>\n<h1><a href=\"#apache-arrow\" aria-hidden=\"true\" class=\"anchor\" id=\"apache-arrow\"></a>Apache Arrow</h1>\n<p>The final thing I want to touch on is <a href=\"https://arrow.apache.org/\">Apache Arrow</a>.\nThis is yet another incredibly useful technology which I've been following for a while,\nbut never quite figured out how to properly use until last week.</p>\n<p>Arrow is a <em>language-independent memory format</em>\nthat's <em>optimized for efficient analytic operations</em> on modern CPUs and GPUs.\nThe core idea is that, rather than having to convert data from one format to another (this implies copying!),\nArrow defines as shared memory format which many systems understand.\nIn practice, this ends up being a bunch of standards which define common representations for different types,\nand libraries for working with them.\nFor example, the <a href=\"https://geoarrow.org/\">GeoArrow</a> spec\nbuilds on the Arrow ecosystem to enable operations on spatial data in a common memory format.\nPretty cool!</p>\n<h2><a href=\"#why-you-should-care-2\" aria-hidden=\"true\" class=\"anchor\" id=\"why-you-should-care-2\"></a>Why you should care</h2>\n<p>It turns out that copying and format shifting data can really eat into your processing times.\nArrow helps you sidestep that by reducing the amount of both you'll need to do,\nand by working on data in groups.</p>\n<h2><a href=\"#how-the-heck-to-use-it\" aria-hidden=\"true\" class=\"anchor\" id=\"how-the-heck-to-use-it\"></a>How the heck to use it?</h2>\n<p>Arrow is mostly hidden from view beneath other libraries.\nSo most of the time, especially if you're writing in a very high level language like Python,\nyou won't even see it.</p>\n<p>But if you're writing something at a slightly lower level,\nit's something you may have to touch for critical sections.\nThe <a href=\"https://docs.rs/duckdb/latest/duckdb/\">DuckDB crate</a>\nincludes an <a href=\"https://docs.rs/duckdb/latest/duckdb/struct.Statement.html#method.query_arrow\">Arrow API</a>\nwhich will give you an iterator over <code>RecordBatch</code>es.\nThis is pretty convenient, since you can use DuckDB to gather all your data\nand just consume the stream of batches!</p>\n<p>So, how do we work with <code>RecordBatch</code>es?\nThe Arrow ecosystem, like Parquet, takes a lot of work to understand,\nand using the low-level libraries directly is difficult.\nEven as a seasoned Rustacean, I found the docs rather obtuse.</p>\n<p>After some searching, I finally found <a href=\"https://docs.rs/serde_arrow/\"><code>serde_arrow</code></a>.\nIt builds on the <code>serde</code> ecosystem with easy-to-use methods that operate on <code>RecordBatch</code>es.\nFinally; something I can use!</p>\n<p>I was initilaly worried about how performant the shift from columns to rows + any (minimal) <code>serde</code> overhead would be,\nbut this turned out to not be an issue.</p>\n<p>Here's how the code looks:</p>\n<pre><code class=\"language-rust\">serde_arrow::from_record_batch::&lt;Vec&lt;FoursquarePlaceRecord&gt;&gt;(&amp;batch)\n</code></pre>\n<p>A few combinators later and you've got a proper data pipeline!</p>\n<h1><a href=\"#review-what-this-enables\" aria-hidden=\"true\" class=\"anchor\" id=\"review-what-this-enables\"></a>Review: what this enables</h1>\n<p>What this ultimately enabled for me was being able to get a lot closer to &quot;scripting&quot;\na pipeline in Rust.\nMost people turn to Python or JavaScript for tasks like this,\nbut Rust has something to add: strong typing and all the related guarantees <em>which can only come with some level of formalism</em>.\nBut that doesn't necessarily have to get in the way of productivity!</p>\n<p>Hopefully this sparks some ideas for making your next data pipeline both fast and correct.</p>\n",
      "summary": "",
      "date_published": "2024-12-08T00:00:00-00:00",
      "image": "media/foursquare-os-places-density-2024.png",
      "authors": [
        {
          "name": "Ian Wagner",
          "url": "https://fosstodon.org/@ianthetechie",
          "avatar": "media/avi.jpeg"
        }
      ],
      "tags": [
        "rust",
        "apache arrow",
        "parquet",
        "duckdb",
        "big data",
        "data engineering",
        "gis"
      ],
      "language": "en"
    }
  ]
}