{
  "version": "https://jsonfeed.org/version/1",
  "title": "Ian's Digital Garden",
  "home_page_url": "https://ianwwagner.com/",
  "feed_url": "https://ianwwagner.com//tag-algorithms.json",
  "description": "",
  "items": [
    {
      "id": "https://ianwwagner.com//quadrupling-the-performance-of-a-data-pipeline.html",
      "url": "https://ianwwagner.com//quadrupling-the-performance-of-a-data-pipeline.html",
      "title": "Quadrupling the Performance of a Data Pipeline",
      "content_html": "<p>Over the past two weeks, I've been focused on optimizing some data pipelines.\nI inherited some old ones which seemed especially slow,\nand I finally hit a limit where an overhaul made sense.\nThe pipelines process and generate data on the order of hundreds of gigabytes,\nrequiring correlation and conflated across several datasets.</p>\n<p>The pipelines in question happened to be written in Node.js,\nwhich I will do my absolute best not to pick on too much throughout.\nNode is actually a perfectly fine solution for certain problems,\nbut was being used especially badly in this case.\nThe rewritten pipeline, using Rust, clocked in at 4x faster than the original.\nBut as we'll soon see, the choice of language wasn't even the main factor in the sluggishness.</p>\n<p>So, let's get into it...</p>\n<h1><a href=\"#problem-1-doing-cpu-bound-work-on-a-single-thread\" aria-hidden=\"true\" class=\"anchor\" id=\"problem-1-doing-cpu-bound-work-on-a-single-thread\"></a>Problem 1: Doing CPU-bound work on a single thread</h1>\n<p>Node.js made a splash in the early 2010s,\nand I can remember a few years where it was the hot new thing to write everything in.\nOne of the selling points was its ability to handle thousands (or tens of thousands)\nof connections with ease; all from JavaScript!\nThe key to this performance is <strong>async I/O</strong>.\nModern operating systems are insanely good at this, and Node made it <em>really</em> easy to tap into it.\nThis was novel to a lot of developers at the time, but it's pretty standard now\nfor building I/O heavy apps.</p>\n<p><strong>Node performs well as long as you were dealing with I/O-bound workloads</strong>,\nbut the magic fades if your workload requires a lot of CPU work.\nBy default, Node is single-threaded.\nYou need to bring in <code>libuv</code>, worker threads (Node 10 or so), or something similar\nto access <em>parallel</em> processing from JavaScript.\nI've only seen a handful of Node programs use these,\nand the pipelines in question were not among them.</p>\n<h2><a href=\"#going-through-the-skeleton\" aria-hidden=\"true\" class=\"anchor\" id=\"going-through-the-skeleton\"></a>Going through the skeleton</h2>\n<p>If you ingest data files (CSV and the like) record-by-record in a naïve way,\nyou'll just read one record at a time, process, insert to the database, and so on in a loop.\nThe original pipeline code was fortunately not quite this bad (it did have batching at least),\nbut had some room for improvemnet.</p>\n<p>The ingestion phase, where you're just reading data from CSV, parquet, etc.\nmaps naturally to Rust's <a href=\"https://rust-lang.github.io/async-book/05_streams/01_chapter.html\">streams</a>\n(the cousin of futures).\nThe original node code was actually fine at this stage,\nif a bit less elegant.\nBut the Rust structure we settled on is worth a closer look.</p>\n<pre><code class=\"language-rust\">fn csv_record_stream&lt;'a, S: AsyncRead + Unpin + Send + 'a, T: TryFrom&lt;StringRecord&gt;&gt;(\n    stream: S,\n    delimiter: u8,\n) -&gt; impl Stream&lt;Item = T&gt; + 'a\nwhere\n    &lt;T as TryFrom&lt;StringRecord&gt;&gt;::Error: Debug,\n{\n    let reader = AsyncReaderBuilder::new()\n        .delimiter(delimiter)\n        // Other config elided...\n        .create_reader(stream);\n    reader.into_records().filter_map(|res| async move {\n        let Ok(record) = res else {\n            log::error!(&quot;Error reading from the record stream: {:?}&quot;, res);\n            return None;\n        };\n\n        match T::try_from(record) {\n            Ok(parsed) =&gt; Some(parsed),\n            Err(e) =&gt; {\n                log::error!(&quot;Error parsing record: {:?}.&quot;, e);\n                None\n            }\n        }\n    })\n}\n</code></pre>\n<p>It starts off dense, but the concept is simple.\nWe'll take an async reader,\nconfigure a CSV reader to pull records for it,\nand map them to another data type using <code>TryFrom</code>.\nIf there are any errors, we just drop them from the stream and log an error.\nThis usually isn't a reason to stop processing for our use case.</p>\n<p>You should <em>not</em> be putting expensive code in your <code>TryFrom</code> implementation.\nBut really quick things like verifying that you have the right number of fields,\nor that a field contains an integer or is non-blank are usually fair game.</p>\n<p>Rust's trait system really shines here.\nOur code can turn <em>any</em> CSV(-like) file\ninto an arbitrary record type.\nAnd the same techniques can apply to just about any other data format too.</p>\n<h2><a href=\"#how-to-use-tokio-for-cpu-bound-operations\" aria-hidden=\"true\" class=\"anchor\" id=\"how-to-use-tokio-for-cpu-bound-operations\"></a>How to use Tokio for CPU-bound operations?</h2>\n<p>Now that we've done the light format shifting and discarded some obviously invalid records,\nlet's turn to the heavier processing.</p>\n<pre><code class=\"language-rust\">let available_parallelism = std::thread::available_parallelism()?.get();\n// let record_pipeline = csv_record_stream(...);\nrecord_pipeline\n    .chunks(500)  // Batch the work (your optimal size may vary)\n    .for_each_concurrent(available_parallelism, |chunk| {\n        // Clone your database connection pool or whatnot before `move`\n        // Every app is different, but this is a pretty common pattern\n        // for sqlx, Elastic Search, hyper, and more which use Arcs and cheap clones for pools.\n        let db_pool = db_pool.clone();\n        async move {\n            // Process your records using a blocking threadpool\n            let documents = tokio::task::spawn_blocking(move || {\n                // Do the heavy work here!\n                chunk\n                    .into_iter()\n                    .map(do_heavy_work)\n                    .collect()\n            })\n            .await\n            .expect(&quot;Problem spawning a blocking task&quot;);\n\n            // Insert processesd data to your database\n            db_pool.bulk_insert(documents).await.expect(&quot;You probably need an error handling strategy here...&quot;);\n        }\n    })\n    .await;\n</code></pre>\n<p>We used the <a href=\"https://docs.rs/futures/latest/futures/stream/trait.StreamExt.html#method.chunks\"><code>chunks</code></a>\nadaptor to pull hundreds of items at a time for more efficient processing in batches.\nThen, we used <a href=\"https://docs.rs/futures/latest/futures/stream/trait.StreamExt.html#method.for_each_concurrent\"><code>for_each_concurrent</code></a>\nin conjunction with <a href=\"https://docs.rs/tokio/latest/tokio/task/fn.spawn_blocking.html\"><code>spawn_blocking</code></a>\nto introduce parallel processing.</p>\n<p>Note that neither <code>chunks</code> nor even <code>for_each_concurrent</code> imply any amount of <em>parallelism</em>\non their own.\n<code>spawn_blocking</code> is the only thing that can actually create a new thread of execution!\nChunking simply splits the work into batches (most workloads like this tend to benefit from batching).\nAnd <code>for_each_concurrent</code> allows for <em>concurrent</em> operations over multiple batches.\nBut <code>spawn_blocking</code> is what enables computation in a background thread.\nIf you don't use <code>spawn_blocking</code>,\nyou'll end up blocking Tokio's async workers,\nand your performance will tank.\nJust like the old Node.js code.</p>\n<p>The astute reader may point out that using <code>spawn_blocking</code> like this\nis not universally accepted as a solution.\nTokio is (relatively) optimized for non-blocking workloads, so some claim that you should avoid this pattern.\nBut my experience having done this for 5+ years in production code serving over 2 billion requests/month,\nis that Tokio can be a great scheduler for heavier tasks too!</p>\n<p>One thing that's often overlooked in these discussions\nis that not all &quot;long-running operations&quot; are the same.\nOne category consists of graphics event loops,\nlong-running continuous computations,\nor other things that may not have an obvious &quot;end.&quot;\nBut some tasks <em>can</em> be expected to complete within some period of time,\nthat's longer than a blink.</p>\n<p>In the case of the former (&quot;long-lived&quot; tasks), then spawning a dedicated thread often makes sense.\nIn the latter scenario though, Tokio tasks with <code>spawn_blocking</code> can be a great choice.</p>\n<p>For our workload, we were doing a lot of the latter sort of operation.\nOne helpful rule of thumb I've seen is that if your task takes longer than tens of microseconds,\nyou should move it off the Tokio worker threads.\nUsing <code>chunks</code> and <code>spawn_blocking</code> avoids this death by a thousand cuts.\nIn our case, the parallelism resulted in a VERY clear speedup.</p>\n<h1><a href=\"#problem-2-premature-optimization-rather-than-backpressure\" aria-hidden=\"true\" class=\"anchor\" id=\"problem-2-premature-optimization-rather-than-backpressure\"></a>Problem 2: Premature optimization rather than backpressure</h1>\n<p>The original data pipeline was very careful to not overload the data store.\nPerhaps a bit too careful!\nThis may have been necessary at some point in the distant past,\nbut most data storage, from vanilla databases to multi-node clustered storage,\nhave some level of natural backpressure built-in.\nThe Node implementation was essentially limiting the amount of work in-flight that hadn't been flushed.</p>\n<p>This premature optimization and the numerous micro-pauses it introduced\nwere another death by a thousand cuts problem.\nDropping the artificial limits approximately doubled throughput.\nIt turned out that our database was able to process 2-4x more records than under the previous implementation.</p>\n<p><strong>TL;DR</strong> — set a reasonable concurrency, let the server tell you when it's chugging (usually via slower response times),\nand let your async runtime handle the rest!</p>\n<h1><a href=\"#problem-3-serde-round-trips\" aria-hidden=\"true\" class=\"anchor\" id=\"problem-3-serde-round-trips\"></a>Problem 3: Serde round-trips</h1>\n<p>Serde, or serialization + deserialization, can be a silent killer.\nAnd unless you're tracking things carefully, you often won't notice!</p>\n<p>I recently listened to <a href=\"https://www.recodingamerica.us/\">Recoding America</a> at the recommendation of a friend.\nOne of the anecdotes made me want to laugh and cry at the same time.\nEngineers had designed a major improvemnet to GPS, but the rollout is delayed\ndue to a performance problem that renders it unusable.</p>\n<p>The project is overseen by Raytheyon, a US government contractor.\nAnd they can't deliver because some arcane federal guidance (not even a regulation proper)\n&quot;recommends&quot; an &quot;Enterprise Service Bus&quot; in the architecture.\nThe startuppper in me dies when I hear such things.\nThe &quot;recommendation&quot; boils down to a data exchange medium where one &quot;service&quot; writes data and another consumes it.\nThink message queues like you may have used before.</p>\n<p>This is fine (even necessary) for some applications,\nbut positively crippling for others.\nIn the case of the new positioning system,\nwhich was heavily dependent on timing,\nthis was a wildly inefficient architecture.\nEven worse, the guidelines stated that it should be encrypted.</p>\n<p>This wasn't even &quot;bad&quot; guidance, but in the context of the problem,\nwhich depended on rapid exchange of time-sensitive messages,\nit was a horrendously bad fit.</p>\n<p>In our data pipeline, I discovered a situation with humorous resemblance in retrospect.\nThe pipeline was set up using a microservice architecture,\nwhich I'm sure souded like a good idea at the time,\nbut it introduced some truly obscene overhead.\nAll services involved were capable of working with data in the same format,\nbut the Node.js implementation was split into multiple services with HTTP and JSON round trips in the middle!\nDouble whammy!</p>\n<p>The new data pipeline simply imports the &quot;service&quot; as a crate,\nand gets rid of all the overhead by keeping everything in-process.\nIf you do really need to have a microservice architecture (ex: to scale another service up independently),\nthen other communication + data exchange formats may improve your performance.\nBut if it's possible to keep everything in-process, your overhead is roughly zero.\nThat's hard to beat!</p>\n<h1><a href=\"#conclusion\" aria-hidden=\"true\" class=\"anchor\" id=\"conclusion\"></a>Conclusion</h1>\n<p>In the end, the new pipeline was 4x the speed of the old.\nI happened to rewrite it in Rust, but Rust itself wasn't the source of all the speedups:\nunderstanding the architecture was.\nYou could achieve similar results in Node.js or Python,\nbut Rust makes it significantly easy to reason about the architecture and correctness of your code.\nThis is especially important when it comes to parallelizing sections of a pipeline,\nwhere Rust's type system will save you from the most common mistakes.</p>\n<p>These and other non-performance-related reasons to use Rust will be the subject of a future blog post (or two).</p>\n",
      "summary": "",
      "date_published": "2024-11-29T00:00:00-00:00",
      "image": "",
      "authors": [
        {
          "name": "Ian Wagner",
          "url": "https://fosstodon.org/@ianthetechie",
          "avatar": "media/avi.jpeg"
        }
      ],
      "tags": [
        "algorithms",
        "rust",
        "elasticsearch",
        "nodejs",
        "data engineering",
        "gis"
      ],
      "language": "en"
    }
  ]
}