{
  "version": "https://jsonfeed.org/version/1",
  "title": "Ian's Digital Garden",
  "home_page_url": "https://ianwwagner.com/",
  "feed_url": "https://ianwwagner.com//tag-databases.json",
  "description": "",
  "items": [
    {
      "id": "https://ianwwagner.com//databases-as-an-alternative-to-application-logging.html",
      "url": "https://ianwwagner.com//databases-as-an-alternative-to-application-logging.html",
      "title": "Databases as an Alternative to Application Logging",
      "content_html": "<p>In my <a href=\"https://stadiamaps.com/\">work</a>, I've been doing a lot of ETL pipeline design recently for our geocoding system.\nThe system processes on the order of a billion records per job,\nand failures are part of the process.\nWe want to log these.</p>\n<p>Most applications start by dumping logs to <code>stderr</code>.\nUntil they overflow their terminal scrollback buffer.\nThe next step is usually text files.\nBut getting insights from 10k+ of lines of text with <code>grep</code> is a chore.\nIt may even be impossible unless you've taken extra care with how your logs are formatted.</p>\n<p>In this post we'll explore some approcahes to do application logging better.</p>\n<h1><a href=\"#structured-logging\" aria-hidden=\"true\" class=\"anchor\" id=\"structured-logging\"></a>Structured logging</h1>\n<p>My first introduction to logs with a structural element was probably Logcat for Android.\nLogcat lets you filter the fire hose of Android logs down to a specific application,\nand can even refine the scope further if you learn how to use it.\nLogcat is a useful tool, but fundamentally all it can do is <em>filter</em> logs from a stream\nand it has most of the same drawbacks as grepping plain text files.</p>\n<p>Larger systems often benefit from something like the <code>tracing</code> crate,\nwhich integrates with services like <code>journald</code> and Grafana Loki.\nThis is a great fit for a long-running <em>service</em>,\nbut is total overkill for an application that does some important stuff ™\nand exits.\nLike our ETL pipeline example.</p>\n<p>(Aside: I have a love/hate relationship with <code>journalctl</code>.\nI mostly interact with it through Ctrl+R in my shell history,\nwhich is problemating when connecting to a new server.\nBut it does have the benefit of being a nearly ubiquitous local structured logging system!)</p>\n<h1><a href=\"#databases-for-application-logs\" aria-hidden=\"true\" class=\"anchor\" id=\"databases-for-application-logs\"></a>Databases for application logs</h1>\n<p>Using a database as an application log can be a brilliant level up for many applications\nbecause you can actually <em>query</em> your logs with ease.\nI'll give a few examples, and then show some crazy cool stuff you can do with that.</p>\n<p>One type of failure we frequently encounter is metadata that looks like a URL where it shouldn't be.\nFor example, the name of a shop being <code>http://spam.example.com/</code>,\nor having a URL in an address or phone number field.\nIn this case, we usually drop the record, but we also want to log it so we can clean up the source data.\nSome other common failures are missing required fields, data in the wrong format, and the like.</p>\n<h2><a href=\"#a-good-schema-enables-analytics\" aria-hidden=\"true\" class=\"anchor\" id=\"a-good-schema-enables-analytics\"></a>A good schema enables analytics</h2>\n<p>Rather than logging these to <code>stderr</code> or some plain text files, we write to a DuckDB database.\nThis has a few benefits beyond the obvious.\nFirst, using a database forces you to come up with a schema.\nAnd just like using a language with types, this forces you to clarify your thinking a bit upfront.\nIn our case, we log things like the original data source, an ID, a log level (warn, error, info, etc.),\na failure code, and additional details.</p>\n<p>From here, we can do meaningufl <em>analytical</em> queries like\n&quot;how many records were dropped due to invalid geographic coordinates&quot;\nor &quot;how many records were rejected due to metadata mismatches&quot;\n(ex: claiming to be a US address but appearing in North Korea).</p>\n<h2><a href=\"#cross-dataset-joins-anyone\" aria-hidden=\"true\" class=\"anchor\" id=\"cross-dataset-joins-anyone\"></a>Cross-dataset joins, anyone?</h2>\n<p>If this query uncovers a lot of rejected records from one data source,\nwouldn't it be nice if we could look at a sample?\nWe have the IDs right there in the log, and the data source identifier, after all.\nBut since we're in DuckDB rather than a plain text file,\nwe can pretty much effortlessly join on the data files!\n(This assumes that your data is in some halfway sane format like JSON, CSV, Parquet, or even another database).</p>\n<p>We can even take this one step further and compare logs across imports!\nWhat's up with that spike in errors compared to last month's release from that data source?</p>\n<p>These are the sort of insights which are almost trivial to uncover when your log is a database.</p>\n<h1><a href=\"#practical-bits\" aria-hidden=\"true\" class=\"anchor\" id=\"practical-bits\"></a>Practical bits</h1>\n<p>Now that I've described all the awesome things you can do,\nlet's get down to the practical questions like how you'd do this in your app.\nMy goals for the code were to make it easy to use and impossible to get wrong at the use site.\nFortunately that's pretty easy in Rust!</p>\n<pre><code class=\"language-rust\">#[derive(Clone)]\npub struct ImportLogger {\n    pool: Pool&lt;DuckdbConnectionManager&gt;,\n    // Implementation detail for our case: we have multiple ETL importers that share code AND logs.\n    // If you have any such attributes that will remain fixed over the life of a logger instance,\n    // consider storing them as struct fields so each event is easier to log.\n    importer_name: String,\n}\n</code></pre>\n<p>Pretty standard struct setup using DuckDB and <a href=\"https://github.com/sfackler/r2d2\"><code>r2d2</code></a> for connection pooling.\nWe put this in a shared logging crate in a workspace containing multiple importers.\nThe <code>importer_name</code> is a field that will get emitted with every log,\nand doesn't change for a logger instance.\nIf your logging has any such attributes (ex: a component name),\nstoring them as struct fields makes each log invocation easier!</p>\n<div class=\"markdown-alert markdown-alert-note\">\n<p class=\"markdown-alert-title\">Note</p>\n<p>At the time of this writing, I couldn't find any async connection pool integrations for DuckDB.\nIf anyone knows of one (or wants to add it to <a href=\"https://github.com/djc/bb8\"><code>bb8</code></a>), let me know!</p>\n</div>\n<pre><code class=\"language-rust\">pub fn new(config: ImportLogConfig, importer_name: String) -&gt; anyhow::Result&lt;ImportLogger&gt; {\n    let manager = DuckdbConnectionManager::file(config.import_log_path)?;\n    let pool = Pool::new(manager)?;\n\n    pool.get()?.execute_batch(include_str!(&quot;schema.sql&quot;))?;\n\n    Ok(Self {\n        pool,\n        importer_name,\n    })\n}\n</code></pre>\n<p>The constructor isn't anything special; it sets up a DuckDB connection to a file-backed database\nbased on our configuration.\nIt also initializes the schema from a file.\nThe schema file lives in the source tree, but the lovely <a href=\"https://doc.rust-lang.org/std/macro.include_str.html\"><code>include_str!</code></a>\nmacro bakes it into a static string at compile time (so we can still distribute a single binary).</p>\n<pre><code class=\"language-rust\">pub fn log(&amp;self, level: Level, source: &amp;str, id: Option&lt;&amp;str&gt;, code: &amp;str, reason: &amp;str) {\n    log::log!(level, &quot;{code}\\t{source}\\t{id:?}\\t{reason}&quot;);\n    let conn = match self.pool.get() {\n        Ok(conn) =&gt; conn,\n        Err(e) =&gt; {\n            log::error!(&quot;failed to get connection: {}&quot;, e);\n            return;\n        }\n    };\n    match conn.execute(\n        &quot;INSERT INTO logs VALUES (current_timestamp, ?, ?, ?, ?, ?, ?)&quot;,\n        params![level.as_str(), self.importer_name, source, id, code, reason],\n    ) {\n        Ok(_) =&gt; (),\n        Err(e) =&gt; log::error!(&quot;Failed to insert log entry: {}&quot;, e),\n    }\n}\n</code></pre>\n<p>And now the meat of the logging!\nThe <code>log</code> method does what you'd expect.\nThe signature is a reflection of the schema:\nwhat you need to log, what you may optionally log, and what type of data you're logging.</p>\n<p>For our use case, we decided to additionally log via the <code>log</code> crate.\nThis way, we can see critical errors on the console to as the job is running.</p>\n<p>And that's pretty much it!\nIt took significantly more time to write this post than to actually write the code.\nSomeone could probably write a macro-based crate to generate these sorts of loggers if they had some spare time ;)</p>\n<h2><a href=\"#bonus-filter_log\" aria-hidden=\"true\" class=\"anchor\" id=\"bonus-filter_log\"></a>Bonus: <code>filter_log</code></h2>\n<p>We have a pretty common pattern in our codebase,\nwhere most operations / pipeline stages yield results,\nand we want to chain these together.\nWhen it succeeds, we pass the result on to the next stage.\nOtherwise, we want to log what went wrong.</p>\n<p>We called this <code>filter_log</code> because it usually shows up in <code>filter_map</code> over streams\nand as such yields an <code>Option&lt;T&gt;</code>.</p>\n<p>This was extremely easy to add to our logging struct,\nand saves loads of boilerplate!</p>\n<pre><code class=\"language-rust\">/// Converts a result to an option, logging the failure if the result is an `Err` variant.\npub fn filter_log&lt;T, E: Debug&gt;(\n    &amp;self,\n    level: Level,\n    source: &amp;str,\n    id: Option&lt;&amp;str&gt;,\n    code: &amp;str,\n    result: Result&lt;T, E&gt;,\n) -&gt; Option&lt;T&gt; {\n    match result {\n        Ok(result) =&gt; Some(result),\n        Err(err) =&gt; {\n            self.log(level, source, id, code, &amp;format!(&quot;{:?}&quot;, err));\n            None\n        }\n    }\n}\n</code></pre>\n<h1><a href=\"#conclusion\" aria-hidden=\"true\" class=\"anchor\" id=\"conclusion\"></a>Conclusion</h1>\n<p>The concept of logging to a database is not at all original with me\nMany enterprise services log extensively to special database tables.\nBut I think the technique is rarely applied to applications.</p>\n<p>Hopefully this post convinced you to give it a try in the next situation where it makes sense.</p>\n",
      "summary": "",
      "date_published": "2025-01-13T00:00:00-00:00",
      "image": "",
      "authors": [
        {
          "name": "Ian Wagner",
          "url": "https://fosstodon.org/@ianthetechie",
          "avatar": "media/avi.jpeg"
        }
      ],
      "tags": [
        "software-engineering",
        "duckdb",
        "databases",
        "rust"
      ],
      "language": "en"
    }
  ]
}