{
  "version": "https://jsonfeed.org/version/1",
  "title": "Ian's Digital Garden",
  "home_page_url": "https://ianwwagner.com/",
  "feed_url": "https://ianwwagner.com//tag-unicode.json",
  "description": "",
  "items": [
    {
      "id": "https://ianwwagner.com//unzip-utf-8-docker-and-c-locales.html",
      "url": "https://ianwwagner.com//unzip-utf-8-docker-and-c-locales.html",
      "title": "Unzip, UTF-8, Docker, and C Locales",
      "content_html": "<p>Today's episode of &quot;things that make you go 'wat'&quot; is sponsored by <code>unzip</code>.\nYes, the venerable utility ubiquitous on UNIX-like systems.\nI mean, what could possibly go wrong?</p>\n<p>This morning I was minding my own business sipping a coffee,\nwhen suddenly the &quot;simple&quot; CI pipeline I was working on failed.\nI literally copied the command that failed out of a shell script.\nWhich ran on the exact same machine previously.\nAnd the command that failed was a simple <code>unzip</code>.\nHuh?</p>\n<p>I initially assumed the file may have been corrupted, or maybe the archive had failed in a weird way\n(I don't control the process, and was downloading it from the internet).\nWhile it was running again, I started hitting <code>Page Down</code> looking for oddities in the log.\nI was greeted by this near the end of the output:</p>\n<pre><code>se/municipality_of_savsj#U00f6-addresses-city.geojson:  mismatching &quot;local&quot; filename (se/municipality_of_savsjö-addresses-city.geojson),\n         continuing with &quot;central&quot; filename version\n</code></pre>\n<p>Weird, huh?\nIt looks like it's trying to unzip a file with a UTF-8 name, which should be supported.</p>\n<p>I <code>ssh</code>'d into the remote machine to give it a try in the terminal.\nI've run this command dozens of times, and wanted to see if I got the same output in a standard <code>bash</code> remote terminal.\nGiven that log output is sometimes inscrutable, I thought &quot;maybe it wasn't the unzip itself that failed directly?&quot;\nOr something...</p>\n<p>Well, the <code>unzip</code> worked in the regular bash login shell on the same host.\nSo there must be <em>something</em> different about the CI environment.</p>\n<h1><a href=\"#searching-for-the-root-cause\" aria-hidden=\"true\" class=\"anchor\" id=\"searching-for-the-root-cause\"></a>Searching for the root cause</h1>\n<p>As a first resort, I started digging through the help and man pages.\nThe first amusing factoid I learned was that apparently the last official release of this utility happened in 2009.\nI guess zip doesn't evolve much, eh?</p>\n<p>The man page was actually pretty detailed,\nbut it didn't have a lot to say about Unicode, or anything else obvious related to the error.\nAnd the project doesn't exactly have an active central issue tracker (there is one on SourceForge, but replies are few and far between).\nI couldn't find anyone else talking about this specific error in the usual places on the internet either.</p>\n<p>I did however find this interesting <a href=\"https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=918894\">bug report via a Debian mailing list</a>\nfrom 2019.\nThe thing that caught my eye was the code sample and mention of locales.</p>\n<p>Oh no...\nWhat if this actually changes behavior based on your system locale?\nThen it started coming back to me... something about a dozen or so environment variables\nthat determine the behavior of C programs in bizarre ways...</p>\n<p>So I ran <code>locale</code> on the <code>ssh</code> session and got something reasonable back:</p>\n<pre><code>$ locale\nLANG=C.UTF-8\nLANGUAGE=\nLC_CTYPE=&quot;C.UTF-8&quot;\nLC_NUMERIC=&quot;C.UTF-8&quot;\nLC_TIME=&quot;C.UTF-8&quot;\nLC_COLLATE=&quot;C.UTF-8&quot;\nLC_MONETARY=&quot;C.UTF-8&quot;\nLC_MESSAGES=&quot;C.UTF-8&quot;\nLC_PAPER=&quot;C.UTF-8&quot;\nLC_NAME=&quot;C.UTF-8&quot;\nLC_ADDRESS=&quot;C.UTF-8&quot;\nLC_TELEPHONE=&quot;C.UTF-8&quot;\nLC_MEASUREMENT=&quot;C.UTF-8&quot;\nLC_IDENTIFICATION=&quot;C.UTF-8&quot;\nLC_ALL=\n</code></pre>\n<p>Then, to avoid wasting even more hours of CI time (it's a long job whose source is a ~100GB ZIP file),\nI set about replicating the environment of the runner.\nFortunately it left a Docker volume behind for me.\nSo, I launched an interactive Docker container using the same image that the CI job used (<code>python:3.13</code>).\nSomewhat surprisingly, the output of <code>locale</code> was a bunch of variables set to.... nothing!\nNow we're getting somewhere!</p>\n<p>To confirm that I could indeed reproduce the failure, I ran <code>unzip</code> again in the container,\nand sure enough, I got the same log message, and the exit code was <code>1</code>.\nNow we've confirmed how to reproduce the issue at least, so we can test a fix.</p>\n<h1><a href=\"#how-locale-affects-unzip\" aria-hidden=\"true\" class=\"anchor\" id=\"how-locale-affects-unzip\"></a>How locale affects <code>unzip</code></h1>\n<p>Since <code>unzip</code> is single threaded (read: SLOW),\nI spent my time during the tests looking through the source code to try and confirm my theory about the locale variables being the issue.\nThe official version seems to be <a href=\"https://sourceforge.net/projects/infozip/\">hosted on SourceForge</a>, which is apparently still a thing.\n(I'm sure there are a lot of Debian patches, but I just wanted a quick way to peruse the code).\nAmusingly, there were 47,074 downloads/week of <code>unzip60.tar.gz</code>, and only 36 downloads/week of the ZIP version.</p>\n<p>Eventually, I found what I was looking for in <code>unzip.c</code>:</p>\n<pre><code class=\"language-c\">int unzip(__G__ argc, argv)\n    __GDEF\n    int argc;\n    char *argv[];\n{\n#ifndef NO_ZIPINFO\n    char *p;\n#endif\n#if (defined(DOS_FLX_H68_NLM_OS2_W32) || !defined(SFX))\n    int i;\n#endif\n    int retcode, error=FALSE;\n#ifndef NO_EXCEPT_SIGNALS\n#ifdef REENTRANT\n    savsigs_info *oldsighandlers = NULL;\n#   define SET_SIGHANDLER(sigtype, newsighandler) \\\n      if ((retcode = setsignalhandler(__G__ &amp;oldsighandlers, (sigtype), \\\n                                      (newsighandler))) &gt; PK_WARN) \\\n          goto cleanup_and_exit\n#else\n#   define SET_SIGHANDLER(sigtype, newsighandler) \\\n      signal((sigtype), (newsighandler))\n#endif\n#endif /* NO_EXCEPT_SIGNALS */\n\n    /* initialize international char support to the current environment */\n    SETLOCALE(LC_CTYPE, &quot;&quot;);\n\n#ifdef UNICODE_SUPPORT\n    /* see if can use UTF-8 Unicode locale */\n# ifdef UTF8_MAYBE_NATIVE\n    {\n        char *codeset;\n#  if !(defined(NO_NL_LANGINFO) || defined(NO_LANGINFO_H))\n        /* get the codeset (character set encoding) currently used */\n#       include &lt;langinfo.h&gt;\n\n        codeset = nl_langinfo(CODESET);\n#  else /* NO_NL_LANGINFO || NO_LANGINFO_H */\n        /* query the current locale setting for character classification */\n        codeset = setlocale(LC_CTYPE, NULL);\n        if (codeset != NULL) {\n            /* extract the codeset portion of the locale name */\n            codeset = strchr(codeset, '.');\n            if (codeset != NULL) ++codeset;\n        }\n#  endif /* ?(NO_NL_LANGINFO || NO_LANGINFO_H) */\n        /* is the current codeset UTF-8 ? */\n        if ((codeset != NULL) &amp;&amp; (strcmp(codeset, &quot;UTF-8&quot;) == 0)) {\n            /* successfully found UTF-8 char coding */\n            G.native_is_utf8 = TRUE;\n        } else {\n            /* Current codeset is not UTF-8 or cannot be determined. */\n            G.native_is_utf8 = FALSE;\n        }\n        /* Note: At least for UnZip, trying to change the process codeset to\n         *       UTF-8 does not work.  For the example Linux setup of the\n         *       UnZip maintainer, a successful switch to &quot;en-US.UTF-8&quot;\n         *       resulted in garbage display of all non-basic ASCII characters.\n         */\n    }\n# endif /* UTF8_MAYBE_NATIVE */\n</code></pre>\n<p>And there we have it, right at the start of the program...</p>\n<p>This is a bit rough to grok if you don't regularly read C,\nbut the gist of it is that it tries to look at the system locale\nusing <a href=\"https://en.cppreference.com/w/c/locale/setlocale.html\"><code>setlocale</code></a>.\nThe <code>LC_CTYPE</code> specifies the types of character used in the locale.\nThe second argument is... very C.\nThe function will &quot;install the specified system locale... as the new C locale.&quot;</p>\n<p>It defaults to <code>&quot;C&quot;</code> at startup, per the docs.\nIn <code>unzip.c</code> though, the authors use the magic value <code>&quot;&quot;</code>,\nwhose behavior is to set the locale to the &quot;user-preferred locale&quot;.\nThis comes from those environment variables.\nA few lines down, past the <code>#ifdefs</code> gating unicode support,\nthey parse the codeset portion by passing <code>NULL</code>, which is ANOTHER magic value\nthat simply returns the current value.\n(So this two-step dance loads the preferred locale from the environment variables,\nwhich are empty in this case, and then inspects the codeset.).</p>\n<p>If your environment variables do not explicitly specify UTF-8,\nyou will get some.... strange and undesirable behavior, apparently,\nif your archive contains unicode file names!</p>\n<p>I'm pretty sure there's a good historical reason for this.\nGiven that many C library functions are locale-sensitive,\nand UTF-8 support is much younger than UNIX and (obviously) C.\nBut I found this behavior from <code>unzip</code> to be a bit surprising nonetheless.\nAnd rather than setting a sensible default like <code>C.UTF-8</code>,\nit turns out most Docker containers start with none!</p>\n<h1><a href=\"#the-fix\" aria-hidden=\"true\" class=\"anchor\" id=\"the-fix\"></a>The Fix</h1>\n<p>Fortunately this is pretty easy to fix.\nJust <code>export LC_ALL=&quot;C.UTF-8&quot;</code> before using <code>unzip</code>.\n(There is probably a more granular approach, but the <code>LC_ALL</code> sledgehammer does the job.)\nKinda crazy that you have to do this,\nbut hopefully this saves someone else an afternoon of debugging!</p>\n",
      "summary": "",
      "date_published": "2025-08-27T00:00:00-00:00",
      "image": "",
      "authors": [
        {
          "name": "Ian Wagner",
          "url": "https://fosstodon.org/@ianthetechie",
          "avatar": "media/avi.jpeg"
        }
      ],
      "tags": [
        "unicode"
      ],
      "language": "en"
    },
    {
      "id": "https://ianwwagner.com//unicode-normalization.html",
      "url": "https://ianwwagner.com//unicode-normalization.html",
      "title": "Unicode Normalization",
      "content_html": "<p>Today I ran into an <a href=\"https://www.openstreetmap.org/node/9317391311/history/2\">amusingly named place</a>,\nthanks to some sharp eyes on the OpenStreetMap US Slack.\nThe name of this restaurant is listed as &quot;𝐊𝐄𝐁𝐀𝐁 𝐊𝐈𝐍𝐆 𝐘𝐀𝐍𝐆𝐎𝐍&quot;.\nThat isn't some font trickery; it's a bunch of Unicode math symbols\ncleverly used to emphasize the name.\n(Amusingly, this does not actually show up properly on most maps, but that's another story for another post).</p>\n<p>I was immediately curious how well the geocoder I spent the last few months building handles this.</p>\n<p><figure><img src=\"media/kebab-king-duplicates.png\" alt=\"A screenshot of a search result list showing two copies of the Kebab King Yangon, one in plain ASCII and the other using the math symbols\" /></figure></p>\n<p>Well, at least it found the place, despite the very SEO-unfriendly name!\nBut what's up with the second search result?</p>\n<p>Well, that's a consequence of us pulling in data from multiple sources.\nIn this case, the second result comes from the <a href=\"https://opensource.foursquare.com/os-places/\">Foursquare OS Places</a> dataset.\nIt seems that either the Placemaker validators decided to clean this up,\nor the Foursquare user who added the place didn't have that key on their phone keyboard.</p>\n<p>One of the things our geocoder needs to do when combining results is deduplicating results.\n(Beyond that, it needs to decide which results to keep, but that's a much longer post!)\nWe use a bunch of factors to make that decision, but one of them is roughly\n&quot;does this place have the same name,&quot; where <em>same</em> is a bit fuzzy.</p>\n<p>One of the ways we can do this is normalizing away things like punctuation and diacritics.\nThese are quite frequently inconsistent across datasets, so two nearby results with similar enough names\nare <em>probably</em> the same place.\nFortunately, Unicode provides a few standardized transformations into canonical forms\nthat make this easier.</p>\n<h1><a href=\"#composed-and-decomposed-characters\" aria-hidden=\"true\" class=\"anchor\" id=\"composed-and-decomposed-characters\"></a>Composed and decomposed characters</h1>\n<p>What we think of as a &quot;character&quot; does not necessarily have a single representation in Unicode.\nFor example, there are multiple ways of encoding &quot;서울&quot; which will look the same when rendered,\nbut have a different binary representation.\nThe Korean writing system is perhaps a less familiar case for many,\nbut characters with diacritical marks such as accents are usually the same.\nThey can be either &quot;composed&quot; or &quot;decomposed&quot; into the component parts\nat the binary level.</p>\n<p>This composition and decomposition transform is useful for (at least) two reasons:</p>\n<ol>\n<li>It gives us a consistent form that allows for easy string comparison when multiple valid encodings exist.</li>\n<li>It lets us strip away parts that we don't want to consider in a comparison, like diacritics.</li>\n</ol>\n<p>I use the <a href=\"https://docs.rs/unicode-normalization/latest/unicode_normalization/\"><code>unicode_normalization</code></a> crate\nto do this &quot;decompose and filter&quot; operation.\nSpecifically, the <a href=\"https://docs.rs/unicode-normalization/latest/unicode_normalization/trait.UnicodeNormalization.html\"><code>UnicodeNormalization</code> trait</a>,\nwhich has helpers which will work on most string-like types.</p>\n<h1><a href=\"#normalization-forms\" aria-hidden=\"true\" class=\"anchor\" id=\"normalization-forms\"></a>Normalization forms</h1>\n<p>You might notice there are four confusingly named methods in the trait:\n<code>nfd</code>, <code>nfkd</code>, <code>nfc</code>, and <code>nfkc</code>.\nThe <code>nf</code> stands for &quot;normalization form&quot;.\nThese functions <em>normalize</em> your strings.\n<code>c</code> and <code>d</code> stand for composition and decomposition.\nThe composed form is, roughly, the more compact form,\nwhereas the decomposed form is the version where you separate the base from the modifiers,\nthe <a href=\"https://en.wikipedia.org/wiki/List_of_Hangul_jamo\">jamo</a> from the syllables, etc.</p>\n<p>We were already decomposing strings so that we could remove the diacritics, using form NFD.\nThis works great for diacritics and even Hangul,\nbut 𝐊𝐄𝐁𝐀𝐁 𝐊𝐈𝐍𝐆 𝐘𝐀𝐍𝐆𝐎𝐍 shows that we were missing something.</p>\n<p>That something is the <code>k</code>, which stands for &quot;compatibility.&quot;\nYou can refer to <a href=\"https://www.unicode.org/reports/tr15/#Canon_Compat_Equivalence\">Unicode Standard Annex #15</a>\nfor a full definition,\nbut the intuition is that <em>compatibility</em> equivalence of two characters\nis a bit more permissive than the stricter <em>canonical</em> equivalence.\nBy reducing two characters (or strings) to their canonical form,\nyou will be able to tell if they represent the same &quot;thing&quot; with the same visual appearance,\nbehavior, semantic meaning, etc.\nCompatibility equivalence is a weaker form.</p>\n<p>Compatibility equivalence is extremely useful in our quest for determining whether two nearby place names\nare a fuzzy match.\nIt reduces things like ligatures, superscripts, and width variations into a standard form.\nIn the case of &quot;𝐊𝐄𝐁𝐀𝐁 𝐊𝐈𝐍𝐆 𝐘𝐀𝐍𝐆𝐎𝐍,&quot; compatibility decomposition transforms it into the ASCII\n&quot;KEBAB KING YANGON.&quot;\nAnd now we can correctly coalesce the available information into a single search result.</p>\n<p>Hopefully this shines a light on one small corner of the complexities of unicode!</p>\n",
      "summary": "",
      "date_published": "2025-05-09T00:00:00-00:00",
      "image": "media/kebab-king-duplicates.png",
      "authors": [
        {
          "name": "Ian Wagner",
          "url": "https://fosstodon.org/@ianthetechie",
          "avatar": "media/avi.jpeg"
        }
      ],
      "tags": [
        "unicode",
        "rust"
      ],
      "language": "en"
    }
  ]
}