Breaking our Latin-1 assumptions (2017)

robin_reala · on June 18, 2022

This is usually the point to link to the Big List of Naughty Strings: https://github.com/minimaxir/big-list-of-naughty-strings

If your system can handle these it can probably handle most global text.

sacrosancty · on June 19, 2022

Perhaps my history is wrong but it seems like an unfortunate accident that some of these languages must have not had much printing press and typewriter use so they retained their cursive scripts a few decades longer than English and now they're being frozen in time because unicode can accommodate them while they're still popular. I guess the reason we don't care about cursive English unicode is because traditional printing couldn't do it so we've already adapted to separate characters while those Arabs and what-not must have still been writing mostly by hand. Does that sound likely?

svat · on June 19, 2022

No, this does not match history. Speaking for Indic scripts, which I know best: the first Devanagari books were printed in the 1700s. While this is a few centuries later than Gutenberg's printing press c. 1450, by the time Unicode (or even computers) came into existence, there were already several centuries of "much printing press and typewriter use" (ok, maybe only a century or so of typewriters).

Do you really think "those Arabs and what-not" did not have printing before Unicode condescended to "accommodate them" in 1988? Traditional printing could very well do Indic scripts, and so could typewriters. And so could computers: in fact, Unicode for Indic scripts takes inspiration from ISCII, an Indian standard extending ASCII for encoding scripts on computers, that dates back to 1983.

DiggyJohnson · on June 19, 2022

> Do you really think […]

I downvoted you for this: our guidelines require you to assume the strongest interpretation of the comment you are responding to. From my perspective, you’re comment is needlessly reactionary to a thoughtful, well stated question.

Sorry to include this lecture, but it helps my conscious to avoid drive-by downvoting. I’m very much open to a rebuttal.

svat · on June 19, 2022

Thanks. In fact, I did consider the guidelines: my impulse was to write "Think about what you're suggesting […] Consider whether this really seems plausible", but I thought that's too confrontational/focuses too much on the person, and "Do you really think…" was a softer/more neutral way of phrasing it. Apparently not!

(This is a different part of the guidelines than "respond to the strongest plausible interpretation" BTW: the comment very explicitly suggests non-Latin scripts not having had "much printing press and typewriter use", so I don't think there's much scope for interpretation there. And in fact, the idea is inherently a condescending and colonialist one—of Western superiority over unsophisticated "those" people—even though it may well be held innocently and stated as such. But I do take the point about not responding to the person, and responding kindly to the idea instead. Will try harder, as I think I usually do.)

(BTW "reactionary" means something specific…)

DiggyJohnson · on June 20, 2022

We still disagree, but I appreciate your reply and it is indeed logically consistent. Thanks for the explanation.

sacrosancty · on June 20, 2022

Devanagari looks like separate characters to me, just like English. Are you talking about a cursive script where each character's shape has to be modified to link to it neighbors? I'm specifically only talking about cursive scripts, not just any non-English alphabet.

svat · on June 20, 2022

For example, in Devanagari, the sequence

    • U+0915: DEVANAGARI LETTER KA (क)
    • U+093F: DEVANAGARI VOWEL SIGN I  (ि)

gives कि where the vowel sign not only attaches to the letter, but (in this case) happens to be to the left of it.

Similarly, the sequence

    • U+‎0915: DEVANAGARI LETTER KA (क)
    • U+‎094D: DEVANAGARI SIGN VIRAMA ( ्)
    • ‎U+0937: DEVANAGARI LETTER SSA (ष)
    • ‎U+0941: DEVANAGARI VOWEL SIGN U ( ु)

gives क्षु where the shapes of the "characters" are indeed modified based on which characters appear next to them.

This is kind of touched upon in the second section in the original post titled "Indic scripts": https://manishearth.github.io/blog/2017/01/15/breaking-our-l...

This is handled in traditional printing (and in modern Opentype fonts) by simply having separate pieces of type for every possible combination of consonants (there are a few hundred, see e.g. https://en.wikipedia.org/w/index.php?title=Devanagari_conjun...), and then having separate pieces of type for vowel signs (which have to have different lengths depending on the width of the consonant cluster, but a handful of lengths for each will do), the same way any good Latin-script font will include ligatures like ﬁ, and will also include separate glyphs for the accent marks (things like ´ ` ¨ ˆ ˜ which can be placed over letters to give é à ü î ñ and so on).

Incidentally, Gutenberg's very first Bible itself used nearly 300 letter forms, including ligatures and variants of several letter (see e.g. https://www.oldbookappreciator.com/libraryofhistoricaltype/e... )

(About "cursive" vs "separate" letters: conceptually it's not such a big difference, because there are only finitely many letters/characters in the script, so as long as you carefully specify the ending position of each glyph for each possible choice of following glyph, you produce the appearance of everything being joined. Indeed that's how traditional printing and modern fonts do it too. And for that matter, cursive fonts in English: try playing with https://fonts.google.com/specimen/Cedarville+Cursive which if it had been made with slightly better kerning rules would likely modify shapes of letters depending on adjacent letters. Maybe one of these fonts do it? https://fontsgeek.com/search/?q=cursive )

sacrosancty · on June 22, 2022

OK, I just multiplied 26 by 26 and realized that you would only need a few hundred different forms of letters, at worst, to print cursive English that way. Printers could have done that just as they did with Devanagari, so perhaps my whole idea is wrong.

kevin_thibedeau · on June 19, 2022

Nothing in Unicode prevents rendering Western languages in cursive. It's just that nobody wants to do the intensive work to generate the layouts and manage the required fonts as was done with Harfbuzz.

winety · on June 19, 2022

There are cursive typefaces for the Latin script, e.g. [1]. There might even be some free ones. Making (good looking) fonts is hard work, but I don’t think making cursive fonts is that much harder.

[1]: https://www.dizajndesign.sk/en/font/skolske

sacrosancty · on June 20, 2022

That supports my idea. People don't much care about cursive in English because we're already adapted to isolated characters.

EdiX · on June 19, 2022

> it seems like an unfortunate accident that some of these languages must have not had much printing press and typewriter use

it wasn't an accident, the printing press was banned by the ottoman empire in 1485.

abdulhaq · on June 19, 2022

Or is it actually fortunate? The delay in properly incorporating Arabic into the IT environment gave the rendering infrastructure time to enrich and develop more elaborate textual representation such as advanced ligatures. This means Arabic text rendering has not had to be bent into the plain linear forms that Latin text more naturally falls into.

sacrosancty · on June 20, 2022

Instead it's bent into the continuous squiggly curve forms that handwriting naturally falls into but it's not written by hand so it doesn't gain that value.

marginalia_nu · on June 19, 2022

Western print letters are fairly faithful replicas of western handwriting, specifically that of humanist minuscule (<https://en.wikipedia.org/wiki/Humanist_minuscule>). Like there's been some ligatures and shorthand, but for the most part, this type of script has always been an extremely printing press-friendly affair with distinct letters of reasonably uniform width.

Cursive has existed alongside print letters going all the way back to antiquity.

jahewson · on June 19, 2022

True indeed, though it was not these Italian scripts that were first to be printed, but the equally typesetting-friendly Gothic blackletter. https://en.m.wikipedia.org/wiki/Blackletter

tengwar2 · on June 19, 2022

But humanist minuscule is not a cursive script: the letters are separate. Cursive does not mean just hand-written, but written in a style designed for haste achieved by not lifting the pen between letters.

marginalia_nu · on June 19, 2022

Right, my point is they have existed alongside each other almost as long as the latin alphabet as we recognize it. Cursive is a bit younger, but they both trace from antiquity.

Both forms existed before print, during the inception of print, and long after print.

DemocracyFTW2 · on June 19, 2022

> while those Arabs and what-not must have still been writing mostly by hand. Does that sound likely?

It does sound rather snarky

DiggyJohnson · on June 19, 2022

Do you not see the hypocrisy in your response? I can understand the question seeming blunt, but not snarky.

Your comment on the other hand… /snark

DemocracyFTW2 · on June 21, 2022

totally disagree. You just don't say "those $other-ethnicity and what-not" where I come from, it sounds super-disrespectful.

kstenerud · on June 19, 2022

Ugh Unicode has been the bane of my existence trying to write a text format spec. I started by trying to forbid certain characters to keep files editable and avoid Unicode rendering exploits (like hiding text, or making structured text behave differently than it looks), but in the end it became so much like herding cats that I had to just settle on https://github.com/kstenerud/concise-encoding/blob/master/ct...

Basically allow everything except some separators, most control chars, and some lookalike characters (which have to be updated as more characters are added to Unicode). It's not as clean as I'd like, but it's at least manageable this way.

jfk13 · on June 19, 2022

Unfortunately, your "text safe" definition appears to exclude text in languages such as Persian (among others), where Zero Width Non-Joiner is required to write some words correctly.

kstenerud · on June 19, 2022

Oh damn good catch! Adding category Cf.

jfk13 · on June 19, 2022

Thanks.... though note that Cf also includes things like the bidi directional-override code points, which you might still prefer to exclude.

kstenerud · on June 19, 2022

Yup, I've removed a bunch of Cf codes (it really is a grab bag!), but I'm not sure if removing BIDI could be done without breaking some languages?

https://github.com/kstenerud/concise-encoding/blob/master/ce...

asiachick · on June 19, 2022

at least you finally got to the right conclusion. If you had started there, maybe you wouldn't have considered it hard to write your spec.

SeanLuke · on June 19, 2022

> Now, the Han characters are ideographs. This is not a phonetic script; individual characters represent words.

[At least in Chinese] Characters represent parts of words. Some basic words consist of just one character, but others consist of several characters. Characters impart a mood to the word -- the set of words in which a character appears generally have similar meanings. I think the closest thing metaphor would be to describe characters as latin roots.

DiggyJohnson · on June 19, 2022

Somewhere between Latin roots and syllables, maybe? Or is there no reasonable comparison between Han characters and Romance syllables?

SeanLuke · on June 21, 2022

I don't think there's any reasonable comparison except for transliterated foreign words. Latin roots seem the closest metaphor.

Dylan16807 · on June 18, 2022

> While this doesn’t affect rendering, Unicode, as a system for describing text, also has a concept of interlinear annotation characters. These are used to represent furigana / ruby. Fonts don’t render this, but it’s useful if you want to represent text that uses ruby.

Useful as long as "represent" means internal in-process use only. You're supposed to never save them into documents or send them between systems.

shadowofneptune · on June 19, 2022

The most curious errors I got when testing a Unicode-enabled lexer were not from the lexer itself.

These are valid C-style identifiers if you extend those rules to Unicode, and editors have no problem with them:

fish_3

рыбы_3

So are these, despite the combination of left-to-right and right-to-left characters:

سمك_٣ (Eastern Arabic numeral for three, note how it appears to come first to an LTR reader.)

سمك_3 (Western Arabic numeral for three)

Problem is, editors really have trouble dealing with mixed-direction words like this. The caret stops pointing to the correct part of the word, the home and end keys become useless, etc. It's an odd situation where the lexer can handle these edge cases well but I cannot think of how you'd actually get them into the compiler.

cm2187 · on June 19, 2022

On a tangential topic, naive question: what does the programming language landscape looks like in China?

English is almost universally spoken among western college graduates, and it is the same character set. But I can’t imagine the difficulty of learning programming compounded by it being in a language you barely understand with a character set you are unfamiliar with. Did Chinese language based programming languages appear? Or is everyone eating the bullet (or perhaps I am overstating the difficulty, hard to get an idea of how much english the average chinese college graduate is exposed to).

andi999 · on June 19, 2022

Basically everyone who hasnt reached retirememt age has learned pinyin which is a phonetic transcription using latin characters. For the english language used in programming the difficulty should be the same as lets say an Italien student would face.

masswerk · on June 19, 2022

> Also, not all code points have a single-codepoint uppercase version. The eszett (ß) capitalizes to “SS”. There’s also the “capital” eszett ẞ, but its usage seems to vary and I’m not exactly sure how it interacts here.

Don't use capital ß. While it may be found in a few places on the Web in posts of enthusiasts, it's not used in real life. (While it's based on an old proposal, it has been introduced only recently and it's supported only by recent OSes. At the same time – or rather, even before this –, orthographic reforms have replaced the lower-case variant by "ss" for most use cases. So it's anachronistically lost in retro-futurism, but bare of any retro-futurism charms.)

Fun fact: "eszett" is a denomination specific to Germany, in Austria it's "scharfes s" ("sharp s", or rather, acute s), and the Swiss got rid of it altogether in the first half of the 20th century (1938), generally replacing it by "ss".

dfawcus · on June 19, 2022

> At the same time – or rather, even before this –, orthographic reforms have replaced the lower-case variant by "ss" for most use cases.

Interesting. So maybe Germans will eventually experience a similar double take as I do when reading older (English) texts which used the 'long s', though that may well take a similar ~ 150 yrs.

https://en.wikipedia.org/wiki/Long_S

masswerk · on June 19, 2022

I guess so, the tendency is towards abolishing the "ß" altogether, the current state of affairs is merely a compromise, which may be rather temporary.

Regarding the "long s", this has been in use in German writing until the early 20th century, as well. One of the best things about "ß" is that there is no consensus what this actually is. Historically, it's a ligature, probably of a long s and a round s, and it can be found in renaissance Italian cursive (e.g., in samples by Palatino). Also, in 17th century type setting, it can be seen as a compositum of long and round s. This also explains, why there is only a lower-case form, as there is also (mostly) only a lower-case long s. So "eszett" or "ß" are probably historically wrong and it was only in Fraktur (non-Latin broken letter type) that "ß" became split into "s" and "z", and ever since, there is this ambiguity. Both "SZ" and "SS" are viable capitalizations, with the former being more authoritative in the mid-20th century and a strong tendency towards the latter (which is about the only form used nowadays).

(Does the ambiguity matter? Not at all. Mind that probably only a minority knows that the ampersand (&) is a ligature of "et", or that "@" is literarily "at" and that you can write "it" the same way. We're perfectly able to use these things without knowing what they are.)

Regarding letter forms, there are plenty sources of worry in German writing: There's Fraktur in books, mostly from the 18th century until 1940 (contrary to common belief, Fraktur was not the preferred typeface of Nazi-Germany, rather they abolished it), there's Schwabacher typeface, which was Latin characters, but still with broken forms, there had been long hand and short hand Kurent cursive, various national forms and epochs of Latin hand writing, even after WW II, etc. Fun!

hgs3 · on June 19, 2022

The author’s original blog post praised Swift because it chose grapheme clusters as its default text abstraction. This might be fine for languages primarily intended to operate on UI’s or DSL's but it’s a bad idea for general purpose languages. Grapheme clusters can change between Unicode versions, between locales, and confuse parsers (imagine a CSV parser that doesn’t see a “comma” anymore because the comma got unexpectedly clumped into a grapheme cluster). Furthermore, Unicode assigns properties to individual code points - not grapheme clusters. You can't query these properties if all you see are graphemes.

The correct abstraction for working with Unicode text is the code point. UTF whatever is an implementation detail and not something to get hung up on. Python is one of the few languages that gets this right.

Jasper_ · on June 19, 2022

> Furthermore, Unicode assigns properties to individual code points - not grapheme clusters. You can't query these properties if all you see are graphemes.

Languages like Swift give you the ability to iterate over encoded bytes, graphemes, or code points. You can access these properties if you really want to, however, the point is that these properties might not be useful in practice.

I kind-of disagree with the author that grapheme clusters meaningfully solve the problem. I think it's just another level of kicking the can down the road, with its own set of problems. But I think that it's at least closer than code points.

> The correct abstraction for working with Unicode text is the code point.

I am not aware of any scenario where the code point is meaningfully what you want. Feel free to inform me of a case.

> UTF whatever is an implementation detail and not something to get hung up on.

Grapheme clusters are not part of the transmission format.

hgs3 · on June 19, 2022

> I am not aware of any scenario where the code point is meaningfully what you want.

The code point is the most meaningful thing. I'm confused as to what you think you're parsing when you parse text?

> Grapheme clusters are not part of the transmission format.

Absolutely. The intent of my comment was to point out how most programming languages expose strings as some specific UTF encoding rather than exposing them as a collection of code points - exposing strings as UTF anything is a leaky abstraction. An exception could be made for lower level languages, like C, which define strings as pointers to the encoded data.

asiachick · on June 19, 2022

Meh, Python picked a subset of unicode as valid identifiers effectively baking non-inclusiveness into the language. Swift, said "all are welcome"

walrus01 · on June 18, 2022

Most of what applies to the commentary there for arabic as right-to-left is also valid for Farsi/Dari/Tajik (pretty much the historical extent of the persian empire on a map, if you image search it), and Urdu.

Interestingly enough modern tajik is written using the russian cyrillic script, which was sort of forced on them in the post-1915 era. But there is a major resurgence in the use of the Farsi/Dari script and alphabet in modern Tajikistan. It's 95% mutually intelligible with what the Dari that's spoken in Kabul. Just a weird accent.

Urdu is of course a huge deal as it's the default language for Pakistan. People whose first language is something else like Balochi, or Pashto, or other will almost certainly learn modern standard Urdu at school. In addition to Pakistan's extensive use of English, of course.

other fun things: the letter "P" or peh doesn't exist in arabic, but does exist in Farsi. So a pizza would be a bizza, and so on. The farsi alphabet is obviously derived from arabic but has some key differences. There's a stanards body in Iran that has defined the 'normal' farsi keyboard layout and unicode information.

https://en.wikipedia.org/wiki/Pe_(Persian_letter)

https://en.wikipedia.org/wiki/Persian_alphabet

in much more "recent" times than the historical person empire, much of the historical extent of the mughal empire resulted in things like the right-to-left persian alphabet and farsi derived language you see in modern standard urdu. urdu is absolutely chock full of farsi words.

https://en.wikipedia.org/wiki/Mughal_Empire#Language

https://en.wikipedia.org/wiki/Persian_language_in_the_Indian...

as to how this might impact software, forms, database fields and such: there's now a VAST population of people who might prefer to either write their info, name, fill out forms entering urdu into a text entry field, or write stuff out in English text if that's the default language they use on the Internet in modern Pakistan. Or some combination of the two. And both are totally valid. You might have somebody's name written out phonetically in English in a text field and their street address and other details are in Urdu or Farsi. Or the other way around.

irrational · on June 18, 2022

We had to figure out the RTL issue for translations. We got it working great. Recently we were pulled into a project with a third party vendor. This project would be translated, including into RTL languages. We told them that they should not put this off, but make sure this worked from the start. We shared best practices with them and even shared some code. Did they listen? Of course not. So now the project is “finished” and it doesn’t work with RTL languages at all. Sigh. We keep trying to help and suggest how they can fix the situation, but they completely refuse to listen. We are just shaking our heads.

jiggawatts · on June 19, 2022

I see i18n and regional/language support going backwards recently, especially in the public cloud, which is very US-centric.

So for example, Azure SQL Database is forced to us-english, “dmy” date formatting, and UTC time zone. This cannot be changed.

Just give up on living outside the US or using local time.

throwaway290 · on June 19, 2022

The issue is bigger than just "dmy" date formatting. There are cultures that use different calendars altogether. I believe there were relevant extensions to ISO 8601 in the works. Should databases jump on implementing it at their level though? I don't see why store time as anything other than a UTC timestamp with enough precision (probably just epoch milliseconds), converting at GUI input/output points.

So I think you underestimate the sheer scope and bloat needed to accommodate all possible languages and cultures equally well. And if we abandon that idea, perhaps it makes sense to keep things simple and support customization instead. If on the other hand Azure doesn't even allow installing any non-English FTS engine that would be questionable of course.

iso1631 · on June 19, 2022

If I create a meeting for next November at 1600 local time in paris, and it's stored at 1500 UTC, what happens if France decides to stick with summer time between now and then. The meeting is still 1600 Paris, but the time will be 1400 UTC.

Similarly if I want a task to run at 6AM local time every day

If a time is in the future you usually want to store it as local time because we don't know what the offset between local and UTC will be.

throwaway290 · on June 19, 2022

For some purposes concerning local future event scheduling you are right that it may be necessary to store date-time representation with timezone (or location, if it's possible to reliably infer timezone from location) and calendar (if non-Gregorian) rather than absolute UTC stamp. Not sure how common such cases are, compared with recording timestamps of past events.

jiggawatts · on June 19, 2022

If you’re developing a database engine from scratch with no attempt at backwards compatibility with anything, then maybe you can get away with that. But there’s three decades worth of Microsoft SQL client applications that assume that GETDATE and GETUTCDATE do different things.

Microsoft just decided to pretend that globalisation is not a problem, but proceeded to market this platform to foreign enterprises as a good migration target for legacy databases and associated applications.

iso1631 · on June 19, 2022

Doesn't the U.S. use "MDY" formatting? At least it's an eclectic mix of defaults

jiggawatts · on June 19, 2022

Apologies, you are right. I'm so used to doing it in "dmy" order that I couldn't type it 'wrong' into a comment even on purpose...

notriddle · on June 18, 2022

(2017)

RicoElectrico · on June 18, 2022

If I had a dollar for every B2B software in my corp that screws up such basic thing...

Yet the box-ticker drones from IT procurement are content with vendors who assume we live in a world without diacritic marks, I guess. Using 8-bit character sets in 2022 is a "brown M&M's" [1] indicator for me. If they can't be bothered to use Unicode, what else they don't care about?

[1] https://www.entrepreneur.com/article/232420

walrus01 · on June 18, 2022

Really worrying is when you run into something like a major Canadian bank where the online banking portal system rejects passwords longer than 14 characters and with a wide range of standard english punctuation marks in it, likely because they're storing the damned thing as a plaintext string in some mainframe database field on equipment/software from the 1980s.

It was like that up until very recently for some gargantuan banks.

spitfire · on June 19, 2022

IF it's TD, then it's AS/400. For reals.

google234123 · on June 18, 2022

The point of this article is that unicode isn't a "basic thing".