Skip to content

[css-content][css-fonts][css-text] Language-dependent behavior in CSS with ill-formed language tags #7098

Open
@jfkthame

Description

@jfkthame

This question is about CSS-related behaviors that depend on the content language, and how they respond when the content has an ill-formed lang attribute.

Examples of CSS features that are affected include font resolution (e.g. whether generic families like sans-serif used with CJK content resolve to Japanese or Chinese font faces should depend on the language), auto-hyphenation, and generation of quote marks around <q> elements.

HTML says that the lang attribute must be a BCP 47 language tag; and BCP 47 says that these are comprised of a sequence of subtags separated by hyphens.

However, we've seen content in the wild where the lang attribute uses an underscore instead of a hyphen to separate subtags, as in en_US or en_GB. As I understand it, according to BCP 47, such a tag is ill-formed, but it's not entirely surprising that such errors show up, as the underscore separator is used in POSIX locale codes, and in other major software systems.

Question: should browsers pay any attention to such language tags, even though they are not correct BCP 47 tags?

Some testing indicates that current behavior is a bit haphazard. I've created codepen testcases to see whether an ill-formed lang tag affects (1) font resolution, (2) hyphenation, and (3) quote marks, or is ignored.

Results:

(1) In Webkit and Blink, the "bad" lang tag affects font resolution. In Gecko, it doesn't; but I just landed a patch to change this behavior, so that upcoming Firefox Nightly will behave like Webkit and Blink browsers in this respect. (This was before I realized quite how messy the current situation is. We could revert it.)

(2) In Webkit and Blink on macOS, the "bad" lang tag affects hyphenation, but in Blink on Windows, it doesn't. In Gecko, it doesn't on any platform.

(3) No browser pays attention to the "bad" lang tag for the purpose of generating quote marks.

Furthermore, as far as I can tell no browser accepts such tags in JS: calling new Intl.Locale("en_US") throws an error in all browsers I tested.

So on the JS side, things seem clear enough: only valid BCP 47 is accepted, anything else throws an error. But on the HTML/CSS side, it's a mess. Currently, Gecko never respects invalid tags, while Webkit and Blink do respect them for font-resolution purposes. And for hyphenation control, Blink may or may not respect them, depending on the platform.

Can we get some better interop here? Ideally, I think we should agree (and perhaps clarify in a note somewhere) that only well-formed BCP 47 language tags will have any effect on the content-language-dependent CSS features, and the browsers that are currently accepting ill-formed tags should stop doing so.

Alternatively, we should agree exactly what kinds of ill-formed tags are accepted, and record this in a spec so that we can all converge on compatible behavior. It makes no sense that en_US enables US English hyphenation in Chrome on macOS but not in Chrome on Windows; and it makes no sense that de_AT selects Austrian-German hyphenation but does not activate Austrian-German quote marks.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions