Jump to content

Module talk:Unicode data

Page contents not supported in other languages.
From Wikipedia, the free encyclopedia

About RTL

[edit]

I am researching RTL scripts. I met this:

  • A
0xa9 -- LATIN CAPITAL LETTER A
Latn
is_rtl: false
  • ث
0x062B -- ARABIC LETTER THEH [1]
Arab
is_rtl: false
  • ש
0x05E9 -- HEBREW LETTER SHIN [2]
Hebr
is_rtl: false


  • ߖ
0x07D6 -- NKO LETTER JA [3]
Nkoo
is_rtl: false

I'd expect the Arab, Hebr, Nkoo characters to be rtl=true. Am I misunderstanding something? @Erutuon: -DePiep (talk) 20:58, 9 January 2021 (UTC)[reply]

@DePiep: The invocation {{#invoke:Unicode data|is|rtl|05E9}} checks whether the literal characters 05E9 are right-to-left. To check the right-to-leftness of the Hebrew character, put in the literal character or a HTML character reference: {{#invoke:Unicode data|is|rtl|ש}} or {{#invoke:Unicode data|is|rtl|ש}}. #invoke:Unicode data|is|rtl as well as #invoke:Unicode data|is|valid_pagename and #invoke:Unicode data|is|Latin interpret their arguments as strings rather than code points in hexadecimal because the corresponding functions in the module take strings. (They could take hexadecimal arguments if someone edited the module to add another parameter to tell them to interpret their argument this way.) — Eru·tuon 01:02, 10 January 2021 (UTC)[reply]
@Erutuon: Thanks, will work for me. Great module! (Second code example is {{#invoke:Unicode data|is|rtl|ש}}). -DePiep (talk) 17:28, 10 January 2021 (UTC)[reply]
  • The four characters, is_rtl:
using &#x...; false
using &#x...; true
using &#x...; true
using &#x...; true
-DePiep (talk) 20:23, 10 January 2021 (UTC)[reply]

is_pagename

[edit]
Resolved

In the function is_pagename, does "pagename" stand for "blockname"? Or wider? -DePiep (talk) 05:17, 27 March 2022 (UTC)[reply]

Resolved: refers to "valid WP pagename", related to WP:NCTR invalid title characters like "#". -DePiep (talk) 11:34, 27 March 2022 (UTC)[reply]

Missing documentation: Hangul, Aliases

[edit]

I am developing the documentation, especially in Module:Unicode data § List of functions. To completify, can someone point out how or where the data /aliases and /Hangul can be retrieved (implementation)? DePiep (talk) 11:39, 27 March 2022 (UTC)[reply]

is_RTL check?

[edit]

About U+0634 ش ARABIC LETTER SHEEN [4]:

{{#invoke:Unicode data |is|rtl|0x0634}} → false

I expect true (is_rtl), right? -DePiep (talk) 23:00, 28 March 2022 (UTC)[reply]

Solved: enter the character <ش >, not the U+hex:
  • {{#invoke:Unicode data |is|rtl|ش }} → true
DePiep (talk) 05:26, 1 June 2022 (UTC)[reply]

Edit request 20 November 2023

[edit]

Description of suggested change: the module code says "-- No image data modules on Wikipedia yet."

We have them now. Can this be enabled?Alexis Jazz (talk or ping me) 05:37, 20 November 2023 (UTC)[reply]

Can you sandbox the code? — Martin (MSGJ · talk) 12:46, 20 November 2023 (UTC)[reply]
MSGJ, I don't speak Lua.. I edited Module:Unicode data/sandbox to sync with the current version and I uncommented the block.
{{#invoke:Unicode data/sandbox|lookup|image|0xA9}} returns Unicode 0x00A9.svg (File:Unicode 0x00A9.svg) so I think this works?Alexis Jazz (talk or ping me) 21:19, 20 November 2023 (UTC)[reply]
 Done I'm not sure I agree with your importing of so many modules from other wikis, but in any event there was never any good reason to comment out that code as opposed to just letting uses of it fail. * Pppery * it has begun... 21:36, 22 November 2023 (UTC)[reply]

Edit request 20 April 2024

[edit]

Description of suggested change: Creation of p.is_noncharacter() as a separate function

Diff:

function p.lookup_name(codepoint) -- U+FDD0-U+FDEF and all code points ending in FFFE or FFFF are Unassigned -- (Cn) and specifically noncharacters: -- https://www.unicode.org/faq/private_use.html#nonchar4 if 0xFDD0 <= codepoint and (codepoint <= 0xFDEF or floor(codepoint % 0x10000) >= 0xFFFE) then return ("<noncharacter-%04X>"):format(codepoint) end
+
function p. -- U+FDD0-U+FDEF and all code points ending in FFFE or FFFF are Unassigned -- (Cn) and specifically noncharacters: -- https://www.unicode.org/faq/private_use.html#nonchar4 0xFDD0 <= codepoint and (codepoint <= 0xFDEF or floor(codepoint % 0x10000) >= 0xFFFE) then return ("<noncharacter-%04X>"):format(codepoint) end

Eievie (talk) 20:48, 20 April 2024 (UTC)[reply]

 Done * Pppery * it has begun... 15:22, 21 April 2024 (UTC)[reply]

Edit request 1 January 2025

[edit]

Description of suggested change:

Allow looking up the kCantonese Unihan property. As an example, {{#invoke:Unicode data/sandbox|lookup|kCantonese|20EB6}} returns "naap6".

Diff:

function p.lookup_kCantonese(codepoint)
	local data = loader[('Unihan/kCantonese/%02X'):format(floor(codepoint / 0x1000))]
	if data then
		return data[codepoint]
	end
end

Northern Moonlight 03:54, 1 January 2025 (UTC)[reply]

 Done * Pppery * it has begun... 23:05, 13 January 2025 (UTC)[reply]

Edit request 15 June 2025

[edit]

Description of suggested change: Reorder the name_hooks table so its entries are sorted in codepoint order. binary_range_search assumes the entries are sorted in this way currently and therefore does not work correctly. {{unichar}} is currently broken by this bug as can be seen in CJK Unified Ideographs Extension I § Background. Specifically U+2ED9D 𮶝 CJK UNIFIED IDEOGRAPH-2ED9D and U+2EDE0 𮷠 CJK UNIFIED IDEOGRAPH-2EDE0 incorrectly appear as reserved. I have made the change in the sandbox.

Diff: See comparison of sandbox with main Warudo (talk) 12:20, 15 June 2025 (UTC)[reply]

-- For the algorithm used to generate Hangul Syllable names, -- see "Hangul Syllable Name Generation" in section 3.12 of the -- Unicode Specification: -- https://www.unicode.org/versions/Unicode11.0.0/ch03.pdf local name_hooks = { { 0x00, 0x1F, "<control-%04X>" }, -- C0 control characters { 0x7F, 0x9F, "<control-%04X>" }, -- DEL and C1 control characters { 0x3400, 0x4DBF, "CJK UNIFIED IDEOGRAPH-%04X" }, -- CJK Ideograph Extension A { 0x4E00, 0x9FFF, "CJK UNIFIED IDEOGRAPH-%04X" }, -- CJK Ideograph { 0xAC00, 0xD7A3, function (codepoint) -- Hangul Syllables local Hangul_data = loader.Hangul local syllable_index = codepoint - 0xAC00 return ("HANGUL SYLLABLE %s%s%s"):format( Hangul_data.leads[floor(syllable_index / Hangul_data.final_count)], Hangul_data.vowels[floor((syllable_index % Hangul_data.final_count) / Hangul_data.trail_count)], Hangul_data.trails[syllable_index % Hangul_data.trail_count] ) end }, -- High Surrogates, High Private Use Surrogates, Low Surrogates { 0xD800, 0xDFFF, "<surrogate-%04X>" }, { 0xE000, 0xF8FF, "<private-use-%04X>" }, -- Private Use -- CJK Compatibility Ideographs { 0xF900, 0xFA6D, "CJK COMPATIBILITY IDEOGRAPH-%04X" }, { 0xFA70, 0xFAD9, "CJK COMPATIBILITY IDEOGRAPH-%04X" }, { 0x17000, 0x187F7, "TANGUT IDEOGRAPH-%04X" }, -- Tangut Ideograph { 0x18800, 0x18AFF, function (codepoint) return ("TANGUT COMPONENT-%03d"):format(codepoint - 0x187FF) end }, { 0x18D00, 0x18D08, "TANGUT IDEOGRAPH-%04X" }, -- Tangut Ideograph Supplement { 0x1B170, 0x1B2FB, "NUSHU CHARACTER-%04X" }, -- Nushu { 0x20000, 0x2A6DF, "CJK UNIFIED IDEOGRAPH-%04X" }, -- CJK Ideograph Extension B { 0x2A700, 0x2B739, "CJK UNIFIED IDEOGRAPH-%04X" }, -- CJK Ideograph Extension C { 0x2B740, 0x2B81D, "CJK UNIFIED IDEOGRAPH-%04X" }, -- CJK Ideograph Extension D { 0x2B820, 0x2CEA1, "CJK UNIFIED IDEOGRAPH-%04X" }, -- CJK Ideograph Extension E { 0x2CEB0, 0x2EBE0, "CJK UNIFIED IDEOGRAPH-%04X" }, -- CJK Ideograph Extension F -- CJK Compatibility Ideographs Supplement (Supplementary Ideographic Plane) { 0x2F800, 0x2FA1D, "CJK COMPATIBILITY IDEOGRAPH-%04X" }, { 0xE0100, 0xE01EF, function (codepoint) -- Variation Selectors Supplement return ("VARIATION SELECTOR-%d"):format(codepoint - 0xE0100 + 17) end}, { 0x30000, 0x3134A, "CJK UNIFIED IDEOGRAPH-%04X" }, -- CJK Ideograph Extension G { 0x31350, 0x323AF, "CJK UNIFIED IDEOGRAPH-%04X" }, -- CJK Ideograph Extension H { 0x2EBF0, 0x2EE5D, "CJK UNIFIED IDEOGRAPH-%04X" }, -- CJK Ideograph Extension I { 0xF0000, 0xFFFFD, "<private-use-%04X>" }, -- Plane 15 Private Use { 0x100000, 0x10FFFD, "<private-use-%04X>" } -- Plane 16 Private Use }
+
-- For the algorithm used to generate Hangul Syllable names, -- see "Hangul Syllable Name Generation" in section 3.12 of the -- Unicode Specification: -- https://www.unicode.org/versions/Unicode11.0.0/ch03.pdf local name_hooks = { { 0x00, 0x1F, "<control-%04X>" }, -- C0 control characters { 0x7F, 0x9F, "<control-%04X>" }, -- DEL and C1 control characters { 0x3400, 0x4DBF, "CJK UNIFIED IDEOGRAPH-%04X" }, -- CJK Ideograph Extension A { 0x4E00, 0x9FFF, "CJK UNIFIED IDEOGRAPH-%04X" }, -- CJK Ideograph { 0xAC00, 0xD7A3, function (codepoint) -- Hangul Syllables local Hangul_data = loader.Hangul local syllable_index = codepoint - 0xAC00 return ("HANGUL SYLLABLE %s%s%s"):format( Hangul_data.leads[floor(syllable_index / Hangul_data.final_count)], Hangul_data.vowels[floor((syllable_index % Hangul_data.final_count) / Hangul_data.trail_count)], Hangul_data.trails[syllable_index % Hangul_data.trail_count] ) end }, -- High Surrogates, High Private Use Surrogates, Low Surrogates { 0xD800, 0xDFFF, "<surrogate-%04X>" }, { 0xE000, 0xF8FF, "<private-use-%04X>" }, -- Private Use -- CJK Compatibility Ideographs { 0xF900, 0xFA6D, "CJK COMPATIBILITY IDEOGRAPH-%04X" }, { 0xFA70, 0xFAD9, "CJK COMPATIBILITY IDEOGRAPH-%04X" }, { 0x17000, 0x187F7, "TANGUT IDEOGRAPH-%04X" }, -- Tangut Ideograph { 0x18800, 0x18AFF, function (codepoint) return ("TANGUT COMPONENT-%03d"):format(codepoint - 0x187FF) end }, { 0x18D00, 0x18D08, "TANGUT IDEOGRAPH-%04X" }, -- Tangut Ideograph Supplement { 0x1B170, 0x1B2FB, "NUSHU CHARACTER-%04X" }, -- Nushu { 0x20000, 0x2A6DF, "CJK UNIFIED IDEOGRAPH-%04X" }, -- CJK Ideograph Extension B { 0x2A700, 0x2B739, "CJK UNIFIED IDEOGRAPH-%04X" }, -- CJK Ideograph Extension C { 0x2B740, 0x2B81D, "CJK UNIFIED IDEOGRAPH-%04X" }, -- CJK Ideograph Extension D { 0x2B820, 0x2CEA1, "CJK UNIFIED IDEOGRAPH-%04X" }, -- CJK Ideograph Extension E { 0x2CEB0, 0x2EBE0, "CJK UNIFIED IDEOGRAPH-%04X" }, -- CJK Ideograph Extension F { 0x2F800, 0x2FA1D, "CJK COMPATIBILITY IDEOGRAPH-%04X" }, { 0x30000, 0x3134A, "CJK UNIFIED IDEOGRAPH-%04X" }, -- CJK Ideograph Extension G { 0x31350, 0x323AF, "CJK UNIFIED IDEOGRAPH-%04X" }, -- CJK Ideograph Extension H { -- { 0xF0000, 0xFFFFD, "<private-use-%04X>" }, -- Plane 15 Private Use { 0x100000, 0x10FFFD, "<private-use-%04X>" } -- Plane 16 Private Use }

--Warudo (talk) 13:56, 15 June 2025 (UTC)[reply]

 Done in Special:Diff/1296263621, thank you. U+2ED9D and U+2EDE0 are now shown correctly. I've also added a test at Template:Unichar/testcases#U+2ED9D – grass radical to show the effect. —⁠andrybak (talk) 22:57, 18 June 2025 (UTC)[reply]

Edit request 29 July 2025

[edit]

Description of suggested change: Add Variation Selectors (not to be confused with Variation Selectors Supplement) to the name_hooks list. This fixes Template_talk:Unichar#c-Great_Brightstar-20250729153400-Some_character_names_are_not_found_by_the_template properly. (I've added the characters to c:Data:Unicode_data/names/00F.tab but that is a hack and should be reverted once the proper fix is done here.) I've provided the code in the Sandbox which was copied over from wikt:Module:Unicode data which handles this correctly.

Diff:

{ 0xE000, 0xF8FF, "<private-use-%04X>" }, -- Private Use -- CJK Compatibility Ideographs
+
{ 0xE000, 0xF8FF, "<private-use-%04X>" }, -- Private UseCJK Compatibility Ideographs

Warudo (talk) 16:17, 29 July 2025 (UTC)[reply]

 Done * Pppery * it has begun... 16:16, 2 August 2025 (UTC)[reply]
@Pppery: I'm sorry for opening this again but I made a copy paste error from Wiktionary which means that the fix failed. The new lines must be added after the CJK compatibility ideographs instead of before them so please make this change:
{ 0xE000, 0xF8FF, "<private-use-%04X>" }, -- Private Use { 0xFE00, 0xFE0F, function (codepoint) -- Variation Selectors return ("VARIATION SELECTOR-%d"):format(codepoint - 0xFE00 + 1) end}, -- CJK Compatibility Ideographs { 0xF900, 0xFA6D, "CJK COMPATIBILITY IDEOGRAPH-%04X" }, { 0xFA70, 0xFAD9, "CJK COMPATIBILITY IDEOGRAPH-%04X" },
+
{ 0xE000, 0xF8FF, "<private-use-%04X>" }, -- Private Use -- CJK Compatibility Ideographs { 0xF900, 0xFA6D, "CJK COMPATIBILITY IDEOGRAPH-%04X" }, { 0xFA70, 0xFAD9, "CJK COMPATIBILITY IDEOGRAPH-%04X" },

I missed this because the temporary fix in Commons masked the problem. I can confirm that this time I tested the change properly by reverting my fix in commons first. Warudo (talk) 16:39, 2 August 2025 (UTC)[reply]

 Done * Pppery * it has begun... 16:52, 2 August 2025 (UTC)[reply]