NLS and lowercase

Matthias Paul | 5 Sep 20:00 2002
Re: NLS and lowercase

On 2002-09-05, Steffen Kaiser wrote:

>> The "filename uppercase" table could map all accented letters to
>> their non-accented variants, and therefore prevent problems with
>> accessing filenames stored in one codepage when other codepage is
>> active (the problems 
>
> Would be my guess, too; but it was never actually used by MS DOS,
> as many other features, too.

I assume this as well.

FYI. In the DR-DOS family prior to 7.02 the COUNTRY.SYS file had
two tables for UCASE and FUCASE, but the kernel always used the
FUCASE table for both of them. This was changed with DR-DOS 7.02+
so that the kernel distinguishes between these tables. It does
not really make a difference, because the tables are identical
in most cases, anyway...

>>> Also, why they usually not contains "lowercase table" (6503)
>>> and not provide "make lower case" (similar to 6520-6522) nor
>>> "filename lowercase table" at all?
>>
>> Maybe just because "lowercase" table was introduced in later MS-DOS
>> version that the other components of NLS system?
>
> Rumors say that the lowercase table has been introduced by some
> Russian KEYB and was never atcually approved by MS (DOS).

Very interesting, can you tell us a bit more about these rumors?

When I first discovered the LCASE (3) vector in the MS-DOS 6.20+
COUNTRY.SYS file back in early 1994, I could find zero information
about it anywhere. All I know is, that it only exists for Russia
(country code 7) in Cyrillic codepage 866. It can also be found
in Hangul, Japanese, and Chinese issues of MS-DOS 6.2x and is
supported in MS-DOS 7.0/7.10/8.0 (Windows 95/98/SE/ME), Windows 2000
(NT 5.0), and IBM PC DOS 7 & 2000 COUNTRY.SYS, while it is not
supported by PC DOS 6.1, OS/2 Warp 3, PTS-DOS, S/DOS, WinDOS,
etc. I have no info about PC DOS 6.3, OS/2 Warp 4, and Windows NT.

The odd thing is that this country/codepage tupel misses the entry
for FCHAR (5), so that the total count of entries per tupel
remains the same, that is 6, except for in Arabic and Hebrew
issues of MS-DOS, where it is 8 - they have additional CCTORC (20)
and ARAMODE (21) entries.

DR-OenDOS and DR-DOS 7.02+ COUNTRY.SYS and NLSFUNC 4.00+ support
LCASE as well (for all countries), but only at the lower NLSFUNC
INT 2Fh/AX=14FEh/CL=03h, not (yet) at DOS INT 21h/AX=6503h level.

On 2002-09-05, Arkady V.Belousov wrote:

> You mean, that you suggest, that FUCASE table should/may
> present more strict conversion, when lower accented letters
> converted into ASCII equivalence?

Yes.

> Hm. This is dangerous. Firstly, you then should apply same
> conversion for upper accented letters (or strings in lower
> case and in upper case give different results).

Yes, but that's how it is...

> Secondly, this will prohibit at all idea of natioanal letters
> in filenames.

They never really were, officially.

> RBIL says that for OS/2 they are identical. Under MS-DOS
> pointers to these tables are differs. Look like, Microsoft
> delivers responsibility for this to country.sys file designers?

The pointers differ, but it could still be, that the contents
is the same. I remember, that I once checked this (my CHCC tool
can derive a NLS database by scanning a live system's CTY info
and can calculate CRCs over the various tables). But it's too
long ago to remember the exact outcome, I could look this up.

>>> Also, why they usually not contains "lowercase table" (6503)
>>> and not provide "make lower case" (similar to 6520-6522) nor
>>> "filename lowercase table" at all?

INT 21h/AX=6523h was already used up at that time.

> To be sure: are someone know some alphabet, where inversion of
> UCASE table is ambiguite (i.e. to upper case code X translated
> two or more other codes) or no inversion at all (i.e. upper
> case code X should be translated to lower case code Y, but Y
> shouldn't translated into X)?

&szlig; letter 'ß' in German is a sharp s and has no single-letter
uppercase equivalent (it would be transscribed to SS in capital
letters), in Greek the 'ß' is a small Beta, while the capital Beta
is 'B'.

I recall there was another issue with the Turkish small dotless 'i',
which, IIRC, has an uppercase equivalent of 'I'. Sorry, I would
have to look this up, I'm not sure about it right now. We had to
deal with it in FreeKEYB.

> RBIL says that "subfunction 03h apparently supports only
> codepage 866 in DOS 6.2x", but this is not true at least
> in my system. Another mistake in RBIL?

Yes and no, Ralf shortened my original note for unknown reasons,
and this changed the meaning somewhat (see below).

Since we are here already and the implementation of INT 21h/AH=65h
may not be too far away in FreeDOS, I'd like to give a few hints
so that the implementation of the function takes this into account
right from the start.

The DOS kernel's internal country data is a dynamically built list
of tagged records for the current country/codepage tupel. It is
build according to the COUNTRY= directive in CONFIG.SYS and
remains static until the next reboot unless NLSFUNC is loaded
and the country code or codepage is changed at runtime.
The exact internal mechanism how this is organized is not known,
but it is clear that this list and therefore the INT 21h/AX=65xxh
API is not restricted to AL=01h..07h, only. Would a COUNTRY.SYS
file contain entries for ID 08, 09h, etc., they could be retrieved
through INT 21h/AH=6508h, AX=6509h. Would a COUNTRY.SYS file not
contain info for, say, ID 07h, calling the corresponding function
would result in an error. The obvious limit is ID 1Fh, since
INT 21h/AX=6520h+ functions already exist. Of course, there is
limited space in the kernel to hold that data, and this is where
the questions arise, what happens, when a given COUNTRY.SYS file
contains more entries than usual per tupel. Also, strange things
happen when someone tries to retrieve LCASE data (see below).

--------D-2165-------------------------------
INT 21 - DOS 3.3+ - GET EXTENDED COUNTRY INFORMATION
        AH = 65h
        AL = info ID
             01h "CTYINFO" get general internationalization info (see also AX=6500h)
             02h "UCASE" get pointer to uppercase table
             03h "LCASE" (MS-DOS 6.20+ COUNTRY.SYS) get pointer to lowercase table
             04h "FUCASE" get pointer to filename uppercase table
             05h "FCHAR" get pointer to filename terminator table
             06h "COLLATE" get pointer to collating sequence table
             07h "DBCS" (DOS 4.0+) get pointer to Double-Byte Character Set table
             14h "CCTORC" (Arabic/Hebrew MS-DOS 5.0) get pointer to CCTORC table
             15h "ARAMODE" (Arabic/Hebrew MS-DOS 5.0) get pointer to ARAMODE table
             FFh (DOS 6.0 undocumented) return *all* info to user, not only
                 specific type of info (this, however, only works for the
                 currently loaded info in DOS, not for other countries)
        BX = code page (FFFFh=global code page) (see #01757)
             (undocumented) A codepage value of 0 will call down to NLSFUNC
             to retrieve the data from the *first* entry matching the given
             DX country in the COUNTRY.SYS file, which contains the primary
             (default) codepage for that country.
             A way to retrieve the primary codepage for a country is to call
             this function with AL=01h and BX=0. Then the returned CTYINFO
             info structure has the primary codepage for this country filled
             in. This works for both, any DR DOS 3.??+ and MS-DOS.
        DX = country ID (FFFFh=current country,
                        (undocumented) 0=system code page)
        ES:DI -> country information buffer (see #01750)
        CX = size of buffer (>= 5)
Return: CF set on error
           AX = error code (see #01680 at AH=59h/BX=0000h)
        CF clear if successful
           CX = size of country information returned
           ES:DI -> country information (see #01750)
Notes: AL=05h appears to return same info for all countries and codepages;
         it has been documented for DOS 5+, but was undocumented in
         earlier versions
       NLSFUNC must be installed to get info for countries other than
         the default
       the ES:DI buffer can be uninitialized on call, that is, it is not
         necessary to let the DWORD pointer in the structure point to some
         buffer in the calling application. The kernel will fill out this
         buffer to let the DWORD pointer address some internal buffer in
         the kernel.
         While MS-DOS returns different buffer location for different
         sub-functions, DR DOS does not (still not with DR-DOS 7.03).
         Instead, it will pass along the pointer to a single buffer in
         NLSFUNC, no matter of the function. Hence there are several
         caveats when using this info:
         - The caller must immediately copy the buffer's contents into
           a private buffer, preferable from inside of a mutex, so that
           no other caller could trash the buffer contents in the meantime.
         - As DR DOS 6.0+ NLSFUNC can reside in the HMA, the buffer pointed
           to can be as well. If the segment value of the DWORD pointer in
           the structure holds FFFEh (for HMA), the caller should also
           take A20 into account.
         - The DR-DOS implementation results in a *much* smaller NLSFUNC and
           API footprint (DR-DOS: ca. 900 bytes (or since NLSFUNC 4.01+ ca.
           1,1 Kb with dual DR-DOS and DOS file scanners, XAPI, and
           Arabic/Hebrew support included), MS-DOS: ca. 7 Kb), at the
           backdraw of lower speed when accessing foreign country data
           (the data for the current country is buffered in the BDOS, though).
           On very slow machines, with very large COUNTRY.SYS file(s),
           requests may take several seconds to succeed. In normal operation,
           there is no noticable difference, except when running NLS scanners.
       subfunctions 02h and 04h are identical under OS/2
       subfunction 03h apparently supports only codepage 866 in DOS 6.2x,
         the MS-DOS 6.20 - 8.0 COUNTRY.SYS file supports LCASE *only* for
         Russia (country code 7) in codepage 866, and for this combination
         there is no FCHAR (5) info available in the file. However, even with
         country 7 and codepage 866 active, it appears not to be possible to
         request the LCASE info via INT 21h/AX=6503h/BX=866/DX=7. This will
         still return an error, while INT 21h/AX=6505h will still return FCHAR
         info for this BX/DX combination. To actually retrieve the LCASE info,
         one can use BX=0 (primary codepage) instead. This will cause some
         screen flickering (apparently due to temporarily codepage switching),
         and will correctly return the LCASE info via INT 21h/AX=6503h, and no
         FCHAR info via INT 21h/AX=6505h.
       LCASE info for *all* countries and codepages was added with
         DR-OpenDOS 7.02+ (1997-12), but it is not available via INT 21h/
         AX=6503h yet (still not with DR-DOS 7.05). This will probably be added
         at a later stage. (At the moment use INT 2Fh/AX=14FEh/CL=03h instead.)
       due to the lack of LCASE info in COUNTRY.SYS, PC DOS 7 and 2000
         do not support LCASE.
       Arabic and Hebrew COUNTRY.SYS issues of MS-DOS 5.0 contain CCTORC
         and ARAMODE info for many countries, while they still contain all
         the other info types (except for LCASE). CCTORC and ARAMODE are
         available via INT 21h/AX=6514h and INT 21h/AX=6515h. They also works
         with standard Western issues of the DOS kernel and NLSFUNC, when
         using Arabic/Hebrew COUNTRY.SYS files. In contrast to the LCASE
         special case the info can be requested via BX=codepage or BX=0.
       DR-DOS NLSFUNC 4.01+ has Arabic/Hebrew support added via INT 2Fh/
         AX=14FEh/CX=0014h/CX=0114h/CL=15h, but this is not yet available
         through INT 21h/AX=65xxh. It will probably be added at a later stage.
BUG:   a country code of DX=0 will cause Novell DOS 7 - Caldera OpenDOS 7.02
         BETA 1 to crash. This was fixed for OpenDOS 7.02 BETA 2 and later.
SeeAlso: AH=38h,AH=70h"MS-DOS 7",INT 2F/AX=1401h,INT 2F/AX=1402h
SeeAlso: INT 2F/AX=14FEh

Format of country information:
Offset  Size    Description     (Table 01750)
 00h    BYTE    info ID
---if info ID = 01h---
 01h    WORD    size of following info in bytes
 03h    WORD    country ID (see #01400 at AH=38h)
 05h    WORD    code page (see #01757)
 07h 34 BYTEs   country-dependent info (see #01399 at AH=38h)
---if info ID = 02h---
 01h    DWORD   pointer to uppercase table (see #01751)
---if info ID = 03h---
 01h    DWORD   pointer to lowercase table (see #01752)
---if info ID = 04h---
 01h    DWORD   pointer to filename uppercase table (see #01753)
---if info ID = 05h---
 01h    DWORD   pointer to filename character table (see #01754)
---if info ID = 06h---
 01h    DWORD   pointer to collating table (see #01755)
---if info ID = 07h (DOS 4.0+)---
 01h    DWORD   pointer to DBCS lead byte table (see #01756)
---if info ID = 14h (DOS 5.0???+)---
 01h    DWORD   pointer to CCTORC table
---if info ID = 15h (DOS 5.0???+)---
 01h    DWORD   pointer to ARAMODE table
SeeAlso: #01775

[...]

Format of Arabic/Hebrew mode table "ARAMODE":

Offset  Size    Description
 00h    WORD    table size (0008h)
 02h    WORD    fontpage
                - Hebrew: hardware fontpage 100
                - Arabic: font/codepages 161, 163, or 165
                  (Probably also 162,164,??? For more fontpages see INT 2Fh/AH=AD
                NB. It is not verified, that the BYTE at offset 03h is actually
                    the high BYTE of the fontpage WORD. However, it has always
                    been zero in the files I examined so far, and codepages usually
                    use a WORD rather than a BYTE.
 04h    BYTE    FFh                  \    Guesswork:
 05h    BYTE    08h                  /    To me, it seems as if these
 06h    BYTE    - Hebrew: FCh        \    entries contain pairs of
                - Arabic: FCh, FDh    \   BYTEs. The first byte could be
 07h    BYTE    - Hebrew: F0h         /   a special char, the second byte
                - Arabic: 60h        /    an attribute byte for this char.
 08h    BYTE    FFh                  \    Beware, this is guesswork only!
 09h    BYTE    00h                  /    Info wanted!

Format of Arabic/Hebrew table "CCTORC":

Offset  Size    Description
 00h    WORD    table size (0200h)
 02h 512 BYTEs  Apparently containing 256 WORDs of special char info.
                Guesswork: Since Arabic/Hebrew codepages are no DBCS
                codepages, this table could be used to define character
                shapes together with their attributes.
                Maybe: "*C*haracter *c*ode *to* *r*aw *c*ode"???
                To me it appears as if this area would consist of 256
                records of 2 BYTEs each. The first BYTE indicating
                the character shape or something similar, the second
                BYTE (which so far has always been 00, 01, 02 only)
                could be an attribute BYTE for the char, defining
                character specific special behaviour, e.g. 00=normal,
                01=alphas, 02=digits???. Since I do not speak Hebrew
                or Arabic languages, I cannot further comment on their
                character representation system. Info wanted.
---

Maybe someone from the Middle East can shed some light on ARAMODE,
CCTORC?

Hope it helps,

 Matthias

-- 
<mailto:Matthias.Paul <at> post.rwth-aachen.de>; <mailto:mpaul <at> drdos.org>
http://www.uni-bonn.de/~uzs180/mpdokeng.html; http://mpaul.drdos.org

"Programs are poems for computers."

----------
list options/archives/etc.: http://www.topica.com/lists/fd-dev
unsubscribe: send blank email to: fd-dev-unsubscribe <at> topica.com

==^================================================================
This email was sent to: gofd-fd-dev <at> gmane.org

EASY UNSUBSCRIBE click here: http://topica.com/u/?bz8Rv5.bafB3U
Or send an email to: fd-dev-unsubscribe <at> topica.com

T O P I C A -- Register now to manage your mail!
http://www.topica.com/partner/tag02/register
==^================================================================
Apr	MAY	Jun
	27
2015	2016	2017