5 Sep 20:00 2002
Re: NLS and lowercase
Matthias Paul <Matthias.Paul <at> post.rwth-aachen.de>
2002-09-05 18:00:37 GMT
2002-09-05 18:00:37 GMT
On 2002-09-05, Steffen Kaiser wrote: >> The "filename uppercase" table could map all accented letters to >> their non-accented variants, and therefore prevent problems with >> accessing filenames stored in one codepage when other codepage is >> active (the problems > > Would be my guess, too; but it was never actually used by MS DOS, > as many other features, too. I assume this as well. FYI. In the DR-DOS family prior to 7.02 the COUNTRY.SYS file had two tables for UCASE and FUCASE, but the kernel always used the FUCASE table for both of them. This was changed with DR-DOS 7.02+ so that the kernel distinguishes between these tables. It does not really make a difference, because the tables are identical in most cases, anyway... >>> Also, why they usually not contains "lowercase table" (6503) >>> and not provide "make lower case" (similar to 6520-6522) nor >>> "filename lowercase table" at all? >> >> Maybe just because "lowercase" table was introduced in later MS-DOS >> version that the other components of NLS system? > > Rumors say that the lowercase table has been introduced by some > Russian KEYB and was never atcually approved by MS (DOS). Very interesting, can you tell us a bit more about these rumors? When I first discovered the LCASE (3) vector in the MS-DOS 6.20+ COUNTRY.SYS file back in early 1994, I could find zero information about it anywhere. All I know is, that it only exists for Russia (country code 7) in Cyrillic codepage 866. It can also be found in Hangul, Japanese, and Chinese issues of MS-DOS 6.2x and is supported in MS-DOS 7.0/7.10/8.0 (Windows 95/98/SE/ME), Windows 2000 (NT 5.0), and IBM PC DOS 7 & 2000 COUNTRY.SYS, while it is not supported by PC DOS 6.1, OS/2 Warp 3, PTS-DOS, S/DOS, WinDOS, etc. I have no info about PC DOS 6.3, OS/2 Warp 4, and Windows NT. The odd thing is that this country/codepage tupel misses the entry for FCHAR (5), so that the total count of entries per tupel remains the same, that is 6, except for in Arabic and Hebrew issues of MS-DOS, where it is 8 - they have additional CCTORC (20) and ARAMODE (21) entries. DR-OenDOS and DR-DOS 7.02+ COUNTRY.SYS and NLSFUNC 4.00+ support LCASE as well (for all countries), but only at the lower NLSFUNC INT 2Fh/AX=14FEh/CL=03h, not (yet) at DOS INT 21h/AX=6503h level. On 2002-09-05, Arkady V.Belousov wrote: > You mean, that you suggest, that FUCASE table should/may > present more strict conversion, when lower accented letters > converted into ASCII equivalence? Yes. > Hm. This is dangerous. Firstly, you then should apply same > conversion for upper accented letters (or strings in lower > case and in upper case give different results). Yes, but that's how it is... > Secondly, this will prohibit at all idea of natioanal letters > in filenames. They never really were, officially. > RBIL says that for OS/2 they are identical. Under MS-DOS > pointers to these tables are differs. Look like, Microsoft > delivers responsibility for this to country.sys file designers? The pointers differ, but it could still be, that the contents is the same. I remember, that I once checked this (my CHCC tool can derive a NLS database by scanning a live system's CTY info and can calculate CRCs over the various tables). But it's too long ago to remember the exact outcome, I could look this up. >>> Also, why they usually not contains "lowercase table" (6503) >>> and not provide "make lower case" (similar to 6520-6522) nor >>> "filename lowercase table" at all? INT 21h/AX=6523h was already used up at that time. > To be sure: are someone know some alphabet, where inversion of > UCASE table is ambiguite (i.e. to upper case code X translated > two or more other codes) or no inversion at all (i.e. upper > case code X should be translated to lower case code Y, but Y > shouldn't translated into X)? ß letter 'ß' in German is a sharp s and has no single-letter uppercase equivalent (it would be transscribed to SS in capital letters), in Greek the 'ß' is a small Beta, while the capital Beta is 'B'. I recall there was another issue with the Turkish small dotless 'i', which, IIRC, has an uppercase equivalent of 'I'. Sorry, I would have to look this up, I'm not sure about it right now. We had to deal with it in FreeKEYB. > RBIL says that "subfunction 03h apparently supports only > codepage 866 in DOS 6.2x", but this is not true at least > in my system. Another mistake in RBIL? Yes and no, Ralf shortened my original note for unknown reasons, and this changed the meaning somewhat (see below). Since we are here already and the implementation of INT 21h/AH=65h may not be too far away in FreeDOS, I'd like to give a few hints so that the implementation of the function takes this into account right from the start. The DOS kernel's internal country data is a dynamically built list of tagged records for the current country/codepage tupel. It is build according to the COUNTRY= directive in CONFIG.SYS and remains static until the next reboot unless NLSFUNC is loaded and the country code or codepage is changed at runtime. The exact internal mechanism how this is organized is not known, but it is clear that this list and therefore the INT 21h/AX=65xxh API is not restricted to AL=01h..07h, only. Would a COUNTRY.SYS file contain entries for ID 08, 09h, etc., they could be retrieved through INT 21h/AH=6508h, AX=6509h. Would a COUNTRY.SYS file not contain info for, say, ID 07h, calling the corresponding function would result in an error. The obvious limit is ID 1Fh, since INT 21h/AX=6520h+ functions already exist. Of course, there is limited space in the kernel to hold that data, and this is where the questions arise, what happens, when a given COUNTRY.SYS file contains more entries than usual per tupel. Also, strange things happen when someone tries to retrieve LCASE data (see below). --------D-2165------------------------------- INT 21 - DOS 3.3+ - GET EXTENDED COUNTRY INFORMATION AH = 65h AL = info ID 01h "CTYINFO" get general internationalization info (see also AX=6500h) 02h "UCASE" get pointer to uppercase table 03h "LCASE" (MS-DOS 6.20+ COUNTRY.SYS) get pointer to lowercase table 04h "FUCASE" get pointer to filename uppercase table 05h "FCHAR" get pointer to filename terminator table 06h "COLLATE" get pointer to collating sequence table 07h "DBCS" (DOS 4.0+) get pointer to Double-Byte Character Set table 14h "CCTORC" (Arabic/Hebrew MS-DOS 5.0) get pointer to CCTORC table 15h "ARAMODE" (Arabic/Hebrew MS-DOS 5.0) get pointer to ARAMODE table FFh (DOS 6.0 undocumented) return *all* info to user, not only specific type of info (this, however, only works for the currently loaded info in DOS, not for other countries) BX = code page (FFFFh=global code page) (see #01757) (undocumented) A codepage value of 0 will call down to NLSFUNC to retrieve the data from the *first* entry matching the given DX country in the COUNTRY.SYS file, which contains the primary (default) codepage for that country. A way to retrieve the primary codepage for a country is to call this function with AL=01h and BX=0. Then the returned CTYINFO info structure has the primary codepage for this country filled in. This works for both, any DR DOS 3.??+ and MS-DOS. DX = country ID (FFFFh=current country, (undocumented) 0=system code page) ES:DI -> country information buffer (see #01750) CX = size of buffer (>= 5) Return: CF set on error AX = error code (see #01680 at AH=59h/BX=0000h) CF clear if successful CX = size of country information returned ES:DI -> country information (see #01750) Notes: AL=05h appears to return same info for all countries and codepages; it has been documented for DOS 5+, but was undocumented in earlier versions NLSFUNC must be installed to get info for countries other than the default the ES:DI buffer can be uninitialized on call, that is, it is not necessary to let the DWORD pointer in the structure point to some buffer in the calling application. The kernel will fill out this buffer to let the DWORD pointer address some internal buffer in the kernel. While MS-DOS returns different buffer location for different sub-functions, DR DOS does not (still not with DR-DOS 7.03). Instead, it will pass along the pointer to a single buffer in NLSFUNC, no matter of the function. Hence there are several caveats when using this info: - The caller must immediately copy the buffer's contents into a private buffer, preferable from inside of a mutex, so that no other caller could trash the buffer contents in the meantime. - As DR DOS 6.0+ NLSFUNC can reside in the HMA, the buffer pointed to can be as well. If the segment value of the DWORD pointer in the structure holds FFFEh (for HMA), the caller should also take A20 into account. - The DR-DOS implementation results in a *much* smaller NLSFUNC and API footprint (DR-DOS: ca. 900 bytes (or since NLSFUNC 4.01+ ca. 1,1 Kb with dual DR-DOS and DOS file scanners, XAPI, and Arabic/Hebrew support included), MS-DOS: ca. 7 Kb), at the backdraw of lower speed when accessing foreign country data (the data for the current country is buffered in the BDOS, though). On very slow machines, with very large COUNTRY.SYS file(s), requests may take several seconds to succeed. In normal operation, there is no noticable difference, except when running NLS scanners. subfunctions 02h and 04h are identical under OS/2 subfunction 03h apparently supports only codepage 866 in DOS 6.2x, the MS-DOS 6.20 - 8.0 COUNTRY.SYS file supports LCASE *only* for Russia (country code 7) in codepage 866, and for this combination there is no FCHAR (5) info available in the file. However, even with country 7 and codepage 866 active, it appears not to be possible to request the LCASE info via INT 21h/AX=6503h/BX=866/DX=7. This will still return an error, while INT 21h/AX=6505h will still return FCHAR info for this BX/DX combination. To actually retrieve the LCASE info, one can use BX=0 (primary codepage) instead. This will cause some screen flickering (apparently due to temporarily codepage switching), and will correctly return the LCASE info via INT 21h/AX=6503h, and no FCHAR info via INT 21h/AX=6505h. LCASE info for *all* countries and codepages was added with DR-OpenDOS 7.02+ (1997-12), but it is not available via INT 21h/ AX=6503h yet (still not with DR-DOS 7.05). This will probably be added at a later stage. (At the moment use INT 2Fh/AX=14FEh/CL=03h instead.) due to the lack of LCASE info in COUNTRY.SYS, PC DOS 7 and 2000 do not support LCASE. Arabic and Hebrew COUNTRY.SYS issues of MS-DOS 5.0 contain CCTORC and ARAMODE info for many countries, while they still contain all the other info types (except for LCASE). CCTORC and ARAMODE are available via INT 21h/AX=6514h and INT 21h/AX=6515h. They also works with standard Western issues of the DOS kernel and NLSFUNC, when using Arabic/Hebrew COUNTRY.SYS files. In contrast to the LCASE special case the info can be requested via BX=codepage or BX=0. DR-DOS NLSFUNC 4.01+ has Arabic/Hebrew support added via INT 2Fh/ AX=14FEh/CX=0014h/CX=0114h/CL=15h, but this is not yet available through INT 21h/AX=65xxh. It will probably be added at a later stage. BUG: a country code of DX=0 will cause Novell DOS 7 - Caldera OpenDOS 7.02 BETA 1 to crash. This was fixed for OpenDOS 7.02 BETA 2 and later. SeeAlso: AH=38h,AH=70h"MS-DOS 7",INT 2F/AX=1401h,INT 2F/AX=1402h SeeAlso: INT 2F/AX=14FEh Format of country information: Offset Size Description (Table 01750) 00h BYTE info ID ---if info ID = 01h--- 01h WORD size of following info in bytes 03h WORD country ID (see #01400 at AH=38h) 05h WORD code page (see #01757) 07h 34 BYTEs country-dependent info (see #01399 at AH=38h) ---if info ID = 02h--- 01h DWORD pointer to uppercase table (see #01751) ---if info ID = 03h--- 01h DWORD pointer to lowercase table (see #01752) ---if info ID = 04h--- 01h DWORD pointer to filename uppercase table (see #01753) ---if info ID = 05h--- 01h DWORD pointer to filename character table (see #01754) ---if info ID = 06h--- 01h DWORD pointer to collating table (see #01755) ---if info ID = 07h (DOS 4.0+)--- 01h DWORD pointer to DBCS lead byte table (see #01756) ---if info ID = 14h (DOS 5.0???+)--- 01h DWORD pointer to CCTORC table ---if info ID = 15h (DOS 5.0???+)--- 01h DWORD pointer to ARAMODE table SeeAlso: #01775 [...] Format of Arabic/Hebrew mode table "ARAMODE": Offset Size Description 00h WORD table size (0008h) 02h WORD fontpage - Hebrew: hardware fontpage 100 - Arabic: font/codepages 161, 163, or 165 (Probably also 162,164,??? For more fontpages see INT 2Fh/AH=AD NB. It is not verified, that the BYTE at offset 03h is actually the high BYTE of the fontpage WORD. However, it has always been zero in the files I examined so far, and codepages usually use a WORD rather than a BYTE. 04h BYTE FFh \ Guesswork: 05h BYTE 08h / To me, it seems as if these 06h BYTE - Hebrew: FCh \ entries contain pairs of - Arabic: FCh, FDh \ BYTEs. The first byte could be 07h BYTE - Hebrew: F0h / a special char, the second byte - Arabic: 60h / an attribute byte for this char. 08h BYTE FFh \ Beware, this is guesswork only! 09h BYTE 00h / Info wanted! Format of Arabic/Hebrew table "CCTORC": Offset Size Description 00h WORD table size (0200h) 02h 512 BYTEs Apparently containing 256 WORDs of special char info. Guesswork: Since Arabic/Hebrew codepages are no DBCS codepages, this table could be used to define character shapes together with their attributes. Maybe: "*C*haracter *c*ode *to* *r*aw *c*ode"??? To me it appears as if this area would consist of 256 records of 2 BYTEs each. The first BYTE indicating the character shape or something similar, the second BYTE (which so far has always been 00, 01, 02 only) could be an attribute BYTE for the char, defining character specific special behaviour, e.g. 00=normal, 01=alphas, 02=digits???. Since I do not speak Hebrew or Arabic languages, I cannot further comment on their character representation system. Info wanted. --- Maybe someone from the Middle East can shed some light on ARAMODE, CCTORC? Hope it helps, Matthias -- <mailto:Matthias.Paul <at> post.rwth-aachen.de>; <mailto:mpaul <at> drdos.org> http://www.uni-bonn.de/~uzs180/mpdokeng.html; http://mpaul.drdos.org "Programs are poems for computers." ---------- list options/archives/etc.: http://www.topica.com/lists/fd-dev unsubscribe: send blank email to: fd-dev-unsubscribe <at> topica.com ==^================================================================ This email was sent to: gofd-fd-dev <at> gmane.org EASY UNSUBSCRIBE click here: http://topica.com/u/?bz8Rv5.bafB3U Or send an email to: fd-dev-unsubscribe <at> topica.com T O P I C A -- Register now to manage your mail! http://www.topica.com/partner/tag02/register ==^================================================================


RSS Feed