From:  Greg Noel
Date:  Sat Jul 3, 1999  8:16 pm
Subject:  Re: Responses to a variety of issues

I wrote:
>.... I recall reading that Chinese has 300,000 words; it's likely
>infeasible to include all of them as individual Unicode values. ...

Wayne H.Smith writes:
> The correct number of individual characters (not compounds of two or
>more characters) would certainly not be 300,000. The largest Chinese
>dicitonaries have a maximum of about 40,000 individual characters. The
>average college graduate know (recognizes) about 6,000 different characters,
>and a knowledge of about 2-3,000 is sufficient to read a newspaper (albeit
>there may be a few characters on each page that exceed that figure). Each
>Chinese character is coded with 2 (not just 1) Unicode numbers [numbers that
>were already assigned to different symbols], and special programs are out
>there that reassign a particular set of dots and squiggles to each
>**pairing** of Unicode numbers so as to produce Chinese characters on the
>screen or on the page.

I meant "words" in the same sense that English has between 500,000 and
2,000,000 words (depending upon your authority); that is, it would include
the compound characters to which you allude. Supposedly, a vocabulary of
850 English words is sufficient for communication (the so-called Basic
English, also used in Pidgin English), and a typical vocabulary to read a
newspaper is probably less than 5,000 words (with the same caveat you have:
a few words on the page won't be in the minimal vocabulary).

But the number of words in a language is much larger than the number of
words recognized by an individual, or even by a population at large. How
many people on this list knew that the word "clientship" existed, even
though the reverse relationship ("patronage") is in pretty common usage?
Or when was the last time you saw "destrier" used for a knight's horse?

The solution for Chinese was to encode pairs of symbols and have special
programs to put them together and produce the composite symbol. That way,
fewer than 300,000 Unicode slots are needed.

SignWriting won't have as many symbols in its "alphabet" of signs, but will
have to do something similar to combine the pieces into a composite symbol.

Hope this helps,
-- Greg Noel, retired UNIX guru

