I'm reposting this to wikitech-l so that discussion doesn't get lost.
On mer, 2002-02-20 at 15:20, lcrocker(a)nupedia.com wrote:
You Wrote:
Please take a
look at the non-English non-ISO-8859-1 wikipedias sometime.
Hundreds of pages, with correct charset headers:
ISO-8859-2:
http://pl.wikipedia.com/
UTF-8 with a custom conversion function for certain character
sequences:
http://eo.wikipedia.com/
You're right. Last time I looked at these, the test pages I retrieved
gave 404s, and the 404 page is still served as ISO-8859-1, but the
headers of contentful pages are indeed as you say: 8859-2 for "pl"
and UTF-8 for "eo", etc.
OK, then, I guess we do have to wade into the morass of national
character sets.
Unless you want to switch to UTF-8, that is a given.
I have little or no experience using actual foreign-
made computers; but I /do/ have extensive knowledge about character
sets and communication protocols, so I'm just trying to make sure we
don't make the same mistakes hundreds of others have made in the past
by not getting this stuff right up front, but just diving headlong
into coding without stepping back a moment to design something that
will be usable and maintainable in the future.
The way it is now, for example, we won't be able to cut-and-paste
between wikis if, say, I wanted to include a quote from some Polish
leader or something.
Sad but true.
Maybe that's a reasonable sacrifice for ease of
editing on those wikis.
Lee, let me put it this way. Imagine, if you will, that history had gone
somewhat differently. Let's say that the first computers had been
developed in a politically free, economically strong, highly
industrialized Russia and the standard computer character set around the
world had been based on the Cyrillic alphabet.
In our hypothetical world, there's a Russian version of what we would
have called Wikipedia. They set up some subsites in other languages, one
of which is English, which uses the Latin alphabet.
Now, you want to add some articles to the English site, but the site
administrators have declared that only the standard cyrillic character
set is to be used, with special markup to allow other characters through
the use of numerical codes. This means:
* Pages display fine for viewing, but when you edit, you see nothing
but numeric escape codes.
* You can't type *a single letter of English text* without using a
special numeric escape code.
* All page titles have to be transliterated into Cyrillic, because the
escape codes aren't allowed in titles.
Now, can you honestly tell me that you expect the average
English-speaking wiki contributor to edit a page that looks something
like this:
[[уикипздиа:Узлкомз
нзукомзрс|Welcome]]
to
[[Уикипздиа|Wikipedia]],
a
collaborative
project to
produce a
complete
encyclopedia
from
scratch.
We started
in
January 2001
and
already
have
over '''23,000
articles'''.
?
I can't imagine that you would expect that to be acceptable to anyone
else! You'll notice that the two non-ISO-8859-1-language 'pedias that
have actual content (Polish and Esperanto) both use the Latin alphabet
with a few diacritics. So theoretically, they would be the *most*
amenable to using HTML entities -- you can almost read text in the edit
box that way -- yet users of both wikipedias took the effort to tweak
the program to make their customary character encodings work so that
they could actually find people who would be willing to edit pages.
HTML entities are great for tossing in an occasional foreign letter or
word, but at the user level they are poor for regularly used diacritics
and utterly useless for text in other alphabets.
We could, alternatively, serve UTF-8 on all
of them, but that would risk breaking older browsers. There are side
issues of what is stored in the database, and what is allowable in
titles/URLs, etc.
Another alternative is to use the entities internally in the database,
but work some mojo to make them appear as normal characters in the edit
box. Which means you get zero advantage over simply using the national
character set -- you still have to send a character set header, you have
to know which Unicode characters can be passed through safely and which
need to be escaped, the search engine still breaks words, you still
can't capitalize non-ISO8859-1 titles, you still can't cut-n-paste, etc
etc etc. All of the pain, none of the gain.
We really need to sit down and spec this out before we
get too far
down the road. That's one reason why I posted the proposed policy on
foreign characters for the English Wiki; it is explicitly for the
English one only, but we need something equivalent for the other
ones.
We had a lot of discussion about these topics in the early months of
the project: I don't want us to ignore everything we learned back
then just because the folks working on the code now weren't around
back then.
0
Indeed. What were the conclusions of these discussions, and the
reasoning behind them?
-- brion vibber (brion @
pobox.com)