Re: [Wikitech-l] New case conversion functions

20 Feb 2002

I'm reposting this to wikitech-l so that discussion doesn't get lost.

On mer, 2002-02-20 at 15:20, lcrocker(a)nupedia.com wrote:
...
  You Wrote:
  Please take a
 look at the non-English non-ISO-8859-1 wikipedias sometime.

Hundreds of pages, with correct charset headers:
  ISO-8859-2:
    http://pl.wikipedia.com/
  UTF-8 with a custom conversion function for certain character
  sequences:
    http://eo.wikipedia.com/  

 You're right. Last time I looked at these, the test pages I retrieved 
 gave 404s, and the 404 page is still served as ISO-8859-1, but the 
 headers of contentful pages are indeed as you say: 8859-2 for "pl" 
 and UTF-8 for "eo", etc.

 OK, then, I guess we do have to wade into the morass of national 
 character sets. 
Unless you want to switch to UTF-8, that is a given.

...
   I have little or no experience using actual foreign-
 made computers; but I /do/ have extensive knowledge about character 
 sets and communication protocols, so I'm just trying to make sure we 
 don't make the same mistakes hundreds of others have made in the past 
 by not getting this stuff right up front, but just diving headlong 
 into coding without stepping back a moment to design something that 
 will be usable and maintainable in the future.

 The way it is now, for example, we won't be able to cut-and-paste 
 between wikis if, say, I wanted to include a quote from some Polish 
 leader or something. 
Sad but true.

...
   Maybe that's a reasonable sacrifice for ease of 
 editing on those wikis. 
Lee, let me put it this way. Imagine, if you will, that history had gone
somewhat differently. Let's say that the first computers had been
developed in a politically free, economically strong, highly
industrialized Russia and the standard computer character set around the
world had been based on the Cyrillic alphabet.

In our hypothetical world, there's a Russian version of what we would
have called Wikipedia. They set up some subsites in other languages, one
of which is English, which uses the Latin alphabet.

Now, you want to add some articles to the English site, but the site
administrators have declared that only the standard cyrillic character
set is to be used, with special markup to allow other characters through
the use of numerical codes. This means:
  * Pages display fine for viewing, but when you edit, you see nothing
but numeric escape codes.
  * You can't type *a single letter of English text* without using a
special numeric escape code.
  * All page titles have to be transliterated into Cyrillic, because the
escape codes aren't allowed in titles.

Now, can you honestly tell me that you expect the average
English-speaking wiki contributor to edit a page that looks something
like this:
[[уикипздиа:Узлкомз
нзукомзрс|&#87;&#101;&#108;&#99;&#111;&#109;&#101;]]
&#116;&#111;
[[Уикипздиа|&#87;&#105;&#107;&#105;&#112;&#101;&#100;&#105;&#97;]],
&#97;
&#99;&#111;&#108;&#108;&#97;&#98;&#111;&#114;&#97;&#116;&#105;&#118;&#101;
&#112;&#114;&#111;&#106;&#101;&#99;&#116; &#116;&#111;
&#112;&#114;&#111;&#100;&#117;&#99;&#101; &#97;
&#99;&#111;&#109;&#112;&#108;&#101;&#116;&#101;
&#101;&#110;&#99;&#121;&#99;&#108;&#111;&#112;&#101;&#100;&#105;&#97;
&#102;&#114;&#111;&#109;
&#115;&#99;&#114;&#97;&#116;&#99;&#104;.
&#87;&#101; &#115;&#116;&#97;&#114;&#116;&#101;&#100;
&#105;&#110;
&#74;&#97;&#110;&#117;&#97;&#114;&#121; 2001
&#97;&#110;&#100;
&#97;&#108;&#114;&#101;&#97;&#100;&#121;
&#104;&#97;&#118;&#101;
&#111;&#118;&#101;&#114; '''23,000
&#97;&#114;&#116;&#105;&#99;&#108;&#101;&#115;'''.
?

I can't imagine that you would expect that to be acceptable to anyone
else! You'll notice that the two non-ISO-8859-1-language 'pedias that
have actual content (Polish and Esperanto) both use the Latin alphabet
with a few diacritics. So theoretically, they would be the *most*
amenable to using HTML entities -- you can almost read text in the edit
box that way -- yet users of both wikipedias took the effort to tweak
the program to make their customary character encodings work so that
they could actually find people who would be willing to edit pages.

HTML entities are great for tossing in an occasional foreign letter or
word, but at the user level they are poor for regularly used diacritics
and utterly useless for text in other alphabets.

...
   We could, alternatively, serve UTF-8 on all 
 of them, but that would risk breaking older browsers.  There are side 
 issues of what is stored in the database, and what is allowable in 
 titles/URLs, etc. 
Another alternative is to use the entities internally in the database,
but work some mojo to make them appear as normal characters in the edit
box. Which means you get zero advantage over simply using the national
character set -- you still have to send a character set header, you have
to know which Unicode characters can be passed through safely and which
need to be escaped, the search engine still breaks words, you still
can't capitalize non-ISO8859-1 titles, you still can't cut-n-paste, etc
etc etc. All of the pain, none of the gain.

...
  We really need to sit down and spec this out before we
get too far 
 down the road.  That's one reason why I posted the proposed policy on 
 foreign characters for the English Wiki; it is explicitly for the 
 English one only, but we need something equivalent for the other 
 ones.  

 We had a lot of discussion about these topics in the early months of 
 the project: I don't want us to ignore everything we learned back 
 then just because the folks working on the code now weren't around 
 back then.
 0 
Indeed. What were the conclusions of these discussions, and the
reasoning behind them?

-- brion vibber (brion @ pobox.com)

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] New case conversion functions