On ĵaŭ, 2002-02-21 at 09:59, Jan Hidders wrote:
Right now there is a localization problem wrt.
indexing. The fulltext index
indexes single words and defines these as series of letters, numbers, and
the odd "'" and "_". Since the standard character set of MySQL is
ISO 8859-1
I assume that it knows what are letters in that character set. I really
don't know how this behaves when the character set of MySQL is changed.
Available, by the way, are big5, cp1251, cp1257, czech, danish, dec8, dos,
euc_kr, gb2312, gbk, german1, hebrew, hp8, hungarian, koi8_ru, koi8_ukr,
latin1, latin2, sjis, swe7, tis620, ujis, usa7, and win1251ukr. But I don't
think we want to go that way because then (if I understand the documentation
correctly) we need a separate MySQL server for every character set. Anyway,
in all cases the indexing breaks down for entities because it doesn't index
words with '&' and ';' in them, so it sees "Gödel"
as "G" and "del"
with some funny symbols inbetween that it doesn't index. The indexing also
has no idea that this has something to do with "Godel".
Admittedly unaware of any previous discussion on this before, I would
suggest the following:
1. Internally, i.e., in the database fields and URLs we use for bodies and
titles only standard ASCII plus HTML entities. However, to allow indexing we
encode e as something like '_101_' in the database fields.
2. Externally in search and edit boxes the user can type any character the
browser allows, but we always translate internally the non-ASCII ones to
entities.
3. When a request for a page is made we always translate the entities as
much as possible to the character set specified in the request, including
the contents edit boxes.
Ugh. Doable, though. Presumably the point of this is so that someone can
type either:
ö (actual o-with-umlaut in the display character encoding)
ö
Ö
Ö
Ö
or any number of other alternatives in the edit box and put the same
actual sequence of bytes into the data?
Also remember that we'll still have to escape entities that _aren't_ in
the display character set in all edit boxes, so that they won't be
disappeared or converted into "?"s when the user hits submit. (I'm
assuming that you don't want to put the raw HTML entities for _every_
non-ASCII character into the edit box appearing as the entity codes? See
my previous message on this subject for why that's a Very Bad Idea.)
The main thing is to define the translation
functions:
- string encodeEntities ( mb-string external-string, string character-set )
- mb-string decodeEntities ( string internal-string, string character-set )
(With mb-string I mean a multi-byte character string.)
cf $wikiRecodeInput(), $wikiRecodeOutput() if you want a ready place to
do this.
(Keep in mind that ASCII-with-HTML-entities is for all intents and
purposes a multibyte character encoding. It switches from single-byte to
double-byte mode when encountering a "&", and self-recovers if it is not
followed by a correct multibyte code string ending in ";".)
For localization we define the following functions:
- string canonicalTitle ( string internal-string ) translates an internal
title to it's canonical form. It deals with capitalization, for example. If
two strings are translated to the same canonical form they are formally the
same title. If a string is translated to an empty string it is not a valid
title. If you don't want entities in your titles, you can define that here.
- string urlTitle ( string internal-string ) translates an internal
canonized title to its URL form. It probably only replaces space characters
with "+" and escapes ASCII characters that need to be escaped in an URL.
cf. wikiTitle->makeSecureTitle()
For these functions we also need to define arrays that
associate entities
with their uppercase equivalents, and vice versa, for the relevant character
sets.
Easy enough, I can generate that from the Unicode data tables.
Having said all this I also want to emphasize that we
first need to have a
document that describes exactly how we are going to do this, before we code
another line for localization. We have to realize that we are a real project
now.
Yes, a real project that's already running and has thousands of pages
that don't conform to the as-yet-nonexistant document. Hopefully we can
munge them together!
-- brion vibber (brion @
pobox.com)