Right now there is a localization problem wrt. indexing. The fulltext index
indexes single words and defines these as series of letters, numbers, and
the odd "'" and "_". Since the standard character set of MySQL is
ISO 8859-1
I assume that it knows what are letters in that character set. I really
don't know how this behaves when the character set of MySQL is changed.
Available, by the way, are big5, cp1251, cp1257, czech, danish, dec8, dos,
euc_kr, gb2312, gbk, german1, hebrew, hp8, hungarian, koi8_ru, koi8_ukr,
latin1, latin2, sjis, swe7, tis620, ujis, usa7, and win1251ukr. But I don't
think we want to go that way because then (if I understand the documentation
correctly) we need a separate MySQL server for every character set. Anyway,
in all cases the indexing breaks down for entities because it doesn't index
words with '&' and ';' in them, so it sees "Gödel"
as "G" and "del"
with some funny symbols inbetween that it doesn't index. The indexing also
has no idea that this has something to do with "Godel".
Admittedly unaware of any previous discussion on this before, I would
suggest the following:
1. Internally, i.e., in the database fields and URLs we use for bodies and
titles only standard ASCII plus HTML entities. However, to allow indexing we
encode e as something like '_101_' in the database fields.
2. Externally in search and edit boxes the user can type any character the
browser allows, but we always translate internally the non-ASCII ones to
entities.
3. When a request for a page is made we always translate the entities as
much as possible to the character set specified in the request, including
the contents edit boxes.
The main thing is to define the translation functions:
- string encodeEntities ( mb-string external-string, string character-set )
- mb-string decodeEntities ( string internal-string, string character-set )
(With mb-string I mean a multi-byte character string.)
For localization we define the following functions:
- string canonicalTitle ( string internal-string ) translates an internal
title to it's canonical form. It deals with capitalization, for example. If
two strings are translated to the same canonical form they are formally the
same title. If a string is translated to an empty string it is not a valid
title. If you don't want entities in your titles, you can define that here.
- string urlTitle ( string internal-string ) translates an internal
canonized title to its URL form. It probably only replaces space characters
with "+" and escapes ASCII characters that need to be escaped in an URL.
For these functions we also need to define arrays that associate entities
with their uppercase equivalents, and vice versa, for the relevant character
sets.
Having said all this I also want to emphasize that we first need to have a
document that describes exactly how we are going to do this, before we code
another line for localization. We have to realize that we are a real project
now.
-- Jan Hidders