Re: [Wikitech-l] New case conversion functions

22 Feb 2002

On ĵaŭ, 2002-02-21 at 09:59, Jan Hidders wrote:
...
  Right now there is a localization problem wrt.
indexing. The fulltext index
 indexes single words and defines these as series of letters, numbers, and
 the odd "'" and "_". Since the standard character set of MySQL is
ISO 8859-1
 I assume that it knows what are letters in that character set. I really
 don't know how this behaves when the character set of MySQL is changed.
 Available, by the way, are big5, cp1251, cp1257, czech, danish, dec8, dos,
 euc_kr, gb2312, gbk, german1, hebrew, hp8, hungarian, koi8_ru, koi8_ukr,
 latin1, latin2, sjis, swe7, tis620, ujis, usa7, and win1251ukr. But I don't
 think we want to go that way because then (if I understand the documentation
 correctly) we need a separate MySQL server for every character set. Anyway,
 in all cases the indexing breaks down for entities because it doesn't index
 words with '&' and ';' in them, so it sees "G&ouml;del"
as "G" and "del"
 with some funny symbols inbetween that it doesn't index. The indexing also
 has no idea that this has something to do with "Godel".

 Admittedly unaware of any previous discussion on this before, I would
 suggest the following:
 1. Internally, i.e., in the database fields and URLs we use for bodies and
 titles only standard ASCII plus HTML entities. However, to allow indexing we
 encode &#101; as something like '_101_' in the database fields.
 2. Externally in search and edit boxes the user can type any character the
 browser allows, but we always translate internally the non-ASCII ones to
 entities.
 3. When a request for a page is made we always translate the entities as
 much as possible to the character set specified in the request, including
 the contents edit boxes. 
Ugh. Doable, though. Presumably the point of this is so that someone can
type either:
  ö  (actual o-with-umlaut in the display character encoding)
  &ouml;
  &#214;
  &#xd6;
  &#x00D6;
or any number of other alternatives in the edit box and put the same
actual sequence of bytes into the data?

Also remember that we'll still have to escape entities that _aren't_ in
the display character set in all edit boxes, so that they won't be
disappeared or converted into "?"s when the user hits submit. (I'm
assuming that you don't want to put the raw HTML entities for _every_
non-ASCII character into the edit box appearing as the entity codes? See
my previous message on this subject for why that's a Very Bad Idea.)

...
  The main thing is to define the translation
functions:

 - string encodeEntities ( mb-string external-string, string character-set )
 - mb-string decodeEntities ( string internal-string, string character-set )

 (With mb-string I mean a multi-byte character string.) 
cf $wikiRecodeInput(), $wikiRecodeOutput() if you want a ready place to
do this.

(Keep in mind that ASCII-with-HTML-entities is for all intents and
purposes a multibyte character encoding. It switches from single-byte to
double-byte mode when encountering a "&", and self-recovers if it is not
followed by a correct multibyte code string ending in ";".)

...
  For localization we define the following functions:

 - string canonicalTitle ( string internal-string ) translates an internal
 title to it's canonical form. It deals with capitalization, for example. If
 two strings are translated to the same canonical form they are formally the
 same title. If a string is translated to an empty string it is not a valid
 title. If you don't want entities in your titles, you can define that here.
 - string urlTitle ( string  internal-string ) translates an internal
 canonized title to its URL form. It probably only replaces space characters
 with "+" and escapes ASCII characters that need to be escaped in an URL.

cf. wikiTitle->makeSecureTitle()

...
  For these functions we also need to define arrays that
associate entities
 with their uppercase equivalents, and vice versa, for the relevant character
 sets. 
Easy enough, I can generate that from the Unicode data tables.

...
  Having said all this I also want to emphasize that we
first need to have a
 document that describes exactly how we are going to do this, before we code
 another line for localization. We have to realize that we are a real project
 now. 
Yes, a real project that's already running and has thousands of pages
that don't conform to the as-yet-nonexistant document. Hopefully we can
munge them together!

-- brion vibber (brion @ pobox.com)

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] New case conversion functions