On ven, 2002-02-22 at 02:33, Jan Hidders wrote:
From: "Brion Vibber"
<brion(a)pobox.com>
Ugh. Doable, though. Presumably the point of this is so that someone can
type either:
รถ (actual o-with-umlaut in the display character encoding)
ö
Ö
Ö
Ö
or any number of other alternatives in the edit box and put the same
actual sequence of bytes into the data?
Exactly.
Also remember that we'll still have to escape
entities that _aren't_ in
the display character set in all edit boxes, so that they won't be
disappeared or converted into "?"s when the user hits submit. (I'm
assuming that you don't want to put the raw HTML entities for _every_
non-ASCII character into the edit box appearing as the entity codes? See
my previous message on this subject for why that's a Very Bad Idea.)
No, no, of course not. What I meant was that we check what the display
character encoding (I said character set, but that is probably not the right
word) is that is given with the request. Suppose someone asks for an edit
page and the browser tells us that it uses ISO-8859-5 (which supports
Cyrillic) then we present the contents of the edit box such that all
entities that have a direct encoding in ISO-8859-5 are translated and all
the other entities simply stay themselves. I think we need
multibyte-character support for this.
(Not necessarily, just lots of transliteration tables. Or perhaps
compile in iconv support... does iconv allow partial transliteration
between HTML entities and other character sets? ie, HTML entities that
are not in the destination charset are left intact?)
The nice thing about this would be that you can
cut'n paste anything from
any other Wikipedia by cutting it from the edit box.
If I understand correctly, you're suggesting that the default character
encoding should *not* be based on the language used, but on some ability
of the browser to specify a preferred encoding (for instance, the HTTP
Accept-Charset header), such that the same user would see wikipedias in
different languages come up with the same character encoding?
I'm not convinced that the default value of that would always (or even
often) be acceptable, and most users won't know how to change it.
Simple, obvious at first sight manual switching between UTF-8 and a
standard transliteration format is a non-negotiable requirement for the
Esperanto 'pedia, so retaining the manual override is necessary.
> The main
thing is to define the translation functions:
>
> - string encodeEntities ( mb-string external-string, string
character-set )
> - mb-string decodeEntities ( string
internal-string, string
character-set )
(With mb-string I mean a multi-byte character string.)
cf $wikiRecodeInput(), $wikiRecodeOutput() if you want a ready place to
do this.
It doesn't have the right arguments. But these are implementation details.
We first should agree on the architecture.
There's no character set argument because that's a global variable. At
present $wikiCharset specifies the default encoding (that used in the
database), and optional alternate external encodings are in
$wikiCharsetEncodings[] with the user-selected index in
$user->options["encoding"].
-- brion vibber (brion @
pobox.com)