Re: [Wikitech-l] New case conversion functions

22 Feb 2002

On ven, 2002-02-22 at 02:33, Jan Hidders wrote:
...
  From: "Brion Vibber"
&lt;brion(a)pobox.com&gt;

 Ugh. Doable, though. Presumably the point of this is so that someone can
 type either:
   ö  (actual o-with-umlaut in the display character encoding)
   &ouml;
   &#214;
   &#xd6;
   &#x00D6;
 or any number of other alternatives in the edit box and put the same
 actual sequence of bytes into the data?  
 Exactly.

  Also remember that we'll still have to escape
entities that _aren't_ in
 the display character set in all edit boxes, so that they won't be
 disappeared or converted into "?"s when the user hits submit. (I'm
 assuming that you don't want to put the raw HTML entities for _every_
 non-ASCII character into the edit box appearing as the entity codes? See
 my previous message on this subject for why that's a Very Bad Idea.)  
 No, no, of course not.  What I meant was that we check what the display
 character encoding (I said character set, but that is probably not the right
 word) is that is given with the request. Suppose someone asks for an edit
 page and the browser tells us that it uses ISO-8859-5 (which supports
 Cyrillic) then we present the contents of the edit box such that all
 entities that have a direct encoding in  ISO-8859-5 are translated and all
 the other entities simply stay themselves. I think we need
 multibyte-character support for this. 
(Not necessarily, just lots of transliteration tables. Or perhaps
compile in iconv support... does iconv allow partial transliteration
between HTML entities and other character sets? ie, HTML entities that
are not in the destination charset are left intact?)

...
  The nice thing about this would be that you can
cut'n paste anything from
 any other Wikipedia by cutting it from the edit box. 
If I understand correctly, you're suggesting that the default character
encoding should *not* be based on the language used, but on some ability
of the browser to specify a preferred encoding (for instance, the HTTP
Accept-Charset header), such that the same user would see wikipedias in
different languages come up with the same character encoding?

I'm not convinced that the default value of that would always (or even
often) be acceptable, and most users won't know how to change it.
Simple, obvious at first sight manual switching between UTF-8 and a
standard transliteration format is a non-negotiable requirement for the
Esperanto 'pedia, so retaining the manual override is necessary.

...
   > The main
thing is to define the translation functions:
 >
 > - string encodeEntities ( mb-string external-string, string  character-set )
  > - mb-string decodeEntities ( string
internal-string, string  character-set )

 (With mb-string I mean a multi-byte character string.) 
 cf $wikiRecodeInput(), $wikiRecodeOutput() if you want a ready place to
 do this.  
 It doesn't have the right arguments. But these are implementation details.
 We first should agree on the architecture. 
There's no character set argument because that's a global variable. At
present $wikiCharset specifies the default encoding (that used in the
database), and optional alternate external encodings are in
$wikiCharsetEncodings[] with the user-selected index in
$user->options["encoding"].

-- brion vibber (brion @ pobox.com)

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] New case conversion functions