Talk:Unicode/Archive 4

This is an archive of past discussions. Do not edit the contents of this page. If you wish to start a new discussion or revive an old one, please do so on the current talk page.

Archive 1

Archive 2

Archive 3

Noncharacters U+FDD0 - U+FDEF

Are there any characters to the Unicode code points U+FDD0 - U+FDEF assigned? --84.61.23.172 09:47, 23 June 2006 (UTC)

Please, don't ask that kind of questions here. This page is about the content of the article and nothing else. Mlewan 12:53, 23 June 2006 (UTC)

I disagree with you about the purpose of talk pages. It is a very good place to ask questions about the topic (and I have seen it used much for this purpose in the past, to the benefit of those involved, and often of benefit to the article as well). If you don't want to answer them, you can ignore them. - Rainwarrior 19:19, 3 July 2006 (UTC)

You use "you" for two persons here. I added the comment above but never removed anything. You can check the history if you want to see who removed what. The initial question was lazy and does not add anything to the discussion. If 84.61.23.172 had simply read the article and followed the links, s/he would have found the answer. The question below is different. The answer is not obvious and it is at least possibly interesting. Mlewan 19:46, 3 July 2006 (UTC)

(Yes, the edit comments were addressed to the person who reverted the addition of the second question. The talk page comment was only addressed to you. Sorry if that was confusing.) I wouldn't have commented if you had just said "you can find easily this information in the article" to the person, but you said "this page is about the content of the article and nothing else", so I took it to mean just that. - Rainwarrior 19:54, 3 July 2006 (UTC)

It's not a very good place to ask questions about the topic. It's inappropriate and distracts people actually working on the article. If it were one or two questions, it wouldn't be a big deal, but this is needlessly bloating the talk page. You go anywhere and start asking a lot of questions that show you haven't done your homework, ignore the mission of that group of people, and ignore anyone who tries talking to you about better ways of doing things, you're going to be considered rude. Frankly, I think he's a troll, the way he always has more questions the instant the last one is answered and never gives any explanation. If not, he's just horribly rude and demanding.--Prosfilaes 00:45, 4 July 2006 (UTC)

Definately a troll. I firmly approve of removing this user's comments when they're of no relevance to the content of the page. It'd be a different matter if the amount of pointless questions asked wasn't trolling. It's as if the user has a list of Top 100 Unicode questions and reels one off when there has been a large time gap or as soon as the last one has been answered. Sukh | ਸੁਖ | Talk 00:51, 4 July 2006 (UTC)

Ah, I hadn't looked at the rest of the page. I'm sorry. I retract any of my earlier statements. Treat this guy however you will. - Rainwarrior 01:45, 4 July 2006 (UTC)

Displaying Unicode Characters

To use one of the available unicode fonts (in your computer) to display the unicode special characters existing on web pages, then, if you are using that special char inside a table or chart or box, specify the class="Unicode" in the table's TR tag (or, in each TD tag, but using it in each TR is easier than using it in each TD), in wiki table code, use that after the (TR equivalent) "|-" (like |- class="Unicode"). For individual case, template code {{Unicode|char}} for each character can also be used. you may use HTML decimal or hexadecimal in the place of char. If a paragraph with lots of special Unicode chars needs to be displayed, then <p class="Unicode"> ... </p> code can be used. Thanks. ~ Tarikash 22:42, 14 July 2006 (UTC).

Bibliography or Further Reading?

Could this article use a bibliography or a list of books for further reading? Seems to me there are some good texts in print regarding Unicode.

tamil?

"However, several issues remain as far as Indic language support is concerned. For instance, the Tamil language has been assigned only 127 blocks, which, while enabling correct text display, causes problems when text is copied onto a word processor. This problem can easily be rectified if an additional 130 blocks are allotted to Tamil."

wtf does blocks mean in this case? what is the problem with word processors and how will more "blocks" help? i've commented this out until it is explained. Plugwash 18:18, 13 August 2006 (UTC)

Hi

By blocks I was referring to code points allotted in the code chart for the Tamil language. The problem is not with word processors but with the allocation of spaces in the Unicode standard itself. The Tamil language has 12 vowels and 18 consonants. Simple math yields 216+12+18=246 characters. Tamil also has a special character called 'aytha ezhuthu'. Put together there are 247 letters of the alphabet. However, the powers that be at Unicode have decided that Tamil does not have to be allocated so many points. Instead they have allotted a few code points for joiners and modifiers. The problem arises when text is copied and pasted. The joiners are rendered as independent characters ('ku' is displayed as 'k'+'u', for instance). Illogical ordering of letters and modifiers is another problem.

Regards

C Ramesh —The preceding unsigned comment was added by 203.199.211.197 (talk • contribs) 12:57, 14 August 2006 (UTC)

Ramesh, this still isn't very clear (For anyone who wonders where the 216 comes from, I assume it is all constanant/vowel pairs). Are you saying that 'ku' is representable in the source application and destination application, but not in the clipboard? What representation are the source and destination applications using that allow them to represent the 'ku' character? Chovain 13:58, 14 August 2006 (UTC)

Hi Chovain

You are right. The 216 letters are vowel-consonant pairs, but they are all treated as individual letters, unlike in English.

I think the problem would be better understood with this illustration:

க = the letter ka

ு = the 'u' vowel sign

கு = the letter 'ku'

When I copy 'கு' from a Web page and paste it onto a word processor, it would appear as க ு (without the space between the two letters). The letter and the vowel sign must not appear as separate letters in Tamil. That's my biggest quibble with Unicode. It's perfect for Tamil text display but fails miserably when it comes to text representation in a word processor or text editor.

C Ramesh —The preceding unsigned comment was added by 203.199.211.197 (talk • contribs) 15:29, 14 August 2006 (UTC)

Ok - that makes much more sense. I've rewritten the paragraph in question. Let me know what you think. Chovain 23:48, 15 August 2006 (UTC)

I do not understand the recent addition that says that "TSCII does not suffer from this problem". The current article on TSCII says that it only uses 128 characters for "non-ASCII" - which would make it impossible to encode the 216 vowel-consonant combinations. So how does TSCII solve the problem? --Alvestrand 05:55, 16 August 2006 (UTC)

Hmm - that's a good point. C Ramesh (or anyone for that matter): Any chance you could shed some light on this? (Don't forget to sign your posts with "~~~~", too) Chovain 07:04, 16 August 2006 (UTC)

Actually, I think I've got it right now. Not 'all' uyirmei require a special glyph, so it 'can' be done in 128 slots, but Unicode doesn't even use all of the 128 slots allocated. Does anyone have a better reference for this stuff? This is feeling too much like original research. Chovain 07:16, 16 August 2006 (UTC)

The most official TSCII to Unicode conversion guide is Unicdoe technical note 15, referenced on the TSCII page. [1] - even though Unicode technical notes are not parts of the standard, I don't think many people want to deviate from that. Note that this refers to Unicode version 4.0; 4.1 added another character. --Alvestrand 07:41, 17 August 2006 (UTC)

Hi Chovain

Thanks for the rewrite. It ceratinly provides a lot more clarity.

C Ramesh —The preceding unsigned comment was added by 203.199.211.197 (talk • contribs) 12:25, 16 August 2006 (UTC)

Ramesh - PLEASE sign your comments with ~~~~.  
See WP:SIG if you don't know what I am talking about.Chovain 03:22, 17 August 2006 (UTC)

So in summary it seems the real issue is that word processor authors haven't fully dealt with the fact that "user percieved character" is not the same as "code point" despite having had many years now to do so. Plugwash 16:17, 16 August 2006 (UTC)

That still doesn't stop this from being a valid issue. Requiring programmers to treat particular characters differently is a pretty serious issue in itself. If I were writing a word processor that I wanted to handle any language, I certainly wouldn't know to treat these cases any differently. If it took 2 characters to represent French's "ç" character (the "c" and the dangly bit - sorry, I don't actually know French :)), this would be considered a very serious problem. Chovain 03:13, 17 August 2006 (UTC)

There's no feasible way to handle the world's languages without intelligence. Many, many scripts need position-sensitive shaping to look right. And ç can be stored as 2 code points, a c and a combining cedilla, and many Latin languages use letters that must be stored as two or more code points.--Prosfilaes 04:18, 17 August 2006 (UTC)

Wait a bit. If the main problem is that கு becomes க ு in a word processor this is a non issue. It works perfectly fine for me when I copy from Safari and paste in TextEdit, Pages or a large number of other MacOS X word processors. Pasting in MS Word fails, exactly as described, but that is a shortcoming of MS Word for Mac - not a shortcoming of Unicode. Could someone confirm that this is indeed the problem and revert the article? Mlewan 05:36, 17 August 2006 (UTC)

No, that's not the problem as I understand it. You think you're seeing it correctly in your OSX apps because everything is displaying it wrong. To make matters worse, anyone with a Tamil enable browser is going to see 'கு' differently to the rest of us :). The character we are discussing (கு) is not meant to look like a க and a ு joined together; it's meant to look like க with the tail extending around like a backwards '@'. See 6th char along the top row of this image. Chovain 06:16, 17 August 2006 (UTC)

I see exactly the 6th character in both Safari and TextEdit exactly as you describe it and as the picture shows. Mlewan 06:55, 17 August 2006 (UTC)

Ok, great. But does the fact that many OSX apps have worked around this issue stop it from being a Unicode issue? Can TextEdit individually represent the characters 'ka' and 'u' without a space between them? There are 3 separate characters in Tamil, and Unicode can only represent 2 of them. This is not an issue with the number of code points allocated (as originally described). As I understand it, is just a problem that Unicode does not define all of Tamil's characters as other encodings do.

Yes, ka and u can be written (displayed when they are pasted) next to each other without a space. I do have three different characters in front of me in a TextEdit document: 0BC1 (u), 0b95 (ka) and the mix of them as per your picture. Displayed in any order. No spaces. Or with.

This is not something "many OS X apps" have worked around. The solution is built into the OS. It doesn't work in MS Word, as Word uses its own text rendering engine. Mlewan 08:12, 17 August 2006 (UTC)

(unintdenting) So you are able to display 4 different things: 'ka', 'u', the correct 'ku' glyph, and the incorrect 'ku' glyph. The hex values are: 'ka'=>0x0B95; 'u'=>0x0BC1; 'ku'=>0x0B95,0x0BC1; and how is the incorrect one (MS-style side-by-side representation) represented in non-MS apps? If it is also represented by a 0x0B95,0x0BC1 combination, then I'm betting it gets displayed correctly again when you save and reopen the file. Chovain 12:01, 17 August 2006 (UTC)

If I get your question right, you want to know what happens if I put க ு (0x0B95 and 0x0BC1) displayed as two separate characters in a TextEdit document, save it, close it and reopen it. The answer is that it displays exactly the same as when I saved it. If you want to know more about the options, I suggest you try a Mac out at your nearest dealer. To actually use Tamil input may not be trivial, but you have information about that at http://discussions.apple.com/message.jspa?messageID=1200527#1200527 . Mlewan 20:25, 17 August 2006 (UTC)

Some further test results: A UTF16 encoded text file from a Mac shows கு perfectly fine in Notepad on Windows XP. However, the trick MacOS uses to be able to have க and ு next to each other without a space, is to actually save it with a space. The consequence is that an additional space is displayed in Notepad. If you type a space between க and ு on MacOS, the file is saved with two spaces: க ு. Even on Windows you can paste கு successfully to both Notepad and OpenOffice, but MS Office fails. Mlewan 13:32, 22 August 2006 (UTC)

(Sorry for the indent mess below. I do not know what was an answer to what anymore. Mlewan 13:32, 22 August 2006 (UTC))

Unicode gives Latin has 'a', 'e', and 'æ'. It does not rely on the Operating system to look for all occurences of 'ae' and display them as 'æ'. It gives French a 'c', '¸' and a 'ç'. (If it didn't, I couldn't write 'c¸' as 2 separate characters to illustrate this example). Chovain 07:31, 17 August 2006 (UTC)

This is simply for historic and compatibility reasons. Having seperate characters for combined letters is redundant. Of course it would be possible to write c and ¸ as two seperate characters, the word processor just needs to add an invisible blank. -- 80.156.42.129 13:20, 28 November 2006 (UTC)

To elaborate: TSCII is able to represent the க character (0xB8), and can apply ு as a modifier to a consonant (0xA4). It has a special (single) character for கு though (0xCC). See tscii_charlist for the full chart. The table listed on the TSCII page displays 0xCC incorrectly, as it is just sending us the unicode. Either way, the paragraph as it stands needs improvement. I'll take a shot at it soon if noone beats me to it. Chovain 06:40, 17 August 2006 (UTC)

The Unicode support of Tamil is perfectly able to fulfill all user requirements (except perhaps some strange issues concerning text markup), but the software implementation is needs somewhat more sophistication that visual order encodings like TSCII. Note that Thai got its visual order encoding grandfathered into Unicode, but most Unicode expert consider the Unicode Thai implementation an odd deviation, needing special-casing here and there (e.g. in UCA). --Pjacobi 12:11, 17 August 2006 (UTC)

You can also try my test pages at http://www.jodelpeter.de/i18n/tamil/index.htm to verify that identical display can be achieved using Unicode or TSCII, provided you've installed your OS and fonts correctly.

Pjacobi 19:10, 17 August 2006 (UTC)

"The table listed on the TSCII page displays 0xCC incorrectly, as it is just sending us the unicode" sending the unicode equivilents is all we can do in html to display a charset and at least here the display in that box does look substantially similar. an image may be a better option for displaying minority (read: not supported by many peoples systems) charsets though. Its certainly an option to consider. Plugwash 11:43, 18 August 2006 (UTC)

From the TSCII proposal (linked from tscii.org), it's pretty clear that TSCII encodes glyphs in order to make text processing easier for systems that can't compose க+ ு=கு. Representing glyphs means instead of கொ (letter ka, vowel sign o), one can use ெகா (vowel sign e, letter ka, vowel sign aa), dispensing with the need for one-glyph-per-consonant-vowel: the two look (almost) identical. I don't think "க ு" is valid Tamil. (OSX actually combines ு with the previous character, so it seems to be one character while it really is two). The comparison to æ is nonsensical - "æ" is semantically different from "ae", and the OS shouldn't display one as the other. However, while there is an fi ligature, OS X does go through your text and ligaturise fi when possible (in fonts with suitable glyphs). There used to be a bug where moving your cursor over a fi would move across the whole ligature - now it treats it properly, and the cursor sits between the f and i, in the middle of a glyph. A glyph is not necessarily the same as a character. For another example, look at Arabic - the OS has to do a lot of work to convert from characters to the right glyphs. If text looks wrong in MS Office, then you either need better fonts or a better word processor. Elektron 19:35, 24 August 2007 (UTC)

Issues section - references are 404'd

In the Issues section, both the [1] and [2] references ('alternatives to Unicode' and 'Thai problems in collation') link to the same dead page at IBM.

WGL-4, MES-1 and MES-2 table

Can something be done to improve this table? The reader is left to figure out for themselves the correlation between the bolding and italicising and which codepoint ranges are included in the subsets. I presume that bold means it is in WGL-4, italics that it is in MES-1 (actually, there don't appear to be any examples of this), bold italics that it is in both WGL-4 and MES-1 and that all mentioned codepoint ranges are included in MES-2. Is this right? I'm still not sure why in the F0 line, 01-02 are given in parentheses. Perhaps there is another scheme (like using colours) to make all this clearer (and not quite so ugly)? Is there a particular reason the table was forced into a different font to the rest of the article? And finally, the notes [1] and [2] in the title don't seem to do anything (their content seems to only appear when you edit them). (I note from the history that the table was originally inserted by User:Crissov back in April.) Thylacoleo 03:06, 23 August 2006 (UTC)

I've tried to improve the lead-in text and the table heading. Comments, corrections and (especially) improvements are welcome. Cheers, CWC 08:43, 20 March 2007 (UTC)

I'm not sure why this table is even here at all - yay, some old piece of software supports their own proprietary set of extensions (intensions?) of Unicode. What value does it serve here? Shouldn't it be in the NT article or WGL or MES one or something? --moof (talk) 07:10, 26 December 2007 (UTC)

Input methods???

I'm tempted to delete the entire section "Input methods". They are essentially unrealted to Unicode. --Pjacobi 22:33, 25 August 2006 (UTC)

OS List

Would a more comprehensive listing of operating systems be of benefit?

I don't think so. The question of what it means to support Unicode is hard, and all recent OSs support Unicode in some way, with the exception of some very low-level stuff.--Prosfilaes 23:59, 7 September 2006 (UTC)

It would definitely be of benefit. If someone could make a comparison between different OSs and how they support (or claim to support) unicode, that would definitely be interesting. However, I see the potential for a lot of details, so it may be better to dedicate a page to it. If there is someone prepared to collect the information, of course. Mlewan 04:32, 8 September 2006 (UTC)

I certainly wouldn't want to clutter up the article, but different OSes have varying degrees of being Unicode-enabled. Maybe a seperate article that had a comparison and notes would be of some benefit.

Translating HP fonts to Unicode

Considering how HP is a party to the Consortium that attempts to be responsible for Unicode, maybe they can appear out of the blue, pipe up, and explain a simple way to translate my custom-designed HP laserfonts (originally composed in 1988 or 1989) to Unicode? (No, I don't use a PC and I don't use a Mac, and admit I am dealing with a fairly ordinary 68000 environment contained (and accessed) in a non-FAT, non-PC filesystem.)

The information here at Wikipedia is simply not explicit enough for me to translate my laserfonts to Unicode. There's got to be a way, but I need a lot more information than what is currently in the main article. And I shouldn't have to buy a PC running under Windows just to see the Unicodes.

Doesn't Unicode worry about the depths of margins into the bitmaps, or the heights and widths of the relevant data of the bitmaps?

Somebody should put together an article about the way Unicode stores its data. —The preceding unsigned comment was added by 198.177.27.18 (talk • contribs) 06:47, 11 October 2006 (UTC)

Thank you for your suggestion! When you feel an article needs improvement, please feel free to look up the details elsewhere, and make those changes. Wikipedia is a wiki, so anyone can edit almost any article by simply following the Edit this page link at the top. You don't even need to log in (although there are many reasons why you might want to). The Wikipedia community encourages you to be bold in updating pages. Don't worry too much about making honest mistakes — they're likely to be found and corrected quickly. If you're not sure how editing works, check out how to edit a page, or use the sandbox to try out your editing skills. New contributors are always welcome. Chovain 08:01, 11 October 2006 (UTC)

I think you have something of a disconnect.... a font is used to represent characters, and Unicode is used to identify characters. If your font is not very fancy/intelligent/complicated/convoluted, and you know which character set your font used to represent, you can probably find a conversion tablle from unicode.org and move the characters around until they make sense for Unicode input. But Unicode isn't a mechanism for representing fonts. --Alvestrand 10:40, 11 October 2006 (UTC)

A typical HP font from 1988 or 1989 allows dynamic re-editing of fonts by encouraging the user to specify particular, individual characters, and then rewriting the data (having first successfully identified the font), so there is a great similarity between locating a character in a set of HP fonts, and locating a character defined by the Unicode consortium. —The preceding unsigned comment was added by 198.177.27.20 (talk • contribs) 21:08, 11 October 2006 (UTC)

You were talking about bitmaps and 'the way unicode stories its data'. Unicode, however, is not a font specification. It specifies a mapping of characters to codes, and a few other things, like character properties, case-folding algorithms, and the like. All this data should be findable at the Unicode website. You need to identify a font format that supports Unicode. JamesFox 22:43, 11 October 2006 (UTC)

I need a bit of help...

I've been searching all over on how to update my PC's unicode registry, as very few non-keyboard characters appear as anything other than boxes, but to no avail. Can anyone tell me how I do this? (Also, might be something to incorporate into the article) --Nintendorulez talk 20:58, 21 October 2006 (UTC)

I never heard of a PC's "unicode registry". Usually when unusual characters appear as boxes the problems are in the font or the application settings or the document itself. Which characters do you look for in particular? Chinese? Russian? Which application do they not appear in? Internet Explorer? MS Word? Firefox? What kind of documents have you tried? HTML? Text? Word? Which operating system do you have? Windows? MacOS X? Linux? Mlewan 22:02, 21 October 2006 (UTC)

Redrafted "Origin and development" section

I went to tidy up a recent edit to the "Origin and development" section of the article, and ended up rewriting the whole section. Here's what I came up with.

By the late 1980s, character encodings were available for many of the world's writing systems. However, those encodings are mutually incompatible and most are only useful in particular regions of the world. Moreover, writing software that can use multiple encodings is quite difficult. (For instance, in ISO/IEC 2022, the meaning of a byte depends on all the bytes preceding it.)

Unicode was developed as a solution to this problem: a single character encoding that includes all existing encodings, covers the whole world, and can encode texts in which arbitrary writing systems are mixed together.

The Unicode standard also includes a number of related items, such as character properties, text normalisation forms, and bidirectional display order (for the correct display of text containing both right-to-left scripts, such as Arabic or Hebrew, and left-to-right scripts).

===Design Principles===

Unicode encodes the underlying characters — graphemes and grapheme-like units — rather than the variant glyphs (renderings) for such characters. It assigns a unique code point — a number, not a glyph — for each character. In other words, Unicode represents a character in an abstract way, and leaves the visual rendering (size, shape, font or style) to other software, such as a web browser or word processor. In the case of Chinese characters, this sometimes leads to controversies over distinguishing the underlying character from its variant glyphs (see Han unification).

To encourage the use of Unicode, its designers also emphasised round-trip compatibility: any text in a standard encoding can translated to Unicode and back with no loss of information. For instance, The first 256 code points were made identical to the content of ISO 8859-1, making it trivial to convert most existing western text.

The tension between these two goals has required some compromises. In many cases, essentially identical characters were encoded multiple times at different code points to preserve distinctions used by legacy encodings, for the sake of round-trip compatibility. For example, the "fullwidth forms" section of code points encompasses a full Latin alphabet that is separate from the main Latin alphabet section. In Chinese, Japanese, and Korean (CJK) fonts, these characters are rendered at the same width as CJK ideographs rather than at half the width. For other examples, see Duplicate characters in Unicode.

Also, while Unicode allows for combining characters, it also contains precomposed versions of most letter/diacritic combinations in normal use. These make conversion to and from legacy encodings simpler and allow applications to use Unicode as an internal text format without having to implement combining characters. For example é can be represented in Unicode as U+0065 (Latin small letter e) followed by U+0301 (combining acute) but it can also be represented as the precomposed character U+00E9 (Latin small letter e with acute).

The main changes I'm aware of (it's late here) is mentioning ISO/IEC 2022 and round-trip compatibility. Does anyone think this go in the article? Cheers, CWC(talk) 17:50, 27 October 2006 (UTC)

UCS-2 obsolete?

Can you really call UCS-2 "obsolete"? To me, "obsolete" means something that no longer is used. However MS SQL Server still uses UCS-2 internally, and that means that a lot of us use it indirectly every day. Our bank may use it, our HR System that pays our salary, some of our favourite web sites... Mlewan 06:28, 4 November 2006 (UTC)

No, what Microsoft and many others use is officially UTF-16, see that article and the UCS-2 article. --Red King 11:52, 4 November 2006 (UTC)

Obsolete doesn't mean it's no longer used. There's still programs written for systems so obsolete they are being emulated on an emulator written for a system that is itself obsolete and thus being run on an emulated system on real hardware. It means that a replacement has come out and that it is no longer being supported and use of it, especially in new programs, is discouraged. That applies to UCS-2; Unicode has been strongly discouraging use of it for years.--Prosfilaes 13:28, 4 November 2006 (UTC)

From wiktionary: "no longer in use; gone into disuse; disused or neglected (often by preference for something newer, which replaces the subject)."

You can say as much as you want that that is not what it means to you, but it does to me, and probably to a fair number of other people. If I read that something is obsolete, I assume that there is no need to learn anything about it. However, as MS uses it, I have to learn what it is and what restrictions come with it. Besides, I guess that they deliberately have not moved to UTF-16 for some reason - perhaps indexing and performance - and then UCS-2 even has some benefits over UTF-16. Mlewan 14:22, 4 November 2006 (UTC)

Red King, the linked to article says nothing about UTF-16, and hopefully Microsoft could be trusted to make the distinction. Unfortunately, many programmers, apparently including the ones working on MS SQL server, don't feel the concern to move from UCS-2 to UTF-16, which means that UTF-16 can usually be used masquerading as UCS-2, but the program won't properly handle the difference.--Prosfilaes 13:28, 4 November 2006 (UTC)

Remember that UCS-2 is a subset of UTF-16. Software that uses UCS-2 will automatically handle UTF-16 correctly as long as it doesn't do character-specific processing like normalization, collation, rendering, word-splitting etc. These days, support for characters outside the BMP (especially the Supplementary Ideographic Plane) is almost mandatory, so I'd be really suprised if MS SQL Server does not handle UTF-16. (I'd be a lot less suprised if some Microsoft technical writers weren't up to speed on the difference between UTF-16 and UCS-2.)

There is an important piece of software that fully supports UCS-2 but is clumsier with UTF-16: Java. Not the implementations, the language itself! Java was designed before surrogates were added to Unicode. Bad timing, that.

Regards, CWC(talk) 03:02, 5 November 2006 (UTC)

The article from MS, which I start this section with, is by no means the only reference to it. Also see blogs.msdn.com or just Google for it. I think everyone who hears about this thing for the first time is surprised, but it nevertheless seems a fact that MS Sql Server uses UCS-2. And MySQL recommends UCS-2 over UTF-8 in some situations, as their implementation of UTF-8 does not support the supplementary plane either.

I saw that thing with Java as well, and was equally surprised. So one of the world's most used programming language and (at least) two of the world's most used databases still choose UCS-2 over alternatives in some situations.

I think the word "obsolete" does not describe the situation correctly. Mlewan 07:25, 5 November 2006 (UTC)

Really the only bits of software that need to consider UTF-16 surrogates as a special case are those that deal with actually rendering the text and those that convert to/from other encodings. As far as everything else is concerned surrogates are just 16 bit words like any other. Is there any evidence that those data types can't store surrogates and if so where is that evidence? Plugwash 17:11, 5 November 2006 (UTC)

Things that count characters and anything that treats the text as more than a black box needs to understand surrogates.--Prosfilaes 13:29, 6 November 2006 (UTC)

I'm not sure I understand the purpose Plugwash's question in this context, but in addition to Prosfilae's answer, there is sorting and finding text - both things a database server is supposed to be able to do. Mlewan 13:56, 6 November 2006 (UTC)

Code written to store and find UCS-2 would work on UTF-16 if it didn't barf on the reserved codepoints that are now used for surrogates. OTOH, string lengths, collated sorts, code conversions, etc would all fail badly. In practice, characters outside the BMP are relatively rare at present. I guess that's why Microsoft and MySQL don't see a cost/benefit advantage in supporting them. CWC(talk) 18:08, 6 November 2006 (UTC)

The Java situation: Character handling in J2SE 5 is based on version 4.0 of the Unicode standard. This includes support for supplementary characters, which has been specified by the JSR 204 expert group and implemented throughout the JDK. See the article Supplementary Characters in the Java Platform, the Java Specification Request 204 or the Character class documentation for more information.

http://java.sun.com/javase/technologies/core/basic/intl/faq.jsp

The Microsoft OS situation

Windows 2000 introduced support for basic input, output, and simple sorting of supplementary characters. However, not all system components are compatible with supplementary characters. Also, supplementary characters are not supported in Windows 95/98/Me.

http://windowssdk.msdn.microsoft.com/en-us/library/ms776414.aspx

The MS SQL server situation

Since these characters’ surrogate pairs are considered two separate Unicode code points, the size of nvarchar(n) needs to be 2 to hold a single supplementary character (i.e. space for a surrogate pair)
String operations are not supplementary character aware. Thus operations such as Substring(nvarchar(2),1,1) will result in only the high surrogate of the supplementary characters surrogate pair. Also the Len operation will return the count of two characters for every supplementary character encountered – one for the high surrogate and one for the low surrogate.
In sorting and searching, all supplementary characters compare equal to all other supplementary characters

http://www.microsoft.com/globaldev/DrIntl/columns/021/default.mspx#EHD

Pjacobi 18:37, 6 November 2006 (UTC)

"OTOH, string lengths, collated sorts, code conversions, etc would all fail badly"

Lets go through theese one at a time

A simple concept of string length is dead in the water with unicode anyway, Most of the time unless you are writing a display engine number of units in memory is the main thing you need to be concerned with.

A sort that is based on 16 bit word values will when applied to UTF-16 provide an order that is different from but not in any obvious way worse than say a sort that sorts supplementry codepoints by codepoint number.

Code conversions in and out of the 16 bit format are indeed one of the main things that needs to be changed (the other main one being the rendering engine) for workable support of supplementry characters. Plugwash 20:11, 6 November 2006 (UTC)

Good points, Plugwash. (Actually, sorting on 16-bit word values is exactly equivalent to sorting by codepoint. The surrogate stuff is very well designed. The only bad thing about it is the name "surrogate", IMO.) OTOH, sorting by raw codepoint is very user-hostile, and locale-specific collated sorts written for UCS-2 will mess up on non-BMP codepoints. (Aside: the variety of rules different cultures use for sorting is quite striking.)

Talking about simple concepts of strings, not only is the concept of string length dead, the concept of a character is on its deathbed as well. Good programmers should no longer write code that treats strings as sequences of characters; instead, strings should be treated as sequences of codepoints (the low-level view) or sequences of graphemes (the medium level view) or sequences of higher-level units (words, lines, etc).

This is why JSR 204 can get away with retaining char as a 16-bit type and storing non-BMP codepoints as a surrogate pair. Code that processes strings character by character has to be rewritten to use CodePointAt and similar methods which JSR 204 added to java.lang.String and java.lang.StringBuffer, but it's better to use a higher-level ICU4J facility such as BreakIterator. (See also the brief rationale for JSR 204 in Supplementary Characters in the Java Platform.) The days when any competent programmer could write production-quality text-processing tools from scratch are over.

Thanks also to Pjacobi for those very useful links above.

Going back to the original question, my answer is that UCS-2 is obsolete (or at least becoming obsolete), but many systems written to store and (to a lesser extent) process UCS-2 text are not. Of course, this is a fairly narrow distinction.

Cheers, CWC(talk) 09:25, 12 November 2006 (UTC)

"Actually, sorting on 16-bit word values is exactly equivalent to sorting by codepoint"

incorrect: sorting on 16-bit word values will put the suplementry characters before the characters in the range U+E000-U+FFEF.

As for locale specific collated sorts written for UCS-2 i presume they will treat surrogates like any other characters from outside thier locale and thereby provide a sort which behaves consistantly but not nessacerally in a user friendly way. Plugwash 11:31, 12 November 2006 (UTC)

Quite true. My mistake. (IIRC, UTR10 does use codepoint as the "sort of last resort".) Cheers, CWC(talk) 12:12, 12 November 2006 (UTC)