Talk:Unicode/Archive 2

This is an archive of past discussions. Do not edit the contents of this page. If you wish to start a new discussion or revive an old one, please do so on the current talk page.

Archive 1

Archive 2

Archive 3

Archive 4

Archive 5

Sentence

"To address the short coming, Unicode is being revised periodically with the addition of more characters and increase in the size of characters potentially represented in unicode."

It's something of a moot point now, but in case it comes up in the future, the reason I cut that sentences is because it was inaccurate. They don't add more characters to address the shortcoming (one word) that people don't use Unicode; there's probably less than a hundred thousand people who would use any of the scripts that are going to be added to Unicode. And for several of the scripts, like Egyptian Hieroglyphics or Hungrian Runic or Tengwar, there's no commericial interest in the script, and there's little to no academic interest in encoding the script (the Egyptologist community has basically told Unicode to go away and come back in few decades). Hobbyist demand for unencoded scripts isn't a huge shortcoming that Unicode is trying to overcome.

What does "increase in the size of characters potentially represented in unicode" mean? I assume by size, you mean number (since you can increase the size of characters just by using a larger font), but I'm not sure what "potentially" means here. As I read it, it's redundant with "addition of more characters". --Prosfilaes 03:38, 11 Dec 2004 (UTC)

The simplest representation of Unicode (giving every character the same number of bits, rather than a more complicated variable-width encoding) has historically increased from 16 bits to about 20 bits. There is (currently) about 2^20 "potential" characters. I suspect the original author suspected that in the future, *more* than (roughly) 20 bits will be required; and that the consortium is planning to "periodically" increase the number of bits. --DavidCary 22:17, 11 Feb 2005 (UTC)

The consortium doesn't plan to increase the number of bits. In 15 years, two planes of characters have almost been filled, out of 15. Just as importantly, those two planes include virtually every character used in a computer; a few people use Tengwar or pIqaD or Cuneiform or Egyptian hieroglyphics, but they're incredibly rare and they amount to a few thousand characters, not the more than a half million it would take to require expansion. And honestly, if it was a matter of expanding for those or ignoring them, their concerns are minor enough and the changes in every piece of Unicode software major enough I suspect they would get ignored. --Prosfilaes 00:30, 1 Jun 2005 (UTC)

it depends on exactly how you define filled.

the BMP (plane 0) is basically full mostly with fully allocated and standardised codepoints

the SMP (plane 1) is mostly stuff in various stages of approval but still has quite a bit of room marked as completely unknown (less than half though)

the SIP (plane 2) is more than half filled by "CJK Unified Ideographs Extension B" and most of the rest is pencilled in for yet more CJK stuff.

the SSP (plane 14) is mostly empty right now

iirc planes 15 and 16 are reserved for private use but i'm not sure.

so if you count the areas that are pencilled in for future scripts then a LOT more than 2 planes are in use.

I define filled as allocated. Pencilling in is meaningless; many of the pencilled-in scripts don't have anyone interested in formalizing them for Unicode. Everything in the SMP right now can fit in the unused space in the SIP. 128 characters are used in SSP, which is barely touched. Even with everything pencilled-in, only three planes would be filled, which doesn't qualify as a LOT more.--Prosfilaes 01:22, 27 Jun 2005 (UTC)

Chinese Punctuation

"Unicode also has a number of serious mistakes in the area of CJK punctuation. For example, it mistakenly treats partial punctuation marks in the various CJK encodings as full punctuation marks, for instance treating half of a CJK ellipsis as the same as an English ellipsis, even though the two glyphs are both semantically and visually dissimilar (considering that the CJK ellipsis can be centred between the baseline and ascender, but the English ellipsis must always be placed on the baseline)." --Gniw 06:53, 6 Feb 2005 (added to article)

This page should not be a page of everyone's minor complaints about Unicode. I've read the Unicode list for four or five years, I've read the Standard, I've read both pro- and anti-Unicode pages (including all the Tron pages in English, and they include about every general or Japanese-specific Unicode complaint possible) and I've never heard this before. Given that it seems to be one person's complaint, I don't think it's worthy of being added to an encyclopedia article. --Prosfilaes 21:48, 6 Feb 2005 (UTC)

This is not a minor complaint if you use do bilingual typesetting or write bilingual (Chinese and English) web pages. The result of the ellipsis misidentification in Unicode causes very ugly web pages to result in mixed English-and-Chinese web pages. But given the sad state of punctuation typesetting taught at art schools these days, and the way English computing has changed Chinese typesetting, I'm not surprised that no one has talked about this. Ah Wing 22:49, 9 Feb 2005 (UTC)

I stand on my position. This is an encyclopedia, not a list of what's wrong with Unicode. If there's no English pages on the issue, then most of the people who could fix the issue have never heard of it; and if no one has ever seen fit to bring it before them, I hardly see it as a major issue. I wouldn't post bug reports about a program on Wikipedia, so I don't see this as appropriate.

But please, if someone else has an opinion on this, please chime in.--Prosfilaes 03:45, 11 Feb 2005 (UTC)

Why isn't this a big issue? The triviality of this is precisely the reason it is important; it shows that the Unicode has mistakes that even primary school students should be able to spot, yet here it is in the standard. This just shows how sloppy Unicode is regarding CJK.

Do you really think that if people who are likely to be affected by the issue has mentioned about it, and the discussion happen to be not in English, then it is not an issue?!

What you mean is "the use of English is a requirement for an issue to be recognized as an issue" or "no matter whether people have discussed it or not, if it has never been discussed in English then it cannot possibly be an issue". Or, in short, "English is the measure of all things". If this is not Western imperialism I don't know what it is, and you don't understand why the Japanese are opposed to Unicode? Opposition to Unicode is not really so much of a technical problem but more a perception of a lack of respect, the fact that my contribution was deleted on New Year tells a lot. 24.101.156.72 19:18, 11 Feb 2005 (UTC)

If a Chinese encyclopedia wrote an article complaining about some problem in the English Wikipedia, and they never mentioned it to anyone who could fix it, we'd be a little pissed. Bring the issue before us, and if we choose not to fix it, then there's a valid complaint, but we can't fix what we don't know about. If it doesn't matter enough to bring it to the people who can fix it, or the people discussing it don't respect the standard enough to try and fix it, it's not an important issue.

I think says a lot that you're not discussing the issue, you're complaining about imperalism and that somehow people shouldn't correct articles on holidays. I will repeat again, this is a thirty year old problem made by Chinese standards. You can't do better using Big5 or any other Chinese standard. Which says a lot to me about the importance of the problem.

While we're on the subject of "Western Imperalism", I will note that the US-based Summer Institute of Lingistics and the Ireland-based Michael Everson have been instrumental in getting new scripts (e.g. several Philippine scripts like Buhid) into the standard, while the Japanese standards body sent a letter to the ISO working group asking for such new standards efforts to cease. Such accusations are insulting and provably inaccurate. --Prosfilaes 23:46, 11 Feb 2005 (UTC)

Thank you for saying so, Prosfilaes. Buhid was years ago, though. Recent scripts I worked on encoding include N'Ko, Vai, Balinese, Cuneiform, Phoenician and lots more. And the letter from JIS was also years ago, and JIS is now Secretariat for SC2, so they're hardly asking to stop the work from proceeding. Evertype 16:32, 12 February 2006 (UTC)

Excuse me. Do you know what a "double byte character set" is? Big5 (as well as GB, EUC-KR, EUC-JP, and Shift JIS) is a DBCS, and by the very nature of a DBCS, you can't encode a whole CJK ellipsis. We have to encode half of the ellipsis. Now when the Unicode committee look at the CJK national character sets and decide that half a CJK ellipsis is equal to a full English ellipsis, that is incredible sloppiness. This is not a "thirty year old problem made by Chinese standards" in the context of Unicode.

If you hae a problem, write up a disunification paper, give evidence, and submit it to the UTC or through your ISO National Body member in SC2. Badmouthing Unicode on this talk page will accomplish nothing. Evertype 16:32, 12 February 2006 (UTC)

And how do you want me to discuss the issue? When whatever I write will simply get deleted. 66.163.1.120 00:05, 12 Feb 2005 (UTC)

It's not incredible sloppiness. It's a unification decision that had some negative side effects. (And we could discuss the incredible sloppiness involved in assuming that every non-ASCII character was double-width, one that still sometimes plagues Russians who get the pleasure of dealing with double-width Cyrillic.) And I want you to discuss it here, on the talk page, instead of making changes on the main page, until some sort of consensus is reached. (And I'd really like a third party to chime in.) --Prosfilaes 01:32, 12 Feb 2005 (UTC)

I cannot understand why this is not sloppiness. The two are completely different. As I originally wrote, (1) they are different in form (the CJK ellipsis can be set on the baseline, or between the baseline and the ascender; the English ellipsis can only be set on the baseline) and (2) they are different in meaning (two "ideographic three dot leader"s, as some Japanese people think it should be called, are required to make one true ellipsis, the leader itself is meaningless; one "horizontal ellipsis" (U+2026) is meaningful by itself). The two cannot be unified no matter whether they consider unification to be based on form or on meaning.

Ok, you might argue that this only means they are unable to spot the differences. But they go into so much effort into distinguishing between almost-indistinguishable variations in ideogram forms (many are really typographic stylistic variations that unfortunately came to be associated with different countries), not making comparable effort in distinguishing these two glyphs certainly sounds extremely strange. Even if they had checked the punctuation sections of a Chinese or Japanese dictionary they would have realized that the "ideographic three dot leader" is not itself a punctuation mark. And this has the added benefit that dictionaries usually set the ellipsis between the baseline and the ascender, so they would simultaneously realize that the two are different in form. In short, there is simply no basis for "unification": Yet they got "unified". Aside from "incredible sloppiness" I really cannot explain this.

(I do accept that Unicode unifications are sometimes based on form, though I think this is contrary to the spirit of Unicode unifications. I personally don't like the CJK unification myself, and you won't understand why I feel this way until you try to work on a Unicode font yourself. But if you ask for my objections to unification decisions, I'll say the unification of the umlaut and the diaeresis really make no sense considering they dis-unify a lot of other things (I'm talking about western script, not CJK) that look 100% identical. In the case of the CJK vs English ellipses, form is not even a question, since they are different in form.)

I do agree with the double-width mess. For us the opposite problem occurs, that all the box-drawing characters become single-width, making Unicode almost useless in terminal emulators if box-drawing characters are to appear anywhere. --Wing 03:45, 12 Feb 2005 (UTC)

First, I stand by my point: for 15 years, this unification has stood, and no one has complained to Unicode. For probably ten of those years, there would have been no problem disunifying the characters, yet not a single standards body made the request. If they were so completely inappropriately unified, there has been incrediably sloppiness and apathy on the basis of the users of the affected scripts.

You make too many assumptions about what I do and don't understand. I believe I understand the reasons why people disagree with CJK unification, and seriously doubt that making a font would make a bit of difference. The whole question is whether the difference is a difference in preferred fonts or a difference in script.

You are apparently a splitter. Besides the fundamental backward compatibility problems, I can't imagine trying to explain to the people at Distributed Proofreaders that coöperate uses a different ö from Köln. Splitting these would cause a world of pain to the advantage of a few librarians. In any case, the various opinions on when to split and when to unify a much more general and interesting topic to add to the page. --Prosfilaes 00:31, 13 Feb 2005 (UTC)

Well, I think I am correct in assuming that you have never worked on a Unicode font. Before I attempted to work on a Unicode font some time ago, I thought just like you (being content with the state of the Han unification).

In the current state of the Han unification, there are many characters that are not unified. However, after adding a radical, the new characters are all unified.

If I want to make one Unicode font containing all the ideograms (not an unreasonable thing, since making such a font requires so much effort), which style should I choose? If adding the radicals would not make the new characters unified, I'd be all happy too (it would just mean that all variants are distinguished, as opposed to variants being not distinguished); as it is, no matter which style I choose, I end up with a font that is wrong.

Regarding the ellipsis itself, it is not a difference in font. Would you consider an ellipsis-like glyph that is raised above the baseline (to about x-height) suitable for typesetting English? From your viewpoint, this is exactly what unification of U+2026 and the hypothetical "ideographic three dot leader" means.

In a sense, the mis-unification of the ellipsis and the "ideographic three dot leader" can be thought of as equivalent to the problem of having full-width Cyrillic letters (in that both mistakenly equates a glyph that's only appropriate in C/J/K with an incompatible western glyph). If you find full-width Cyrillic letters unacceptable and is "incredible sloppiness", I fail to understand why an ellipsis raised to x-height for English is acceptable or is not the result of sloppiness.

I would not object to your calling us having "incredible apathy" regarding Unicode. We have already acquired "incredible apathy" after using the suboptimal national character sets for so long; and many of our typesetting and/or punctuation conventions have been destroyed by Western-centric computing for so long (can you imagine just about ten years ago even westerners know that in C/J/K, numbers should be grouped by myriads, but now many Chinese do not even know this, but rather group digits by thousands and then laborously count the digits every time a large number is being read… and many Chinese are so used to western-style underlining that they are now desensitized with the grammatical mistakes they are making every time they underline Chinese words that are not proper names…) I definitely think that this is pathetic enough, and there is no need for Unicode to make this kind of mistakes to further worsen the situation.

I am not saying that the knowledge of proper punctuation has not deteriorated in the West; but at least the deterioration has not been codified into an international standard (unless I count this ellipsis mis-unification)… --Wing 04:30, 13 Feb 2005 (UTC)

PS: Perhaps there is; other than this ellipsis thing, there is also this hyphen-dash confusion. It seems to be just as bad…

afaict the hyphen-dash issue comes from the fact that ascii and other encodings of its era came from the days when charactors on computers were fixed width. given that and the limited number of code values availible in ascii it seemed totally reasonable to unify the hyphen dashes and minus signs. There was also the unification of beta and sharp s in ibm code page 437 Plugwash 02:46, 1 Jun 2005 (UTC)

Revision history year-wikilinks

The year wikilinks in the revisions list are a little confusing; I clicked through thinking I was going to be led to that particular revision, but found myself on a general-year page. Could you reconsider these links please? Thanks. Courtland

Good luck in changing this deep-incrusted policy of irrelevantly wikilinking each and every year number.--84.188.146.200 02:58, 9 February 2006 (UTC)

There has recently been some movement away from this, with some people doing mass unlinking of years. But regardless of that, the years can be linked to something else, the people who link every year don't seem to mind where the link goes, just that the years are nice and blue. Qutezuce 03:20, 9 February 2006 (UTC)

Is there a policy on year-linking, or a place where this is being discussed? I can see arguments on both sides of the fence - especially in historical articles, I think it's fun to click on years and find "what else happened then". But since the linking of years is done so much, it seems wise to include more text in the link than the year if you want to link something else. --Alvestrand 04:02, 9 February 2006 (UTC)

The issue is talked about on Wikipedia:Only make links that are relevant to the context. Qutezuce 04:16, 9 February 2006 (UTC)

Unicode adoption in e-mail

The adoption of Unicode in e-mail has been very slow. Most East-Asian text is still encoded in a local encoding such as ISO-2022-JP, and many commonly used e-mail programs still cannot handle Unicode data correctly. This situation is not expected to change in the foreseeable future.

This doesn't look like an accurate picture to me. Mac OS X's default Mail.app client has transparently supported Unicode since 2001. Didn't Windows 95's Internet Mail and News or Outlook Express have Unicode support even earlier? I don't know how widely used Unicode is, but hasn't it been very widely supported for years? —Michael Z. 2005-04-12 21:20 Z

Keep in mind that that some programs support unicode does not mean they can handle text encoded in unicode correctly. The situation may have changed since then, but I used to hear that you should not send mails in unicode because many programs have problems with them. You see I heard a report that even gmail does not correctly handle the subject of e-mails. More research would certainly help, but I don't think the above is far from the reality. -- Taku 02:35, Apr 13, 2005 (UTC)

The situation is changing all the time - for one thing, Outlook now seems to have debugged most of its Unicode support. I've been sending email in UTF-8 for a year or so, and very few people report problems with it. Of course, it helps that most of my mail is in English, and the rest is in Norwegian.... some systems handle UTF-8 OK, but only if the output is within the Latin1 charset, for instance.

Input methods

On Windows XP, any Unicode character can be input by pressing Alt, then, with Alt down (and using only the numeric keypad keys), pressing the decimal digits of the Unicode characters one after the other. For example, Alt, then, with Alt still down, 9, then 6 and then 0 yields π (Greek lowercase letter Pi). For values less than 256, precede the digits with a 0, to avoid code page translation (see Extended ASCII), e.g. Alt 0, 1, 6, 5 yields ¥.

This just doesn't work when I try it. Pressing Alt-9-6-0 gives me └, which appears to be "Box Drawings Light Up And Right", character x2514/9,492 (└). However, Alt-0-x-x-x does work for me and always has (I can get the yen symbol fine). Does this statement need correction or clarification? —Simetrical (talk) 01:57, 8 May 2005 (UTC)

Forgot to mention, I do use Windows XP, English-language SP 2 to be precise. —Simetrical (talk) 02:31, 8 May 2005 (UTC)

I use WinXP, Spanish-language SP2, and it does not work for me, either. Nor does it work for anyone I know who uses WinXP, either. By the way, the character '└' can also be obtained by pressing Alt+192 - moreover, I have found that under WinXP, Alt+number produces the same output as Alt+number modulo 256 (provided that any zeroes before the original number are preserved). So, Alt+289 produces '!', Alt+416 produces 'á', and Alt+0416 produces ' ', the non-breaking-space.

I think that paragraph should be removed. --Fibonacci 21:53, 21 May 2005 (UTC)

it seems to depend on the edit control in use. it seems stuff that uses the standard edit (e.g. notepad) doesn't allow unicode entry with alt+numpad whereas stuff that uses the standard richedit (e.g. wordpad) does (tested on english winxp non-sp2 not sure if its original or sp1). Plugwash 22:37, 21 May 2005 (UTC)

The way I understand it, a four-digit or longer number enters the Unicode character. A three-digit number under 256 enters the character in the current code page, which I suppose would be Win CP-1252 for English and some European languages (don't know if that includes Spanish). It appears that three-digit numbers over 255 are processed with some funky math (Shouldn't numbers over 255 be Unicode? Can anyone think of a reason for using modulo-256 except programmer laziness?). —Michael Z. 2005-05-25 17:45 Z

NO NO NO

in apps that use the windows EDIT control (ie notepad) you CANNOT enter unicode with alt+numpad (unless the app makes special provisions which some apps seem to do) and numbers entered with alt+numpad are treated modulo 256 regardless of lengh

in apps that use the windows RICHEDIT control numbers over 256 and all numbers 4 digits or more are unicode (for numbers like 052 the local code page matches unicode anyway so its impossible to really tell)

other apps that set up thier own edit controls may behave differently again.Plugwash 18:40, 25 May 2005 (UTC)

In Windows (at least versions XP, 2000, 2003) you need to have in your registry at HKEY_Current_User/Control Panel/Input Method, the value EnableHexNumpad set to "1". If you have to add it, set the type to be REG_SZ. WARNING: Don't mess with the windows registry unless you know what you are doing. The only problem is, how do you add a hex number like 39A from your numpad? The numpad doesn't include A-F and keying 'A' (for example) invokes a menu entry since the ALT key is pressed. Any ideas on this one? EGT. 15:25 12 July 2005 (GMT+2)

You have to remember that the Unicode are in hex, so to enter a Unicode charater with the number pad use the decimal value for the hex value. In the case of the Unicode value of 39a enter the decimal value 0922. Mr. McCoy 11:52 Jan 12 2006 (GMT -8)

Another form to enter Unicode characters is Alt + PLUS SIGN + HEX CODE. It worked for me on Windows XP. --ʀʇʉʀɵ 23:57, 25 January 2007 (UTC)

The article says: "it is possible to create Unicode characters by pressing Alt + PLUS + #, where # represents the hexadecimal code point up to FFFF; for example, Alt + PLUS + F1 will produce the Unicode character ñ." but I can't get this to work on Win2000. In MS Word, I just get the latest from the Recent Files opened. Does "+" mean the keys should be pressed sequentially or all together?

Nifty resource.

I found, at some point, a nifty resource for Unicode at fileformat.info. It has some rather decent tools for looking up individual codepoints, like U+0023 or U+20AC. Each page includes a browser test and font support info. Perhaps it would be useful to link U+F00F the same way we link PMID, ISBN and RFC IDs now. grendel|khan 16:50, 2005 May 25 (UTC)

-1. Not as long as they keep those ads running. An idea more in line with the Wikipedia ethos would be to link to the Wiktionary entries, like 京. However, this cannot be done consistently without some manual intervention for certain intercepted characters like "+", "]", etc., and the Wiktionary entries for things like € and Latin letters are not very exciting, if present at all. — mjb 1 July 2005 02:57 (UTC)

+1. I didn't realize there was a policy against linking to sites with ads. If there is enough interest, I sure we can find a way to get rid of the ads. Please let me know! Andrew M, FileFormat.Info author. 4 July 2005

There isn't such a policy, but of course we prefer resources with more usefulness and less advertising. The fileformat.info site seems pretty good. I would put it in right after the Letter Database, which seems to offer similar functionality without ads. —Michael Z. 2005-07-5 04:21 Z

Of course there's no set policy, but the examples so far are leaning in that direction, and I'm sure I'm not the only one who prefers it that way. Traffic from Wikipedia and the sites that mirror its content would be a windfall for an ad-supported site; we should be very careful who we choose to "support" in this way. Rather than favoring one particular info/library/retail source for books, the automatic links on ISBN numbers go to a generated portal. Automatic RFC links to faqs.org are fairly innocuous, as well; RFCs are static documents and all that is done at faqs.org is some minor reformatting and hyperlinking. I think if faqs.org were to be using frames and ads like zvon.org, people would not be so happy about it, and would be more likely to favor linking directly to the plain text documents in the original IETF repository. So for character information, I want to see something equally neutral and encyclopedic. fileformat.info is good, but not thorough or encyclopedic enough, even without ads. — mjb 5 July 2005 06:44 (UTC)

I think there should be a lot more discussion about a character-linking strategy. When it comes to character information, the information sources Wikipedia, Wiktionary, fileformat.info, and the Letter Database are all great in their own way, but none of them are complete. For some characters, some sources are better than others. Wikipedia has great info on Latin script punctuation, Wiktionary has great entries for East Asian characters, fileformat.info has a lot more character set info than the Letter Database, and the Letter Database has some unique properties of its own, like language data. Compare the entry for 京 at Wiktionary, fileformat.info, and the Letter Database (ouch!).

Another complication is that what we call "characters" are actually a codified abstraction of graphemes and constructs of similar utility (control codes, zero-width joiners, and such); how might this affect what kind of information we want to link to? Take the Latin script for example: it has one hyphen grapheme, but Unicode has codified it as a half-dozen characters in order to accommodate different rendering behaviors, languages and legacy encodings. And East Asian scripts have other complications, as noted in Han unification. For example, decisions were made that can result in one logogram appearing at multiple code points depending on purpose, and similar logograms appearing at one code point but requiring sometimes substantially different renderings depending on language. So far, none of the info sources take any of this into account, although some of the cryptic Han data in fileformat.info might be indicative of a few of these properties, I'm not sure. In any case, I'd question whether it is sufficient to only provide "character" data when grapheme info may be useful more often, depending on whether the researcher is coming at it from a lay person / linguist's point of view, or from a programmer / computer professional's point of view. I suggest developing some kind of meta-article, along the line of the ISBN pages. — mjb 5 July 2005 06:44 (UTC)

There are two very different approaches to resource construction: manual (Wiktionary) and automatic (FileFormat.Info). The manual sources will always have more in-depth info, but the automatic will have better coverage (and, if both the site and data source are maintained, be more up to date). An automatic site is much easier to link to. The unicode information at FileFormat.Info is from the Unicode Character Database, the Java run-time, and the dotNet runtime. While I made a spot for per-character custom information, there isn't any since I'm not any sort of authority. I could add Wiktionary links if I could figure out a standard URL (or a standard and a list of exceptions). Note: I'm definitely link-whoring, but not because I'm hoping for a windfall: unicode searchers don't seem to be worth much in the advertising world. I'm willing to give up the ads. Andrew M. 5 July 2005

Unicode 4.1.0

Can someone give me a link so that I can download Unicode 4.1.0 for free? → JarlaxleArtemis 00:14, May 27, 2005 (UTC)

http://www.unicode.org — Monedula 05:56, 27 May 2005 (UTC)