Talk:List of XML and HTML character entity references

Lists Low‑importance

	This article is within the scope of WikiProject Lists, an attempt to structure and organize all list pages on Wikipedia. If you wish to help, please visit the project page, where you can join the project and/or contribute to the discussion.ListsWikipedia:WikiProject ListsTemplate:WikiProject ListsList articles
Low	This article has been rated as Low-importance on the project's importance scale.

Computing: Software Low‑importance

	This article is within the scope of WikiProject Computing, a collaborative effort to improve the coverage of computers, computing, and information technology on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.ComputingWikipedia:WikiProject ComputingTemplate:WikiProject ComputingComputing articles
Low	This article has been rated as Low-importance on the project's importance scale.
	This article is supported by WikiProject Software.

Typography Low‑importance

	This article is within the scope of WikiProject Typography, a collaborative effort to improve the coverage of articles related to Typography on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.TypographyWikipedia:WikiProject TypographyTemplate:WikiProject TypographyTypography articles
Low	This article has been rated as Low-importance on the importance scale.

This article is a former featured list candidate. Please view its sub-page to see why the nomination failed. Once the objections have been addressed you may resubmit the article for featured list status.

Canonical Reference[edit]

The W3C standard XML Entity Definitions for Characters April 1, 2010 is the final authority on entity names. The ISO original standards committee (ISO/IECJTC1 SC34) invited the W3C MathML working group to take over the maintenance and development of entity names. The Unicode Consortium accepts the ISO recommendation. Since there is one defining document for all entity names it should be referenced as the authoritative document for all entity names. Other references for entity names should be shown for historical reasons since some entity names have been associated with different characters over time (examples are 'lang' and 'rang' from U+2329 and U+232A to U+27E8 and U+27E9 respectively). —Preceding unsigned comment added by Joejava (talk • contribs) 17:04, 16 November 2012 (UTC)[reply]

Octal anyone?[edit]

I don't think that the standard references octal numbers, but some of my text tools (e.g. Unix's od(1) command) output octal representations of data. It sure would be convenient to search on whatever we've got without having to convert to hex. Since   was the one that triggered this thought, here's a proposal for an alternative.

Name	Character	Unicode code point (decimal octal)	Standard	DTD^[1]	Old ISO subset^[2]	Description^[3]
quot	"	U+0022 (34 042)	HTML 2.0	HTMLspecial	ISOnum	quotation mark (= APL quote)
nbsp		U+00A0 (160 0240)	HTML 3.2	HTMLlat1	ISOnum	no-break space (= non-breaking space)^[4]

And... since I'm guessing that this is machine generated, here's a Perl snippet that prints the (augmented) cell.

foreach $code_point (34, 160) {
    printf "| U+%04X (%d 0%o)\n", ($code_point)x3;
}

MichaelRWolf (talk) 14:03, 28 November 2009 (UTC)[reply]

P.S. I'd be glad to flesh out this line of Perl to generate the entire table, should you like.

&#nnn; or &#nnnn;[edit]

Some cheat sheets show 3 digit references, some show 4 digit references. If I'm correct, the 3 digit references refer to ISO-8859-1 and the 4 digit references refer to ISO10646/Unicode.

For example, I'd like to use an en dash on my site, but I'm not sure whether to use  or –…

Which should I be using, or does it depend on my encoding (or something else)?

Thanks,
Wulf (2006-08-28T23:28:00Z)

Your encoding and the number of digits doesn't matter, but the range of numbers represented by those digits does.  through , whether you write them like that or with any number of leading zeroes (or in hexadecimal form preceded by 'x') are technically not allowed in HTML documents, and if they were, they'd be, according to the specs, referring to non-printing control codes.

Browsers that render some refs in that range as if they were references to Windows-1252 bytes, rather than UCS code points, are doing so only for backward compatibility with pre-HTML 4 browsers that were trying to accommodate authors who were using those refs in an attempt to put certain then-illegal characters (such as the Euro symbol, en dash, em dash, and curved quotation marks) in their documents. If you use the proper codes for the characters you want (most of which would indeed require 4 digits), you should see them in all modern browsers and environments. —mjb 05:36, 30 August 2006 (UTC)[reply]

Thanks :) –Wulf 03:30, 1 September 2006 (UTC)[reply]

need to add[edit]

ř is a Czech character that is used in the name of the composer Dvořák, but I don't know the rest of the information for that row. I just know it would be useful to list. Symphony Girl (talk) 00:43, 6 May 2008 (UTC)[reply]

character entity reference[edit]

I'd like to know what allowable names are for non-numeric entity references. a-z, numbers, dashes seem to be allowed, but what about underscores? Other characters? Case sensitivity? How long can a name be?

Also, it appears that at least in SGML entity values are not restricted to one character. Is there a lenght limit, and how does it compare to XML? 85.178.100.140 (talk) 17:40, 8 December 2007 (UTC)[reply]

Vertical bar[edit]

What is the code for "|"? Since the code for the broken vertical bar exists, shouldn't one exist for the "original", unbroken version? __meco (talk) 14:40, 9 June 2010 (UTC)[reply]

(U+007C). | . fileformat.info. Dan ☺ 19:54, 10 June 2010 (UTC)[reply]

|. —Tamfang (talk) 20:17, 10 June 2010 (UTC)[reply]

In the article. __meco (talk) 21:14, 10 June 2010 (UTC)[reply]

We're funnin' ya. Since the common-or-garden pipe is not a special character in HTML, nor an extension to the "original" character set, it needs no code other than "|"; but any character can be specified by its Unicode number, as shown above. Same goes for the "original" unaccented 'e'. —Tamfang (talk) 02:34, 11 June 2010 (UTC)[reply]

Case sensitivity of named character entities[edit]

The article does not mention anywhere, whether (XML and/or HTML) named entitied are case sensitive or not.

I.e. does ' &APOS; &Apos; and &apoS; all signify the same apostrophe character, or is only the first of the preceding list valid?

For HTML character entities, there are separate definitions that differ only by case (e.g. Ø and ø for an upper-/lowercase letter "O" with a forward slash (Ø and ø). But does the standard allow "free case" where no ambiguity exists?

—Preceding unsigned comment added by Mortenhattesen (talk • contribs) 08:31, 6 December 2010 (UTC)[reply]

-- No idea how to reply but they are case-sensitive in both HTML and XML. —Preceding unsigned comment added by 83.85.115.123 (talk) 17:57, 4 January 2011 (UTC)[reply]

Entity names have been case sensitive since HTML 2.0. See rfc 1866 section "3.2.3." which says "Element and attribute names are not case sensitive, but entity names are. For example, `<BLOCKQUOTE>', `<BlockQuote>', and `<blockquote>' are equivalent, whereas `&' is different from `&AMP;'."

However, the OP's question asked about ' &APOS; &Apos; and &apoS;. None of those are valid entity names for HTML 2.0 through 4.01^[5]. ' is part of the HTML 5.0^[6] proposal and is in XHTML 1.0.^[7] --Marc Kupper|talk 18:24, 5 September 2011 (UTC)[reply]

Apos entity[edit]

The HTML 4 doesn't include the "apos" entity. However, with "apos", the list consists of 253 items. — Preceding unsigned comment added by 85.50.221.168 (talk) 14:55, 31 October 2011 (UTC)[reply]

Title[edit]

As XML does not have "character entity references" but "predefined entities" is this the best title? Widefox (talk) 10:13, 13 June 2012 (UTC)[reply]

HTML5[edit]

HTML5 adds a truckload of new named references, and changes a few from HTML 4.0 (like &lang; and &rang;). How should we handle this? -- [[User:Edokter]] {{talk}} 08:25, 15 October 2014 (UTC)[reply]

Perpendicular or bottom?[edit]

Unicode spec says:

22A4 ⊤ DOWN TACK

= top

→ 2E06 ⸆ raised interpolation marker

→ 1F768 🝨 alchemical symbol for crucible-4

22A5 ⊥ UP TACK

= base, bottom

→ 27C2 ⟂ perpendicular

So how is the XML perp defined? 22A5 would not make sense

I'm sorry I don't have time to investige now :( — Preceding unsigned comment added by 37.152.9.190 (talk) 17:29, 2 December 2015 (UTC)[reply]

Spaces[edit]

This page defines a complete set of space codes in the range U+2000 to U+200B but does not give them character entity codes. This page shows some, possibly all of them. Sorry, I do not feel moved to chase up their history and add the missing ones to this table. — RHaworth (talk · contribs) 10:02, 19 January 2019 (UTC)[reply]

Updated spec from WHATWG[edit]

I understand that since this W3C announcement, the canonical reference for the named entities is the WHATWG’s list of named references. I updated the spec link and table accordingly. Two major changes are that:

some of entities are also valid without the trailing semicolon (they seem to be those of the DTD HTMLLat1, and some HTMLspecial);
some entities correspond to two code points (but still to one grapheme).

Rangitoto2 (talk) 06:21, 17 September 2019 (UTC)[reply]

Automated checking[edit]

I programmatically verified that the table was respecting a few rules:

A. all named entities from the spec are in the table

B. no named entity outside of the spec is in the table

C. the decimal code points corresponding to the named entities is as per the spec

D. the code points are in format "U+HHHH (D)"

E. the hexadecimal value of the code points match the decimal value

F. the default order of the entities is a) per ascending number of code points, b) per ascending value of the code points

G. there are no duplicate code points (so that named entities with the same code points are gropued in the same row)

H. the descriptions of the named entities consist of the name of the Unicode code points as per the Unicode standard, optionally followed by a wiki reference and/or additional text in parenthesis

I. the characters match the decimal code points

This checks the correctness of three out of the six columns of the table: “Names”, “Character”, “Unicode code point (decimal)”, “Description”. I am not sure about the three other columns, and I may have made mistakes. In particular I have added entities with the value “HTML 5.0” for the “Standard” column, but I think that the WHATWG only has a living standard (as opposed to the W3C which has versions). So please feel free to fix those if needs be. — Rangitoto2 (talk) 06:21, 17 September 2019 (UTC)[reply]

Making use of the code[edit]

To anyone maintaining the table, please consider making use of the code I used to check the rules mentioned above. This is JavaScript code to run in the browser console. Note that if you do not trust this code, do not execute it. Running untrusted code can present security risks.

To make use of the code, go to the article page, then open the JavaScript console of the browser (F12 is a common shortcut for that), then paste the following snippets:

const wikiTable = document.querySelector(".wikitable.sortable > tbody");

This assigns the tbody element of the table to a variable. It will be needed for the various checks. — Rangitoto2 (talk) 06:21, 17 September 2019 (UTC)[reply]

Check list of named entities[edit]

There are two steps to perform the checks A, B, C. First open a tab on https://whatwg.org, and in the console paste the following function:

Extended content

async function makeReference() {
  const referenceURL = 'https://html.spec.whatwg.org/entities.json';
  const json = await (await fetch(referenceURL)).json();
  const refMap = {};
  for (const name in json) {
    const key = json[name].codepoints.join("_");
    if (key in refMap) {
      refMap[key].push(name);
    } else {
      refMap[key] = [ name ];
    }
  }
  for(const style of document.getElementsByTagName("style")) {
    style.parentElement.remove(style);
  }
  document.body.innerHTML = "<pre>const refMap = " + JSON.stringify(
      refMap,
      (k, v) => typeof(v) === "string" ? v.replace("&", "&AMP;") : v,
      4 )
    + ";</pre>";
  return "done";
}

Call it as follows to replace the content of the page with the JavaScript object refMap which contains the spec of the named entities.

await makeReference()

Copy the content of that page, and paste it in the console of the tab for the wikipedia article.

Then, paste the following function which checks the wikipedia table using the object above.

Extended content

function checkNamedEntitiesList(refMap, wikiTable) {
  console.log("=== BEGIN checkNamedEntitiesList ===");
  const optionalSemiRefName = "[a]";
  const wikiMap = {};
  for (const tr of wikiTable.children) {
    const rawNames = tr.children[0].textContent.split(", ");
    const rawCP = tr.children[2].textContent.split(", ");
    const codepoints = rawCP.map( cp => +cp.split("(")[1].split(")")[0] );
    const names = [];
    for (const rawName of rawNames) {
      const name = rawName.trim();
      const regexMatch = name.match(/(.*)(\[[a-z]+\])/);
      if (regexMatch) {
        names.push( "&" + regexMatch[1] + ";" );
        if (regexMatch[2] === optionalSemiRefName) {
          names.push( "&" + regexMatch[1] );
        }
      } else {
        names.push( "&" + name + ";" );
      }
    }
    const key = codepoints.join("_");
    wikiMap[key] = names;
  }

  const missing = [];
  for (const key in refMap) {
    const hasKey = key in wikiMap;
    for (const name of refMap[key]) {
      if (!hasKey || wikiMap[key].indexOf(name) < 0) {
        missing.push( [key, name] );
      }
    }
  }

  const extra = [];
  for (const key in wikiMap) {
    const hasKey = key in refMap;
    for (const name of wikiMap[key]) {
      if (!hasKey || refMap[key].indexOf(name) < 0) {
        extra.push( [key, name] );
      }
    }
  }

  console.log("There are", missing.length, "missing entities, and",
    extra.length, "extra entities in the wikipedia table");

  if (missing.length > 0) {
    console.log("The missing entities are: (name : decimal code point(s))");
    for (const [key, name] of missing) {
      console.log(name, ":", key.split("_").join());
    }
    console.log("Note: named entities without a trailing semicolon",
      "need to be marked with the reference", optionalSemiRefName);
  }

  if (extra.length > 0) {
    console.log("The extra entities are: (name : decimal code point(s))");
    for (const [key, name] of extra) {
      console.log(name, ":", key.split("_").join());
    }
  }
  console.log("=== END checkNamedEntitiesList ===");
}

Call it as follows:

checkNamedEntitiesList(refMap, wikiTable)

Fix all errors before proceeding. — Rangitoto2 (talk) 06:21, 17 September 2019 (UTC)[reply]

Check code points[edit]

The following function performs the checks D, E, F, G:

Extended content

function checkCodepoints(wikiTable) {
  console.log("=== BEGIN checkCodepoints ===");
  let errorCount = 0;
  function errorCheck(errorCount, msg) {
    if (errorCount > 0) throw { errorCount, msg };
  }
  try {
    console.log("1. Code point format check");
    const entitiesCP = [];
    for (let i = 0; i < wikiTable.children.length; ++i) {
      const tr = wikiTable.children[i];
      const namedEntities = tr.children[0].textContent.trim();
      const rawCP = tr.children[2].textContent.split(", ");
      for (const cp of rawCP) {
        const regexMatch = cp.match(/^U\+([0-9A-F]{4,}) \(([0-9]+)\)\n{0,1}$/);
        if (!regexMatch) {
          console.log( "Code point in wrong format for entities:",
            namedEntities, "->", '"' + cp + '"' );
          ++errorCount;
          continue;
        }
        if (i in entitiesCP) {
          entitiesCP[i].push( [ +regexMatch[2], regexMatch[1] ] );
        } else {
          entitiesCP[i] = [ [ +regexMatch[2], regexMatch[1], namedEntities ] ];
        }
      }
    }
    errorCheck(errorCount, 'Note: code point format is "U+HHHH (D)"\n'
      + '      HHHH is 4 or more characters in range 0-9A-F;'
      + ' D is in character range 0-9\n'
      + '      multiple code points are separated with ", " (comma space)\n'
      + '      single trailing new line is accepted' );

    console.log("2. Hexadecimal/decimal value match check");
    for (const codepoints of entitiesCP) {
      const namedEntities = codepoints[0][2];
      for (const [dec, hex] of codepoints) {
        if ( parseInt(hex, 16) !== dec ) {
          console.log("Hex does not match decimal value for entities:",
            namedEntities, "-> (hex)", hex, "!= (dec)", dec);
          ++errorCount;
        }
      }
    }
    errorCheck(errorCount, '');

    console.log("3. Order check");
    let prevDec = [];
    for (const codepoints of entitiesCP) {
      const namedEntities = codepoints[0][2];
      if (codepoints.length < prevDec.length) {
        console.log("Entities", namedEntities,
          "have", codepoints.length,
          "code point(s) but are located after entity having",
          prevDec.length, "code point(s)");
        ++errorCount;
      }

      if (codepoints.length !== prevDec.length) {
        prevDec = new Array(codepoints.length).fill(-1);
      }

      for (let i = 0; i < codepoints.length; ++i) {
        const dec = codepoints[i][0];
        if (dec === prevDec[i]) {
          if ( i === codepoints.length - 1 ) {
            console.log("Entities", namedEntities,
              "have duplicate decimal code point(s) [",
              prevDec.join(", "), "]");
            ++errorCount;
          }
          continue;
        }
        if (dec < prevDec[i]) {
          console.log("Entities", namedEntities,
            "have decimal code point", dec,
            "but are located after entity having code point",
            prevDec[i], "(at code point #" + (i + 1) + ")");
          ++errorCount;
        }
        prevDec = codepoints.map( cp => cp[0] );
        break;
      }
    }
    errorCheck(errorCount, "Note: order of entities is\n"
      + "      a) per increasing number of code points\n"
      + "      b) per increasing code point value, from left to right");

    console.log("Check complete: no error found.");
  } catch({errorCount, msg}) {
    if (msg.length > 0) { console.log(msg); }
    console.log("checkCodepoints:", errorCount, "error(s) found. Exiting");
  }
  console.log("=== END checkCodepoints ===");
}

Call it as follows:

checkCodepoints(wikiTable)

Fix all errors before proceeding. — Rangitoto2 (talk) 06:21, 17 September 2019 (UTC)[reply]

Check descriptions[edit]

To perform the check H, two steps are necessary. First, go to the Unicode data page URL at https://www.unicode.org/Public/UNIDATA/UnicodeData.txt and run the following code:

document.body.innerHTML = document.body.textContent.split("\n").map((line,i) => {
  if (line.length === 0) { return ""; }
  const f = line.split(";");
  const pre = i? "" : "<pre>const nameRef = {\n";
  return pre + '"' + f[0] + '": "' + (f[1] === "<control>" ? f[10] : f[1]);
} ).join('",\n') + "};</pre>", 0;

This will transform the data in the page into the JS object nameRef needed to perform the description check. Copy the updated content of the page, and paste it in the JavaScript console of the wikipedia article page. Beware however that it may slow down your browser considerably (on my system after I pasted the object in the console, trying to use the DOM inspector on that page caused firefox to freeze). I suspect that it is because the object is very large.

Then paste the following function to check the table:

Extended content

function checkDescriptions(nameRef, wikiTable) {
  console.log("=== BEGIN checkDescriptions ===");
  const descList = [];
  for (const tr of wikiTable.children) {
    const namedEntities = tr.children[0].textContent.trim();
    const rawCP = tr.children[2].textContent.split(", ");
    const rawDesc = tr.children[6].textContent;
    const refDesc = rawCP.map( cp => {
      const hex = cp.split(" ")[0].slice(2);
      return nameRef[hex].toLowerCase();
    } ).join(", ");
    descList.push( {namedEntities, rawDesc, refDesc} );
  }

  let errorCount = 0;
  function errorCheck(errorCount) { if(errorCount) throw errorCount; }
  try {
    for (const {namedEntities, rawDesc, refDesc} of descList) {
      const desc = rawDesc.toLowerCase().trim();
      if ( !desc.startsWith(refDesc) ) {
        console.log("The description of entities", namedEntities,
          "does not match the Unicode name of its code point(s):\n",
          "    -> Unicode name is \"" + refDesc + '"\n',
          "    -> the wiki description is \"" + desc + '"');
        ++errorCount;
      }
    }
    errorCheck(errorCount);

    let warningCount = 0;
    for (const {namedEntities, rawDesc, refDesc} of descList) {
      const desc = rawDesc.toLowerCase().trim();
      const tail = desc.slice(refDesc.length);
      const regexMatch = tail.match(/(.*)(\[[a-z]+\])/);
      const noRefTail = regexMatch ? regexMatch[1] : tail;
      const noParensNoRefTail = noRefTail.split("(")[0];
      const tailText = noParensNoRefTail.trim();
      if (tailText.length > 0) {
        console.log("The description of entities", namedEntities,
          "has extra text after the Unicode name of its code point(s):\n",
          "    -> Unicode name is \"" + refDesc + '"\n',
          "    -> extra tailing text is \"" + tailText + '"');
        ++warningCount;
      }
    }
    if (warningCount > 0) {
      console.log("Note: the tailing part excludes wiki references",
        "and content after parenthesis");
    }
    console.log("Warning(s) found:", warningCount);
    
    console.log("Check complete: no error found.");
  } catch(errorCount) {
    console.log("checkDescriptions:", errorCount, "error(s) found. Exiting");
  }
  console.log("=== END checkDescriptions ===");
}

Call it as follows:

checkDescriptions(nameRef, wikiTable)

Currently it outputs two warnings, as there is extra explanation for rows as follows:

The description of entities equiv, Congruent has extra text after the Unicode name of its code point(s):
    -> Unicode name is "identical to"
    -> extra tailing text is "; sometimes used for 'equivalent to' or 'congruent'"
The description of entities nequiv, NotCongruent has extra text after the Unicode name of its code point(s):
    -> Unicode name is "not identical to"
    -> extra tailing text is "; sometimes used for 'not congruent'"

— Rangitoto2 (talk) 06:21, 17 September 2019 (UTC)[reply]

Check characters[edit]

The following function performs the check I:

Extended content

function checkCharacters(wikiTable) {
  console.log("=== BEGIN checkCharacters ===");
  let errorCount = 0;
  for (const tr of wikiTable.children) {
    const namedEntities = tr.children[0].textContent.trim();
    const rawChars = tr.children[1].textContent;
    const rawCP = tr.children[2].textContent.split(", ");
    const aRawChars = Array.from(rawChars);
    const codepoints = rawCP.map( cp => +cp.split("(")[1].split(")")[0] );
    codepoints.push(10);
    if ( aRawChars.length !== codepoints.length
      || aRawChars.some( (c, i) => c.codePointAt(0) !== codepoints[i] ) )
    {
      codepoints.pop();
      console.log("The character field for entities", namedEntities,
        "does not contain the entity code point(s) plus new line (10):\n",
        "    -> code point(s): [", codepoints.join(", "), "]\n",
        "    -> code point(s) in character field: [",
          aRawChars.map( c => c.codePointAt(0) ).join(", "), "]");
      ++errorCount;
    }
  }
  console.log("Error(s) found:", errorCount);
  console.log("=== END checkCharacters ===");
}

Call it as follows:

checkCharacters(wikiTable)

— Rangitoto2 (talk) 06:21, 17 September 2019 (UTC)[reply]

Automate table row code creation[edit]

Here is a helper function which generates a new row for entities with a given code point. It needs the object nameRef generated above.

Extended content

function makeRow(nameRef, line) {
  const [ names, vals ] = line.split(" : ");
  const deces = vals.split(",").map(Number);
  const hexes = deces.map( d => {
    const hex = d.toString(16).toUpperCase();
    return "0".repeat( Math.max(0, 4 - hex.length) ) + hex;
  } );
  const uninames = hexes.map(d => nameRef[d].toLowerCase());
  const cp = hexes.map( (h, i) => "U+" + h + " ("+ deces[i] +")" );
  return `|-
| ${names}
| ${String.fromCodePoint(...deces)}
| ${cp.join(", ")}
| HTML 5.0
|
|
| ${uninames.join(", ")}`;
}

It can be called for example as follows:

makeRow(nameRef, "nbumpe, NotHumpEqual : 8783,824")

— Rangitoto2 (talk) 06:21, 17 September 2019 (UTC)[reply]

Representation in HTML[edit]

Should be another (second?) column in table List of XML and HTML character entity references#Character entity references in HTML with code look, e.g. &Tab;, &NewLine;, &excl; etc. And it's hard to find e.g. "ge" entity now, "≥" would be really easier.

Do you support or could I add? (see this) Segu (talk) 20:12, 15 February 2020 (UTC)[reply]

Does Wikipedia support HTML5 ?[edit]

The codes from HTML versions 4 and early work — for example Â ( Â ) but the codes from HTML5 do not work for me — for example &Scedil; ( Ş - Ş ) - I tried in Chrome and Firefox.

Wikipedia does not support HTML 5 ? — Ark25 (talk) 17:44, 28 March 2020 (UTC)[reply]

Neither does it for me in SeaMonkey 2.53.2 (Gecko 60.3.2) on Linux. I see your Ş when written in UTF-8, but the following entity is spelled out as an uninterpreted entity. What does work is using either the decimal value (Ş gives Ş) or the hex value (Ş gives Ş and Ş gives Ş) but of course to use them you have to know the decimal or hex value of the codepoint, which is not always obvious to get. The Unicode scripts chart and the Unicode character name index can help you. Every script or character link there resends to a PDF file containing a part of the current Unicode character list. — Tonymec (talk) 18:27, 28 March 2020 (UTC)[reply]

This is remarkably hard to find a solid answer for. 8-(

AFAICS, Mediawiki has been HTML5 (in the output) for some years. They've also adopted the idea of a rigorously XML-compliant output model, and with full Unicode support. There's also strong encouragement for authors to use Unicode directly, rather than entities.

An XML output model raises an old issue with XHTML: which entities are permitted? For some XML parsing models, none of them (except the five XML entities) are usable. In others, the HTML DTD is parsed (or assumed) and the HTML entities are permissible. But which set of entities? In particular, HTML5 doesn't indicate the DTD to be used (it's implicit, by defined HTML5 behaviour outside the normal XML or SGML parsing models). Clearly (by observation), Mediawiki passes HTML 4 entities through as entities but anything else (including HTML5 entities) are &ampersand; escaped. I can't explain this choice, I can't find a source for the decision.[8] I'm puzzled as to why: passing them through would work (HTML5 is accepted as effectively universal), converting them to characters would work (it's Unicode clear throughout), but this behaviour gives an unexpected behaviour for editors, based on whether an entity if HTML5 or HTML 4.

Note that this isn't a browser behaviour. What Mediawiki puts out in the source for a page is only to clear. Andy Dingley (talk) 01:51, 29 March 2020 (UTC)[reply]

Really strange, imo. I just created a .htm file in Notepad and Chrome, Firefox and IE have no problem to show Ş as "Ş". Why MediaWiki refuses to parse it is quite a mistery. — Ark25 (talk) 21:42, 17 April 2020 (UTC)[reply]

Possibly this might help one day? Unless it has been removed by this. All the best: Rich Farmbrough (the apparently calm and reasonable) 17:53, 17 May 2020 (UTC).[reply]

[1]

[2]

[3]

[4]