A numeric character reference in HTML refers to a character by its Universal Character Set/Unicode code point, and uses the format
where nnnn is the code point in decimal form, and hhhh is the code point in hexadecimal form. The x must be lowercase in XML documents. The nnnn or hhhh may be any number of digits and may include leading zeros. The hhhh may mix uppercase and lowercase, though uppercase is the usual style.
Not all web browsers or email clients used by receivers of HTML documents, or text editors used by authors of HTML documents, will be able to render all HTML characters. Most modern software is able to display most or all of the characters for the user’s language, and will draw a box or other clear indicator for characters they cannot render.
For codes from 0 to 127, the original 7-bit ASCII standard set, most of these characters can be used without a character reference. Codes from 160 to 255 can all be created using character entity names. Only a few higher-numbered codes can be created using entity names, but all can be created by decimal number character reference.
HTML forbids the use of the characters with Universal Character Set/Unicode code points
- 0 to 31, except 9, 10, and 13 (C0 control characters)
- 127 (DEL character)
- 128 to 159 (C1 control characters)
- 55296 to 57343 (xD800-xDFFF, the UTF-16 surrogate halves)
These characters are not even allowed by reference. That is, you should not even write them as numeric character references. However, references to characters 128–159 are commonly interpreted by lenient web browsers as if they were references to the characters assigned to bytes 128–159 (decimal) in the Windows-1252 character encoding. This is in violation of HTML and SGML standards, and the characters are already assigned to higher code points, so HTML document authors should always use the higher code points. For example, for the trademark sign (™), use
The characters 9 (tab), 10 (linefeed), and 13 (carriage return) are allowed in HTML documents, but, along with 32 (space) are all considered “whitespace“. The “form feed” control character, which would be at 12, is not allowed in HTML documents, but is also mentioned as being one of the “white space” characters — perhaps an oversight in the specifications. In HTML, most consecutive occurrences of white space characters, except in a
<pre> block, are interpreted as comprising a single “word separator” for rendering purposes. A word separator is typically rendered a single en-width space in European languages, but not in others.