Quantcast Character encodings in HTML | Source and Code System

Character encodings in HTML

Aug 18, 2010  ¦¦  by isr.coder  ¦¦  HTML  ¦¦  49 Comments

HTML (Hypertext Markup Language) has been in use since 1991, but HTML 4.0 (December 1997) was the first standardized version where internationalcharacters were given reasonably complete treatment. When an HTML document includes special characters outside the range of seven-bit ASCII two goals are worth considering: the information’s integrity, and universal browser display.

HTML
HTML.svg

Specifying the document’s character encoding

There are several ways to specify which character encoding is used in the document. First, the web server can include the character encoding or “charset” in the Hypertext Transfer Protocol (HTTP) Content-Type header, which would typically look like this:

Content-Type: text/html; charset=ISO-8859-1 For HTML (but not in XHTML) it is possible to include this information inside the head element near the top of the document:

Content-Type: text/html; charset=ISO-8859-1

For HTML (but not in XHTML) it is possible to include this information inside the head element near the top of the document:

<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

HTML5 also allows the following syntax to mean exactly the same:

<meta charset="utf-8">

XHTML documents have a third option: to express the character encoding via XML processing instruction, as follows:

<?xml version="1.0" encoding="ISO-8859-1"?>

A known misconception about <meta http-equiv="Content-Type"> is that meta element is intended to be interpreted directly by a browser, like an ordinary HTML tag. According to WWW Consortium, it helps HTTP server to generate some headers when it serves the document. The HTTP/1.1 header specification for a HTML document must label an appropriate encoding in the Content-Type header, missing charset= parameter results in acceptance ofISO-8859-1 (so HTTP/1.1 formally does not offer such option as an unspecified character encoding), and this specification supersedes all HTML (or XHTML) meta element ones. This can pose a problem if the server generates an incorrect header and one does not have the access or the knowledge to change them.

As each of these methods explain to the receiver how the file being sent should be interpreted, it would be inappropriate for these declaration not to match the actual character encoding used. Because a server usually can't know how a document is encoded—especially if documents are created on different platforms or in different regions—many servers simply do not include a reference to the "charset" in the Content-Type header, thus avoiding making false promises. However, if the document does not specify the encoding either, this may result in the equally bad situation where the user agent displaysmojibake because it cannot find out which character encoding was used. Due to widespread and persistent ignorance of HTTP charset= over the Internet (at its server side), WWW Consortium disappointed in HTTP/1.1’s strict approach and encourage browser developers to use some fixes in violation of RFC 2616.

If a user agent reads a document with no character encoding information, it can fall back to using some other information. For example, it can rely on the user's settings, either browser-wide or specific for a given document, or it can pick a default encoding based on the user's language. For Western European languages, it is typical and fairly safe to assume Windows-1252, which is similar to ISO-8859-1 but has printable characters in place of some control codes. The consequence of choosing incorrectly is that characters outside the printable ASCII range (32 to 126) usually appear incorrectly. This presents few problems for English-speaking users, but other languages regularly—in some cases, always—require characters outside that range. In CJK environments where there are several different multi-byte encodings in use, auto-detection is also often employed. Finally, browsers usually permit to override incorrect charset label manually as well.

It is increasingly common for multilingual websites and websites in non-Western languages to use UTF-8, which allows use of the same encoding for all languages. UTF-16 or UTF-32, which can be used for all languages as well, are less widely used because they can be harder to handle in programming languages that assume a byte-oriented ASCII superset encoding, and they are less efficient for text with a high frequency of ASCII characters, which is usually the case for HTML documents.

Successful viewing of a page is not necessarily an indication that its encoding is specified correctly. If the page's creator and reader are both assuming some platform-specific character encoding, and the server does not send any identifying information, then the reader will nonetheless see the page as the creator intended, but other readers on different platforms or with different native languages will not see the page as intended.

Character references

In addition to native character encodings, characters can also be encoded as character references, which can be numeric character references (decimal or hexadecimal) or character entity references. Character entity references are also sometimes referred to as named entities, or HTML entities for HTML. HTML's usage of character references derives from SGML.

HTML character references

Numeric character references can be in decimal format, &#DD;, where DD is a variable number of decimal digits. Similarly there is a hexadecimal format, &#xHHHH;, where HHHH is a variable number of hexadecimal digits. Hexadecimal character references are case-insensitive in HTML. For example, the character 'λ' can be represented as &#955;&#x03BB; or &#X03bb;. Numeric referencesalways refer to Unicode code points, regardless of the page's encoding. Using numeric references that refer to permanently undefined characters and control characters is forbidden, with the exception of the linefeed, tab, and carriage return characters. That is, characters in the hexadecimal ranges 00–08, 0B–0C, 0E–1F, 7F, and 80–9F cannot be used in an HTML document, not even by reference, so "&#153;", for example, is not allowed. However, for backward compatibility with early HTML authors and browsers that ignored this restriction, raw characters and numeric character references in the 80–9F range are interpreted by some browsers as representing the characters mapped to bytes 80–9F in the Windows-1252 encoding.

Character entity references have the format &name; where "name" is a case-sensitive alphanumeric string. For example, 'λ' can also be encoded as &lambda; in an HTML document. (For a list of all named HTML character entity references, see List of XML and HTML character entity references.) The character entity references &lt;&gt;&quot; and &amp; are predefined in HTML and SGML, because <>" and & are already used to delimit markup. This notably does not include XML's &apos; (') entity. For a list of all named HTML character entity references, see List of XML and HTML character entity references (approximately 250 entries).

Unnecessary use of HTML character references may significantly reduce HTML readability. If the character encoding for a web page is chosen appropriately then HTML character references are usually only required for a markup delimiting characters mentioned above, and for a few special characters (or not at all if a native Unicode encoding like UTF-8 is used).

XML character references

Unlike traditional HTML with its large range of character entity references, in XML there are only five predefined character entity references. These are used to escape characters that are markup sensitive in certain contexts:

  • &amp; → & (ampersand, U+0026)
  • &lt; → < (less-than sign, U+003C)
  • &gt; → > (greater-than sign, U+003E)
  • &quot; → " (quotation mark, U+0022)
  • &apos; → ' (apostrophe, U+0027)

All other character entity references have to be defined before they can be used. For example, use of &eacute; (which gives é, Latin lower-case E with acute accent, U+00E9 in Unicode) in an XML document will generate an error unless the entity has already been defined. XML also requires that the x in hexadecimal numeric references be in lowercase: for example &#xA1b rather than &#XA1b.XHTML, which is an XML application, supports the HTML entity set, along with XML's predefined entities.

However, use of &apos; in XHTML should generally be avoided for compatibility reasons. &#39; or &#x0027; may be used instead.

&amp; has the special problem that it starts with the character to be escaped. A simple Internet search finds thousands of sequences &amp;amp;amp;amp; ... in HTML pages for which the algorithm to replace an ampersand by the corresponding character entity reference was applied too often.

49 Comments

  • Websites we think you should visit…

    [...]although websites we backlink to below are considerably not related to ours, we feel they are actually worth a go through, so have a look[...]……

  • Read was interesting, stay in touch……

    [...]please visit the sites we follow, including this one, as it represents our picks from the web[...]……

  • … [Trackback]…

    What I wouldnt give to learn how you got your design to be so amazing! I mean it. Besides the blog just being awesome, this page is too sweet! Its not too flashy. It doesnt do too much with colours and things and the videos you use are perfect for this…

  • Super Cheap Yankee Candles…

    [...]we like to honor other sites on the web, even if they aren’t related to us, by linking to them. Below are some sites worth checking out[...]…

  • Buy Yankee Candles Cheap…

    [...]below you’ll find the link to some sites that we think you should visit[...]…

  • Websites we think you should visit…

    [...]although websites we backlink to below are considerably not related to ours, we feel they are actually worth a go through, so have a look[...]……

  • Looking around…

    I like to look around the online world, often I will go to Digg and follow thru…

  • Websites worth visiting…

    [...]here are some links to sites that we link to because we think they are worth visiting[...]……

  • marketing online…

    [...]just below, are some totally unrelated sites to ours, however, they are definitely worth checking out[...]…

  • Buy Yankee Candles Cheap…

    [...]below you’ll find the link to some sites that we think you should visit[...]…

  • Buy Yankee Candles at Half Off…

    [...]the time to read or visit the content or sites we have linked to below the[...]…

  • TV Lamps…

    The Truth behind Television Replacement Lamps….

  • adult social network…

    [...]below you’ll find the link to some sites that we think you should visit[...]…

  • Blogs ou should be reading…

    [...]Here is a Great Blog You Might Find Interesting that we Encourage You[...]……

  • Another Website Mentions Your Website…

    [...]please go tо sites we follow, including this onе, the way іt represents оur picks on thе web[...]…

  • Sources…

    [...]check below, are some totally unrelated websites to ours, however, they are most trustworthy sources that we use[...]……

  • Great Clips Coupons…

    [...]following are a few web links to online sites which we link to since we think these are definitely worth browsing[...]…

  • Online Article……

    [...]The information mentioned in the article are some of the best available [...]……

  • Cool news from asia, fun ppl around the world, One world !…

    [...]blow are some sites that we think it will be worth your time[...]…

  • Superb website Mentions Your Website…

    [...]always an important fan оf linking tо bloggers that wе love but don’t have a great deal of link love from[...]…

  • Sources…

    [...]check below, are some totally unrelated websites to ours, however, they are most trustworthy sources that we use[...]……

  • Websites you should visit…

    [...]below you’ll find the link to some sites that we think you should visit[...]……

  • buy damansara house…

    [...]you will find information that’s relevant to you if you want to buy damansara house[...]…

  • Recommeneded websites…

    [...]Here are some of the sites we recommend for our visitors[...]……

  • Read was interesting, stay in touch……

    [...]please visit the sites we follow, including this one, as it represents our picks from the web[...]……

  • Websites we think you should visit……

    [...]that is the end of this article. Here you’ll find some sites we think you’ll appreciate, simply click the links over[...]……

  • Garmin 1490t Best Price…

    [...]while the sites we link to below are completely unrelated to ours, we think they are worth a read, so have a look[...]…

  • Online Article……

    [...]The information mentioned in the article are some of the best available [...]……

  • Superb website…

    [...]always a big fan of linking to bloggers that I love but don’t get a lot of link love from[...]……

  • Basketball Drills…

    [...]the time to read or visit the content or sites we have linked to below the[...]…

  • Links……

    [...]we like to honor a number of other sites on the web, even though they aren’t related to us, by linking to them. Under are some webpages worth looking at[...]……

  • Dreary Day…

    It was a dreary day here today, so I just took to messing around on the internet and realized…

  • Its hard to find good help…

    I am regularly saying that its hard to get quality help, but here is…

  • A gaming website recommends your website…

    [...]check below, are a few totally unrelated websites tо ours, however, they сan bе most trustworthy sources that аnу of uѕ use[...]…

  • Latest news from all around the world, you need to see this !…

    [...]blow are some sites that we think it will be helpful and fun to read[...]…

  • Grandma Always Said…

    It was my grandma who regularly told me that I should see more blogs like this….

  • Ipad Stylus…

    [...]underneath are several listings to websites online I always link to seeing that we think they really are worthy of browsing[...]…

  • hotels…

    [...]here are some links to sites that we link to because we think they are worth visiting[...]…

  • Wicked in a good way…

    Bizarre topic to learn more about….

  • Looking around…

    I like to look in various places on the internet, often I will just go to Digg and read and check stuff out…

  • Tumblr article…

    I saw someone writing about this on Tumblr and it linked to…

  • Cheap NBA Jerseys…

    [...]here are a handful of web links to webpages which we link to since we feel there’re truly worth checking out[...]…

  • Visitor recommendations…

    [...]one of our visitors recently recommended the following website[...]……

  • Close to being spot on…

    Right on the money, congrats….

  • Gems form the internet…

    [...]very few websites that happen to be detailed below, from our point of view are undoubtedly well worth checking out[...]……

  • Recommeneded websites…

    [...]Here are some of the sites we recommend for our visitors[...]……

  • Websites worth visiting…

    [...]here are some links to sites that we link to because we think they are worth visiting[...]……

  • Bail Bonds Los Angeles…

    [...]listed below are a handful of web links to internet websites which I connect to as we feel these are truly worth checking out[...]…

  • Tummy Tuck Pictures…

    [...]the following are several web links to online websites we connect to since we think these are seriously worth checking out[...]…

Leave a comment

*