HTML Odds and Ends

Comments

To embed a comment in your HTML code, use the following structure:

<!-- Comment Text -->

Entities

There are two reasons to use entities in HTML code.

Entities for reserved characters

There are a few characters that can not be typed directly in HTML code because they have special meaning in the language. These characters always have to be replaced by entity codes:

< - replace with <
> - replace with >
" - replace with "
& - replace with &

Entities for non-ASCII characters

With the exception of the special characters listed above, you can type every other character on the standard US English keyboard directly into HTML without having to replace it with an entity code. All of the characters that appear in a standard US English keyboard are members of the ASCII character set. Characters outside of that basic character set have to be built using an HTML entity. Here are some basic examples:

Accented characters used in languages such as French, Spanish, or German: è gets rendered as è
Greek characters used in mathematics: π gets rendered as π
Special symbols: ⨯ gets rendered as &Cross;
Virtually every character can be represented by using its Unicode code number. For example, the Greek letter theta (θ) gets represented as θ.

Complete reference

Here is a site with a handy table of commonly used entity codes.

This site has a more extensive listing of entity code names.

Entity code numbers, binary and base-16

All information that gets stored in a computer is stored in the form of numbers. To represent text, computers use a system of code numbers that assigns a numeric code to each character. The simplest method for doing is the ASCII system, which was established in the United States in the 1960s.

Another relevant fact about computers is that computers store all numbers using a base 2 (or binary) representation. Furthermore, computers typically group the binary digits (or bits) of a base 2 number into subgroups of 8 bits, called a byte.

Here is an example of how this works. The ASCII code number for the letter 'A' is 65, which looks like

01000001

when we write it out in base 2 notation.

When computer scientists work with base 2 numbers, they very frequently use a simple trick to make the representation of the number more compact. Instead of using base 2 to write out the number, they use a more convenient base, base 16. In this number system the digits range from 0 to 15 instead 0 to 9 as in base 10. To represent the digits 10 through 15 in base 16 we use the letters A through F.

Here is how to convert a base 2 number to base 16. We start by grouping the bits of the number into groups of size 4:

0100 0001

Each of these groups of size 4 can be represented by a single digit in base 16:

4 1

To keep from confusing base 16 (or hexadecimal) numbers with base 10 numbers, computer scientists use the convention of putting an 'x' in the front of the number:

x41

Unicode

All of this information about number systems becomes much more important when we switch to a character encoding system that is much more extensive than the ASCII system. The most extensive encoding system in use today is the Unicode system, which is large enough to encompass the characters used in all written human languages.

For example, the character 気, which corresponds to the Chinese qi or Japanese ki, is encoded in Unicode via the code number

x6C17

To embed this character in a web page, you can construct an entity code that gives the appropriate code number

気

UTF-8

Constructing HTML entities using the Unicode code number for the symbol you want to embed in a document works just fine for limited use. If you have the occasional need to embed a special symbol in your text you can use entity codes for the purpose. However, this approach quickly becomes impractical if you need to embed, say, a long quotation in Chinese in your web page.

The last example above shows an obvious complication that Unicode introduces, and that is that many characters require more than one byte to represent. This character would take at least two bytes to represent. This causes a big problem for HTML, because HTML typically expects you to use the ASCII encoding to construct tags. This means that in practice we need to find some way to use both ASCII to encode the letters that make up tags and Unicode for content in languages other than English.

The UTF-8 system is a clever hack that allows us to have both ASCII and Unicode in the same HTML document.

Here are the basic ideas behind UTF-8:

ASCII characters in the range from 0 to 127 get represented via their usual ASCII codes, using a single byte for each character.
All of the binary representations of the code numbers 0 to 127 have a 0 as their first bit.
All numbers above 127 will have a 1 as their first bit.
To represent numbers that would normally require more than one byte, UTF-8 uses a specialized encoding that spreads the required bits across several bytes.

Here is a concrete example to illustrate how the UTF-8 encoding works. Consider again our example character 気. In base 16 that character gets represented via the code number

6C 17

The binary equivalent of that number is

01101100 00010111

UTF-8 will use a total of three bytes to represent this number, using a structure that looks like this:

1110xxxx 10xxxxxx 10xxxxxx

The sequence 1110 at the start of the first byte indicates that the character we are about to represent covers a total of 3 bytes. The 10 sequences at the front of the second two bytes serve as special markers. The bits in the code number above will get spread across the x positions in the structure, producing something that looks like this in binary:

11100110 10110000 10010111

Despite the fact that the technical details behind UTF-8 are fairly complex, in practice UTF-8 is very easy to use in a web page. All you need to do is to make sure that the editor you use to construct the page supports UTF-8. (Most modern text editors, including Notepad++ and TextWrangler, support UTF-8 content.) You can freely copy and paste content in languages other than English into your web page, and the editor will automatically encode that content in UTF-8.

The only extra step you need to do to then make sure that a browser will be able to correctly render the page is to put a special declaration at the top of the page. In the head element of the page, place this element:

<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />

A gratuitous example

To demonstrate that the approach described above works just fine, here is a longer quote in UTF-8:

David Bowie war meine erste große Liebe. Ich war 13, als sein Album "The Rise And Fall Of Ziggy Stardust And The Spiders From Mars" veröffentlicht wurde. Danach atmete ich Bowie förmlich. Ich wollte sein wie er. Ich kannte jedes Wort seiner Songs auswendig, hörte sie ununterbrochen. Ich ließ mir sogar seinen Haarschnitt verpassen, und war mächtig stolz darauf.

-Steve Blame, quoted on www.spiegle.de

HTML 4 vs HTML 5

The HTML language has undergone considerable evolution over the last 25 years. HTML itself has evolved through 5 versions of the language, with most web sites you are likely to encounter using either HTML 4 or HTML 5.

To indicate to a browser which version of HTML you are using for a page you put a DOCTYPE declaration as the first line of the file. HTML 4 uses the declaration

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">

while HTML 5 uses

<!DOCTYPE html>

With very few exceptions, HTML 5 is backward compatible with HTML 4. This means that if you have a page encoded in HTML 4, you can simply change the DOCTYPE declaration to the HTML 5 form and everything should continue to work. Moving to HTML 5 will give you access to some new language features (described below).

The only thing you have to watch out for when using HTML 5 is that some older browsers (such as Internet Explorer 6 on Windows XP) do not support HTML 5.

New elements in HTML 5

HTML 5 offers a number of new elements. Below I will describe various groups of these new elements.

Semantic elements

The purpose of these elements is to describe the role that a particular portion of a document plays. Before HTML 5, HTML programmers would set up distinct sections of a page by using the <div> element in combination with a style attibute (we'll see more about stylees when we cover CSS). For example, to set up a section of navigation links, programmers would do

<div style="nav"> Content </div>

In HTML 5 there is now a <nav> element that serves the same purpose.

Here is a listing of some these semantic elements.

<article>
<section>
<aside>
<header>
<footer>
<nav>
<figure>

Many of these semantic elements will become more useful to us once we start using CSS. After we have covered the basics of CSS I will revisit some of these elements.

New input types for forms

HTML 5 expands the range of input elements with elements for inputting dates, email addresses, and URLs. The text discusses these new elements in chapter 7.

Video element

Embedding video in a web page used to be a somewhat clunky process. At the present time, the most straight-forward way to put video in a page is to use a video sharing service such as YouTube or Vimeo. These services will allow you to upload your video and then provide you with a snippet of HTML code that you can paste into your page. Here is an example of a YouTube video embedded in a page:

HTML 5 offers a new <video> element that makes it easier to embed video in a page without using a hosting service. That element is discussed in chapter 9. Here is a link to a page that demonstrates this new element.