Due to the problems pointed out by this article, SGML comments have been removed from Acid 2, and future HTML versions will not require SGML comments. Browsers that have implemented them are now expected to remove their support for SGML comments, for all HTML versions.
Comments allow you to include HTML or text in your document that you want the browser to ignore. Maybe you want to leave a note for yourself to remind yourself of what you were doing. Maybe you just need to temporarily hide part of your page for a few days.
In its simplest form, an HTML comment starts like this: "<!--", and ends like this: "-->". So this is a basic comment:
<!-- this text is ignored by the browser -->
Strictly speaking, it is not ignored, since scripts can still access it. But never mind that for now. Let's just concentrate on the syntax. Initially, this is how all browsers interpreted it. You could put anything inside the comment, except for "-->", as that would end the comment.
Suddenly it is not so easy any more. You see, browsers were wrong. HTML was created as a subset of SGML, and SGML dictates a more complicated view of comments. Browsers all ignored SGML comments though, and stuck with the comment format they had always used. This was a sensible approach, in my opinion.
For a little while, Opera experimented with "correct" comment handling, and found that predictably, no Web page authors were aware of it, resulting in a lot of broken sites. So Opera changed back to using the format that everyone understood. Then Mozilla decided to implement them "properly" as well. It was implemented only in strict mode, but that did not stop it causing problems. Then the Acid 2 test came along, and for debatable reasons, they decided to include SGML comments in it.
It would have been better to rewrite the HTML standard to reflect the reality of what authors were using, but no. So browser vendors are now forced to implement SGML comments, or risk embarrassment, even though they will cause Web pages to break. Why will they break?
To put it simply, the double dash at the start and end of the comment do not start and end the comment. Double dash indicates a change in what the comment is allowed to contain. The first -- starts the comment, and tells the browser that the comment is allowed to contain > characters without ending the comment. The second -- does not end the comment. It tells the browser that if it encounters a > character, it must then end the comment. If another -- is added, then it goes back to allowing the > characters:
<!-- this can contain > characters -- this can not, so the comment ends here>
Each time a double dash is encountered, it changes the format between allowing, and not allowing the > characters to be inside the comment:
<!-- this can contain > characters -- this can not -- this can contain > characters -- this can not, so the comment ends here>
That example is not actually valid HTML, since the last part (between the last -- and the closing >) is not allowed to contain anything except whitespace. However, the SGML parsing rules will cause it to behave as described, even if there are some other non-whitespace characters in there:
<!-- this can contain > characters -- this can not -- this can contain > characters -->
Note, XML (and therefore also XHTML when served using an XML based content-type) took the sensible step of making it not valid to have -- inside a comment. As a result, trying to use it should result in a parsing error. Because of this, XML and XHTML do not have the SGML comment problem. In practice, I have never seen any real need for SGML comments, so I favour the XML approach. Note that XHTML, if served using the text/html content-type, will be treated as HTML, so the SGML comment parsing rules will be applied.
Also note that if there is a space between the "<!" and the first -- then it is not treated as a comment. It is treated as an unknown SGML tag, and ends at the first > character that is encountered:
<! -- > This is not inside a comment, because there is a space before the first dashes. The "unknown SGML tag" ended at the first > character. -->
In the real Web, people were not aware of the SGML comment rules. All people knew is what browsers did. So they used as many dashes as they wanted inside the comment. And now browsers are going to change the way they treat this.
The browsers that have implemented SGML comments so far have only done so if the page's doctype declaration triggers the browser's strict mode (see my comparison table for more details). However, if a page triggers strict mode, then the comments will be parsed differently. Common constructs originally thought to be a comment will not be considered a valid comment:
<!------- start wrapper -------->
And yes, pages will break. Parts of pages will appear or disappear in different browsers.
In each case, assume this is the only content of the body, and assume that there are no other (attempts to make) comments on the page.
<!------->
<!-- | Start of comment |
---|---|
--- | Content of comment |
--> | End of comment |
<! | Start of comment structure |
---|---|
-- | Start of comment, may contain > |
-- | Comment may not contain > (using a > will cause the comment to end) |
-- | Comment may contain > |
- | Not a double -- so it is treated as part of the comment |
> | This part of the comment is allowed to contain the > character, so this is part of the comment |
The comment never ends, so browsers will treat this as an invalid comment (and invalid HTML). They should reparse the comment assuming the first < character was meant to be a < (part of a text node). Note that some browsers have implemented SGML comments but do not reparse the invalid comment (since HTML does not specify error handling). Those browsers should be updated to perform a reparse as described here.
<!-------> | Text content containing < and > characters, without being correctly written as entities |
---|
<!------->
<!-- -- foo>bar
<!-- | Start of comment |
---|---|
-- | Comment is ready to end |
foo | Usually ignored (assumed to be a mistake) |
> | Comment has ended |
bar | Rendered as page content |
There is no "proper" --> to end the comment, so in most old browsers, the "-- foo >" is considered to be the end (the browser assumes the comment is invalid). Note that some old browsers may ignore everything upto the end of the page.
bar
<! | Start of comment structure |
---|---|
-- | Start of comment, may contain > |
-- | Comment may not contain > (using a > will cause the comment to end) |
foo | Content of the comment (content is not actually allowed here) |
> | This part of the comment is not allowed to contain the > character, so this ends the comment |
bar | Rendered as page content |
bar
<!-- -- foo>bar<!-- -->
<!-- | Start of comment |
---|---|
-- | There are more comment ends later, so this is assumed to be part of the comment |
foo>bar<!-- | Content of the comment |
--> | Comment has ended |
<! | Start of comment structure |
---|---|
-- | Start of comment, may contain > |
-- | Comment may not contain > (using a > will cause the comment to end) |
foo | Content of the comment (content is not actually allowed here) |
> | This part of the comment is not allowed to contain the > character, so this ends the comment |
bar | Rendered as page content |
<! | Start of comment structure |
-- | Start of comment, may contain > |
-- | Comment may not contain > (using a > will cause the comment to end) |
> | This part of the comment is not allowed to contain the > character, so this ends the comment |
bar
<!-- -- -->bar<!-- -- -->
This is basically the same as the example in the Acid 2 test, except they used a few more isolated dashes to throw you off.
<!-- | Start of comment |
---|---|
-- | Content of the comment |
--> | Comment has ended |
bar | Rendered as page content |
<!-- | Start of comment |
-- | Content of the comment |
--> | Comment has ended |
bar
<! | Start of comment structure |
---|---|
-- | Start of comment, may contain > |
-- | Comment may not contain > (using a > will cause the comment to end) |
-- | Comment may contain > |
>bar<! | This part of the comment is allowed to contain the > character, so this is part of the comment |
-- | Comment may not contain > (using a > will cause the comment to end) |
-- | Comment may contain > |
-- | Comment may not contain > (using a > will cause the comment to end) |
> | This part of the comment is not allowed to contain the > character, so this ends the comment |
<!-- -- -->bar<!-- -->
<!-- | Start of comment |
---|---|
-- | Content of the comment |
--> | Comment has ended |
bar | Rendered as page content |
<!-- | Start of comment |
--> | Comment has ended |
bar
<! | Start of comment structure |
---|---|
-- | Start of comment, may contain > |
-- | Comment may not contain > (using a > will cause the comment to end) |
-- | Comment may contain > |
>bar<! | This part of the comment is allowed to contain the > character, so this is part of the comment |
-- | Comment may not contain > (using a > will cause the comment to end) |
-- | Comment may contain > |
> | This part of the comment is allowed to contain the > character, so this is part of the comment |
The comment never ends, so browsers will treat this as an invalid comment (and invalid HTML). They should reparse the comment assuming the first < character was meant to be a < (part of a text node):
<!-- -- -->bar | Text content containing < and > characters, without being correctly written as entities |
---|---|
<! | Start of comment structure |
-- | Start of comment, may contain > |
-- | Comment may not contain > (using a > will cause the comment to end) |
> | This part of the comment is not allowed to contain the > character, so this ends the comment |
<!-- -- -->bar
<!-- -- <!-- -- -->
<!-- | Start of comment |
---|---|
-- <!-- -- | Content of the comment |
--> | Comment has ended |
<! | Start of comment structure |
---|---|
-- | Start of comment, may contain > |
-- | Comment may not contain > (using a > will cause the comment to end) |
<! | Content of the comment |
-- | Comment may contain > |
-- | Comment may not contain > (using a > will cause the comment to end) |
-- | Comment may contain > |
> | This part of the comment is allowed to contain the > character, so this is part of the comment |
The comment never ends, so browsers will treat this as an invalid comment (and invalid HTML). They should reparse the comment assuming the first < character was meant to be a < (part of a text node):
<!-- -- | Text content containing a < character, without being correctly written as an entity |
---|---|
<! | Start of comment structure |
-- | Start of comment, may contain > |
-- | Comment may not contain > (using a > will cause the comment to end) |
-- | Comment may contain > |
> | This part of the comment is allowed to contain the > character, so this is part of the comment |
The new comment also never ends, so browsers will treat this as an invalid comment (and invalid HTML). They should reparse the new comment assuming the first < character was meant to be a < (part of a text node):
<!-- -- | Text content containing a < character, without being correctly written as an entity |
---|---|
<!-- -- --> | Text content containing < and > characters, without being correctly written as entities |
<!-- -- <!-- -- -->
<!-- > -- -->
This is a simple version of an earlier example, and highlights the mistake made by browsers that fail to reparse invalid comments.
<!-- | Start of comment |
---|---|
> -- | Content of the comment |
--> | Comment has ended |
<! | Start of comment structure |
---|---|
-- | Start of comment, may contain > |
> | Content of the comment |
-- | Comment may not contain > (using a > will cause the comment to end) |
-- | Comment may contain > |
> | This part of the comment is allowed to contain the > character, so this is part of the comment |
The comment never ends, so browsers will treat this as an invalid comment (and invalid HTML). They should reparse the comment assuming the first < character was meant to be a < (part of a text node):
<!-- > -- --> | Text content containing < and > characters, without being correctly written as entities |
---|
<!-- > -- -->
These browsers all understand basic SGML comments, but fail to correctly reparse when they encounter an invalid comment.
Gecko ignores only the opening <! so it outputs:
-- > -- -->
KHTML treats the < as the start of an unknown SGML tag, and ignores everything upto the first > so it outputs
-- -->
iCab treats the comment as an old-style comment, so it outputs
<!-- foo <!-- bar -->
This is a mistake sometimes seen on pages, where the author forgets to close the first comment before starting the second.
<!-- | Start of comment |
---|---|
foo <!-- bar | Content of the comment |
--> | Comment has ended |
<! | Start of comment structure |
---|---|
-- | Start of comment, may contain > |
foo <! | Content of the comment |
-- | Comment may not contain > (using a > will cause the comment to end) |
bar | Content of the comment |
-- | Comment may contain > |
> | This part of the comment is allowed to contain the > character, so this is part of the comment |
The comment never ends, so browsers will treat this as an invalid comment (and invalid HTML). They should reparse the comment assuming the first < character was meant to be a < (part of a text node):
<!-- foo | Text content containing a < character, without being correctly written as an entity |
---|---|
<! | Start of comment structure |
-- | Start of comment, may contain > |
bar | Content of the comment |
-- | Comment may not contain > (using a > will cause the comment to end) |
> | This part of the comment is not allowed to contain the > character, so this ends the comment |
<!-- foo
<!-- foo -->
The basic comment. You will be happy to hear that this still works - even in XHTML and XML.
<!-- | Start of comment |
---|---|
foo | Content of the comment |
--> | Comment has ended |
<! | Start of comment structure |
---|---|
-- | Start of comment, may contain > |
foo | Content of the comment |
-- | Comment may not contain > (using a > will cause the comment to end) |
> | This part of the comment is not allowed to contain the > character, so this ends the comment |
<! -- > foo -->
The space before the first -- causes this not to be treated as a comment.
<! | Start of unknown SGML structure |
---|---|
-- | Content of the structure |
> | End of the structure |
foo --> | Text content containing a > character, without being correctly written as an entity |
foo -->
<!-- -- -->foo<!-- >bar<!-- -->
An example of this was found on a real page, that thought it was using non-existent conditional comments for Firefox. The space before the second > is important to avoid tripping over a bug in IE.
<!-- | Start of comment |
---|---|
-- | Content of the comment |
--> | Comment has ended |
foo | Rendered as page content |
<!-- | Start of comment |
>bar<!-- | Content of the comment |
--> | Comment has ended |
foo
<! | Start of comment structure |
---|---|
-- | Start of comment, may contain > |
-- | Comment may not contain > (using a > will cause the comment to end) |
-- | Comment may contain > |
>foo<! | This part of the comment is allowed to contain the > character, so this is part of the comment |
-- | Comment may not contain > (using a > will cause the comment to end) |
> | This part of the comment is not allowed to contain the > character, so this ends the comment |
bar | Rendered as page content |
<! | Start of comment structure |
-- | Start of comment, may contain > |
-- | Comment may not contain > (using a > will cause the comment to end) |
> | This part of the comment is not allowed to contain the > character, so this ends the comment |
bar
<!-- -- Failed>Your browser is SGML compliant.<!-- --><!-- -- -->Your browser is not SGML compliant.<!-- -- -->
This combines two of the examples given above. In your browser, it outputs:
Your browser is not SGML compliant.
Note: although this test is valid, the HTML is not strictly valid, so yes, this page will not validate. I am aware of this ;)
When a script starting tag is encountered, the content following it is assumed to be CDATA, until the next occurence of "</script>" (that is why / characters in script strings are supposed to be written as \/). Everything in between is passed to the script engine for processing. SGML comments do not apply inside scripts. Inline scripts may optionally start with "<!--" and end with "//-->" but that has nothing to do with SGML.
The script engine is responsible for dealing with these, and will treat the opening comment as a single line script comment (the closing comment must always be commented out as a single line script comment). These HTML comment structures are not allowed anywhere else inside the scripts (that doesn't seem to stop people from using it, but they are still wrong to do so).
These HTML comment structures inside the script are to protect very old browsers (the age of Netscape 1 and Mosaic). Scripts may contain -- (the decrement operator), but that should not be an issue, as the very old browsers do not understand SGML comments anyway, and compliant browsers will treat it as CDATA anyway.
But you may want to temporarily comment out your scripts, so the script itself is disabled, and this is where you will have problems. Imagine you have a script that contains a decrement operator, and then maybe later, it contains a > (greater than) operator - it does not need one of these to make a mess, but it will help illustrate the point. And then you comment it out to test something:
<!--
<script type="text/javascript">
for( var i = 10; i > 0; i-- ) {
if( myar[i].status > 3 ) {
ntlp++;
}
}
</script>
-->
Can you see the problem yet? In older browsers it works as you expected it to. But in compliant browsers, it gets a bit more convoluted.
The comment starts where you told it to start. The first -- means that > characters are allowed (so "i > 0" is ignored), but then there is the decrement operator (i--), followed by the > character inside the "if" conditional. This ends the comment. Everything afterwards is treated as HTML. The </script> tag is treated as a closing tag for an element that was never opened (invalid and ignored), and the "-->" is treated as text containing a > character that was not correctly written as an entity. As a result, the rendered page shows this:
3 ) { ntlp++; } } -->
A large number of pages contain comments saying "<!-- start of navigation -->" ... "<!-- end of navigation -->", or "<!-- start of content -->" ... "<!-- end of content -->". However, some go a little overboard with the number of dashes, and they will be made to pay for this mistake, because now those dashes actually mean something. It is easily possible to accidentally comment out large sections of documents, making them disappear in SGML compliant browsers:
<!------ Start navigation ----->
<ul><li><a href="one/">One</a></li>
<li><a href="two/">Two</a></li>
...</ul>
<!------ End navigation ----->
In that example, the entire source code of the navigation would be ignored in SGML compliant browsers, simply because the author did not know they needed to check the number of double dashes that they had used. Such a simple mistake, and so many different ways to make it.
If you want to make your pages work in new and old browsers (and for now, most browsers are in the "old browsers" category), this is the solution; start your comments with "<!-- ", and end them with " -->", and never ever ever have "--" inside your comments. Treat all your comments like XHTML comments, and it will work in both new and old browsers, no matter what HTML doctype you do (or don't) use.