Saving Web pages
I get enough emails from visitors who have trouble with pages they have saved from this site, so I have put this page together to give some advice on how to avoid these problems.
I am happy for you to save copies of my pages for private or educational use (please do not put them on any public Web sites). Many of you like to use that as a way to make your own pages based on my script demo pages.
Encoding problems
An example is where the page looks OK on my site, but when you upload it to your own server, some of the characters look wrong, either looking like the wrong character, a black square, or a diamond with a question mark.
All pages on this site are served using UTF-8 encoding. In most cases, these pages are compatible with most encodings, as I use HTML entities for characters that are not in the base character set. However, a few pages need to use characters that are not in the base character set, and cannot use HTML entities (for example, JavaScripts or PHP scripts for email validation).
If you serve the pages from your own server using UTF-8 encoding they should work (assuming you have not opened and saved them in an editor that has broken them). Ask your server administrator for instructions.
If your text or HTML editor displays incorrect characters, try telling your editor to interpret the page as UTF-8 - if there is no option for this, try a better editor. Some editors may break these characters if you save pages in that editor without the correct encoding.
Broken saved pages
Two of the most common problems I hear about are:
- When a saved page displays incorrectly in all browsers except Internet Explorer.
- Scripts do not work in saved page, either failing to run, and throwing errors, or duplicating page content.
Internet Explorer breaks CSS
In the first case, the problem is virtually always because the page was saved using Internet Explorer on Windows. It breaks any CSS that it does not understand:
- It splits styles like "border" into the separate styles that it represents, such as "border-color". In most cases this change is harmless, but it can change the meaning sometimes. (It also removes the last semi-colon, and changes all style names to upper case, but that is also harmless).
- It breaks selectors it does not understand. Yes, plenty of us use the > selector to target advanced browsers. IE removes the selector of the rule completely, and replaces it with "UNKNOWN". It then removes all contents of that style rule. It is a totally useless thing to do, and serves no purpose except to break saved pages for other browsers.
- It changes all selectors to upper case ("BODY" instead of "body"), meaning that if the page was going to be served to good browsers as XHTML, the stylesheet will not be applied at all.
- It replaces all media types it does not know with "Unknown", so media query blocks and speech media are corrupted beyond use. Some of these blocks may be removed completely.
- It fails to use relative imported URLs correctly in stylesheets so the relative resource cannot be located (and it causes IE to abort saving).
Since most of my CSS files - like most CSS used on the Web - use at least some CSS that IE does not understand (carefully used to ensure that IE will use the simpler CSS, and other browsers will use the advanced CSS), that means that it will always break CSS.
Summary: Never save Web pages using Internet Explorer.
Mozilla/Firefox/Netscape, Konqueror, and Safari break scripts
More accurately, they break pages that use scripts. Instead of saving the source code of the page, they save the page as the scripts have modified it. But then they also save the scripts as well. This means that if a script creates some extra content, the extra content will be saved, but then the script will create it again, meaning that it will be duplicated.
Also, if the script expected a specific layout of the markup, it will not be correct, and the script will not be able to find what it was looking for, and will not be able to run.
Almost all of my pages use scripts that add extra content (the stylesheet selector on the navigation panel, for example), so they will all be broken at least a little.
Summary: Never save Web pages using Mozilla/Firefox/Netscape, Konqueror, Safari, or related browsers.
So what should you use to save Web pages
Opera. Not only is it free and exceptionally good at standards, it is also very useful for Web development, and it is the only browser that saves Web pages with all their associated files without breaking anything. You will not have to uninstall your current browser, and you do not have to use Opera as your main browser, but it can help you to produce standards compliant pages, and you can use it to save pages without breaking them.
What gets saved
This section only deals with current major desktop browsers. There are four ways that browsers will usually save Web pages:
- Basic HTML
- Containing only the page itself, without any images or external content such as stylesheets or scripts.
- Complete Web page
- Containing the page itself, any images and external content such as stylesheets or scripts. Each is carefully rewritten to ensure that the page uses the locally saved copies of these files. Browsers may also rewrite relative links to point to their online locations. Browsers may save external content in a dedicated folder to keep the clutter out of sight, but browsers may put all content in the same folder, which can be a little more awkward from a user perspective.
- Web archive
- Containing the page itself, any images and external content such as stylesheets or scripts. They are bundled into the archive file without any rewriting, and the browser loads the page, using the local version instead of the online version, if it is available. This is usually done using MHTML (like a MIME email), where each file is stored (separated by a unique identifier), along with the location it represents. Files inside a MHTML archive can usually be edited inside a text editor if needed, as they are usually stored in plain text. An alternative Web archive is similar to a zip or tar archive, which cannot easily be edited, and since only one browser uses each type of zipping, it cannot be shared with other browsers. However, it may be possible to unpack and use the unpacked files as a Web page, which can be read by all browsers, edited, and converted into a new page, as long as the files all use relative links (many do not).
- Plain text
- The page is converted into a basic text file, without any images, scripts, stylesheets, etc.
The most commonly used format for saving is the complete Web page. It makes it very easy to save pages from sites like this, so they still look and behave properly, and can be edited directly. The general problem with this format is that if extra external content (such as a stylesheet, another script, an image, an iframe, or a plugin) is included by a script, that cannot be added, as it is impossible to work out which parts of the script need to be rewritten to reference the local version (since there are an infinite number of ways that a script could compute the content location).
Web archives can partially solve this problem (assuming the script always adds the same external content), and are a very neat storage mechanism, but they make it difficult to share with other browsers, since not many browsers support them. In addition, although their contents can be edited, it is not easy to convert them into a new page. As a result, the complete Web page approach is usually more useful for saving pages from this site if you intend to use them as a base for making your own pages. The Web archive may be preferred for saving the tutorials for private use.
The plain text and basic HTML generally lead to a fairly broken page that cannot do very much except provide you with text. If all you want to do is read, that may be OK, but anything relying on script or styling will generally fail very badly, and most links will fail. This typically makes it unsuitable for anything containing a demonstration or example (so not very helpful for anything on this site, or many others either).
What does each browser provide
If a browser provides more than one option, the choice is provided on the save dialog (except in Konqueror, where archiving is in the Tools menu).
Browser | HTML | Complete | Archive | Text |
---|---|---|---|---|
iCab | Yes | No | Zip | Yes |
IE 6+ Win | Yes | Yes | MHTML | Yes |
IE Mac | Yes | No | MHTML | Yes |
Konqueror | Yes | No | Tar | No |
Mozilla/Firefox | Yes | Yes | No | Yes |
Opera 8 | Yes | Yes | No | Yes |
Opera 9 | Yes | Yes | MHTML | Yes |
OmniWeb | Yes | No | No | No |
Safari | Yes | No | Custom | No |
In Internet Explorer 6 on Windows, and Mozilla/Firefox, the default is complete Web page. In Opera, the default is basic HTML (select 'HTML file with images' from the 'Save as type' dropdown for complete Web page). Internet Explorer 7 on Windows and Internet Explorer on Mac use Web archive as the default. All others use basic HTML as the default (or only) option. Firefox has an extension to support saving and opening of MHTML, but this is not part of the default install so is not considered here. In any case, it is subject to the same problems as normal saving in Firefox.
Many operating systems also allow an option to 'print' to PDF, which would basically be like saving a picture of the page - no interaction would work. This is no better for most purposes than saving basic HTML.
How well do they work
This deals only with complete Web page and archive formats, since the plain text and basic HTML formats do nothing special and always save minimal content, losing all styling, scripting, and interactivity (and losing most of their usefulness in the process).
Pages can be made up of several things; the page itself, external script files, external stylesheets, alternate stylesheets, imported stylesheets, images, framed pages, plugins, dynamically added external resources. The browser may also rewrite links. When saving in complete Web page format, they may use an external folder in order to reduce clutter. The browser should not break stylesheets, or pages using dynamic scripts. Browsers should not download linked resources unless they are part of the current page. Browsers should not have to download the pages again in order to save them, since they are already in its cache. Browsers may or may not be able to save pages generated entirely by scripts (held in memory, not as a real file). So, let's see how well they do:
- IE C
- Internet Explorer on Windows; Complete
- IE A
- Internet Explorer on Windows; Archive
- IEM A
- Internet Explorer on Mac; Archive
- Op C
- Opera; Complete
- OP A
- Opera; Archive
- MZ C
- Mozilla/Firefox; Complete
- Kq A
- Konqueror archive
- S A
- Safari; Archive
- iC A
- iCab; Archive
Content type | IE C | IE A | IEM A | Op C | Op A | MZ C | Kq A | S A | iC A |
---|---|---|---|---|---|---|---|---|---|
Page | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes |
External scripts | Yes | Yes | Yes | Yes | Yes | Can break | Can break | Can break | Yes |
External stylesheets | Broken | Broken | Yes | Yes | Yes | Yes | Yes | Yes | Yes |
Alternate stylesheets | Broken | Broken | Yes | Yes | Yes | Yes | Yes | No | Yes |
Imported stylesheets | Wrong Broken | Broken | No | Yes | Yes | No | Some | No | Yes |
Images | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes |
Images in stylesheets | No | No | No | Yes | Yes | No | No | No | Some |
Framed pages | Yes | Yes | Yes | Yes | Yes | Yes * | Broken | Yes | Yes |
Plugins | No | No | Yes | Yes | Yes | Yes | No | Yes | Yes |
Dynamically added external resources | N/A | No | No | N/A | No | Broken | Broken | Broken | Yes |
Rewrites links | Yes | N/A | N/A | Yes | N/A | Yes | No | N/A | N/A |
External folder | Yes | N/A | N/A | 9+ | N/A | Yes | N/A | N/A | N/A |
Undamaged stylesheets | No | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes |
Undamaged dynamic scripts | Yes | Yes | Yes | Yes | Yes | No | No | No | Yes |
Relevant LINK resources only | Yes | Yes | No | Yes | Yes | Yes | No | Yes | Yes |
Saves the cached page | Yes | Yes | No | Yes | Yes | Yes | Yes | Yes | Yes |
Script generated pages | Broken in IE 6+ | Broken | Broken | No | No | Yes | Yes | Broken | Yes |
* Mozilla/Firefox does not correctly rewrite links to external resources (such as stylesheets) in framed pages.
IE Mac's Web archive goes a little overboard and saves everything that is referenced via <link> tags, such as 'next', 'prev', 'index', etc, even though these are not used as part of the page itself. Not only does it bloat the archive with files that are not needed, but it also does not use the cached copy of the page. Instead it downloads everything again. Images, stylesheets, HTML, scripts; everything. This wastes bandwidth for both you and the Web site host, but also means that pages that have dynamic content that is different every time, will have the wrong content saved. In the case of online shopping, saving the order confirmation page (as proof of purchase) could easily end up submitting your order a second time. Ouch.
Konqueror makes the same basic mistake as IE Mac - it saves everything referenced via <link> tags, bloating the archive. Fortunately it uses the cached versions if it has them, but saving one page on my site can make it save five or more separate redundant pages which can waste both of our bandwidth (in the worst case, it can save over 30 pages when you try to save 1). It also fails to save imported stylesheets unless the master stylesheet is a part of the main page (not a imported or linked stylesheet), and any images referenced by stylesheets. It breaks any scripts in the same way as Mozilla/Firefox, duplicating their dynamic content, or breaking them completely. When saving files in frames, it saves them with the file extension they had on the server (such as .php .asp), which it then fails to load if you try to view the archived page. On top of that, it also fails to rewrite any links, so they all point to tar:/linkpath (which doesn't exist), and it uses a format that no other browser understands.
Safari saves in its own custom Web archive format that is not supported by any other browser. In addition, it breaks scripts in the same way as Mozilla/Firefox, duplicating their dynamic content, or breaking them completely. It fails to save imported stylesheets, alternate stylesheets, and images used by stylesheets, meaning that even though you have saved the Web page, you will still need an Internet connection to view the saved version properly.
OK, so clearly one browser managed to survive more challenges than the others, and that was iCab. The only thing it failed to do correctly was to save all images used by the stylesheet (it stored the images referred to by the currently enabled stylesheet, not the alternate stylesheets). However, it's archive format is incompatible with other browsers. In addition, most of you who save demonstration or example pages from this site will want to be able to edit them to make your own pages, so the complete Web page format is much more useful.
Internet Explorer and Mozilla/Firefox make a real mess with their breaking of stylesheets and dynamic page content - to the extent that they are unusable. The only thing that Mozilla/Firefox is useful for is saving pages that are completely generated by JavaScript (almost always popup content), and that does not happen very often at all. Internet Explorer failed so badly to handle saving of stylesheets that for a while, the ability to save complete or archived Web pages was actually removed in early versions of IE 7.
That leaves just one browser; Opera. Remember to select 'HTML file with images' from the 'Save as type' dropdown.