html2text

Introduction

The html2text PHP script renders HTML as text. Originally designed to construct text based email content from HTML pages, but useful for wherever text representations of HTML are required. Especially good when used with makeMIME. This is similar to how text browsers like Lynx render Web pages, but this script is designed for more restrictive environments, where documents cannot respond to forms or hyperlinks, and features such as text underlines, bold fonts and colours are not available.

Version 2 is a complete rewrite using a totally different approach. The output is compatible with version 1, but with significantly improved functionality. Instead of relying on rudimentary pattern matching, it now uses a proper HTML parser to load and interpret the HTML, then a text rendering engine to determine the layout. The main function may still be called in exactly the same way, meaning that in most cases, the new version can simply be dropped directly in as a replacement for the old one.

Though significantly more complex and a much larger code size, it outperforms version 1 in all respects.

Benefits of version 2+

Twice as fast as version 1, as it does not have to run regular expressions against long strings.
Easily configurable and extendable with support for new elements.
Stylable with content before and after the element, element margins and preformatting.
Proper margin collapsing model avoids unwanted gaps between elements, and allows preformatted elements to have multiple blank lines.
Proper collapsing of whitespace avoids odd indents and gaps inside elements.
Better support for invalid HTML (based on PHP's DOMDocument::loadHTML), which also copes with HTML fragments.
No longer suffers from various replacement loops and lockups.
Support for base href.
Support for numbered lists.
Basic support for indenting nested lists.
Basic HTML 5 support.
Default styles for more HTML 4 elements, including several important phrasing elements.
Better default styling for headings.
Optional basic XHTML parsing mode (using an XML parser).
Optional loading directly from file instead of a string.
Optional processing of pre-prepared DOMDocument objects.
Better support for unicode characters.

Limitations of version 2+

Support depends on PHP 5's DOMDocument. It is normally installed by default, but some installations may need this feature to be specifically enabled (such as the Fedora/CentOS distribution needing the "php-xml" package installed; yum install php-xml).
PHP 4 does not support HTML parsing, and even with XML parsing, its DOM support uses non-standard method calls and property names. As a result, it cannot be made to use version 2 of this script. PHP 4 installations will need to use the now-unsupported version 1.
Support for encodings is at the mercy of PHP and libxml. They have many limitations, discussed below.
Support for HTML parsing is at the mercy of libxml. It cannot currently cope with certain constructs, such as "</" in scripts (but "<\/" is OK), so you may need to strip scripts on pages that rely on that syntax.
Base href support is not able to cope with a base href containing unicode characters that are not in the base PHP encoding.
The element tagName is used instead of namespaces when styling - namespace prefixes do not work.
Table cells cannot always be aligned correctly when they have long contents - restrict tables to small amounts of data.
Wordwrapping does not take tabs or text indents into account - this is a limitation of PHP's wordwrap function.
Wordwrapping is not multibyte-safe, so it may occasionally fail to wrap long lines of unicode characters (in most cases, it just works) - this is a limitation of PHP's wordwrap function.

Table cell alignment, wordrapping, and the namespace prefix limitations also affect version 1.

Using the script

To use this script library, put the following line in your script before the part that needs it:

require('PATH_TO_FILE/html2text.php');

Then choose the type of conversion that you require - note that only the first type of call to html2text can strip PHP, and even then it may fail if there are encoding problems in the source string:

To convert an HTML/PHP string to text

$textVersion = html2text( $HTMLstring );

To convert an XHTML string to text

$textVersion = html2text( $HTMLstring, true );

To convert an HTML file to a text string

$textVersion = html2text( $filepath, false, true );

To convert an XHTML file to a text string

$textVersion = html2text( $filepath, true, true );

To convert a pre-prepared case-insensitive HTML DOMDocument object (some harmless properties beginning with "h2t_" will be added to the object)

$textVersion = html2text( $domdocumentobject )

To convert a pre-prepared case-sensitive XHTML DOMDocument object (some harmless properties beginning with "h2t_" will be added to the object)

$textVersion = html2text( $domdocumentobject, true )

Note that there are limitations in its ability to strip PHP contained in the HTML string. It can fail to remove all PHP in some circumstances (such as strings containing PHP tags, or encoding problems, as discussed below), and this may expose the contents of the PHP code contained in the HTML string. Ensure that there is no sensitive content in any PHP you pass to the html2text function. For best results, only pass HTML.

Note that depending on the validity of the markup you are using (or the existence of the files you ask it to load), DOMDocument may generate various warnings or notifications.

The returned string will only use Windows line breaks (\r\n), as this is required by RFC2822, the email format specification. Though normally unnecessary, if you need to convert these into Unix line breaks (\n) for whatever reason, use a multibyte-safe function to replace one with the other. For example:

$textVersion = preg_replace( "/\r\n/u", "\n", $textVersion );

Encodings

Encoding problems will occur when you try to feed characters or byte sequences into html2text that are not valid in the encoding that PHP is assuming when interpreting the code. The html2text function attempts to detect and recover from simple encoding issues (be warned that it will not strip PHP from the source string if this happens, and that may expose sensitive contents), and will issue a warning if it detects something wrong. However, it cannot detect all issues, and may fail completely in some cases. You will know that you have a more serious encoding problem with your HTML when some characters appear malformed in the html2text output, or cause the output - or part of the output - to disappear completely.

PHP's own support for unicode is quite poor, and it does not always handle utf-8 correctly. Although the html2text script uses unicode-safe methods (without needing the mbstring module), it cannot always cope with PHP's own handling of encodings. PHP does not really understand unicode or encodings like utf-8, and uses the base encoding of the system, which tends to be one of the ISO-8859 family. As a result, what may look to you like a valid character in your text editor, in either utf-8 or single-byte, may well be misinterpreted by PHP. So even though you think you are feeding a valid character into html2text, you may well not be.

Now add the complications of libxml, used by PHP as the DOMDocument HTML parser. Libxml does recover from some encoding problems, and can sometimes be made to work in other cases by the markup having the correct encoding set in a meta tag before the rest of the content, but it will not always work correctly. For the most reliable response, use HTML entities for all special characters. An alternative is to use XHTML mode with valid XHTML containing an XML prolog specifying the encoding, and make html2text load the markup from a file. If neither of these works, please refer to the PHP documentation for DOMDocument and libxml to find solutions to the problem.

Using one of these approaches, the script can output any unicode character that the target viewer can display, served as utf-8. This is a significant improvement over version 1, which typically just displayed broken characters.

Adding and configuring elements

The script holds formatting information about elements in an array. Once the script has been included, other elements can be added into the array as needed. Existing elements can also have their formatting details redefined as needed. Subsequent calls to html2text will use the updated element details. Note, however, that in HTML mode, libxml may use error handling when it sees unrecognised tags, and it may not build the expected DOM. Custom (non-standard or non-supported) tags are best used only in XHTML mode.

To add or change an element's formatting details, use the following format:

$html2text_elements['tagName'] = Array( isFlow, isPreformatted, isVoid, marginTop, marginBottom, dropFirstChildMargin, before, after, isBeforeFunction, isAfterFunction, dropOnFirst );

The array values are as follows:

isFlow: Boolean: says if the element should be treated as block/flow level (drop any pending spaces, ignore leading whitespace if not preformatted).
isPreformatted: Boolean: says if the element and all its children should be treated as preformatted text (whitespace is not collapsed), like PRE in HTML.
isVoid: Boolean: drops all childNodes completely - for elements whose contents must not be displayed (even if error handling creates contents).
marginTop: Positive integer: minimum number of line breaks to display before this element (basic margin collapsing will take place).
marginBottom: Positive integer: minimum number of line breaks to display after this element (basic margin collapsing will take place).
dropFirstChildMargin: Boolean: ignore marginTop of first rendered childNode (for use where the child must line up with the 'before' content of this element). Recommended only for elements that have some 'before' content. Use in other cases can cause the element's own margin to disappear completely.
before: String: text to insert before element (whitespace is not collapsed).
after: String: text to insert after element (whitespace is not collapsed).
isBeforeFunction: Boolean: execute 'before' as a function instead of treating directly as a text node.
isAfterFunction: Boolean: execute 'after' as a function instead of treating directly as a text node.
dropOnFirst: Boolean: says if the element's 'before' should be ignored if the element is the firstElementChild of its parent. (Note that although this could also be achieved with a 'before' function that checks if the element is a first child, the results are slightly different; if a function returns an empty string, it will not output pending spaces or linebreaks, but it will if dropOnFirst is used instead - this makes it most useful for elements such as a potentially empty table cell.)

The element's tagName should be lower case for HTML, and is case sensitive for XHTML mode. Functions used for 'before' or 'after' will be passed the element node and its childElement index as parameters. They must return a string (return empty strings if no text is wanted).

The following simple example shows how to define the poem element as a preformatted block element for use in X(HT)ML. It will have a large gap above and below, and leading '[' and trailing ']' characters:

$html2text_elements['poem'] = Array(true,true,false,3,3,false,'[',']',false,false,false);

By default, the first and last margins created in the document are ignored. You can change this by redefining the 'the document' virtual element as follows:

//Array( ignore start margin, ignore end margin )
$html2text_elements['the document'] = Array(false,false);

Useful functions

When using functions for 'before' or 'after', there are two utility functions provided by the script, which can help tidy up the values retrieved from element attributes. The first is html2text_cleanspace, which accepts a single string value. It returns a whitespace-collapsed version of the string. The second is html2text_resolve, which accepts two parameters; the first is a string that is a relative or absolute URL, and the second is the element that was passed to the initial function. It returns a resolved URL if a BASE HREF tag has been found, or if not, it just returns the string it was passed. If no BASE HREF has been found yet, and the string passed to the function contains only a HTML fragment identifier, it returns an empty string.

The following example assumes the poem element has both an author attribute containing the author's name, and a url attribute, pointing to the author's website. It shows how to extract and clean/resolve those attributes, using a 'before' function, and the utility functions:

$html2text_elements['poem'] = Array(true,true,false,3,3,false,'poemdetail',']',true,false,false);
function poemdetail($element,$index) {
  $author = $url = '';
  if( $element->hasAttribute('author') ) {
    $author = html2text_cleanspace($element->getAttribute('author'));
  }
  if( $element->hasAttribute('url') ) {
    $url = html2text_resolve($element->getAttribute('url'),$element);
  }
  return '['.$author.($url?(' '.$url):'').(($author||$url)?"\r\n":'');
}

To download the script(s), and see the script license, use the links on the navigation panel at the top of this page.