html2text
--------------------
html2text
Introduction
The html2text PHP script renders HTML as text. Originally designed to
construct text based email content from HTML pages, but useful for wherever
text representations of HTML are required. Especially good when used with
makeMIME. This is similar to how text browsers like Lynx render Web pages,
but this script is designed for more restrictive environments, where
documents cannot respond to forms or hyperlinks, and features such as text
underlines, bold fonts and colours are not available.
Version 2 is a complete rewrite using a totally different approach. The
output is compatible with version 1, but with significantly improved
functionality. Instead of relying on rudimentary pattern matching, it now
uses a proper HTML parser to load and interpret the HTML, then a text
rendering engine to determine the layout. The main function may still be
called in exactly the same way, meaning that in most cases, the new version
can simply be dropped directly in as a replacement for the old one.
Though significantly more complex and a much larger code size, it
outperforms version 1 in all respects.
Benefits of version 2+
· Twice as fast as version 1, as it does not have to run regular
expressions against long strings.
· Easily configurable and extendable with support for new elements.
· Stylable with content before and after the element, element margins and
preformatting.
· Proper margin collapsing model avoids unwanted gaps between elements,
and allows preformatted elements to have multiple blank lines.
· Proper collapsing of whitespace avoids odd indents and gaps inside
elements.
· Better support for invalid HTML (based on PHP's DOMDocument::loadHTML),
which also copes with HTML fragments.
· No longer suffers from various replacement loops and lockups.
· Support for base href.
· Support for numbered lists.
· Basic support for indenting nested lists.
· Basic HTML 5 support.
· Default styles for more HTML 4 elements, including several important
phrasing elements.
· Better default styling for headings.
· Optional basic XHTML parsing mode (using an XML parser).
· Optional loading directly from file instead of a string.
· Optional processing of pre-prepared DOMDocument objects.
· Better support for unicode characters.
Limitations of version 2+
· Support depends on PHP 5's DOMDocument. It is normally installed by
default, but some installations may need this feature to be specifically
enabled (such as the Fedora/CentOS distribution needing the "php-xml"
package installed; yum install php-xml).
· PHP 4 does not support HTML parsing, and even with XML parsing, its DOM
support uses non-standard method calls and property names. As a result, it
cannot be made to use version 2 of this script. PHP 4 installations will
need to use the now-unsupported version 1.
· Support for encodings is at the mercy of PHP and libxml. They have many
limitations, discussed below.
· Support for HTML parsing is at the mercy of libxml. It cannot currently
cope with certain constructs, such as "" in scripts (but "<\/" is OK), so
you may need to strip scripts on pages that rely on that syntax.
· Base href support is not able to cope with a base href containing
unicode characters that are not in the base PHP encoding.
· The element tagName is used instead of namespaces when styling -
namespace prefixes do not work.
· Table cells cannot always be aligned correctly when they have long
contents - restrict tables to small amounts of data.
· Wordwrapping does not take tabs or text indents into account - this is a
limitation of PHP's wordwrap function.
· Wordwrapping is not multibyte-safe, so it may occasionally fail to wrap
long lines of unicode characters (in most cases, it just works) - this is a
limitation of PHP's wordwrap function.
Table cell alignment, wordrapping, and the namespace prefix limitations
also affect version 1.
Using the script
To use this script library, put the following line in your script before
the part that needs it:
require('PATH_TO_FILE/html2text.php');
Then choose the type of conversion that you require - note that only the
first type of call to html2text can strip PHP, and even then it may fail if
there are encoding problems in the source string:
To convert an HTML/PHP string to text
$textVersion = html2text( $HTMLstring );
To convert an XHTML string to text
$textVersion = html2text( $HTMLstring, true );
To convert an HTML file to a text string
$textVersion = html2text( $filepath, false, true );
To convert an XHTML file to a text string
$textVersion = html2text( $filepath, true, true );
To convert a pre-prepared case-insensitive HTML DOMDocument object (some
harmless properties beginning with "h2t_" will be added to the object)
$textVersion = html2text( $domdocumentobject )
To convert a pre-prepared case-sensitive XHTML DOMDocument object (some
harmless properties beginning with "h2t_" will be added to the object)
$textVersion = html2text( $domdocumentobject, true )
Note that there are limitations in its ability to strip PHP contained in
the HTML string. It can fail to remove all PHP in some circumstances (such
as strings containing PHP tags, or encoding problems, as discussed below),
and this may expose the contents of the PHP code contained in the HTML
string. Ensure that there is no sensitive content in any PHP you pass to
the html2text function. For best results, only pass HTML.
Note that depending on the validity of the markup you are using (or the
existence of the files you ask it to load), DOMDocument may generate
various warnings or notifications.
The returned string will only use Windows line breaks (\r\n), as this is
required by RFC2822, the email format specification. Though normally
unnecessary, if you need to convert these into Unix line breaks (\n) for
whatever reason, use a multibyte-safe function to replace one with the
other. For example:
$textVersion = preg_replace( "/\r\n/u", "\n", $textVersion );
Encodings
Encoding problems will occur when you try to feed characters or byte
sequences into html2text that are not valid in the encoding that PHP is
assuming when interpreting the code. The html2text function attempts to
detect and recover from simple encoding issues (be warned that it will not
strip PHP from the source string if this happens, and that may expose
sensitive contents), and will issue a warning if it detects something
wrong. However, it cannot detect all issues, and may fail completely in
some cases. You will know that you have a more serious encoding problem
with your HTML when some characters appear malformed in the html2text
output, or cause the output - or part of the output - to disappear
completely.
PHP's own support for unicode is quite poor, and it does not always handle
utf-8 correctly. Although the html2text script uses unicode-safe methods
(without needing the mbstring module), it cannot always cope with PHP's own
handling of encodings. PHP does not really understand unicode or encodings
like utf-8, and uses the base encoding of the system, which tends to be one
of the ISO-8859 family. As a result, what may look to you like a valid
character in your text editor, in either utf-8 or single-byte, may well be
misinterpreted by PHP. So even though you think you are feeding a valid
character into html2text, you may well not be.
Now add the complications of libxml, used by PHP as the DOMDocument HTML
parser. Libxml does recover from some encoding problems, and can sometimes
be made to work in other cases by the markup having the correct encoding
set in a meta tag before the rest of the content, but it will not always
work correctly. For the most reliable response, use HTML entities for all
special characters. An alternative is to use XHTML mode with valid XHTML
containing an XML prolog specifying the encoding, and make html2text load
the markup from a file. If neither of these works, please refer to the
[LINK: http://www.php.net/manual/en/class.domdocument.php] PHP
documentation for DOMDocument and libxml to find solutions to the problem.
Using one of these approaches, the script can output any unicode character
that the target viewer can display, served as utf-8. This is a significant
improvement over version 1, which typically just displayed broken
characters.
Adding and configuring elements
The script holds formatting information about elements in an array. Once
the script has been included, other elements can be added into the array as
needed. Existing elements can also have their formatting details redefined
as needed. Subsequent calls to html2text will use the updated element
details. Note, however, that in HTML mode, libxml may use error handling
when it sees unrecognised tags, and it may not build the expected DOM.
Custom (non-standard or non-supported) tags are best used only in XHTML
mode.
To add or change an element's formatting details, use the following format:
$html2text_elements['tagName'] = Array( isFlow, isPreformatted, isVoid,
marginTop, marginBottom, dropFirstChildMargin, before, after,
isBeforeFunction, isAfterFunction, dropOnFirst );
The array values are as follows:
isFlow
Boolean: says if the element should be treated as block/flow level (drop
any pending spaces, ignore leading whitespace if not preformatted).
isPreformatted
Boolean: says if the element and all its children should be treated as
preformatted text (whitespace is not collapsed), like PRE in HTML.
isVoid
Boolean: drops all childNodes completely - for elements whose contents
must not be displayed (even if error handling creates contents).
marginTop
Positive integer: minimum number of line breaks to display before this
element (basic margin collapsing will take place).
marginBottom
Positive integer: minimum number of line breaks to display after this
element (basic margin collapsing will take place).
dropFirstChildMargin
Boolean: ignore marginTop of first rendered childNode (for use where the
child must line up with the 'before' content of this element). Recommended
only for elements that have some 'before' content. Use in other cases can
cause the element's own margin to disappear completely.
before
String: text to insert before element (whitespace is not collapsed).
after
String: text to insert after element (whitespace is not collapsed).
isBeforeFunction
Boolean: execute 'before' as a function instead of treating directly as a
text node.
isAfterFunction
Boolean: execute 'after' as a function instead of treating directly as a
text node.
dropOnFirst
Boolean: says if the element's 'before' should be ignored if the element
is the firstElementChild of its parent. (Note that although this could also
be achieved with a 'before' function that checks if the element is a first
child, the results are slightly different; if a function returns an empty
string, it will not output pending spaces or linebreaks, but it will if
dropOnFirst is used instead - this makes it most useful for elements such
as a potentially empty table cell.)
The element's tagName should be lower case for HTML, and is case sensitive
for XHTML mode. Functions used for 'before' or 'after' will be passed the
element node and its childElement index as parameters. They must return a
string (return empty strings if no text is wanted).
The following simple example shows how to define the poem element as a
preformatted block element for use in X(HT)ML. It will have a large gap
above and below, and leading '[' and trailing ']' characters:
$html2text_elements['poem'] =
Array(true,true,false,3,3,false,'[',']',false,false,false);
By default, the first and last margins created in the document are ignored.
You can change this by redefining the 'the document' virtual element as
follows:
//Array( ignore start margin, ignore end margin )
$html2text_elements['the document'] = Array(false,false);
Useful functions
When using functions for 'before' or 'after', there are two utility
functions provided by the script, which can help tidy up the values
retrieved from element attributes. The first is html2text_cleanspace, which
accepts a single string value. It returns a whitespace-collapsed version of
the string. The second is html2text_resolve, which accepts two parameters;
the first is a string that is a relative or absolute URL, and the second is
the element that was passed to the initial function. It returns a resolved
URL if a BASE HREF tag has been found, or if not, it just returns the
string it was passed. If no BASE HREF has been found yet, and the string
passed to the function contains only a HTML fragment identifier, it returns
an empty string.
The following example assumes the poem element has both an author attribute
containing the author's name, and a url attribute, pointing to the author's
website. It shows how to extract and clean/resolve those attributes, using
a 'before' function, and the utility functions:
$html2text_elements['poem'] =
Array(true,true,false,3,3,false,'poemdetail',']',true,false,false);
function poemdetail($element,$index) {
$author = $url = '';
if( $element->hasAttribute('author') ) {
$author = html2text_cleanspace($element->getAttribute('author'));
}
if( $element->hasAttribute('url') ) {
$url = html2text_resolve($element->getAttribute('url'),$element);
}
return '['.$author.($url?(' '.$url):'').(($author||$url)?"\r\n":'');
}