net.htmlparser.jericho
Class Renderer

java.lang.Object
  extended by net.htmlparser.jericho.Renderer
All Implemented Interfaces:
CharStreamSource

public class Renderer
extends java.lang.Object
implements CharStreamSource

Performs a simple rendering of HTML markup into text.

This provides a human readable version of the segment content that is modelled on the way Mozilla Thunderbird and other email clients provide an automatic conversion of HTML content to text in their alternative MIME encoding of emails.

The output using default settings complies with the "text/plain; format=flowed" (DelSp=No) protocol described in RFC3676.

Many properties are available to customise the output, possibly the most significant of which being MaxLineLength. See the individual property descriptions for details.

Use one of the following methods to obtain the output:

The rendering of some constructs, especially tables, is very rudimentary. No attempt is made to render nested tables properly, except to ensure that all of the text content is included in the output.

Rendering an entire Source object performs a full sequential parse automatically.

Any aspect of the algorithm not specifically mentioned here is subject to change without notice in future versions.

To extract pure text without any rendering of the markup, use the TextExtractor class instead.


Constructor Summary
Renderer(Segment segment)
          Constructs a new Renderer based on the specified Segment.
 
Method Summary
 void appendTo(java.lang.Appendable appendable)
          Appends the output to the specified Appendable object.
 int getBlockIndentSize()
          Returns the size of the indent to be used for anything other than LI elements.
 boolean getConvertNonBreakingSpaces()
          Indicates whether non-breaking space ( ) character entity references are converted to spaces.
 boolean getDecorateFontStyles()
          Indicates whether decoration characters are to be included around the content of some font style elements and phrase elements.
 long getEstimatedMaximumOutputLength()
          Returns the estimated maximum number of characters in the output, or -1 if no estimate is available.
 boolean getIncludeHyperlinkURLs()
          Indicates whether hyperlink URL's are included in the output.
 char[] getListBullets()
          Returns the bullet characters to use for list items inside UL elements.
 int getListIndentSize()
          Returns the size of the indent to be used for LI elements.
 int getMaxLineLength()
          Returns the column at which lines are to be wrapped.
 java.lang.String getNewLine()
          Returns the string to be used to represent a newline in the output.
 java.lang.String getTableCellSeparator()
          Returns the string that is to separate table cells.
 java.lang.String renderHyperlinkURL(StartTag startTag)
          Renders the hyperlink URL from the specified StartTag.
 Renderer setBlockIndentSize(int blockIndentSize)
          Sets the size of the indent to be used for anything other than LI elements.
 Renderer setConvertNonBreakingSpaces(boolean convertNonBreakingSpaces)
          Sets whether non-breaking space ( ) character entity references are converted to spaces.
 Renderer setDecorateFontStyles(boolean decorateFontStyles)
          Sets whether decoration characters are to be included around the content of some font style elements and phrase elements.
 Renderer setIncludeHyperlinkURLs(boolean includeHyperlinkURLs)
          Sets whether hyperlink URL's are included in the output.
 Renderer setListBullets(char[] listBullets)
          Sets the bullet characters to use for list items inside UL elements.
 Renderer setListIndentSize(int listIndentSize)
          Sets the size of the indent to be used for LI elements.
 Renderer setMaxLineLength(int maxLineLength)
          Sets the column at which lines are to be wrapped.
 Renderer setNewLine(java.lang.String newLine)
          Sets the string to be used to represent a newline in the output.
 Renderer setTableCellSeparator(java.lang.String tableCellSeparator)
          Sets the string that is to separate table cells.
 java.lang.String toString()
          Returns the output as a string.
 void writeTo(java.io.Writer writer)
          Writes the output to the specified Writer.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Constructor Detail

Renderer

public Renderer(Segment segment)
Constructs a new Renderer based on the specified Segment.

Parameters:
segment - the segment containing the HTML to be rendered.
See Also:
Segment.getRenderer()
Method Detail

writeTo

public void writeTo(java.io.Writer writer)
             throws java.io.IOException
Description copied from interface: CharStreamSource
Writes the output to the specified Writer.

Specified by:
writeTo in interface CharStreamSource
Parameters:
writer - the destination java.io.Writer for the output.
Throws:
java.io.IOException - if an I/O exception occurs.

appendTo

public void appendTo(java.lang.Appendable appendable)
              throws java.io.IOException
Description copied from interface: CharStreamSource
Appends the output to the specified Appendable object.

Specified by:
appendTo in interface CharStreamSource
Parameters:
appendable - the destination java.lang.Appendable object for the output.
Throws:
java.io.IOException - if an I/O exception occurs.

getEstimatedMaximumOutputLength

public long getEstimatedMaximumOutputLength()
Description copied from interface: CharStreamSource
Returns the estimated maximum number of characters in the output, or -1 if no estimate is available.

The returned value should be used as a guide for efficiency purposes only, for example to set an initial StringBuilder capacity. There is no guarantee that the length of the output is indeed less than this value, as classes implementing this method often use assumptions based on typical usage to calculate the estimate.

Although implementations of this method should never return a value less than -1, users of this method must not assume that this will always be the case. Standard practice is to interpret any negative value as meaning that no estimate is available.

Specified by:
getEstimatedMaximumOutputLength in interface CharStreamSource
Returns:
the estimated maximum number of characters in the output, or -1 if no estimate is available.

toString

public java.lang.String toString()
Description copied from interface: CharStreamSource
Returns the output as a string.

Specified by:
toString in interface CharStreamSource
Overrides:
toString in class java.lang.Object
Returns:
the output as a string.

setMaxLineLength

public Renderer setMaxLineLength(int maxLineLength)
Sets the column at which lines are to be wrapped.

Lines that would otherwise exceed this length are wrapped onto a new line at a word boundary.

A Line may still exceed this length if it consists of a single word, where the length of the word plus the line indent exceeds the maximum length. In this case the line is wrapped immediately after the end of the word.

The default value is 76, which reflects the maximum line length for sending email data specified in RFC2049 section 3.5.

Parameters:
maxLineLength - the column at which lines are to be wrapped.
Returns:
this Renderer instance, allowing multiple property setting methods to be chained in a single statement.
See Also:
getMaxLineLength()

getMaxLineLength

public int getMaxLineLength()
Returns the column at which lines are to be wrapped.

See the setMaxLineLength(int) method for a full description of this property.

Returns:
the column at which lines are to be wrapped.

setNewLine

public Renderer setNewLine(java.lang.String newLine)
Sets the string to be used to represent a newline in the output.

The default value is "\r\n" (CR+LF) regardless of the platform on which the library is running. This is so that the default configuration produces valid MIME plain/text output, which mandates the use of CR+LF for line breaks.

Specifying a null argument causes the output to use same new line string as is used in the source document, which is determined via the Source.getNewLine() method. If the source document does not contain any new lines, a "best guess" is made by either taking the new line string of a previously parsed document, or using the value from the static Config.NewLine property.

Parameters:
newLine - the string to be used to represent a newline in the output, may be null.
Returns:
this Renderer instance, allowing multiple property setting methods to be chained in a single statement.
See Also:
getNewLine()

getNewLine

public java.lang.String getNewLine()
Returns the string to be used to represent a newline in the output.

See the setNewLine(String) method for a full description of this property.

Returns:
the string to be used to represent a newline in the output.

setIncludeHyperlinkURLs

public Renderer setIncludeHyperlinkURLs(boolean includeHyperlinkURLs)
Sets whether hyperlink URL's are included in the output.

The default value is true.

When this property is true, the URL of each hyperlink is included in the output as determined by the implementation of the renderHyperlinkURL(StartTag) method.

Example:

Assuming the default implementation of renderHyperlinkURL(StartTag), when this property is true, the following HTML:

<a href="http://jericho.htmlparser.net/">Jericho HTML Parser</a>
produces the following output:
Jericho HTML Parser <http://jericho.htmlparser.net/>

Parameters:
includeHyperlinkURLs - specifies whether hyperlink URL's are included in the output.
Returns:
this Renderer instance, allowing multiple property setting methods to be chained in a single statement.
See Also:
getIncludeHyperlinkURLs()

getIncludeHyperlinkURLs

public boolean getIncludeHyperlinkURLs()
Indicates whether hyperlink URL's are included in the output.

See the setIncludeHyperlinkURLs(boolean) method for a full description of this property.

Returns:
true if hyperlink URL's are included in the output, otherwise false.

renderHyperlinkURL

public java.lang.String renderHyperlinkURL(StartTag startTag)
Renders the hyperlink URL from the specified StartTag.

A return value of null indicates that the hyperlink URL should not be rendered at all.

The default implementation of this method returns null if the href attribute of the specified start tag is '#', starts with "javascript:", or is missing. In all other cases it returns the value of the href attribute enclosed in angle brackets.

See the documentation of the setIncludeHyperlinkURLs(boolean) method for an example of how a hyperlink is rendered by the default implementation.

This method can be overridden in a subclass to customise the rendering of hyperlink URLs.

Rendering of hyperlink URLs can be disabled completely without overriding this method by setting the IncludeHyperlinkURLs property to false.

Example:
To render hyperlink URLs without the enclosing angle brackets:

Renderer renderer=new Renderer(segment) {
    public String renderHyperlinkURL(StartTag startTag) {
        String href=startTag.getAttributeValue("href");
        if (href==null || href.equals("#") || href.startsWith("javascript:")) return null;
        return href;
    }
};
String renderedSegment=renderer.toString();

Parameters:
startTag - the start tag of the hyperlink element, must not be null.
Returns:
The rendered hyperlink URL from the specified StartTag, or null if the hyperlink URL should not be rendered.

setDecorateFontStyles

public Renderer setDecorateFontStyles(boolean decorateFontStyles)
Sets whether decoration characters are to be included around the content of some font style elements and phrase elements.

The default value is false.

Below is a table summarising the decorated elements.

ElementsCharacterExample Output
B and STRONG**bold text*
I and EM//italic text/
U__underlined text_
CODE||code|

Parameters:
decorateFontStyles - specifies whether decoration characters are to be included around the content of some font style elements.
Returns:
this Renderer instance, allowing multiple property setting methods to be chained in a single statement.
See Also:
getDecorateFontStyles()

getDecorateFontStyles

public boolean getDecorateFontStyles()
Indicates whether decoration characters are to be included around the content of some font style elements and phrase elements.

See the setDecorateFontStyles(boolean) method for a full description of this property.

Returns:
true if decoration characters are to be included around the content of some font style elements, otherwise false.

setConvertNonBreakingSpaces

public Renderer setConvertNonBreakingSpaces(boolean convertNonBreakingSpaces)
Sets whether non-breaking space (&nbsp;) character entity references are converted to spaces.

The default value is that of the static Config.ConvertNonBreakingSpaces property at the time the Renderer is instantiated.

Parameters:
convertNonBreakingSpaces - specifies whether non-breaking space (&nbsp;) character entity references are converted to spaces.
Returns:
this Renderer instance, allowing multiple property setting methods to be chained in a single statement.
See Also:
getConvertNonBreakingSpaces()

getConvertNonBreakingSpaces

public boolean getConvertNonBreakingSpaces()
Indicates whether non-breaking space (&nbsp;) character entity references are converted to spaces.

See the setConvertNonBreakingSpaces(boolean) method for a full description of this property.

Returns:
true if non-breaking space (&nbsp;) character entity references are converted to spaces, otherwise false.

setBlockIndentSize

public Renderer setBlockIndentSize(int blockIndentSize)
Sets the size of the indent to be used for anything other than LI elements.

At present this applies to BLOCKQUOTE and DD elements.

The default value is 4.

Parameters:
blockIndentSize - the size of the indent.
Returns:
this Renderer instance, allowing multiple property setting methods to be chained in a single statement.
See Also:
getBlockIndentSize()

getBlockIndentSize

public int getBlockIndentSize()
Returns the size of the indent to be used for anything other than LI elements.

See the setBlockIndentSize(int) method for a full description of this property.

Returns:
the size of the indent to be used for anything other than LI elements.

setListIndentSize

public Renderer setListIndentSize(int listIndentSize)
Sets the size of the indent to be used for LI elements.

The default value is 6.

This applies to LI elements inside both UL and OL elements.

The bullet or number of the list item is included as part of the indent.

Parameters:
listIndentSize - the size of the indent.
Returns:
this Renderer instance, allowing multiple property setting methods to be chained in a single statement.
See Also:
getListIndentSize()

getListIndentSize

public int getListIndentSize()
Returns the size of the indent to be used for LI elements.

See the setListIndentSize(int) method for a full description of this property.

Returns:
the size of the indent to be used for LI elements.

setListBullets

public Renderer setListBullets(char[] listBullets)
Sets the bullet characters to use for list items inside UL elements.

The values in the default array are *, o, + and #.

If the nesting of rendered lists goes deeper than the length of this array, the bullet characters start repeating from the first in the array.

WARNING: If any of the characters in the default array are modified, this will affect all other instances of this class using the default array.

Parameters:
listBullets - an array of characters to be used as bullets, must have at least one entry.
Returns:
this Renderer instance, allowing multiple property setting methods to be chained in a single statement.
See Also:
getListBullets()

getListBullets

public char[] getListBullets()
Returns the bullet characters to use for list items inside UL elements.

See the setListBullets(char[]) method for a full description of this property.

Returns:
the bullet characters to use for list items inside UL elements.

setTableCellSeparator

public Renderer setTableCellSeparator(java.lang.String tableCellSeparator)
Sets the string that is to separate table cells.

The default value is " \t" (a space followed by a tab).

Parameters:
tableCellSeparator - the string that is to separate table cells.
Returns:
this Renderer instance, allowing multiple property setting methods to be chained in a single statement.
See Also:
getTableCellSeparator()

getTableCellSeparator

public java.lang.String getTableCellSeparator()
Returns the string that is to separate table cells.

See the setTableCellSeparator(String) method for a full description of this property.

Returns:
the string that is to separate table cells.