Extracts the textual content from HTML markup.
The output is ideal for feeding into a text search engine such as
Apache Lucene,
especially when the
IncludeAttributes
property has been set to
true
.
Use one of the following methods to obtain the output:
The process removes all of the tags and
decodes the result, collapsing all white space.
A space character is included in the output where a
normal tag is present in the source,
unless the tag belongs to an
inline-level element.
An exception to this is the
BR
element, which is also converted to a space despite being an inline-level element.
Text inside
SCRIPT
and
STYLE
elements contained within this segment
is ignored.
Setting the
ExcludeNonHTMLElements
property results in the exclusion of any content within a
non-HTML element.
See the
excludeElement(StartTag)
method for details on how to implement a more complex mechanism to determine whether the
content of each
Element
is to be excluded from the output.
All tags that are not
normal tags, such as
server tags,
comments etc., are removed from the output without adding whitespace to the output.
Note that segments on which the
Segment.ignoreWhenParsing()
method has been called are treated as text rather than markup,
resulting in their inclusion in the output.
To remove specific segments before extracting the text, create an
OutputDocument
and call its
remove(Segment)
or
replaceWithSpaces(int begin, int end)
method for each segment to be removed.
Then create a new source document using
new Source(outputDocument.toString())
and perform the text extraction on this new source object.
Extracting the text from an entire
Source
object performs a
full sequential parse automatically.
To perform a simple rendering of HTML markup into text, which is more readable than the output of this class, use the
Renderer
class instead.
<div><b>O</b>ne</div><div title="Two"><b>Th</b><script>//a script </script>ree</div>
One Two Three
excludeElement
public boolean excludeElement(StartTag startTag)
Indicates whether the text inside the
Element
of the specified start tag should be excluded from the output.
During the text extraction process, every start tag encountered in the segment is checked using this method to determine whether the text inside its
associated element should be excluded from the output.
The default implementation of this method is to always return
false
, so that every element is included,
but the method can be overridden in a subclass to perform a check of arbitrary complexity on each start tag.
All elements nested inside an excluded element are also implicitly excluded, as are all
SCRIPT
and
STYLE
elements.
Such elements are skipped over without calling this method, so there is no way to include them by overriding the method.
segment
class="NotIndexed"
TextExtractor textExtractor=new TextExtractor(segment) {
public boolean excludeElement(StartTag startTag) {
return "NotIndexed".equalsIgnoreCase(startTag.getAttributeValue("class"));
}
};
String extractedText=textExtractor.toString();
startTag
- the start tag of the element to check for inclusion.
- if the text inside the
Element
of the specified start tag should be excluded from the output, otherwise false
.
getConvertNonBreakingSpaces
public boolean getConvertNonBreakingSpaces()
Indicates whether non-breaking space (
CharacterEntityReference._nbsp
) character entity references are converted to spaces.
See the
setConvertNonBreakingSpaces(boolean)
method for a full description of this property.
true
if non-breaking space (CharacterEntityReference._nbsp
) character entity references are converted to spaces, otherwise false
.
getEstimatedMaximumOutputLength
public long getEstimatedMaximumOutputLength()
Returns the estimated maximum number of characters in the output, or
-1
if no estimate is available.
The returned value should be used as a guide for efficiency purposes only, for example to set an initial
StringBuffer
capacity.
There is no guarantee that the length of the output is indeed less than this value,
as classes implementing this method often use assumptions based on typical usage to calculate the estimate.
- getEstimatedMaximumOutputLength in interface CharStreamSource
- the estimated maximum number of characters in the output, or
-1
if no estimate is available.
getExcludeNonHTMLElements
public boolean getExcludeNonHTMLElements()
true
if the content of non-HTML elements is excluded from the output, otherwise false
.
getIncludeAttributes
public boolean getIncludeAttributes()
true
if the attribute values are to be included in the output, otherwise false
.
setConvertNonBreakingSpaces
public TextExtractor setConvertNonBreakingSpaces(boolean convertNonBreakingSpaces)
Sets whether non-breaking space (
CharacterEntityReference._nbsp
) character entity references are converted to spaces.
The default value is
true
.
convertNonBreakingSpaces
- specifies whether non-breaking space (CharacterEntityReference._nbsp
) character entity references are converted to spaces.
- this
TextExtractor
instance, allowing multiple property setting methods to be chained in a single statement.
setExcludeNonHTMLElements
public TextExtractor setExcludeNonHTMLElements(boolean excludeNonHTMLElements)
Sets whether the content of
non-HTML elements is excluded from the output.
The default value is
false
, meaning that content from all elements meeting the other criteria is included.
excludeNonHTMLElements
- specifies whether content non-HTML elements is excluded from the output.
- this
TextExtractor
instance, allowing multiple property setting methods to be chained in a single statement.
setIncludeAttributes
public TextExtractor setIncludeAttributes(boolean includeAttributes)
Sets whether the values of
title,
alt,
label,
summary, and
content
attributes of
normal tags are to be included in the output.
The value of a
content attribute is
only included if a
name attribute is also present,
as the content attribute of a
META
tag only contains human readable text if the name attribute is used as opposed to an
http-equiv attribute.
The default value is
false
.
includeAttributes
- specifies whether the attribute values are included in the output.
- this
TextExtractor
instance, allowing multiple property setting methods to be chained in a single statement.
writeTo
public void writeTo(Writer writer)
throws IOException
Writes the output to the specified Writer
.
- writeTo in interface CharStreamSource
writer
- the destination java.io.Writer
for the output.