au.id.jericho.lib.html

Class Segment

Implemented Interfaces:
CharSequence, Comparable
Known Direct Subclasses:
Attribute, CharacterReference, Element, FormControl, Source, Tag

public class Segment
extends java.lang.Object
implements Comparable, CharSequence

Represents a segment of a Source document.

The span of a segment is defined by the combination of its begin and end character positions.

Constructor Summary

Segment(Source source, int begin, int end)
Constructs a new Segment within the specified source document with the specified begin and end character positions.

Method Summary

char
charAt(int index)
Returns the character at the specified index.
int
compareTo(Object o)
Compares this Segment object to another object.
boolean
encloses(Segment segment)
Indicates whether this Segment encloses the specified Segment.
boolean
encloses(int pos)
Indicates whether this segment encloses the specified character position in the source document.
boolean
equals(Object object)
Compares the specified object with this Segment for equality.
String
extractText()
Extracts the text content of this segment.
String
extractText(boolean includeAttributes)
Extracts the text content of this segment.
List
findAllCharacterReferences()
Returns a list of all CharacterReference objects that are enclosed by this segment.
List
findAllComments()
Deprecated. Use findAllTags(StartTagType.COMMENT) instead.
List
findAllElements()
Returns a list of all Element objects that are enclosed by this segment.
List
findAllElements(String name)
Returns a list of all Element objects with the specified name that are enclosed by this segment.
List
findAllElements(StartTagType startTagType)
Returns a list of all Element objects with start tags of the specified type that are enclosed by this segment.
List
findAllStartTags()
Returns a list of all StartTag objects that are enclosed by this segment.
List
findAllStartTags(String name)
Returns a list of all StartTag objects with the specified name that are enclosed by this segment.
List
findAllStartTags(String attributeName, String value, boolean valueCaseSensitive)
Returns a list of all StartTag objects with the specified attribute name/value pair that are enclosed by this segment.
List
findAllTags()
Returns a list of all Tag objects that are enclosed by this segment.
List
findAllTags(TagType tagType)
Returns a list of all Tag objects of the specified type that are enclosed by this segment.
List
findFormControls()
Returns a list of the FormControl objects that are enclosed by this segment.
FormFields
findFormFields()
Returns the FormFields object representing all form fields that are enclosed by this segment.
List
findWords()
Deprecated. no replacement
int
getBegin()
Returns the character position in the Source document at which this segment begins.
List
getChildElements()
Returns a list of the immediate children of this segment in the document element hierarchy.
String
getDebugInfo()
Returns a string representation of this object useful for debugging purposes.
int
getEnd()
Returns the character position in the Source document immediately after the end of this segment.
String
getSourceText()
Deprecated. Use toString() instead.
String
getSourceTextNoWhitespace()
Deprecated. Use the more useful CharacterReference.decodeCollapseWhiteSpace(CharSequence) method instead.
int
hashCode()
Returns a hash code value for the segment.
void
ignoreWhenParsing()
Causes the this segment to be ignored when parsing.
boolean
isComment()
Deprecated. Use this instanceof Tag && ((Tag)this).getTagType()==StartTagType.COMMENT instead.
boolean
isWhiteSpace()
Indicates whether this segment consists entirely of white space.
static boolean
isWhiteSpace(char ch)
Indicates whether the specified character is white space.
int
length()
Returns the length of the segment.
Attributes
parseAttributes()
Parses any Attributes within this segment.
CharSequence
subSequence(int beginIndex, int endIndex)
Returns a new character sequence that is a subsequence of this sequence.
String
toString()
Returns the source text of this segment as a String.

Constructor Details

Segment

public Segment(Source source,
               int begin,
               int end)
Constructs a new Segment within the specified source document with the specified begin and end character positions.
Parameters:
source - the Source document, must not be null.
begin - the character position in the source where this segment begins.
end - the character position in the source where this segment ends.

Method Details

charAt

public final char charAt(int index)
Returns the character at the specified index.

This is logically equivalent to toString().charAt(index) for valid argument values 0 <= index <32length().

However because this implementation works directly on the underlying document source string, it should not be assumed that an IndexOutOfBoundsException is thrown for an invalid argument value.

Parameters:
index - the index of the character.
Returns:
the character at the specified index.

compareTo

public int compareTo(Object o)
Parameters:
o - the segment to be compared
Returns:
a negative integer, zero, or a positive integer as this segment is before, equal to, or after the specified segment.

encloses

public final boolean encloses(Segment segment)
Indicates whether this Segment encloses the specified Segment.

This is the case if getBegin()<=segment.getBegin() && getEnd()>=segment.getEnd().

Parameters:
segment - the segment to be tested for being enclosed by this segment.
Returns:
true if this Segment encloses the specified Segment, otherwise false.

encloses

public final boolean encloses(int pos)
Parameters:
pos - the position in the Source document.
Returns:
true if this segment encloses the specified character position in the source document, otherwise false.

equals

public final boolean equals(Object object)
Parameters:
object - the object to be compared for equality with this Segment.
Returns:
true if the specified object is equal to this Segment, otherwise false.

extractText

public String extractText()
Extracts the text content of this segment.

This method removes all of the tags from the segment and decodes the result, collapsing all white space.

See the documentation of the extractText(boolean includeAttributes) method for more details.

This is equivalent to calling extractText(false).

Returns:
the text content of this segment.

extractText

public String extractText(boolean includeAttributes)
Extracts the text content of this segment.

This method removes all of the tags from the segment and decodes the result, collapsing all white space. Tags are also converted to whitespace unless they belong to an inline-level element. An exception to this is the BR element, which is also converted to whitespace despite being an inline-level element.

Text inside SCRIPT and STYLE elements contained within this segment is ignored.

Specifying a value of true as an argument to the includeAttributes parameter causes the values of title, alt, label, and summary attributes of normal tags to be included in the extracted text.

<div><b>O</b>ne</div><div><b>T</b><script>//a script </script>wo</div>One Two

Note that in version 2.1, no tags were converted to whitespace and text inside SCRIPT and STYLE elements was included. The example above produced the text "OneT//a script wo".

Parameters:
includeAttributes - indicates whether the values of title, alt, label, and summary attributes are included.
Returns:
the text content of this segment.

findAllCharacterReferences

public List findAllCharacterReferences()
Returns a list of all CharacterReference objects that are enclosed by this segment.
Returns:
a list of all CharacterReference objects that are enclosed by this segment.

findAllComments

public List findAllComments()

Deprecated. Use findAllTags(StartTagType.COMMENT) instead.

Returns a list of all StartTag objects representing HTML comments that are enclosed by this segment.

This method has been deprecated as of version 2.0 in favour of the more generic findAllTags(TagType) method.

Returns:
a list of all StartTag objects representing HTML comments that are enclosed by this segment.

findAllElements

public List findAllElements()
Returns a list of all Element objects that are enclosed by this segment.

The elements returned correspond exactly with the start tags returned in the findAllStartTags() method.

Returns:
a list of all Element objects that are enclosed by this segment.

findAllElements

public List findAllElements(String name)
Returns a list of all Element objects with the specified name that are enclosed by this segment.

The elements returned correspond exactly with the start tags returned in the findAllStartTags(String name) method.

Specifying a null argument to the name parameter is equivalent to findAllElements().

This method also returns elements consisting of unregistered tags if the specified name is not a valid XML tag name.

Parameters:
name - the name of the elements to find.
Returns:
a list of all Element objects with the specified name that are enclosed by this segment.

findAllElements

public List findAllElements(StartTagType startTagType)
Returns a list of all Element objects with start tags of the specified type that are enclosed by this segment.

The elements returned correspond exactly with the start tags returned in the findAllTags(TagType) method.

Parameters:
startTagType - the type of start tags to find, must not be null.
Returns:
a list of all Element objects with start tags of the specified type that are enclosed by this segment.

findAllStartTags

public List findAllStartTags()
Returns a list of all StartTag objects that are enclosed by this segment.

See the Tag class documentation for more details about the behaviour of this method.

Returns:
a list of all StartTag objects that are enclosed by this segment.

findAllStartTags

public List findAllStartTags(String name)
Returns a list of all StartTag objects with the specified name that are enclosed by this segment.

See the Tag class documentation for more details about the behaviour of this method.

Specifying a null argument to the name parameter is equivalent to findAllStartTags().

This method also returns unregistered tags if the specified name is not a valid XML tag name.

Parameters:
name - the name of the start tags to find.
Returns:
a list of all StartTag objects with the specified name that are enclosed by this segment.

findAllStartTags

public List findAllStartTags(String attributeName,
                             String value,
                             boolean valueCaseSensitive)
Returns a list of all StartTag objects with the specified attribute name/value pair that are enclosed by this segment.

See the Tag class documentation for more details about the behaviour of this method.

Parameters:
attributeName - the attribute name (case insensitive) to search for, must not be null.
value - the value of the specified attribute to search for, must not be null.
valueCaseSensitive - specifies whether the attribute value matching is case sensitive.
Returns:
a list of all StartTag objects with the specified attribute name/value pair that are enclosed by this segment.

findAllTags

public List findAllTags()
Returns a list of all Tag objects that are enclosed by this segment.

See the Tag class documentation for more details about the behaviour of this method.

Returns:
a list of all Tag objects that are enclosed by this segment.

findAllTags

public List findAllTags(TagType tagType)
Returns a list of all Tag objects of the specified type that are enclosed by this segment.

See the Tag class documentation for more details about the behaviour of this method.

Specifying a null argument to the tagType parameter is equivalent to findAllTags().

Parameters:
tagType - the type of tags to find.
Returns:
a list of all Tag objects of the specified type that are enclosed by this segment.

findFormControls

public List findFormControls()
Returns a list of the FormControl objects that are enclosed by this segment.
Returns:
a list of the FormControl objects that are enclosed by this segment.

findFormFields

public FormFields findFormFields()
Returns the FormFields object representing all form fields that are enclosed by this segment.

This is equivalent to new FormFields(findFormControls()).

Returns:
the FormFields object representing all form fields that are enclosed by this segment.

findWords

public final List findWords()

Deprecated. no replacement

Returns a list of Segment objects representing every word in this segment separated by white space. Note that any markup contained in this segment is regarded as normal text for the purposes of this method.

This method has been deprecated as of version 2.0 as it has no discernable use.

Returns:
a list of Segment objects representing every word in this segment separated by white space.

getBegin

public final int getBegin()
Returns the character position in the Source document at which this segment begins.
Returns:
the character position in the Source document at which this segment begins.

getChildElements

public List getChildElements()
Returns:
the a list of the immediate children of this segment in the document element hierarchy, guaranteed not null.

getDebugInfo

public String getDebugInfo()
Returns a string representation of this object useful for debugging purposes.
Returns:
a string representation of this object useful for debugging purposes.

getEnd

public final int getEnd()
Returns the character position in the Source document immediately after the end of this segment.

The character at the position specified by this property is not included in the segment.

Returns:
the character position in the Source document immediately after the end of this segment.

getSourceText

public String getSourceText()

Deprecated. Use toString() instead.

Returns the source text of this segment.

This method has been deprecated as of version 2.0 as it now duplicates the functionality of the toString() method.

Returns:
the source text of this segment.

getSourceTextNoWhitespace

public final String getSourceTextNoWhitespace()

Deprecated. Use the more useful CharacterReference.decodeCollapseWhiteSpace(CharSequence) method instead.

Returns the source text of this segment without white space.

All leading and trailing white space is omitted, and any sections of internal white space are replaced by a single space.

This method has been deprecated as of version 2.0 as it is no longer used internally and has no practical use as a public method. It is similar to the new CharacterReference.decodeCollapseWhiteSpace(CharSequence) method, but does not decode the text after collapsing the white space.

Returns:
the source text of this segment without white space.

hashCode

public int hashCode()
Returns a hash code value for the segment.

The current implementation returns the sum of the begin and end positions, although this is not guaranteed in future versions.

Returns:
a hash code value for the segment.

ignoreWhenParsing

public void ignoreWhenParsing()
Causes the this segment to be ignored when parsing.

This method is usually used to exclude server tags or other non-HTML segments from the source text so that they do not interfere with the parsing of the surrounding HTML.

This is necessary because many server tags are used as attribute values and in other places within HTML tags, and very often contain characters that prevent the parser from recognising the surrounding tag.

Any tags appearing in this segment that are found before this method is called will remain in the tag cache, and so will continue to be found by the tag search methods. If this is undesirable, the Source.clearCache() method can be called to remove them from the cache. Calling the Source.fullSequentialParse() method after this method clears the cache automatically.

For efficiency reasons, this method should be called on all segments that need to be ignored without calling any of the tag search methods in between.

See Also:
Source.ignoreWhenParsing(Collection segments)

isComment

public boolean isComment()

Deprecated. Use this instanceof Tag && ((Tag)this).getTagType()==StartTagType.COMMENT instead.

Indicates whether this segment is a Tag of type StartTagType.COMMENT.

This method has been deprecated as of version 2.0 as it is not a robust method of checking whether an HTML comment spans this segment.

Returns:
true if this segment is a Tag of type StartTagType.COMMENT, otherwise false.

isWhiteSpace

public final boolean isWhiteSpace()
Returns:
true if this segment consists entirely of white space, otherwise false.

isWhiteSpace

public static final boolean isWhiteSpace(char ch)
Indicates whether the specified character is white space.

The HTML 4.01 specification section 9.1 specifies the following white space characters:

  • space (U+0020)
  • tab (U+0009)
  • form feed (U+000C)
  • line feed (U+000A)
  • carriage return (U+000D)
  • zero-width space (U+200B)

Despite the explicit inclusion of the zero-width space in the HTML specification, Microsoft IE6 does not recognise them as whitespace and renders them as an unprintable character (empty square). Even zero-width spaces included using the numeric character reference &#x200B; are rendered this way.

Parameters:
ch - the character to test.
Returns:
true if the specified character is white space, otherwise false.

length

public final int length()
Returns the length of the segment. This is defined as the number of characters between the begin and end positions.
Returns:
the length of the segment.

parseAttributes

public Attributes parseAttributes()
Parses any Attributes within this segment. This method is only used in the unusual situation where attributes exist outside of a start tag. The StartTag.getAttributes() method should be used in normal situations.

This is equivalent to source.parseAttributes(getBegin(),getEnd()).

Returns:
the Attributes within this segment, or null if too many errors occur while parsing.

subSequence

public final CharSequence subSequence(int beginIndex,
                                      int endIndex)
Returns a new character sequence that is a subsequence of this sequence.

This is logically equivalent to toString().subSequence(beginIndex,endIndex) for valid values of beginIndex and endIndex.

However because this implementation works directly on the underlying document source string, it should not be assumed that an IndexOutOfBoundsException is thrown for invalid argument values as described in the String.subSequence(int,int) method.

Parameters:
beginIndex - the begin index, inclusive.
endIndex - the end index, exclusive.
Returns:
a new character sequence that is a subsequence of this sequence.

toString

public String toString()
Returns the source text of this segment as a String.

The returned String is newly created with every call to this method, unless this segment is itself an instance of Source.

Note that before version 2.0 this returned a representation of this object useful for debugging purposes, which can now be obtained via the getDebugInfo() method.

Returns:
the source text of this segment as a String.