Prev Class | Next Class | Frames | No Frames |
Summary: Nested | Field | Method | Constr | Detail: Nested | Field | Method | Constr |
java.lang.Object
au.id.jericho.lib.html.Segment
public class Segment
extends java.lang.Object
implements Comparable, CharSequence
Source
document.
The span of a segment is defined by the combination of its begin and end character positions.
Constructor Summary | |
Method Summary | |
char |
|
int |
|
boolean | |
boolean |
|
boolean |
|
String |
|
String |
|
List |
|
List |
|
List |
|
List |
|
List |
|
List |
|
List |
|
List |
|
List |
|
List |
|
List |
|
FormFields |
|
List |
|
int | |
List |
|
String |
|
int | |
String |
|
String |
|
int |
|
void |
|
boolean |
|
boolean |
|
static boolean |
|
int |
|
Attributes |
|
CharSequence |
|
String |
|
public Segment(Source source, int begin, int end)
Constructs a newSegment
within the specified source document with the specified begin and end character positions.
- Parameters:
source
- theSource
document, must not benull
.begin
- the character position in the source where this segment begins.end
- the character position in the source where this segment ends.
public final char charAt(int index)
Returns the character at the specified index. This is logically equivalent totoString().charAt(index)
for valid argument values0 <= index <32length()
. However because this implementation works directly on the underlying document source string, it should not be assumed that anIndexOutOfBoundsException
is thrown for an invalid argument value.
- Parameters:
index
- the index of the character.
- Returns:
- the character at the specified index.
public int compareTo(Object o)
Compares thisSegment
object to another object. If the argument is not aSegment
, aClassCastException
is thrown. A segment is considered to be before another segment if its begin position is earlier, or in the case that both segments begin at the same position, its end position is earlier. Segments that begin and end at the same position are considered equal for the purposes of this comparison, even if they relate to different source documents. Note: this class has a natural ordering that is inconsistent with equals. This means that this method may return zero in some cases where calling theequals(Object)
method with the same argument returnsfalse
.
- Parameters:
o
- the segment to be compared
- Returns:
- a negative integer, zero, or a positive integer as this segment is before, equal to, or after the specified segment.
public final boolean encloses(Segment segment)
Indicates whether thisSegment
encloses the specifiedSegment
. This is the case ifgetBegin()
<=segment.
getBegin()
&&
getEnd()
>=segment.
getEnd()
.
- Parameters:
segment
- the segment to be tested for being enclosed by this segment.
- Returns:
true
if thisSegment
encloses the specifiedSegment
, otherwisefalse
.
public final boolean encloses(int pos)
Indicates whether this segment encloses the specified character position in the source document. This is the case ifgetBegin()
<= pos <
getEnd()
.
- Parameters:
pos
- the position in theSource
document.
- Returns:
true
if this segment encloses the specified character position in the source document, otherwisefalse
.
public final boolean equals(Object object)
Compares the specified object with thisSegment
for equality. Returnstrue
if and only if the specified object is also aSegment
, and both segments have the sameSource
, and the same begin and end positions.
- Parameters:
object
- the object to be compared for equality with thisSegment
.
- Returns:
true
if the specified object is equal to thisSegment
, otherwisefalse
.
public String extractText()
Extracts the text content of this segment. This method removes all of the tags from the segment and decodes the result, collapsing all white space. See the documentation of theextractText(boolean includeAttributes)
method for more details. This is equivalent to callingextractText(false)
.
- Returns:
- the text content of this segment.
public String extractText(boolean includeAttributes)
Extracts the text content of this segment. This method removes all of the tags from the segment and decodes the result, collapsing all white space. Tags are also converted to whitespace unless they belong to an inline-level element. An exception to this is theBR
element, which is also converted to whitespace despite being an inline-level element. Text insideSCRIPT
andSTYLE
elements contained within this segment is ignored. Specifying a value oftrue
as an argument to theincludeAttributes
parameter causes the values of title, alt, label, and summary attributes of normal tags to be included in the extracted text.Note that in version 2.1, no tags were converted to whitespace and text inside
<div><b>O</b>ne</div><div><b>T</b><script>//a script </script>wo</div>
One Two
SCRIPT
andSTYLE
elements was included. The example above produced the text "OneT//a script wo
".
- Returns:
- the text content of this segment.
public List findAllCharacterReferences()
Returns a list of allCharacterReference
objects that are enclosed by this segment.
- Returns:
- a list of all
CharacterReference
objects that are enclosed by this segment.
public List findAllComments()
Deprecated. Use
findAllTags
(
StartTagType.COMMENT
)
instead.Returns a list of allStartTag
objects representing HTML comments that are enclosed by this segment. This method has been deprecated as of version 2.0 in favour of the more genericfindAllTags(TagType)
method.
public List findAllElements()
Returns a list of allElement
objects that are enclosed by this segment. The elements returned correspond exactly with the start tags returned in thefindAllStartTags()
method.
public List findAllElements(String name)
Returns a list of allElement
objects with the specified name that are enclosed by this segment. The elements returned correspond exactly with the start tags returned in thefindAllStartTags(String name)
method. Specifying anull
argument to thename
parameter is equivalent tofindAllElements()
. This method also returns elements consisting of unregistered tags if the specified name is not a valid XML tag name.
- Parameters:
name
- the name of the elements to find.
public List findAllElements(StartTagType startTagType)
Returns a list of allElement
objects with start tags of the specified type that are enclosed by this segment. The elements returned correspond exactly with the start tags returned in thefindAllTags(TagType)
method.
- Parameters:
startTagType
- the type of start tags to find, must not benull
.
public List findAllStartTags()
public List findAllStartTags(String name)
Returns a list of allStartTag
objects with the specified name that are enclosed by this segment. See theTag
class documentation for more details about the behaviour of this method. Specifying anull
argument to thename
parameter is equivalent tofindAllStartTags()
. This method also returns unregistered tags if the specified name is not a valid XML tag name.
- Parameters:
name
- the name of the start tags to find.
public List findAllStartTags(String attributeName, String value, boolean valueCaseSensitive)
Returns a list of allStartTag
objects with the specified attribute name/value pair that are enclosed by this segment. See theTag
class documentation for more details about the behaviour of this method.
- Parameters:
attributeName
- the attribute name (case insensitive) to search for, must not benull
.value
- the value of the specified attribute to search for, must not benull
.valueCaseSensitive
- specifies whether the attribute value matching is case sensitive.
public List findAllTags()
public List findAllTags(TagType tagType)
Returns a list of allTag
objects of the specified type that are enclosed by this segment. See theTag
class documentation for more details about the behaviour of this method. Specifying anull
argument to thetagType
parameter is equivalent tofindAllTags()
.
- Parameters:
tagType
- the type of tags to find.
public List findFormControls()
Returns a list of theFormControl
objects that are enclosed by this segment.
- Returns:
- a list of the
FormControl
objects that are enclosed by this segment.
public FormFields findFormFields()
Returns theFormFields
object representing all form fields that are enclosed by this segment. This is equivalent tonew FormFields
(
findFormControls()
)
.
- Returns:
- the
FormFields
object representing all form fields that are enclosed by this segment.
- See Also:
findFormControls()
public final List findWords()
Deprecated. no replacement
Returns a list ofSegment
objects representing every word in this segment separated by white space. Note that any markup contained in this segment is regarded as normal text for the purposes of this method. This method has been deprecated as of version 2.0 as it has no discernable use.
- Returns:
- a list of
Segment
objects representing every word in this segment separated by white space.
public final int getBegin()
Returns the character position in theSource
document at which this segment begins.
- Returns:
- the character position in the
Source
document at which this segment begins.
public List getChildElements()
Returns a list of the immediate children of this segment in the document element hierarchy. The returned list may include an element that extends beyond the end of this segment, as long as it begins within this segment. The objects in the list are all of typeElement
. See theSource.getChildElements()
method for more details.
- Returns:
- the a list of the immediate children of this segment in the document element hierarchy, guaranteed not
null
.
- See Also:
Element.getParentElement()
public String getDebugInfo()
Returns a string representation of this object useful for debugging purposes.
- Returns:
- a string representation of this object useful for debugging purposes.
public final int getEnd()
Returns the character position in theSource
document immediately after the end of this segment. The character at the position specified by this property is not included in the segment.
- Returns:
- the character position in the
Source
document immediately after the end of this segment.
public String getSourceText()
Deprecated. Use
toString()
instead.Returns the source text of this segment. This method has been deprecated as of version 2.0 as it now duplicates the functionality of thetoString()
method.
- Returns:
- the source text of this segment.
public final String getSourceTextNoWhitespace()
Deprecated. Use the more useful
CharacterReference.decodeCollapseWhiteSpace(CharSequence)
method instead.Returns the source text of this segment without white space. All leading and trailing white space is omitted, and any sections of internal white space are replaced by a single space. This method has been deprecated as of version 2.0 as it is no longer used internally and has no practical use as a public method. It is similar to the newCharacterReference.decodeCollapseWhiteSpace(CharSequence)
method, but does not decode the text after collapsing the white space.
- Returns:
- the source text of this segment without white space.
public int hashCode()
Returns a hash code value for the segment. The current implementation returns the sum of the begin and end positions, although this is not guaranteed in future versions.
- Returns:
- a hash code value for the segment.
public void ignoreWhenParsing()
Causes the this segment to be ignored when parsing. This method is usually used to exclude server tags or other non-HTML segments from the source text so that they do not interfere with the parsing of the surrounding HTML. This is necessary because many server tags are used as attribute values and in other places within HTML tags, and very often contain characters that prevent the parser from recognising the surrounding tag. Any tags appearing in this segment that are found before this method is called will remain in the tag cache, and so will continue to be found by the tag search methods. If this is undesirable, theSource.clearCache()
method can be called to remove them from the cache. Calling theSource.fullSequentialParse()
method after this method clears the cache automatically. For efficiency reasons, this method should be called on all segments that need to be ignored without calling any of the tag search methods in between.
- See Also:
Source.ignoreWhenParsing(Collection segments)
public boolean isComment()
Deprecated. Use
this instanceof
Tag
&& ((Tag)this).
getTagType()
==
StartTagType.COMMENT
instead.Indicates whether this segment is aTag
of typeStartTagType.COMMENT
. This method has been deprecated as of version 2.0 as it is not a robust method of checking whether an HTML comment spans this segment.
- Returns:
true
if this segment is aTag
of typeStartTagType.COMMENT
, otherwisefalse
.
public final boolean isWhiteSpace()
Indicates whether this segment consists entirely of white space.
- Returns:
true
if this segment consists entirely of white space, otherwisefalse
.
public static final boolean isWhiteSpace(char ch)
Indicates whether the specified character is white space. The HTML 4.01 specification section 9.1 specifies the following white space characters:Despite the explicit inclusion of the zero-width space in the HTML specification, Microsoft IE6 does not recognise them as whitespace and renders them as an unprintable character (empty square). Even zero-width spaces included using the numeric character reference
- space (U+0020)
- tab (U+0009)
- form feed (U+000C)
- line feed (U+000A)
- carriage return (U+000D)
- zero-width space (U+200B)
​
are rendered this way.
- Parameters:
ch
- the character to test.
- Returns:
true
if the specified character is white space, otherwisefalse
.
public final int length()
Returns the length of the segment. This is defined as the number of characters between the begin and end positions.
- Returns:
- the length of the segment.
public Attributes parseAttributes()
Parses anyAttributes
within this segment. This method is only used in the unusual situation where attributes exist outside of a start tag. TheStartTag.getAttributes()
method should be used in normal situations. This is equivalent tosource.
parseAttributes
(
getBegin()
,
getEnd()
)
.
- Returns:
- the
Attributes
within this segment, ornull
if too many errors occur while parsing.
public final CharSequence subSequence(int beginIndex, int endIndex)
Returns a new character sequence that is a subsequence of this sequence. This is logically equivalent totoString().subSequence(beginIndex,endIndex)
for valid values ofbeginIndex
andendIndex
. However because this implementation works directly on the underlying document source string, it should not be assumed that anIndexOutOfBoundsException
is thrown for invalid argument values as described in theString.subSequence(int,int)
method.
- Parameters:
beginIndex
- the begin index, inclusive.endIndex
- the end index, exclusive.
- Returns:
- a new character sequence that is a subsequence of this sequence.
public String toString()
Returns the source text of this segment as aString
. The returnedString
is newly created with every call to this method, unless this segment is itself an instance ofSource
. Note that before version 2.0 this returned a representation of this object useful for debugging purposes, which can now be obtained via thegetDebugInfo()
method.
- Returns:
- the source text of this segment as a
String
.