Prev Class | Next Class | Frames | No Frames |
Summary: Nested | Field | Method | Constr | Detail: Nested | Field | Method | Constr |
java.lang.Object
au.id.jericho.lib.html.Segment
public class Segment
extends java.lang.Object
implements Comparable, CharSequence
Source
document.
Many of the tag search methods are defined in this class.
The span of a segment is defined by the combination of its begin and end character positions.
Constructor Summary | |
Method Summary | |
char |
|
int |
|
boolean | |
boolean |
|
boolean |
|
String |
|
String |
|
List |
|
List |
|
List |
|
List |
|
List |
|
List |
|
List |
|
List |
|
List |
|
List |
|
List |
|
FormFields |
|
int | |
List |
|
String |
|
int | |
Renderer |
|
TextExtractor |
|
int |
|
void |
|
boolean |
|
static boolean |
|
int |
|
Attributes |
|
CharSequence |
|
String |
|
public Segment(Source source, int begin, int end)
Constructs a newSegment
within the specified source document with the specified begin and end character positions.
- Parameters:
source
- theSource
document, must not benull
.begin
- the character position in the source where this segment begins.end
- the character position in the source where this segment ends.
public final char charAt(int index)
Returns the character at the specified index. This is logically equivalent totoString().charAt(index)
for valid argument values0 <= index <32length()
. However because this implementation works directly on the underlying document source string, it should not be assumed that anIndexOutOfBoundsException
is thrown for an invalid argument value.
- Parameters:
index
- the index of the character.
- Returns:
- the character at the specified index.
public int compareTo(Object o)
Compares thisSegment
object to another object. If the argument is not aSegment
, aClassCastException
is thrown. A segment is considered to be before another segment if its begin position is earlier, or in the case that both segments begin at the same position, its end position is earlier. Segments that begin and end at the same position are considered equal for the purposes of this comparison, even if they relate to different source documents. Note: this class has a natural ordering that is inconsistent with equals. This means that this method may return zero in some cases where calling theequals(Object)
method with the same argument returnsfalse
.
- Parameters:
o
- the segment to be compared
- Returns:
- a negative integer, zero, or a positive integer as this segment is before, equal to, or after the specified segment.
public final boolean encloses(Segment segment)
Indicates whether thisSegment
encloses the specifiedSegment
. This is the case ifgetBegin()
<=segment.
getBegin()
&&
getEnd()
>=segment.
getEnd()
.
- Parameters:
segment
- the segment to be tested for being enclosed by this segment.
- Returns:
true
if thisSegment
encloses the specifiedSegment
, otherwisefalse
.
public final boolean encloses(int pos)
Indicates whether this segment encloses the specified character position in the source document. This is the case ifgetBegin()
<= pos <
getEnd()
.
- Parameters:
pos
- the position in theSource
document.
- Returns:
true
if this segment encloses the specified character position in the source document, otherwisefalse
.
public final boolean equals(Object object)
Compares the specified object with thisSegment
for equality. Returnstrue
if and only if the specified object is also aSegment
, and both segments have the sameSource
, and the same begin and end positions.
- Parameters:
object
- the object to be compared for equality with thisSegment
.
- Returns:
true
if the specified object is equal to thisSegment
, otherwisefalse
.
public String extractText()
Deprecated. Use
getTextExtractor()
.
toString()
instead.Extracts the textual content from the HTML markup of this segment. This method has been deprecated as of version 2.4 and replaced with thegetTextExtractor()
method.
- Returns:
- the textual content from the HTML markup of this segment.
public String extractText(boolean includeAttributes)
Deprecated. Use
getTextExtractor()
.
setIncludeAttributes(includeAttributes)
.
toString()
instead.Extracts the textual content from the HTML markup of this segment. This method has been deprecated as of version 2.4 and replaced with thegetTextExtractor()
method.
- Returns:
- the textual content from the HTML markup of this segment.
public List findAllCharacterReferences()
Returns a list of allCharacterReference
objects that are enclosed by this segment.
- Returns:
- a list of all
CharacterReference
objects that are enclosed by this segment.
public List findAllElements()
Returns a list of allElement
objects that are enclosed by this segment. TheSource.fullSequentialParse()
method should be called after construction of theSource
object if this method is to be used on a large proportion of the source. It is called automatically if this method is called on theSource
object itself. The elements returned correspond exactly with the start tags returned in thefindAllStartTags()
method.
public List findAllElements(String name)
Returns a list of allElement
objects with the specified name that are enclosed by this segment. The elements returned correspond exactly with the start tags returned in thefindAllStartTags(String name)
method. Specifying anull
argument to thename
parameter is equivalent tofindAllElements()
. This method also returns elements consisting of unregistered tags if the specified name is not a valid XML tag name.
- Parameters:
name
- the name of the elements to find.
public List findAllElements(String attributeName, String value, boolean valueCaseSensitive)
Returns a list of allElement
objects with the specified attribute name/value pair that are enclosed by this segment. The elements returned correspond exactly with the start tags returned in thefindAllStartTags(String attributeName, String value, boolean valueCaseSensitive)
method.
- Parameters:
attributeName
- the attribute name (case insensitive) to search for, must not benull
.value
- the value of the specified attribute to search for, must not benull
.valueCaseSensitive
- specifies whether the attribute value matching is case sensitive.
public List findAllElements(StartTagType startTagType)
Returns a list of allElement
objects with start tags of the specified type that are enclosed by this segment. The elements returned correspond exactly with the start tags returned in thefindAllTags(TagType)
method.
- Parameters:
startTagType
- the type of start tags to find, must not benull
.
public List findAllStartTags()
Returns a list of allStartTag
objects that are enclosed by this segment. TheSource.fullSequentialParse()
method should be called after construction of theSource
object if this method is to be used on a large proportion of the source. It is called automatically if this method is called on theSource
object itself. See theTag
class documentation for more details about the behaviour of this method.
public List findAllStartTags(String name)
Returns a list of allStartTag
objects with the specified name that are enclosed by this segment. See theTag
class documentation for more details about the behaviour of this method. Specifying anull
argument to thename
parameter is equivalent tofindAllStartTags()
. This method also returns unregistered tags if the specified name is not a valid XML tag name.
- Parameters:
name
- the name of the start tags to find.
public List findAllStartTags(String attributeName, String value, boolean valueCaseSensitive)
Returns a list of allStartTag
objects with the specified attribute name/value pair that are enclosed by this segment. See theTag
class documentation for more details about the behaviour of this method.
- Parameters:
attributeName
- the attribute name (case insensitive) to search for, must not benull
.value
- the value of the specified attribute to search for, must not benull
.valueCaseSensitive
- specifies whether the attribute value matching is case sensitive.
public List findAllTags()
Returns a list of allTag
objects that are enclosed by this segment. TheSource.fullSequentialParse()
method should be called after construction of theSource
object if this method is to be used on a large proportion of the source. It is called automatically if this method is called on theSource
object itself. See theTag
class documentation for more details about the behaviour of this method.
public List findAllTags(TagType tagType)
Returns a list of allTag
objects of the specified type that are enclosed by this segment. See theTag
class documentation for more details about the behaviour of this method. Specifying anull
argument to thetagType
parameter is equivalent tofindAllTags()
.
- Parameters:
tagType
- the type of tags to find.
public List findFormControls()
Returns a list of theFormControl
objects that are enclosed by this segment.
- Returns:
- a list of the
FormControl
objects that are enclosed by this segment.
public FormFields findFormFields()
Returns theFormFields
object representing all form fields that are enclosed by this segment. This is equivalent tonew FormFields
(
findFormControls()
)
.
- Returns:
- the
FormFields
object representing all form fields that are enclosed by this segment.
- See Also:
findFormControls()
public final int getBegin()
Returns the character position in theSource
document at which this segment begins.
- Returns:
- the character position in the
Source
document at which this segment begins.
public List getChildElements()
Returns a list of the immediate children of this segment in the document element hierarchy. The returned list may include an element that extends beyond the end of this segment, as long as it begins within this segment. An element found at the start of this segment is included in the list. Note however that if this segment is anElement
, the overridingElement.getChildElements()
method is called instead, which only returns the children of the element. CallinggetChildElements()
on anElement
is usually more efficient than calling it on aSegment
. The objects in the list are all of typeElement
. TheSource.fullSequentialParse()
method should be called after construction of theSource
object if this method is to be used on a large proportion of the source. It is called automatically if this method is called on theSource
object itself. See theSource.getChildElements()
method for more details.
- Returns:
- the a list of the immediate children of this segment in the document element hierarchy, guaranteed not
null
.
- See Also:
Element.getParentElement()
public String getDebugInfo()
Returns a string representation of this object useful for debugging purposes.
- Returns:
- a string representation of this object useful for debugging purposes.
public final int getEnd()
Returns the character position in theSource
document immediately after the end of this segment. The character at the position specified by this property is not included in the segment.
- Returns:
- the character position in the
Source
document immediately after the end of this segment.
public Renderer getRenderer()
Performs a simple rendering of the HTML markup in this segment into text. The output can be configured by setting any number of properties on the returnedRenderer
instance before obtaining its output.
- Returns:
- an instance of
Renderer
based on this segment.
- See Also:
getTextExtractor()
public TextExtractor getTextExtractor()
Extracts the textual content from the HTML markup of this segment. The output can be configured by setting properties on the returnedTextExtractor
instance before obtaining its output.
- Returns:
- an instance of
TextExtractor
based on this segment.
- See Also:
getRenderer()
public int hashCode()
Returns a hash code value for the segment. The current implementation returns the sum of the begin and end positions, although this is not guaranteed in future versions.
- Returns:
- a hash code value for the segment.
public void ignoreWhenParsing()
Causes the this segment to be ignored when parsing. Ignored segments are treated as blank spaces by the parsing mechanism, but are included as normal text in all other functions. This method was originally the only means of preventing server tags located inside normal tags from interfering with the parsing of the tags. The most common scenario is where the attributes of a normal tag uses server tags to dynamically set the values of the attributes. As of version 2.4 it is no longer necessary to use this method to ignore common server tags inside normal tags, as the attributes parser now automatically ignores common server tags. As of version 2.5 it is also unnecessary to use this method to ignore the contents ofSCRIPT
elements, as the parser automatically ignores this content when performing a full sequential parse. This leaves only a few scenarios where calling this method still provides a significant benefit. One such case is where XML-style server tags are used inside normal tags. Here is an example using an XML-style JSP tag:The first double-quote of<a href="<i18n:resource path="/Portal"/>?BACK=TRUE">back</a>
"/Portal"
will be interpreted as the end quote for thehref
attribute, as there is no way for the parser to recognise theil8n:resource
element as a server tag. Such use of XML-style server tags inside normal tags is generally seen as bad practice, but it is nevertheless valid JSP. The only way to ensure that this library is able to parse the normal tag surrounding it is to find these server tags first and call theignoreWhenParsing
method to ignore them before parsing the rest of the document. It is important to understand the difference between ignoring the segment when parsing and removing the segment completely. Any text inside a segment that is ignored when parsing is treated by most functions as content, and as such is included in the output of tools such asTextExtractor
andRenderer
. To remove segments completely, create anOutputDocument
and call itsremove(Segment)
orreplaceWithSpaces(int begin, int end)
method for each segment. Then create a new source document usingnew Source(outputDocument.toString())
and perform the desired operations on this new source object. Calling this method after theSource.fullSequentialParse()
method has been called is not permitted and throws anIllegalStateException
. Any tags appearing in this segment that are found before this method is called will remain in the tag cache, and so will continue to be found by the tag search methods. If this is undesirable, theSource.clearCache()
method can be called to remove them from the cache. Calling theSource.fullSequentialParse()
method after this method clears the cache automatically. For best performance, this method should be called on all segments that need to be ignored without calling any of the tag search methods in between.
- See Also:
Source.ignoreWhenParsing(Collection segments)
public final boolean isWhiteSpace()
Indicates whether this segment consists entirely of white space.
- Returns:
true
if this segment consists entirely of white space, otherwisefalse
.
public static final boolean isWhiteSpace(char ch)
Indicates whether the specified character is white space. The HTML 4.01 specification section 9.1 specifies the following white space characters:Despite the explicit inclusion of the zero-width space in the HTML specification, Microsoft IE6 does not recognise them as whitespace and renders them as an unprintable character (empty square). Even zero-width spaces included using the numeric character reference
- space (U+0020)
- tab (U+0009)
- form feed (U+000C)
- line feed (U+000A)
- carriage return (U+000D)
- zero-width space (U+200B)
​
are rendered this way.
- Parameters:
ch
- the character to test.
- Returns:
true
if the specified character is white space, otherwisefalse
.
public final int length()
Returns the length of the segment. This is defined as the number of characters between the begin and end positions.
- Returns:
- the length of the segment.
public Attributes parseAttributes()
Parses anyAttributes
within this segment. This method is only used in the unusual situation where attributes exist outside of a start tag. TheStartTag.getAttributes()
method should be used in normal situations. This is equivalent tosource.
parseAttributes
(
getBegin()
,
getEnd()
)
.
- Returns:
- the
Attributes
within this segment, ornull
if too many errors occur while parsing.
public final CharSequence subSequence(int beginIndex, int endIndex)
Returns a new character sequence that is a subsequence of this sequence. This is logically equivalent totoString().subSequence(beginIndex,endIndex)
for valid values ofbeginIndex
andendIndex
. However because this implementation works directly on the underlying document source string, it should not be assumed that anIndexOutOfBoundsException
is thrown for invalid argument values as described in theString.subSequence(int,int)
method.
- Parameters:
beginIndex
- the begin index, inclusive.endIndex
- the end index, exclusive.
- Returns:
- a new character sequence that is a subsequence of this sequence.
public String toString()
Returns the source text of this segment as aString
. The returnedString
is newly created with every call to this method, unless this segment is itself an instance ofSource
. Note that before version 2.0 this returned a representation of this object useful for debugging purposes, which can now be obtained via thegetDebugInfo()
method.
- Returns:
- the source text of this segment as a
String
.