au.id.jericho.lib.html

Class Source

Implemented Interfaces:
CharSequence, Comparable

public class Source
extends Segment

Represents a source HTML document.

The first step in parsing an HTML document is always to construct a Source object from the source data, which can be a String, Reader, InputStream or URL. Each constructor uses all the evidence available to determine the original character encoding of the data.

Once the Source object has been created, you can immediately start searching for tags or elements within the document using the tag search methods. It is strongly advised however to first think about how many of the document's tags you will need to parse. If you will be searching for all or most of the tags, performance can be greatly improved by first calling the fullSequentialParse() method. If you only need to parse a few tags, performance will probably be better if you use the default parse on demand mode.

It can also be useful to set the location of the log writer before calling any tag search methods so that important log messages can be traced while the document is being parsed.

Note that many of the useful functions which can be performed on the source document are defined in its superclass, Segment. The source object is itself a segment which spans the entire document.

Most of the methods defined in this class are useful for determining the elements and tags surrounding or neighbouring a particular character position in the document.

For information on how to create a modified version of this source document, see the OutputDocument class.

See Also:
Segment

Constructor Summary

Source(CharSequence text)
Constructs a new Source object from the specified text.
Source(InputStream inputStream)
Constructs a new Source object by loading the content from the specified InputStream.
Source(Reader reader)
Constructs a new Source object by loading the content from the specified Reader.
Source(URL url)
Constructs a new Source object by loading the content from the specified URL.

Method Summary

void
clearCache()
Clears the tag cache of all tags.
List
findAllElements()
Returns a list of all elements in this source document.
List
findAllStartTags()
Returns a list of all start tags in this source document.
List
findAllTags()
Returns a list of all tags in this source document.
Segment
findEnclosingComment(int pos)
Deprecated. Use findEnclosingTag(pos,StartTagType.COMMENT) instead.
Element
findEnclosingElement(int pos)
Returns the most nested Element that encloses the specified position in the source document.
Element
findEnclosingElement(int pos, String name)
Returns the most nested Element with the specified name that encloses the specified position in the source document.
StartTag
findEnclosingStartTag(int pos)
Deprecated. Use findEnclosingTag(int pos) instead.
Tag
findEnclosingTag(int pos)
Returns the Tag that encloses the specified position in the source document.
Tag
findEnclosingTag(int pos, TagType tagType)
Returns the Tag of the specified type that encloses the specified position in the source document.
int
findNameEnd(int pos)
Returns the end position of the XML Name that starts at the specified position.
CharacterReference
findNextCharacterReference(int pos)
Returns the CharacterReference beginning at or immediately following the specified position in the source document.
StartTag
findNextComment(int pos)
Deprecated. Use findNextTag(pos,StartTagType.COMMENT) instead.
Element
findNextElement(int pos)
Returns the Element beginning at or immediately following the specified position in the source document.
Element
findNextElement(int pos, String name)
Returns the Element with the specified name beginning at or immediately following the specified position in the source document.
EndTag
findNextEndTag(int pos)
Returns the EndTag beginning at or immediately following the specified position in the source document.
EndTag
findNextEndTag(int pos, String name)
Returns the normal EndTag with the specified name beginning at or immediately following the specified position in the source document.
EndTag
findNextEndTag(int pos, String name, EndTagType endTagType)
Returns the EndTag with the specified name and type beginning at or immediately following the specified position in the source document.
StartTag
findNextStartTag(int pos)
Returns the StartTag beginning at or immediately following the specified position in the source document.
StartTag
findNextStartTag(int pos, String name)
Returns the StartTag with the specified name beginning at or immediately following the specified position in the source document.
StartTag
findNextStartTag(int pos, String attributeName, String value, boolean valueCaseSensitive)
Returns the StartTag with the specified attribute name/value pair beginning at or immediately following the specified position in the source document.
Tag
findNextTag(int pos)
Returns the Tag beginning at or immediately following the specified position in the source document.
Tag
findNextTag(int pos, TagType tagType)
Returns the Tag of the specified type beginning at or immediately following the specified position in the source document.
CharacterReference
findPreviousCharacterReference(int pos)
Returns the CharacterReference at or immediately preceding (or enclosing) the specified position in the source document.
EndTag
findPreviousEndTag(int pos)
Returns the EndTag beginning at or immediately preceding the specified position in the source document.
EndTag
findPreviousEndTag(int pos, String name)
Returns the normal EndTag with the specified name at or immediately preceding (or enclosing) the specified position in the source document.
StartTag
findPreviousStartTag(int pos)
Returns the StartTag at or immediately preceding (or enclosing) the specified position in the source document.
StartTag
findPreviousStartTag(int pos, String name)
Returns the StartTag with the specified name at or immediately preceding (or enclosing) the specified position in the source document.
Tag
findPreviousTag(int pos)
Returns the Tag beginning at or immediately preceding (or enclosing) the specified position in the source document.
Tag
findPreviousTag(int pos, TagType tagType)
Returns the Tag of the specified type beginning at or immediately preceding (or enclosing) the specified position in the source document.
Tag[]
fullSequentialParse()
Parses all of the tags in this source document sequentially from beginning to end.
String
getCacheDebugInfo()
Returns a string representation of the tag cache, useful for debugging purposes.
List
getChildElements()
Returns a list of the top-level elements in the document element hierarchy.
int
getColumn(int pos)
Returns the column number of the specified character position in the source document.
String
getDocumentSpecifiedEncoding()
Returns the document encoding specified within the text of the document.
Element
getElementById(String id)
Returns the Element with the specified id attribute value.
String
getEncoding()
Returns the original encoding of the source document.
String
getEncodingSpecificationInfo()
Returns a simple description of how the encoding of the source document was determined.
Writer
getLogWriter()
Returns the destination Writer for log messages.
Iterator
getNextTagIterator(int pos)
Deprecated. Use findAllTags().iterator() instead, or multiple calls to the Tag.findNextTag() method.
ParseText
getParseText()
Returns the parse text of this source document.
int
getRow(int pos)
Returns the row number of the specified character position in the source document.
RowColumnVector
getRowColumnVector(int pos)
Returns a RowColumnVector object representing the row and column number of the specified character position in the source document.
Tag
getTagAt(int pos)
Returns the Tag at the specified position in the source document.
void
ignoreWhenParsing(Collection segments)
Causes all of the segments in the specified collection to be ignored when parsing.
void
ignoreWhenParsing(int begin, int end)
Causes the specified range of the source text to be ignored when parsing.
CharStreamSource
indent(String indentText, boolean tidyTags, boolean collapseWhiteSpace, boolean indentAllElements)
Reproduces the source text with indenting that represents the document element hierarchy of this source document.
boolean
isLoggingEnabled()
Indicates whether logging is currently enabled.
boolean
isXML()
Indicates whether the source document is likely to be XML.
void
log(String message)
Writes the specified message to the log.
Attributes
parseAttributes(int pos, int maxEnd)
Parses any Attributes starting at the specified position.
Attributes
parseAttributes(int pos, int maxEnd, int maxErrorCount)
Parses any Attributes starting at the specified position.
void
setLogWriter(Writer writer)
Sets the destination Writer for log messages.
String
toString()
Returns the source text as a String.

Methods inherited from class au.id.jericho.lib.html.Segment

charAt, compareTo, encloses, encloses, equals, extractText, extractText, findAllCharacterReferences, findAllComments, findAllElements, findAllElements, findAllElements, findAllStartTags, findAllStartTags, findAllStartTags, findAllTags, findAllTags, findFormControls, findFormFields, findWords, getBegin, getChildElements, getDebugInfo, getEnd, getSourceText, getSourceTextNoWhitespace, hashCode, ignoreWhenParsing, isComment, isWhiteSpace, isWhiteSpace, length, parseAttributes, subSequence, toString

Constructor Details

Source

public Source(CharSequence text)
Constructs a new Source object from the specified text.
Parameters:
text - the source text.

Source

public Source(InputStream inputStream)
            throws IOException
Constructs a new Source object by loading the content from the specified InputStream.

The algorithm for detecting the character encoding of the source document from the raw bytes of the specified input stream is the same as that for the Source(URL) constructor with the following exceptions:

  • Step 1 is not possible as there is no Content-Type header to check.
  • Step 6 is not performed as it is not possible to know whether the input stream was aquired from an HTTP connection.
Parameters:
inputStream - the java.io.InputStream from which to load the source text.

Source

public Source(Reader reader)
            throws IOException
Parameters:
reader - the java.io.Reader from which to load the source text.

Source

public Source(URL url)
            throws IOException
Constructs a new Source object by loading the content from the specified URL.

The algorithm for detecting the character encoding of the source document is as follows:

  1. If the URLConnection.getContentType() specifies an encoding (where a charset parameter is included in the value of the the stream's Content-Type header), then this is used to decode the input stream and is returned verbatim by the getEncoding() method of the created source object. Otherwise:
  2. Get the content input stream via the URLConnection.getInputStream() method.
  3. If the input stream is empty, the created source document has zero length and its getEncoding() method returns null. Otherwise:
  4. Determine a preliminary encoding by examining the first 4 bytes of the input stream:
    • If the first two bytes match the byte order mark (U+FEFF) in either big or little endian order:
      • If the third byte is 00, assume a 32-bit encoding (UTF-32).
      • Otherwise, assume a 16-bit encoding (UTF-16).
      • If the first byte is 00:
        • If the second or fourth byte is 00, assume a 32-bit encoding (UTF-32).
        • Otherwise, assume a big endian 16-bit encoding without byte order mark (UTF-16BE).
        • If the second byte is 00:
          • If the third byte is 00, assume a 32-bit encoding (UTF-32).
          • Otherwise, assume a little endian 16-bit encoding without byte order mark (UTF-16LE).
          • If the first four bytes match the EBDIC encoding of "<?xm", the preliminary encoding is Cp037.
          • Otherwise, assume an 8-bit encoding (UTF-8).
          • Preview the first 2048 characters of the source document (hereafter referred to as the preview segment) using the preliminary encoding. If the preview segment contains an encoding specification (which is always at or near the top of the document), the specified encoding is used to decode the input stream and is returned verbatim by the getEncoding() method of the created source object. Otherwise:
          • If the preview segment does not contain an encoding specification, and the URLConnection is an instance of HttpURLConnection, then the HTTP protocol section 3.7.1 specifies that an encoding of ISO-8859-1 can be assumed. An XML document should not assume this as it would require an XML declaration to specify this encoding, which would have been detected in one of the previous steps. So if the preview segment is not determined to be XML, and the preliminary encoding is 8-bit, then the encoding ISO-8859-1 is used to decode the input stream and is returned by the getEncoding() method of the created source object.
          • Otherwise, the preliminary encoding is used to decode the input stream and is returned by the getEncoding() method of the created source object.
          Parameters:
          url - the URL from which to load the source text.

          Method Details

          clearCache

          public void clearCache()
          Clears the tag cache of all tags.

          This method may be useful after calling the Segment.ignoreWhenParsing() method so that any tags previously found within the ignored segments will no longer be returned by the tag search methods.


          findAllElements

          public List findAllElements()
          Returns a list of all elements in this source document.

          The fullSequentialParse() method should be called after construction of the Source object if this method is to be used.

          The elements returned correspond exactly with the start tags returned in the findAllStartTags() method.

          Overrides:
          findAllElements in interface Segment
          Returns:
          a list of all elements in this source document.

          findAllStartTags

          public List findAllStartTags()
          Returns a list of all start tags in this source document.

          The fullSequentialParse() method should be called after construction of the Source object if this method is to be used.

          See the Tag class documentation for more details about the behaviour of this method.

          Overrides:
          findAllStartTags in interface Segment
          Returns:
          a list of all start tags in this source document.

          findAllTags

          public List findAllTags()
          Returns a list of all tags in this source document.

          The fullSequentialParse() method should be called after construction of the Source object if this method is to be used.

          See the Tag class documentation for more details about the behaviour of this method.

          Overrides:
          findAllTags in interface Segment
          Returns:
          a list of all tags in this source document.

          findEnclosingComment

          public Segment findEnclosingComment(int pos)

          Deprecated. Use findEnclosingTag(pos,StartTagType.COMMENT) instead.

          Returns the Segment object representing the HTML comment that encloses the specified position in the source document.

          This method has been deprecated as of version 2.0 in favour of the more generic findEnclosingTag(int pos, TagType) method.

          Parameters:
          pos - the position in the source document.
          Returns:
          the Segment object representing the HTML comment that encloses the specified position in the source document, or null if the position is not within a comment.

          findEnclosingElement

          public Element findEnclosingElement(int pos)
          Returns the most nested Element that encloses the specified position in the source document.

          The specified position can be anywhere inside the start tag, end tag, or content of the element. There is no requirement that the returned element has an end tag, and it may be a server tag or HTML comment.

          See the Tag class documentation for more details about the behaviour of this method.

          Parameters:
          pos - the position in the source document, may be out of bounds.
          Returns:
          the most nested Element that encloses the specified position in the source document, or null if the position is not within an element or is out of bounds.

          findEnclosingElement

          public Element findEnclosingElement(int pos,
                                              String name)
          Returns the most nested Element with the specified name that encloses the specified position in the source document.

          The specified position can be anywhere inside the start tag, end tag, or content of the element. There is no requirement that the returned element has an end tag, and it may be a server tag or HTML comment.

          See the Tag class documentation for more details about the behaviour of this method.

          This method also returns elements consisting of unregistered tags if the specified name is not a valid XML tag name.

          Parameters:
          pos - the position in the source document, may be out of bounds.
          name - the name of the element to search for.
          Returns:
          the most nested Element with the specified name that encloses the specified position in the source document, or null if none exists or the specified position is out of bounds.

          findEnclosingStartTag

          public StartTag findEnclosingStartTag(int pos)

          Deprecated. Use findEnclosingTag(int pos) instead. (see caveat)

          Returns the StartTag that encloses the specified position in the source document.

          This method has been deprecated as of version 2.0 in favour of the more generic findEnclosingTag(int pos) method.

          Caveat - The returned tag from findEnclosingTag(int pos) may be an instance of EndTag. In most cases this should be interpreted in the same way as if this method returned a null, since an end tag normally does not exist inside of a start tag. There is however one situation where this may occur legitimately, where a server-side end tag appears within a normal start tag. It is up to the developer to decide whether this situation requires special handling when updating code that uses this deprecated method.

          Parameters:
          pos - the position in the source document.
          Returns:
          the StartTag that encloses the specified position in the source document, or null if the position is not within a start tag.

          findEnclosingTag

          public Tag findEnclosingTag(int pos)
          Returns the Tag that encloses the specified position in the source document.

          See the Tag class documentation for more details about the behaviour of this method.

          Parameters:
          pos - the position in the source document, may be out of bounds.
          Returns:
          the Tag that encloses the specified position in the source document, or null if the position is not within a tag or is out of bounds.

          findEnclosingTag

          public Tag findEnclosingTag(int pos,
                                      TagType tagType)
          Returns the Tag of the specified type that encloses the specified position in the source document.

          See the Tag class documentation for more details about the behaviour of this method.

          Parameters:
          pos - the position in the source document, may be out of bounds.
          tagType - the TagType to search for.
          Returns:
          the Tag of the specified type that encloses the specified position in the source document, or null if the position is not within a tag of the specified type or is out of bounds.

          findNameEnd

          public int findNameEnd(int pos)
          Returns the end position of the XML Name that starts at the specified position.

          This implementation first checks that the character at the specified position is a valid XML Name start character as defined by the Tag.isXMLNameStartChar(char) method. If this is not the case, the value -1 is returned.

          Once the first character has been checked, subsequent characters are checked using the Tag.isXMLNameChar(char) method until one is found that is not a valid XML Name character or the end of the document is reached. This position is then returned.

          Parameters:
          pos - the position in the source document of the first character of the XML Name.
          Returns:
          the end position of the XML Name that starts at the specified position.

          findNextCharacterReference

          public CharacterReference findNextCharacterReference(int pos)
          Returns the CharacterReference beginning at or immediately following the specified position in the source document.

          Character references positioned within an HTML comment are NOT ignored.

          Parameters:
          pos - the position in the source document from which to start the search, may be out of bounds.
          Returns:
          the CharacterReference beginning at or immediately following the specified position in the source document, or null if none exists or the specified position is out of bounds.

          findNextComment

          public StartTag findNextComment(int pos)

          Deprecated. Use findNextTag(pos,StartTagType.COMMENT) instead.

          Returns the StartTag object representing the HTML comment beginning at or immediately following the specified position in the source document.

          This method has been deprecated as of version 2.0 in favour of the more generic findNextTag(int pos, TagType) method.

          Parameters:
          pos - the position in the source document from which to start the search.
          Returns:
          the StartTag object representing the HTML comment beginning at or immediately following the specified position in the source document, or null if none exists.

          findNextElement

          public Element findNextElement(int pos)
          Returns the Element beginning at or immediately following the specified position in the source document.

          This is equivalent to findNextStartTag(pos).getElement(), assuming the result is not null.

          Parameters:
          pos - the position in the source document from which to start the search, may be out of bounds.
          Returns:
          the Element beginning at or immediately following the specified position in the source document, or null if none exists or the specified position is out of bounds.

          findNextElement

          public Element findNextElement(int pos,
                                         String name)
          Returns the Element with the specified name beginning at or immediately following the specified position in the source document.

          This is equivalent to findNextStartTag(pos,name).getElement(), assuming the result is not null.

          Specifying a null argument to the name parameter is equivalent to findNextElement(pos).

          Specifying an argument to the name parameter that ends in a colon (:) searches for all elements in the specified XML namespace.

          This method also returns elements consisting of unregistered tags if the specified name is not a valid XML tag name.

          Parameters:
          pos - the position in the source document from which to start the search, may be out of bounds.
          name - the name of the element to search for.
          Returns:
          the Element with the specified name beginning at or immediately following the specified position in the source document, or null if none exists or the specified position is out of bounds.

          findNextEndTag

          public EndTag findNextEndTag(int pos)
          Returns the EndTag beginning at or immediately following the specified position in the source document.

          See the Tag class documentation for more details about the behaviour of this method.

          Parameters:
          pos - the position in the source document from which to start the search, may be out of bounds.
          Returns:
          the EndTag beginning at or immediately following the specified position in the source document, or null if none exists or the specified position is out of bounds.

          findNextEndTag

          public EndTag findNextEndTag(int pos,
                                       String name)
          Returns the normal EndTag with the specified name beginning at or immediately following the specified position in the source document.

          See the Tag class documentation for more details about the behaviour of this method.

          Parameters:
          pos - the position in the source document from which to start the search, may be out of bounds.
          name - the name of the end tag to search for, must not be null.
          Returns:
          the normal EndTag with the specified name beginning at or immediately following the specified position in the source document, or null if none exists or the specified position is out of bounds.

          findNextEndTag

          public EndTag findNextEndTag(int pos,
                                       String name,
                                       EndTagType endTagType)
          Returns the EndTag with the specified name and type beginning at or immediately following the specified position in the source document.

          See the Tag class documentation for more details about the behaviour of this method.

          Parameters:
          pos - the position in the source document from which to start the search, may be out of bounds.
          name - the name of the end tag to search for, must not be null.
          endTagType - the type of the end tag to search for, must not be null.
          Returns:
          the EndTag with the specified name and type beginning at or immediately following the specified position in the source document, or null if none exists or the specified position is out of bounds.

          findNextStartTag

          public StartTag findNextStartTag(int pos)
          Returns the StartTag beginning at or immediately following the specified position in the source document.

          See the Tag class documentation for more details about the behaviour of this method.

          Parameters:
          pos - the position in the source document from which to start the search, may be out of bounds.
          Returns:
          the StartTag beginning at or immediately following the specified position in the source document, or null if none exists or the specified position is out of bounds.

          findNextStartTag

          public StartTag findNextStartTag(int pos,
                                           String name)
          Returns the StartTag with the specified name beginning at or immediately following the specified position in the source document.

          See the Tag class documentation for more details about the behaviour of this method.

          Specifying a null argument to the name parameter is equivalent to findNextStartTag(pos).

          Specifying an argument to the name parameter that ends in a colon (:) searches for all start tags in the specified XML namespace.

          This method also returns unregistered tags if the specified name is not a valid XML tag name.

          Parameters:
          pos - the position in the source document from which to start the search, may be out of bounds.
          name - the name of the start tag to search for.
          Returns:
          the StartTag with the specified name beginning at or immediately following the specified position in the source document, or null if none exists or the specified position is out of bounds.

          findNextStartTag

          public StartTag findNextStartTag(int pos,
                                           String attributeName,
                                           String value,
                                           boolean valueCaseSensitive)
          Returns the StartTag with the specified attribute name/value pair beginning at or immediately following the specified position in the source document.

          See the Tag class documentation for more details about the behaviour of this method.

          Parameters:
          pos - the position in the source document from which to start the search, may be out of bounds.
          attributeName - the attribute name (case insensitive) to search for, must not be null.
          value - the value of the specified attribute to search for, must not be null.
          valueCaseSensitive - specifies whether the attribute value matching is case sensitive.
          Returns:
          the StartTag with the specified attribute name/value pair beginning at or immediately following the specified position in the source document, or null if none exists or the specified position is out of bounds.

          findNextTag

          public Tag findNextTag(int pos)
          Returns the Tag beginning at or immediately following the specified position in the source document.

          See the Tag class documentation for more details about the behaviour of this method.

          Parameters:
          pos - the position in the source document from which to start the search, may be out of bounds.
          Returns:
          the Tag beginning at or immediately following the specified position in the source document, or null if none exists or the specified position is out of bounds.

          findNextTag

          public Tag findNextTag(int pos,
                                 TagType tagType)
          Returns the Tag of the specified type beginning at or immediately following the specified position in the source document.

          See the Tag class documentation for more details about the behaviour of this method.

          Parameters:
          pos - the position in the source document from which to start the search, may be out of bounds.
          tagType - the TagType to search for.
          Returns:
          the Tag with the specified type beginning at or immediately following the specified position in the source document, or null if none exists or the specified position is out of bounds.

          findPreviousCharacterReference

          public CharacterReference findPreviousCharacterReference(int pos)
          Returns the CharacterReference at or immediately preceding (or enclosing) the specified position in the source document.

          Character references positioned within an HTML comment are NOT ignored.

          Parameters:
          pos - the position in the source document from which to start the search, may be out of bounds.
          Returns:
          the CharacterReference beginning at or immediately preceding the specified position in the source document, or null if none exists or the specified position is out of bounds.

          findPreviousEndTag

          public EndTag findPreviousEndTag(int pos)
          Returns the EndTag beginning at or immediately preceding the specified position in the source document.

          See the Tag class documentation for more details about the behaviour of this method.

          Parameters:
          pos - the position in the source document from which to start the search, may be out of bounds.
          Returns:
          the EndTag beginning at or immediately preceding the specified position in the source document, or null if none exists or the specified position is out of bounds.

          findPreviousEndTag

          public EndTag findPreviousEndTag(int pos,
                                           String name)
          Returns the normal EndTag with the specified name at or immediately preceding (or enclosing) the specified position in the source document.

          See the Tag class documentation for more details about the behaviour of this method.

          Parameters:
          pos - the position in the source document from which to start the search, may be out of bounds.
          name - the name of the end tag to search for, must not be null.
          Returns:
          the normal EndTag with the specified name at or immediately preceding (or enclosing) the specified position in the source document, or null if none exists or the specified position is out of bounds.

          findPreviousStartTag

          public StartTag findPreviousStartTag(int pos)
          Returns the StartTag at or immediately preceding (or enclosing) the specified position in the source document.

          See the Tag class documentation for more details about the behaviour of this method.

          Parameters:
          pos - the position in the source document from which to start the search, may be out of bounds.
          Returns:
          the StartTag at or immediately preceding the specified position in the source document, or null if none exists or the specified position is out of bounds.

          findPreviousStartTag

          public StartTag findPreviousStartTag(int pos,
                                               String name)
          Returns the StartTag with the specified name at or immediately preceding (or enclosing) the specified position in the source document.

          See the Tag class documentation for more details about the behaviour of this method.

          Specifying a null argument to the name parameter is equivalent to findPreviousStartTag(pos).

          This method also returns unregistered tags if the specified name is not a valid XML tag name.

          Parameters:
          pos - the position in the source document from which to start the search, may be out of bounds.
          name - the name of the start tag to search for.
          Returns:
          the StartTag with the specified name at or immediately preceding the specified position in the source document, or null if none exists or the specified position is out of bounds.

          findPreviousTag

          public Tag findPreviousTag(int pos)
          Returns the Tag beginning at or immediately preceding (or enclosing) the specified position in the source document.

          See the Tag class documentation for more details about the behaviour of this method.

          Parameters:
          pos - the position in the source document from which to start the search, may be out of bounds.
          Returns:
          the Tag beginning at or immediately preceding the specified position in the source document, or null if none exists or the specified position is out of bounds.

          findPreviousTag

          public Tag findPreviousTag(int pos,
                                     TagType tagType)
          Returns the Tag of the specified type beginning at or immediately preceding (or enclosing) the specified position in the source document.

          See the Tag class documentation for more details about the behaviour of this method.

          Parameters:
          pos - the position in the source document from which to start the search, may be out of bounds.
          tagType - the TagType to search for.
          Returns:
          the Tag with the specified type beginning at or immediately preceding the specified position in the source document, or null if none exists or the specified position is out of bounds.

          fullSequentialParse

          public Tag[] fullSequentialParse()
          Parses all of the tags in this source document sequentially from beginning to end.

          Calling this method can greatly improve performance if most or all of the tags in the document need to be parsed. It is typically called before any of the tag search methods are called on this Source object, directly after setting the location of the log writer.

          By default, tags are parsed only as needed, which is referred to as parse on demand mode. In this mode, every call to a tag search method that is not returning previously cached tags must perform a relatively complex check to determine whether a potential tag is in a valid position.

          Generally speaking, a tag is in a valid position if it does not appear inside any another tag. Server tags can appear anywhere in a document, including inside other tags, so this relates only to non-server tags. Theoretically, checking whether a specified position in the document is enclosed in another tag is only possible if every preceding tag has been parsed, otherwise it is impossible to tell whether one of the delimiters of the enclosing tag was in fact enclosed by some other tag before it, thereby invalidating it.

          When this method is called, each tag is parsed in sequence starting from the beginning of the document, making it easy to check whether each potential tag is in a valid position. In parse on demand mode a compromise technique must be used for this check, since the theoretical requirement of having parsed all preceding tags is no longer practical. This compromise involves only checking whether the position is enclosed by other tags with certain tag types. The added complexity of this technique makes parsing each tag slower compared to when a full sequential parse is performed, but when only a few tags need parsing this is an extremely beneficial trade-off.

          The documentation of the TagType.isValidPosition(Source, int pos) method, which is called internally by the parser to perform the valid position check, includes a more detailed explanation of the differences between the two modes of operation.

          If the findAllTags(), findAllStartTags() or findAllElements() method is called on the Source object without having called this method first, a log message is generated recommending its use.

          This method returns the same list of tags as the Source.findAllTags() method, but as an array instead of a list.

          If this method is called after any of the tag search methods are called, the cache is cleared of any previously found tags before being restocked via the full sequential parse. This is significant if the Segment.ignoreWhenParsing() method has been called since the tags were first found, as any tags inside the ignored segments will no longer be returned by any of the tag search methods.

          See also the Tag class documentation for more general details about how tags are parsed.

          Returns:
          an array of all tags in this source document.

          getCacheDebugInfo

          public String getCacheDebugInfo()
          Returns a string representation of the tag cache, useful for debugging purposes.
          Returns:
          a string representation of the tag cache, useful for debugging purposes.

          getChildElements

          public List getChildElements()
          Returns a list of the top-level elements in the document element hierarchy.

          The fullSequentialParse() method should be called after construction of the Source object if this method is to be used.

          The objects in the list are all of type Element.

          The term top-level element refers to an element that is not nested within any other element in the document.

          The term document element hierarchy refers to the hierarchy of elements that make up this source document. While the document itself is theoretically at the top of the hierarchy, this library only considers Element objects to be part of the hierarchy, so the top-level elements are the immediate children of the source document.

          The Element.getChildElements() method can be used to get the decendents of the top-level elements.

          The document element hierarchy differs from that of the Document Object Model in that it is only a representation of the elements that are physically present in the source text. Unlike the DOM, it does not include any "implied" HTML elements such as TBODY if they are not present in the source text.

          Elements formed from server tags are not included in the hierarchy at all.

          Structural errors in this source document such as overlapping elements are reported in the log. In the case that two elements are found to overlap, the position of the start tag determines the location of the element in the hierarchy.

          A visual representation of the document element hierarchy can be obtained by calling indent(String,boolean,boolean,boolean) indent("  ",true,true,true).

          Overrides:
          getChildElements in interface Segment
          Returns:
          a list of the top-level elements in the document element hierarchy, guaranteed not null.

          getColumn

          public int getColumn(int pos)
          Returns the column number of the specified character position in the source document.
          Parameters:
          pos - the position in the source document.
          Returns:
          the column number of the specified character position in the source document.
          See Also:
          getRow(int pos), getRowColumnVector(int pos)

          getDocumentSpecifiedEncoding

          public String getDocumentSpecifiedEncoding()
          Returns the document encoding specified within the text of the document.

          The document encoding can be specified within the document text in two ways. They are referred to generically in this library as an encoding specification, and are listed below in order of precedence:

          1. An XML text declaration at the start of the document, which is essentially an XML declaration with an encoding attribute. This is only used in XML documents, and must be present if an XML document has an encoding other than UTF-8 or UTF-16.
            <?xml version="1.0" encoding="ISO-8859-1" ?>
          2. A META declaration, which is in the form of a META tag with attribute http-equiv="Content-Type". The encoding is specified in the charset parameter of a Content-Type HTTP header value, which is placed in the value of the meta tag's content attribute. This META declaration should appear as early as possible in the HEAD element.
            <META http-equiv=Content-Type content="text/html; charset=iso-8859-1">

          Both of these tags must only use unicode characters in the range U+0000 to U+007F, and in the case of the META declaration must use ASCII encoding. This, along with the fact that they must occur at or near the beginning of the document, assists in their detection and decoding without the need to know the exact encoding of the full text.

          Returns:
          the document encoding specified within the text of the document.
          See Also:
          getEncoding()

          getElementById

          public Element getElementById(String id)
          Returns the Element with the specified id attribute value.

          This simulates the script method getElementById defined in DOM HTML level 1.

          This is equivalent to findNextStartTag(0,"id",id,true).getElement(), assuming that the element exists.

          A well formed HTML document should have no more than one element with any given id attribute value.

          Parameters:
          id - the id attribute value (case sensitive) to search for, must not be null.
          Returns:
          the Element with the specified id attribute value, or null if no such element exists.

          getEncoding

          public String getEncoding()
          Returns the original encoding of the source document.

          The encoding of a document defines how the original byte stream was encoded into characters. The HTTP specification section 3.4 defines the term "character set" to refer to the encoding, and the term "charset" is similarly used in Java (see the class java.nio.charset.Charset). This is an unfortunate convention that often causes confusion, as a character set is not the same thing as a character encoding. For example, the Unicode character set has several encodings, such as UTF-8, UTF-16, and UTF-32.

          This method makes the best possible effort to return the name of the encoding used to decode the original source text byte stream into character data. This decoding takes place in the constructor when a parameter based on a byte stream such as an InputStream or URL is used to specify the source text. The documentation of the Source(InputStream) and Source(URL) constructors describe how the return value of this method is determined in these cases. It is also possible in some circumstances for the encoding to be determined in the Source(Reader) constructor.

          If a constructor was used that specifies the source text directly in character form (not requiring the decoding of a byte sequence) then the document itself is searched for an encoding specification. In this case, this method returns the same value as the getDocumentSpecifiedEncoding() method.

          The getEncodingSpecificationInfo() method returns a simple description of how the value of this method was determined.

          Returns:
          the original encoding of the source document.

          getEncodingSpecificationInfo

          public String getEncodingSpecificationInfo()
          Returns a simple description of how the encoding of the source document was determined.

          The description is intended for informational purposes only. It is not guaranteed to have any particular format and can not be reliably parsed.

          Returns:
          a simple description of how the encoding of the source document was determined.
          See Also:
          getEncoding()

          getLogWriter

          public Writer getLogWriter()
          Returns the destination Writer for log messages.

          By default, the log writer is set to null, which supresses log messages.

          Returns:
          the destination Writer for log messages.

          getNextTagIterator

          public Iterator getNextTagIterator(int pos)

          Deprecated. Use findAllTags().iterator() instead, or multiple calls to the Tag.findNextTag() method.

          Returns an iterator of Tag objects beginning at and following the specified position in the source document.

          This method has been deprecated as of version 2.2 as it was originally only included because it was more efficient than consecutive calls to findNextTag(int pos). The most efficient replacement is to use multiple calls to Tag.findNextTag() if a full sequential parse was peformed, otherwise use findAllTags().iterator() and skip over the tags that begin before the position specified in the pos argument of this method.

          Parameters:
          pos - the position in the source document from which to start the iteration.
          Returns:
          an iterator of Tag objects beginning at and following the specified position in the source document.

          getParseText

          public final ParseText getParseText()
          Returns the parse text of this source document.

          This method is normally only of interest to users who wish to create custom tag types.

          The parse text is defined as the entire text of the source document in lower case, with all ignored segments replaced by space characters.

          Returns:
          the parse text of this source document.

          getRow

          public int getRow(int pos)
          Returns the row number of the specified character position in the source document.
          Parameters:
          pos - the position in the source document.
          Returns:
          the row number of the specified character position in the source document.
          See Also:
          getColumn(int pos), getRowColumnVector(int pos)

          getRowColumnVector

          public RowColumnVector getRowColumnVector(int pos)
          Returns a RowColumnVector object representing the row and column number of the specified character position in the source document.
          Parameters:
          pos - the position in the source document.
          Returns:
          a RowColumnVector object representing the row and column number of the specified character position in the source document.
          See Also:
          getRow(int pos), getColumn(int pos)

          getTagAt

          public final Tag getTagAt(int pos)
          Returns the Tag at the specified position in the source document.

          See the Tag class documentation for more details about the behaviour of this method.

          This method also returns unregistered tags.

          Parameters:
          pos - the position in the source document, may be out of bounds.
          Returns:
          the Tag at the specified position in the source document, or null if no tag exists at the specified position or it is out of bounds.

          ignoreWhenParsing

          public void ignoreWhenParsing(Collection segments)

          ignoreWhenParsing

          public void ignoreWhenParsing(int begin,
                                        int end)
          Parameters:
          begin - the beginning character position in the source text.
          end - the end character position in the source text.

          indent

          public CharStreamSource indent(String indentText,
                                         boolean tidyTags,
                                         boolean collapseWhiteSpace,
                                         boolean indentAllElements)
          Reproduces the source text with indenting that represents the document element hierarchy of this source document. Any indenting present in the original source text is removed.

          The output text is functionally equivalent to the original source and should be rendered identically unless specified below.

          The following points describe the process in general terms. Any aspect of the algorithm not specifically mentioned here is subject to change without notice in future versions.

          • Every element that is not an inline-level element appears on a new line with an indent corresponding to its depth in the document element hierarchy.
          • The indent is formed by writing n repetitions of the string specified in the indentText argument, where n is the depth of the indent.
          • The content of an indented element starts on a new line and is indented at a depth one greater than that of the element, with the end tag appearing on a new line at the same depth as the start tag. If the content contains only text and inline-level elements, it may continue on the same line as the start tag. Additionally, if the output content contains no new lines, the end tag may also continue on the same line.
          • The content of preformatted elements such as PRE and TEXTAREA are not indented, nor is the white space modified in any way.
          • Only normal and document type declaration elements are indented. All others are treated as inline-level elements.
          • White space and indenting inside HTML comments, CDATA sections, or any server tag is preserved, but with the indenting of new lines starting at a depth one greater than that of the surrounding text.
          • White space and indenting inside SCRIPT elements is preserved, but with the indenting of new lines starting at a depth one greater than that of the SCRIPT element.
          • If the tidyTags option is used, every tag in the document is replaced with the output from its Tag.tidy() method. If this argument is set to false, the tag from the original text is used, including all white space, but with any new lines indented at a depth one greater than that of the element.
          • If the collapseWhiteSpace option is used, every string of one or more white space characters located outside of a tag is replaced with a single space in the output. White space located adjacent to a non-inline-level element tag (except server tags) may be removed.
          • If the indentAllElements option is used, every element appears indented on a new line, including inline-level elements. This generates output that is a good representation of the actual document element hierarchy, but is very likely to introduce white space that affects the functional equivalency of the document.
          • If the source document contains server tags, the functional equivalency of the output document may be compromised.

          Use one of the following methods to obtain the output from the returned CharStreamSource object:
          CharStreamSource.writeTo(Writer)
          CharStreamSourceUtil.toString(CharStreamSource)
          CharStreamSourceUtil.getReader(CharStreamSource)

          Parameters:
          indentText - the string to use for each indent, must not be null.
          tidyTags - specifies whether to replace the original text of each tag with the output from its Tag.tidy() method.
          collapseWhiteSpace - specifies whether to collapse the white space in the text between the tags.
          indentAllElements - specifies whether to indent all elements, including inline-level elements and those with preformatted contents.
          Returns:
          a CharStreamSource from which an indented copy of this source document can be obtained.

          isLoggingEnabled

          public boolean isLoggingEnabled()
          Indicates whether logging is currently enabled.

          The current implementation of this method is equivalent to getLogWriter()!=null.

          For best performance you should check that this method returns true before constructing the string to send to the log(String message) method.

          Returns:
          true if logging is currently enabled, otherwise false.

          isXML

          public boolean isXML()
          Indicates whether the source document is likely to be XML.

          The algorithm used to determine this is designed to be relatively inexpensive and to provide an accurate result in most normal situations. An exact determination of whether the source document is XML would require a much more complex analysis of the text.

          The algorithm is as follows:

          1. If the document begins with an XML declaration, it is an XML document.
          2. If the document contains a document type declaration that contains the text "xhtml", it is an XHTML document, and hence also an XML document.
          3. If the document does NOT have an HTML element, assume it is XML. This assumption is based on the premise that the library is used to parse HTML or XML documents only.
          4. If none of the above conditions are met, assume the document is normal HTML, and therefore not an XML document.
          Returns:
          true if the source document is likely to be XML, otherwise false.

          log

          public void log(String message)
          Writes the specified message to the log.

          The log destination is set via the setLogWriter(Writer) method. By default, log messages are not sent anywhere.

          A newline character is added to the message and the Writer is flushed after every call to this method.

          If an IOException is thrown while writing to the log, this method throws a RuntimeException with the original IOException as its cause.

          Parameters:
          message - the message to log

          parseAttributes

          public Attributes parseAttributes(int pos,
                                            int maxEnd)
          Parses any Attributes starting at the specified position. This method is only used in the unusual situation where attributes exist outside of a start tag. The StartTag.getAttributes() method should be used in normal situations.

          The returned Attributes segment always begins at pos, and ends at the end of the last attribute before either maxEnd or the first occurrence of "/>" or ">" outside of a quoted attribute value, whichever comes first.

          Only returns null if the segment contains a major syntactical error or more than the default maximum number of minor syntactical errors.

          This is equivalent to parseAttributes(pos,maxEnd,Attributes.getDefaultMaxErrorCount())}.

          Parameters:
          pos - the position in the source document at the beginning of the attribute list, may be out of bounds.
          maxEnd - the maximum end position of the attribute list, or -1 if no maximum.
          Returns:
          the Attributes starting at the specified position, or null if too many errors occur while parsing or the specified position is out of bounds.

          parseAttributes

          public Attributes parseAttributes(int pos,
                                            int maxEnd,
                                            int maxErrorCount)
          Parses any Attributes starting at the specified position. This method is only used in the unusual situation where attributes exist outside of a start tag. The StartTag.getAttributes() method should be used in normal situations.

          Only returns null if the segment contains a major syntactical error or more than the specified number of minor syntactical errors.

          The maxErrorCount argument overrides the default maximum error count.

          See parseAttributes(int pos, int maxEnd) for more information.

          Parameters:
          pos - the position in the source document at the beginning of the attribute list, may be out of bounds.
          maxEnd - the maximum end position of the attribute list, or -1 if no maximum.
          maxErrorCount - the maximum number of minor errors allowed while parsing.
          Returns:
          the Attributes starting at the specified position, or null if too many errors occur while parsing or the specified position is out of bounds.
          See Also:
          StartTag.getAttributes(), parseAttributes(int pos, int MaxEnd)

          setLogWriter

          public void setLogWriter(Writer writer)
          Sets the destination Writer for log messages.

          When required, this method should normally be called immediately after the construction of the Source object.

          Parameters:
          writer - the destination java.io.Writer for log messages.

          toString

          public String toString()
          Returns the source text as a String.
          Overrides:
          toString in interface Segment
          Returns:
          the source text as a String.