Prev Class | Next Class | Frames | No Frames |
Summary: Nested | Field | Method | Constr | Detail: Nested | Field | Method | Constr |
java.lang.Object
au.id.jericho.lib.html.Segment
au.id.jericho.lib.html.Source
public class Source
extends Segment
Source
object from the source data, which can be a
String
, Reader
, InputStream
or URL
.
Each constructor uses all the evidence available to determine the original character encoding of the data.
Once the Source
object has been created, you can immediately start searching for tags or elements within the document
using the tag search methods.
It is strongly advised however to first think about how many of the document's tags you will need to parse.
If you will be searching for all or most of the tags, performance can be greatly improved by first calling the fullSequentialParse()
method.
If you only need to parse a few tags, performance will probably be better if you use the default parse on demand mode.
It can also be useful to set the location of the log writer before calling any tag search methods
so that important log messages can be traced while the document is being parsed.
Note that many of the useful functions which can be performed on the source document are
defined in its superclass, Segment
.
The source object is itself a segment which spans the entire document.
Most of the methods defined in this class are useful for determining the elements and tags
surrounding or neighbouring a particular character position in the document.
For information on how to create a modified version of this source document, see the OutputDocument
class.
Segment
Constructor Summary | |
| |
| |
| |
|
Method Summary | |
void |
|
List |
|
List |
|
List |
|
Segment |
|
Element |
|
Element |
|
StartTag |
|
Tag |
|
Tag |
|
int |
|
CharacterReference |
|
StartTag |
|
Element |
|
Element |
|
EndTag |
|
EndTag |
|
EndTag |
|
StartTag |
|
StartTag |
|
StartTag |
|
Tag |
|
Tag |
|
CharacterReference |
|
EndTag |
|
EndTag |
|
StartTag |
|
StartTag |
|
Tag |
|
Tag |
|
Tag[] |
|
String |
|
List |
|
int |
|
String |
|
Element |
|
String |
|
String |
|
Writer |
|
Iterator |
|
ParseText |
|
int |
|
RowColumnVector |
|
Tag | |
void |
|
void |
|
CharStreamSource |
|
boolean |
|
boolean | |
void |
|
Attributes |
|
Attributes |
|
void |
|
String |
|
Methods inherited from class au.id.jericho.lib.html.Segment | |
charAt , compareTo , encloses , encloses , equals , extractText , extractText , findAllCharacterReferences , findAllComments , findAllElements , findAllElements , findAllElements , findAllStartTags , findAllStartTags , findAllStartTags , findAllTags , findAllTags , findFormControls , findFormFields , findWords , getBegin , getChildElements , getDebugInfo , getEnd , getSourceText , getSourceTextNoWhitespace , hashCode , ignoreWhenParsing , isComment , isWhiteSpace , isWhiteSpace , length , parseAttributes , subSequence , toString |
public Source(CharSequence text)
Constructs a newSource
object from the specified text.
- Parameters:
text
- the source text.
- See Also:
setLogWriter(Writer)
public Source(InputStream inputStream) throws IOException
Constructs a newSource
object by loading the content from the specifiedInputStream
. The algorithm for detecting the character encoding of the source document from the raw bytes of the specified input stream is the same as that for theSource(URL)
constructor with the following exceptions:
- Step 1 is not possible as there is no
Content-Type
header to check.- Step 6 is not performed as it is not possible to know whether the input stream was aquired from an HTTP connection.
- Parameters:
inputStream
- thejava.io.InputStream
from which to load the source text.
- See Also:
getEncoding()
,setLogWriter(Writer)
public Source(Reader reader) throws IOException
Constructs a newSource
object by loading the content from the specifiedReader
. If the specified reader is an instance ofInputStreamReader
, thegetEncoding()
method of the created source object returns the encoding fromInputStreamReader.getEncoding()
.
- Parameters:
reader
- thejava.io.Reader
from which to load the source text.
- See Also:
setLogWriter(Writer)
public Source(URL url) throws IOException
Constructs a newSource
object by loading the content from the specified URL. The algorithm for detecting the character encoding of the source document is as follows:
- If the
URLConnection.getContentType()
specifies an encoding (where acharset
parameter is included in the value of the the stream's Content-Type header), then this is used to decode the input stream and is returned verbatim by thegetEncoding()
method of the created source object. Otherwise:- Get the content input stream via the
URLConnection.getInputStream()
method.- If the input stream is empty, the created source document has zero length and its
getEncoding()
method returnsnull
. Otherwise:- Determine a preliminary encoding by examining the first 4 bytes of the input stream:
- If the first two bytes match the byte order mark (U+FEFF) in either big or little endian order:
- If the third byte is 00, assume a 32-bit encoding (UTF-32).
- Otherwise, assume a 16-bit encoding (UTF-16).
- If the first byte is 00:
- If the second or fourth byte is 00, assume a 32-bit encoding (UTF-32).
- Otherwise, assume a big endian 16-bit encoding without byte order mark (UTF-16BE).
- If the second byte is 00:
- If the third byte is 00, assume a 32-bit encoding (UTF-32).
- Otherwise, assume a little endian 16-bit encoding without byte order mark (UTF-16LE).
- If the first four bytes match the EBDIC encoding of "
<?xm
", the preliminary encoding is Cp037.- Otherwise, assume an 8-bit encoding (UTF-8).
- Preview the first 2048 characters of the source document (hereafter referred to as the preview segment) using the preliminary encoding. If the preview segment contains an encoding specification (which is always at or near the top of the document), the specified encoding is used to decode the input stream and is returned verbatim by the
getEncoding()
method of the created source object. Otherwise:- If the preview segment does not contain an encoding specification, and the
URLConnection
is an instance ofHttpURLConnection
, then the HTTP protocol section 3.7.1 specifies that an encoding of ISO-8859-1 can be assumed. An XML document should not assume this as it would require an XML declaration to specify this encoding, which would have been detected in one of the previous steps. So if the preview segment is not determined to be XML, and the preliminary encoding is 8-bit, then the encoding ISO-8859-1 is used to decode the input stream and is returned by thegetEncoding()
method of the created source object.- Otherwise, the preliminary encoding is used to decode the input stream and is returned by the
getEncoding()
method of the created source object.
- Parameters:
url
- the URL from which to load the source text.
- See Also:
getEncoding()
,setLogWriter(Writer)
public void clearCache()
Clears the tag cache of all tags. This method may be useful after calling theSegment.ignoreWhenParsing()
method so that any tags previously found within the ignored segments will no longer be returned by the tag search methods.
public List findAllElements()
Returns a list of all elements in this source document. ThefullSequentialParse()
method should be called after construction of theSource
object if this method is to be used. The elements returned correspond exactly with the start tags returned in thefindAllStartTags()
method.
- Overrides:
- findAllElements in interface Segment
- Returns:
- a list of all elements in this source document.
public List findAllStartTags()
Returns a list of all start tags in this source document. ThefullSequentialParse()
method should be called after construction of theSource
object if this method is to be used. See theTag
class documentation for more details about the behaviour of this method.
- Overrides:
- findAllStartTags in interface Segment
- Returns:
- a list of all start tags in this source document.
public List findAllTags()
Returns a list of all tags in this source document. ThefullSequentialParse()
method should be called after construction of theSource
object if this method is to be used. See theTag
class documentation for more details about the behaviour of this method.
- Overrides:
- findAllTags in interface Segment
- Returns:
- a list of all tags in this source document.
public Segment findEnclosingComment(int pos)
Deprecated. Use
findEnclosingTag
(pos,
StartTagType.COMMENT
)
instead.Returns theSegment
object representing the HTML comment that encloses the specified position in the source document. This method has been deprecated as of version 2.0 in favour of the more genericfindEnclosingTag(int pos, TagType)
method.
- Parameters:
pos
- the position in the source document.
public Element findEnclosingElement(int pos)
Returns the most nestedElement
that encloses the specified position in the source document. The specified position can be anywhere inside the start tag, end tag, or content of the element. There is no requirement that the returned element has an end tag, and it may be a server tag or HTML comment. See theTag
class documentation for more details about the behaviour of this method.
- Parameters:
pos
- the position in the source document, may be out of bounds.
public Element findEnclosingElement(int pos, String name)
Returns the most nestedElement
with the specified name that encloses the specified position in the source document. The specified position can be anywhere inside the start tag, end tag, or content of the element. There is no requirement that the returned element has an end tag, and it may be a server tag or HTML comment. See theTag
class documentation for more details about the behaviour of this method. This method also returns elements consisting of unregistered tags if the specified name is not a valid XML tag name.
- Parameters:
pos
- the position in the source document, may be out of bounds.name
- the name of the element to search for.
public StartTag findEnclosingStartTag(int pos)
Deprecated. Use
findEnclosingTag(int pos)
instead. (see caveat)Returns theStartTag
that encloses the specified position in the source document. This method has been deprecated as of version 2.0 in favour of the more genericfindEnclosingTag(int pos)
method. Caveat - The returned tag fromfindEnclosingTag(int pos)
may be an instance ofEndTag
. In most cases this should be interpreted in the same way as if this method returned anull
, since an end tag normally does not exist inside of a start tag. There is however one situation where this may occur legitimately, where a server-side end tag appears within a normal start tag. It is up to the developer to decide whether this situation requires special handling when updating code that uses this deprecated method.
- Parameters:
pos
- the position in the source document.
public Tag findEnclosingTag(int pos)
Returns theTag
that encloses the specified position in the source document. See theTag
class documentation for more details about the behaviour of this method.
- Parameters:
pos
- the position in the source document, may be out of bounds.
public Tag findEnclosingTag(int pos, TagType tagType)
Returns theTag
of the specified type that encloses the specified position in the source document. See theTag
class documentation for more details about the behaviour of this method.
- Parameters:
pos
- the position in the source document, may be out of bounds.tagType
- theTagType
to search for.
public int findNameEnd(int pos)
Returns the end position of the XML Name that starts at the specified position. This implementation first checks that the character at the specified position is a valid XML Name start character as defined by theTag.isXMLNameStartChar(char)
method. If this is not the case, the value-1
is returned. Once the first character has been checked, subsequent characters are checked using theTag.isXMLNameChar(char)
method until one is found that is not a valid XML Name character or the end of the document is reached. This position is then returned.
- Parameters:
pos
- the position in the source document of the first character of the XML Name.
- Returns:
- the end position of the XML Name that starts at the specified position.
public CharacterReference findNextCharacterReference(int pos)
Returns theCharacterReference
beginning at or immediately following the specified position in the source document. Character references positioned within an HTML comment are NOT ignored.
- Parameters:
pos
- the position in the source document from which to start the search, may be out of bounds.
- Returns:
- the
CharacterReference
beginning at or immediately following the specified position in the source document, ornull
if none exists or the specified position is out of bounds.
public StartTag findNextComment(int pos)
Deprecated. Use
findNextTag
(pos,
StartTagType.COMMENT
)
instead.Returns theStartTag
object representing the HTML comment beginning at or immediately following the specified position in the source document. This method has been deprecated as of version 2.0 in favour of the more genericfindNextTag(int pos, TagType)
method.
- Parameters:
pos
- the position in the source document from which to start the search.
public Element findNextElement(int pos)
Returns theElement
beginning at or immediately following the specified position in the source document. This is equivalent tofindNextStartTag(pos)
.
getElement()
, assuming the result is notnull
.
- Parameters:
pos
- the position in the source document from which to start the search, may be out of bounds.
- Returns:
- the
Element
beginning at or immediately following the specified position in the source document, ornull
if none exists or the specified position is out of bounds.
public Element findNextElement(int pos, String name)
Returns theElement
with the specified name beginning at or immediately following the specified position in the source document. This is equivalent tofindNextStartTag(pos,name)
.
getElement()
, assuming the result is notnull
. Specifying anull
argument to thename
parameter is equivalent tofindNextElement(pos)
. Specifying an argument to thename
parameter that ends in a colon (:
) searches for all elements in the specified XML namespace. This method also returns elements consisting of unregistered tags if the specified name is not a valid XML tag name.
- Parameters:
pos
- the position in the source document from which to start the search, may be out of bounds.name
- the name of the element to search for.
public EndTag findNextEndTag(int pos)
Returns theEndTag
beginning at or immediately following the specified position in the source document. See theTag
class documentation for more details about the behaviour of this method.
- Parameters:
pos
- the position in the source document from which to start the search, may be out of bounds.
- Returns:
- the
EndTag
beginning at or immediately following the specified position in the source document, ornull
if none exists or the specified position is out of bounds.
public EndTag findNextEndTag(int pos, String name)
Returns the normalEndTag
with the specified name beginning at or immediately following the specified position in the source document. See theTag
class documentation for more details about the behaviour of this method.
- Parameters:
pos
- the position in the source document from which to start the search, may be out of bounds.name
- the name of the end tag to search for, must not benull
.
public EndTag findNextEndTag(int pos, String name, EndTagType endTagType)
Returns theEndTag
with the specified name and type beginning at or immediately following the specified position in the source document. See theTag
class documentation for more details about the behaviour of this method.
public StartTag findNextStartTag(int pos)
Returns theStartTag
beginning at or immediately following the specified position in the source document. See theTag
class documentation for more details about the behaviour of this method.
- Parameters:
pos
- the position in the source document from which to start the search, may be out of bounds.
- Returns:
- the
StartTag
beginning at or immediately following the specified position in the source document, ornull
if none exists or the specified position is out of bounds.
public StartTag findNextStartTag(int pos, String name)
Returns theStartTag
with the specified name beginning at or immediately following the specified position in the source document. See theTag
class documentation for more details about the behaviour of this method. Specifying anull
argument to thename
parameter is equivalent tofindNextStartTag(pos)
. Specifying an argument to thename
parameter that ends in a colon (:
) searches for all start tags in the specified XML namespace. This method also returns unregistered tags if the specified name is not a valid XML tag name.
- Parameters:
pos
- the position in the source document from which to start the search, may be out of bounds.name
- the name of the start tag to search for.
public StartTag findNextStartTag(int pos, String attributeName, String value, boolean valueCaseSensitive)
Returns theStartTag
with the specified attribute name/value pair beginning at or immediately following the specified position in the source document. See theTag
class documentation for more details about the behaviour of this method.
- Parameters:
pos
- the position in the source document from which to start the search, may be out of bounds.attributeName
- the attribute name (case insensitive) to search for, must not benull
.value
- the value of the specified attribute to search for, must not benull
.valueCaseSensitive
- specifies whether the attribute value matching is case sensitive.
- Returns:
- the
StartTag
with the specified attribute name/value pair beginning at or immediately following the specified position in the source document, ornull
if none exists or the specified position is out of bounds.
public Tag findNextTag(int pos)
Returns theTag
beginning at or immediately following the specified position in the source document. See theTag
class documentation for more details about the behaviour of this method.
- Parameters:
pos
- the position in the source document from which to start the search, may be out of bounds.
- Returns:
- the
Tag
beginning at or immediately following the specified position in the source document, ornull
if none exists or the specified position is out of bounds.
public Tag findNextTag(int pos, TagType tagType)
Returns theTag
of the specified type beginning at or immediately following the specified position in the source document. See theTag
class documentation for more details about the behaviour of this method.
- Parameters:
pos
- the position in the source document from which to start the search, may be out of bounds.tagType
- theTagType
to search for.
public CharacterReference findPreviousCharacterReference(int pos)
Returns theCharacterReference
at or immediately preceding (or enclosing) the specified position in the source document. Character references positioned within an HTML comment are NOT ignored.
- Parameters:
pos
- the position in the source document from which to start the search, may be out of bounds.
- Returns:
- the
CharacterReference
beginning at or immediately preceding the specified position in the source document, ornull
if none exists or the specified position is out of bounds.
public EndTag findPreviousEndTag(int pos)
Returns theEndTag
beginning at or immediately preceding the specified position in the source document. See theTag
class documentation for more details about the behaviour of this method.
- Parameters:
pos
- the position in the source document from which to start the search, may be out of bounds.
- Returns:
- the
EndTag
beginning at or immediately preceding the specified position in the source document, ornull
if none exists or the specified position is out of bounds.
public EndTag findPreviousEndTag(int pos, String name)
Returns the normalEndTag
with the specified name at or immediately preceding (or enclosing) the specified position in the source document. See theTag
class documentation for more details about the behaviour of this method.
- Parameters:
pos
- the position in the source document from which to start the search, may be out of bounds.name
- the name of the end tag to search for, must not benull
.
public StartTag findPreviousStartTag(int pos)
Returns theStartTag
at or immediately preceding (or enclosing) the specified position in the source document. See theTag
class documentation for more details about the behaviour of this method.
- Parameters:
pos
- the position in the source document from which to start the search, may be out of bounds.
- Returns:
- the
StartTag
at or immediately preceding the specified position in the source document, ornull
if none exists or the specified position is out of bounds.
public StartTag findPreviousStartTag(int pos, String name)
Returns theStartTag
with the specified name at or immediately preceding (or enclosing) the specified position in the source document. See theTag
class documentation for more details about the behaviour of this method. Specifying anull
argument to thename
parameter is equivalent tofindPreviousStartTag(pos)
. This method also returns unregistered tags if the specified name is not a valid XML tag name.
- Parameters:
pos
- the position in the source document from which to start the search, may be out of bounds.name
- the name of the start tag to search for.
public Tag findPreviousTag(int pos)
Returns theTag
beginning at or immediately preceding (or enclosing) the specified position in the source document. See theTag
class documentation for more details about the behaviour of this method.
- Parameters:
pos
- the position in the source document from which to start the search, may be out of bounds.
- Returns:
- the
Tag
beginning at or immediately preceding the specified position in the source document, ornull
if none exists or the specified position is out of bounds.
public Tag findPreviousTag(int pos, TagType tagType)
Returns theTag
of the specified type beginning at or immediately preceding (or enclosing) the specified position in the source document. See theTag
class documentation for more details about the behaviour of this method.
- Parameters:
pos
- the position in the source document from which to start the search, may be out of bounds.tagType
- theTagType
to search for.
public Tag[] fullSequentialParse()
Parses all of the tags in this source document sequentially from beginning to end. Calling this method can greatly improve performance if most or all of the tags in the document need to be parsed. It is typically called before any of the tag search methods are called on thisSource
object, directly after setting the location of the log writer. By default, tags are parsed only as needed, which is referred to as parse on demand mode. In this mode, every call to a tag search method that is not returning previously cached tags must perform a relatively complex check to determine whether a potential tag is in a valid position. Generally speaking, a tag is in a valid position if it does not appear inside any another tag. Server tags can appear anywhere in a document, including inside other tags, so this relates only to non-server tags. Theoretically, checking whether a specified position in the document is enclosed in another tag is only possible if every preceding tag has been parsed, otherwise it is impossible to tell whether one of the delimiters of the enclosing tag was in fact enclosed by some other tag before it, thereby invalidating it. When this method is called, each tag is parsed in sequence starting from the beginning of the document, making it easy to check whether each potential tag is in a valid position. In parse on demand mode a compromise technique must be used for this check, since the theoretical requirement of having parsed all preceding tags is no longer practical. This compromise involves only checking whether the position is enclosed by other tags with certain tag types. The added complexity of this technique makes parsing each tag slower compared to when a full sequential parse is performed, but when only a few tags need parsing this is an extremely beneficial trade-off. The documentation of theTagType.isValidPosition(Source, int pos)
method, which is called internally by the parser to perform the valid position check, includes a more detailed explanation of the differences between the two modes of operation. If thefindAllTags()
,findAllStartTags()
orfindAllElements()
method is called on theSource
object without having called this method first, a log message is generated recommending its use. This method returns the same list of tags as theSource.findAllTags()
method, but as an array instead of a list. If this method is called after any of the tag search methods are called, the cache is cleared of any previously found tags before being restocked via the full sequential parse. This is significant if theSegment.ignoreWhenParsing()
method has been called since the tags were first found, as any tags inside the ignored segments will no longer be returned by any of the tag search methods. See also theTag
class documentation for more general details about how tags are parsed.
- Returns:
- an array of all tags in this source document.
public String getCacheDebugInfo()
Returns a string representation of the tag cache, useful for debugging purposes.
- Returns:
- a string representation of the tag cache, useful for debugging purposes.
public List getChildElements()
Returns a list of the top-level elements in the document element hierarchy. ThefullSequentialParse()
method should be called after construction of theSource
object if this method is to be used. The objects in the list are all of typeElement
. The term top-level element refers to an element that is not nested within any other element in the document. The term document element hierarchy refers to the hierarchy of elements that make up this source document. While the document itself is theoretically at the top of the hierarchy, this library only considersElement
objects to be part of the hierarchy, so the top-level elements are the immediate children of the source document. TheElement.getChildElements()
method can be used to get the decendents of the top-level elements. The document element hierarchy differs from that of the Document Object Model in that it is only a representation of the elements that are physically present in the source text. Unlike the DOM, it does not include any "implied" HTML elements such asTBODY
if they are not present in the source text. Elements formed from server tags are not included in the hierarchy at all. Structural errors in this source document such as overlapping elements are reported in the log. In the case that two elements are found to overlap, the position of the start tag determines the location of the element in the hierarchy. A visual representation of the document element hierarchy can be obtained by callingindent(String,boolean,boolean,boolean) indent(" ",true,true,true)
.
- Overrides:
- getChildElements in interface Segment
- Returns:
- a list of the top-level elements in the document element hierarchy, guaranteed not
null
.
public int getColumn(int pos)
Returns the column number of the specified character position in the source document.
- Parameters:
pos
- the position in the source document.
- Returns:
- the column number of the specified character position in the source document.
- See Also:
getRow(int pos)
,getRowColumnVector(int pos)
public String getDocumentSpecifiedEncoding()
Returns the document encoding specified within the text of the document. The document encoding can be specified within the document text in two ways. They are referred to generically in this library as an encoding specification, and are listed below in order of precedence:Both of these tags must only use unicode characters in the range U+0000 to U+007F, and in the case of the META declaration must use ASCII encoding. This, along with the fact that they must occur at or near the beginning of the document, assists in their detection and decoding without the need to know the exact encoding of the full text.
- An XML text declaration at the start of the document, which is essentially an XML declaration with an
encoding
attribute. This is only used in XML documents, and must be present if an XML document has an encoding other than UTF-8 or UTF-16.<?xml version="1.0" encoding="ISO-8859-1" ?>- A META declaration, which is in the form of a
META
tag with attributehttp-equiv="Content-Type"
. The encoding is specified in thecharset
parameter of aContent-Type
HTTP header value, which is placed in the value of the meta tag'scontent
attribute. This META declaration should appear as early as possible in theHEAD
element.<META http-equiv=Content-Type content="text/html; charset=iso-8859-1">
- Returns:
- the document encoding specified within the text of the document.
- See Also:
getEncoding()
public Element getElementById(String id)
Returns theElement
with the specifiedid
attribute value. This simulates the script methodgetElementById
defined in DOM HTML level 1. This is equivalent tofindNextStartTag
(0,"id",id,true).
getElement()
, assuming that the element exists. A well formed HTML document should have no more than one element with any givenid
attribute value.
- Parameters:
id
- theid
attribute value (case sensitive) to search for, must not benull
.
- Returns:
- the
Element
with the specifiedid
attribute value, ornull
if no such element exists.
public String getEncoding()
Returns the original encoding of the source document. The encoding of a document defines how the original byte stream was encoded into characters. The HTTP specification section 3.4 defines the term "character set" to refer to the encoding, and the term "charset" is similarly used in Java (see the classjava.nio.charset.Charset
). This is an unfortunate convention that often causes confusion, as a character set is not the same thing as a character encoding. For example, the Unicode character set has several encodings, such as UTF-8, UTF-16, and UTF-32. This method makes the best possible effort to return the name of the encoding used to decode the original source text byte stream into character data. This decoding takes place in the constructor when a parameter based on a byte stream such as anInputStream
orURL
is used to specify the source text. The documentation of theSource(InputStream)
andSource(URL)
constructors describe how the return value of this method is determined in these cases. It is also possible in some circumstances for the encoding to be determined in theSource(Reader)
constructor. If a constructor was used that specifies the source text directly in character form (not requiring the decoding of a byte sequence) then the document itself is searched for an encoding specification. In this case, this method returns the same value as thegetDocumentSpecifiedEncoding()
method. ThegetEncodingSpecificationInfo()
method returns a simple description of how the value of this method was determined.
- Returns:
- the original encoding of the source document.
- See Also:
getEncodingSpecificationInfo()
public String getEncodingSpecificationInfo()
Returns a simple description of how the encoding of the source document was determined. The description is intended for informational purposes only. It is not guaranteed to have any particular format and can not be reliably parsed.
- Returns:
- a simple description of how the encoding of the source document was determined.
- See Also:
getEncoding()
public Writer getLogWriter()
Returns the destinationWriter
for log messages. By default, the log writer is set tonull
, which supresses log messages.
- Returns:
- the destination
Writer
for log messages.
public Iterator getNextTagIterator(int pos)
Deprecated. Use
findAllTags()
.iterator()
instead, or multiple calls to theTag.findNextTag()
method.Returns an iterator ofTag
objects beginning at and following the specified position in the source document. This method has been deprecated as of version 2.2 as it was originally only included because it was more efficient than consecutive calls tofindNextTag(int pos)
. The most efficient replacement is to use multiple calls toTag.findNextTag()
if a full sequential parse was peformed, otherwise usefindAllTags()
.iterator()
and skip over the tags that begin before the position specified in thepos
argument of this method.
- Parameters:
pos
- the position in the source document from which to start the iteration.
- Returns:
- an iterator of
Tag
objects beginning at and following the specified position in the source document.
public final ParseText getParseText()
Returns the parse text of this source document. This method is normally only of interest to users who wish to create custom tag types. The parse text is defined as the entire text of the source document in lower case, with all ignored segments replaced by space characters.
- Returns:
- the parse text of this source document.
public int getRow(int pos)
Returns the row number of the specified character position in the source document.
- Parameters:
pos
- the position in the source document.
- Returns:
- the row number of the specified character position in the source document.
- See Also:
getColumn(int pos)
,getRowColumnVector(int pos)
public RowColumnVector getRowColumnVector(int pos)
Returns aRowColumnVector
object representing the row and column number of the specified character position in the source document.
- Parameters:
pos
- the position in the source document.
- Returns:
- a
RowColumnVector
object representing the row and column number of the specified character position in the source document.
- See Also:
getRow(int pos)
,getColumn(int pos)
public final Tag getTagAt(int pos)
Returns theTag
at the specified position in the source document. See theTag
class documentation for more details about the behaviour of this method. This method also returns unregistered tags.
- Parameters:
pos
- the position in the source document, may be out of bounds.
- Returns:
- the
Tag
at the specified position in the source document, ornull
if no tag exists at the specified position or it is out of bounds.
public void ignoreWhenParsing(Collection segments)
Causes all of the segments in the specified collection to be ignored when parsing. This is equivalent to callingSegment.ignoreWhenParsing()
on each segment in the collection.
public void ignoreWhenParsing(int begin, int end)
Causes the specified range of the source text to be ignored when parsing. See the documentation of theSegment.ignoreWhenParsing()
method for more information.
- Parameters:
begin
- the beginning character position in the source text.end
- the end character position in the source text.
public CharStreamSource indent(String indentText, boolean tidyTags, boolean collapseWhiteSpace, boolean indentAllElements)
Reproduces the source text with indenting that represents the document element hierarchy of this source document. Any indenting present in the original source text is removed. The output text is functionally equivalent to the original source and should be rendered identically unless specified below. The following points describe the process in general terms. Any aspect of the algorithm not specifically mentioned here is subject to change without notice in future versions.Use one of the following methods to obtain the output from the returned
- Every element that is not an inline-level element appears on a new line with an indent corresponding to its depth in the document element hierarchy.
- The indent is formed by writing n repetitions of the string specified in the
indentText
argument, where n is the depth of the indent.- The content of an indented element starts on a new line and is indented at a depth one greater than that of the element, with the end tag appearing on a new line at the same depth as the start tag. If the content contains only text and inline-level elements, it may continue on the same line as the start tag. Additionally, if the output content contains no new lines, the end tag may also continue on the same line.
- The content of preformatted elements such as
PRE
andTEXTAREA
are not indented, nor is the white space modified in any way.- Only normal and document type declaration elements are indented. All others are treated as inline-level elements.
- White space and indenting inside HTML comments, CDATA sections, or any server tag is preserved, but with the indenting of new lines starting at a depth one greater than that of the surrounding text.
- White space and indenting inside
SCRIPT
elements is preserved, but with the indenting of new lines starting at a depth one greater than that of theSCRIPT
element.- If the
tidyTags
option is used, every tag in the document is replaced with the output from itsTag.tidy()
method. If this argument is set tofalse
, the tag from the original text is used, including all white space, but with any new lines indented at a depth one greater than that of the element.- If the
collapseWhiteSpace
option is used, every string of one or more white space characters located outside of a tag is replaced with a single space in the output. White space located adjacent to a non-inline-level element tag (except server tags) may be removed.- If the
indentAllElements
option is used, every element appears indented on a new line, including inline-level elements. This generates output that is a good representation of the actual document element hierarchy, but is very likely to introduce white space that affects the functional equivalency of the document.- If the source document contains server tags, the functional equivalency of the output document may be compromised.
CharStreamSource
object:
CharStreamSource.writeTo(Writer)
CharStreamSourceUtil.toString(CharStreamSource)
CharStreamSourceUtil.getReader(CharStreamSource)
- Parameters:
indentText
- the string to use for each indent, must not benull
.tidyTags
- specifies whether to replace the original text of each tag with the output from itsTag.tidy()
method.collapseWhiteSpace
- specifies whether to collapse the white space in the text between the tags.indentAllElements
- specifies whether to indent all elements, including inline-level elements and those with preformatted contents.
- Returns:
- a
CharStreamSource
from which an indented copy of this source document can be obtained.
public boolean isLoggingEnabled()
Indicates whether logging is currently enabled. The current implementation of this method is equivalent togetLogWriter()
!=null
. For best performance you should check that this method returnstrue
before constructing the string to send to thelog(String message)
method.
- Returns:
true
if logging is currently enabled, otherwisefalse
.
public boolean isXML()
Indicates whether the source document is likely to be XML. The algorithm used to determine this is designed to be relatively inexpensive and to provide an accurate result in most normal situations. An exact determination of whether the source document is XML would require a much more complex analysis of the text. The algorithm is as follows:
- If the document begins with an XML declaration, it is an XML document.
- If the document contains a document type declaration that contains the text "
xhtml
", it is an XHTML document, and hence also an XML document.- If the document does NOT have an
HTML
element, assume it is XML. This assumption is based on the premise that the library is used to parse HTML or XML documents only.- If none of the above conditions are met, assume the document is normal HTML, and therefore not an XML document.
- Returns:
true
if the source document is likely to be XML, otherwisefalse
.
public void log(String message)
Writes the specified message to the log. The log destination is set via thesetLogWriter(Writer)
method. By default, log messages are not sent anywhere. A newline character is added to the message and theWriter
is flushed after every call to this method. If anIOException
is thrown while writing to the log, this method throws aRuntimeException
with the originalIOException
as its cause.
- Parameters:
message
- the message to log
public Attributes parseAttributes(int pos, int maxEnd)
Parses anyAttributes
starting at the specified position. This method is only used in the unusual situation where attributes exist outside of a start tag. TheStartTag.getAttributes()
method should be used in normal situations. The returned Attributes segment always begins atpos
, and ends at the end of the last attribute before eithermaxEnd
or the first occurrence of "/>" or ">" outside of a quoted attribute value, whichever comes first. Only returnsnull
if the segment contains a major syntactical error or more than the default maximum number of minor syntactical errors. This is equivalent toparseAttributes
(pos,maxEnd,
Attributes.getDefaultMaxErrorCount()
)}
.
- Parameters:
pos
- the position in the source document at the beginning of the attribute list, may be out of bounds.maxEnd
- the maximum end position of the attribute list, or -1 if no maximum.
- Returns:
- the
Attributes
starting at the specified position, ornull
if too many errors occur while parsing or the specified position is out of bounds.
public Attributes parseAttributes(int pos, int maxEnd, int maxErrorCount)
Parses anyAttributes
starting at the specified position. This method is only used in the unusual situation where attributes exist outside of a start tag. TheStartTag.getAttributes()
method should be used in normal situations. Only returnsnull
if the segment contains a major syntactical error or more than the specified number of minor syntactical errors. ThemaxErrorCount
argument overrides the default maximum error count. SeeparseAttributes(int pos, int maxEnd)
for more information.
- Parameters:
pos
- the position in the source document at the beginning of the attribute list, may be out of bounds.maxEnd
- the maximum end position of the attribute list, or -1 if no maximum.maxErrorCount
- the maximum number of minor errors allowed while parsing.
- Returns:
- the
Attributes
starting at the specified position, ornull
if too many errors occur while parsing or the specified position is out of bounds.
- See Also:
StartTag.getAttributes()
,parseAttributes(int pos, int MaxEnd)
public void setLogWriter(Writer writer)
Sets the destinationWriter
for log messages. When required, this method should normally be called immediately after the construction of theSource
object.
- Parameters:
writer
- the destinationjava.io.Writer
for log messages.
- See Also:
getLogWriter()