Prev Class | Next Class | Frames | No Frames |
Summary: Nested | Field | Method | Constr | Detail: Nested | Field | Method | Constr |
java.lang.Object
au.id.jericho.lib.html.Segment
au.id.jericho.lib.html.Source
public class Source
extends Segment
Source
object from the source data, which can be a
String
, Reader
, InputStream
, URLConnection
or URL
.
Each constructor uses all the evidence available to determine the original character encoding of the data.
Once the Source
object has been created, you can immediately start searching for tags or elements within the document
using the tag search methods.
In certain circumstances you may be able to improve performance by calling the fullSequentialParse()
method before calling any
tag search methods. See the documentation of the fullSequentialParse()
method for details.
Any issues encountered while parsing are logged to a Logger
object.
The setLogger(Logger)
method can be used to explicitly set a Logger
implementation for a particular Source
instance,
otherwise the static Config.LoggerProvider
property determines how the logger is set by default for all Source
instances.
See the documentation of the Config.LoggerProvider
property for information about how the default logging provider is determined.
Note that many of the useful functions which can be performed on the source document are
defined in its superclass, Segment
.
The source object is itself a segment which spans the entire document.
Most of the methods defined in this class are useful for determining the elements and tags
surrounding or neighbouring a particular character position in the document.
For information on how to create a modified version of this source document, see the OutputDocument
class.
Segment
Constructor Summary | |
| |
| |
| |
| |
|
Method Summary | |
void |
|
List |
|
List |
|
List |
|
Element |
|
Element |
|
Tag |
|
Tag |
|
int |
|
CharacterReference |
|
Element |
|
Element |
|
Element |
|
EndTag |
|
EndTag |
|
EndTag |
|
StartTag |
|
StartTag |
|
StartTag |
|
StartTag |
|
Tag |
|
Tag |
|
CharacterReference |
|
EndTag |
|
EndTag |
|
StartTag |
|
StartTag |
|
StartTag |
|
Tag |
|
Tag |
|
Tag[] |
|
String |
|
List |
|
int |
|
String |
|
Element |
|
String |
|
String |
|
Writer |
|
Logger | |
String |
|
ParseText |
|
String |
|
int |
|
RowColumnVector |
|
SourceFormatter |
|
Tag | |
void |
|
void |
|
CharStreamSource |
|
boolean |
|
boolean | |
void |
|
Attributes |
|
Attributes |
|
void |
|
void | |
String |
|
Methods inherited from class au.id.jericho.lib.html.Segment | |
charAt , compareTo , encloses , encloses , equals , extractText , extractText , findAllCharacterReferences , findAllElements , findAllElements , findAllElements , findAllElements , findAllStartTags , findAllStartTags , findAllStartTags , findAllTags , findAllTags , findFormControls , findFormFields , getBegin , getChildElements , getDebugInfo , getEnd , getNodeIterator , getRenderer , getSource , getTextExtractor , hashCode , ignoreWhenParsing , isWhiteSpace , isWhiteSpace , length , parseAttributes , subSequence , toString |
public Source(CharSequence text)
Constructs a newSource
object from the specified text.
- Parameters:
text
- the source text.
public Source(InputStream inputStream) throws IOException
Constructs a newSource
object by loading the content from the specifiedInputStream
. The algorithm for detecting the character encoding of the source document from the raw bytes of the specified input stream is the same as that for theSource(URL)
constructor, except that the first step is not possible as there is no Content-Type header to check.
- Parameters:
inputStream
- thejava.io.InputStream
from which to load the source text.
- See Also:
getEncoding()
public Source(Reader reader) throws IOException
Constructs a newSource
object by loading the content from the specifiedReader
. If the specified reader is an instance ofInputStreamReader
, thegetEncoding()
method of the created source object returns the encoding fromInputStreamReader.getEncoding()
.
- Parameters:
reader
- thejava.io.Reader
from which to load the source text.
public Source(URL url) throws IOException
Constructs a newSource
object by loading the content from the specified URL. This is equivalent toSource(url.openConnection())
.
- Parameters:
url
- the URL from which to load the source text.
- See Also:
getEncoding()
public Source(URLConnection urlConnection) throws IOException
Constructs a newSource
object by loading the content from the specifiedURLConnection
. The algorithm for detecting the character encoding of the source document is as follows:
(process termination is marked by ♦)
- If the HTTP headers received from the URL connection include a Content-Type header specifying a
charset
parameter, then use the encoding specified in the value of thecharset
parameter. ♦- Read the first four bytes of the input stream.
- If the input stream is empty, the created source document has zero length and its
getEncoding()
method returnsnull
. ♦- If the input stream starts with a unicode Byte Order Mark (BOM), then use the encoding signified by the BOM. ♦
BOM Bytes Encoding EF BB FF
UTF-8 FF FE 00 00
UTF-32 (little-endian) 00 00 FE FF
UTF-32 (big-endian) FF FE
UTF-16 (little-endian) FE FF
UTF-16 (big-endian) 0E FE FF
SCSU 2B 2F 76
UTF-7 DD 73 66 73
UTF-EBCDIC FB EE 28
BOCU-1 - If the stream contains less than four bytes, then:
- If the stream contains either one or three bytes, then use the encoding ISO-8859-1. ♦
- If the stream starts with a zero byte, then use the encoding UTF-16BE. ♦
- If the second byte of the stream is zero, then use the encoding UTF-16LE. ♦
- Otherwise use the encoding ISO-8859-1. ♦
Determine a preliminary encoding by examining the first four bytes of the input stream. See the getPreliminaryEncodingInfo()
method for details.Read the first 2048 bytes of the input stream and decode it using the preliminary encoding to create a "preview segment". If the detected preliminary encoding is not supported on this platform, create the preview segment using ISO-8859-1 instead (this incident is logged at warn level). Search the preview segment for an encoding specification, which should always appear at or near the top of the document. If an encoding specification is found:
- If the specified encoding is supported on this platform, use it. ♦
- If the specified encoding is not supported on this platform, use the encoding that was used to create the preview segment, which is normally the detected preliminary encoding. ♦
If the document looks like XML, then use UTF-8. ♦
Section 4.3.3 of the XML 1.0 specification states that an XML file that is not encoded in UTF-8 must contain either a UTF-16 BOM or an encoding declaration in its XML declaration. Since neither of these was detected, we can assume the encoding is UTF-8.Use the encoding that was used to create the preview segment, which is normally the detected preliminary encoding. ♦
This is the best guess, in the absence of any explicit information about the encoding, based on the first four bytes of the stream. The HTTP protocol section 3.7.1 states that an encoding of ISO-8859-1 can be assumed if nocharset
parameter was included in the HTTP Content-Type header. This is consistent with the preliminary encoding detected in this scenario.
- Parameters:
urlConnection
- the URL connection from which to load the source text.
- See Also:
getEncoding()
public void clearCache()
Clears the tag cache of all tags. This method may be useful after calling theSegment.ignoreWhenParsing()
method so that any tags previously found within the ignored segments will no longer be returned by the tag search methods.
public List findAllElements()
Returns a list of all elements in this source document. Calling this method on theSource
object performs a full sequential parse automatically. The elements returned correspond exactly with the start tags returned in thefindAllStartTags()
method.
- Overrides:
- findAllElements in interface Segment
- Returns:
- a list of all elements in this source document.
public List findAllStartTags()
Returns a list of all start tags in this source document. Calling this method on theSource
object performs a full sequential parse automatically. See theTag
class documentation for more details about the behaviour of this method.
- Overrides:
- findAllStartTags in interface Segment
- Returns:
- a list of all start tags in this source document.
public List findAllTags()
Returns a list of all tags in this source document. Calling this method on theSource
object performs a full sequential parse automatically. See theTag
class documentation for more details about the behaviour of this method.
- Overrides:
- findAllTags in interface Segment
- Returns:
- a list of all tags in this source document.
public Element findEnclosingElement(int pos)
Returns the most nested normalElement
that encloses the specified position in the source document. The specified position can be anywhere inside the start tag, end tag, or content of the element. There is no requirement that the returned element has an end tag, and it may be a server tag or HTML comment. See theTag
class documentation for more details about the behaviour of this method.
- Parameters:
pos
- the position in the source document, may be out of bounds.
public Element findEnclosingElement(int pos, String name)
Returns the most nested normalElement
with the specified name that encloses the specified position in the source document. The specified position can be anywhere inside the start tag, end tag, or content of the element. There is no requirement that the returned element has an end tag, and it may be a server tag or HTML comment. See theTag
class documentation for more details about the behaviour of this method. This method also returns elements consisting of unregistered tags if the specified name is not a valid XML tag name.
- Parameters:
pos
- the position in the source document, may be out of bounds.name
- the name of the element to search for.
public Tag findEnclosingTag(int pos)
Returns theTag
that encloses the specified position in the source document. See theTag
class documentation for more details about the behaviour of this method.
- Parameters:
pos
- the position in the source document, may be out of bounds.
public Tag findEnclosingTag(int pos, TagType tagType)
Returns theTag
of the specified type that encloses the specified position in the source document. See theTag
class documentation for more details about the behaviour of this method.
- Parameters:
pos
- the position in the source document, may be out of bounds.tagType
- theTagType
to search for.
public int findNameEnd(int pos)
Returns the end position of the XML Name that starts at the specified position. This implementation first checks that the character at the specified position is a valid XML Name start character as defined by theTag.isXMLNameStartChar(char)
method. If this is not the case, the value-1
is returned. Once the first character has been checked, subsequent characters are checked using theTag.isXMLNameChar(char)
method until one is found that is not a valid XML Name character or the end of the document is reached. This position is then returned.
- Parameters:
pos
- the position in the source document of the first character of the XML Name.
- Returns:
- the end position of the XML Name that starts at the specified position.
public CharacterReference findNextCharacterReference(int pos)
Returns theCharacterReference
beginning at or immediately following the specified position in the source document. Character references positioned within an HTML comment are NOT ignored.
- Parameters:
pos
- the position in the source document from which to start the search, may be out of bounds.
- Returns:
- the
CharacterReference
beginning at or immediately following the specified position in the source document, ornull
if none exists or the specified position is out of bounds.
public Element findNextElement(int pos)
Returns theElement
beginning at or immediately following the specified position in the source document. This is equivalent tofindNextStartTag(pos)
.
getElement()
, assuming the result is notnull
.
- Parameters:
pos
- the position in the source document from which to start the search, may be out of bounds.
- Returns:
- the
Element
beginning at or immediately following the specified position in the source document, ornull
if none exists or the specified position is out of bounds.
public Element findNextElement(int pos, String name)
Returns the normalElement
with the specified name beginning at or immediately following the specified position in the source document. This is equivalent tofindNextStartTag(pos,name)
.
getElement()
, assuming the result is notnull
. Specifying anull
argument to thename
parameter is equivalent tofindNextElement(pos)
. Specifying an argument to thename
parameter that ends in a colon (:
) searches for all elements in the specified XML namespace. This method also returns elements consisting of unregistered tags if the specified name is not a valid XML tag name.
- Parameters:
pos
- the position in the source document from which to start the search, may be out of bounds.name
- the name of the element to search for.
public Element findNextElement(int pos, String attributeName, String value, boolean valueCaseSensitive)
Returns theElement
with the specified attribute name/value pair beginning at or immediately following the specified position in the source document. This is equivalent tofindNextStartTag(pos,attributeName,value,valueCaseSensitive)
.
getElement()
, assuming the result is notnull
.
- Parameters:
pos
- the position in the source document from which to start the search, may be out of bounds.attributeName
- the attribute name (case insensitive) to search for, must not benull
.value
- the value of the specified attribute to search for, must not benull
.valueCaseSensitive
- specifies whether the attribute value matching is case sensitive.
- Returns:
- the
Element
with the specified attribute name/value pair beginning at or immediately following the specified position in the source document, ornull
if none exists or the specified position is out of bounds.
public EndTag findNextEndTag(int pos)
Returns theEndTag
beginning at or immediately following the specified position in the source document. See theTag
class documentation for more details about the behaviour of this method.
- Parameters:
pos
- the position in the source document from which to start the search, may be out of bounds.
- Returns:
- the
EndTag
beginning at or immediately following the specified position in the source document, ornull
if none exists or the specified position is out of bounds.
public EndTag findNextEndTag(int pos, String name)
Returns the normalEndTag
with the specified name beginning at or immediately following the specified position in the source document. See theTag
class documentation for more details about the behaviour of this method.
- Parameters:
pos
- the position in the source document from which to start the search, may be out of bounds.name
- the name of the end tag to search for, must not benull
.
public EndTag findNextEndTag(int pos, String name, EndTagType endTagType)
Returns theEndTag
with the specified name and type beginning at or immediately following the specified position in the source document. See theTag
class documentation for more details about the behaviour of this method.
public StartTag findNextStartTag(int pos)
Returns theStartTag
beginning at or immediately following the specified position in the source document. See theTag
class documentation for more details about the behaviour of this method.
- Parameters:
pos
- the position in the source document from which to start the search, may be out of bounds.
- Returns:
- the
StartTag
beginning at or immediately following the specified position in the source document, ornull
if none exists or the specified position is out of bounds.
public StartTag findNextStartTag(int pos, String name)
Returns the normalStartTag
with the specified name beginning at or immediately following the specified position in the source document. See theTag
class documentation for more details about the behaviour of this method. Specifying anull
argument to thename
parameter is equivalent tofindNextStartTag(pos)
. Specifying an argument to thename
parameter that ends in a colon (:
) searches for all start tags in the specified XML namespace. This method also returns unregistered tags if the specified name is not a valid XML tag name.
- Parameters:
pos
- the position in the source document from which to start the search, may be out of bounds.name
- the name of the start tag to search for, may benull
.
public StartTag findNextStartTag(int pos, String attributeName, String value, boolean valueCaseSensitive)
Returns theStartTag
with the specified attribute name/value pair beginning at or immediately following the specified position in the source document. See theTag
class documentation for more details about the behaviour of this method.
- Parameters:
pos
- the position in the source document from which to start the search, may be out of bounds.attributeName
- the attribute name (case insensitive) to search for, must not benull
.value
- the value of the specified attribute to search for, must not benull
.valueCaseSensitive
- specifies whether the attribute value matching is case sensitive.
- Returns:
- the
StartTag
with the specified attribute name/value pair beginning at or immediately following the specified position in the source document, ornull
if none exists or the specified position is out of bounds.
public StartTag findNextStartTag(int pos, String name, StartTagType startTagType)
Returns theStartTag
with the specified name and type beginning at or immediately following the specified position in the source document. See theTag
class documentation for more details about the behaviour of this method. SpecifyingStartTagType.NORMAL
as the argument to thestartTagType
parameter is equivalent tofindNextStartTag(pos,name)
.
public Tag findNextTag(int pos)
Returns theTag
beginning at or immediately following the specified position in the source document. See theTag
class documentation for more details about the behaviour of this method. UseTag.findNextTag()
to find the tag immediately following another tag.
- Parameters:
pos
- the position in the source document from which to start the search, may be out of bounds.
- Returns:
- the
Tag
beginning at or immediately following the specified position in the source document, ornull
if none exists or the specified position is out of bounds.
public Tag findNextTag(int pos, TagType tagType)
Returns theTag
of the specified type beginning at or immediately following the specified position in the source document. See theTag
class documentation for more details about the behaviour of this method.
- Parameters:
pos
- the position in the source document from which to start the search, may be out of bounds.tagType
- theTagType
to search for.
public CharacterReference findPreviousCharacterReference(int pos)
Returns theCharacterReference
at or immediately preceding (or enclosing) the specified position in the source document. Character references positioned within an HTML comment are NOT ignored.
- Parameters:
pos
- the position in the source document from which to start the search, may be out of bounds.
- Returns:
- the
CharacterReference
beginning at or immediately preceding the specified position in the source document, ornull
if none exists or the specified position is out of bounds.
public EndTag findPreviousEndTag(int pos)
Returns theEndTag
beginning at or immediately preceding the specified position in the source document. See theTag
class documentation for more details about the behaviour of this method.
- Parameters:
pos
- the position in the source document from which to start the search, may be out of bounds.
- Returns:
- the
EndTag
beginning at or immediately preceding the specified position in the source document, ornull
if none exists or the specified position is out of bounds.
public EndTag findPreviousEndTag(int pos, String name)
Returns the normalEndTag
with the specified name at or immediately preceding (or enclosing) the specified position in the source document. See theTag
class documentation for more details about the behaviour of this method.
- Parameters:
pos
- the position in the source document from which to start the search, may be out of bounds.name
- the name of the end tag to search for, must not benull
.
public StartTag findPreviousStartTag(int pos)
Returns theStartTag
at or immediately preceding (or enclosing) the specified position in the source document. See theTag
class documentation for more details about the behaviour of this method.
- Parameters:
pos
- the position in the source document from which to start the search, may be out of bounds.
- Returns:
- the
StartTag
at or immediately preceding the specified position in the source document, ornull
if none exists or the specified position is out of bounds.
public StartTag findPreviousStartTag(int pos, String name)
Returns the normalStartTag
with the specified name at or immediately preceding (or enclosing) the specified position in the source document. See theTag
class documentation for more details about the behaviour of this method. Specifying anull
argument to thename
parameter is equivalent tofindPreviousStartTag(pos)
. This method also returns unregistered tags if the specified name is not a valid XML tag name.
- Parameters:
pos
- the position in the source document from which to start the search, may be out of bounds.name
- the name of the start tag to search for.
public StartTag findPreviousStartTag(int pos, String name, StartTagType startTagType)
Returns theStartTag
with the specified name and type at or immediately preceding (or enclosing) the specified position in the source document. See theTag
class documentation for more details about the behaviour of this method. SpecifyingStartTagType.NORMAL
as the argument to thestartTagType
parameter is equivalent tofindPreviousStartTag(pos,name)
.
public Tag findPreviousTag(int pos)
Returns theTag
beginning at or immediately preceding (or enclosing) the specified position in the source document. See theTag
class documentation for more details about the behaviour of this method.
- Parameters:
pos
- the position in the source document from which to start the search, may be out of bounds.
- Returns:
- the
Tag
beginning at or immediately preceding the specified position in the source document, ornull
if none exists or the specified position is out of bounds.
public Tag findPreviousTag(int pos, TagType tagType)
Returns theTag
of the specified type beginning at or immediately preceding (or enclosing) the specified position in the source document. See theTag
class documentation for more details about the behaviour of this method.
- Parameters:
pos
- the position in the source document from which to start the search, may be out of bounds.tagType
- theTagType
to search for.
public Tag[] fullSequentialParse()
Parses all of the tags in this source document sequentially from beginning to end. Calling this method can greatly improve performance if most or all of the tags in the document need to be parsed. Calling thefindAllTags()
,findAllStartTags()
,findAllElements()
orgetChildElements()
method on theSource
object performs a full sequential parse automatically. There are however still circumstances where it should be called manually, such as when it is known that most or all of the tags in the document will need to be parsed, but none of the abovementioned methods are used, or are called only after calling one or more other tag search methods. If this method is called manually, is should be called soon after theSource
object is created, before any tag search methods are called. By default, tags are parsed only as needed, which is referred to as parse on demand mode. In this mode, every call to a tag search method that is not returning previously cached tags must perform a relatively complex check to determine whether a potential tag is in a valid position. Generally speaking, a tag is in a valid position if it does not appear inside any another tag. Server tags can appear anywhere in a document, including inside other tags, so this relates only to non-server tags. Theoretically, checking whether a specified position in the document is enclosed in another tag is only possible if every preceding tag has been parsed, otherwise it is impossible to tell whether one of the delimiters of the enclosing tag was in fact enclosed by some other tag before it, thereby invalidating it. When this method is called, each tag is parsed in sequence starting from the beginning of the document, making it easy to check whether each potential tag is in a valid position. In parse on demand mode a compromise technique must be used for this check, since the theoretical requirement of having parsed all preceding tags is no longer practical. This compromise involves only checking whether the position is enclosed by other tags with certain tag types. The added complexity of this technique makes parsing each tag slower compared to when a full sequential parse is performed, but when only a few tags need parsing this is an extremely beneficial trade-off. The documentation of theTagType.isValidPosition(Source, int pos, int[] fullSequentialParseData)
method, which is called internally by the parser to perform the valid position check, includes a more detailed explanation of the differences between the two modes of operation. Calling this method a second or subsequent time has no effect. This method returns the same list of tags as theSource.findAllTags()
method, but as an array instead of a list. If this method is called after any of the tag search methods are called, the cache is cleared of any previously found tags before being restocked via the full sequential parse. This is significant if theSegment.ignoreWhenParsing()
method has been called since the tags were first found, as any tags inside the ignored segments will no longer be returned by any of the tag search methods. See also theTag
class documentation for more general details about how tags are parsed.
- Returns:
- an array of all tags in this source document.
public String getCacheDebugInfo()
Returns a string representation of the tag cache, useful for debugging purposes.
- Returns:
- a string representation of the tag cache, useful for debugging purposes.
public List getChildElements()
Returns a list of the top-level elements in the document element hierarchy. The objects in the list are all of typeElement
. The term top-level element refers to an element that is not nested within any other element in the document. The term document element hierarchy refers to the hierarchy of elements that make up this source document. The source document itself is not considered to be part of the hierarchy, meaning there is typically more than one top-level element. Even when the source represents an entire HTML document, the document type declaration and/or an XML declaration often exist as top-level elements along with theHTML
element itself. TheElement.getChildElements()
method can be used to get the children of the top-level elements, with recursive use providing a means to visit every element in the document hierarchy. The document element hierarchy differs from that of the Document Object Model in that it is only a representation of the elements that are physically present in the source text. Unlike the DOM, it does not include any "implied" HTML elements such asTBODY
if they are not present in the source text. Elements formed from server tags are not included in the hierarchy at all. Structural errors in this source document such as overlapping elements are reported in the log. When elements are found to overlap, the position of the start tag determines the location of the element in the hierarchy. Calling this method on theSource
object performs a full sequential parse automatically. A visual representation of the document element hierarchy can be obtained by calling:
getSourceFormatter()
.
setIndentAllElements(true)
.
setCollapseWhiteSpace(true)
.
setTidyTags(true)
.
toString()
- Overrides:
- getChildElements in interface Segment
- Returns:
- a list of the top-level elements in the document element hierarchy, guaranteed not
null
.
public int getColumn(int pos)
Returns the column number of the specified character position in the source document.
- Parameters:
pos
- the position in the source document.
- Returns:
- the column number of the specified character position in the source document.
- See Also:
getRow(int pos)
,getRowColumnVector(int pos)
public String getDocumentSpecifiedEncoding()
Returns the document encoding specified within the text of the document. The document encoding can be specified within the document text in two ways. They are referred to generically in this library as an encoding specification, and are listed below in order of precedence:Both of these tags must only use characters in the range U+0000 to U+007F, and in the case of the META declaration must use ASCII encoding. This, along with the fact that they must occur at or near the beginning of the document, assists in their detection and decoding without the need to know the exact encoding of the full text.
- An encoding declaration within the XML declaration of an XML document, which must be present if it has an encoding other than UTF-8 or UTF-16.
<?xml version="1.0" encoding="ISO-8859-1" ?>- A META declaration, which is in the form of a
META
tag with attributehttp-equiv="Content-Type"
. The encoding is specified in thecharset
parameter of aContent-Type
HTTP header value, which is placed in the value of the meta tag'scontent
attribute. This META declaration should appear as early as possible in theHEAD
element.<META http-equiv=Content-Type content="text/html; charset=iso-8859-1">
- Returns:
- the document encoding specified within the text of the document, or
null
if no encoding is specified.
- See Also:
getEncoding()
public Element getElementById(String id)
Returns theElement
with the specifiedid
attribute value. This simulates the script methodgetElementById
defined in DOM HTML level 1. This is equivalent tofindNextStartTag
(0,"id",id,true).
getElement()
, assuming that the element exists. A well formed HTML document should have no more than one element with any givenid
attribute value.
- Parameters:
id
- theid
attribute value (case sensitive) to search for, must not benull
.
- Returns:
- the
Element
with the specifiedid
attribute value, ornull
if no such element exists.
public String getEncoding()
Returns the character encoding scheme of the source byte stream used to create this object. The encoding of a document defines how the original byte stream was encoded into characters. The HTTP specification section 3.4 uses the term "character set" to refer to the encoding, and the term "charset" is similarly used in Java (see the classjava.nio.charset.Charset
). This often causes confusion, as a modern "coded character set" such as Unicode can have several encodings, such as UTF-8, UTF-16, and UTF-32. See the Wikipedia character encoding article for an explanation of the terminology. This method makes the best possible effort to return the name of the encoding used to decode the original source byte stream into character data. This decoding takes place in the constructor when a parameter based on a byte stream such as anInputStream
orURL
is used to specify the source text. The documentation of theSource(InputStream)
andSource(URL)
constructors describe how the return value of this method is determined in these cases. It is also possible in some circumstances for the encoding to be determined in theSource(Reader)
constructor. If a constructor was used that specifies the source text directly in character form (not requiring the decoding of a byte sequence) then the document itself is searched for an encoding specification. In this case, this method returns the same value as thegetDocumentSpecifiedEncoding()
method. ThegetEncodingSpecificationInfo()
method returns a simple description of how the value of this method was determined.
- Returns:
- the character encoding scheme of the source byte stream used to create this object, or
null
if the encoding is not known.
- See Also:
getEncodingSpecificationInfo()
public String getEncodingSpecificationInfo()
Returns a concise description of how the encoding of the source document was determined. The description is intended for informational purposes only. It is not guaranteed to have any particular format and can not be reliably parsed.
- Returns:
- a concise description of how the encoding of the source document was determined.
- See Also:
getEncoding()
public Writer getLogWriter()
Deprecated. Use
((
WriterLogger
)
getLogger()
).
getWriter()
instead.Returns the destinationWriter
for log messages. This method has been deprecated as of version 2.4 in favour of the more genericgetLogger()
method. Returnsnull
if the current logger is not an instance ofWriterLogger
.
- Returns:
- the destination
Writer
for log messages, ornull
if the current logger is not an instance ofWriterLogger
.
public Logger getLogger()
Returns theLogger
that handles log messages. A logger instance is created automatically for eachSource
object using theLoggerProvider
specified by the staticConfig.LoggerProvider
property. This can be overridden by calling thesetLogger(Logger)
method. The name used for all automatically created logger instances is "net.htmlparser.jericho
".
- Returns:
- the
Logger
that handles log messages, ornull
if logging is disabled.
public String getNewLine()
Returns the newline character sequence used in the source document. If the document does not contain any newline characters, this method returnsnull
. The three possible return values (aside fromnull
) are"\n"
,"\r\n"
and"\r"
.
- Returns:
- the newline character sequence used in the source document, or
null
if none is present.
public final ParseText getParseText()
Returns the parse text of this source document. This method is normally only of interest to users who wish to create custom tag types. The parse text is defined as the entire text of the source document in lower case, with all ignored segments replaced by space characters.
- Returns:
- the parse text of this source document.
public String getPreliminaryEncodingInfo()
Returns the preliminary encoding of the source document together with a concise description of how it was determined. It is sometimes necessary for theSource(InputStream)
andSource(URL)
constructors to search the document for an encoding specification in order to determine the exact encoding of the source byte stream. In order to search for the document specified encoding before the exact encoding is known, a preliminary encoding is determined using the first four bytes of the input stream. Because the encoding specification must only use characters in the range U+0000 to U+007F, the preliminary encoding need only have the following basic properties determined:The encodings used to represent the most commonly encountered combinations of these basic properties are:
- Code unit size (8-bit, 16-bit or 32-bit)
- Byte order (big-endian or little-endian) if the code unit size is 16-bit or 32-bit
- Basic encoding of characters in the range U+0000 to U+007F (current implementation only distinguishes between ASCII and EBCDIC)
Note: all encodings with a code unit size greater than 8 bits are assumed to use an ASCII-compatible low-order byte. In some descriptions returned by this method, and the documentation below, a pattern is used to help demonstrate the contents of the first four bytes of the stream. The patterns use the characters "
- ISO-8859-1: 8-bit ASCII-compatible encoding
- Cp037: 8-bit EBCDIC-compatible encoding
- UTF-16BE: 16-bit big-endian encoding
- UTF-16LE: 16-bit little-endian encoding
- UTF-32BE: 32-bit big-endian encoding (not supported on most java platforms)
- UTF-32LE: 32-bit little-endian encoding (not supported on most java platforms)
00
" to signify a zero byte, "XX
" to signify a non-zero byte, and "??
" to signify a byte than can be either zero or non-zero. The algorithm for determining the preliminary encoding is as follows:If it was not necessary to search for a document specified encoding when determining the encoding of this source document from a byte stream, this method returns
- Byte pattern "
00 00
..." : If the stream starts with two zero bytes, the default 32-bit big-endian encoding UTF-32BE is used.- Byte pattern "
00 XX
..." : If the stream starts with a single zero byte, the default 16-bit big-endian encoding UTF-16BE is used.- Byte pattern "
XX ?? 00 00
..." : If the third and fourth bytes of the stream are zero, the default 32-bit little-endian encoding UTF-32LE is used.- Byte pattern "
XX 00
..." or "XX ?? XX 00
..." : If the second or fourth byte of the stream is zero, the default 16-bit little-endian encoding UTF-16LE is used.- Byte pattern "
XX XX 00 XX
..." : If the third byte of the stream is zero, the default 16-bit big-endian encoding UTF-16BE is used (assumes the first character is > U+00FF).- Byte pattern "
4C XX XX XX
..." : If the first four bytes are consistent with the EBCDIC encoding of an XML declaration ("<?xm
") or a document type declaration ("<!DO
"), or any other string starting with the EBCDIC character '<' followed by three non-ASCII characters (8th bit set), which is consistent with EBCDIC alphanumeric characters, the default EBCDIC-compatible encoding Cp037 is used.- Byte pattern "
XX XX XX XX
..." : Otherwise, if all of the first four bytes of the stream are non-zero, the default 8-bit ASCII-compatible encoding ISO-8859-1 is used.null
. See the documentation of theSource(InputStream)
andSource(URL)
constructors for more detailed information about when the detection of a preliminary encoding is required. The description returned by this method is intended for informational purposes only. It is not guaranteed to have any particular format and can not be reliably parsed.
- Returns:
- the preliminary encoding of the source document together with a concise description of how it was determined, or
null
if no preliminary encoding was required.
- See Also:
getEncoding()
public int getRow(int pos)
Returns the row number of the specified character position in the source document.
- Parameters:
pos
- the position in the source document.
- Returns:
- the row number of the specified character position in the source document.
- See Also:
getColumn(int pos)
,getRowColumnVector(int pos)
public RowColumnVector getRowColumnVector(int pos)
Returns aRowColumnVector
object representing the row and column number of the specified character position in the source document.
- Parameters:
pos
- the position in the source document.
- Returns:
- a
RowColumnVector
object representing the row and column number of the specified character position in the source document.
- See Also:
getRow(int pos)
,getColumn(int pos)
public SourceFormatter getSourceFormatter()
Formats the HTML source by laying out each non-inline-level element on a new line with an appropriate indent. The output format can be configured by setting any number of properties on the returnedSourceFormatter
instance before obtaining its output. To create aSourceFormatter
instance based on aSegment
rather than an entireSource
document, use new SourceFormatter(segment) instead.
- Returns:
- an instance of
SourceFormatter
based on this source document.
public final Tag getTagAt(int pos)
Returns theTag
at the specified position in the source document. See theTag
class documentation for more details about the behaviour of this method. This method also returns unregistered tags.
- Parameters:
pos
- the position in the source document, may be out of bounds.
- Returns:
- the
Tag
at the specified position in the source document, ornull
if no tag exists at the specified position or it is out of bounds.
public void ignoreWhenParsing(Collection segments)
Causes all of the segments in the specified collection to be ignored when parsing. This is equivalent to callingSegment.ignoreWhenParsing()
on each segment in the collection.
public void ignoreWhenParsing(int begin, int end)
Causes the specified range of the source text to be ignored when parsing. See the documentation of theSegment.ignoreWhenParsing()
method for more information.
- Parameters:
begin
- the beginning character position in the source text.end
- the end character position in the source text.
public CharStreamSource indent(String indentString, boolean tidyTags, boolean collapseWhiteSpace, boolean indentAllElements)
Deprecated. Use
getSourceFormatter()
.
setIndentString(indentString)
.
setTidyTags(tidyTags)
.
setCollapseWhiteSpace(collapseWhiteSpace)
.
setIndentAllElements(indentAllElements)
instead.Formats the HTML source by laying out each non-inline-level element on a new line with an appropriate indent. This method has been deprecated as of version 2.4 and replaced with thegetSourceFormatter()
method.
- Parameters:
indentString
- the string to use for indentation.tidyTags
- specifies whether to replace the original text of each tag with the output from itsTag.tidy()
method.collapseWhiteSpace
- specifies whether to collapse the white space in the text between the tags.indentAllElements
- specifies whether to indent all elements, including block-level elements and those with preformatted contents.
- Returns:
- a
CharStreamSource
that produces the output.
public boolean isLoggingEnabled()
Deprecated. Use
getLogger()
.isInfoEnabled()
instead.Indicates whether logging is currently enabled. This method has been deprecated as of version 2.4 as its purpose was to allow efficient use of thelog(String)
method, which has been deprecated.
- Returns:
true
if logging is currently enabled, otherwisefalse
.
public boolean isXML()
Indicates whether the source document is likely to be XML. The algorithm used to determine this is designed to be relatively inexpensive and to provide an accurate result in most normal situations. An exact determination of whether the source document is XML would require a much more complex analysis of the text. The algorithm is as follows:As of version 2.5, this method no longer returns
- If the document begins with an XML declaration, it is an XML document.
- If the document contains a document type declaration that contains the text "
xhtml
", it is an XHTML document, and hence also an XML document.- If none of the above conditions are met, assume the document is normal HTML, and therefore not an XML document.
true
if the document doesn't contain anHTML
element. The library is often used to parse partial HTML documents, so the lack of anHTML
element is not a reliable test for an XML document.
- Returns:
true
if the source document is likely to be XML, otherwisefalse
.
public void log(String message)
Deprecated. Use
getLogger()
.info(message)
instead.Writes the specified message to the log. This method has been deprecated as of version 2.4 as logging is now perfomed via theLogger
interface obtained via thegetLogger()
method.
- Parameters:
message
- the message to log
public Attributes parseAttributes(int pos, int maxEnd)
Parses anyAttributes
starting at the specified position. This method is only used in the unusual situation where attributes exist outside of a start tag. TheStartTag.getAttributes()
method should be used in normal situations. The returned Attributes segment always begins atpos
, and ends at the end of the last attribute before eithermaxEnd
or the first occurrence of "/>" or ">" outside of a quoted attribute value, whichever comes first. Only returnsnull
if the segment contains a major syntactical error or more than the default maximum number of minor syntactical errors. This is equivalent toparseAttributes
(pos,maxEnd,
Attributes.getDefaultMaxErrorCount()
)}
.
- Parameters:
pos
- the position in the source document at the beginning of the attribute list, may be out of bounds.maxEnd
- the maximum end position of the attribute list, or -1 if no maximum.
- Returns:
- the
Attributes
starting at the specified position, ornull
if too many errors occur while parsing or the specified position is out of bounds.
public Attributes parseAttributes(int pos, int maxEnd, int maxErrorCount)
Parses anyAttributes
starting at the specified position. This method is only used in the unusual situation where attributes exist outside of a start tag. TheStartTag.getAttributes()
method should be used in normal situations. Only returnsnull
if the segment contains a major syntactical error or more than the specified number of minor syntactical errors. ThemaxErrorCount
argument overrides the default maximum error count. SeeparseAttributes(int pos, int maxEnd)
for more information.
- Parameters:
pos
- the position in the source document at the beginning of the attribute list, may be out of bounds.maxEnd
- the maximum end position of the attribute list, or -1 if no maximum.maxErrorCount
- the maximum number of minor errors allowed while parsing.
- Returns:
- the
Attributes
starting at the specified position, ornull
if too many errors occur while parsing or the specified position is out of bounds.
- See Also:
StartTag.getAttributes()
,parseAttributes(int pos, int MaxEnd)
public void setLogWriter(Writer writer)
Deprecated. Use
setLogger
(new
WriterLogger
(writer))
instead.Sets the logger to an implementation that that sends all output to a specifiedWriter
. This method has been deprecated as of version 2.4 in favour of the more genericsetLogger(Logger)
method.
- Parameters:
writer
- the destinationjava.io.Writer
for log messages.
public void setLogger(Logger logger)
Sets theLogger
that handles log messages. Specifying anull
argument disables logging completely for operations performed on thisSource
object. A logger instance is created automatically for eachSource
object using theLoggerProvider
specified by the staticConfig.LoggerProvider
property. The name used for all automatically created logger instances is "net.htmlparser.jericho
". Use of this method with a non-null argument is therefore not usually necessary, unless specifying an instance ofWriterLogger
or a user-definedLogger
implementation.
- Parameters:
logger
- the logger that will handle log messages, ornull
to disable logging.
- See Also:
Config.LoggerProvider