au.id.jericho.lib.html

Class Tag

Implemented Interfaces:
CharSequence, Comparable, HTMLElementName
Known Direct Subclasses:
EndTag, StartTag

public abstract class Tag
extends Segment
implements HTMLElementName

Represents either a StartTag or EndTag in a specific source document.

Tag Parsing Process

The following process describes how each tag is identified by the parser:
  1. Every '<' character found in the source document is considered to be the start of a tag. The characters following it are compared with the start delimiters of all the registered tag types, and a list of matching tag types is determined.
  2. A more detailed analysis of the source is performed according to the features of each matching tag type from the first step, in order of precedence, until a valid tag is able to be constructed.

    The analysis performed in relation to each candidate tag type is a two-stage process:

    1. The position of the tag is checked to determine whether it is valid. In theory, a server tag is valid in any position, but a non-server tag is not valid inside another non-server tag.

      The TagType.isValidPosition(Source, int pos) method is responsible for this check and has a common default implementation for all tag types (although custom tag types can override it if necessary). Its behaviour differs depending on whether or not a full sequential parse is peformed. See the documentation of the isValidPosition method for full details.

    2. A final analysis is performed by the TagType.constructTagAt(Source, int pos) method of the candidate tag type. This method returns a valid Tag object if all conditions of the candidate tag type are met, otherwise it returns null and the process continues with the next candidate tag type.
  3. If the source does not match the start delimiter or syntax of any registered tag type, the segment spanning it and the next '>' character is taken to be an unregistered tag. Some tag search methods ignore unregistered tags. See the isUnregistered() method for more information.

    See the documentation of the TagType class for more details on how tags are recognised.

    Tag Search Methods

    Methods that find tags in a source document are collectively referred to as Tag Search Methods. They are found mostly in the Source and Segment classes, and can be generally categorised as follows:

    Open Search:
    nametype
    Named Search:
    namename:
    • Segment.findAllElements(String name)
    • Segment.findAllStartTags(String name)
    • Source.findPreviousStartTag(int pos, String name)
    • Source.findNextStartTag(int pos, String name)
    • Source.findPreviousEndTag(int pos, String name)
    • Source.findNextEndTag(int pos, String name)
    • Source.findNextEndTag(int pos, String name, EndTagType)
    Tag Type Search:
    tagTypetypeStartTagTypeTagType
    Other Search:
    attribute values
    • Segment.findAllStartTags(String attributeName, String value, boolean valueCaseSensitive)
    • Source.findNextStartTag(int pos, String attributeName, String value, boolean valueCaseSensitive)
  4. Field Summary

    static String
    DOCTYPE_DECLARATION
    Deprecated. Use StartTagType.DOCTYPE_DECLARATION in combination with tag type search methods instead.
    static String
    PROCESSING_INSTRUCTION
    Deprecated. Use StartTagType.XML_PROCESSING_INSTRUCTION in combination with tag type search methods instead.
    static String
    SERVER_COMMON
    Deprecated. Use StartTagType.SERVER_COMMON in combination with tag type search methods instead.
    static String
    SERVER_MASON_COMPONENT_CALL
    Deprecated. Use MasonTagTypes.MASON_COMPONENT_CALL in combination with tag type search methods instead.
    static String
    SERVER_MASON_COMPONENT_CALLED_WITH_CONTENT
    Deprecated. Use MasonTagTypes.MASON_COMPONENT_CALLED_WITH_CONTENT in combination with tag type search methods instead.
    static String
    SERVER_MASON_NAMED_BLOCK
    Deprecated. Use MasonTagTypes.MASON_NAMED_BLOCK in combination with tag type search methods instead.
    static String
    SERVER_PHP
    Deprecated. Use PHPTagTypes.PHP_STANDARD in combination with tag type search methods instead.
    static String
    XML_DECLARATION
    Deprecated. Use StartTagType.XML_DECLARATION in combination with tag type search methods instead.

    Fields inherited from interface au.id.jericho.lib.html.HTMLElementName

    A, ABBR, ACRONYM, ADDRESS, APPLET, AREA, B, BASE, BASEFONT, BDO, BIG, BLOCKQUOTE, BODY, BR, BUTTON, CAPTION, CENTER, CITE, CODE, COL, COLGROUP, DD, DEL, DFN, DIR, DIV, DL, DT, EM, FIELDSET, FONT, FORM, FRAME, FRAMESET, H1, H2, H3, H4, H5, H6, HEAD, HR, HTML, I, IFRAME, IMG, INPUT, INS, ISINDEX, KBD, LABEL, LEGEND, LI, LINK, MAP, MENU, META, NOFRAMES, NOSCRIPT, OBJECT, OL, OPTGROUP, OPTION, P, PARAM, PRE, Q, S, SAMP, SCRIPT, SELECT, SMALL, SPAN, STRIKE, STRONG, STYLE, SUB, SUP, TABLE, TBODY, TD, TEXTAREA, TFOOT, TH, THEAD, TITLE, TR, TT, U, UL, VAR

    Method Summary

    Tag
    findNextTag()
    Returns the next tag in the source document.
    Tag
    findPreviousTag()
    Returns the previous tag in the source document.
    abstract Element
    getElement()
    Returns the element that is started or ended by this tag.
    String
    getName()
    Returns the name of this tag, always in lower case.
    Segment
    getNameSegment()
    Returns the segment spanning the name of this tag.
    abstract TagType
    getTagType()
    Returns the type of this tag.
    Object
    getUserData()
    Returns the general purpose user data object that has previously been associated with this tag via the setUserData(Object) method.
    abstract boolean
    isUnregistered()
    Indicates whether this tag has a syntax that does not match any of the registered tag types.
    static boolean
    isXMLName(CharSequence text)
    Indicates whether the specified text is a valid XML Name.
    static boolean
    isXMLNameChar(char ch)
    Indicates whether the specified character is valid anywhere in an XML Name.
    static boolean
    isXMLNameStartChar(char ch)
    Indicates whether the specified character is valid at the start of an XML Name.
    abstract String
    regenerateHTML()
    Deprecated. Use tidy() instead.
    void
    setUserData(Object userData)
    Associates the specified general purpose user data object with this tag.
    abstract String
    tidy()
    Returns an XML representation of this tag.

    Methods inherited from class au.id.jericho.lib.html.Segment

    charAt, compareTo, encloses, encloses, equals, extractText, extractText, findAllCharacterReferences, findAllComments, findAllElements, findAllElements, findAllElements, findAllStartTags, findAllStartTags, findAllStartTags, findAllTags, findAllTags, findFormControls, findFormFields, findWords, getBegin, getChildElements, getDebugInfo, getEnd, getSourceText, getSourceTextNoWhitespace, hashCode, ignoreWhenParsing, isComment, isWhiteSpace, isWhiteSpace, length, parseAttributes, subSequence, toString

    Field Details

    DOCTYPE_DECLARATION

    public static final String DOCTYPE_DECLARATION

    Deprecated. Use StartTagType.DOCTYPE_DECLARATION in combination with tag type search methods instead.


    PROCESSING_INSTRUCTION

    public static final String PROCESSING_INSTRUCTION

    Deprecated. Use StartTagType.XML_PROCESSING_INSTRUCTION in combination with tag type search methods instead.


    SERVER_COMMON

    public static final String SERVER_COMMON

    Deprecated. Use StartTagType.SERVER_COMMON in combination with tag type search methods instead.

    Common server tag (<% ... %>)

    SERVER_MASON_COMPONENT_CALL

    public static final String SERVER_MASON_COMPONENT_CALL

    Deprecated. Use MasonTagTypes.MASON_COMPONENT_CALL in combination with tag type search methods instead.


    SERVER_MASON_COMPONENT_CALLED_WITH_CONTENT

    public static final String SERVER_MASON_COMPONENT_CALLED_WITH_CONTENT

    Deprecated. Use MasonTagTypes.MASON_COMPONENT_CALLED_WITH_CONTENT in combination with tag type search methods instead.


    SERVER_MASON_NAMED_BLOCK

    public static final String SERVER_MASON_NAMED_BLOCK

    Deprecated. Use MasonTagTypes.MASON_NAMED_BLOCK in combination with tag type search methods instead.

    Mason named block (<%name ... > ... </%name>)

    SERVER_PHP

    public static final String SERVER_PHP

    Deprecated. Use PHPTagTypes.PHP_STANDARD in combination with tag type search methods instead.

    Standard PHP tag (<?php ... ?>)

    XML_DECLARATION

    public static final String XML_DECLARATION

    Deprecated. Use StartTagType.XML_DECLARATION in combination with tag type search methods instead.

    Method Details

    findNextTag

    public Tag findNextTag()
    Returns the next tag in the source document.

    If a full sequential parse has been performed, this method is very efficient.

    If not, it is equivalent to source.findNextTag(getBegin()+1).

    See the Tag class documentation for more details about the behaviour of this method.

    Returns:
    the next tag in the source document, or null if this is the last tag.

    findPreviousTag

    public Tag findPreviousTag()
    Returns the previous tag in the source document.

    If a full sequential parse has been performed, this method is very efficient.

    If not, it is equivalent to source.findPreviousTag(getBegin()-1).

    See the Tag class documentation for more details about the behaviour of this method.

    Returns:
    the previous tag in the source document, or null if this is the first tag.

    getElement

    public abstract Element getElement()
    Returns the element that is started or ended by this tag.

    StartTag.getElement() is guaranteed not null.

    EndTag.getElement() can return null if the end tag is not properly matched to a start tag.

    Returns:
    the element that is started or ended by this tag.

    getName

    public final String getName()
    Returns the name of this tag, always in lower case.

    The name always starts with the name prefix defined in this tag's type. For some tag types, the name consists only of this prefix, while in others it must be followed by a valid XML name (see StartTagType.isNameAfterPrefixRequired()).

    If the name is equal to one of the constants defined in the HTMLElementName interface, this method is guaranteed to return the constant itself. This allows comparisons to be performed using the == operator instead of the less efficient String.equals(Object) method.

    For example, the following expression can be used to test whether a StartTag is from a SELECT element:
    startTag.getName()==HTMLElementName.SELECT

    To get the name of this tag in its original case, use getNameSegment().toString().

    Returns:
    the name of this tag, always in lower case.

    getNameSegment

    public Segment getNameSegment()
    Returns the segment spanning the name of this tag.

    The code getNameSegment().toString() can be used to retrieve the name of this tag in its original case.

    Every call to this method constructs a new Segment object.

    Returns:
    the segment spanning the name of this tag.
    See Also:
    getName()

    getTagType

    public abstract TagType getTagType()
    Returns the type of this tag.
    Returns:
    the type of this tag.

    getUserData

    public Object getUserData()
    Returns:
    the generic data object that has previously been associated with this tag via the setUserData(Object) method.

    isUnregistered

    public abstract boolean isUnregistered()
    Indicates whether this tag has a syntax that does not match any of the registered tag types.

    The only requirement of an unregistered tag type is that it starts with '<' and there is a closing '>' character at some position after it in the source document.

    The absence or presence of a '/' character after the initial '<' determines whether an unregistered tag is respectively a StartTag with a type of StartTagType.UNREGISTERED or an EndTag with a type of EndTagType.UNREGISTERED.

    There are no restrictions on the characters that might appear between these delimiters, including other '<' characters. This may result in a '>' character that is identified as the closing delimiter of two separate tags, one an unregistered tag, and the other a tag of any type that begins in the middle of the unregistered tag. As explained below, unregistered tags are usually only found when specifically looking for them, so it is up to the user to detect and deal with any such nonsensical results.

    Unregistered tags are only returned by the Source.getTagAt(int pos) method, named search methods, where the specified name matches the first characters inside the tag, and by tag type search methods, where the specified tagType is either StartTagType.UNREGISTERED or EndTagType.UNREGISTERED.

    Open tag searches and other searches always ignore unregistered tags, although every discovery of an unregistered tag is logged by the parser.

    The logic behind this design is that unregistered tag types are usually the result of a '<' character in the text that was mistakenly left unencoded, or a less-than operator inside a script, or some other occurrence which is of no interest to the user. By returning unregistered tags in named and tag type search methods, the library allows the user to specifically search for tags with a certain syntax that does not match any existing TagType. This expediency feature avoids the need for the user to create a custom tag type to define the syntax before searching for these tags. By not returning unregistered tags in the less specific search methods, it is providing only the information that most users are interested in.

    Returns:
    true if this tag has a syntax that does not match any of the registered tag types, otherwise false.

    isXMLName

    public static final boolean isXMLName(CharSequence text)
    Indicates whether the specified text is a valid XML Name.

    This implementation first checks that the first character of the specified text is a valid XML Name start character as defined by the isXMLNameStartChar(char) method, and then checks that the rest of the characters are valid XML Name characters as defined by the isXMLNameChar(char) method.

    Note that this implementation does not exactly adhere to the formal definition of an XML Name, but the differences are unlikely to be significant in real-world XML or HTML documents.

    Parameters:
    text - the text to test.
    Returns:
    true if the specified text is a valid XML Name, otherwise false.
    See Also:
    Source.findNameEnd(int pos)

    isXMLNameChar

    public static final boolean isXMLNameChar(char ch)
    Indicates whether the specified character is valid anywhere in an XML Name.

    The XML 1.0 specification section 2.3 uses the entity NameChar to represent this set of characters, which is defined as
    (Letter | Digit | '.' | '-' | '_' | ':' | CombiningChar | Extender).

    This method uses the expression
    Character.isLetterOrDigit(ch) || ch=='.' || ch=='-' || ch=='_' || ch==':'.

    Note that there are many differences between these definitions, but these differences are unlikely to be significant in real-world XML or HTML documents.

    Parameters:
    ch - the character to test.
    Returns:
    true if the specified character is valid anywhere in an XML Name, otherwise false.
    See Also:
    Source.findNameEnd(int pos)

    isXMLNameStartChar

    public static final boolean isXMLNameStartChar(char ch)
    Indicates whether the specified character is valid at the start of an XML Name.

    The XML 1.0 specification section 2.3 defines a Name as starting with one of the characters
    (Letter | '_' | ':').

    This method uses the expression
    Character.isLetter(ch) || ch=='_' || ch==':'.

    Note that there are many differences between the Character.isLetter() definition of a Letter and the XML definition of a Letter, but these differences are unlikely to be significant in real-world XML or HTML documents.

    Parameters:
    ch - the character to test.
    Returns:
    true if the specified character is valid at the start of an XML Name, otherwise false.
    See Also:
    Source.findNameEnd(int pos)

    regenerateHTML

    public abstract String regenerateHTML()

    Deprecated. Use tidy() instead.

    Regenerates the HTML text of this tag.

    This method has been deprecated as of version 2.2 and replaced with the exactly equivalent tidy() method.

    Returns:
    the regenerated HTML text of this tag.

    setUserData

    public void setUserData(Object userData)
    Parameters:
    userData - general purpose user data of any type.

    tidy

    public abstract String tidy()
    Returns:
    an XML representation of this tag.