au.id.jericho.lib.html

Class Element

Implemented Interfaces:
CharSequence, Comparable, HTMLElementName

public final class Element
extends Segment
implements HTMLElementName

Represents an element in a specific source document, which encompasses a start tag, an optional end tag and all content in between.

The term normal element refers to an element having a start tag with a type of StartTagType.NORMAL. This comprises all HTML elements and non-HTML elements.

Element instances are obtained using one of the following methods:

See also the HTMLElements class, and the XML 1.0 specification for elements.

Element Structure

The three possible structures of an element are listed below:

Single Tag Element
start tagelement contenttag content
getEndTag()==null
isEmpty()==true
getEnd()==getStartTag().getEnd()

Explicitly Terminated Element
start tagcontentend tag
getEndTag()!=null
isEmpty()==false
getEnd()==getEndTag().getEnd()

Implicitly Terminated Element
start tagcontentend tag
getEndTag()==null
isEmpty()==false
getEnd()!=getStartTag().getEnd()

HTML elementend tag is optional

implicitly terminating tagstart tagsingle tag element

element parsing rules for HTML elements with optional end tags

HTMLElements.getEndTagOptionalElementNames()

Element Parsing Rules

The following rules describe the algorithm used in the StartTag.getElement() method to construct an element. The detection of the start tag's matching end tag or other terminating tags always takes into account the possible nesting of elements.

  • If the end tag for an element of this name is optional, the parser searches not only for the start tag's matching end tag, but also for any other tag that implicitly terminates the element.
    For each tag (T2) following the start tag (ST1) of this element (E1):
  • If T2 is an end tag:
  • If no more tags are present in the source document, then E1 ends at the end of the file, and an implicitly terminated element is created. Note that the syntactical indication of an empty-element tag in the start tag is ignored when determining the end of HTML elements. See the documentation of the isEmptyElementTag() method for more information.
  • If the name of the start tag does not match one of the recognised HTML element names (indicating a non-HTML element):
  • If the start tag has any type other than StartTagType.NORMAL:
  • See Also:
    HTMLElements

    Fields inherited from interface au.id.jericho.lib.html.HTMLElementName

    A, ABBR, ACRONYM, ADDRESS, APPLET, AREA, B, BASE, BASEFONT, BDO, BIG, BLOCKQUOTE, BODY, BR, BUTTON, CAPTION, CENTER, CITE, CODE, COL, COLGROUP, DD, DEL, DFN, DIR, DIV, DL, DT, EM, FIELDSET, FONT, FORM, FRAME, FRAMESET, H1, H2, H3, H4, H5, H6, HEAD, HR, HTML, I, IFRAME, IMG, INPUT, INS, ISINDEX, KBD, LABEL, LEGEND, LI, LINK, MAP, MENU, META, NOFRAMES, NOSCRIPT, OBJECT, OL, OPTGROUP, OPTION, P, PARAM, PRE, Q, S, SAMP, SCRIPT, SELECT, SMALL, SPAN, STRIKE, STRONG, STYLE, SUB, SUP, TABLE, TBODY, TD, TEXTAREA, TFOOT, TH, THEAD, TITLE, TR, TT, U, UL, VAR

    Method Summary

    String
    getAttributeValue(String attributeName)
    Returns the decoded value of the attribute with the specified name (case insensitive).
    Attributes
    getAttributes()
    Returns the attributes specified in this element's start tag.
    List
    getChildElements()
    Returns a list of the immediate children of this element in the document element hierarchy.
    Segment
    getContent()
    Returns the segment representing the content of the element.
    String
    getContentText()
    Deprecated. Use isEmpty() ? null : getContent().toString() instead.
    String
    getDebugInfo()
    Returns a string representation of this object useful for debugging purposes.
    int
    getDepth()
    Returns the nesting depth of this element in the document element hierarchy.
    EndTag
    getEndTag()
    Returns the end tag of the element.
    FormControl
    getFormControl()
    Returns the FormControl defined by this element.
    String
    getName()
    Returns the name of the start tag of this element, always in lower case.
    Element
    getParentElement()
    Returns the parent of this element in the document element hierarchy.
    StartTag
    getStartTag()
    Returns the start tag of the element.
    static boolean
    isBlock(String elementName)
    Deprecated. Use HTMLElements.getBlockLevelElementNames().contains(elementName.toLowerCase()) instead.
    boolean
    isEmpty()
    Indicates whether this element has zero-length content.
    boolean
    isEmptyElementTag()
    Indicates whether this element is an empty-element tag.
    static boolean
    isInline(String elementName)
    Deprecated. Use HTMLElements.getInlineLevelElementNames().contains(elementName.toLowerCase()) instead.

    Methods inherited from class au.id.jericho.lib.html.Segment

    charAt, compareTo, encloses, encloses, equals, extractText, extractText, findAllCharacterReferences, findAllComments, findAllElements, findAllElements, findAllElements, findAllStartTags, findAllStartTags, findAllStartTags, findAllTags, findAllTags, findFormControls, findFormFields, findWords, getBegin, getChildElements, getDebugInfo, getEnd, getSourceText, getSourceTextNoWhitespace, hashCode, ignoreWhenParsing, isComment, isWhiteSpace, isWhiteSpace, length, parseAttributes, subSequence, toString

    Method Details

    getAttributeValue

    public String getAttributeValue(String attributeName)
    Returns the decoded value of the attribute with the specified name (case insensitive).

    Returns null if the start tag of this element does not have attributes, no attribute with the specified name exists or the attribute has no value.

    This is equivalent to getStartTag().getAttributeValue(attributeName).

    Parameters:
    attributeName - the name of the attribute to get.
    Returns:
    the decoded value of the attribute with the specified name, or null if the attribute does not exist or has no value.

    getAttributes

    public Attributes getAttributes()
    Returns the attributes specified in this element's start tag.

    This is equivalent to getStartTag().getAttributes().

    Returns:
    the attributes specified in this element's start tag.

    getChildElements

    public final List getChildElements()
    Overrides:
    getChildElements in interface Segment
    Returns:
    a list of the immediate children of this element in the document element hierarchy, guaranteed not null.

    getContent

    public Segment getContent()
    Returns the segment representing the content of the element.

    This segment spans between the end of the start tag and the start of the end tag. If the end tag is not present, the content reaches to the end of the element.

    Note that before version 2.0 this method returned null if the element was empty, whereas now a zero-length segment is returned.

    Returns:
    the segment representing the content of the element, guaranteed not null.

    getContentText

    public String getContentText()

    Deprecated. Use isEmpty() ? null : getContent().toString() instead.

    Returns the content text of the element.

    This method has been deprecated as of version 2.0 as the Segment returned by the getContent() method now implements CharSequence and can be used directly in many cases. Use getContent().toString() if a String is required.

    Returns:
    the content text of the element, or null if the element is empty.

    getDebugInfo

    public String getDebugInfo()
    Returns a string representation of this object useful for debugging purposes.
    Overrides:
    getDebugInfo in interface Segment
    Returns:
    a string representation of this object useful for debugging purposes.

    getDepth

    public int getDepth()
    Returns the nesting depth of this element in the document element hierarchy.

    The Source.fullSequentialParse() method should be called after construction of the Source object if this method is to be used.

    A top-level element has a nesting depth of 0.

    An element formed from a server tag always have a nesting depth of 0, regardless of whether it is nested inside a normal element.

    See the Source.getChildElements() method for more details.

    Returns:
    the nesting depth of this element in the document element hierarchy.

    getEndTag

    public EndTag getEndTag()
    Returns the end tag of the element.

    If the element has no end tag this method returns null.

    Returns:
    the end tag of the element, or null if the element has no end tag.

    getFormControl

    public FormControl getFormControl()
    Returns the FormControl defined by this element.
    Returns:
    the FormControl defined by this element, or null if it is not a control.

    getName

    public String getName()
    Returns the name of the start tag of this element, always in lower case.

    This is equivalent to getStartTag().getName().

    See the Tag.getName() method for more information.

    Returns:
    the name of the start tag of this element, always in lower case.

    getParentElement

    public Element getParentElement()
    Returns the parent of this element in the document element hierarchy.

    The Source.fullSequentialParse() method should be called after construction of the Source object if this method is to be used.

    This method returns null for a top-level element, as well as any element formed from a server tag, regardless of whether it is nested inside a normal element.

    See the Source.getChildElements() method for more details.

    Returns:
    the parent of this element in the document element hierarchy, or null if this element is a top-level element.

    getStartTag

    public StartTag getStartTag()
    Returns the start tag of the element.
    Returns:
    the start tag of the element.

    isBlock

    public static boolean isBlock(String elementName)

    Deprecated. Use HTMLElements.getBlockLevelElementNames().contains(elementName.toLowerCase()) instead.

    Indicates whether the specified element name is an HTML block-level element.

    This method has been deprecated as of version 2.0 as the HTMLElements.getBlockLevelElementNames() method now provides a complete set of the element names for which this method returns true.

    Parameters:
    elementName - an element name.
    Returns:
    true if the specified element name is an HTML block-level element, otherwise false.

    isEmpty

    public boolean isEmpty()
    Indicates whether this element has zero-length content.

    This is equivalent to getContent().length()==0.

    Note that this is a broader definition than that of both the HTML definition of an empty element, which is only those elements whose end tag is forbidden, and the XML definition of an empty element, which is "either a start-tag immediately followed by an end-tag, or an empty-element tag". The other possibility covered by this property is the case of an HTML element with an optional end tag that is immediately followed by another tag that implicitly terminates the element.

    Returns:
    true if this element has zero-length content, otherwise false.

    isEmptyElementTag

    public boolean isEmptyElementTag()
    Indicates whether this element is an empty-element tag.

    It is signified by an empty element with the characters "/>" at the end of the start tag.

    This is equivalent to isEmpty() && getStartTag().isEmptyElementTag().

    The StartTag.isEmptyElementTag() property only checks whether the start tag syntactically an empty-element tag, whereas this property also makes sure the element is in fact empty.

    A syntactical empty-element tag that is not actually empty can occur if the end tag of an HTML element is either required or optional, but the start tag is erroneously terminated with the characters "/>" in the source document. All major browsers ignore the syntactical hint of an empty element in this case, even in an XHTML document, so this parser does the same.

    Returns:
    true if this element is an empty-element tag, otherwise false.

    isInline

    public static boolean isInline(String elementName)

    Deprecated. Use HTMLElements.getInlineLevelElementNames().contains(elementName.toLowerCase()) instead.

    Indicates whether the specified element name is an HTML inline-level element.

    This method has been deprecated as of version 2.0 as the HTMLElements.getInlineLevelElementNames() method now provides a complete set of the element names for which this method returns true.

    Parameters:
    elementName - an element name.
    Returns:
    true if the specified element name is an HTML inline-level element, otherwise false.