au.id.jericho.lib.html

Class StartTag

Implemented Interfaces:
CharSequence, Comparable, HTMLElementName

public final class StartTag
extends Tag

Represents the start tag of an element in a specific source document.

A start tag always has a type that is a subclass of StartTagType, meaning that any tag that does not start with the characters '</' is categorised as a start tag.

This includes many tags which stand alone, without a corresponding end tag, and would not intuitively be categorised as a "start tag". For example, an HTML comment is represented as a single start tag that spans the whole comment, and does not have an end tag at all.

See the static fields defined in the StartTagType class for a list of the standard start tag types.

StartTag instances are obtained using one of the following methods:

The methods above which accept a name parameter are categorised as named search methods.

In such methods dealing with start tags, specifying an argument to the name parameter that ends in a colon (:) searches for all start tags in the specified XML namespace.

The constants defined in the HTMLElementName interface can be used directly as arguments to these name parameters. For example, source.findAllStartTags(HTMLElementName.A) is equivalent to source.findAllStartTags("a"), and finds all hyperlink start tags.

The Tag superclass defines a method called getName() to get the name of this start tag.

See also the XML 1.0 specification for start tags.

See Also:
Tag, Element, EndTag

Field Summary

Fields inherited from class au.id.jericho.lib.html.Tag

DOCTYPE_DECLARATION, PROCESSING_INSTRUCTION, SERVER_COMMON, SERVER_MASON_COMPONENT_CALL, SERVER_MASON_COMPONENT_CALLED_WITH_CONTENT, SERVER_MASON_NAMED_BLOCK, SERVER_PHP, XML_DECLARATION

Fields inherited from interface au.id.jericho.lib.html.HTMLElementName

A, ABBR, ACRONYM, ADDRESS, APPLET, AREA, B, BASE, BASEFONT, BDO, BIG, BLOCKQUOTE, BODY, BR, BUTTON, CAPTION, CENTER, CITE, CODE, COL, COLGROUP, DD, DEL, DFN, DIR, DIV, DL, DT, EM, FIELDSET, FONT, FORM, FRAME, FRAMESET, H1, H2, H3, H4, H5, H6, HEAD, HR, HTML, I, IFRAME, IMG, INPUT, INS, ISINDEX, KBD, LABEL, LEGEND, LI, LINK, MAP, MENU, META, NOFRAMES, NOSCRIPT, OBJECT, OL, OPTGROUP, OPTION, P, PARAM, PRE, Q, S, SAMP, SCRIPT, SELECT, SMALL, SPAN, STRIKE, STRONG, STYLE, SUB, SUP, TABLE, TBODY, TD, TEXTAREA, TFOOT, TH, THEAD, TITLE, TR, TT, U, UL, VAR

Method Summary

EndTag
findEndTag()
Deprecated. Use getElement().getEndTag() instead.
static String
generateHTML(String tagName, Map attributesMap, boolean emptyElementTag)
Generates the HTML text of a normal start tag with the specified tag name and attributes map.
String
getAttributeValue(String attributeName)
Returns the decoded value of the attribute with the specified name (case insensitive).
Attributes
getAttributes()
Returns the attributes specified in this start tag.
String
getDebugInfo()
Returns a string representation of this object useful for debugging purposes.
Element
getElement()
Returns the element that is started by this start tag.
Segment
getFollowingTextSegment()
Deprecated. Use new Segment(source,getEnd(),findNextTag().getBegin()) instead.
FormControl
getFormControl()
Returns the FormControl defined by this start tag.
FormControlType
getFormControlType()
Deprecated. Use getFormControl().getFormControlType() instead.
StartTagType
getStartTagType()
Returns the type of this start tag.
Segment
getTagContent()
Returns the segment between the end of the tag's name and the start of its end delimiter.
TagType
getTagType()
Returns the type of this tag.
boolean
isComment()
Deprecated. Use getTagType()==StartTagType.COMMENT instead.
boolean
isCommonServerTag()
Deprecated. Use getTagType()==StartTagType.SERVER_COMMON instead.
boolean
isDocTypeDeclaration()
Deprecated. Use getTagType()==StartTagType.DOCTYPE_DECLARATION instead.
boolean
isEmptyElementTag()
Indicates whether this start tag is syntactically an empty-element tag.
boolean
isEndTagForbidden()
Indicates whether a matching end tag is forbidden.
boolean
isEndTagOptional()
Deprecated. Use getStartTagType()==StartTagType.NORMAL && HTMLElements.getEndTagOptionalElementNames().contains(getName()) instead.
boolean
isEndTagRequired()
Indicates whether a matching end tag is required.
boolean
isMasonComponentCall()
Deprecated. Use getTagType()==MasonTagTypes.MASON_COMPONENT_CALL instead.
boolean
isMasonComponentCalledWithContent()
Deprecated. Use getTagType()==MasonTagTypes.MASON_COMPONENT_CALLED_WITH_CONTENT instead.
boolean
isMasonNamedBlock()
Deprecated. Use getTagType()==MasonTagTypes.MASON_NAMED_BLOCK instead.
boolean
isMasonTag()
Deprecated. Use MasonTagTypes.isParsedByMason(getTagType()) instead.
boolean
isPHPTag()
Deprecated. Use getTagType()==PHPTagTypes.PHP_STANDARD instead.
boolean
isProcessingInstruction()
Deprecated. Use charAt(1)=='?' instead for backward compatibility.
boolean
isServerTag()
Deprecated. Use getTagType().isServerTag() instead.
boolean
isUnregistered()
Indicates whether this tag has a syntax that does not match any of the registered tag types.
boolean
isXMLDeclaration()
Deprecated. Use getTagType()==StartTagType.XML_DECLARATION instead.
Attributes
parseAttributes()
Parses the attributes specified in this start tag, regardless of the type of start tag.
Attributes
parseAttributes(int maxErrorCount)
Parses the attributes specified in this start tag, regardless of the type of start tag.
String
regenerateHTML()
Deprecated. Use tidy() instead.
String
tidy()
Returns an XML representation of this start tag.
String
tidy(boolean toXHTML)
Returns an XML or XHTML representation of this start tag.

Methods inherited from class au.id.jericho.lib.html.Tag

findNextTag, findPreviousTag, getElement, getName, getNameSegment, getTagType, getUserData, isUnregistered, isXMLName, isXMLNameChar, isXMLNameStartChar, regenerateHTML, setUserData, tidy

Methods inherited from class au.id.jericho.lib.html.Segment

charAt, compareTo, encloses, encloses, equals, extractText, extractText, findAllCharacterReferences, findAllComments, findAllElements, findAllElements, findAllElements, findAllStartTags, findAllStartTags, findAllStartTags, findAllTags, findAllTags, findFormControls, findFormFields, findWords, getBegin, getChildElements, getDebugInfo, getEnd, getSourceText, getSourceTextNoWhitespace, hashCode, ignoreWhenParsing, isComment, isWhiteSpace, isWhiteSpace, length, parseAttributes, subSequence, toString

Method Details

findEndTag

public EndTag findEndTag()

Deprecated. Use getElement().getEndTag() instead.

Returns the end tag that corresponds to this start tag.

This method has been deprecated as of version 2.0 as it has existed mainly for backward compatability with version 1.0.

The getElement() method is much more useful as it determines the span of the element even if the end tag is optional and is not present in the source document.

This method on the other hand just returns null in the above case, revealing nothing about where the element ends.

Returns:
the end tag that corresponds to this start tag, or null if it does not exist in the source document.

generateHTML

public static String generateHTML(String tagName,
                                  Map attributesMap,
                                  boolean emptyElementTag)
Generates the HTML text of a normal start tag with the specified tag name and attributes map.

The output of the attributes is as described in the Attributes.generateHTML(Map attributesMap) method.

The emptyElementTag parameter specifies whether the start tag should be an empty-element tag, in which case a slash is inserted before the closing angle bracket, separated from the name or last attribute by a single space.

Example:

The following code:
 LinkedHashMap attributesMap=new LinkedHashMap();
 attributesMap.put("name","Company");
 attributesMap.put("value","G\n00fcnter O'Reilly & Associés");
 System.out.println(StartTag.generateHTML("INPUT",attributesMap,true));
generates the following output:

<INPUT name="Company" value="G&uuml;nter O'Reilly &amp; Associ&eacute;s" />

Parameters:
tagName - the name of the start tag.
attributesMap - a map containing attribute name/value pairs.
emptyElementTag - specifies whether the start tag should be an empty-element tag.
Returns:
the HTML text of a normal start tag with the specified tag name and attributes map.
See Also:
EndTag.generateHTML(String tagName)

getAttributeValue

public String getAttributeValue(String attributeName)
Returns the decoded value of the attribute with the specified name (case insensitive).

Returns null if this start tag does not have attributes, no attribute with the specified name exists or the attribute has no value.

This is equivalent to getAttributes().getValue(attributeName), except that it returns null if this start tag does not have attributes instead of throwing a NullPointerException.

Parameters:
attributeName - the name of the attribute to get.
Returns:
the decoded value of the attribute with the specified name, or null if the attribute does not exist or has no value.

getAttributes

public Attributes getAttributes()
Returns the attributes specified in this start tag.

Return value is not null if and only if getStartTagType().hasAttributes()==true.

To force the parsing of attributes in other start tag types, use the parseAttributes() method instead.

Returns:
the attributes specified in this start tag, or null if the type of this start tag does not have attributes.
See Also:
parseAttributes(), Source.parseAttributes(int pos, int maxEnd)

getDebugInfo

public String getDebugInfo()
Returns a string representation of this object useful for debugging purposes.
Overrides:
getDebugInfo in interface Segment
Returns:
a string representation of this object useful for debugging purposes.

getElement

public Element getElement()
Returns the element that is started by this start tag. Guaranteed not null.

Example 1: Elements for which the end tag is required

 1. <div>
 2.   <div>
 3.     <div>
 4.       <div>This is line 4</div>
 5.     </div>
 6.     <div>This is line 6</div>
 7.   </div>
  • The start tag on line 1 returns an empty element spanning only the start tag. This is because the end tag of a <div> element is required, making the sample code invalid as all the end tags are matched with other start tags.
  • The start tag on line 2 returns an element spanning to the end of line 7.
  • The start tag on line 3 returns an element spanning to the end of line 5.
  • The start tag on line 4 returns an element spanning to the end of line 4.
  • The start tag on line 6 returns an element spanning to the end of line 6.

Example 2: Elements for which the end tag is optional

 1. <ul>
 2.   <li>item 1
 3.   <li>item 2
 4.     <ul>
 5.       <li>subitem 1</li>
 6.       <li>subitem 2
 7.     </ul>
 8.   <li>item 3</li>
 9. </ul>
  • The start tag on line 1 returns an element spanning to the end of line 9.
  • The start tag on line 2 returns an element spanning to the start of the <li> start tag on line 3.
  • The start tag on line 3 returns an element spanning to the start of the <li> start tag on line 8.
  • The start tag on line 4 returns an element spanning to the end of line 7.
  • The start tag on line 5 returns an element spanning to the end of line 5.
  • The start tag on line 6 returns an element spanning to the start of the </ul> end tag on line 7.
  • The start tag on line 8 returns an element spanning to the end of line 8.
Overrides:
getElement in interface Tag
Returns:
the element that is started by this start tag.

getFollowingTextSegment

public Segment getFollowingTextSegment()

Deprecated. Use new Segment(source,getEnd(),findNextTag().getBegin()) instead.

Returns the segment containing the text that immediately follows this start tag up until the start of the following tag.

Guaranteed not null.

This method has been deprecated as of version 2.0 as it is no longer used internally and has no practical use as a public method.

Returns:
the segment containing the text that immediately follows this start tag up until the start of the following tag.

getFormControl

public FormControl getFormControl()
Returns the FormControl defined by this start tag.

This is equivalent to getElement().getFormControl().

Returns:
the FormControl defined by this start tag, or null if it is not a control.

getFormControlType

public FormControlType getFormControlType()

Deprecated. Use getFormControl().getFormControlType() instead.

Returns the FormControlType of this start tag.

This method has been deprecated as of version 2.0 as it is no longer used internally and has no practical use as a public method.

Returns:
the form control type of this start tag, or null if it is not a control.

getStartTagType

public StartTagType getStartTagType()
Returns the type of this start tag.

This is equivalent to (StartTagType)getTagType().

Returns:
the type of this start tag.

getTagContent

public Segment getTagContent()
Returns the segment between the end of the tag's name and the start of its end delimiter.

This method is normally only of use for start tags whose content is something other than attributes.

A new Segment object is created with each call to this method.

Returns:
the segment between the end of the tag's name and the start of the end delimiter.

getTagType

public TagType getTagType()
Returns the type of this tag.
Overrides:
getTagType in interface Tag
Returns:
the type of this tag.

isComment

public boolean isComment()

Deprecated. Use getTagType()==StartTagType.COMMENT instead.

Indicates whether this start tag is of type StartTagType.COMMENT.

This method has been deprecated as of version 2.0 as its functionality is now easily performed without a dedicated method.

Overrides:
isComment in interface Segment
Returns:
true if this start tag is of type StartTagType.COMMENT, otherwise false.

isCommonServerTag

public boolean isCommonServerTag()

Deprecated. Use getTagType()==StartTagType.SERVER_COMMON instead.

Indicates whether this start tag has a type of StartTagType.SERVER_COMMON.

This method has been deprecated as of version 2.0 as its functionality is now easily performed without a dedicated method.

Returns:
true if this start tag has a type of StartTagType.SERVER_COMMON, otherwise false.

isDocTypeDeclaration

public boolean isDocTypeDeclaration()

Deprecated. Use getTagType()==StartTagType.DOCTYPE_DECLARATION instead.

Indicates whether this start tag has a type of StartTagType.DOCTYPE_DECLARATION.

This method has been deprecated as of version 2.0 as its functionality is now easily performed without a dedicated method.

Returns:
true if this start tag has a type of StartTagType.DOCTYPE_DECLARATION, otherwise false.

isEmptyElementTag

public boolean isEmptyElementTag()
Indicates whether this start tag is syntactically an empty-element tag.

This is signified by the characters "/>" at the end of the start tag.

Only a normal start tag can be an empty-element tag.

This property simply reports whether the syntax of the start tag is consistent with that of an empty-element tag, it does not guarantee that this start tag's element is actually empty.

This possible discrepancy reflects the way major browsers interpret illegal empty element tags used in HTML elements, and is explained further in the documentation of the Element.isEmptyElementTag() property.

Compare this property with the Element.isEmptyElementTag() property, which does check that the element is actually empty.

Returns:
true if this start tag is syntactically an empty-element tag, otherwise false.

isEndTagForbidden

public boolean isEndTagForbidden()
Indicates whether a matching end tag is forbidden.

This property returns true if one of the following conditions is met:

If this property returns true then this start tag's element will always be a single tag element.

Returns:
true if a matching end tag is forbidden, otherwise false.

isEndTagOptional

public boolean isEndTagOptional()

Deprecated. Use getStartTagType()==StartTagType.NORMAL && HTMLElements.getEndTagOptionalElementNames().contains(getName()) instead.

Indicates whether a matching end tag is optional according to the HTML 4.01 specification.

This method has been deprecated as of version 2.0 and replaced with the HTMLElements.getEndTagOptionalElementNames() static method.

This property is only relevant to start tags forming part of an HTML element and returns false in all other cases.

Returns:
true if a matching end tag is optional according to the HTML 4.01 specification, otherwise false.

isEndTagRequired

public boolean isEndTagRequired()
Indicates whether a matching end tag is required.

This property returns true if one of the following conditions is met:

Returns:
true if a matching end tag is required, otherwise false.

isMasonComponentCall

public boolean isMasonComponentCall()

Deprecated. Use getTagType()==MasonTagTypes.MASON_COMPONENT_CALL instead.

Indicates whether this start tag has a type of MasonTagTypes.MASON_COMPONENT_CALL.

This method has been deprecated as of version 2.0 as its functionality is now easily performed without a dedicated method.

Returns:
true if this start tag has a type of MasonTagTypes.MASON_COMPONENT_CALL, otherwise false.

isMasonComponentCalledWithContent

public boolean isMasonComponentCalledWithContent()

Deprecated. Use getTagType()==MasonTagTypes.MASON_COMPONENT_CALLED_WITH_CONTENT instead.

Indicates whether this start tag has a type of MasonTagTypes.MASON_COMPONENT_CALLED_WITH_CONTENT.

This method has been deprecated as of version 2.0 as its functionality is now easily performed without a dedicated method.

Returns:
true if this start tag has a type of MasonTagTypes.MASON_COMPONENT_CALLED_WITH_CONTENT, otherwise false.

isMasonNamedBlock

public boolean isMasonNamedBlock()

Deprecated. Use getTagType()==MasonTagTypes.MASON_NAMED_BLOCK instead.

Indicates whether this start tag has a type of MasonTagTypes.MASON_NAMED_BLOCK.

This method has been deprecated as of version 2.0 as its functionality is now easily performed without a dedicated method.

Returns:
true if this start tag has a type of MasonTagTypes.MASON_NAMED_BLOCK, otherwise false.

isMasonTag

public boolean isMasonTag()

Deprecated. Use MasonTagTypes.isParsedByMason(getTagType()) instead.

Indicates whether this start tag would be parsed by a Mason server.

This method has been deprecated as of version 2.0 as its functionality is now easily performed without a dedicated method.

Returns:
true if this start tag would be parsed by a Mason server, otherwise false.

isPHPTag

public boolean isPHPTag()

Deprecated. Use getTagType()==PHPTagTypes.PHP_STANDARD instead.

Indicates whether this start tag has a type of PHPTagTypes.PHP_STANDARD.

This method has been deprecated as of version 2.0 as its functionality is now easily performed without a dedicated method.

Returns:
true if this start tag has a type of PHPTagTypes.PHP_STANDARD, otherwise false.

isProcessingInstruction

public boolean isProcessingInstruction()

Deprecated. Use charAt(1)=='?' instead for backward compatibility.

Indicates whether this start tag has a type of StartTagType.XML_PROCESSING_INSTRUCTION or is any other tag starting with "<?".

This method has been deprecated as of version 2.0 as its functionality is now easily performed without a dedicated method.

Returns:
true if this start tag has a type of StartTagType.XML_PROCESSING_INSTRUCTION or is any other tag starting with "<?", otherwise false.

isServerTag

public boolean isServerTag()

Deprecated. Use getTagType().isServerTag() instead.

Indicates whether the start tag is a server tag.

This method has been deprecated as of version 2.0 as its functionality is now easily performed without a dedicated method.

Returns:
true if the start tag is a server tag, otherwise false.

isUnregistered

public boolean isUnregistered()
Indicates whether this tag has a syntax that does not match any of the registered tag types.

The only requirement of an unregistered tag type is that it starts with '<' and there is a closing '>' character at some position after it in the source document.

The absence or presence of a '/' character after the initial '<' determines whether an unregistered tag is respectively a StartTag with a type of StartTagType.UNREGISTERED or an EndTag with a type of EndTagType.UNREGISTERED.

There are no restrictions on the characters that might appear between these delimiters, including other '<' characters. This may result in a '>' character that is identified as the closing delimiter of two separate tags, one an unregistered tag, and the other a tag of any type that begins in the middle of the unregistered tag. As explained below, unregistered tags are usually only found when specifically looking for them, so it is up to the user to detect and deal with any such nonsensical results.

Unregistered tags are only returned by the Source.getTagAt(int pos) method, named search methods, where the specified name matches the first characters inside the tag, and by tag type search methods, where the specified tagType is either StartTagType.UNREGISTERED or EndTagType.UNREGISTERED.

Open tag searches and other searches always ignore unregistered tags, although every discovery of an unregistered tag is logged by the parser.

The logic behind this design is that unregistered tag types are usually the result of a '<' character in the text that was mistakenly left unencoded, or a less-than operator inside a script, or some other occurrence which is of no interest to the user. By returning unregistered tags in named and tag type search methods, the library allows the user to specifically search for tags with a certain syntax that does not match any existing TagType. This expediency feature avoids the need for the user to create a custom tag type to define the syntax before searching for these tags. By not returning unregistered tags in the less specific search methods, it is providing only the information that most users are interested in.

Overrides:
isUnregistered in interface Tag
Returns:
true if this tag has a syntax that does not match any of the registered tag types, otherwise false.

isXMLDeclaration

public boolean isXMLDeclaration()

Deprecated. Use getTagType()==StartTagType.XML_DECLARATION instead.

Indicates whether this start tag has a type of StartTagType.XML_DECLARATION.

This method has been deprecated as of version 2.0 as its functionality is now easily performed without a dedicated method.

Returns:
true if this start tag has a type of StartTagType.XML_DECLARATION, otherwise false.

parseAttributes

public Attributes parseAttributes()
Parses the attributes specified in this start tag, regardless of the type of start tag. This method is only required in the unusual situation where attributes exist in a start tag whose type doesn't have attributes.

This method returns the cached attributes from the getAttributes() method if its value is not null, otherwise the source is physically parsed with each call to this method.

This is equivalent to parseAttributes(Attributes.getDefaultMaxErrorCount())}.

Overrides:
parseAttributes in interface Segment
Returns:
the attributes specified in this start tag, or null if too many errors occur while parsing.
See Also:
getAttributes(), Source.parseAttributes(int pos, int maxEnd)

parseAttributes

public Attributes parseAttributes(int maxErrorCount)
Parses the attributes specified in this start tag, regardless of the type of start tag. This method is only required in the unusual situation where attributes exist in a start tag whose type doesn't have attributes.

See the documentation of the parseAttributes() method for more information.

Parameters:
maxErrorCount - the maximum number of minor errors allowed while parsing
Returns:
the attributes specified in this start tag, or null if too many errors occur while parsing.

regenerateHTML

public String regenerateHTML()

Deprecated. Use tidy() instead.

Returns an XML representation of this start tag.

This method has been deprecated as of version 2.2 and replaced with the exactly equivalent tidy() method.

Overrides:
regenerateHTML in interface Tag
Returns:
an XML representation of this start tag, or the source text if it is of a type that does not have attributes

tidy

public String tidy()
Returns an XML representation of this start tag.

This is equivalent to tidy(false), thereby keeping the name of the tag in its original case.

See the documentation of the tidy(boolean toXHTML) method for more details.

Overrides:
tidy in interface Tag
Returns:
an XML representation of this start tag, or the source text if it is of a type that does not have attributes.

tidy

public String tidy(boolean toXHTML)
Returns an XML or XHTML representation of this start tag.

The tidying of the tag is carried out as follows:

  • if this start tag is of a type that does not have attributes, then the original source text is returned.
  • name converted to lower case if the toXHTML argument is true and this is a normal start tag
  • attributes separated by a single space
  • attribute names in original case
  • attribute values are enclosed in double quotes and re-encoded
  • if this start tag forms an HTML element that has no end tag, a slash is inserted before the closing angle bracket, separated from the name or last attribute by a single space.
  • if an attribute value contains a server tag it is inserted verbatim instead of being encoded.

The toXHTML parameter determines only whether the name is converted to lower case for normal tags. In all other respects the generated tag is already valid XHTML.

Example:

The following source text:

<INPUT name=Company value='G&uuml;nter O&#39;Reilly &amp Associés'>

produces the following regenerated HTML:

<input name="Company" value="G&uuml;nter O'Reilly &amp; Associ&eacute;s" />

Parameters:
toXHTML - specifies whether the output is XHTML.
Returns:
an XML or XHTML representation of this start tag, or the source text if it is of a type that does not have attributes.