au.id.jericho.lib.html

Class TagType

Known Direct Subclasses:
EndTagType, StartTagType

public abstract class TagType
extends java.lang.Object

Defines the syntax for a tag type that can be recognised by the parser.

This class is the root abstract class common to all tag types, and contains methods to register and deregister tag types as well as various methods to aid in their implementation.

Every tag type is represented by an instance of a class (usually a singleton) that must be a subclass of either StartTagType or EndTagType. These two abstract classes, the only direct descendants of this class, represent the two major classifications under which every tag type exists.

The term predefined tag type refers to any of the tag types defined in this library, including both standard and extended tag types.

The term standard tag type refers to any of the tag types represented by instances in static fields of the StartTagType and EndTagType subclasses. Standard tag types are registered by default, and define the tags most commonly found in HTML documents.

The term extended tag type refers to any predefined tag type that is not a standard tag type. The PHPTagTypes and MasonTagTypes classes contain extended tag types related to their respective server platforms. The tag types defined within them must be registered by the user before they are recognised by the parser.

The term custom tag type refers to any user-defined tag type, or any tag type that is not a predefined tag type.

The tag recognition process of the parser gives each tag type a precedence level, which is primarily determined by the length of its start delimiter. A tag type with a more specific start delimiter is chosen in preference to one with a less specific start delimiter, assuming they both share the same prefix. If two tag types have exactly the same start delimiter, the one which was registered later has the higher precedence.

The two special tag types StartTagType.UNREGISTERED and EndTagType.UNREGISTERED represent tags that do not match the syntax of any other tag type. They have the lowest precedence of all the tag types. The Tag.isUnregistered() method provides a detailed explanation of unregistered tags.

See the documentation of the tag parsing process for more information on how each tag is identified by the parser.

Note that the standard HTML element names do not represent different tag types. All standard HTML tags have a tag type of StartTagType.NORMAL or EndTagType.NORMAL, and are also referred to as normal tags.

Apart from the registration related methods, all of the methods in this class and its subclasses relate to the implementation of custom tag types and are not relevant to the majority of users who just use the predefined tag types.

For perfomance reasons, this library only allows tag types that start with a '<' character. The character following this defines the immediate subclass of the tag type. An EndTagType always has a slash ('/') as the second character, while a StartTagType has any character other than a slash as the second character. This definition means that tag types which are not intuitively classified as either start tag types or end tag types (such as an HTML comment) are mostly classified as start tag types.

Every method in this and the StartTagType and EndTagType abstract classes can be categorised as one of the following:

Properties:
Abstract implementation methods:
Default implementation methods:
Implementation assistance methods:
Registration related methods:
registration

Method Summary

protected abstract Tag
constructTagAt(Source source, int pos)
Constructs a tag of this type at the specified position in the specified source document if it matches all of the required features.
void
deregister()
Deregisters this tag type.
String
getClosingDelimiter()
Returns the character sequence that marks the end of the tag.
String
getDescription()
Returns a description of this tag type useful for debugging purposes.
protected String
getNamePrefix()
Returns the name prefix required by this tag type.
static List
getRegisteredTagTypes()
Returns a list of all the currently registered tag types in order of lowest to highest precedence.
String
getStartDelimiter()
Returns the character sequence that marks the start of the tag.
static TagType[]
getTagTypesIgnoringEnclosedMarkup()
Returns an array of all the tag types inside which the parser ignores all other non-server tags in parse on demand mode.
boolean
isServerTag()
Indicates whether this tag type represents a server tag.
protected boolean
isValidPosition(Source source, int pos, int[] fullSequentialParseData)
Indicates whether a tag of this type is valid in the specified position of the specified source document.
void
register()
Registers this tag type for recognition by the parser.
static void
setTagTypesIgnoringEnclosedMarkup(TagType[] tagTypes)
Sets the tag types inside which the parser ignores all other non-server tags.
protected boolean
tagEncloses(Source source, int pos)
Indicates whether a tag of this type encloses the specified position of the specified source document.
String
toString()
Returns a string representation of this object useful for debugging purposes.

Method Details

constructTagAt

protected abstract Tag constructTagAt(Source source,
                                      int pos)
Constructs a tag of this type at the specified position in the specified source document if it matches all of the required features.
(abstract implementation method)

The implementation of this method must check that the text at the specified position meets all of the criteria of this tag type, including such checks as the presence of the correct or well formed closing delimiter, name, attributes, end tag, or any other distinguishing features.

It can be assumed that the specified position starts with the start delimiter of this tag type, and that all other tag types with higher precedence (if any) have already been rejected as candidates. Tag types with lower precedence will be considered if this method returns null.

This method is only called after a successful check of the tag's position, i.e. isValidPosition(source,pos,fullSequentialParseData)==true.

The StartTagTypeGenericImplementation and EndTagTypeGenericImplementation subclasses provide default implementations of this method that allow the use of much simpler properties and implementation assistance methods and to carry out the required functions.

Parameters:
source - the Source document.
pos - the position in the source document.
Returns:
a tag of this type at the specified position in the specified source document if it meets all of the required features, or null if it does not meet the criteria.

deregister

public final void deregister()
See Also:
register()

getClosingDelimiter

public final String getClosingDelimiter()
Returns the character sequence that marks the end of the tag.
(
property method)

The character sequence must be all in lower case.

In a StartTag of a type that has attributes, characters appearing inside a quoted attribute value are ignored when determining the location of the closing delimiter.

Note that the optional '/' character preceding the closing '>' in an empty-element tag is not considered part of the end delimiter. This property must define the closing delimiter common to all instances of the tag type.

Tag TypeClosing Delimiter
StartTagType.UNREGISTERED>
StartTagType.NORMAL>
StartTagType.COMMENT-->
StartTagType.XML_DECLARATION?>
StartTagType.XML_PROCESSING_INSTRUCTION?>
StartTagType.DOCTYPE_DECLARATION>
StartTagType.MARKUP_DECLARATION>
StartTagType.CDATA_SECTION]]>
StartTagType.SERVER_COMMON%>
EndTagType.UNREGISTERED>
EndTagType.NORMAL>
Tag TypeClosing Delimiter
PHPTagTypes.PHP_SCRIPT>
PHPTagTypes.PHP_SHORT?>
PHPTagTypes.PHP_STANDARD?>
MasonTagTypes.MASON_COMPONENT_CALL&>
MasonTagTypes.MASON_COMPONENT_CALLED_WITH_CONTENT&>
MasonTagTypes.MASON_COMPONENT_CALLED_WITH_CONTENT_END>
MasonTagTypes.MASON_NAMED_BLOCK>
MasonTagTypes.MASON_NAMED_BLOCK_END>
Returns:
the character sequence that marks the end of the tag.

getDescription

public final String getDescription()
Returns:
a description of this tag type useful for debugging purposes.

getNamePrefix

protected final String getNamePrefix()
Returns the name prefix required by this tag type.
(property method)

This string is identical to the start delimiter, except that it does not include the initial "<" or "</" characters that always prefix the start delimiter of a StartTagType or EndTagType respectively.

The name of a tag of this type may or may not include extra characters after the prefix. This is determined by properties such as StartTagType.isNameAfterPrefixRequired() or EndTagTypeGenericImplementation.isStatic().

Tag TypeName Prefix
StartTagType.UNREGISTERED(empty string)
StartTagType.NORMAL(empty string)
StartTagType.COMMENT!--
StartTagType.XML_DECLARATION?xml
StartTagType.XML_PROCESSING_INSTRUCTION?
StartTagType.DOCTYPE_DECLARATION!doctype
StartTagType.MARKUP_DECLARATION!
StartTagType.CDATA_SECTION![cdata[
StartTagType.SERVER_COMMON%
EndTagType.UNREGISTERED(empty string)
EndTagType.NORMAL(empty string)
Tag TypeName Prefix
PHPTagTypes.PHP_SCRIPTscript
PHPTagTypes.PHP_SHORT?
PHPTagTypes.PHP_STANDARD?php
MasonTagTypes.MASON_COMPONENT_CALL&
MasonTagTypes.MASON_COMPONENT_CALLED_WITH_CONTENT&|
MasonTagTypes.MASON_COMPONENT_CALLED_WITH_CONTENT_END&
MasonTagTypes.MASON_NAMED_BLOCK%
MasonTagTypes.MASON_NAMED_BLOCK_END%
Returns:
the name prefix required by this tag type.

getRegisteredTagTypes

public static final List getRegisteredTagTypes()
Returns:
a list of all the currently registered tag types in order of lowest to highest precedence.

getStartDelimiter

public final String getStartDelimiter()
Returns the character sequence that marks the start of the tag.
(
property method)

The character sequence must be all in lower case.

The first character in this property must be '<'. This is a deliberate limitation of the system which is necessary to retain reasonable performance.

The second character in this property must be '/' if the implementing class is an EndTagType. It must not be '/' if the implementing class is a StartTagType.

Tag TypeStart Delimiter
StartTagType.UNREGISTERED<
StartTagType.NORMAL<
StartTagType.COMMENT<!--
StartTagType.XML_DECLARATION<?xml
StartTagType.XML_PROCESSING_INSTRUCTION<?
StartTagType.DOCTYPE_DECLARATION<!doctype
StartTagType.MARKUP_DECLARATION<!
StartTagType.CDATA_SECTION<![cdata[
StartTagType.SERVER_COMMON<%
EndTagType.UNREGISTERED</
EndTagType.NORMAL</
Tag TypeStart Delimiter
PHPTagTypes.PHP_SCRIPT<script
PHPTagTypes.PHP_SHORT<?
PHPTagTypes.PHP_STANDARD<?php
MasonTagTypes.MASON_COMPONENT_CALL<&
MasonTagTypes.MASON_COMPONENT_CALLED_WITH_CONTENT<&|
MasonTagTypes.MASON_COMPONENT_CALLED_WITH_CONTENT_END</&
MasonTagTypes.MASON_NAMED_BLOCK<%
MasonTagTypes.MASON_NAMED_BLOCK_END</%
Returns:
the character sequence that marks the start of the tag.

getTagTypesIgnoringEnclosedMarkup

public static final TagType[] getTagTypesIgnoringEnclosedMarkup()
Returns an array of all the tag types inside which the parser ignores all other non-server tags in parse on demand mode.
(implementation assistance method)

The tag types returned by this property (referred to in the following paragraphs as the "listed types") default to StartTagType.COMMENT and StartTagType.CDATA_SECTION.

In parse on demand mode, every new non-server tag found by the parser (referred to as a "new tag") undergoes a check to see whether it is enclosed by a tag of one of the listed types, including new tags of the listed types themselves. The recursive nature of this check means that all tags of the listed types occurring before the new tag must be found by the parser before it can determine whether the new tag should be ignored. To mitigate any performance issues arising from this process, the listed types are given special treatment in the tag cache. This dramatically decreases the time taken to search on these tag types, so adding a tag type to this array that is easily recognised and occurs infrequently only results in a small degradation in overall performance.

Theoretically, non-server tags appearing inside any other non-server tag should be ignored. One situation where a tag can legitimately contain a sequence of characters that resembles a tag, which shouldn't be recognised as a tag by the parser, is within an attribute value. The HTML 4.01 specification section 5.3.2 specifically allows the presence of '<' and '>' characters within attribute values. A common occurrence of this is in event attributes such as onclick, which contain scripts that often dynamically load new HTML into the document (see the file samples/data/Test.html for an example).

Performing a full sequential parse of the source document prevents these attribute values from being recognised as tags, but can be very expensive if only a few tags in the document need to be parsed. The penalty of not parsing every tag in the document is that the exactness of this check is compromised, but in practical terms the difference is inconsequential. The default listed types of comments and CDATA sections yields sensible results in the vast majority of practical applications with only a minor impact on performance.

In XHTML, '<' and '>' characters must be represented in attribute values as character references (see the XML 1.0 specification section 3.1), so the situation should never arise that a tag is found inside another tag unless one of them is a server tag.

This method is called from the default implementation of the isValidPosition(Source, int pos, int[] fullSequentialParseData) method.

Returns:
an array of all the tag types inside which the parser ignores all other non-server tags.

isServerTag

public final boolean isServerTag()
Indicates whether this tag type represents a server tag.
(
property method)

Server tags are typically parsed by some process on the web server and substituted with other text or markup before delivery to the user agent. This parser therefore handles them differently to non-server tags in that they can occur at any position in the document without regard for the HTML document structure. As a result they can occur anywhere inside any other tag and vice versa.

To avoid the problem of server tags interfering with the proper parsing of the rest of the document, the Segment.ignoreWhenParsing() method can be called on all server tags found in the document before parsing the non-server tags.

The documentation of the tag parsing process explains in detail how the value of this property affects the recognition of a tag.

Tag TypeIs Server Tag
StartTagType.UNREGISTEREDfalse
StartTagType.NORMALfalse
StartTagType.COMMENTfalse
StartTagType.XML_DECLARATIONfalse
StartTagType.XML_PROCESSING_INSTRUCTIONfalse
StartTagType.DOCTYPE_DECLARATIONfalse
StartTagType.MARKUP_DECLARATIONfalse
StartTagType.CDATA_SECTIONfalse
StartTagType.SERVER_COMMONtrue
EndTagType.UNREGISTEREDfalse
EndTagType.NORMALfalse
Tag TypeIs Server Tag
PHPTagTypes.PHP_SCRIPTtrue
PHPTagTypes.PHP_SHORTtrue
PHPTagTypes.PHP_STANDARDtrue
MasonTagTypes.MASON_COMPONENT_CALLtrue
MasonTagTypes.MASON_COMPONENT_CALLED_WITH_CONTENTtrue
MasonTagTypes.MASON_COMPONENT_CALLED_WITH_CONTENT_ENDtrue
MasonTagTypes.MASON_NAMED_BLOCKtrue
MasonTagTypes.MASON_NAMED_BLOCK_ENDtrue
Returns:
true if this tag type represents a server tag, otherwise false.

isValidPosition

protected boolean isValidPosition(Source source,
                                  int pos,
                                  int[] fullSequentialParseData)
Indicates whether a tag of this type is valid in the specified position of the specified source document.
(implementation assistance method)

This method is called immediately before constructTagAt(Source, int pos) to do a preliminary check on the validity of a tag of this type in the specified position.

This check is not performed as part of the constructTagAt(Source, int pos) call because the same validation is used for all the standard tag types, and is likely to be sufficient for all custom tag types. Having this check separated into a different method helps to isolate common code from the code that is unique to each tag type.

In theory, a server tag is valid in any position, but a non-server tag is not valid inside another non-server tag.

The common implementation of this method always returns true for server tags, but for non-server tags it behaves slightly differently depending upon whether or not a full sequential parse is being peformed.

When this method is called during a full sequential parse, the fullSequentialParseData argument contains information allowing the exact theoretical check to be performed, rejecting a non-server tag if it is inside any other non-server tag. See below for further information about the fullSequentialParseData parameter.

When this method is called in parse on demand mode (not during a full sequential parse, fullSequentialParseData==null), practical constraints prevent the exact theoretical check from being carried out, and non-server tags are only rejected if they are found inside HTML comments or CDATA sections.

This behaviour is configurable by manipulating the static TagTypesIgnoringEnclosedMarkup array to determine which tag types can not contain non-server tags in parse on demand mode. The documentation of this property contains a more detailed analysis of the subject and explains why only the comment and CDATA section tag types are included by default.

See the documentation of the tag parsing process for more information about how this method fits into the whole tag parsing process.

This method can be overridden in custom tag types if the default implementation is unsuitable.

The fullSequentialParseData parameter:

In the current version of this library, the fullSequentialParseData argument is either null (in parse on demand mode) or an integer array containing only a single entry (if a full sequential parse is being peformed).

The integer contained in the array is the end position of the last non-server tag, indicating that no other non-server tags should be recognised before that position. If no non-server tags have yet been encountered, the value of this integer is zero.

If the last non-server tag encountered was the start tag of a SCRIPT element, the value of this integer is Integer.MAX_VALUE, indicating that no other non-server elements should be recognised until the end tag of the SCRIPT element is found. This default implementation however allows any occurrence of the characters "</" to terminate this condition, without checking whether it is actually a SCRIPT end tag. This is consistent the HTML 4.01 specification section 6.2 dealing with the special handling of CDATA within SCRIPT and STYLE elements.

Although STYLE elements should theoretically be treated in the same way as SCRIPT elements, the syntax of Cascading Style Sheets (CSS) does not contain any constructs that could be misinterpreted as HTML tags, so there is virtually no need to perform any special checks in this case.

The rationale behind using an integer array to hold this value, rather than a scalar int value, is to emulate passing the parameter by reference. This value needs to be shared amongst several internal methods during the full sequential parse process, and any one of those methods needs to be able to modify the value and pass it back to the calling method. This would normally be implemented by passing the parameter by reference, but because Java does not support this language construct, a container for a mutable integer must be passed instead. Because the standard Java library does not provide a class for holding a single mutable integer (the java.lang.Integer class is immutable), the easiest container to use, without creating a class especially for this purpose, is an integer array. The use of an array does not imply any intention to use more than a single array entry in subsequent versions.

Parameters:
source - the Source document.
pos - the character position in the source document to check.
fullSequentialParseData - an integer array containing data allowing this method to implement a better algorithm when a full sequential parse is being performed, or null in parse on demand mode.
Returns:
true if a tag of this type is valid in the specified position of the specified source document, otherwise false.

register

public final void register()
Registers this tag type for recognition by the parser.
(
registration related method)

The order of registration affects the precedence of the tag type when a potential tag is being parsed.

See Also:
deregister()

setTagTypesIgnoringEnclosedMarkup

public static final void setTagTypesIgnoringEnclosedMarkup(TagType[] tagTypes)
Sets the tag types inside which the parser ignores all other non-server tags.
(implementation assistance method)

See getTagTypesIgnoringEnclosedMarkup() for the documentation of this property.

Parameters:
tagTypes - an array of tag types.

tagEncloses

protected final boolean tagEncloses(Source source,
                                    int pos)
Indicates whether a tag of this type encloses the specified position of the specified source document.
(implementation assistance method)

This is logically equivalent to source.findEnclosingTag(pos,this)!=null, but is safe to use within other implementation methods without the risk of causing an infinite recursion.

This method is called from the default implementation of the isValidPosition(Source, int pos, int[] fullSequentialParseData) method.

Parameters:
source - the Source document.
pos - the character position in the source document to check.
Returns:
true if a tag of this type encloses the specified position of the specified source document, otherwise false.

toString

public String toString()
Returns a string representation of this object useful for debugging purposes.
Returns:
a string representation of this object useful for debugging purposes.