Prev Class | Next Class | Frames | No Frames |
Summary: Nested | Field | Method | Constr | Detail: Nested | Field | Method | Constr |
java.lang.Object
org.w3c.tidy.Lexer
public class Lexer
extends java.lang.Object
Field Summary | |
static short |
|
static short |
|
static short |
|
static short |
|
protected short |
|
protected short |
|
protected boolean |
|
protected short |
|
protected short |
|
protected int |
|
protected Configuration |
|
protected int |
|
protected short |
|
protected PrintWriter |
|
protected boolean |
|
protected boolean |
|
protected StreamIn |
|
protected Node |
|
protected int |
|
protected boolean |
|
protected Stack |
|
protected int |
|
protected boolean |
|
protected byte[] |
|
protected int |
|
protected int |
|
protected int |
|
protected boolean |
|
protected Report |
|
protected Node |
|
protected boolean |
|
protected boolean |
|
protected short |
|
protected Style |
|
protected Node |
|
protected int |
|
protected int |
|
protected short |
|
protected short |
|
protected boolean |
|
Constructor Summary | |
|
Method Summary | |
void |
|
void |
|
boolean |
|
void |
|
void |
|
short |
|
boolean | |
void |
|
boolean |
|
AttVal |
|
Node | |
void |
|
boolean |
|
short |
|
boolean |
|
void |
|
void | |
boolean |
|
Node | |
Node |
|
short |
|
String |
|
Node |
|
int | |
Node | |
static boolean |
|
boolean | |
static boolean |
|
Node |
|
Node |
|
Node |
|
Node |
|
Node |
|
String |
|
AttVal |
|
void |
|
Node |
|
int |
|
char |
|
String |
|
void | |
protected boolean |
|
void |
|
boolean |
|
void | |
protected void |
|
public static final short IGNORE_MARKUP
state: ignore markup.
- Field Value:
- 3
public static final short IGNORE_WHITESPACE
state: ignore whitespace.
- Field Value:
- 0
public static final short MIXED_CONTENT
state: mixed content.
- Field Value:
- 1
public static final short PREFORMATTED
state: preformatted.
- Field Value:
- 2
protected short badAccess
for accessibility errors.
protected short badChars
for bad char encodings.
protected boolean badDoctype
set if html or PUBLIC is missing.
protected short badForm
for mismatched/mispositioned form tags.
protected short badLayout
for bad style errors.
protected int columns
at start of current token.
protected int doctype
version as given by doctype (if any).
protected short errors
count of errors.
protected PrintWriter errout
error output stream.
protected boolean excludeBlocks
Netscape compatibility.
protected boolean exiled
true if moved out of table.
protected int insert
for inferring inline tags.
protected boolean insertspace
when space is moved after end tag.
protected Stack istack
stack.
protected int istackbase
start of frame.
protected boolean isvoyager
true if xmlns attribute on html element.
protected byte[] lexbuf
Lexer character buffer parse tree nodes span onto this buffer which contains the concatenated text contents of all of the elements. Lexsize must be reset for each file. Byte buffer of UTF-8 chars.
protected int lexlength
allocated.
protected int lexsize
used.
protected int lines
lines seen.
protected boolean pushed
true after token has been pushed back.
protected boolean seenEndBody
already seen end body tag?
protected boolean seenEndHtml
already seen end html tag?
protected short state
state of lexer's finite state machine.
protected int txtend
end of current node.
protected int txtstart
start of current node.
protected short versions
bit vector of HTML versions.
protected short warnings
count of warnings in this document.
protected boolean waswhite
used to collapse contiguous white space.
public Lexer(StreamIn in, Configuration configuration, Report report)
Instantiates a new Lexer.
- Parameters:
in
- StreamInconfiguration
- configuation instancereport
- report instance, for reporting errors
public void addByte(int c)
Adds a byte to lexer buffer.
- Parameters:
c
- byte to add
public void addCharToLexer(int c)
Store char c as UTF-8 encoded byte stream.
- Parameters:
c
- char to store
public boolean addGenerator(Node root)
Add meta element for Tidy. If the meta tag is already present, update release date.
- Parameters:
root
- root node
- Returns:
true
if the tag has been added
public void addStringLiteral(String str)
calls addCharToLexer for any char in the string.
- Parameters:
str
- input String
public void addStringToLexer(String str)
Adds a string to lexer buffer.
- Parameters:
str
- String to add
public short apparentVersion()
Return the html version used in document.
- Returns:
- version code
public boolean canPrune(Node element)
Can the given element be removed?
- Parameters:
element
- node
- Returns:
true
if he element can be removed
public void changeChar(byte c)
Substitute the last char in buffer.
- Parameters:
c
- new char
public boolean checkDocTypeKeyWords(Node doctype)
Check system keywords (keywords should be uppercase).
- Parameters:
doctype
- doctype node
- Returns:
- true if doctype keywords are all uppercase
public AttVal cloneAttributes(AttVal attrs)
Clones an attribute value and add eventual asp or php node to node list.
- Parameters:
attrs
- original AttVal
- Returns:
- cloned AttVal
public Node cloneNode(Node node)
Clones a node and add it to node list.
- Parameters:
node
- Node
- Returns:
- cloned Node
public void deferDup()
Defer duplicates when entering a table or other element where the inlines shouldn't be duplicated.
public boolean endOfInput()
Has end of input stream been reached?
- Returns:
true
if end of input stream been reached
public short findGivenVersion(Node doctype)
Examine DOCTYPE to identify version.
- Parameters:
doctype
- doctype node
- Returns:
- version code
public boolean fixDocType(Node root)
Fixup doctype if missing.
- Parameters:
root
- root node
- Returns:
false
if current version has not been identified
public void fixHTMLNameSpace(Node root, String profile)
Fix xhtml namespace.
- Parameters:
root
- root Nodeprofile
- current profile
public void fixId(Node node)
duplicate name attribute as an id and check if id and name match.
- Parameters:
node
- Node to check for name/it attributes
public boolean fixXmlDecl(Node root)
Ensure XML document starts with<?XML version="1.0"?>
. Add encoding attribute if not using ASCII or UTF-8 output.
- Parameters:
root
- root node
- Returns:
- always true
public Node getCDATA(Node container)
Create a text node for the contents of a CDATA element like style or script which ends with </foo> for some foo.
- Parameters:
container
- container node
- Returns:
- cdata node
public Node getToken(short mode)
Gets a token.
- Parameters:
mode
- one of the following:
MixedContent
-- for elements which don't accept PCDATAPreformatted
-- white spacepreserved as isIgnoreMarkup
-- for CDATA elements such as script, style
- Returns:
- next Node
public short htmlVersion()
Choose what version to use for new doctype.
- Returns:
- html version constant
public String htmlVersionName()
Choose what version to use for new doctype.
- Returns:
- html version name
public Node inferredTag(String name)
Generates and inserts a new node.
- Parameters:
name
- tag name
- Returns:
- generated node
public int inlineDup(Node node)
This has the effect of inserting "missing" inline elements around the contents of blocklevel elements such as P, TD, TH, DIV, PRE etc. This procedure is called at the start of ParseBlock. When the inline stack is not empty, as will be the case in:<i><h1>italic heading</h1></i>
which is then treated as equivalent to<h1><i>italic heading</i></h1>
This is implemented by setting the lexer into a mode where it gets tokens from the inline stack rather than from the input stream.
- Parameters:
node
- original node
- Returns:
- stack size
public static boolean isCSS1Selector(String buf)
In CSS1, selectors can contain only the characters A-Z, 0-9, and Unicode characters 161-255, plus dash (-); they cannot start with a dash or a digit; they can also contain escaped characters and any Unicode character as a numeric code (see next item). The backslash followed by at most four hexadecimal digits (0..9A..F) stands for the Unicode character with that number. Any character except a hexadecimal digit can be escaped to remove its special meaning, by putting a backslash in front.
- Parameters:
buf
- css selector name
- Returns:
true
if the given string is a valid css1 selector name
public boolean isPushed(Node node)
Is the node in the stack?
- Parameters:
node
- Node
- Returns:
true
is the node is found in the stack
public static boolean isValidAttrName(String attr)
Check if attr is a valid name.
- Parameters:
attr
- String to check, must be non-null
- Returns:
true
if attr is a valid name.
public Node newLineNode()
Adds a new line node. Used for creating preformatted text from Word2000.
- Returns:
- new line node
public Node newNode(short type, byte[] textarray, int start, int end)
Creates a new node and add it to nodelist.
- Parameters:
type
- node type: Node.ROOT_NODE | Node.DOCTYPE_TAG | Node.COMMENT_TAG | Node.PROC_INS_TAG | Node.TEXT_NODE | Node.START_TAG | Node.END_TAG | Node.START_END_TAG | Node.CDATA_TAG | Node.SECTION_TAG | Node. ASP_TAG | Node.JSTE_TAG | Node.PHP_TAG | Node.XML_DECLtextarray
- array of bytes contained in the Nodestart
- start positionend
- end position
- Returns:
- Node
public Node newNode(short type, byte[] textarray, int start, int end, String element)
Creates a new node and add it to nodelist.
- Parameters:
type
- node type: Node.ROOT_NODE | Node.DOCTYPE_TAG | Node.COMMENT_TAG | Node.PROC_INS_TAG | Node.TEXT_NODE | Node.START_TAG | Node.END_TAG | Node.START_END_TAG | Node.CDATA_TAG | Node.SECTION_TAG | Node. ASP_TAG | Node.JSTE_TAG | Node.PHP_TAG | Node.XML_DECLtextarray
- array of bytes contained in the Nodestart
- start positionend
- end positionelement
- tag name
- Returns:
- Node
public Node parseAsp()
parser for ASP within start tags Some people use ASP for to customize attributes Tidy isn't really well suited to dealing with ASP This is a workaround for attributes, but won't deal with the case where the ASP is used to tailor the attribute value. Here is an example of a work around for using ASP in attribute values:href='<%=rsSchool.Fields("ID").Value%>'
where the ASP that generates the attribute value is masked from Tidy by the quotemarks.
- Returns:
- parsed Node
public String parseAttribute(boolean[] isempty, Node[] asp, Node[] php)
consumes the '>' terminating start tags.
- Parameters:
isempty
- flag is passed as array so it can be modifiedasp
- asp Node, passed as array so it can be modifiedphp
- php Node, passed as array so it can be modified
- Returns:
- parsed attribute
public AttVal parseAttrs(boolean[] isempty)
Parse tag attributes.
- Parameters:
isempty
- is tag empty?
- Returns:
- parsed attribute/value list
public void parseEntity(short mode)
Parse an html entity.
- Parameters:
mode
- mode
public Node parsePhp()
PHP is like ASP but is based upon XML processing instructions, e.g.<?php ... ?>
.
- Returns:
- parsed Node
public int parseServerInstruction()
Invoked when < is seen in place of attribute value but terminates on whitespace if not ASP, PHP or Tango this routine recognizes ' and " quoted strings.
- Returns:
- delimiter
public char parseTagName()
Parses a tag name.
- Returns:
- first char after the tag name
public String parseValue(String name, boolean foldCase, boolean[] isempty, int[] pdelim)
Parse an attribute value.
- Parameters:
name
- attribute namefoldCase
- fold case?isempty
- is attribute empty? Passed as an array reference to allow modificationpdelim
- delimiter, passed as an array reference to allow modification
- Returns:
- parsed value
public void popInline(Node node)
Pop a copy of an inline node from the stack.
- Parameters:
node
- Node to be popped
protected boolean preContent(Node node)
Is content acceptable for pre elements?
- Parameters:
node
- content
- Returns:
true
if node is acceptable in pre elements
public void pushInline(Node node)
Push a copy of an inline node onto stack but don't push if implicit or OBJECT or APPLET (implicit tags are ones generated from the istack) One issue arises with pushing inlines when the tag is already pushed. For instance:<p><em> text <p><em> more text
Shouldn't be mapped to<p><em> text </em></p><p><em><em> more text </em></em>
- Parameters:
node
- Node to be pushed
public boolean setXHTMLDocType(Node root)
Adds a new xhtml doctype to the document.
- Parameters:
root
- root node
- Returns:
true
if a doctype has been added
public void ungetToken()
protected void updateNodeTextArrays(byte[] oldtextarray, byte[] newtextarray)
Updateoldtextarray
in the current nodes.
- Parameters:
oldtextarray
- previous text arraynewtextarray
- new text array