tagsoup-0.6: Parsing and extracting information from (possibly malformed) HTML documentsSource codeContentsIndex
Text.HTML.TagSoup
Portabilityportable
Stabilityunstable
Maintainerhttp://www.cs.york.ac.uk/~ndm/
Contents
Data structures and parsing
Tag identification
Extraction
Utility
Combinators
Description

This module is for extracting information out of unstructured HTML code, sometimes known as tag-soup. This is for situations where the author of the HTML is not cooperating with the person trying to extract the information, but is also not trying to hide the information.

The standard practice is to parse a String to Tags using parseTags, then operate upon it to extract the necessary information.

Synopsis
data Tag
= TagOpen String [Attribute]
| TagClose String
| TagText String
| TagComment String
| TagWarning String
| TagPosition !Row !Column
type Attribute = (String, String)
module Text.HTML.TagSoup.Parser
canonicalizeTags :: [Tag] -> [Tag]
isTagOpen :: Tag -> Bool
isTagClose :: Tag -> Bool
isTagText :: Tag -> Bool
isTagWarning :: Tag -> Bool
isTagOpenName :: String -> Tag -> Bool
isTagCloseName :: String -> Tag -> Bool
fromTagText :: Tag -> String
fromAttrib :: String -> Tag -> String
maybeTagText :: Tag -> Maybe String
maybeTagWarning :: Tag -> Maybe String
innerText :: [Tag] -> String
sections :: (a -> Bool) -> [a] -> [[a]]
partitions :: (a -> Bool) -> [a] -> [[a]]
class TagRep a
class IsChar a
(~==) :: TagRep t => Tag -> t -> Bool
(~/=) :: TagRep t => Tag -> t -> Bool
Data structures and parsing
data Tag Source
An HTML element, a document is [Tag]. There is no requirement for TagOpen and TagClose to match
Constructors
TagOpen String [Attribute]An open tag with Attributes in their original order.
TagClose StringA closing tag
TagText StringA text node, guaranteed not to be the empty string
TagComment StringA comment
TagWarning StringMeta: Mark a syntax error in the input file
TagPosition !Row !ColumnMeta: The position of a parsed element
type Attribute = (String, String)Source
An HTML attribute id="name" generates ("id","name")
module Text.HTML.TagSoup.Parser
canonicalizeTags :: [Tag] -> [Tag]Source
Turns all tag names to lower case and converts DOCTYPE to upper case.
Tag identification
isTagOpen :: Tag -> BoolSource
Test if a Tag is a TagOpen
isTagClose :: Tag -> BoolSource
Test if a Tag is a TagClose
isTagText :: Tag -> BoolSource
Test if a Tag is a TagText
isTagWarning :: Tag -> BoolSource
Test if a Tag is a TagWarning
isTagOpenName :: String -> Tag -> BoolSource
Returns True if the Tag is TagOpen and matches the given name
isTagCloseName :: String -> Tag -> BoolSource
Returns True if the Tag is TagClose and matches the given name
Extraction
fromTagText :: Tag -> StringSource
Extract the string from within TagText, crashes if not a TagText
fromAttrib :: String -> Tag -> StringSource
Extract an attribute, crashes if not a TagOpen. Returns "" if no attribute present.
maybeTagText :: Tag -> Maybe StringSource
Extract the string from within TagText, otherwise Nothing
maybeTagWarning :: Tag -> Maybe StringSource
Extract the string from within TagWarning, otherwise Nothing
innerText :: [Tag] -> StringSource
Extract all text content from tags (similar to Verbatim found in HaXml)
Utility
sections :: (a -> Bool) -> [a] -> [[a]]Source
This function takes a list, and returns all suffixes whose first item matches the predicate.
partitions :: (a -> Bool) -> [a] -> [[a]]Source
This function is similar to sections, but splits the list so no element appears in any two partitions.
Combinators
class TagRep a Source
Define a class to allow String's or Tag's to be used as matches
class IsChar a Source
(~==) :: TagRep t => Tag -> t -> BoolSource

Performs an inexact match, the first item should be the thing to match. If the second item is a blank string, that is considered to match anything. For example:

 (TagText "test" ~== TagText ""    ) == True
 (TagText "test" ~== TagText "test") == True
 (TagText "test" ~== TagText "soup") == False

For TagOpen missing attributes on the right are allowed.

(~/=) :: TagRep t => Tag -> t -> BoolSource
Negation of ~==
Produced by Haddock version 2.6.0