Module Pxp_types


module Pxp_types: sig .. end
This module defines and exports all the types listed in Pxp_core_types_type.CORE_TYPES:

type ext_id type private_id val allocate_private_id type_resolver_id val resolver_id_of_ext_id type dtd_id type content_model_type type mixed_spec type regexp_spec type att_type type att_default type att_value class type collect_warnings class drop_warnings class type symbolic_warnings type warning type encoding type rep_encoding exception Validation_error exception WF_error exception Namespace_error exception Error exception Character_not_supported exception At exception Undeclared exception Method_not_applicable exception Namespace_method_not_applicable val string_of_exn type output_stream val write type pool val make_probabilistic_pool val pool_string

See the file pxp_core_types_type.mli for the exact definitions of these types/values.


include Pxp_core_types_type.CORE_TYPES

This module defines and exports all the types listed in Pxp_core_types_type.CORE_TYPES:

type ext_id type private_id val allocate_private_id type_resolver_id val resolver_id_of_ext_id type dtd_id type content_model_type type mixed_spec type regexp_spec type att_type type att_default type att_value class type collect_warnings class drop_warnings class type symbolic_warnings type warning type encoding type rep_encoding exception Validation_error exception WF_error exception Namespace_error exception Error exception Character_not_supported exception At exception Undeclared exception Method_not_applicable exception Namespace_method_not_applicable val string_of_exn type output_stream val write type pool val make_probabilistic_pool val pool_string

See the file pxp_core_types_type.mli for the exact definitions of these types/values.

type config = {
   warner : collect_warnings; (*An object that collects warnings.*)
   swarner : symbolic_warnings option; (*Another object getting warnings expressed as polymorphic variants. This is especially useful to turn warnings into errors. If defined, the swarner gets the warning first before it is sent to the classic warner.*)
   enable_pinstr_nodes : bool;
   enable_super_root_node : bool;
   enable_comment_nodes : bool; (*When enabled, comments are represented as nodes with type = T_comment. To access the contents of comments, use the method "comment" for the comment nodes. These nodes behave like elements; however, they are normally empty and do not have attributes. Note that it is possible to add children to comment nodes and to set attributes, but it is strongly recommended not to do so. There are no checks on such abnormal use, because they would cost too much time, even when no comment nodes are generated at all.

Comment nodes should be disabled unless you must parse a third-party XML text which uses comments as another data container.

The nodes of type T_comment are created from the comment exemplars in your spec.

Event-based parser: This flag controls whether E_comment events are generated.

*)
   drop_ignorable_whitespace : bool;
   encoding : rep_encoding; (*Specifies the encoding used for the *internal* representation of any character data. Note that the default is still Enc_iso88591.*)
   recognize_standalone_declaration : bool; (*Whether the "standalone" declaration is recognized or not. This option does not have an effect on well-formedness parsing: in this case such declarations are never recognized.

Recognizing the "standalone" declaration means that the value of the declaration is scanned and passed to the DTD, and that the "standalone-check" is performed.

Standalone-check: If a document is flagged standalone='yes' some additional constraints apply. The idea is that a parser without access to any external document subsets can still parse the document, and will still return the same values as the parser with such access. For example, if the DTD is external and if there are attributes with default values, it is checked that there is no element instance where these attributes are omitted - the parser would return the default value but this requires access to the external DTD subset.

Event-based parser: The option has an effect if the `Parse_xml_decl entry flag is set. In this case, it is passed to the DTD whether there is a standalone declaration, ... and the rest is unclear.

*)
   store_element_positions : bool; (*Whether the file name, the line and the column of the beginning of elements are stored in the element nodes. This option may be useful to generate error messages.

Positions are only stored for:

  • Elements
  • Wrapped processing instructions (see enable_pinstr_nodes) For all other node types, no position is stored.
You can access positions by the method "position" of nodes.

Event-based parser: If true, the E_position events will be generated.

*)
   idref_pass : bool; (*Whether the parser does a second pass and checks that all IDREF and IDREFS attributes contain valid references. This option works only if an ID index is available. To create an ID index, pass an index object as id_index argument to the parsing functions (such as parse_document_entity; see below).

"Second pass" does not mean that the XML text is again parsed; only the existing document tree is traversed, and the check on bad IDREF/IDREFS attributes is performed for every node.

Event-based parser: this option is ignored.

*)
   validate_by_dfa : bool; (*If true, and if DFAs are available for validation, the DFAs will actually be used for validation. If false, or if no DFAs are available, the standard backtracking algorithm will be used. DFA = deterministic finite automaton.

DFAs are only available if accept_only_deterministic_models is "true" (because in this case, it is relatively cheap to construct the DFAs). DFAs are a data structure which ensures that validation can always be performed in linear time.

I strongly recommend using DFAs; however, there are examples for which validation by backtracking is faster.

Event-based parser: this option is ignored.

*)
   accept_only_deterministic_models : bool; (*Whether only deterministic content models are accepted in DTDs.

Event-based parser: this option is ignored.

*)
   disable_content_validation : bool; (*When set to 'true', content validation is disabled; however, other validation checks remain activated. This option is intended to save time when a validated document is parsed and it can be assumed that it is valid.

Do not forget to set accept_only_deterministic_models to false to save maximum time (or DFAs will be computed which is rather expensive).

Event-based parser: this option is ignored.

*)
   name_pool : Pxp_core_types.pool;
   enable_name_pool_for_element_types : bool;
   enable_name_pool_for_attribute_names : bool;
   enable_name_pool_for_attribute_values : bool; (*enable_name_pool_for_notation_names : bool;*)
   enable_name_pool_for_pinstr_targets : bool; (*The name pool maps strings to pool strings such that strings with the same value share the same block of memory. Enabling the name pool saves memory, but makes the parser slower.

Event-based parser: As far as I remember, some of the pool options are honoured, but not all.

*)
   enable_namespace_processing : Pxp_dtd.namespace_manager option; (*Setting this option to a namespace_manager enables namespace processing. This works only if the namespace-aware implementation namespace_element_impl of element nodes is used in the spec; otherwise you will get error messages complaining about missing methods.

Note that PXP uses a technique called "prefix normalization" to implement namespaces on top of the plain document model. This means that the namespace prefixes of elements and attributes are changed to unique prefixes if they are ambiguous, and that these "normprefixes" are actually stored in the document tree. Furthermore, the normprefixes are used for validation.

Every normprefix corresponds uniquely to a namespace URI, and this mapping is controlled by the namespace_manager. It is possible to fill the namespace_manager before parsing starts such that the programmer knows which normprefix is used for which namespace URI. Example:

let mng = new namespace_manager in mng # add_namespace "html" "http://www.w3.org/1999/xhtml"; ...

This forces that elements with the mentioned URI are rewritten to a form using the normprefix "html". For instance, "html:table" always refers to the HTML table construct, independently of the prefix used in the parsed XML text.

By default, namespace processing is turned off.

Event-based parser: If true, the events E_ns_start_tag and E_ns_end_tag are generated instead of E_start_tag, and E_end_tag, respectively.

*)
   escape_contents : (Pxp_lexer_types.token -> Pxp_entity_manager.entity_manager -> string) option;
   escape_attributes : (Pxp_lexer_types.token -> int -> Pxp_entity_manager.entity_manager -> string)
option
;
   debugging_mode : bool;
}

val default_config : config
- Warnings are thrown away
val default_namespace_config : config
Same as default_config, but namespace processing is turned on

sources

Sources specify where the XML text to parse comes from. The type source is often not used directly, but sources are constructed with the help of the functions from_channel, from_obj_channel, from_file, and from_string (see below).

The type source is an abstraction on top of resolver (defined in module Pxp_reader). The resolver is a configurable object that knows how to access files that are

A source is a resolver that is applied to a certain ID that should be initially opened.
type source = Pxp_dtd.source = 
| Entity of ((Pxp_dtd.dtd -> Pxp_entity.entity) * Pxp_reader.resolver)
| ExtID of (Pxp_core_types.ext_id * Pxp_reader.resolver)
| XExtID of (Pxp_core_types.ext_id * string option * Pxp_reader.resolver)

The three basic flavours of sources:
val from_channel : ?alt:Pxp_reader.resolver list ->
?system_id:string ->
?fixenc:encoding ->
?id:ext_id ->
?system_encoding:encoding -> Pervasives.in_channel -> source
This function creates a source that reads the XML text from the passed in_channel. By default, this source is not able to read XML text from any other location (you cannot read from files etc.). The optional arguments allow it to modify this behaviour.

Keep the following in mind:

~alt: A list of further resolvers. For example, you can pass new Pxp_reader.resolve_as_file() to enable resolving of file names found in SYSTEM IDs. ~system_id: By default, the XML text found in the in_channel does not have any ID (to be exact, the in_channel has a private ID, but this is hidden). Because of this, it is not possible to open a second file by using a relative SYSTEM ID. The parameter ~system_id assigns the channel a SYSTEM ID that is only used to resolve further relative SYSTEM IDs. This parameter must be encoded as UTF-8 string. ~fixenc: By default, the character encoding of the XML text is determined by looking at the XML declaration. Setting ~fixenc forces a certain character encoding. Useful if you can assume that the XML text has been recoded by the transmission media.

THE FOLLOWING OPTIONS ARE DEPRECATED:

~id: This parameter assigns the channel an arbitrary ID (like ~system_id, but PUBLIC, anonmyous, and private IDs are also possible - although not reasonable). Furthermore, setting ~id also enables resolving of file names. ~id has higher precedence than ~system_id. ~system_encoding: (Only useful together with ~id.) The character encoding used for file names. (UTF-8 by default.)

val from_obj_channel : ?alt:Pxp_reader.resolver list ->
?system_id:string ->
?fixenc:encoding ->
?id:ext_id ->
?system_encoding:encoding -> Netchannels.in_obj_channel -> source
Similar to from_channel, but reads from a netchannel instead.
val from_string : ?alt:Pxp_reader.resolver list ->
?system_id:string -> ?fixenc:encoding -> string -> source
Similar to from_channel, but reads from a string.

Of course, it is possible to parse this source several times, unlike the channel-based sources.

val from_file : ?alt:Pxp_reader.resolver list ->
?system_encoding:encoding -> ?enc:encoding -> string -> source
The source is the file whose name is passed as string argument. The filename must be UTF-8-encoded (so it can be correctly rewritten into a URL).

This source can open further files by default, and relative URLs work.

~alt: A list of further resolvers, especially useful to open non-SYSTEM IDs, and non-file entities. ~system_encoding: The character encoding the system uses to represent filenames. By default, UTF-8 is assumed. ~enc: The character encoding of the string argument. As mentioned, this is UTF-8 by default.

val open_source : config ->
source ->
bool -> Pxp_dtd.dtd -> Pxp_reader.resolver * Pxp_entity.entity
Returns the resolver and the entity for a source. The boolean arg determines whether a document entity (true) or a normal external entity (false) will be returned.
type entry = [ `Entry_content of [ `Dummy ] list
| `Entry_declarations of [ `Extend_dtd_fully | `Val_mode_dtd ] list
| `Entry_document of
[ `Extend_dtd_fully | `Parse_xml_decl | `Val_mode_dtd ] list
| `Entry_expr of [ `Dummy ] list ]
Entry points for the parser (used to call process_entity: The entry points have a list of flags. Note that `Dummy is ignored and only present because O'Caml does not allow empty variants. For `Entry_document, and `Entry_declarations, the flags determine the kind of DTD object that is generated. Without flags, the DTD object is configured for well-formedness mode:

The flags affecting the DTD have the following meaning:

There is another option regarding the XML declaration:



type event =
| E_start_doc of (string * Pxp_dtd.dtd)
| E_end_doc of string
| E_start_tag of (string * (string * string) list * Pxp_dtd.namespace_scope option *
Pxp_lexer_types.entity_id)
| E_end_tag of (string * Pxp_lexer_types.entity_id)
| E_char_data of string
| E_pinstr of (string * string * Pxp_lexer_types.entity_id)
| E_pinstr_member of (string * string * Pxp_lexer_types.entity_id)
| E_comment of string
| E_start_super
| E_end_super
| E_position of (string * int * int)
| E_error of exn
| E_end_of_stream (*may be extended in the future*)

The type of XML events: E_start_doc (xmlversion,dtd) E_end_doc lit_name lit_name: The literal name of the root element

E_start_tag (name, attlist, scope_opt, entid): <name attlist> scope_opt is None in non-namespace mode, and the namespace scope object in namespace mode.

E_end_tag (name, entid): </name>

E_char_data data: The parser usually generates several E_char_data events for a longer section of character data.

E_pinstr (target,value): <?target value?> as node

E_pinstr_member (target,value): <?target value?> as member of the parent element (add_pinstr)

E_comment value: <!--value-->

E_start_super, E_end_super: Indicates where the "super root node" is. Only generated when enable_super_root_node is on.

E_position(entity,line,col): these events are only created if the next event will be E_start_tag, E_pinstr, or E_comment, and if the configuration option store_element_position is true.

E_end_of_stream: this last event indicates that the parser has terminated without error

E_error(exn): this last event indicates that the parser has terminated with error

Note Pxp_lexer_types.entity_id: currently, this is just < >, i.e. the class type without properties. It is planned, however, that one can at least query the base URI of the entity. The best way of dealing with this parameter for now: