Warning: This page contains preliminary results that are likely to change as development teams review and polish their parsers and common methodology. The current intent of this page is not to provide reliable results but to facilitate production of future high quality results. For example, performance of parsers is likely to improve significantly and new parsers may be added to the set. Any help in that direction is more than welcome!

This page discusses compares Hapy performance with performance of other parsers.

Table of Contents

Scope

Our goal is to measure and document parser performance. Performance is a factor when selecting a parser and when planning how to use it. There are many other important factors (e.g., programming language, parser interface, licensing, and support) that are outside of this page scope. The following performance metrics are measured:

All of the above metrics may depend on the test environment. For example, operating system, compiler version, and parser configuration often have drastic affects on some measurements. When possible, we provide results for different environments.

Results

The results in this section are based on the tests and methodology documented later in this document. The results are given separately for each test. You may want to pay more attention to test(s) and environments that are relevant to your use cases.

In tables, compilation memory usage is given based on Unix time tool output. Executable size is measured after stripping the binary from debugging symbols. Parsing speed and greed figures are averages over all corresponding tests (same parser, same environment, various input sizes). Parsing speed is the total input size divided by the total parsing time (higher speed is better). Parsing greed corresponds to the coefficient between memory required to parse input and input size (lower greed is better). Both derivative measurements do not depend much on input size, which makes them good invariants for summaries.

Simple XML

The graphs below compare Hapy and Spirit parser performance when generating a parsing tree for simple XML input of increasing size.

And here is a summary table with compile-time measurements and average parsing performance. Complete results are available elsewhere.

Parser Environment Compiling Parsing
time
(sec)
RAM
(MB)
exe
(KB)
speed
(KB/sec)
greed
(RAM/input)
Hapy
0.0.3
BSD1 2 10 70 268 18
Spirit
1.6.1
BSD1 32 121 348 50 100

Tests

This section describes the tests used to benchmark parsers. For each test, we define a grammar, valid input generation method, and interpretation task. Parser correctness can be checked using invalid input. Parsers that accept (instead of rejecting with a syntax error) input that does not match the grammar are disqualified.

Simple XML

The Simple XML test requires that the parser creates a parsing tree for an XML input based on a drastically simplified XML grammar. The parser must accept any valid input and must reject any invalid input. This is not a "validating" parser test. Input validity is defined by the grammar.

	grammar = node*;
	node = pi | element | text;
	element = openElement node* closeElement | closedElement;

	pi = "<?" name (CHARACTER - "?>")* "?>";

	openElement = "<" name attr* ">";
	closeElement = "</" name ">";
	closedElement = "<" name attr* "/>";

	text = (CHARACTER - '<')+;

	attr = name '=' value;
	name = ALPHA (ALPHA | DIGIT | '_' | ':')*;
	value = '"' (CHARACTER - '"')* '"';

	- Grammar terminals can be separated by any amount of whitespace.
	- Grammar terminals are "text", "name", "value", and all literals.
	

The grammar for this test only recognizes three kinds of XML nodes: text, element, and a processor instruction. Element attributes and nested elements are recognized. Entities, comments, and CDATA sections are not recognized. Many (most?) XML documents or messages produced by machines are limited to XML features recognized by this grammar. The complexity of the grammar is comparable to simple configuration languages and protocol messages.

The input for this test is generated by the tests/xmlgen tool. The original generator has been developed by the XMark project. We have slightly modified the original tool to fix a bug and to generate output of a given size (these changes were sent back to xmlgen authors).

Full XML

The Full XML test requires that the parser creates a parsing tree for an XML input based on a full XML grammar. The parser must accept any valid input and must reject any invalid input. This is not a "validating" parser test. Input validity is defined by the grammar.

The grammar for this test is based on the XML EBNF extracted from the W3C XML 1.0 specification. All XML constructs must be correctly recognized. Many (most?) XML documents produced by humans use a nearly full set of XML "features". The complexity of the grammar is comparable to scripting languages and protocol messages of moderate-to-high complexity.

We are looking for sources of input data for this test. Suggestions are welcome.

IP Packets

The IP Packets test requires that the parser/interpreter looks at each IP packet within a stream of IPv4 and IPv6 packets. The number of packets matching some simple criteria (e.g., invalid checksum) must be returned as a result of the test.

We are working on the details of this test. Suggestions are welcome.

Methodology

This section outlines our testing methodology. The overall intent behind our rules is to produce performance results meaningful to an average user that is either making a choice among several parsers or is doing capacity planning for a given parser. That average user is expected to write their own parser for an unknown but similar grammar and use case. Consequently, benchmark-specials of any kind and guru-level optimizations or tricks are not allowed.

For tests using xmlgen XML generator, please use the tests/xmlgen tool that comes with Hapy. XML input must be generated with scale factor of 1 (which is also the default) and variable "slice" sizes. For example,

        ./xmlgen -f 1 -s    1 | ./parser >> test.log
        ./xmlgen -f 1 -s    2 | ./parser >> test.log
        ./xmlgen -f 1 -s    4 | ./parser >> test.log
        ...
        ./xmlgen -f 1 -s 1024 | ./parser >> test.log
        ./xmlgen -f 1 -s 2048 | ./parser >> test.log
        ./xmlgen -f 1 -s 4096 | ./parser >> test.log
        ./xmlgen -f 1 -s 8192 | ./parser >> test.log
	

If you want to submit your own results, please look at tests/SpeedTest.sh script for the test harness we use and modify it to suit your needs (in most cases only OS portability modifications may be necessary).

If you want to write and test your own parser, please look at tests/SpeedTest.cc and tests/SpiritSpeedTest.cc for examples that can be used as templates. Those examples include code to produce statistics that is used to populate tables and graphs provided here.

When in doubt, please ask before investing a lot of effort into something that may end up incompatible with existing rules or results.

Fairness

Ideally, all tests should have been performed by an independent unbiased 3rd party, with all parser development teams providing methodology and rules input. While we wait for such a 3rd party to appear, we want to provide our users with meaningful performance results and comparisons, which implies obtaining and publishing results for parsers other than Hapy. We try to ensure basic fairness despite the obvious conflict of interests:

If you have suggestions on how to improve quality and fairness of these tests, please let us know.

Contributions

Besides general feedback, we do solicit test result submissions from users and other parser development teams. When running tests, please follow out testing methodology closely so that results are comparable. When submitting results, please disclose all necessary details, including:

  1. submitter contact information
  2. submitter affiliation with any of the parser development teams
  3. test log
  4. Number of CPUs and CPU speed
  5. Available RAM (MBytes)
  6. Anything else that is not included in test logs and that may have affected the results or may affect results reproduceability

Submittors are responsible for accuracy and quality of the results they submit. Accepted submissions are acknowledged in and become a part of the Hapy documentation.