File formats

Building a Poliqarp corpus involves converting between several data representations or file formats. It’s easy to get confused with all the acronyms, so this section provides a quick reference to all the formats used in a Poliqarp for DjVu pipeline.

DjVu

DjVu is a computer file format designed for efficient representation of scanned documents. It’s the source format for building Poliqarp for DjVu corpora. DjVu’s key feature from the point of view of Poliqarp for DjVu is that documents can contain a hidden text layer. This means that OCR-ed text can be stored along the graphical representation of the original document. Poliqarp corpora are built based on the text layer, and sophisticated Poliqarp queries can return graphical concordances because the items in the text layer are mapped to appropriate regions of the graphical layer or, in other words, the scanned image of the source document.

hOCR

hOCR is a format for representing OCR output, which combines layout and recognition confidence information with the recognized text. All this data is encoded in a standard HTML document, which makes it easy to process with existing tools.

In Poliqarp for DjVu, hOCR is used as the intermediate format between source DjVu files and XCES. The intermediate format can be annotated with information about the structure of the source document, which can be later used when querying the corpus to limit the query results e.g. only to the front matter or a particular chapter.

XCES

XCES is an XML-based format for encoding text corpora. In Poliqarp for DjVu, it’s used as the input for bpng for generating a binary corpus.

Corpus configuration file

The corpus configuration file is used by the Poliqarp daemon. First and foremost, it describes the tagset of the corpus, defining the categories to which the segments can belong and the attributes that categories might take.

See also

Daniel Janus’s MSc thesis (in Polish)

bpng configuration file

bpng is the Poliqarp corpus builder. Its configuration file controls which files are included in the corpus and defines the metadata fields.

See also

bpng man pages

Corpus header

There are only two requirements for the header file:

  • It should be a well-formed XML document.
  • It should include the metadata fields defined in the bpng configuration file.

A simple header file, with just two metadata fields, can look like this:

<meta>
   <year>1990</year>
   <creator>John Doe</creator>
</meta>

Structure file

The structure file can be used to mark up the corpus with information about the physical layout of the source scanned document. This allows the queries to be limited to particular parts of the document, such as the back matter.

The structure file is a plain-text document which must follow a specific syntax. In short, the file:

  • must refer to specific page ranges in the DjVu document;
  • must include sequential numbers for each DjVu document that is described;
  • can use any names for the individual sections;
  • can describe a tree-like structure, i.e. sections can be nested;
  • can include comments marked with “#”.

An example is shown below. The file describes the structure of a 200-page document with front matter, body, and back matter.

1                         # DjVu document number
front,1,20
acknowledgement,5,20      # Nested section
body,21,180
back,181,200

The section identifiers can be used in corpus queries by means of the within clause to limit the query results to the particular section.

Table Of Contents

Previous topic

Browsing corpora

This Page