Concordances - Unitex - Manuel d'utilisation

10.6.1 The concord.ind file

Theconcord.indfile is the index of the occurrences found by the programLocateduring the application of a grammar. It is a text file that contains the starting and ending position of each occurrence, possibly accompanied by a sequence of letters if the construction of the concordance took into account the possible transducer outputs of the grammar. Here is an example of such a file:

#M¶

59 63 the[ADJ= greater] part¶ 67 71 the beautiful hills¶ 87 91 the pleasant town¶

123 127 the noble seats¶ 157 161 the fabulous Dragon¶

189 193 the Civil Wars¶

455 459 the feeble interference¶ 463 467 the English Council¶

568 572 the national convulsions¶ 592 596 the inferior gentry¶ 628 632 the English constitution¶

698 702 the petty kings¶ 815 819 the certain hazard¶ 898 902 the great Barons¶

940 944 the very edge¶

The first line indicates in which transduction mode the concordance has been constructed.

The three possible values are:

10.6. CONCORDANCES 171

• #I: transducer outputs have been ignored;

• #M: transducer outputs have been inserted before the corresponding inputs (MERGE mode);

• #R: transducer outputs have replaced the recognized sequences (REPLACE mode)).

Each occurrence is described in one line. The lines start with the start and end position of the occurrence. These positions are given in lexical units.

If the file has the heading line#I, the end position of each occurrence is immediately followed by a newline. Otherwise, it is followed by a space and a sequence of characters.

In REPLACE mode, that sequence corresponds to the output produced for the recognized sequence. In MERGE mode, it represents the recognized sequences into which the outputs have been inserted. In MERGE or REPLACE mode, this sequence is displayed in the con-cordance. If the outputs have been ignored, the contents of the occurrence is extracted from the text file.

10.6.2 The concord.txt file

The concord.txtfile is a text file that represents a concordance. Each occurrence is en-coded in a line that is composed of three character sequences separated by a tab, represent-ing the left context, the occurrence (possibly modified by transducer outputs) and the right context.

10.6.3 The concord.html file

Theconcord.htmlfile is anHTMLfile that represents a concordance. This file is encoded in UTF-8.

The title of the page is the number of occurrences it describes. The lines of the concor-dance are encoded as lines where the occurrences are considered to be hypertext lines. The reference associated to each of these lines has the following form: <a href="X Y Z">.

X and Y represent the start and end position of the occurrence in characters in the file name_of_text.snt. Zrepresents the number of the phrase in which this occurrence ap-pears.

All spaces that are at the left and right edges of lines are encoded by a non breaking space ( in HTML), which allows the preservation of the alignment of the utterances even if one of them has a left context with spaces.

NOTE: if the concordance has been constructed with the glossanet parameter, the HTML file obtains the same structure, except for the links. In these concordances, the oc-currences are real links pointing at the web server of the GlossaNet application. For more information on GlossaNet, consult the link on the Unitex web site.

Here is an example of a file:

172 CHAPTER 10. FILE FORMATS

on, there <a href="116 124 2">extended</a> i  ¶

extended <a href="125 127 2">in</a> ancient  ¶

Scott {S}<a href="32 34 2">IN</a> THAT PL  ¶

STRICT of <a href="61 66 2">merry</a> Engl  ¶ S}IN THAT <a href="40 48 2">PLEASANT</a> D  ¶

which is <a href="84 91 2">watered</a> by  ¶

¶

</td></table></body>¶

</html>¶

Figure10.2shows the page that corresponds to the file below.

Figure 10.2: Example of a concordance

10.6.4 The diff.html file

Thediff.htmlfile is anHTMLfile that presents the differences between two concordances.

This file is encoded in UTF-8. Here is an example of file (new lines have been introduced for presentation convenience):

10.7. DICTIONARIES 173

Red: similar but different sequences

Green: sequences that occur in only one of the two concordances

<table border="1" cellpadding="0" style="font-family: Courier new;

font-size: 12">

<tr><td width="450">ed in ancient times

a large forest, covering the greater par</td>

<td width="450">ed in ancient times

a largeforest, covering the greater par</td>

</tr>

<tr><td width="450">ge forest, covering

the greater part of the beautiful hills

</td>

The compression of the DELAF dictionaries by theCompressprogram produces two files:

a.binfile that represents the minimal automaton of the inflected forms of the dictionaries, and a .inf file that contains the compressed forms required for the construction of the dictionaries from the inflected forms. This section describes the format of these two file types, as well as the format of the CHECK_DIC.TXTfile, which contains the result of the verification of a dictionary.

10.7.1 The .bin files

A.binfile is a binary file that represents an automaton. The first 4 bytes of the file represent an integer that indicates the size of the file in bytes. The states of the automaton are encoded in the following way:

• the first two bytes indicate if the state is final as well as the number of its outgoing transitions. The highest bit is 0 if the state is final, 1 if not. The other 15 bits encode the number of transitions. Example: a non-final state with 17 transitions is encoded by the

Dans le document Unitex - Manuel d'utilisation (Page 171-174)