Foreign language documents can be recognised -
and displayed
separately *
The recognition of foreign language texts is based on the following
strategy:
- In normal english documents, using its dictionary and
parser, FindWord recognises a certain number of words. These are shown with an
asterisk "*" prepended in the word list. For every file shown in the file
window the number of words recognised in it is shown as a percentage of the
total words it contains (in the column on the right).
- In the case of foreign texts and special texts
(address books, computer programs and such) only a few words will be
recognised, and their percentage of the total number of words will be very
little.
- FindWord makes use of this phenomenon by allowing you
to define a minimum percentage of recognised words - under which a document
will be classed as not containing normal english text. Thus you can request
the sole display within a project of
- all documents,
- only english documents, or
- only foreign language or special documents.
An example:
A project has 13 documents all containing the word "patent":
The "Recognition quotient" specifies in percent, for the current
project, how many words must be recognised from the dictionary or by the parser
in order to classify a document as containing readable english. On average we
observe that
- more than 30% of words are recognised in an english text, but
- less than 15% of words are recognised in foreign or special
texts.
Therefore we recommend a recognition quotient of about 20%.
With that set and "Only foreign language files" selected, we
immediately see that 3 files , with recognition quotients of respectively 6%,
12% and 8% lie below the 20% mark:
Conversely, if we select "Only english language files" then we'll see
13 - 3 = 10 files, with recognition quotients of at least 20%: