diff options
Diffstat (limited to 'libraries/libexttextcat/README')
-rw-r--r-- | libraries/libexttextcat/README | 20 |
1 files changed, 20 insertions, 0 deletions
diff --git a/libraries/libexttextcat/README b/libraries/libexttextcat/README new file mode 100644 index 0000000000000..3b9743c04a4dd --- /dev/null +++ b/libraries/libexttextcat/README @@ -0,0 +1,20 @@ +Libtextcat is a library with functions that implement the +classification technique described in Cavnar & Trenkle, "N-Gram-Based +Text Categorization". It was primarily developed for language +guessing, a task on which it is known to perform with near-perfect +accuracy. + +The central idea of the Cavnar & Trenkle technique is to calculate a +"fingerprint" of a document with an unknown category, and compare this +with the fingerprints of a number of documents of which the categories +are known. The categories of the closest matches are output as the +classification. A fingerprint is a list of the most frequent n-grams +occurring in a document, ordered by frequency. Fingerprints are +compared with a simple out-of-place metric. See the article for more +details. + +Considerable effort went into making this implementation fast and +efficient. The language guesser processes over 100 documents/second on +a simple PC, which makes it practical for many uses. It was developed +for use in our webcrawler and search engine software, in which it it +handles millions of documents a day. |