понедельник, 7 декабря 2015 г.

Dictionaries and PostgreSQL FTS

Introduction

There are some text search operator in PostgreSQL: LIKE (~~), ILIKE (~~*), ~, ~* and SIMILAR TO. But they have important disadvantage - they do not provide linguistic support. For example, you will not find the string "great abilities", if you will execute the following query:

CREATE TABLE table1 (col1 varchar);
INSERT INTO table1 VALUES ('great abilities');
SELECT col1 FROM table1 WHERE col1 LIKE '%ability%';

To avoid this drawback use tsvector and tsquery data types. These data types consist of lexemes. Moreover tsvector can store a position and a rank of lexemes.

The process of extracting of lexemes from document consists of the following steps:

  • parsing document to tokens. There are various kinds of tokens: word, url, int, email, etc.
  • convert tokens to lexemes. To do this PostgreSQL uses Full-Text Search (FTS) configurations and set of dictionaries by normalizing tokens to their infinitive form.

I will omit the description of tsvector and tsquery data types and FTS operators. In this topic I want to describe various kinds of dictionaries and how to use them.

Dictionaries

All PostgreSQL dictionaries can be combined in configurations. FTS comes with a set of default configurations. To see which configurations are available to you, use the following command:

\dF

With the following command you can ask for more detailed information about english configuration:

\dF+ english

It returns a correspondence table between token kinds and dictionaries.

Configurations are used in such functions as to_tsvector() and to_tsquery(). For example, the following query will be used english configuration and english_stem dictionary:

SELECT to_tsvector('english', 'The blue whale is the largest animal');

It returns the following result:

               to_tsvector               
-----------------------------------------
 'anim':7 'blue':2 'largest':6 'whale':3
(1 row)

In this example you can see that words "The", "is" and "the" were removed, because they are defined in a stop word list. For the english configuration stop word list is defined in a file. The file’s full name is $SHAREDIR/tsearch_data/english.stop (use pg_config --sharedir command to determine it). A stop word file is just a plain text file with one word on each line.

Also you can see that the word "animal" was converted to the lexeme "anim", that is not correct english word. It was converted using the Snowball stemming rules.

There are various kinds of predefined dictionary templates.

Simple dictionary

The simple dictionary does the following things:

  • remove stop words
  • if a word was not recognized as a stop word it will be converted to lowercase.

Synonym dictionary

This dictionary is used to replace words by their synonyms. It relies on a file which contains synonym conversions. This file should be stored in the $SHAREDIR/tsearch_data directory and has the .syn extension.

Thesaurus dictionary

The thesaurus dictionary convert phrases into lexemes with the same meaning using a file containing the phrase conversions. The file has the .ths extension.

This dictionary also uses subdictionary to normalize thesaurus text.

Snowball dictionary

The snowball dictionary uses stemming algorithms to reduce words to their basic form. PostgreSQL 9.4 contains stemmers for 16 languages. Stemming algorithms are predefined in PostgreSQL and you cannot change them. The dictionary recognizes everything, so it will be good to place it at the end of the dictionary stack.

Ispell dictionary

Finally, the ispell dictionary. This dictionary uses morphological rules to get a basic form of words. This rules are defined in two files with .affix and .dict extensions. PostgreSQL supports three formats of these files: Ispell, MySpell and Hunspell. They are used by spell checker programs with the same names.

Let’s talk about a dictionary formats.

Ispell

The .affix file of Ispell has the following structure:

prefixes
flag *A:
    .           >   RE      # As in enter > reenter
suffixes
flag T:
    E           >   ST      # As in late > latest
    [^AEIOU]Y   >   -Y,IEST # As in dirty > dirtiest
    [AEIOU]Y    >   EST     # As in gray > grayest
    [^EY]       >   EST     # As in small > smallest

And the .dict file has the following structure:

lapse/ADGRS
lard/DGRS
large/PRTY
lark/MRS

Format of the .dict file is obvious:

basic_form/affix_class_name

In the .affix file every affix flag is described in the following format:

condition > [-stripping_letters,] adding_affix

Here, condition has a format similar to the format of regular expressions. It can use groupings [...] and [^...]. For example, [AEIOU]Y means that the last letter of the word is "y" and the penultimate letter is "a", "e", "i", "o" or "u". [^EY] means that the last letter is neither "e" nor "y".

Hunspell

MySpell is very similar to Hunspell. And I will not describe MySpell in this topic.

The .affix file of Hunspell has the following structure:

PFX A Y 1
PFX A   0     re         .
SFX T N 4
SFX T   0     st         e
SFX T   y     iest       [^aeiou]y
SFX T   0     est        [aeiou]y
SFX T   0     est        [^ey]

The first line of an affix class is the header. Fields of an affix rules are listed after the header:

  • parameter name (PFX or SFX)
  • flag (name of the affix class)
  • stripping characters from beginning (at prefix) or end (at suffix) of the word
  • adding affix
  • condition that has a format similar to the format of regular expressions.

The .dict file looks like the .dict file of Ispell:

larder/M
lardy/RT
large/RSPMYT
largehearted

In order to use Ispell or Hunspell dictionaries you need to perform these steps:

  • download dictionary files. Ispell dictionary files for some languages you can download from the link. Hunspell dictionary files for many languages you can download from the OpenOffice extensions page. OpenOffice extensions have the .oxt extension. It is necessary to extract .aff and .dic files, change extensions to .affix and .dict. For some dictionary files it is also needed to convert characters to the UTF-8 encoding with commands (for example, for norwegian language dictionary):
    iconv -f ISO_8859-1 -t UTF-8 -o nn_no.affix nn_NO.aff
    iconv -f ISO_8859-1 -t UTF-8 -o nn_no.dict nn_NO.dic
    
  • copy files to the $SHAREDIR/tsearch_data/ directory
  • load files into PostgreSQL with the following command:
    CREATE TEXT SEARCH DICTIONARY en_hunspell (
     Template = ispell,
     DictFile = en_us,
     AffFile = en_us,
     Stopwords = english);
    
  • create text search configuration:
    CREATE TEXT SEARCH CONFIGURATION en_hunspell(parser = default);
    ALTER TEXT SEARCH CONFIGURATION en_hunspell
     ALTER MAPPING FOR asciiword, asciihword, hword_asciipart,
      word, hword, hword_part
     WITH en_hunspell;
    

After these you can execute the following query:

SELECT to_tsvector('en_hunspell', 'The blue whale is the largest animal');

It returns the following result:

               to_tsvector               
-----------------------------------------
 'animal':7 'blue':2 'large':6 'whale':3
(1 row)

Here we can see that words "animal" was converted to the lexeme "animal", that is correct english word.

How does PostgreSQL get a basic form of words? All affixes from the .affix file and words from the .dict file are stored in a prefix tree, after they have been loaded. In order to find a basic form PostgreSQL goes through the following steps:

  • search for all affixes that match the desired word using prefix tree
  • get a new word applying a found affix to the desired word
  • check whether a new word satisfies an affix rule condition
  • search for lexemes in .dict file using prefix tree that match a found affix flag.

Let’s examine this steps using the word "largest":

  • PostgreSQL found affixes:
    • SFX T 0 st e
    • SFX T 0 est [aeiou]y
    • SFX T 0 est [^ey]
  • get new words:
    • "large"
    • "larg"
    • "larg"
  • check whether a new word satisfies an affix rule condition
    • "large" satisfies to the "e"
    • "larg" do not satisfies to the "[aeiou]y"
    • "larg" do not satisfies to the "[^ey]".

As a result we got the lexeme "large".

Compound words

A compound word is a word that consists of two or more other words. Many languages have compound words. For example, there is the word "Fotballklubber" in Norwegian. Using this word as a query and hunspell dictionary for Norwegian you can find a document "Klubbfotballen".

But first, you need to download a norwegian hunspell dictionary, load it using the above-mentioned description and the following commands:

CREATE TEXT SEARCH DICTIONARY no_hunspell(
 Template = ispell,
 DictFile = nn_no,
 AffFile = nn_no,
 Stopwords = norwegian);
CREATE TEXT SEARCH CONFIGURATION no_hunspell(parser = default);
ALTER TEXT SEARCH CONFIGURATION no_hunspell
 ALTER MAPPING FOR asciiword, asciihword, hword_asciipart,
  word, hword, hword_part
 WITH no_hunspell;

After this you will get the following tsvector’s for words "Fotballklubber" and "Klubbfotballen":

SELECT to_tsvector('no_hunspell', 'Fotballklubber');
              to_tsvector               
----------------------------------------
 'ball':1 'fot':1 'fotball':1 'klubb':1
(1 row)

SELECT to_tsvector('no_hunspell', 'Klubbfotballen');
                            to_tsvector                            
-------------------------------------------------------------------
 'ball':1 'balle':1 'fot':1 'fotball':1 'klubb':1 'klubbfotball':1
(1 row)

You can see that words was splitted into several variants of lexemes. For example, for the word "Fotballklubber" we get two variants of lexemes:

  1. fot + ball + klubb
  2. fotball + klubb.

And we can test text search for compound words:

SELECT to_tsvector('no_hunspell', 'Klubbfotballen') @@ to_tsquery('no_hunspell', 'Fotballklubber');
 ?column? 
----------
 t
(1 row)

Conclusion

In this topic I have tried to explain various kinds of the dictionary templates and various formats of the Ispell dictionaries. I have explained only basic options of Ispell and Hunspell formats. There are other options of Hunspell that PostgreSQL does not support. Options AF and FLAG are among them.

FLAG option set affix flag type. Default type is the extended ASCII (8-bit) character. The "long" value sets the double extended ASCII character flag type, the "num" sets the decimal number flag type. For example, french dictionary have double character flag type:

SFX S. Y 2
SFX S.   0   0   .
SFX S.   0   s   [^sxz]

AF option set aliases for affix flag sets. The french dictionary have this option too:

AF 273
AF S.()
AF S*()

The first line of the option is the header that points number of aliases.

References

  1. Tsearch2 - full text extension for PostgreSQL
  2. PostgreSQL: A Full Text Search engine
  3. Ispell documentation
  4. Hunspell documentation
  5. PostgreSQL documentation. Dictionaries

4 комментария:

  1. Your blog is in a convincing manner, thanks for sharing such an information with lots of your effort and time mongodb online training

    ОтветитьУдалить