вторник, 22 декабря 2015 г.

Dictionaries and PostgreSQL FTS. Part 2

This is the second part of the topic about using Ispell and Hunspell dictionaries within PostgreSQL. In this topic I want to give some information about FLAG and AF parameters of Hunspell and about a patch which helps PostgreSQL to load dictionaries with this parameters.

FLAG and AF parameters of Hunspell

Let’s learn this parameters in the French dictionary example. This is the .affix file fragment:

FLAG long
AF 273
AF S.()  #1
AF S*()  #2
AF F.()  #3
AF a0p+()  #4
AF F*()  #5
AF L'D'Q'  #6
AF W.()  #7
AF n'q'l'm't's' #8
...
SFX S. Y 2
SFX S.   0    0    .
SFX S.   0    s    [^sxz]
...
SFX F. Y 72
SFX F.   0    0    .
SFX F.   0    s    [eë]
SFX F.   e    0    [éiï]e
SFX F.   e    s    [éiï]e
SFX F.   rice eur  [dt]rice
SFX F.   rice eurs [dt]rice
SFX F.   de   d    de
SFX F.   de   ds   de
SFX F.   fe   f    fe
SFX F.   fe   fs   fe
...
SFX a0 N 102
SFX a0   er   er    er
SFX a0   er   ant   [^cg]er
SFX a0   cer  çant  cer
SFX a0   ger  geant ger
SFX a0   er   e     [^y]er
SFX a0   yer  ye    [^ou]yer

This is the .dict file fragment:

amodiatrice/3
argumentatrice/3
babillarde/3
banlieusarde/3

Here a .dict file have the following format:

basic_form/alias_number

AF parameter is used to have an alias for flag sets. If this parameter is used in an .affix file then in a .dict file we need use alias numbers, not affix class names.

Also in the French dictionary FLAG long parameter is used. This parameter can be used to have a large number of affix flags since we can use the double extended ASCII character flag type.

And using this French dictionary we must get the following results (how to load dictionaries you can see here):

SELECT ts_lexize('fr_hunspell', 'amodiateur');
   ts_lexize   
---------------
 {amodiatrice}
(1 row)
SELECT ts_lexize('fr_hunspell', 'argumentateur');
    ts_lexize     
------------------
 {argumentatrice}
(1 row)
SELECT ts_lexize('fr_hunspell', 'babillard');
  ts_lexize   
--------------
 {babillarde}
(1 row)
SELECT ts_lexize('fr_hunspell', 'banlieusard');
   ts_lexize    
----------------
 {banlieusarde}
(1 row)

But instead we get the following error:

ERROR:  Ispell dictionary supports only default flag value
CONTEXT:  line 161 of configuration file "/home/artur/progs/pgsqlpro/share/tsearch_data/fr.affix": "FLAG long"

This happens because of PostgreSQL do not support FLAG parameter. Also PostgreSQL do not support AF parameter, but no error will be raised. For example, you can load this Hungarian dictionary and test it.

Let’s look at this Danish dictionary (or this one). This is the .affix file fragment:

FLAG num
SFX 6 Y 4
SFX 6   0   de/148,944   e
SFX 6   0   ede/944,148  [^e]
SFX 6   0   et/944,148   [^e]
SFX 6   0   t/148,944    e
...
SFX 841 Y 20
SFX 841   0   be/70   b
SFX 841   0   ce/70   c
SFX 841   0   de/70   d
SFX 841   0   fe/70   f

And this is the .dict file fragment:

abonnere/6,143,148
absolvere/6,143,148
aller/699,55
alminde/699,55

Here a .dict file have the following format:

basic_form/flag,flag,...

Here FLAG num parameter is used. This parameter also can be used to work with a large number of affix flags.

Improvements

With some fixes PostgreSQL can support this parameters. From this thread (or direct link to the patch) you can download a patch.

This patch adds support for the FLAG long, FLAG num and AF parameters.

To apply this patch you need to perform these steps:

  • download PostgreSQL 9.5 or higher source (from here), extract it.
  • download the patch.
  • execute the following command:
    patch -p1 < ../patches/hunspell_dict.patch
  • install PostgreSQL from downloaded source. You can use this documentation.

Further improvements

You maybe noticed that the Danish dictionary have a strange format of the .affix file:

SFX 841   0   be/70   b

Here 70 is reference to respective flag. It looks like the following:

SFX 70 Y 2
SFX 70 0 s/944 [^sxz]
SFX 70 0 '/944 [sxz]

It means that to the suffix be can be added suffix s (not suffix ' since an ending be does not satisfy to the condition [sxz]).

Without this feature some dictionaries will not work correctly. But this feature is not supported by PostgreSQL yet.

Комментариев нет:

Отправить комментарий