воскресенье, 27 декабря 2015 г.

Fuzzy substring searching with the pg_trgm extension

Introduction

In the previous topic I mentioned that we have some operators for the pattern matching (LIKE (~~), ILIKE (~~*), ~, ~* and SIMILAR TO) and an operator for the text search (@@). If you know exactly what you want to find this operators are good choise.

But if you know a search query only approximately then a trigram matching can help you. A trigram matching is provided by the pg_trgm module in PostgreSQL. For example, you can execute the following query:

CREATE TABLE clients(name varchar);
INSERT INTO clients VALUES('Devereaux'), ('Devereux');
SELECT * FROM clients WHERE name % 'Devereux';

And you will get the following people with the Devereux surname:

   name  
-----------
 Devereaux
 Devereux
(2 rows)

pg_trgm

The pg_trgm module provides the following functions and operators:

  • functions:
    • similarity - returns a number (from 0 to 1) that indicates how similar two strings
    • show_trgm - returns an array of all trigrams of the given argument
    • set_limit - sets the current similarity threshold that is used by the % operator (from 0 to 1, default is 0.3)
    • show_limit - returns the current similarity threshold that is used by the % operator
  • operators:
    • % - returns true if its arguments have a similarity that is greater than the current similarity threshold
    • <-> - returns one minus the similarity() value.

How it works?

Firstly, both a query and a document are divided into unique and sorted trigrams. For example, "Devereux" is divided into this trigrams:

SELECT show_trgm('Devereux');
                  show_trgm                  
---------------------------------------------
 {"  d"," de",dev,ere,eux,eve,reu,"ux ",ver}
(1 row)

And "Devereaux" is divided into this trigrams:

SELECT show_trgm('Devereaux');
                    show_trgm                    
-------------------------------------------------
 {"  d"," de",aux,dev,eau,ere,eve,rea,"ux ",ver}
(1 row)

Secondly, similarity of both strings is calculated using this formula:

count is a number of equal trigrams of both strings, len1 is a number of unique trigrams of the first string and len2 is a number of unique trigrams of the second string.

Lastly, if the calculated similarity greater than the current similarity threshold (default 0.3) then the string "Devereaux" will be returned by SELECT query.

pg_trgm also provides GIST and GIN indexes support for the % similarity operator and for the pattern matching operators LIKE (~~), ILIKE (~~*), ~ and ~*.

Fuzzy substring searching

Both the similarity function and the % operator match two strings expecting that they are wholly similar. But they give bad results if we want to find documents by a query which is a substring of a desired document. For example, try to find the document "General Ricardo Wall y Devereux" using the query "Devereaux".

For similar purposes we need a new operator.

Here you can download a new version of the pg_trgm module. It can be installed into PostgreSQL 9.4 and above. To install it you may follow these steps:

  • download source of a new version of module and extract it to pg_trgm folder
  • execute the command:
    cd pg_trgm
    make USE_PGXS=1 install && make USE_PGXS=1 installcheck 
    
  • execute the query (if the module is not installed in your base yet):
    CREATE EXTENSION pg_trgm
    

This module provides the following additional functions and operators:

  • functions:
    • substring_similarity(text, text) - returns a number (from 0 to 1) that indicates how similar the first string to the substring of the second string.
    • show_substring_limit() - returns the current substring similarity threshold that is used by the <% operator.
    • set_substring_limit(real) - sets the current substring similarity threshold that is used by the <% operator. The threshold must be between 0 and 1 (default is 0.6).
  • operators:
    • <% - returns true if its first argument have a similar substring in the second argument.

The substring_similarity function use the formula described above. The function calculates the maximum similarity for a substring within the second string. For example, we have the following unsorted trigrams for the "Devereaux":

{"  d"," de",dev,eve,ver,ere,rea,eau,aux,"ux "}

We have the following unsorted trigrams for the "General Ricardo Wall y Devereux":

{"  g"," ge",gen,ene,ner,era,ral,"al ","  r"," ri",ric,ica,car,ard,rdo,"do ","  w"," wa",wal,all,"ll ","  y"," y ","  d"," de",dev,eve,ver,ere,reu,eux,"ux "}

Selected trigrams give the maximum similarity with the string "Devereaux". This similarity will be equal to 0.6:

SELECT substring_similarity('Devereaux', 'General Ricardo Wall y Devereux');
 substring_similarity
----------------------
                  0.6
(1 row)

Let’s execute the following example query:

INSERT INTO clients VALUES('General Ricardo Wall y Devereux, Spanish General and Prime Minister'),
 ('George Devereux, Austro-Hungarian ethnologist and psychoanalyst'),
 ('Helena Devereux, American educator and founder of the Devereux Foundation'),
 ('James Devereux, U.S. Marine Corps General and politician');
SELECT * FROM clients WHERE 'Devereaux' <% name;

Then we will get the following results:

                                   name                                    
---------------------------------------------------------------------------
 Devereaux
 Devereux
 General Ricardo Wall y Devereux, Spanish General and Prime Minister
 George Devereux, Austro-Hungarian ethnologist and psychoanalyst
 Helena Devereux, American educator and founder of the Devereux Foundation
 James Devereux, U.S. Marine Corps General and politician
(6 rows)

Conclusion

In this topic I described new possibility of fuzzy string searching within another string. Tests show the following results:

Tests have been done using a table that contains 10254520 records. This table contains texts that have trigram count between 2 and 121. 214 queries were done to this table using GIN and GIST indexes and same queries were done without indexes.

Two graphs ("GIN - index time" and "GIST - index time") show the dependence of the actual time of bitmap index scan from the trigram count of a searched text.

Third graph ("Sequential scan") shows the dependence of the actual time of sequential scan from the trigram count of a searched text.

As you can see performance of the new operator is comparable with the % operator. But I have not done tests to show performance of the search in large texts.

References

  1. Code in github

Authors

Alexander Korotkov a.korotkov@postgrespro.ru

Artur Zakirov a.zakirov@postgrespro.ru

Комментариев нет:

Отправить комментарий