Full-text searching in Django with PostgreSQL and tsearch2

Posted by Ross Poulton on Wed 28 January 2009 #fulltext #searching #django #tutorial

Recently, I have wanted to improve the searching ability in a number of my projects. I've often added a search box that simply does an icontains filter, which works fine for small sites but doesn't scale due to the inefficiency of matching text with basic SQL queries. The plus side of an icontains filter is that it works on all of Django's database backends (with the exception of SQLite, which works but in a case-sensitive way).

All of my projects run on PostgreSQL, an open-source database package that runs beautifully everywhere I've tried it. If you are familiar with MySQL, Postgres shouldn't be a problem for you.

When you want to do full-text searching from within your Django projects, you have a few different choices available to you, mostly depending on which database backend you're using:

  • MySQL has built-in fulltext indexing which can be used in Django with some basic ORM extensions.
  • PostgreSQL prior to 8.3 has a 'contrib' module known as tsearch2, which allows full-text indexing within your SQL queries. This is what I am going to detail below.
  • PostgreSQL 8.3 and later has tsearch2 built-in, which is usable in Django if you follow these directions from Barry Pederson
  • You can install Sphinx to index your database content (on either MySQL or Postgres), then use django-sphinx to get access to your Sphinx indexes within the Django ORM. Since this can work over a network, it may be a better option for massive datasets as you can run your database and index on separate machines.
  • Similar to Sphinx is solr, which can be used with a Django ORM addon as documented in this blog post. I believe solr works with both MySQL and PostgreSQL, but I haven't investigated it further.

I'm using PostgreSQL 8.1, and decided not to rely on external software to maintain my index due to memory limitations. Therefore my choice was simple: I utilised the tsearch2 contrib application for PostgreSQL.

Note: All of the instructions in this blog post relate to my setup on a basic Debian 4.0 system with a relatively default PostgreSQL configuration. PostgreSQL and Django were already installed and running, and I already had my models in place. My models also had data in them already as I was upgrading an existing website. YMMV.

The first thing I had to do was install the postgresql-contrib-8.1 package, which provides a number of third party addons to PostgreSQL including type extensions for ISBNs, benchmarking, various full-text indexing tools, and XML manipulation tools. The only element out of these that we'll be using is tsearch2, described as "Searchable text data type supporting indexed access." The PostgreSQL contribs are similar to those in Django: If you don't actually use them, they'll just sit there taking up nothing but hard-drive space. This means the extra contribs that I am not using shouldn't cause any compatibility or runtime errors.

sudo apt-get install postgresql-contrib-8.1

Once the package is installed, on Debian I was given a bunch of new files under /usr/share/postgresql/8.1/contrib/, including a file called tsearch2.sql. You can take a look at it if you like: it provides a number of new tables, functions and data types once it's imported into a database.

The next step is to do just that: Import the tsearch2.sql file into my database. In this case I'm working on the DjangoSites database, so within my djangosites folder I ran ./manage.py dbshell.

$ cd djangosites
$ ./manage.py dbshell
djangosites=> \i /usr/share/postgresql/8.1/contrib/tsearch2.sql
djangosites=> \q

Running this command as my restricted djangosites user threw up an error immediately, and because the script utilises transactions the rest of the changes were rolled back. To get around this I had to run the import as the postgres superuser. Alternatively, I could give the djangosites user a higher level of access.

$ psql -U postgres djangosites
djangosites=# \i /usr/share/postgresql/8.1/contrib/tsearch2.sql
SET
BEGIN
CREATE TABLE
CREATE FUNCTION
... (a few hundred CREATE TABLE, CREATE FUNCTION, and INSERT commands)
CREATE OPERATOR
CREATE OPERATOR CLASS
COMMIT
djangosites=# \q

My djangosites database is now able to recognise the commands used by tsearch2, and a handful of new database tables have been created:

$ ./manage.py dbshell
djangosites=> \dt pg*
            List of relations
 Schema |     Name     | Type  |  Owner
--------+--------------+-------+----------
 public | pg_ts_cfg    | table | postgres
 public | pg_ts_cfgmap | table | postgres
 public | pg_ts_dict   | table | postgres
 public | pg_ts_parser | table | postgres
(4 rows)
djangosites=> \q

I now need to create a new field on the table I wish to index and search to hold the full-text search keywords and index. This is pretty straightforward, and needs to be done once for each table you'll be searching on:

$ ./manage.py dbshell
djangosites=> ALTER TABLE websites_website ADD search_tsv TSVECTOR;
ALTER TABLE
djangosites=> CREATE INDEX search_index ON websites_website USING gist(search_tsv);
CREATE INDEX
djangosites=> \q

This simple command added a new field called search_tsv to my websites_website table and created an index on that field. The type, TSVECTOR, is only available if we successfully imported the tsearch2.sql file earlier. The name search_tsv is arbitary, I chose that name so when inspecting the schema later I'll be reminded it's a tsvector field and not for general use.

In my example I want to include four fields in the search index: the website Title, URL, Description and Tags. Because my table is already populated with plenty of data, I need to update the index for each existing row. If you are working with a fresh database, this won't be needed.

This first example will build the search index out of a single column, if you only have one column to index:

$ ./manage.py dbshell
djangosites=> UPDATE websites_website SET search_tsv=to_tsvector('default', title);
UPDATE 1994
djangosites=> \q

If, like me, you have multiple columns you want to match on, then you need to use the COALESCE command while combining them to ensure columns with no data are handled without errors:

$ ./manage.py dbshell
djangosites=> UPDATE websites_website SET search_tsv=to_tsvector('default', COALESCE(title, '') ||' '|| COALESCE(url, '') ||' '|| COALESCE(description, '') ||' '|| COALESCE(tags, ''));
UPDATE 1994
djangosites=> VACUUM FULL ANALYZE;
VACUUM
djangosites=> \q

Running this command gave me an error, as my djangosites user did not have permission to read the pg_ts_* tables created during the import earlier. To fix this, I jumped into the console as the postgres user again and granted SELECT access to the relevant tables. I could immediately run the UPDATE command above after doing this.

$ psql -U postgres djangosites
djangosites=# GRANT SELECT ON pg_ts_cfg, pg_ts_cfgmap, pg_ts_dict, pg_ts_parser TO djangosites;
GRANT
djangosites=# \q

The UPDATE command populated my search_tsv field with a mass of data taken from the columns I fed into it. If you run a SELECT on this column now you'll see that search data:

$ ./manage.py dbshell
djangosites=> SELECT title, search_tsv FROM websites_website WHERE slug = 'www-rossp-org';
         title          |                   search_tsv 
------------------------+--------------------------------------------------------------
 Ross Poulton's Website | 'wed':20 'blog':10,27 'info':17 'ross':1,6 'howto':15,28 ....
(1 row)
djangosites=> \q

So we have built our initial search index and loaded it up with data from the existing database records. But what about when somebody comes along and inserts a new row, or modifies an existing row? We need to make sure the index is updated so that the search remains useful. To do this, we create a database trigger.

$ ./manage.py dbshell
djangosites=> CREATE TRIGGER websites_website_tsvectorupdate BEFORE INSERT OR UPDATE ON websites_website FOR EACH ROW EXECUTE PROCEDURE tsearch2('search_tsv', 'title', 'url', 'description', 'tags');
CREATE TRIGGER
djangosites=> \q

You can now try inserting data into your table with a simple INSERT query, then SELECT that row to ensure the search_tsv column is populated. If you get an error on the insert stating ERROR: could not find tsearch config by locale, then you need to update the tsearch2 configuration to match the locale that your system is using.

$ psql -U postgres djangosites
djangosites=# SHOW lc_collate;
 lc_collate
-------------
 en_US.UTF-8
(1 row)

djangosites=# UPDATE pg_ts_cfg SET locale='en_US.UTF-8' WHERE ts_name='default';
UPDATE 1
djangosites=# \q

We've now reached a point where all existing data is indexed, and all new or updated records are indexed automatically. This data isn't very useful as it is, though. We can now make use of a handful of nifty commands to actually run our full-text search against our data set:

$ ./manage.py dbshell
djangosites=> SELECT title FROM websites_website WHERE search_tsv @@ to_tsquery('melbourne');
         title
------------------------
 FC Melbourne
 Ross Poulton's Website
 Jutda - Show The Way
(3 rows)

djangosites=> SELECT HEADLINE(title, q), RANK(search_tsv, q) AS rank FROM websites_website, to_tsquery('hosting') AS q WHERE search_tsv @@ q ORDER BY rank DESC;
                        headline                         |   rank
---------------------------------------------------------+-----------
 Libre <b>Hosting</b>                                    | 0.0865452
 TopHost (<b>hosting</b> service provider)               | 0.0865452
 Fileducky - easily upload to multiple file <b>hosts</b> | 0.0865452
 Compare VPS <b>hosting</b> plans                        | 0.0827456
 Bitbucket.org - Mercurial <b>Hosting</b>                | 0.0827456
 Chide.it - We Make Internet Applications                | 0.0759909
 Kyron                                                   | 0.0759909
 ElasticHosts                                            | 0.0759909
...
 excess.org                                              | 0.0607927
 GetHotShots                                             | 0.0607927
(34 rows)

djangosites=> \q

As you can see, the HEADLINE() and RANK() commands come in quite handy. HEADLINE() will highlight the matching words with HTML <b> tags, and RANK() will provide a ranking that can be used to sort the items from most to least relevant.

Something interesting to note in the above hosting example is that a search for hosting also matched the word host and hosts. This fuzzy searching is configurable in tsearch2, but is beyond the scope of this blog post (and, to be honest, the defaults work perfectly for me in my English-based websites).

The last piece of this puzzle is to make use of our search within our Django view. Assuming we have a variable called q with our search terms, we can use the extra() command in Django's ORM. Note that we turn the space-separated keyword list into a pipe-separated list, so all searches will OR words together by default:

keywords = "|".join(q.split())
results = Website.objects.select_related().extra(
    select={
        'snippet': "headline(websites_website.description, to_tsquery(%s))",
        'rank': "rank(search_tsv, %s)",
    },
    where=["search_tsv @@ to_tsquery(%s)"],
    params=[keywords],
    select_params=[keywords, keywords],
    order_by=('-rank',),
    )

We're now left with results, a standard Django QuerySet for the Website model with two new fields: snippet contains a section of the description column with the search terms highlighted (avoiding XSS during template display is an exercise for the reader) and rank, a numeric ranking of the usefulness of each row returned. We've sorted it by the rank, so that the most likely search results are presented at the top of the list. You may want to add this code into a custom manager to provide quicker access to it, although it will work fine in a view as it is.

You can pass results into your template, or even into a generic view that makes use of pagination. Your website search will be more accurate so it'll be more useful to your visitors, and it'll place less load on your server. Enjoy!

Benchmarking

To get an idea of how efficient this method of searching is, take a look at these two examples. The first is utilising the LIKE command, which Django defaults to for it's icontainssearches. The second is utilising the tsearch2 index explained above. There are less than 2000 rows in this table, and the difference in both query time and results returned is staggering.

$ ./manage.py dbshell
djangosites=> \timing
Timing is on.
djangosites=> SELECT title,url FROM websites_website WHERE title LIKE '%hosting%' OR description LIKE '%hosting%' OR url LIKE '%hosting%' OR tags LIKE '%hosting%';
....
(19 rows)

Time: 13.892 ms
djangosites=> SELECT title,url FROM websites_website WHERE search_tsv @@ to_tsquery('hosting');
....
(34 rows)

Time: 5.921 ms
djangosites=> \q

If I run each command twice in quick succession, the LIKE query drops to about 11ms and the tsearch2 query drops to 0.9ms. Given most searches don't happen twice in quick succession, I'm seeing a 50% improvement in query times. As the number of rows grows, I would expect the LIKE query to take longer whilst the tsearch2 query should stay relatively stable.

References

I made use of the following excellent resources whilst setting this up for myself and writing this blog post. They're strongly recommended as further reading:

  • Barry Pederson's blog post PostgreSQL full text search with Django, which assumes PostgreSQL 8.3 is being used, is very useful. The Django code in my blog post is from Barry's post.
  • The DevX article Implementing Full Text Indexing with PostgreSQL is a non-Django-specific article covering the installation of tsearch2, configuration and use in an SQL context.
  • The official tsearch2 documentation:
  • The tsearch-v2-intro document covers generalised use and explains how various parts of tsearch2 work under the covers
  • The tsearch2 guide goes into more detail and covers configuration of items such as the fuzzy searching, weighting the rankings, and more. It's essential reading if you're not working in English or if you have complex models to index.