Simple Keyword Relevance

January 29, 2005   bookmark

In a recent thread at sitepointforums.com, someone wanted to search a database table column for a keyword and rank the results based on the number of times the keyword occurs, divided by the total number of words, or the total length of the original column value.

Keyword searching can be done in two ways -- using a built-in fulltext search capability, and using the LIKE operator. The problem with fulltext searching is that it is comparatively difficult to set up, isn't available in all databases, and doesn't tell you the number of occurrences (using, instead, its own formula for relevance).

So how is using the LIKE operator a better solution? Well, by itself it can only detect a match. Fortunately, with a few additional SQL expressions, you can also obtain the count of occurrences and a rudimentary calculation of relevance.

Here's the general strategy. Let's say your search query uses a WHERE condition like this:

... where columnname like '%keyword%'

This WHERE condition ensures that each column value in the result set will contain the keyword at least once.

Now for the calculations. Use the REPLACE function to change each occurrence of the keyword found in the column value to an empty string, then compare the length of the column value before and after the replacement, divide by the number of characters in the keyword, and that tells you how many occurrences of the keyword you replaced. Neat, eh?

Next, calculate the number of words in the column value. This is done by simply counting the number of spaces, and adding 1. Counting the number of spaces is done with the same REPLACE trick, comparing the length of the string before and afterwards. Note that this does not give accurate results if there happen to be consecutive spaces in the column value, but that happens infrequently, doesn't it.

To demonstrate the SQL, let's use the following test data. Note that this is MySQL syntax, and slight changes will be necessary for other databases.

create table keyword_relevance
( id tinyint not null primary key auto_increment
, Description varchar(99)
);
insert into keyword_relevance (Description) values
 ('I like to play games')
,('One game, two games, eh')
,('There''s no keyword')
,('The games keyword is present only once')
,('The games keyword is present only once in a longer sentence')
,('The games keyword is present only once in a really really really really really really long sentence')
,('Games games')
,('Games beautiful games')
,('')
,(null)
,('Games')
,('Games     ')
,('     Games');

Here's the query:

select id
  , Description
  , length(Description) as len
  , length(replace(Description,'games','')) as lrep

  , ( length(Description)
    - length(replace(Description,'games','')) )
    / length('games') as occ

  , length(Description) + 1
   -length(replace(Description,' ','')) as wds

  , (( length(Description)
     - length(replace(Description,'games','')) )
     / length('games') )
     /(length(Description) + 1
     - length(replace(Description,' ','')) ) as occwds

  , ( length(Description)
    - length(replace(Description,'games','')) )
    / length(Description) as occlen

  from keyword_relevance
 where Description like '%games%'

The calculations in the query are explained as follows:

len
length of the original column value
lrep
length of the column value after the keywords have been removed
occ
number of occurrences of the keyword, calculated as len - lrep divided by the length of the keyword
wds
number of words in the original column values, calculated as the number of spaces plus 1
occwds
relevance based on occurrences of keyword to number of words in the original column value
occlen
relevance based on occurrences of keyword to length of the original column value in characters

Here are the results:

Keyword Relevance
idDescriptionlenlrepoccwdsoccwdsocclen
1I like to play games2015150.20.25
2One game, two games, eh2318150.20.22
4The games keyword is present only once3833170.14290.13
5The games keyword is present only once in a longer sentence59541110.09090.08
6The games keyword is present only once in a really really really really really really long sentence99941170.05880.05
7Games games1112210.91
8Games beautiful games2111230.66670.48
11Games501111
12Games501111
13     Games105160.16670.5

Usage

So how do we make use of the relevance calculations? Like this:

... order by occwds desc

In MySQL, you can use a column alias in the ORDER BY clause. This assumes that you have the actual expression in the SELECT list.

Which relevance calculation is better, the one based on number of words in the original value, or the number of characters? Perhaps the difference is not really that significant.

Additional comments

Test data should always include a NULL value. Rows 3, 9 and 10 were eliminated by the WHERE clause.

MySQL strips trailing spaces from VARCHARs, but not leading spaces. Hence the difference between the values stored in rows 12 and 13. Note the incorrect calculation of the number of words in row 13.