Ignore INNODB_FT_DEFAULT_STOPWORD for ngram indexes

Bug #1679135 reported by Sveta Smirnova
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
MySQL Server
Unknown
Unknown
Percona Server moved to https://jira.percona.com/projects/PS
Status tracked in 5.7
5.5
Invalid
Undecided
Unassigned
5.6
Invalid
Undecided
Unassigned
5.7
Fix Released
Medium
Yura Sorokin

Bug Description

Originally reported at https://bugs.mysql.com/bug.php?id=84420

[5 Jan 11:19] Miguel Angel Nieto

Description:
Ngram indexes also check the stopwords list, to see if any indexed element *contain* one of the words on that list. This looks good and it is the normal behaviour, but I don't think that the default table is suitable to use it with ngram.

For example, any item that contains 'a' or 'i' will be ignored. So for example, if you have word "east", you cannot search for "ea" because it has been ignored.

Ngram should have a different default list of stopwords, or an empty list.

How to repeat:
mysql> CREATE TABLE `articles` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`body` text,
PRIMARY KEY (`id`),
FULLTEXT KEY `ftx` (`body`) /*!50100 WITH PARSER `ngram` */
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;

mysql> insert into articles (body) values ('east');
mysql> insert into articles (body) values ('east area');
mysql> insert into articles (body) values ('east job');
mysql> insert into articles (body) values ('eastnation');
mysql> insert into articles (body) values ('eastway, try try');

mysql> SELECT * FROM articles WHERE MATCH(body) AGAINST('ea' IN BOOLEAN MODE);
Empty set (0.00 sec)

====

There is a workaround for this bug: create custom INNODB_FT_DEFAULT_STOPWORD table for ngram indexes. But issue with this workaround is that such a table used by other fulltext indexes, such as mecab.

Suggested fix: either have special INNODB_FT_DEFAULT_STOPWORD table for ngram indexes or ignore it at all.

There is also code in fts_check_token:

4791 bool
4792 fts_check_token(
4793 const fts_string_t* token,
4794 const ib_rbt_t* stopwords,
4795 bool is_ngram,
4796 const CHARSET_INFO* cs)
4797 {
4798 ut_ad(cs != NULL || stopwords == NULL);
4799
4800 if (!is_ngram) {
4801 ib_rbt_bound_t parent;
4802
4803 if (token->f_n_char < fts_min_token_size
4804 || token->f_n_char > fts_max_token_size
4805 || (stopwords != NULL
4806 && rbt_search(stopwords, &parent, token) == 0)) {
4807 return(false);
4808 } else {
4809 return(true);
4810 }
4811 }
4812
4813 /* Check token for ngram. */
4814 DBUG_EXECUTE_IF(
4815 "fts_instrument_ignore_ngram_check",
4816 return(true);
4817 );

So only job is to replace DBUG_EXECUTE_IF with some new option.

tags: added: upstream
Revision history for this message
Sveta Smirnova (svetasmirnova) wrote :
Revision history for this message
Sveta Smirnova (svetasmirnova) wrote :
Revision history for this message
Sveta Smirnova (svetasmirnova) wrote :
tags: added: sfr-78
Revision history for this message
Yura Sorokin (yura-sorokin) wrote :

Fixed by implementing bp:innodb-fts-ngram-ignore-stopword-list
"A new InnoDB variable to control whether InnoDB FTS should ignore stopword list" (https://blueprints.launchpad.net/percona-server/+spec/innodb-fts-ngram-ignore-stopword-list).

https://github.com/percona/percona-server/pull/1988

Revision history for this message
Shahriyar Rzayev (rzayev-sehriyar) wrote :

Percona now uses JIRA for bug reports so this bug report is migrated to: https://jira.percona.com/browse/PS-1802

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.