Did You Mean optimization fails for some data sets

Bug #1931162 reported by Mike Rylander
36
This bug affects 8 people
Affects Status Importance Assigned to Milestone
Evergreen
Fix Released
High
Unassigned

Bug Description

Evergreen Versions: 3.7+
PostgreSQL Versions: all supported
OpenSRF versions: n/a

For some data sets and some queries the Did You Mean search suggestion logic can be much too slow. This is mainly due to cases where a "misspelled" word of sufficient length greater than the symspell prefix length is checked against many short prefixes that have many long suggestions attached to them.

A branch with a drop-in update to the search.symspell_lookup() function that optimizes against this situation is forthcoming.

Revision history for this message
Mike Rylander (mrylander) wrote :

Branch is available here for testing:

https://git.evergreen-ils.org/?p=working/Evergreen.git;a=shortlog;h=refs/heads/user/miker/lp-1931162-DYM-optimization / user/miker/lp-1931162-DYM-optimization

From the commit:

For some data sets and some queries the Did You Mean search suggestion logic can be much too slow. This is mainly in cases where a "misspelled" word of sufficient length greater than the symspell prefix length is checked against many short prefixes that have many long suggestions attached to them.

This commit optimizes for that case in particular by testing the length of suggestions and prefix keys against the user input to avoid unnecessary tests. Futher, it captures the edit distance of suggestions that pass that test in-line, avoiding expensive retesting, and caches the short-cutoff edit distance when in low-verbosity mode to avoid future different-but-not-too-different suggestions coming from the same prefix key.

It additionally provides a general optimization by batching the capture of suggest counts to avoid per-suggestion secondary lookups, and a micro-optimization of ordering suggestions by length at distance cache time.

tags: added: pullrequest
Changed in evergreen:
assignee: Mike Rylander (mrylander) → nobody
importance: Undecided → High
Changed in evergreen:
status: New → Confirmed
Changed in evergreen:
milestone: 3.7.1 → 3.7.2
Revision history for this message
Shula Link (slink-g) wrote :

What would be a suggested mode to test this on the Concerto dataset?

Revision history for this message
Mike Rylander (mrylander) wrote :

Hi Shula,

Unfortunately for testing purposes, Concerto is not big enough to trigger the optimization issues that this commit addresses. We have tested it locally on a large consortial data set, and it does what it says on the tin for us, without any additional code or configuration tuning.

If you can independently confirm that it does not /break/ anything, that would be a big help.

Thanks!

Revision history for this message
Shula Link (slink-g) wrote :

Tested multiple searches across Concerto and nothing broke.

I sign off on this patch with my name, Shula Link, and my email, <email address hidden>.

https://git.evergreen-ils.org/?p=working/Evergreen.git;a=commit;h=96f05bd08b1ce77dc748b2d20e8340682bf89087

tags: added: signedoff
Revision history for this message
Jason Boyer (jboyer) wrote :

Works well for me also, pushed to master and rel_3_7. Thanks Mike and Shula!

Changed in evergreen:
status: Confirmed → Fix Committed
Changed in evergreen:
status: Fix Committed → Fix Released
tags: added: didyoumean
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.