stemming language setting problematic
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Tracker |
Expired
|
Wishlist
|
|||
tracker (Ubuntu) |
Confirmed
|
Undecided
|
Unassigned |
Bug Description
Binary package hint: tracker
System - Preferences - Indexing preferences - General tab - Stemming has a short list of languages ... but you have to pick one. This is not a realistic scenario in many locales; for example, I routinely handle stuff in three languages (English; my home language, Swedish; and the majority language of the country where I live, Finnish) and so does everyone in my family, including soon enough my daughter, who just started school.
Not only is the absence of stemming for, say, Finnish, problematic, but being forced to choose English or Swedish stemming for Finnish documents is likely to produce a large amount of false-positive stems, making searches for Finnish words return what seem like completely haphazard matches in many cases -- enough to make it useless at least in some scenarios.
What happens if later, you change this setting? Does it throw away or redo all the stemming it has done so far?
What happens if your primary locale preferences indicate a language which is not on the list; would that be a workaround for disabling stemming?
I do realize that coming up with a good fix for this is hard. At a minimum, indexing without any stemming should be possible. Further out in wishlist territory, it would be nice if at some point the indexer could try to establish the language of each document (ignoring for now the can of worms that is multilingual documents -- don't let any philologists hear about this) and use an appropriate stemmer only if the language can be established with reasonable certainty. (Debian has a package "mguesser" for stand-alone language identification, which is also available as a library which ships with the mnogosearch search engine; google for TextCat for some more suggestions. Or ask me again and be prepared for a veritable flood of bookmarks on the topic.)
Changed in tracker: | |
status: | Unknown → In Progress |
Changed in tracker: | |
status: | New → Confirmed |
Changed in tracker: | |
importance: | Unknown → Wishlist |
status: | In Progress → Expired |
Thanks for the pointer to the upstream bug; I added some comments there.