ISBN searching - mixed results..

Bug #833045 reported by George Duimovich
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Evergreen
Confirmed
Wishlist
Unassigned

Bug Description

EG 2.0.8
OpenSRF 2.0.1

For our bib record id: 7798179 we have ISBN:
020 . ‡a0-8412-1632-0

We can find this ISBN successfully these ways:

0-84121632-0 works
0841216320 works

But not these ways:
0-8412 1632-0
0 8412 1632 0

It's not uncommon for the space delimited format to appear on book jackets without dashes: "0 8412 1632 0" (and of course for mixed data entry practices by both cataloguers and searchers - dash vs. no dash, etc.).

And don't get me started on those text strings that sometimes follow ISBN's in our data. But these ISBN's seem to be perfectly findable.
Eg.
<datafield tag="020" ind1=" " ind2=" "><subfield code="a">0803109288 (soft)</subfield>
<datafield tag="020" ind1=" " ind2=" "><subfield code="a">3540656049 (softcover : alk. paper)</subfield>

Tags: search
Revision history for this message
Mike Rylander (mrylander) wrote :

Does adding a "remove spaces" normalizer, in addition to the "remove dashes" one, help? If so, it's just configuration ... which might be worth adding to the stock data.

Revision history for this message
George Duimovich (george-duimovich) wrote :

re: normalizer - I would think so.
FWIW, robot librarian had some normalizer code posted somewhere (as well as a good read on isbn data http://robotlibrarian.billdueber.com/?s=isbn)...

Revision history for this message
Dan Scott (denials) wrote : Re: [Bug 833045] [NEW] ISBN searching - mixed results..

Wow. I have not run into ISBNS with spaces before - either in MARC
records, or in search queries. I guess it's possible. But as George
suggests, simply adding a "remove space" normalizer would screw up the
"### (pbk.)" values in the MARC records. (If people enter "####
(pbk.)" into a search query then they deserve what they get).

I suppose we could add a 3rd normalizer, which, after all spaces and
hyphens have been removed, would then try the following matches in
order and take the first successful match:

a. 13 digits
b. 12 digits followed by an X
c. 10 digits
d. 10 digits followed by an X

*sigh*

Revision history for this message
George Duimovich (george-duimovich) wrote :

I guess at some point a line has to be drawn as to how much "bad data" can be anticipated / accommodated for versus library shops just fixing their data.

Here's another example from just poking around a bit more. I found over 5400 MARC 020's with data in this format (i.e. trailing colon), that might (?) present a problem with remove spaces:

<subfield code="a">0852981937 :</subfield>

This ISBN is perfectly findable in EG right now, but definitely a good / easy target for cleanup I think.... But wait, and Grrr - looking at sample records, it's clear that many/most cases the embedded ":" are there for display purposes!

020 . ‡a1551050420 : ‡c24.95

But the big data boss in the sky doesn't like that, so he commands that we change the standard, even if the standard won't change itself. That colon, IMHO, should be moved to display only in our shop IMHO, so if |c present, add the colon, etc.

Also, only a small number of our ISBN's have spaces instead of dashs FWIW.

thx

Revision history for this message
Jason Stephenson (jstephenson) wrote :

I am setting this to incomplete because I am not certain if this is a bug report at this point.

Changed in evergreen:
status: New → Incomplete
Revision history for this message
Jane Sandberg (sandbergja) wrote :

I moved this to Wishlist, since it's a feature request.

Changed in evergreen:
status: Incomplete → Confirmed
importance: Undecided → Wishlist
tags: added: search
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.