remote z39.50 search returns no results for terms with diacritics

Bug #1346518 reported by Jason Stephenson
20
This bug affects 4 people
Affects Status Importance Assigned to Milestone
Evergreen
Confirmed
Undecided
Unassigned

Bug Description

Evergreen version: Master as of 2014-07-21
OpenSRF version: Master as of 2014-07-14 (2.3.0)
Postgres version: 9.1.[something]

When searching via z39.50, I get no remote results for terms containing diacritics. For example, Slavoj Žižek return 0 results if I only search remote sources, but returns 4 from the local catalog. The same is true if I search his name as Slavoj Zizek.

In the attached screenshots, I have limited the searches to remotes that are also running Evergreen, but I get identical results if I choose Library of Congress or biblios.net.

I have also tested against the other Evergreen catalogs and my own using SRU. If I search for the authors last name, without or without diacritics, I get results that way.

I assume this has something to do with Z39.50 going through YAZ. It would seem to me that the characters are being double encoded, converted to an ISO8859-1 string, or similar. Note the way the author's name appears in the screen shot for the remote search by title.

Revision history for this message
Jason Stephenson (jstephenson) wrote :
tags: added: diacritics search z3950
Revision history for this message
Jason Stephenson (jstephenson) wrote :
Revision history for this message
Jason Stephenson (jstephenson) wrote :

I did this search because I knew it would return results from the NOBLE catalog.

Note how the author's name appears in the results list. This suggests some kind of encoding problem as the data passes through Evergreen, yaz, and back.

Revision history for this message
Jason Stephenson (jstephenson) wrote :

OK, what I said about LOC above is not totally true.

The searches with diacritics still fail, but if I search by the English title, I get results, similar to that from NOBLE.

Not that in this screen shot, the characters of the author's name are not "mangled." They appear as they should in UTF-8.

Revision history for this message
Jason Stephenson (jstephenson) wrote :

Additionally, I have discoverd that certain UTF characters will cause a record not to display when retrieved. Having done some experimentation with yaz-client, it looks like these characters are getting encoded/decoded improperly at that level.

At this time, I believe the fix for these problems may be the same, i.e. tell yaz to use utf-8. However, if the solution proves to be different for these two problems, I'll create a new bug.

Revision history for this message
John Yorio (jyorio) wrote :

Came upon the same issue while investigating why yaz-client returned garbled characters for diacritics in records from Evergreen. In this case, the records all had e acute in their 65X fields.

Using Evergreen client on a separate system to retrieve the same records with diacritics via z39.50 from the 1st system returned no results. Other records returned as expected. SRU retrieved records but diacritics still mangled.

Tried to force yaz-client to treat as UTF-8 using 'marccharset UTF8', showed field values with diacritics as blanks. The MARC Edit Z39.50 client also returned improperly encoded/decoded diacritics, whether forcing UTF-8 or not. Tried MARC Edit Z39.50 to create file of records to import into originating system to see if Evergreen would translate the characters back to the proper diacritics, but it did not.

Confirmed that the record leaders do contain 'a' in position 09 to indicate Unicode.

Editing the configuration files for Simple2ZOOM to use UTF-8 instead of MARC-8 (see http://evergreen-ils.org/dokuwiki/doku.php?id=evergreen-admin:sru_and_z39.50 for the MARC-8 set up) and restarting Simple2ZOOM yielded same results.

Changed in evergreen:
status: New → Confirmed
Revision history for this message
Eva Cerninakova (ece) wrote :

I wonder whether the problem of retrieving record is not related rather to remote server than to Evergreen. We are using diacritics for retrieving record via Z39.50 from the Czech remote servers on daily basis and always get the existing record. I have attached snapshot of Slavoj Žižek search using The Union Catalog of the Czech Republic
My setting for the Union Catalog:
Source: SKC
Host: aleph.nkp.cz
Port: 9991
DB: SKC-UTF
Record format: FI
Transmission format: usmarc
Auth: False
Z39.50 Attributes for SKC are almost identical to OCLC, except for the truncation set to "1"

However the discussion reminds me partially of the problem of Z39.50 server querry encoding, seehttp://list.georgialibraries.org/pipermail/open-ils-general/2016-February/012781.html

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.