biblio.extract_located_uris always returns the URL for "link_text"

Bug #797304 reported by Dan Scott
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Evergreen
Fix Released
High
Dan Scott

Bug Description

 * Evergreen 2.0.6
 * PostgreSQL 9.0.4
 * Debian Squeeze (database server)

For a record with the following 856:

856 4 0 ‡uhttp://dsp-psd.pwgsc.gc.ca/Collection-R/LoPBdP/PRB-e/PRB0220-e.pdf ‡yTo view Windsor's electronic resource click here. ‡9OWA

One would expect asset.uri to be populated with:

href = http://dsp-psd.pwgsc.gc.ca/Collection-R/LoPBdP/PRB-e/PRB0220-e.pdf
link_text = To view Windsor's electronic resource click here.
use_restriction = NULL

However, what gets populated is:

href = http://dsp-psd.pwgsc.gc.ca/Collection-R/LoPBdP/PRB-e/PRB0220-e.pdf
link_text = http://dsp-psd.pwgsc.gc.ca/Collection-R/LoPBdP/PRB-e/PRB0220-e.pdf
use_restriction = NULL

Checking the definition of biblio.extract_located_uris() shows that link_text is pulled via:

(oils_xpath('//*[@code="y"]/text()|//*[@code="3"]/text()|//*[@code="u"]/text()',uri_xml))[1]

However, running that against this record results in the URI being returned rather than the link text:

SELECT (oils_xpath('//*[@code="y"]/text()|//*[@code="3"]/text()|//*[@code="u"]/text()', marc))[1] FROM biblio.record_entry WHERE id = 2390555;
                             oils_xpath
--------------------------------------------------------------------
 http://dsp-psd.pwgsc.gc.ca/Collection-R/LoPBdP/PRB-e/PRB0220-e.pdf
(1 row)

 it looks like the XPath results are not necessarily returned in the order they are listed in the XPath expression. http://www.postgresql.org/docs/9.0/static/functions-xml.html makes no mention of the ordering of results in the array returned by the XPATH() function; I tried swapping the positions of "u" and "y" in the XPath expression and offset 1 was still "u".

So I suspect we'll have to grab the complete array of results and iterate through it to find the first match that does not match uri_href inside the function and assign that to uri_label.

Revision history for this message
Dan Scott (denials) wrote :

Easier than iterating through the array (although it does work) is to not pull from subfield "u" in the uri_label XPath expression and just assign the value of uri_href to uri_label if uri_label is NULL - as that's effectively what we're doing anyway. Then we can remove the test for uri_label being NULL because it will all come down to uri_href, as it should.

Revision history for this message
Dan Scott (denials) wrote :

Fix in progress at user/dbs/lp797304_lp797307 (see related asset.uri bug 797307).

Changed in evergreen:
status: New → Confirmed
importance: Undecided → High
assignee: nobody → Dan Scott (denials)
Revision history for this message
Dan Wells (dbw2) wrote :

This isn't directly related to the direction you are going now, but in answer to your original problem, oils_xpath returns your query results in the order they appear in the record, not the order of the query (as you discovered). So, if your uri_xml is:

<datafield tag="856" ind1="4" ind2="1"><subfield code="u">UUUUU</subfield><subfield code="y">YYYYY</subfield><subfield code="3">33333</subfield></datafield>

then oils_xpath('//*[@code="y"]/text()|//*[@code="3"]/text()|//*[@code="u"]/text()',uri_xml) will return:

{UUUUU,YYYYY,33333}

because subfield 'u' came first in the actual text, then 'y', then '3'. I am not sure if this linear search order is spelled out or guaranteed, but that has been my experience.

To achieve what you were trying to do, you could perhaps do separate searches and concatenate the results, something like (using your current uri_label as an example):

uri_label := (oils_xpath('//*[@code="y"]/text()', uri_xml) || oils_xpath('//*[@code="3"]/text()', uri_xml))[1];

I am not necessarily suggesting you do that, but merely throwing the idea out there as another option.

Dan

Revision history for this message
Dan Scott (denials) wrote : Re: [Bug 797304] Re: biblio.extract_located_uris always returns the URL for "link_text"

Thanks Dan, but your assertion about xpath's result ordering didn't match our experiences on our environment - changing the subfield order in the offending record was one of the first things I tried as a test.

Revision history for this message
Dan Wells (dbw2) wrote :

Thanks for follow-up, Dan. Hmmm, I find that to be surprising and concerning. I imagine other parts of the code are relying upon those nodes to be returned in document order. For instance, it looks like the in-DB fingerprinting uses xpaths like this, and it wouldn't make for much of a fingerprint if the return order wasn't fixed in some way.

I checked the XPath spec, and 1.0 doesn't define the node order from such expressions, leaving it up to the implementation, so that isn't much help. In the very recent 2.0 spec, this has been fixed, and now specifies document order.

Perhaps it is PG version dependent? I am testing on 8.4, and I think you are on 9.0. When I do the following:

select (oils_xpath('//*[@tag="245"]/*[@code="b"]/text()|//*[@tag="245"]/*[@code="c"]/text()|//*[@tag="245"]/*[@code="a"]/text()', marc)) from biblio.record_entry where id > 10000 and id < 10050;

the arrays come back in the expected order, a -> b -> c. Do you get something different?

Revision history for this message
Dan Scott (denials) wrote :

Yes, it's surprising and concerning. It's what I thought I saw. (And yes, as the bug report says at the very top, we're on 9.0).

However, your test returns a/b/c in the expected order. But of course there are probably no 245 fields where subfields c or b come before a, so to also remove the hypothesis that ordering may somehow be alphabetic based on the matching attribute name, a better test is:

select (oils_xpath('//*[@tag="245"]/*[@code="b"]/text()|//*[@tag="245"]/*[@code="c"]/text()|//*[@tag="245"]/*[@code="a"]/text()',
'<marc><data tag="245"><sub code="c">foo</sub><sub code="a">bar</sub><sub code="b">baz</sub></data></marc>'));

This returns:

  oils_xpath
---------------
 {foo,bar,baz}
(1 row)

... suggesting that you're right, it is document node order and thus compliant with XPath 2.0.

So we can put that as a notch in our collective wisdom about oils_xpath() / the native XPATH() function on which it is based, and at some point turn our attention to other areas where it may be a problem. Thanks for the clarification.

Now, back to the problem at hand: I believe my branch does fix the problem reported here (along with the problem reported in 797307). Your approach looks like it would solve this problem and may be more efficient, but I honestly don't know how the cost of invoking oils_xpath() multiple times compares to iterating through the contents of the array and doing comparisons on the array contents against uri_href (and I don't know whether there would be any performance difference between 9.0 and 8.4).

Revision history for this message
Dan Wells (dbw2) wrote :

Thank you again, Dan, for helping verify this behavior. I wasn't sure about any of this before I looked into it, so it helps me personally to work through it, and I do hope it helps the collective wisdom.

At any rate, I think your fix is both more clear (to me, at least), and probably more efficient to boot (though perhaps that is not such a factor here). My plate looks pretty full for today, but I'll try to get this tested soon.

Revision history for this message
Mike Rylander (mrylander) wrote :

Instead of grabbing the array of results, perhaps it would be better to run each XPATH() query in turn (or use oils_xpath_string()) and take the first that returns a non-NULL and non-empty value? That way we preserve precedence ordering regardless of subfield order in the record. Thoughts?

Revision history for this message
Dan Scott (denials) wrote :

Perhaps. In the absence of a good benchmark, I don't know. That's basically the same suggestion dbwells made in 797307 I believe.

Revision history for this message
Dan Scott (denials) wrote :

Sorry, actually, it's what Dan Wells suggested back in https://bugs.launchpad.net/evergreen/+bug/797304/comments/3 - I answered too quickly from my phone sans context.

Performance is really only a concern for bulk operations such as massive updates (such as at migration time, or importing large sets of electronic resources, or as will ensue once whatever version of biblio.extract_located_uris() gets pushed into whatever release it eventually lands in).

I don't particularly care about precedence ordering for our own data, as the number of 856 fields that we have that have both $3 and $y approaches 0. We have 6, to be precise, out of over 360,000 records containing a total of 500,000 856 fields that have a subfield 'u' - assuming that the following queries are in the ballpark:

SELECT COUNT(*) FROM (SELECT DISTINCT m1.record FROM metabib.full_rec m1 INNER JOIN metabib.full_rec m2 ON m2.record = m1.record WHERE m2.tag = '856' AND m2.subfield ='3' AND m1.subfield='y' AND m1.tag = '856') AS foo;

SELECT COUNT(*) FROM metabib.full_rec WHERE tag = '856' AND subfield = 'u';

SELECT COUNT(*) FROM (SELECT DISTINCT record FROM metabib.full_rec WHERE tag = '856' AND subfield = 'u') AS foo;

I just want working located URIs. I have applied the current branch to our production server, so we have that for our Evergreen instance.

Revision history for this message
Mike Rylander (mrylander) wrote :

To be clear, Dan (Scott), what I suggest is different from what Dan (Wells) said in #3. To be clear, something like:

label_text := oils_xpath_string('//*[@code="y"]', uri_xml);
IF label_text IS NULL || label_text = '' THEN
    oils_xpath('//*[@code="3"]', uri_xml);
-- nest more here, if needed
END IF;

That said, if it's not natural to see both subfields in one field then this doesn't matter, which seems to be what you're saying is the case for your dataset. In which case I'm for getting that pulled into master and 2.1 as a bug fix. I'm traveling today, so may not have time to do the sign-off dance myself until the weekend.

Revision history for this message
Mike Rylander (mrylander) wrote :

Fixed as part of the mentioned branch, now merged.

Changed in evergreen:
status: Confirmed → Fix Committed
Ben Shum (bshum)
Changed in evergreen:
milestone: none → 2.0.7
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.