Bug #1024095 “Extremely slow batch marc loading in 2.2 Vandelay” : Bugs : Evergreen

Revision history for this message

Kathy Lussier (klussier) wrote on 2012-07-16:

#1

More testing results from C/W MARS, with 2.1 million bibs in the database.

A batch import with 45 records:

010 only: less than 1 minute
020 only: less than 30 seconds
035 only: less than 30 seconds
010, 020, and 035: 8 minutes - only 8 records per minute

To see if quality metrics impacted the queuing, I queued the same file with the Record Match Set It did not.
020, 035, and 022 WITH quality metrics: 8.5 minutes.

A batch import of 219 records (no attached items) matching on 020, 022 and 035 WITH quality metrics. It took 28 minutes to process and 2 minutes to load for a total of 30 minutes: 7 records per minute.

C/W MARS has added an index for matching on the 020, 022 and 024 fields. Imports with a match set matching on 020 have improved since that time. However, C/W MARS is still seeing slow load times when matching on more than one fields at a time.

Revision history for this message

Lebbeous Fogle-Weekley (lebbeous) wrote on 2012-07-16:

#2

Download full text (3.4 KiB)

Thanks for posting your numbers here Sharon and Kathy.

As you know, in order to perform its function the Record Match Sets feature is comparing each of your incoming records to each of your existing records on all of the match points. The way this works today has proven, as you've demonstrated, not to scale very well with large numbers of bibs already in the catalog.

Match sets are made up of match points and boolean operators (AND and OR). Those match points can either be tab/subfield combinations or record attributes. The stored procedure vandelay.match_set_test_marcxml(INTEGER, TEXT) is responsible for building the query that will compare a given MARC record to all your existing MARC records on your specified match points.

Here's an example of what the generated query looks like for a match set with three match points, all OR'd together (901‡a OR 022‡a OR 020‡a): http://pastie.org/4196393

That query leads to a sequential scan on biblio.record_entry (put 'EXPLAIN' in front of this query and run it to see what I mean) and you wind up evaluating too many rows.

If we refactor vandelay.match_set_test_marcxml() to produce a query that looks like this instead (for the same match set): http://pastie.org/4268113

then we get a query plan (again, see 'EXPLAIN ...') that does not contain any sequential scans, evalutates fewer rows, and runs about 20% faster in practice.

Now 20% isn't that great of a speedup when faced with the kind of waits that Sharon and Kathy have cited, so here's where the potential is for even better speed ups.

Construct your match sets strictly out of record attributes, and don't use any explicit tag/subfield combos. There are no stock record attributes that correlate to 020‡a, 022‡a, or 901‡a, so you would have to create all the record attributes that you are going to use (they go in the config.record_attribute_definition table) and reingest your bibs first.

As the code stands today, using record attributes doesn't give you any better performance because the existing version of vandelay.match_set_test_marcxml() creates a query that still looks about like the first one linked above: i.e., it uses biblio.record_entry as the core table and has a separate join to metabib.record_attr for every single match point.

But if your match points are *all* record attributes (again, no tag/subfield combos), the code could be made smarter so that it instead produces queries that look more like:

SELECT id
FROM metabib.record_attr mra
JOIN biblio.record_entry bre ON (bre.id = mra.id)
WHERE (mra.attrs->'upc' = 'blah' OR mra.attrs->'issn' = 'blah' OR mra.attrs->'nineohone' = 'blah')
AND bre.deleted IS FALSE;

Queries like should be *much* faster than what vandelay.match_set_test_marcxml() generates today, mainly thanks to the GiST index on the mra.attrs HSTORE column, but again I stress that you can only do this if none of your match points are tag/subfield combos (for which we have no HSTORE or other fast lookup mechanism).

The above by itself should be a big improvement. Mike Rylander has told me that he sees how to optimize the above example even further by using certain operators meant for fast HSTORE searching, but what h...

Thanks for posting your numbers here Sharon and Kathy.

As you know, in order to perform its function the Record Match Sets feature is comparing each of your incoming records to each of your existing records on all of the match points.  The way this works today has proven, as you've demonstrated, not to scale very well with large numbers of bibs already in the catalog.

Match sets are made up of match points and boolean operators (AND and OR). Those match points can either be tab/subfield combinations or record attributes.  The stored procedure vandelay.match_set_test_marcxml(INTEGER, TEXT) is responsible for building the query that will compare a given MARC record to all your existing MARC records on your specified match points.

Here's an example of what the generated query looks like for a match set with three match points, all OR'd together (901‡a OR 022‡a OR 020‡a): http://pastie.org/4196393

That query leads to a sequential scan on biblio.record_entry (put 'EXPLAIN' in front of this query and run it to see what I mean) and you wind up evaluating too many rows.

If we refactor vandelay.match_set_test_marcxml() to produce a query that looks like this instead (for the same match set): http://pastie.org/4268113

then we get a query plan (again, see 'EXPLAIN ...') that does not contain any sequential scans, evalutates fewer rows, and runs about 20% faster in practice.

Now 20% isn't that great of a speedup when faced with the kind of waits that Sharon and Kathy have cited, so here's where the potential is for even better speed ups.

Construct your match sets strictly out of record attributes, and don't use any explicit tag/subfield combos.  There are no stock record attributes that correlate to 020‡a, 022‡a, or 901‡a, so you would have to create all the record attributes that you are going to use (they go in the config.record_attribute_definition table) and reingest your bibs first.

As the code stands today, using record attributes doesn't give you any better performance because the existing version of vandelay.match_set_test_marcxml() creates a query that still looks about like the first one linked above: i.e., it uses biblio.record_entry as the core table and has a separate join to metabib.record_attr for every single match point.

But if your match points are *all* record attributes (again, no tag/subfield combos), the code could be made smarter so that it instead produces queries that look more like:

SELECT id
FROM metabib.record_attr mra
JOIN biblio.record_entry bre ON (bre.id = mra.id)
WHERE (mra.attrs->'upc' = 'blah' OR mra.attrs->'issn' = 'blah' OR mra.attrs->'nineohone' = 'blah')
    AND bre.deleted IS FALSE;

Queries like should be *much* faster than what vandelay.match_set_test_marcxml() generates today, mainly thanks to the GiST index on the mra.attrs HSTORE column, but again I stress that you can only do this if none of your match points are tag/subfield combos (for which we have no HSTORE or other fast lookup mechanism).

The above by itself should be a big improvement.  Mike Rylander has told me that he sees how to optimize the above example even further by using certain operators meant for fast HSTORE searching, but what he was trying to tell me went over my head, so I'll have to leave it to him to explain that part here.  But even what I have provided above should be a serious performance improvement.

So that's what improving Record Match Sets to make them faster would look like.

Revision history for this message

Lebbeous Fogle-Weekley (lebbeous) wrote on 2012-07-16:

#3

Oh, and I didn't state something obvious, yet probably more complex: a parallelized solution could do this better still, but I haven't come up with one yet myself.

Revision history for this message

Dan Wells (dbw2) wrote on 2012-07-17:

#4

Hello Lebbeous,

We are also facing this problem. I did some testing of the refactored query above (second pastie), but unfortunately, it doesn't quite work. If you have a record which meets the second two criteria but not the first, you end up with multiple rows for that record (with, obviously, un-summed quality values). This happens because n25 joins with n24, and n26 joins with n24, but n26 doesn't join with n25 (which only matters in cases where n25 didn't already join with n24). It would be nice if we could OR the join condition, but apparently this isn't allowed for FULL OUTER joins (which makes sense). We ideally need to establish some sort of hierarchical join condition for n26, but I didn't know off-hand an obvious way to do that.

I did toy with making the JOINs and subqueries themselves hierarchical, and it worked, but it was quite ugly and not remarkably faster (for my test data) then the original.

I am going to continue testing tomorrow (we have a few batches of 300-600 records which we really need to overlay in a more reasonable timeframe). I'm having some success with adding more specific indexes, but need to verify that the improvements in testing actually apply to my real world data.

Thanks,
Dan

Revision history for this message

Kathy Lussier (klussier) wrote on 2012-07-18:

#5

Thanks to Lebbeous for thinking up potential solutions and to Dan for continued testing!

Lebbeous, I have a question/concern about the suggestion to move to record attributes. IIRC, we had previously explored using record attributes for 020 matching when first testing this development, but it wasn't a workable solution because it only matched when there was a single 020 field. Most of our records would have multiple 020 fields. Is this still the case or has any thought be given to providing functionality so that record attributes can match on repeatable fields? I understand that the ability to match on repeatable 020 fields would probably add some cost to the upload times, but there really is no way around the fact that these records have multiple 020 fields, and that the order of the 020 fields in the incoming record does not necessarily match the order in the existing record.

Thanks again for giving some thought to this!

Revision history for this message

Dan Wells (dbw2) wrote on 2012-07-18:

#6

It looks like we can solve the query problems I mentioned without overly complicating the query by using USING. I am getting promising results with something like so:

SELECT
        record,
        (COALESCE(n24.quality,0) + COALESCE(n25.quality,0) + COALESCE(n26.quality,0)) AS quality
FROM (
        SELECT *, 300 AS quality
        FROM metabib.full_rec mfr
        WHERE mfr.tag = '022' AND mfr.subfield = 'a' AND
                LOWER(mfr.value) LIKE '2%'
) n24
FULL OUTER JOIN (
        SELECT *, 400 AS quality
        FROM metabib.full_rec mfr
        WHERE mfr.tag = '901' AND mfr.subfield = 'c' AND
        (mfr.value = '1553709')
) n25 USING (record)
FULL OUTER JOIN (
        SELECT *, 500 AS quality
        FROM metabib.full_rec mfr
        WHERE mfr.tag = '020' AND mfr.subfield = 'a' AND
        LOWER(mfr.value) LIKE '2%'
) n26 USING (record)
JOIN biblio.record_entry bre
        ON (bre.id = record)
WHERE bre.deleted IS FALSE;

Also, the biggest index problem I have found is that we have both a 'lower()' index and a separate 'substring()' index for the 02x fields, but this query really needs a 'lower(substring())' index to function optimally. I am seeing good speedups in my artificial testing with this new index, and if it holds up, I will try to post some real-world stats later on today.

Revision history for this message

Lebbeous Fogle-Weekley (lebbeous) wrote on 2012-07-18:

#7

Dan, thanks for looking at that! I'm interested in seeing the stats and finding out how close we may be to a faster solution.

Kathy, I think you raise a good point. That sounds right to me, but I'd want Mike Rylander to look at this and confirm. I believe Mike wrote a draft of an idea for extending record attributes / single-value fields (SVF) to multiple-value fields (MVF), but if I'm remembering right, that would be even more additional development work than I originally was thinking. But since Dan may really be on to something with a non-SVF solution, I'm looking forward to seeing his next post.

Revision history for this message

Mike Rylander (mrylander) wrote on 2012-07-18:

#8

Kathy, re multiple values in SVF entries, that is still a limitation, but one that could, conceivably, be worked around today if one is willing to have an upper limit on the number of, say, ISBNs that are indexed for a record. In practical terms, three seems like a likely upper limit there. You'd simply define an SVF using xpath that specified a position in the nodeset that returns all ISBN fields. Something like:

* first_isbn = //*[@tag='020']/*[@code='a' or @code='z'][1]
* second_isbn = //*[@tag='020']/*[@code='a' or @code='z'][2]
* third_isbn = //*[@tag='020']/*[@code='a' or @code='z'][3]

However, there are other possible options, all requiring more development than the plan Lebbeous proposed. The most performant IMO would be to create an MVF infrastructure as Lebbeous mentioned that records all values as a list of one or more strings for repeatable subfields. An index could certainly be constructed over an expansion of that to cover the queries we want to issue.

Another option, which I've been pondering for quite a while, is to simply have an "identifying values" table that would pull, based on configuration, specific fields out of every record specifically for the purpose of this sort of search. This is less generalized, but not specific to exactly this one use.

We could come up with other schemes as well, but IMO the fastest way forward to relieve the pressure is to refactor what we have now.

Revision history for this message

Mike Rylander (mrylander) wrote on 2012-07-18:

#9

Also, the specific optimization I was thinking of was to combine all SVF tests into a single JOIN ... USING for each nesting level of the matchpoint set and construct the appropriate tests using the @> (left contains right, for ANDed conditions) and ? (overlaps, for ORed conditions) operators. This would be in combination with a change that gets rid of the single flat list of FULL OUTER JOINs -- we'd end up with a tree of subqueries that only join what's needed at that nesting level, and then wrap the upper-most nesting level in a query that joins to bre for deletedness checking.

Revision history for this message

Dan Wells (dbw2) wrote on 2012-07-23:

#10

Download full text (3.5 KiB)

I spent a fair amount of time working on this late last week, but it ended up being mostly an exercise in frustration. Still, I am going to attempt to summarize what I tried and how things ended up.

First, I was never able to reproduce the wideness of the variance reported by others. Our "slow" loads take around 8-10 seconds per record to do the initial "Processing...", so a 6 record load processes for around a minute, not 3 or 4. (That said, I still think our loads are too slow.) It is also possible that the 6 records in the initial report are a worse case than I encountered. Sharon, if you are able, please post a marcxml file of those 6 records for further testing.

Second, I was successful in (apparently) speeding up the underlying query using a combination of Lebbeous's refactoring plan, bumping up ANALYZE statistics and reanalyzing metabib.real_full_rec, and adding a lower(substring()) index. However, this speed up was only partial, as queries with more than 6 or so ISBNs still went slow (~4s per query compared to essentially "instant" results with these three changes).

Ultimately, as one might expect by now, the problem seems to center on the query planner. In most cases it chooses to use the tag/subfield index, then filter by value, when in the cases of identifiers (and assuming we have added a matching index, such as the lower(substring()) one), it would be far better off using the value index, then filtering by tag/subfield (which will very often be negligible, since the values themselves are generally unique, or close to it). Alternately, we could simply create a tag/subfield/lower(substring()) index for the 02x fields. The size increase would be small, though the savings from not filtering are also very small.

So, in the end how do we coax this query to use our value index over any other, for identifier fields, no matter what? I am really green in this area, but it seems like the tag/subfield/transformed_value index *might* be enough. Another thing perhaps worth trying would be a subquery on value alone, then an outer query for the tag/subfield.

Unfortunately, that is not the end of our problems. Even if we make a nice value index available to Vandelay, and convince it that it is a good idea, it currently will never use it, as we are employing a 'LIKE ANY()' syntax which apparently never uses indexes. See here:
http://postgresql.1045698.n5.nabble.com/Index-use-difference-betweer-LIKE-LIKE-ANY-td3400118.html

You can verify it for yourself like so:
EXPLAIN SELECT * from metabib.full_rec WHERE value LIKE ANY(ARRAY['9780313345807%']::TEXT[]);

That does a sequential scan on metabib.full_rec, and that is always a *bad* plan.

That issue looks like it might be fixed in PG9.2, here (if I understand what the commit message is saying):
http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=9e8da0f75731aaa7605cf4656c21ea09e84d2eb1

In the meantime, I am not aware of a reason we shouldn't refactor that ANY(...) into a 'LIKE ... OR LIKE ...' construct. Is there something I am not thinking of? That, plus the tag/subfield/lower(substring()) index for the 02x fields should greatly improve performance.

With all the movi...

I spent a fair amount of time working on this late last week, but it ended up being mostly an exercise in frustration.  Still, I am going to attempt to summarize what I tried and how things ended up.

First, I was never able to reproduce the wideness of the variance reported by others.  Our "slow" loads take around 8-10 seconds per record to do the initial "Processing...", so a 6 record load processes for around a minute, not 3 or 4.  (That said, I still think our loads are too slow.)  It is also possible that the 6 records in the initial report are a worse case than I encountered.  Sharon, if you are able, please post a marcxml file of those 6 records for further testing.

Second, I was successful in (apparently) speeding up the underlying query using a combination of Lebbeous's refactoring plan, bumping up ANALYZE statistics and reanalyzing metabib.real_full_rec, and adding a lower(substring()) index.  However, this speed up was only partial, as queries with more than 6 or so ISBNs still went slow (~4s per query compared to essentially "instant" results with these three changes).

Ultimately, as one might expect by now, the problem seems to center on the query planner.  In most cases it chooses to use the tag/subfield index, then filter by value, when in the cases of identifiers (and assuming we have added a matching index, such as the lower(substring()) one), it would be far better off using the value index, then filtering by tag/subfield (which will very often be negligible, since the values themselves are generally unique, or close to it).  Alternately, we could simply create a tag/subfield/lower(substring()) index for the 02x fields.  The size increase would be small, though the savings from not filtering are also very small.

So, in the end how do we coax this query to use our value index over any other, for identifier fields, no matter what?  I am really green in this area, but it seems like the tag/subfield/transformed_value index *might* be enough.  Another thing perhaps worth trying would be a subquery on value alone, then an outer query for the tag/subfield.

Unfortunately, that is not the end of our problems.  Even if we make a nice value index available to Vandelay, and convince it that it is a good idea, it currently will never use it, as we are employing a 'LIKE ANY()' syntax which apparently never uses indexes.  See here:
http://postgresql.1045698.n5.nabble.com/Index-use-difference-betweer-LIKE-LIKE-ANY-td3400118.html

You can verify it for yourself like so:
EXPLAIN SELECT * from metabib.full_rec WHERE value LIKE ANY(ARRAY['9780313345807%']::TEXT[]);

That does a sequential scan on metabib.full_rec, and that is always a *bad* plan.

That issue looks like it might be fixed in PG9.2, here (if I understand what the commit message is saying):
http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=9e8da0f75731aaa7605cf4656c21ea09e84d2eb1

In the meantime, I am not aware of a reason we shouldn't refactor that ANY(...) into a 'LIKE ... OR LIKE ...' construct.  Is there something I am not thinking of?  That, plus the tag/subfield/lower(substring()) index for the 02x fields should greatly improve performance.

With all the moving parts (views, indexes, functions), I am frustrated to still not be able to assert with 100% confidence that this plan will "solve" the problem.  I am going to keep poking at it over the next few days, but at this point I am still pretty hopeful, and, if nothing else, I certainly understand much better how all of this stuff works.

Revision history for this message

Dan Wells (dbw2) wrote on 2012-08-30:

#11

Ok, after a short furlough, I am back with a branch for testing:

http://git.evergreen-ils.org/?p=working/Evergreen.git;a=shortlog;h=refs/heads/user/dbwells/vandelay_overlay_match_speed_up

This branch reorders the joins as originally suggested by Lebbeous, then adds a specific index to coax the query planner into making the best choices. The example index here only targets the 02x identifier fields, so other tag/subfield/substring(value) indexes will likely be necessary for maximum benefit when matching on other fields (e.g. 010 or 035). Check out the commit to better understand what I mean.

These changes build on the branch on LP#1018107 which replaces 'LIKE ANY' with 'LIKE (... OR ...)'. While each of the three changes on their own does help, the combination of all three (LIKE (... OR ...), join reorder, new index) has shown great promise in testing.

I've deployed these changes to our production server, so I'm confident there are no major bugs for our use, but this is highly configuration dependent, so more feedback is definitely desired.

Thanks,
Dan

Revision history for this message

Lebbeous Fogle-Weekley (lebbeous) wrote on 2012-08-30:

#12

Hi Dan, and thanks a ton! I will be testing this very soon, but that's not to discourage testing by anyone else who can pick it up now. Thanks again.

Lebbeous

Lebbeous Fogle-Weekley (lebbeous) on 2012-09-04

tags:

added: pullrequest

Revision history for this message

Lebbeous Fogle-Weekley (lebbeous) wrote on 2012-09-04:

#13

Hi Dan,

This code works in my testing. I don't have the large dataset I need to measure the speed improvement meaningfully, but I think large sites waiting to be able to use match sets effectively will want to jump on this.

I pushed a branch here with your commit signed off. My "like-any" commit is HEAD~1 in this branch, and if you can sign-off on that, this can go into master, rel_2_3, and rel_2_2.

http://git.evergreen-ils.org/?p=working/Evergreen.git;a=shortlog;h=refs/heads/collab/senator/vandelay_overlay_match_speed_up_signoff

Thanks again!

Revision history for this message

Dan Wells (dbw2) wrote on 2012-09-04:

#14

Just last night, I thought to myself (out of the blue), 'Dang, I should have signed off on that "like-any" commit'.

I added the missing sign-off and pushed to master, rel_2_3, and rel_2_2.

I suspect this might not be the last we hear of this issue, but we have certainly taken a few steps forward!

Changed in evergreen:
status:	New → Fix Committed

Revision history for this message

Sharon Herbert (sherbert-u) wrote on 2012-09-04:

#15

Thanks very much Dan and Lebbeous for all of your work on this! We hope to test this shortly against a copy of our production server, so will let you know how it goes. Thanks again,

Sharon

Revision history for this message

Sharon Herbert (sherbert-u) wrote on 2012-09-18:

#16

Hi Dan and Lebbeous,
We've tested this fix on a copy of production and the results are vastly improved! Hooray! Here are Laurie's test results:
===
Great news after testing the fix on this -- the speed for queuing using a record match
set is tremendously improved. (Planning to deploy on production in the next day or so)

Using Record Match Set matching on 020a

MARC File of 10 records -- almost instantaneous (previously 5 minutes)
MARC File of 6 records -- almost instantaneous (previously 3 minutes)
MARC File of 70 records -- 12s (previously 35 minutes)

Using Record Match Set (matching on 901c or 020a or 010a or 022a or 035a)

6 records: 4s
70 records: 15s

Using Record Match set (020a AND 010a)

70 records: 13s
===
Great results, so big thanks and kudos!

Sharon

Revision history for this message

Lebbeous Fogle-Weekley (lebbeous) wrote on 2012-09-18:

#17

Thank you Sharon for doing these timed tests and reporting your results here. It's great to have a solid understanding of the impact of these changes.

And I, too, would like to thank Dan one more time for identifying a path to such big performance gains and delivering working code in an area that is not particularly easy to work on or quick to test.

Ben Shum (bshum) on 2013-05-19

Changed in evergreen:
status:	Fix Committed → Fix Released

	Status	Importance	Assigned to	Milestone
Evergreen	Fix Released	Undecided	Unassigned
2.2	Fix Released	Undecided	Unassigned	Evergreen 2.2.3
2.3	Fix Released	Undecided	Unassigned	Evergreen 2.3.0

Evergreen

Extremely slow batch marc loading in 2.2 Vandelay

Bug Description

Duplicates of this bug

Other bug subscribers

Remote bug watches