Rows doubled or tripled in metabib.%_field_entry tables

Bug #1032208 reported by Bob Wicksall
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Evergreen
Fix Released
Medium
Unassigned
2.3
Fix Released
Medium
Unassigned
2.4
Fix Released
Medium
Unassigned

Bug Description

Evergreen 2.2.0

Rows in metabib.title_field_entry, subject_field_entry, series_field_entry, and author_field_entry are doubled or tripled due to bad logic in biblio.extract_metabib_field_entry. This results in these tables being 2 or more times their correct size.

This was introduced in 2.2.0 when the logic for browse_field and facet_field were added to biblio.extract_metabib_field_entry. 2.1 is not affected.

The duplicates are caused when biblio.extract_metabib_field_entry returns TRUE in the search_field column for all rows even if they should just be facet_field or browse_field.

Tags: pullrequest
Revision history for this message
Bob Wicksall (bwicksall) wrote :

I'm attaching a modified biblio.extract_metabib_field_entry that should resolve this issue. On our test server this has had a fairly significant impact on row counts and table sizes:

  author_field_entry 1786920 rows before and 494144 rows after
  series_field_entry 330018 rows before and 108101 rows after
  subject_field_entry 4506630 rows before and 971917 rows after
  title_field_entry 1032180 rows before and 528558 rows after

I'm not sure if this translates to faster searches but it "feels" faster to me.

Revision history for this message
Mike Rylander (mrylander) wrote :

This looks great to me. I'll get it into master, 2.3 and 2.2. Thanks, Bob!

Changed in evergreen:
milestone: none → 2.3.0-beta2
milestone: 2.3.0-beta2 → none
Changed in evergreen:
assignee: nobody → Mike Rylander (mrylander)
Revision history for this message
Mike Rylander (mrylander) wrote :

I've pushed collab/miker/reduce_index_bloat to the working repo for further testing. I reworked the patch a little to reduce the amount of change a little bit. Thanks, Bob!

http://git.evergreen-ils.org/?p=working/Evergreen.git;a=shortlog;h=refs/heads/collab/miker/reduce_index_bloat

tags: added: pullrequest
Changed in evergreen:
assignee: Mike Rylander (mrylander) → nobody
Revision history for this message
Dan Scott (denials) wrote :
Download full text (3.3 KiB)

A test with just the concerto sample set of 100 bibs shows the following difference:

Before the patch:

evergreen=# SELECT COUNT(*) FROM metabib.title_field_entry;
 count
-------
   448
(1 row)

evergreen=# SELECT COUNT(*) FROM metabib.author_field_entry;
 count
-------
   765
(1 row)

After the patch:

evergreen=# SELECT COUNT(*) FROM metabib.title_field_entry;
 count
-------
   174
(1 row)

evergreen=# SELECT COUNT(*) FROM metabib.author_field_entry;
 count
-------
   161
(1 row)

I started looking at adding a call to the upgrade script that would trim the duplicate fields from an existing 2.2 or 2.3 system that already had dupes, and came up with:

DELETE FROM metabib.title_field_entry
  WHERE id NOT IN (
    SELECT MAX(id) FROM metabib.title_field_entry
    GROUP BY source, field, value, index_vector
  );

But that only eliminated 132 rows - not the 274 I expected. Looking at the results, I see "duplicates" like:

SELECT id, source, field, value FROM metabib.title_field_entry WHERE value = 'Concertos, piano';
 id | source | field | value
-----+--------+-------+------------------
 173 | 45 | 5 | Concertos, piano
 178 | 47 | 5 | Concertos, piano
 265 | 63 | 5 | Concertos, piano
 313 | 72 | 5 | Concertos, piano
(4 rows)

So I thought - hey, don't we want to retain one row for each source? And then, when I recreate the database schema with the duplicate-reduction patch applied, and run the same query, I get:

SELECT id, source, field, value FROM metabib.title_field_entry WHERE value = 'Concertos, piano';
 id | source | field | value
-----+--------+-------+------------------
  68 | 45 | 5 | Concertos, piano
 103 | 63 | 5 | Concertos, piano
 120 | 72 | 5 | Concertos, piano
(3 rows)

.. which seemed _really_ weird. If we want to keep all unique source/field/value rows, shouldn't the "Concertos, piano" from bib #47 be in there too? Of course, the answer is that for bib #47, which has two 730 (uniform title) entries, there is the following entry:

71 | 47 | 5 | Concertos, piano Rapsodie sur un thème de Paganini, piano, orchestra |

... basically, in the function, the raw_text variable gets concatenated with all of the matching values for that metabib field definition, and the final result includes all of the matching text - which is supposed to be added in a single row. Here's what the old (bad) approach created:

SELECT id, source, field, value FROM metabib.title_field_entry WHERE source = 47 and field = 5 ORDER BY id ASC;
 id | source | field | value
-----+--------+-------+----------------------------------------------------------------------
 178 | 47 | 5 | Concertos, piano
 179 | 47 | 5 | Rapsodie sur un thème de Paganini, piano, orchestra
 180 | 47 | 5 | Concertos, piano Rapsodie sur un thème de Paganini, piano, orchestra
(3 rows)

So, refining our clean-up command a bit, we could do the following:

DELETE FROM metabib.title_field_entry
  WHERE id NOT IN (
    SELECT MAX(id) FROM metabib.title_field_entry
    GROUP BY source, field
  );

... and indeed, that del...

Read more...

Revision history for this message
Dan Scott (denials) wrote :

Note: I'm not 100% sure this is ready for prime time.

If the old, bad approach created:

 id | source | field | value
-----+--------+-------+----------------------------------------------------------------------
 178 | 47 | 5 | Concertos, piano
 179 | 47 | 5 | Rapsodie sur un thème de Paganini, piano, orchestra
 180 | 47 | 5 | Concertos, piano Rapsodie sur un thème de Paganini, piano, orchestra
(3 rows)

and the new approach creates:

 id | source | field | value
-----+--------+-------+----------------------------------------------------------------------
 180 | 47 | 5 | Concertos, piano Rapsodie sur un thème de Paganini, piano, orchestra
(1 rows)

... then it seems like there is a loss of granularity for search purposes (e.g. searching for "starts with Rapsodie").

More testing needed?

Revision history for this message
Bob Wicksall (bwicksall) wrote :

I think the "old, bad approach" created some additional granularity as a side effect. Is that granularity valid? Is searching for "starts with Rapsodie" a valid search or should it be "starts with Concertos and contains Rapsodie "?

The old approach also created a lot of the following:

field_class | field | source | value

title | 6 | 427639 | Persuasion /
title | 6 | 427639 | Persuasion /
subject | 11 | 427639 | England
subject | 11 | 427639 | England
subject | 11 | 427639 | England
subject | 14 | 427639 | Man-woman relationships
subject | 14 | 427639 | Man-woman relationships
subject | 14 | 427639 | Man-woman relationships

Changed in evergreen:
milestone: none → 2.4.0-alpha
Changed in evergreen:
status: New → Confirmed
Ben Shum (bshum)
Changed in evergreen:
milestone: 2.4.0-alpha1 → 2.4.0-beta
Ben Shum (bshum)
Changed in evergreen:
milestone: 2.4.0-beta → 2.4.0-rc
importance: Undecided → Medium
Ben Shum (bshum)
Changed in evergreen:
milestone: 2.4.0-rc → none
Revision history for this message
Dan Scott (denials) wrote :

Bob: I _do_ think the granularity has value; the "starts with" in the cases I was looking at referred to individual movements in a symphony, and it's absolutely reasonable to expect that researchers want to search for those songs by name.

That is distinct from the entirely duplicated entries for a single record, which offers no value to researchers and simply bloats the table and indexes.

Revision history for this message
Dan Scott (denials) wrote :

Over in bug 1187433, I offered up http://git.evergreen-ils.org/?p=working/Evergreen.git;a=shortlog;h=refs/heads/user/dbs/another_metabib_ingest_fix which I believe provides a nice balance between reducing index bloat while maintaining index granularity.

Ben Shum (bshum)
no longer affects: evergreen/2.2
Changed in evergreen:
status: Confirmed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.