Bug #1032208 “Rows doubled or tripled in metabib.%_field_entry t...” : Bugs : Evergreen

Revision history for this message

Bob Wicksall (bwicksall) wrote on 2012-08-07:

#1

biblio.extract_metabib_field_entry_new.sql Edit (4.9 KiB, text/plain)

I'm attaching a modified biblio.extract_metabib_field_entry that should resolve this issue. On our test server this has had a fairly significant impact on row counts and table sizes:

  author_field_entry 1786920 rows before and 494144 rows after
  series_field_entry 330018 rows before and 108101 rows after
  subject_field_entry 4506630 rows before and 971917 rows after
  title_field_entry 1032180 rows before and 528558 rows after

I'm not sure if this translates to faster searches but it "feels" faster to me.

Revision history for this message

Mike Rylander (mrylander) wrote on 2012-08-07:

#2

This looks great to me. I'll get it into master, 2.3 and 2.2. Thanks, Bob!

Changed in evergreen:
milestone:	none → 2.3.0-beta2
milestone:	2.3.0-beta2 → none

Mike Rylander (mrylander) on 2012-08-07

Changed in evergreen:
assignee:	nobody → Mike Rylander (mrylander)

Revision history for this message

Mike Rylander (mrylander) wrote on 2012-08-10:

#3

I've pushed collab/miker/reduce_index_bloat to the working repo for further testing. I reworked the patch a little to reduce the amount of change a little bit. Thanks, Bob!

http://git.evergreen-ils.org/?p=working/Evergreen.git;a=shortlog;h=refs/heads/collab/miker/reduce_index_bloat

tags:	added: pullrequest
Changed in evergreen:
assignee:	Mike Rylander (mrylander) → nobody

Revision history for this message

Dan Scott (denials) wrote on 2012-08-16:

#4

Download full text (3.3 KiB)

A test with just the concerto sample set of 100 bibs shows the following difference:

Before the patch:

evergreen=# SELECT COUNT(*) FROM metabib.title_field_entry;
count
-------
448
(1 row)

evergreen=# SELECT COUNT(*) FROM metabib.author_field_entry;
count
-------
765
(1 row)

After the patch:

evergreen=# SELECT COUNT(*) FROM metabib.title_field_entry;
count
-------
174
(1 row)

evergreen=# SELECT COUNT(*) FROM metabib.author_field_entry;
count
-------
161
(1 row)

I started looking at adding a call to the upgrade script that would trim the duplicate fields from an existing 2.2 or 2.3 system that already had dupes, and came up with:

DELETE FROM metabib.title_field_entry
  WHERE id NOT IN (
    SELECT MAX(id) FROM metabib.title_field_entry
    GROUP BY source, field, value, index_vector
  );

But that only eliminated 132 rows - not the 274 I expected. Looking at the results, I see "duplicates" like:

SELECT id, source, field, value FROM metabib.title_field_entry WHERE value = 'Concertos, piano';
id | source | field | value
-----+--------+-------+------------------
173 | 45 | 5 | Concertos, piano
178 | 47 | 5 | Concertos, piano
265 | 63 | 5 | Concertos, piano
313 | 72 | 5 | Concertos, piano
(4 rows)

So I thought - hey, don't we want to retain one row for each source? And then, when I recreate the database schema with the duplicate-reduction patch applied, and run the same query, I get:

SELECT id, source, field, value FROM metabib.title_field_entry WHERE value = 'Concertos, piano';
id | source | field | value
-----+--------+-------+------------------
68 | 45 | 5 | Concertos, piano
103 | 63 | 5 | Concertos, piano
120 | 72 | 5 | Concertos, piano
(3 rows)

.. which seemed _really_ weird. If we want to keep all unique source/field/value rows, shouldn't the "Concertos, piano" from bib #47 be in there too? Of course, the answer is that for bib #47, which has two 730 (uniform title) entries, there is the following entry:

71 | 47 | 5 | Concertos, piano Rapsodie sur un thème de Paganini, piano, orchestra |

... basically, in the function, the raw_text variable gets concatenated with all of the matching values for that metabib field definition, and the final result includes all of the matching text - which is supposed to be added in a single row. Here's what the old (bad) approach created:

SELECT id, source, field, value FROM metabib.title_field_entry WHERE source = 47 and field = 5 ORDER BY id ASC;
id | source | field | value
-----+--------+-------+----------------------------------------------------------------------
178 | 47 | 5 | Concertos, piano
179 | 47 | 5 | Rapsodie sur un thème de Paganini, piano, orchestra
180 | 47 | 5 | Concertos, piano Rapsodie sur un thème de Paganini, piano, orchestra
(3 rows)

So, refining our clean-up command a bit, we could do the following:

DELETE FROM metabib.title_field_entry
  WHERE id NOT IN (
    SELECT MAX(id) FROM metabib.title_field_entry
    GROUP BY source, field
  );

... and indeed, that del...

A test with just the concerto sample set of 100 bibs shows the following difference:

Before the patch:

evergreen=# SELECT COUNT(*) FROM metabib.title_field_entry;
 count 
-------
   448
(1 row)

evergreen=# SELECT COUNT(*) FROM metabib.author_field_entry;
 count 
-------
   765
(1 row)

After the patch:

evergreen=# SELECT COUNT(*) FROM metabib.title_field_entry;
 count 
-------
   174
(1 row)

evergreen=# SELECT COUNT(*) FROM metabib.author_field_entry;
 count 
-------
   161
(1 row)

I started looking at adding a call to the upgrade script that would trim the duplicate fields from an existing 2.2 or 2.3 system that already had dupes, and came up with:

DELETE FROM metabib.title_field_entry
  WHERE id NOT IN (
    SELECT MAX(id) FROM metabib.title_field_entry
    GROUP BY source, field, value, index_vector
  );

But that only eliminated 132 rows - not the 274 I expected. Looking at the results, I see "duplicates" like:

SELECT id, source, field, value FROM metabib.title_field_entry WHERE value = 'Concertos, piano';
 id  | source | field |      value       
-----+--------+-------+------------------
 173 |     45 |     5 | Concertos, piano
 178 |     47 |     5 | Concertos, piano
 265 |     63 |     5 | Concertos, piano
 313 |     72 |     5 | Concertos, piano
(4 rows)

So I thought - hey, don't we want to retain one row for each source? And then, when I recreate the database schema with the duplicate-reduction patch applied, and run the same query, I get:

SELECT id, source, field, value FROM metabib.title_field_entry WHERE value = 'Concertos, piano';
 id  | source | field |      value       
-----+--------+-------+------------------
  68 |     45 |     5 | Concertos, piano
 103 |     63 |     5 | Concertos, piano
 120 |     72 |     5 | Concertos, piano
(3 rows)

.. which seemed _really_ weird. If we want to keep all unique source/field/value rows, shouldn't the "Concertos, piano" from bib #47 be in there too? Of course, the answer is that for bib #47, which has two 730 (uniform title) entries, there is the following entry:

71 |     47 |     5 | Concertos, piano Rapsodie sur un thème de Paganini, piano, orchestra |

... basically, in the function, the raw_text variable gets concatenated with all of the matching values for that metabib field definition, and the final result includes all of the matching text - which is supposed to be added in a single row. Here's what the old (bad) approach created:

SELECT id, source, field, value FROM metabib.title_field_entry WHERE source = 47 and field = 5 ORDER BY id ASC;
 id  | source | field |                                value                                 
-----+--------+-------+----------------------------------------------------------------------
 178 |     47 |     5 | Concertos, piano
 179 |     47 |     5 | Rapsodie sur un thème de Paganini, piano, orchestra
 180 |     47 |     5 | Concertos, piano Rapsodie sur un thème de Paganini, piano, orchestra
(3 rows)

So, refining our clean-up command a bit, we could do the following:

DELETE FROM metabib.title_field_entry
  WHERE id NOT IN (
    SELECT MAX(id) FROM metabib.title_field_entry
    GROUP BY source, field
  );

... and indeed, that deletes 274 rows.

So, after all that learning -- I've signed off on the initial commit, and added another commit that includes index debloating at user/dbs/reduce_index_bloat in working. If that's good, then let's push it!

Revision history for this message

Dan Scott (denials) wrote on 2012-08-31:

#5

Note: I'm not 100% sure this is ready for prime time.

If the old, bad approach created:

id | source | field | value
-----+--------+-------+----------------------------------------------------------------------
178 | 47 | 5 | Concertos, piano
179 | 47 | 5 | Rapsodie sur un thème de Paganini, piano, orchestra
180 | 47 | 5 | Concertos, piano Rapsodie sur un thème de Paganini, piano, orchestra
(3 rows)

and the new approach creates:

id | source | field | value
-----+--------+-------+----------------------------------------------------------------------
180 | 47 | 5 | Concertos, piano Rapsodie sur un thème de Paganini, piano, orchestra
(1 rows)

... then it seems like there is a loss of granularity for search purposes (e.g. searching for "starts with Rapsodie").

More testing needed?

Revision history for this message

Bob Wicksall (bwicksall) wrote on 2012-09-04:

#6

I think the "old, bad approach" created some additional granularity as a side effect. Is that granularity valid? Is searching for "starts with Rapsodie" a valid search or should it be "starts with Concertos and contains Rapsodie "?

The old approach also created a lot of the following:

field_class | field | source | value

title | 6 | 427639 | Persuasion /
title | 6 | 427639 | Persuasion /
subject | 11 | 427639 | England
subject | 11 | 427639 | England
subject | 11 | 427639 | England
subject | 14 | 427639 | Man-woman relationships
subject | 14 | 427639 | Man-woman relationships
subject | 14 | 427639 | Man-woman relationships

Jason Stephenson (jstephenson) on 2012-11-08

Changed in evergreen:
milestone:	none → 2.4.0-alpha

Jason Stephenson (jstephenson) on 2012-12-13

Changed in evergreen:
status:	New → Confirmed

Ben Shum (bshum) on 2013-03-03

Changed in evergreen:
milestone:	2.4.0-alpha1 → 2.4.0-beta

Ben Shum (bshum) on 2013-03-17

Changed in evergreen:
milestone:	2.4.0-beta → 2.4.0-rc
importance:	Undecided → Medium

Ben Shum (bshum) on 2013-04-27

Changed in evergreen:
milestone:	2.4.0-rc → none

Revision history for this message

Dan Scott (denials) wrote on 2013-06-13:

#7

Bob: I _do_ think the granularity has value; the "starts with" in the cases I was looking at referred to individual movements in a symphony, and it's absolutely reasonable to expect that researchers want to search for those songs by name.

That is distinct from the entirely duplicated entries for a single record, which offers no value to researchers and simply bloats the table and indexes.

Revision history for this message

Dan Scott (denials) wrote on 2013-06-13:

#8

Over in bug 1187433, I offered up http://git.evergreen-ils.org/?p=working/Evergreen.git;a=shortlog;h=refs/heads/user/dbs/another_metabib_ingest_fix which I believe provides a nice balance between reducing index bloat while maintaining index granularity.

Ben Shum (bshum) on 2013-08-22

no longer affects:

evergreen/2.2

Mike Rylander (mrylander) on 2013-08-30

Changed in evergreen:
status:	Confirmed → Fix Released

Evergreen

Rows doubled or tripled in metabib.%_field_entry tables

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches

	Status	Importance	Assigned to	Milestone
Evergreen	Fix Released	Medium	Unassigned
2.3	Fix Released	Medium	Unassigned
2.4	Fix Released	Medium	Unassigned	Evergreen 2.4.1