A test with just the concerto sample set of 100 bibs shows the following difference: Before the patch: evergreen=# SELECT COUNT(*) FROM metabib.title_field_entry; count ------- 448 (1 row) evergreen=# SELECT COUNT(*) FROM metabib.author_field_entry; count ------- 765 (1 row) After the patch: evergreen=# SELECT COUNT(*) FROM metabib.title_field_entry; count ------- 174 (1 row) evergreen=# SELECT COUNT(*) FROM metabib.author_field_entry; count ------- 161 (1 row) I started looking at adding a call to the upgrade script that would trim the duplicate fields from an existing 2.2 or 2.3 system that already had dupes, and came up with: DELETE FROM metabib.title_field_entry WHERE id NOT IN ( SELECT MAX(id) FROM metabib.title_field_entry GROUP BY source, field, value, index_vector ); But that only eliminated 132 rows - not the 274 I expected. Looking at the results, I see "duplicates" like: SELECT id, source, field, value FROM metabib.title_field_entry WHERE value = 'Concertos, piano'; id | source | field | value -----+--------+-------+------------------ 173 | 45 | 5 | Concertos, piano 178 | 47 | 5 | Concertos, piano 265 | 63 | 5 | Concertos, piano 313 | 72 | 5 | Concertos, piano (4 rows) So I thought - hey, don't we want to retain one row for each source? And then, when I recreate the database schema with the duplicate-reduction patch applied, and run the same query, I get: SELECT id, source, field, value FROM metabib.title_field_entry WHERE value = 'Concertos, piano'; id | source | field | value -----+--------+-------+------------------ 68 | 45 | 5 | Concertos, piano 103 | 63 | 5 | Concertos, piano 120 | 72 | 5 | Concertos, piano (3 rows) .. which seemed _really_ weird. If we want to keep all unique source/field/value rows, shouldn't the "Concertos, piano" from bib #47 be in there too? Of course, the answer is that for bib #47, which has two 730 (uniform title) entries, there is the following entry: 71 | 47 | 5 | Concertos, piano Rapsodie sur un thème de Paganini, piano, orchestra | ... basically, in the function, the raw_text variable gets concatenated with all of the matching values for that metabib field definition, and the final result includes all of the matching text - which is supposed to be added in a single row. Here's what the old (bad) approach created: SELECT id, source, field, value FROM metabib.title_field_entry WHERE source = 47 and field = 5 ORDER BY id ASC; id | source | field | value -----+--------+-------+---------------------------------------------------------------------- 178 | 47 | 5 | Concertos, piano 179 | 47 | 5 | Rapsodie sur un thème de Paganini, piano, orchestra 180 | 47 | 5 | Concertos, piano Rapsodie sur un thème de Paganini, piano, orchestra (3 rows) So, refining our clean-up command a bit, we could do the following: DELETE FROM metabib.title_field_entry WHERE id NOT IN ( SELECT MAX(id) FROM metabib.title_field_entry GROUP BY source, field ); ... and indeed, that deletes 274 rows. So, after all that learning -- I've signed off on the initial commit, and added another commit that includes index debloating at user/dbs/reduce_index_bloat in working. If that's good, then let's push it!