Enhanced MARC importer script electronic_marc_import.pl

Bug #1947898 reported by Blake GH
34
This bug affects 6 people
Affects Status Importance Assigned to Milestone
Evergreen
Confirmed
Wishlist
Unassigned

Bug Description

I would like to submit my work to the Evergreen repository, making it easier to share. We've created a Swiss Army Knife Perl script which imports MARC records from several supported sources:

Local Folder
FTP (recursive or not)
CloudLibrary
MARCive https

Files are only downloaded when the filename matches the configured filename fragment. They are designated as "adds" or "deletes" or "authority control" also by configured filename fragment.

More than reaching out and getting these MARC records, this tool can be configured to edit the MARC records as they are imported. It performs work automatically that otherwise would be manual:

- Appending $9 to 856's (when configured)

- Editing/removing/replacing any MARC field(s) (when configured)

- Processing imported Authority Control records

- Setting up and continuing to import into a configured "Bib Source"

- Matching incoming records onto the same record that was previous imported by this tool. This is accomplished with a special $7 tag on the 856s

- Deletes will only remove a related 856, but will also remove the whole bib when zero 856's remain.

- Allows for matching records outside of the previously-imported-records-by-this-tool (usually configured for the first run)

- Sends an email when it begins and an email with the result summary upon completion

- The results are also logged to a CSV file

- The results are also in its database schema

- Another script to "sync" the $9s to all of the already-imported bibs if and when the participating libraries change.

This tool requires it's own schema in the database. The script creates the schema when it's used for the first time, but I'll be refactoring it to integrate that component into the Evergreen/OpenSRF/src/sql/Pg folder. Making it easier to upgrade from one version to another when changes to the schema might be required.

Branch incoming.

Blake GH (bmagic)
tags: added: supportscripts
Revision history for this message
Blake GH (bmagic) wrote :

Stubbing the files in the working repo. Merged all of the external Perl Module functionality into the single script.

https://git.evergreen-ils.org/?p=working/Evergreen.git;a=shortlog;h=refs/heads/user/blake/enhanced_marc_importer_lp1947898

Changed in evergreen:
importance: Undecided → Wishlist
Michele Morgan (mmorgan)
tags: added: pullrequest
Revision history for this message
Blake GH (bmagic) wrote :

Alright then! I've got the code to a place where I feel like it's time to invite others to review. I've confirmed that Evergreen can build and run with this patch. The only danger is fm_IDL.xml which seems to be sound. I've ran some reports in the interface against the new schema.

Changed in evergreen:
status: New → Confirmed
assignee: nobody → Jason Stephenson (jstephenson)
Revision history for this message
Blake GH (bmagic) wrote :

Updated this with a small patch for the 901c match not matching when running with argument --match_901c.

Revision history for this message
Blake GH (bmagic) wrote :

Added a bug fix for deep matching on 001 when the incoming 001 is too small (less than 6 characters) or non-existent. Also fixed a bug for marc editing for non-matching fields.

Changed in evergreen:
assignee: Jason Stephenson (jstephenson) → nobody
Changed in evergreen:
assignee: nobody → Jason Stephenson (jstephenson)
Revision history for this message
Jason Stephenson (jstephenson) wrote (last edit ):

These are my "first impression" comments from having looked at the code and the documentation, but not having run it, yet. These are intended as constructive/actionable suggestions to make it fit better with the existing Evergreen code base and way of doing things. These are my opinions, so take them for what they're worth. Nothing here is meant to be negative.

This code adds the following dependencies to Evergreen:

* DateTime::Format::Duration
* Digest::SHA1
* Email::Sender::Simple
* pQuery

Of the above, the functionality of Digest::SHA1 can be replaced by the sha1 functions of Digest::SHA, which this new code also uses. I suggest refactoring the code to use Digest::SHA and dropping the Digest::SHA1 dependency.

Evergreen currently uses Email::Send to deliver email. Since Email::Send is "deprecated" by the maintainer (see: https://metacpan.org/pod/Email::Send#WAIT!-ACHTUNG!), we could do well to replace Email::Send in other code with Email::Sender::Simple. That's a totally different bug, of course.

Whatever modules stay, they should be added to the prerequisites installation.

libdatetime-format-duration-perl is a deb on all of the Debian and Ubuntu releases currently supported by the Evergreen community. The others do not have packages and need to be installed via CPAN.

Why is the documentation under the development category? Shouldn't it be under cataloging or possibly admin?

Change git://git.esilibrary.com/migration-tools.git to https://github.com/EquinoxOpenLibraryInitiative/migration-tools.git. This is the new home of the tools.

The crontab examples should be changed to remove the sourcing of .bashrc and to not cd to /openils/bin. We should encourage people to set up proper crontab files with the environment set appropriately. This means that we should also fix our other example crontab files as well. (That's another bug that I've been meaning to open for some time.)

The instructions should tell the user how to install the program to /openils/bin, or it should be installed as part of make install.--I favor installing it automatically.

I also think that the script and configuration file should be moved out of the Open-ILS/src/support-scripts/bib_magic_importer subdirectory. The script should go in the parent directory (Open-ILS/src/support-scripts) and the configuration example should be moved to Open-ILS/examples. The example configuration should also be installed to SYSCONFDIR as part of the make install process.

To fit better with MARC/LoC nomenclature, change "marc_edit_standard" to "marc_edit_data" as the proper name for these is "Data Field." There is no "Standard Field" in the MARC documentation.

I'll follow up in a few days once I've had a chance to try the code out.

Revision history for this message
Jason Stephenson (jstephenson) wrote (last edit ):

The module below should only be used when needed, i.e. if the code that uses it is invoked.

Can't locate REST/Client.pm in @INC (you may need to install the REST::Client module) (@INC contains: /etc/perl /usr/local/lib/x86_64-linux-gnu/perl/5.34.0 /usr/local/share/perl/5.34.0 /usr/lib/x86_64-linux-gnu/perl5/5.34 /usr/share/perl5 /usr/lib/x86_64-linux-gnu/perl-base /usr/lib/x86_64-linux-gnu/perl/5.34 /usr/share/perl/5.34 /usr/local/lib/site_perl) at /openils/bin/bib_magic_importer line 42.
BEGIN failed--compilation aborted at /openils/bin/bib_magic_importer line 42.

I missed it in the documentation, but I suppose it needs to be added to the prerequisite installation as well.

Revision history for this message
Jason Stephenson (jstephenson) wrote :
Download full text (3.3 KiB)

Follow up comments after trying to run 3 batches of records through it:

Configuration example suggestions:

* The example configuration file should be renamed to bib_magic_importer.conf.example.

* Delete the comment about installing the prerequisite modules.

* Provide default directories that match sample Evergreen layout: /openils/var/log/, etc.

* Boolean options should accept Boolean values: 0, 1, true, false, yes, no. They shouldn't need to be commented out to be set to false.

* Can we dispense with the need for the _{N} on the field edits?

The documentation should warn the user that a recursive descent is done through the incomingmarcfolder, so any subdirectories are scanned as well. This was not obvious to me, and I only found during my first test run.

It should not try to send email if no email addresses are configured. I commented out the erroremaillist, successemaillist, and alwaysemail options and got the following error at the end of the run:

Use of uninitialized value for string at /usr/share/perl5/Email/MIME/Enc[7/1838]
ine 67.
no recipients

Trace begun at /usr/local/share/perl/5.34.0/Email/Sender/Simple.pm line 116
Email::Sender::Simple::send_email('Email::Sender::Simple', 'Email::Abstract=ARRA
Y(0x565547895338)', 'HASH(0x565547ad1f48)') called at /usr/local/share/perl/5.34
.0/Email/Sender/Role/CommonSending.pm line 45

Email::Sender::Role::CommonSending::try {...} at /usr/share/perl5/Try/Tiny.pm line 102 eval {...} at /usr/share/perl5/Try/Tiny.pm line 93 Try::Tiny::try('CODE(0x56554788f7a8)', 'Try::Tiny::Catch=REF(0x565547896f80)') called at /usr/local/share/perl/5.34.0/Email/Sender/Role/CommonSending.pm line 58

Email::Sender::Role::CommonSending::send('Email::Sender::Simple', 'Email::MIME=HASH(0x565547af3330)') called at /usr/share/perl5/Sub/Exporter/Util.pm line 57

Sub::Exporter::Util::__ANON__('Email::MIME=HASH(0x565547af3330)') called at /ope nils/bin/bib_magic_importer line 4183

main::email_send('HASH(0x5655478bd150)', 'Evergreen Electronic Import Summary - custom_electronic Import Report Job # 3 WINDING UP', 'Hello,^J^JJust letting you know that I have begun processing the provided files:^JKanopy_MARC_Records__add itions__springfieldlibrary.mrc^M^J^JThis software is configured to perform deep search matches against the database. This is slow but thorough.^JDepending on th e number of records, it could be days before you receive the finished message. F YI.^JI\'ll send a follow-up email when I\'m done.^J^JYours Truly,^JThe friendly server^J') called at /openils/bin/bib_magic_importer line 1103 main::sendWelcomeMessage('ARRAY(0x5655472a5c70)') called at /openils/bin/bib_mag ic_importer line 252

I am testing this on a test environment that is not set up to send email.

I think the above may have blown up my attempts to process records because nothing seemed to happen after that and no records were loaded.

Looking at the data in bib_magic.import_status, t...

Read more...

tags: added: needswork
removed: pullrequest
Changed in evergreen:
assignee: Jason Stephenson (jstephenson) → nobody
Revision history for this message
Jason Stephenson (jstephenson) wrote :

Another suggestion regarding the configuration file: Have you considering using YAML or XML or some other format? There are Perl modules for parsing that which would eliminate the need for the parsing code in the program itself. Other parts of Evergreen/OpenSRF use XML, but YAML or JSON are more compact alternatives.

Revision history for this message
Blake GH (bmagic) wrote :

Committed a change to the matching. Introducing 250a, 245n, 245p match points to prevent the matching/merging for bibs that are different on these data points.

Jason - This is great feedback! Sorry for the delayed response. Thank you for reviewing the code. I agree with everything. I'll get back to this as soon as I can!

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.