Wednesday, January 20, 2010

Drupal Actions: extending biblio to extract full text

The Bibliography (biblio) module for Drupal provides a convenient way to harvest records from other repositories and catalogues. A requirement for one project was to allow for searching across the full text content of digitally stored books, which is not always stored in other catalogues. The most direct approach was to grab a copy of the digital object (usually in PDF format), run a system level tool to extract the text and update the Biblio record to contain it. As with most things Drupal, the trick was to find the right place to hook this functionality in, and in this case I used Actions

Rather than modify the biblio module directly, I wanted to extend it separately if possible, to minimise the need to revisit the code every time biblio is updated. To this end I created the rather inelegantly named biblio_full_text module.

; $Id$
name = Extract Full Text
description = For a given PDF - either as URL or attachment - extract the text to the Biblio node's Full Text (Body) field.
package = Biblio
core = 6.x
dependencies[] = biblio

Drupal actions are triggered by events, such as a user logging out, content being modified, or when cron is run. An action can be bound to one or more event types and to one or more operations. The biblio content type is a type of node, so the event needed was nodeapi. The full text had to be grabbed, extracted and saved with the node, which ruled out the insert and update operations as they fire after the node has been save, leaving presave as the best candidate.

With this, I could now create a hook that ties my new module's behaviour to the nodeapi+presave trigger (comments have been stripped for brevity):

function biblio_full_text_action_info() {
    $info['biblio_full_text_action_extract'] = array(
        'type' => 'node',
        'description' => t('Include URL/attachment contents as full text'),
        'configurable' => FALSE,
        'hooks' => array(
            'nodeapi' => array('presave'),
            ),
        );
    return $info;
}

The description will appear in the /admin/actions and /admin/build/trigger/* admin screens. The value biblio_full_text_action_extract is the name of the function to be called when nodeapi+presave is triggered. This receives a copy of the object currently being operated on - which in this case will always be a node - and, in this example, simply checks to make sure it has the right content type before passing it off to the worker code:

function biblio_full_text_action_extract($object) {
    if ( $object->type == 'biblio' ) {
        biblio_full_text_extract($object);
    }
}

One thing that I didn't initially get was that node objects are a wrapper around the full content type, not a mapping to the node table in the database. I assumed that I'd have to use the node details to pull the biblio details out. Thankfully Drupal is smarter than I gave it credit for; the $object passed to the worker function includes all of the data pertaining to the biblio type. With this, it's a simple case of using any URL provided to grab the PDF, performing the text extraction with the shell tool pdftotext and stashing it back on the biblio object before it continues through to update the database. (Some irrelevant helper functions are left out below but the meaning should be clear):

function biblio_full_text_extract(&$biblio) {
    $message = 'FAILURE: no message set';
    $url = $biblio->biblio_url;
    $pdffile = tempnam(TEMPPATH, 'pdf-in');
    $txtfile = tempnam(TEMPPATH, 'pdf-out');
    if ( $url ) {
        $result = download_pdf($url, $pdffile);
        if ( is_success($result) ) {
            $result = convert_pdf_to_text($pdffile, $txtfile);
            if ( is_success($result) ) {
                $biblio->body = $result;
            } else {
                $message = $result;
            }
        } else {
            $message = $result;
        }
    } else {
        $message = 'WARNING: no URL provided';
    }
    cleanup_temp_file($pdffile);
    cleanup_temp_file($txtfile);
    _log($message);
}
function convert_pdf_to_text($pdffile, $txtfile) {
    $result = `pdftotext $pdffile $txtfile`; 
    if ( !$result ) { // returns nothing on a successful run
        $message = file_get_contents($txtfile);
    } else {
        $message = 'FAILURE: pdftotext error [$result]';
    }
    return $message;
}

With the code all in place and the module enabled, there are still a couple of administrative tasks that need to be done. First, the action 'Include URL/attachment contents as full text' has to be enabled on the Actions screen (/admin/settings/actions). Then it has to be assigned to the correct trigger on the node Triggers screen (/admin/build/trigger/node). Once done, the action is called every time a biblio node is editing, between saving on the form and updating to the database.

No comments:

Post a Comment