Extracting data from an XML file

Eric Lease Morgan emorgan at nd.edu
Mon Jan 5 15:54:09 EST 2004


Can you suggest a fast, efficient way to use Perl to extract selected data
from an XML file?

I am in the process of re-writing my Alex Catalogue of Electronic Texts. In
this re-write I will be marking up items in the collection as TEI/XML files.
These files will them become my archival copies of the data much like the
TIFF files of image databases. I will repurpose the TEI files to create
plain text files, HTML files, Palm documents, PDF files, as well as provide
the means for full-text, fielded, and concordance indexing and searching.
Much of this work is already done for a small subset of data, and you can
see the work in progress here:

  http://infomotions.com/alex2/

To create my HTML files with rich meta data, I need to extract bits and
pieces of information from the teiHeader of my originals. The snippet of
code below illustrates how I am currently doing this with XML::LibXML:

  # require the necessary module
  use XML::LibXML;
  
  # initialize
  my $parser = XML::LibXML->new;
  my $file   = '/foo/bar.xml';
  
  # do the work
  my $doc    = $parser->parse_file($file);
  my $root   = $doc->getDocumentElement;
  my @header = $root->findnodes('teiHeader');
  my $author = $header[0]->findvalue('fileDesc/titleStmt/author');
  my $title  = $header[0]->findvalue('fileDesc/titleStmt/title');
  my $id     = $header[0]->findvalue('fileDesc/publicationStmt/idno');
  
  # output the results
  print " author: $author\n title: $title\n id: $id\n\n";

The code works, but is really slow. Can you suggest a way to improve my code
or use some other technique for extracting things like author, title, and id
from my XML?

-- 
Eric Lease Morgan
University Libraries of Notre Dame

(574) 631-8604




More information about the xml4lib mailing list