Extracting data from an XML file

Eric Lease Morgan emorgan at nd.edu
Mon Jan 5 15:54:09 EST 2004

Can you suggest a fast, efficient way to use Perl to extract selected data
from an XML file?

I am in the process of re-writing my Alex Catalogue of Electronic Texts. In
this re-write I will be marking up items in the collection as TEI/XML files.
These files will them become my archival copies of the data much like the
TIFF files of image databases. I will repurpose the TEI files to create
plain text files, HTML files, Palm documents, PDF files, as well as provide
the means for full-text, fielded, and concordance indexing and searching.
Much of this work is already done for a small subset of data, and you can
see the work in progress here:


To create my HTML files with rich meta data, I need to extract bits and
pieces of information from the teiHeader of my originals. The snippet of
code below illustrates how I am currently doing this with XML::LibXML:

  # require the necessary module
  use XML::LibXML;
  # initialize
  my $parser = XML::LibXML->new;
  my $file   = '/foo/bar.xml';
  # do the work
  my $doc    = $parser->parse_file($file);
  my $root   = $doc->getDocumentElement;
  my @header = $root->findnodes('teiHeader');
  my $author = $header[0]->findvalue('fileDesc/titleStmt/author');
  my $title  = $header[0]->findvalue('fileDesc/titleStmt/title');
  my $id     = $header[0]->findvalue('fileDesc/publicationStmt/idno');
  # output the results
  print " author: $author\n title: $title\n id: $id\n\n";

The code works, but is really slow. Can you suggest a way to improve my code
or use some other technique for extracting things like author, title, and id
from my XML?

Eric Lease Morgan
University Libraries of Notre Dame

(574) 631-8604

More information about the xml4lib mailing list