[XML4LIB] Extracting data from an XML file

Tony Lavender tlavender at promediaco.com
Mon Jan 5 19:54:43 EST 2004

Maybe it'd be faster to not use XPath, but stick to the DOM and drill down
using getElementsByTagNameinstead?

Or, use SAX...

Or (warning: desperate Perl hacker response follows) if you're fairly rigid
in how the XML is formatted (no arbitrary line breaks, etc) you could just
read the XML file line-by-line and use regex matching to grab the elements
you want, watching for "</titleStmt>" and "</publicationStmt>" to know when
to stop.

Tony Lavender
Promedia Inc

-----Original Message-----
From: xml4lib at sunsite.berkeley.edu
[mailto:xml4lib at sunsite.berkeley.edu]On Behalf Of Eric Lease Morgan
Sent: Monday, January 05, 2004 1:57 PM
To: Multiple recipients of list
Subject: [XML4LIB] Extracting data from an XML file

Can you suggest a fast, efficient way to use Perl to extract selected data
from an XML file?

I am in the process of re-writing my Alex Catalogue of Electronic Texts. In
this re-write I will be marking up items in the collection as TEI/XML files.
These files will them become my archival copies of the data much like the
TIFF files of image databases. I will repurpose the TEI files to create
plain text files, HTML files, Palm documents, PDF files, as well as provide
the means for full-text, fielded, and concordance indexing and searching.
Much of this work is already done for a small subset of data, and you can
see the work in progress here:


To create my HTML files with rich meta data, I need to extract bits and
pieces of information from the teiHeader of my originals. The snippet of
code below illustrates how I am currently doing this with XML::LibXML:

  # require the necessary module
  use XML::LibXML;

  # initialize
  my $parser = XML::LibXML->new;
  my $file   = '/foo/bar.xml';

  # do the work
  my $doc    = $parser->parse_file($file);
  my $root   = $doc->getDocumentElement;
  my @header = $root->findnodes('teiHeader');
  my $author = $header[0]->findvalue('fileDesc/titleStmt/author');
  my $title  = $header[0]->findvalue('fileDesc/titleStmt/title');
  my $id     = $header[0]->findvalue('fileDesc/publicationStmt/idno');

  # output the results
  print " author: $author\n title: $title\n id: $id\n\n";

The code works, but is really slow. Can you suggest a way to improve my code
or use some other technique for extracting things like author, title, and id
from my XML?

Eric Lease Morgan
University Libraries of Notre Dame

(574) 631-8604

More information about the xml4lib mailing list