[XML4LIB] Extracting data from an XML file

Tony Lavender tlavender at promediaco.com
Mon Jan 5 19:54:43 EST 2004


Maybe it'd be faster to not use XPath, but stick to the DOM and drill down
using getElementsByTagNameinstead?

Or, use SAX...

Or (warning: desperate Perl hacker response follows) if you're fairly rigid
in how the XML is formatted (no arbitrary line breaks, etc) you could just
read the XML file line-by-line and use regex matching to grab the elements
you want, watching for "</titleStmt>" and "</publicationStmt>" to know when
to stop.

Tony Lavender
Promedia Inc

-----Original Message-----
From: xml4lib at sunsite.berkeley.edu
[mailto:xml4lib at sunsite.berkeley.edu]On Behalf Of Eric Lease Morgan
Sent: Monday, January 05, 2004 1:57 PM
To: Multiple recipients of list
Subject: [XML4LIB] Extracting data from an XML file



Can you suggest a fast, efficient way to use Perl to extract selected data
from an XML file?

I am in the process of re-writing my Alex Catalogue of Electronic Texts. In
this re-write I will be marking up items in the collection as TEI/XML files.
These files will them become my archival copies of the data much like the
TIFF files of image databases. I will repurpose the TEI files to create
plain text files, HTML files, Palm documents, PDF files, as well as provide
the means for full-text, fielded, and concordance indexing and searching.
Much of this work is already done for a small subset of data, and you can
see the work in progress here:

  http://infomotions.com/alex2/

To create my HTML files with rich meta data, I need to extract bits and
pieces of information from the teiHeader of my originals. The snippet of
code below illustrates how I am currently doing this with XML::LibXML:

  # require the necessary module
  use XML::LibXML;

  # initialize
  my $parser = XML::LibXML->new;
  my $file   = '/foo/bar.xml';

  # do the work
  my $doc    = $parser->parse_file($file);
  my $root   = $doc->getDocumentElement;
  my @header = $root->findnodes('teiHeader');
  my $author = $header[0]->findvalue('fileDesc/titleStmt/author');
  my $title  = $header[0]->findvalue('fileDesc/titleStmt/title');
  my $id     = $header[0]->findvalue('fileDesc/publicationStmt/idno');

  # output the results
  print " author: $author\n title: $title\n id: $id\n\n";

The code works, but is really slow. Can you suggest a way to improve my code
or use some other technique for extracting things like author, title, and id
from my XML?

--
Eric Lease Morgan
University Libraries of Notre Dame

(574) 631-8604




More information about the xml4lib mailing list