Extracting data from an XML file

Eric Lease Morgan emorgan at nd.edu
Mon Jan 5 22:27:39 EST 2004

I wrote:

> Can you suggest a fast, efficient way to use Perl to extract selected
> data from an XML file?...

First of all, thank you everyone who promptly replied to my query.

Second, I was not quite clear in my question. Many people said I should
write an XSLT style sheet to transform my XML document into HTML. This is in
fact what I do, but I was not clear in my question. I need a process to not
only transform each of my documents, but I also need to create an author as
well as title indexes to my collection, and therefore I need to extract bits
of data from each of my original XML files.

Third, most of the replies fell into two categories: 1) use an XSLT style
sheet as as sort of "subroutine", and 2) use XML::Twig.

Fourth, I tried both of these approaches plus my own, and timed them. I had
to process 1.5 MB of data in nineteen files. Tiny. Ironically, my original
code was the fastest at 96 seconds. The XSLT implementation came in second
at 101 seconds, and the XML::Twig implementation, while straight-forward
came in last as 141 seconds. (See the attached code snippets.)

Since my original implementation is still the fastest, and the newer
implementations do not improve the speed of the application, then I must
assume that the process is slow because of the XSLT transformations
themselves. These transformations are straight-forward:

  # transform the document and save it
  my $doc       = $parser->parse_file($file);
  my $results   = $stylesheet->transform($doc);
  my $html_file = "$HTML_DIR/$id.html";
  open OUT, "> $html_file";
  print OUT $stylesheet->output_string($results);
  close OUT;
  # convert the HTML to plain text and save it
  my $html      = parse_htmlfile($html_file);
  my $text_file = "$TEXT_DIR/$id.txt";
  open OUT, "> $text_file";
  print OUT $formatter->format($html);
  close OUT;

When my collection grows big I will have to figure out a better way to batch
transform my documents. I might even have to break down and write a shell
script to call xsltproc directly. (Blasphemy!)

Eric Lease Morgan
University Libraries of Notre Dame

Due to deletion of content types excluded from this list by policy,
this multipart message was reduced to a single part, and from there
to a plain text message.

More information about the xml4lib mailing list