Extracting data from an XML file

William Wueppelmann william.wueppelmann at nlc-bnc.ca
Tue Jan 6 09:56:09 EST 2004


Eric Lease Morgan wrote:

> Since my original implementation is still the fastest, and the newer
> implementations do not improve the speed of the application, then I must
> assume that the process is slow because of the XSLT transformations
> themselves. These transformations are straight-forward:
> 
>   # transform the document and save it
>   my $doc       = $parser->parse_file($file);
>   my $results   = $stylesheet->transform($doc);
>   my $html_file = "$HTML_DIR/$id.html";
>   open OUT, "> $html_file";
>   print OUT $stylesheet->output_string($results);
>   close OUT;
>   
>   # convert the HTML to plain text and save it
>   my $html      = parse_htmlfile($html_file);
>   my $text_file = "$TEXT_DIR/$id.txt";
>   open OUT, "> $text_file";
>   print OUT $formatter->format($html);
>   close OUT;

Can you save some time by not re-parsing the HTML file? I haven't used 
the parse HTML feature of LibXML, but doesn't it produce the exact same 
kind of XML Document object? If so, you already have a copy in $results 
in the first  part of the code, so you shouldn't need to go back and 
re-parse the file you just created, since $html should be identical, or 
at least functionally identical, to $results.

I don't know whether or not you are already doing this, but you might be 
able to save a lot of time if you don't re-parse documents and 
stylesheets, re-instantiate XML parsers, and so forth. Ideally, you 
would call XML::LibXML->new and XML::LibXSLT->new once at the beginning 
of the script, immediately followed by creating a $stylesheet that 
contains the parsed stylesheet that you can then apply to each document 
in the batch. You can then parse each source XML document once and 
perform all of your operations on it in one go. Your script could then 
look something vaguely like:

my $xml_parser = XML::LibXML->new;
my $xslt_parser = XML::LibXSLT->new;

my $xslt_doc   = $xml_parser->parse_file ('stylesheet.xsl');
my $stylesheet = $xslt_parser->parse_stylesheet ($xslt_doc);

foreach my $file (@files_to_process) {
	# Parse the document
	my $original_doc = $xml_parser->parse_file ($file);

	# Transform to HTML
	my $html_doc     = $stylesheet->transform ($original_doc);
	my $html_file    = my_filenaming_algorithm ($file);
	$html_doc->toFile ($html_file);

	# Transform the newly-transformed HTML (or XHTML) to plain text
	open TEXT_OUT (">$text_file");
	print TEXT_OUT $formatter->format ($html_doc);
	close TEXT_OUT;

	# Grab selected information from the TEI header
   	my ($header) = $html_doc->findnodes('teiHeader');
   	my $author = $header->findvalue('fileDesc/titleStmt/author');
   	my $title  = $header->findvalue('fileDesc/titleStmt/title');
   	my $id =$header->findvalue('fileDesc/publicationStmt/idno');
	do_something_with_my_data ($author, $title, $id);
}

That way, you only instantiate a parser once, you only parse the 
XML->HTML stylesheet once, and you only parse each XML document once. I 
don't know how much of this you are doing already, but eliminating 
unnecessary parsing could speed things up a fair bit.

I think the speed of the XSLT process depends a lot on how complex the 
stylesheet is. I have a script that parses XML documents and creates 
secondary XML documents which contain a small subset of the original 
data (with some fields amalgamated and otherwise massaged) and it takes 
maybe 20-40 minutes to batch about 4 or 5 gigabytes of data in about 
8500 files. My original documents are quite large and numerous, but the 
derived documents are only about 1 KB or so, and the structure of the 
original is reasonably simple. The stylesheet itself is only about 100 
lines, though the stylesheet rules do seem to include a lot of 'or' 
clauses in them. I don't know how complex your input files and 
transformations are compared to mine, or how fast your computer is, but 
96 seconds to process 1.5 MB does seem a little slow compared to what I 
am getting.

I hope some of that helps.

-- 
William Wueppelmann
Electronic Systems Specialist
Canadian Institute for Historical Microreproductions (CIHM)
http://www.canadiana.org



More information about the xml4lib mailing list