[XML4LIB] Extracting data from an XML file

Chuck Bearden cbearden at hal-pc.org
Mon Jan 5 22:56:10 EST 2004


On Mon, Jan 05, 2004 at 12:56:21PM -0800, Eric Lease Morgan wrote:
> 
> Can you suggest a fast, efficient way to use Perl to extract selected data
> from an XML file?

I'm going to second Ed Summers's and Tony Lavender's recommendation of
SAX.  Because SAX processes XML as a stream, biting off only as much as
it can chew at one time, it plows through XML relatively quickly.  I 
do my SAX programming in Python, but the concepts should apply to the 
Perl implementation of the SAX interface as well.

Basically, you instantiate a SAX parser and provide it with an instance
of a content handler class.  Then, you use the parser to parse the
document--iterate over a bunch of them if you like.

The hard work will be done by the content handler, a class for which you
define methods to handle start element events, end element events, and
character events (there are other events that you can handle, but I
doubt they are relevant to your task).  If you don't make the handling
of one part of the header contingent upon content before or after it or
accumulate all the data for processing at one time, your task is 
pretty simple.  

  1. When the start handler encounters the start of a wanted element, 
     set a flag to tell the character event handler to accumulate 
     character data.  

  2. When the character handler is invoked while the "accumulate" 
     flag is set, make it append the data to the accumulator variable.  
     Otherwise, have it do nothing.

  3. When the end element handler is invoked while the "accumulate" 
     flag is set, have it unset the flag, normalize and output the 
     accumulated data (stdout, file, database, whatever), and reset 
     the accumulator variable.

If the body of your TEI doc is large and you don't want SAX to have to
wade through it when you just want the header, you can cheat a little 
bit by having the end element handler raise a special error when it
encounters the end-of-header tag, causing the parser to exit.  Catch 
the error outside of the parser, and go on to the next file.  This is 
simple in Python, which has built-in exception handling, but it could 
probably be done in Perl somehow.  

If you need to keep all the wanted data until the end of the header 
in order to processe it all at once (perhaps loading via a single SQL 
statement, or perhaps the processing of one element depends on the
content of another), you can easily define a more complex accumulator
with a hash ref, keys being the element name or some other label, and
values being a scalar, a list, or even another hash ref.  Process and
output the whole accumulator when the end-of-header event is
encountered.

If your processing is contingent upon more complex contextual info (e.g.
a wanted element can be the child of two or more different parents, but
is wanted only when the child of one of those parents), you will need to
accumulate contextual info as well.  The use of boolean flags becomes
unwieldy rather quickly.  I've started to develop a SAX approach in 
Python that goes some ways toward letting you select data by vaguely 
XPath-like criteria, though it is not even close to a complete XPath
implementation.  Let me know if you need something like this and I can
provide details.

Good luck and have fun.  I hope you enjoyed the tenor of this little
missive.

Chuck Bearden



More information about the xml4lib mailing list