You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Ogren, Philip V." <Og...@mayo.edu> on 2001/12/01 20:05:03 UTC

Indexing XML Demo

Well, I got one response so I went ahead with it.  This is a very simple
demo that demonstrates how to use XSL Transforms and SAX parsing to index
XML documents using Lucene.  The code has been compiled and tested.  

This demo takes several seconds to run but that is due to the cost of
instantiating objects and opening and closing the index.  I have used this
technique to index about 1M documents and have found it reasonably fast
(~120seconds/1000 documents).  I know that statistic was completely
meaningless, but heh...  I think the small amount of code that it took to
put this demo together is a powerful testimony to the SAX technology - I'm
pretty high on XSL right now :)  The main thing, as I mentioned below, is
that I do not need to change any java code if I decide I need to change the
way I index my xml documents. 

Let me know if you find this useful or if for some reason it doesn't work.

Directions: extract the attached jar file and open readme.txt

Regards,
Philip Ogren

-----Original Message-----
From: Yiyi Sun [mailto:yiyisun@yahoo.com]
Sent: Thursday, November 29, 2001 10:46 AM
To: Lucene Developers List
Subject: Re: parsing XML


Hi,

Thanks a lot. I would like to have you XML package and
demo.

Cheers!

Yiyi

--- "Ogren, Philip V." <Og...@mayo.edu> wrote:
> 
> I didn't pour through the archive to make sure no
> one had done this yet
> but...
> I have a generic way of indexing XML that I think is
> really useful.
> Basically, I implement the DefaultHandler (in SAX)
> that handles XML
> documents that look like something like this:
> <document>
> 	<field name="myfield1" store="true" index="true"
> token="true">a
> small field</field>
> 	<field name="myfield2" store="false" index="true"
> token="true">a
> large field</field>
> </document>
> 
> I haven't actually written a DTD or schema because I
> haven't needed one
> yet.*  I create a org.apache.lucene.document.Field
> for each 'field' tag that
> is processed.  The way I get an XML document that
> conforms to this very
> simplistic schema is through XSLT.  You simply
> create a style sheet that
> transforms your specific xml document into xml that
> conforms with the above
> tags.  It's proven very useful on our project
> because changing the way an
> xml document is indexed requires no change in the
> code - I simply change my
> style sheet and reindex.  
> 
> I would be willing to cut a version of this code
> that would be suitable for
> a demonstration - along with a demo -  if there is
> any interest.  
> 
> Regards,
> Philip Ogren
> 
> *I originally had a 'datefield' tag as well but I
> found the DateField class
> to be useless for my application as it doesn't
> handle dates before 1970.
> 
> > Philip V. Ogren
> > Medical Information Resources
> > Mayo Clinic Rochester
> > (507) 538-0167
> > ogren@mayo.edu
> > 
> 
> --
> To unsubscribe, e-mail:  
> <ma...@jakarta.apache.org>
> For additional commands, e-mail:
> <ma...@jakarta.apache.org>
> 


__________________________________________________
Do You Yahoo!?
Yahoo! GeoCities - quick and easy web site hosting, just $8.95/month.
http://geocities.yahoo.com/ps/info1

--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>