You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Malcolm Clark <ma...@btinternet.com> on 2005/10/25 19:21:08 UTC

Lucene and SAX

Hi again,
I am desperately asking for aid!!

I have used the sandbox demo to parse the INEX collection.The problem being 
it points to a volume file which references 50 other xml articles.Lucene 
only treats this as one document.Is there any method of which I'm 
overlooking that halts after each reference?
Could somebody please help and I wont post again until I submit something 
useful.

The code is:
public class XMLDocumentHandlerSAX
extends HandlerBase
{
    /** A buffer for each XML element */
    private StringBuffer elementBuffer = new StringBuffer();

    private Document mDocument;

    // constructor
    public XMLDocumentHandlerSAX(File xmlFile)
 throws ParserConfigurationException, SAXException, IOException
    {
 SAXParserFactory spf = SAXParserFactory.newInstance();

 SAXParser parser = spf.newSAXParser();
 parser.parse(xmlFile, this);
    }

    // call at document start
    public void startDocument()
    {
     mDocument = new Document();
 //mDocument = new Document();
 elementBuffer.setLength(0);
    }

    // call at element start
    public void startElement(String localName, AttributeList atts)
 throws SAXException
    {

     if (localName.equals("article")) {
      elementBuffer.setLength(0);
     }

    }
    // call when cdata found
    public void characters(char[] text, int start, int length)
    {

      elementBuffer.append(text, start, length);

    }

    // call at element end
    public void endElement(String localName)
 throws SAXException
    {

     if (localName.equals("article")) {
      System.out.println("Article: "+elementBuffer.length());
      elementBuffer.setLength(0);
     }

      mDocument.add(Field.Text(localName,elementBuffer.toString()));
      System.out.println("EB: "+elementBuffer);
      elementBuffer.setLength(0);

    }


    public Document getDocument()
    {

 return mDocument;
    }

    public static void main(String[] args)
 throws Exception
    {
 try
 {
     Date start = new Date();
     String indexDir = "C:\\LuceneDemo\\index";
     IndexWriter writer = new IndexWriter(indexDir, new StandardAnalyzer(), 
true);
     indexDocs(writer, new File("C:\\1995\\volume.xml"));


     writer.optimize();
     writer.close();

     Date end = new Date();

    }
 catch (Exception e)
 {
     System.out.println(" caught a " + e.getClass() + "\n with message: " + 
e.getMessage());
     throw e;
 }
    }

    public static void indexDocs(IndexWriter writer, File file)
 throws Exception
    {

 if (file.isDirectory())

 {
     String[] files = file.list();
     for (int i = 0; i < files.length; i++)
     indexDocs(writer, new File(file, files[i]));

 }
 else
 {
     System.out.println("adding " + file);

     XMLDocumentHandlerSAX hdlr = new XMLDocumentHandlerSAX(file);
     StandardAnalyzer anal = new StandardAnalyzer();
     writer.addDocument(hdlr.getDocument(),anal);
     System.out.println("Documents added to Index: "+writer.docCount());



 }
    }
}
Thanks very much again.
MC 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Lucene and SAX

Posted by Malcolm <ma...@btinternet.com>.

Hi Grant,
A highly shortened version of the volume is like below.

<?xml version="1.0" ?>
<!DOCTYPE books PUBLIC "-//LBIN//DTD IEEE Mag//EN" "xmlarticle.dtd"
[
<!ENTITY A1003 SYSTEM "a1003.xml">
<!ENTITY A1004 SYSTEM "a1004.xml">
<!ENTITY A1006 SYSTEM "a1006.xml">
]>
<books>
<journal>
<title>IEEE Annals of the History of Computing</title>
<issue>Spring 1995 (Vol. 17, No. 1)</issue>
<publisher>Published by the IEEE Computer Society</publisher>
<!--<graphicc filename="cs_cpy.tif"></graphicc>-->
<sec1>
<title>About this Issue</title>
</sec1>
&A1003;
<sec1>
<title>Comments, Queries, and Debate</title>
</sec1>
&A1004;
<sec1>
<title>Articles</title>
</sec1>
&A1006;
</journal>
</books>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Lucene and SAX

Posted by Grant Ingersoll <gs...@syr.edu>.

Sounds like you need to make your articles XML or stop trying to use an 
XML parser to process the file, whichever is easier for you.  I don't 
think your issues are Lucene related.  I think you need to get a better 
handle on the XML processing.  As I suggested on your Digester thread 
before, I would separate the handling of the XML from Lucene until you 
can get a better handle on the XML processing.  I would also write 
myself some Unit Tests using some very small Volume examples. Can you 
write a small app that simply reads in the Volume file, then loads an 
article and prints it out?

-Grant

Malcolm wrote:

> I'm not in anyway an expert, in fact far from, but when I try to 
> reference each article seperately it complains of entitites as the XML 
> articles are not well-formed.
> Thanks,
> MC
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Lucene and SAX

Posted by Malcolm <ma...@btinternet.com>.

I'm not in anyway an expert, in fact far from, but when I try to reference 
each article seperately it complains of entitites as the XML articles are 
not well-formed.
Thanks,
MC 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Lucene and SAX

Posted by Grant Ingersoll <gs...@syr.edu>.

 From what I can see, you are only passing volume.xml to your parser.  
If I understand your code and questions correctly, the Volume file 
simply points to the actual articles that you want to parse.  Seems like 
you need to parse the Volume file, get the name/location of the article 
file and then parse that for entry in Lucene.  Or am I mistaken?  What 
does the volume file look like?

Malcolm wrote:

>It's XML like this. It has 120-ish volumes with references to 12,107 articles which are like this below:
><article>
><fno>A1003</fno>
><doi>10.1041/A1003s-1995</doi>
><fm><hdr><hdr1><ti>IEEE Annals of the History of Computing</ti>
><crt><issn>1058-6180</issn>/95/$4.00 <cci><onm>&copy; 1995 IEEE</onm></cci></crt></hdr1>
><hdr2><obi><volno>Vol. 17</volno>, <issno>No. 1</issno></obi>
><pdt><mo>Spring</mo><yr>1995</yr></pdt>
><pp>pp. 3-3</pp></hdr2></hdr>
><tig>
><atl>About this Issue</atl><pn>pp. 3-3</pn></tig>
><au sequence="first"><fnm>J.A.N.</fnm><snm>Lee</snm><role>Editor&hyphen;in&hyphen;Chief</role></au>
></fm>
><bdy>
><p>The first issue of our 17th volume is as diverse in topics as any nontheme issue that we have tried to present over the past many years. However, it still represents the work of the English&hyphen;speaking world of the North Atlantic rather than a broader picture of computing in the whole world. The Editorial Board and the article editors of the <it>Annals</it>
>are doing their best to bring the history of the whole world of computing to our readers, but it does require authors in other countries to offer their manuscripts for our consideration. Please take this as an open invitation to authors in other parts of the world to submit papers to the <it>Annals</it>
>for review and help us to follow the lead of our parent organization in being the &ldquo;The World&rsquo;s Computer Society.&rdquo;</p>
><p>The five major articles in this issue represent several manuscripts that have been in our files for some time, and we are grateful to the authors for having &ldquo;stuck with us&rdquo; while we reviewed, re&hyphen;reviewed, and reworked their papers. Articles in the field of history do not always present the work of the authors themselves (though we welcome pioneers to give us their own stories, as in the case of the 1935 article by John McPherson in this issue); thus, answering the question &ldquo;is it accurate?&rdquo; is not always easy. In fact, we ask our referees to answer the following questions about each manuscript, and their responses determine whether we accept the manuscript &ldquo;as is&rdquo; or whether we ask the author(s) to revise the material:</p>
><l2>
><li>
><p>Are the issues addressed in the paper stated clearly enough?</p></li>
><li>
></bdy></article>
>
>  
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Lucene and SAX

Posted by Malcolm <ma...@btinternet.com>.

It's XML like this. It has 120-ish volumes with references to 12,107 articles which are like this below:
<article>
<fno>A1003</fno>
<doi>10.1041/A1003s-1995</doi>
<fm><hdr><hdr1><ti>IEEE Annals of the History of Computing</ti>
<crt><issn>1058-6180</issn>/95/$4.00 <cci><onm>&copy; 1995 IEEE</onm></cci></crt></hdr1>
<hdr2><obi><volno>Vol. 17</volno>, <issno>No. 1</issno></obi>
<pdt><mo>Spring</mo><yr>1995</yr></pdt>
<pp>pp. 3-3</pp></hdr2></hdr>
<tig>
<atl>About this Issue</atl><pn>pp. 3-3</pn></tig>
<au sequence="first"><fnm>J.A.N.</fnm><snm>Lee</snm><role>Editor&hyphen;in&hyphen;Chief</role></au>
</fm>
<bdy>
<p>The first issue of our 17th volume is as diverse in topics as any nontheme issue that we have tried to present over the past many years. However, it still represents the work of the English&hyphen;speaking world of the North Atlantic rather than a broader picture of computing in the whole world. The Editorial Board and the article editors of the <it>Annals</it>
are doing their best to bring the history of the whole world of computing to our readers, but it does require authors in other countries to offer their manuscripts for our consideration. Please take this as an open invitation to authors in other parts of the world to submit papers to the <it>Annals</it>
for review and help us to follow the lead of our parent organization in being the &ldquo;The World&rsquo;s Computer Society.&rdquo;</p>
<p>The five major articles in this issue represent several manuscripts that have been in our files for some time, and we are grateful to the authors for having &ldquo;stuck with us&rdquo; while we reviewed, re&hyphen;reviewed, and reworked their papers. Articles in the field of history do not always present the work of the authors themselves (though we welcome pioneers to give us their own stories, as in the case of the 1935 article by John McPherson in this issue); thus, answering the question &ldquo;is it accurate?&rdquo; is not always easy. In fact, we ask our referees to answer the following questions about each manuscript, and their responses determine whether we accept the manuscript &ldquo;as is&rdquo; or whether we ask the author(s) to revise the material:</p>
<l2>
<li>
<p>Are the issues addressed in the paper stated clearly enough?</p></li>
<li>
</bdy></article>

Re: Lucene and SAX

Posted by Grant Ingersoll <gs...@syr.edu>.

I am not familiar with the INEX collection, could you post a sample? 

Malcolm Clark wrote:

> Hi again,
> I am desperately asking for aid!!
>
> I have used the sandbox demo to parse the INEX collection.The problem 
> being it points to a volume file which references 50 other xml 
> articles.Lucene only treats this as one document.Is there any method 
> of which I'm overlooking that halts after each reference?
> Could somebody please help and I wont post again until I submit 
> something useful.
>
> The code is:
> public class XMLDocumentHandlerSAX
> extends HandlerBase
> {
>    /** A buffer for each XML element */
>    private StringBuffer elementBuffer = new StringBuffer();
>
>    private Document mDocument;
>
>    // constructor
>    public XMLDocumentHandlerSAX(File xmlFile)
> throws ParserConfigurationException, SAXException, IOException
>    {
> SAXParserFactory spf = SAXParserFactory.newInstance();
>
> SAXParser parser = spf.newSAXParser();
> parser.parse(xmlFile, this);
>    }
>
>    // call at document start
>    public void startDocument()
>    {
>     mDocument = new Document();
> //mDocument = new Document();
> elementBuffer.setLength(0);
>    }
>
>    // call at element start
>    public void startElement(String localName, AttributeList atts)
> throws SAXException
>    {
>
>     if (localName.equals("article")) {
>      elementBuffer.setLength(0);
>     }
>
>    }
>    // call when cdata found
>    public void characters(char[] text, int start, int length)
>    {
>
>      elementBuffer.append(text, start, length);
>
>    }
>
>    // call at element end
>    public void endElement(String localName)
> throws SAXException
>    {
>
>     if (localName.equals("article")) {
>      System.out.println("Article: "+elementBuffer.length());
>      elementBuffer.setLength(0);
>     }
>
>      mDocument.add(Field.Text(localName,elementBuffer.toString()));
>      System.out.println("EB: "+elementBuffer);
>      elementBuffer.setLength(0);
>
>    }
>
>
>    public Document getDocument()
>    {
>
> return mDocument;
>    }
>
>    public static void main(String[] args)
> throws Exception
>    {
> try
> {
>     Date start = new Date();
>     String indexDir = "C:\\LuceneDemo\\index";
>     IndexWriter writer = new IndexWriter(indexDir, new 
> StandardAnalyzer(), true);
>     indexDocs(writer, new File("C:\\1995\\volume.xml"));
>
>
>     writer.optimize();
>     writer.close();
>
>     Date end = new Date();
>
>    }
> catch (Exception e)
> {
>     System.out.println(" caught a " + e.getClass() + "\n with message: 
> " + e.getMessage());
>     throw e;
> }
>    }
>
>    public static void indexDocs(IndexWriter writer, File file)
> throws Exception
>    {
>
> if (file.isDirectory())
>
> {
>     String[] files = file.list();
>     for (int i = 0; i < files.length; i++)
>     indexDocs(writer, new File(file, files[i]));
>
> }
> else
> {
>     System.out.println("adding " + file);
>
>     XMLDocumentHandlerSAX hdlr = new XMLDocumentHandlerSAX(file);
>     StandardAnalyzer anal = new StandardAnalyzer();
>     writer.addDocument(hdlr.getDocument(),anal);
>     System.out.println("Documents added to Index: "+writer.docCount());
>
>
>
> }
>    }
> }
> Thanks very much again.
> MC
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Lucene and SAX

Posted by MALCOLM CLARK <ma...@btinternet.com>.

Grant,

 

Thanks for your tips.I have considered DOM processing but it seemed to take a hell of a long time to process all the documents(12,125).

Re: Lucene and SAX

Posted by Karl Øie <ka...@gan.no>.

Hi there Malcolm!

I can´t see any place in your source that you add the document id of 
the document you are parsing. startDocument() should atleast add a 
sys-id field for the xml document being parsed;

public void startDocument() {
	mDocument = new Document();
	mDocument.add(new Field("sys-id", ""+getMySystemId(), false, true, 
false));
}

You should also consider this:

Parsing SAX is event based, if you create a object graf internally you 
can just as well use DOM parsing.

If you use SAX parser get and set the XMLReader your self, this way you 
can add ContentHandler, LexicalHandler and DTDHandler yourself for even 
more precise indexing.

Happy Hacking!

Karl

On 25. okt. 2005, at 19.21, Malcolm Clark wrote:

> Hi again,
> I am desperately asking for aid!!
>
> I have used the sandbox demo to parse the INEX collection.The problem 
> being it points to a volume file which references 50 other xml 
> articles.Lucene only treats this as one document.Is there any method 
> of which I'm overlooking that halts after each reference?
> Could somebody please help and I wont post again until I submit 
> something useful.
>
> The code is:
> public class XMLDocumentHandlerSAX
> extends HandlerBase
> {
>    /** A buffer for each XML element */
>    private StringBuffer elementBuffer = new StringBuffer();
>
>    private Document mDocument;
>
>    // constructor
>    public XMLDocumentHandlerSAX(File xmlFile)
> throws ParserConfigurationException, SAXException, IOException
>    {
> SAXParserFactory spf = SAXParserFactory.newInstance();
>
> SAXParser parser = spf.newSAXParser();
> parser.parse(xmlFile, this);
>    }
>
>    // call at document start
>    public void startDocument()
>    {
>     mDocument = new Document();
> //mDocument = new Document();
> elementBuffer.setLength(0);
>    }
>
>    // call at element start
>    public void startElement(String localName, AttributeList atts)
> throws SAXException
>    {
>
>     if (localName.equals("article")) {
>      elementBuffer.setLength(0);
>     }
>
>    }
>    // call when cdata found
>    public void characters(char[] text, int start, int length)
>    {
>
>      elementBuffer.append(text, start, length);
>
>    }
>
>    // call at element end
>    public void endElement(String localName)
> throws SAXException
>    {
>
>     if (localName.equals("article")) {
>      System.out.println("Article: "+elementBuffer.length());
>      elementBuffer.setLength(0);
>     }
>
>      mDocument.add(Field.Text(localName,elementBuffer.toString()));
>      System.out.println("EB: "+elementBuffer);
>      elementBuffer.setLength(0);
>
>    }
>
>
>    public Document getDocument()
>    {
>
> return mDocument;
>    }
>
>    public static void main(String[] args)
> throws Exception
>    {
> try
> {
>     Date start = new Date();
>     String indexDir = "C:\\LuceneDemo\\index";
>     IndexWriter writer = new IndexWriter(indexDir, new 
> StandardAnalyzer(), true);
>     indexDocs(writer, new File("C:\\1995\\volume.xml"));
>
>
>     writer.optimize();
>     writer.close();
>
>     Date end = new Date();
>
>    }
> catch (Exception e)
> {
>     System.out.println(" caught a " + e.getClass() + "\n with message: 
> " + e.getMessage());
>     throw e;
> }
>    }
>
>    public static void indexDocs(IndexWriter writer, File file)
> throws Exception
>    {
>
> if (file.isDirectory())
>
> {
>     String[] files = file.list();
>     for (int i = 0; i < files.length; i++)
>     indexDocs(writer, new File(file, files[i]));
>
> }
> else
> {
>     System.out.println("adding " + file);
>
>     XMLDocumentHandlerSAX hdlr = new XMLDocumentHandlerSAX(file);
>     StandardAnalyzer anal = new StandardAnalyzer();
>     writer.addDocument(hdlr.getDocument(),anal);
>     System.out.println("Documents added to Index: "+writer.docCount());
>
>
>
> }
>    }
> }
> Thanks very much again.
> MC
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
Remember kids, given half a chance a cow would eat you and everyone you 
care about. (Pan and zoom to a cow with a crazed look.) MOO-OOO


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org