You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by syedfa <fa...@gmail.com> on 2008/04/21 00:24:07 UTC

Searching through a single XML document

Dear Fellow Java/Lucene developers:

I am trying to write a search engine application for shakespeare's "Hamlet"
which I have in xml format.  For demonstration purposes, I am using an xml
file of some quotes from the play only, which is listed below:

<PLAY>
<TITLE>The Tragedy of Hamlet, Prince of Denmark</TITLE>
<SPEECH>
<SPEAKER>LORD POLONIUS</SPEAKER>
<LINES>Yet here, Laertes! aboard, aboard, for shame!
The wind sits in the shoulder of your sail,
And you are stay'd for. There; my blessing with thee!
And these few precepts in thy memory
See thou character. Give thy thoughts no tongue,
Nor any unproportioned thought his act.
Be thou familiar, but by no means vulgar.
Those friends thou hast, and their adoption tried,
Grapple them to thy soul with hoops of steel;
But do not dull thy palm with entertainment
Of each new-hatch'd, unfledged comrade. Beware
Of entrance to a quarrel, but being in,
Bear't that the opposed may beware of thee.
Give every man thy ear, but few thy voice;
Take each man's censure, but reserve thy judgment.
Costly thy habit as thy purse can buy,
But not express'd in fancy; rich, not gaudy;
For the apparel oft proclaims the man,
And they in France of the best rank and station
Are of a most select and generous chief in that.
Neither a borrower nor a lender be;
For loan oft loses both itself and friend,
And borrowing dulls the edge of husbandry.
This above all: to thine ownself be true,
And it must follow, as the night the day,
Thou canst not then be false to any man.
Farewell: my blessing season this in thee!</LINES>
</SPEECH>
<SPEECH>
<SPEAKER>HAMLET</SPEAKER>
<LINES>To be, or not to be: that is the question:
Whether 'tis nobler in the mind to suffer
The slings and arrows of outrageous fortune,
Or to take arms against a sea of troubles,
And by opposing end them? To die: to sleep;
No more; and by a sleep to say we end
The heart-ache and the thousand natural shocks
That flesh is heir to, 'tis a consummation
Devoutly to be wish'd. To die, to sleep;
To sleep: perchance to dream: ay, there's the rub;
For in that sleep of death what dreams may come
When we have shuffled off this mortal coil,
Must give us pause: there's the respect
That makes calamity of so long life;
For who would bear the whips and scorns of time,
The oppressor's wrong, the proud man's contumely,
The pangs of despised love, the law's delay,
The insolence of office and the spurns
That patient merit of the unworthy takes,
When he himself might his quietus make
With a bare bodkin? who would fardels bear,
To grunt and sweat under a weary life,
But that the dread of something after death,
The undiscover'd country from whose bourn
No traveller returns, puzzles the will
And makes us rather bear those ills we have
Than fly to others that we know not of?
Thus conscience does make cowards of us all;
And thus the native hue of resolution
Is sicklied o'er with the pale cast of thought,
And enterprises of great pith and moment
With this regard their currents turn awry,
And lose the name of action.--Soft you now!
The fair Ophelia! Nymph, in thy orisons
Be all my sins remember'd.</LINES>
</SPEECH>
<SPEECH>
<SPEAKER>HAMLET</SPEAKER>
<LINES>To be, or not to be: that is the question:
Whether 'tis nobler in the mind to suffer
The slings and arrows of outrageous fortune,
Or to take arms against a sea of troubles,
And by opposing end them? To die: to sleep;
No more; and by a sleep to say we end
The heart-ache and the thousand natural shocks
That flesh is heir to, 'tis a consummation
Devoutly to be wish'd. To die, to sleep;
To sleep: perchance to dream: ay, there's the rub;
For in that sleep of death what dreams may come
When we have shuffled off this mortal coil,
Must give us pause: there's the respect
That makes calamity of so long life;
For who would bear the whips and scorns of time,
The oppressor's wrong, the proud man's contumely,
The pangs of despised love, the law's delay,
The insolence of office and the spurns
That patient merit of the unworthy takes,
When he himself might his quietus make
With a bare bodkin? who would fardels bear,
To grunt and sweat under a weary life,
But that the dread of something after death,
The undiscover'd country from whose bourn
No traveller returns, puzzles the will
And makes us rather bear those ills we have
Than fly to others that we know not of?
Thus conscience does make cowards of us all;
And thus the native hue of resolution
Is sicklied o'er with the pale cast of thought,
And enterprises of great pith and moment
With this regard their currents turn awry,
And lose the name of action.--Soft you now!
The fair Ophelia! Nymph, in thy orisons
Be all my sins remember'd.</LINES>
</SPEECH>
<SPEECH>
<SPEAKER>HAMLET</SPEAKER>
<LINES>To be, or not to be: that is the question:
Whether 'tis nobler in the mind to suffer
The slings and arrows of outrageous fortune,
Or to take arms against a sea of troubles,
And by opposing end them? To die: to sleep;
No more; and by a sleep to say we end
The heart-ache and the thousand natural shocks
That flesh is heir to, 'tis a consummation
Devoutly to be wish'd. To die, to sleep;
To sleep: perchance to dream: ay, there's the rub;
For in that sleep of death what dreams may come
When we have shuffled off this mortal coil,
Must give us pause: there's the respect
That makes calamity of so long life;
For who would bear the whips and scorns of time,
The oppressor's wrong, the proud man's contumely,
The pangs of despised love, the law's delay,
The insolence of office and the spurns
That patient merit of the unworthy takes,
When he himself might his quietus make
With a bare bodkin? who would fardels bear,
To grunt and sweat under a weary life,
But that the dread of something after death,
The undiscover'd country from whose bourn
No traveller returns, puzzles the will
And makes us rather bear those ills we have
Than fly to others that we know not of?
Thus conscience does make cowards of us all;
And thus the native hue of resolution
Is sicklied o'er with the pale cast of thought,
And enterprises of great pith and moment
With this regard their currents turn awry,
And lose the name of action.--Soft you now!
The fair Ophelia! Nymph, in thy orisons
Be all my sins remember'd.</LINES>
</SPEECH>
</PLAY>

I have created an application to create an index from the above xml file
which I am listing here:

import java.io.InputStream;
import java.io.IOException;
import java.io.File;
import java.io.FileInputStream;
import java.util.Iterator;
import java.util.HashMap;
 
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
 
import org.xml.sax.helpers.DefaultHandler;
import org.xml.sax.SAXException;
import org.xml.sax.Attributes;
 
import javax.xml.parsers.SAXParser;
import javax.xml.parsers.SAXParserFactory;
import javax.xml.parsers.ParserConfigurationException;
 
public class HamletHandler extends DefaultHandler implements DocumentHandler
{
 
    
	private StringBuffer elementBuffer=new StringBuffer();
	private HashMap attributeMap;
	private Document doc;
	
	
	
	public Document getDocument(InputStream is) throws DocumentHandlerException
{
		// TODO Auto-generated method stub
		SAXParserFactory spf=SAXParserFactory.newInstance();
		
		try{
			SAXParser parser=spf.newSAXParser();
			parser.parse(is, this);
		}
		catch(IOException e){
			throw new DocumentHandlerException("Cannot parse XML document", e);
		}
		
		catch(ParserConfigurationException e){
			throw new DocumentHandlerException("Cannot parse XML document", e);
		}
		
		catch(SAXException e){
			throw new DocumentHandlerException("Cannot parse XML document", e);
		}
		
		return doc;
	}
 
	public void startDocument(){
		doc=new Document();
	}
	
	public void startElement(String uri, String localName, String qName,
Attributes atts) throws SAXException{
		
		elementBuffer.setLength(0);
		if(atts.getLength()>0){
			attributeMap=new HashMap();
			for(int i=0; i<atts.getLength(); i++){
				attributeMap.put(atts.getQName(i), atts.getValue(i));
			}
		}
	}
	public void characters(char[] text, int start, int length){
		elementBuffer.append(text, start, length);
		
	}
	
	public void endElement(String uri, String localName, String qName) throws
SAXException{
		
		if(qName.equals("SPEAKER")){
			Field speaker = new Field(qName, elementBuffer.toString(),
Field.Store.YES, Field.Index.TOKENIZED, Field.TermVector.YES);
	    	speaker.setBoost(2.0f);
			doc.add(speaker);
		}
		else if(qName.equals("LINES")){
			Field lines = new Field(qName, elementBuffer.toString(), Field.Store.YES,
Field.Index.TOKENIZED, Field.TermVector.YES);
	    	lines.setBoost(1.0f);
			doc.add(lines);
		}
		
		else{
			return;
		}
	}
	/**
	 * @param args
	 */
	public static void main(String[] args) throws Exception{
		File index=new File("c:\\index");
		Directory fsDirectory = FSDirectory.getDirectory(index);
    	Analyzer  analyzer    = new StandardAnalyzer();
    	IndexWriter indexWriter = new IndexWriter(fsDirectory, analyzer, true);
		HamletHandler handler=new HamletHandler();
		Document doc=handler.getDocument(new FileInputStream(new File(args[0])));
		System.out.println("test" + doc);
		indexWriter.addDocument(doc);
		int numIndexed=indexWriter.docCount();
		System.out.println(numIndexed);
		indexWriter.optimize();
    	indexWriter.close();
 
	}
 
}

The elements that I am interested in searching against are <SPEAKER> and
<LINES>.  If I do a search for HAMLET, I expect to get the number of
occurances of the element in the xml file (in this case 3).  However,
instead of 3, I get a value of 1.  My other concern is that I am unable to
search through the text contained within any of the <LINES> elements.  So if
I search for the word "the", I am unable to get a single hit.  Why is that? 
My code that I am using to search is listed here:

import java.io.File;
import java.io.FileReader;
import java.io.Reader;
import java.util.Date;
import java.io.IOException;
 
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer ;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query ;
import org.apache.lucene.search.Hits;
import org.apache.lucene.index.Term;
import org.apache.lucene.search.TermQuery;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.store.Directory;
import org.apache.lucene.queryParser.QueryParser;
/**
 *
 * 
 */
public class Searcher {
    
    /** Creates a new instance of Searcher */
    
    /**
     * @param args the command line arguments
     */
    public static void main(String[] args) throws Exception{ 
        
        File indexDir=new File("c:\\index");
        String q="HAMLET";
        
        if(!indexDir.exists() || !indexDir.isDirectory()){
            throw new Exception(indexDir + "does not exist of is not a
directory."); 
        }
        
        search(indexDir, q);
        
        // TODO code application logic here
    }
    
    public static void search(File indexDir, String q) throws Exception {
         
        Directory fsDir=FSDirectory.getDirectory(indexDir);
        IndexSearcher is=new IndexSearcher(fsDir);
        
        Query parser=new QueryParser("SPEAKER", new
StandardAnalyzer()).parse(q); 
        long start=new Date().getTime();
        Hits hits=is.search(parser);
        long end=new Date().getTime();
        
        System.err.println("Found " + hits.length() + " document(s)(in" +
(end-start) + " milliseconds) that matched query '" + q + "':"); 
        
        for(int i=0; i<hits.length(); i++){
            Document doc=hits.doc(i);
            System.out.println(doc.get("filename"));
        }
        
    }
    
}

What am I doing wrong?  Any help would be greatly appreciated.  

Thanks to all who reply.
Sincerely;
Fayyaz
-- 
View this message in context: http://www.nabble.com/Searching-through-a-single-XML-document-tp16799601p16799601.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Searching through a single XML document

Posted by Ulf Dittmer <ud...@yahoo.com>.
The search can't return more than one document,
because only a single document is ever added to the
index. You might want to think about structuring the
index differently, e.g. by creating one Document for
each SPEECH element.

The search for "the" in particular won't find
anything, because that's a stopword that will be
removed if you use a StandardAnalyzer. (See the
comments about StopFilter in the StandardAnalyzer
javadocs).

You'll probably want to set the size of elementBuffer
to 0 after you created a Field that contains the
current text. Otherwise you'll be adding the same text
over and over again.

Also, think about using Field.Store.NO instead of
Field.Store.YES for the LINES. Unless you need to
retrieve the full text of each speech from the index,
this just makes the index very large for no good
reason.

Ulf



      ____________________________________________________________________________________
Be a better friend, newshound, and 
know-it-all with Yahoo! Mobile.  Try it now.  http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org