You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@cocoon.apache.org by icewind <ic...@yahoo.com> on 2002/07/12 17:48:14 UTC

Lucene, CocoonIndexer

	I have created an index of some XML documents but I'm
not thrilled with the way the index is built. Text
appears to get indexed with the innermost XML tag it
is found in. For example, if I had a fragment like the
following:

<title>
   <person>Alice's</person> guide to the great novels
	    of the <date>1800's</date>
</title>

and I then used the following search term:
"title:Alice" or "title:1800", I would not get a
match. I would need to search for "person:Alice" or
"date:1800" respectively.
	Since all the tags within the <title> tag contain
text that are clearly part of the title, I want a user
who is searching through the collection to be able to
do title specific searches that match any word within
the title tag, regardless of whether it has other XML
tags wrapped around it.
	Has anyone run into this issue? I'm not sure how to
go about implementing what I want. Is this something I
could do in Cocoon, or would I have to modify
something in the LuceneXMLIndexer component?

	Suggestions appreciated. I imagine someone has run
into this and has already come up with a workable
solution.


__________________________________________________
Do You Yahoo!?
Sign up for SBC Yahoo! Dial - First Month Free
http://sbc.yahoo.com

---------------------------------------------------------------------
Please check that your question  has not already been answered in the
FAQ before posting.     <http://xml.apache.org/cocoon/faq/index.html>

To unsubscribe, e-mail:     <co...@xml.apache.org>
For additional commands, e-mail:   <co...@xml.apache.org>


RE : Lucene, CocoonIndexer

Posted by Martin Sévigny <se...@ajlsm.com>.
Hello all,

> -----Message d'origine-----
> De : icewind [mailto:icewind0@yahoo.com] 
> Envoyé : vendredi 12 juillet 2002 17:48

> <title>
>    <person>Alice's</person> guide to the great novels
> 	    of the <date>1800's</date>
> </title>
> 

-cut-

> 	Has anyone run into this issue? I'm not sure how to
> go about implementing what I want. Is this something I
> could do in Cocoon, or would I have to modify
> something in the LuceneXMLIndexer component?

For specific needs like that, we implemented an XML search engine using
both Lucene and Cocoon, the difference is that indexing is completely
configurable. You'll find more information at http://sdx.culture.fr (in
French).

You define fields with some properties, and you populate them using an
XSLT transformation or a SAX filter applied to your XML document. Than
you get an XSP logicsheet for querying, paging, viewing of results, etc.

Version 1.1 runs on Cocoon 1.8 and Lucene 1, version 2 (in development
but usable) runs on Cocoon 2 and Lucene 1.2. GPL.

Have fun,

Martin Sévigny


---------------------------------------------------------------------
Please check that your question  has not already been answered in the
FAQ before posting.     <http://xml.apache.org/cocoon/faq/index.html>

To unsubscribe, e-mail:     <co...@xml.apache.org>
For additional commands, e-mail:   <co...@xml.apache.org>


RE: Lucene, CocoonIndexer

Posted by Vadim Gritsenko <va...@verizon.net>.
> From: icewind [mailto:icewind0@yahoo.com]
> 
> --- Vadim Gritsenko <va...@verizon.net>
> wrote:
> 
> Vadim,
> 
> I looked at the characters method. I want to be able
> to see what it does when it executes this, but I
> assume I just cant stick system.oit.println() calls in
> there

You can. You will see messages in the console window of servlet engine,
I usually use tomcat (under *nix console output will be in catalina.out
file) or resin.

Other option is to run servlet engine with debug enabled and connect to
the engine from the IDE debugger.


> to see what the values of various variables are
> when it runs. Is there some way I could produce some
> debugging information so I could watch this method
> run? Is there a way to output to the cocoon log?

Usually you use getLogger().debug("message") to put some debug info into
the log, but LuceneIndexContentHandler is not log enabled.


Vadim


> 
> 
> >
> > Look into LuceneIndexContentHandler, characters()
> > method.
> >
> > Ok, I see that it appends text only to bodyText and
> > current tag...
> > Simple solution would be to add text to every field
> > in stack (in
> > characters(), for(;;) instead of if()), but better
> > solution is to have
> > not stack of StringBuffers (see this.elementStack),
> > but stack of indexes
> > in single string buffer (this.bodyText). This
> > solution will utilize
> > memory more efficiently.
> >
> >
> > Vadim
> >
> >
> > > 	Suggestions appreciated. I imagine someone has
> > > run
> > > into this and has already come up with a workable
> > > solution.
> >


---------------------------------------------------------------------
Please check that your question  has not already been answered in the
FAQ before posting.     <http://xml.apache.org/cocoon/faq/index.html>

To unsubscribe, e-mail:     <co...@xml.apache.org>
For additional commands, e-mail:   <co...@xml.apache.org>


RE: Lucene, CocoonIndexer

Posted by icewind <ic...@yahoo.com>.
--- Vadim Gritsenko <va...@verizon.net>
wrote:

Vadim, 

I looked at the characters method. I want to be able
to see what it does when it executes this, but I
assume I just cant stick system.oit.println() calls in
there to see what the values of various variables are
when it runs. Is there some way I could produce some
debugging information so I could watch this method
run? Is there a way to output to the cocoon log?



> 
> Look into LuceneIndexContentHandler, characters()
> method.
> 
> Ok, I see that it appends text only to bodyText and
> current tag...
> Simple solution would be to add text to every field
> in stack (in
> characters(), for(;;) instead of if()), but better
> solution is to have
> not stack of StringBuffers (see this.elementStack),
> but stack of indexes
> in single string buffer (this.bodyText). This
> solution will utilize
> memory more efficiently.
> 
> 
> Vadim
>  
> 
> > 	Suggestions appreciated. I imagine someone has
> run
> > into this and has already come up with a workable
> > solution.
> 
> 
>
---------------------------------------------------------------------
> Please check that your question  has not already
> been answered in the
> FAQ before posting.    
> <http://xml.apache.org/cocoon/faq/index.html>
> 
> To unsubscribe, e-mail:    
> <co...@xml.apache.org>
> For additional commands, e-mail:  
> <co...@xml.apache.org>
> 


__________________________________________________
Do You Yahoo!?
Yahoo! Health - Feel better, live better
http://health.yahoo.com

---------------------------------------------------------------------
Please check that your question  has not already been answered in the
FAQ before posting.     <http://xml.apache.org/cocoon/faq/index.html>

To unsubscribe, e-mail:     <co...@xml.apache.org>
For additional commands, e-mail:   <co...@xml.apache.org>


RE: Lucene, CocoonIndexer

Posted by Vadim Gritsenko <va...@verizon.net>.
> From: icewind [mailto:icewind0@yahoo.com]
> 
> 	I have created an index of some XML documents but I'm
> not thrilled with the way the index is built. Text
> appears to get indexed with the innermost XML tag it
> is found in. For example, if I had a fragment like the
> following:
> 
> <title>
>    <person>Alice's</person> guide to the great novels
> 	    of the <date>1800's</date>
> </title>
> 
> and I then used the following search term:
> "title:Alice" or "title:1800", I would not get a
> match. I would need to search for "person:Alice" or
> "date:1800" respectively.
> 	Since all the tags within the <title> tag contain
> text that are clearly part of the title, I want a user
> who is searching through the collection to be able to
> do title specific searches that match any word within
> the title tag, regardless of whether it has other XML
> tags wrapped around it.
> 	Has anyone run into this issue? I'm not sure how to
> go about implementing what I want. Is this something I
> could do in Cocoon, or would I have to modify
> something in the LuceneXMLIndexer component?

Look into LuceneIndexContentHandler, characters() method.

Ok, I see that it appends text only to bodyText and current tag...
Simple solution would be to add text to every field in stack (in
characters(), for(;;) instead of if()), but better solution is to have
not stack of StringBuffers (see this.elementStack), but stack of indexes
in single string buffer (this.bodyText). This solution will utilize
memory more efficiently.


Vadim
 

> 	Suggestions appreciated. I imagine someone has run
> into this and has already come up with a workable
> solution.


---------------------------------------------------------------------
Please check that your question  has not already been answered in the
FAQ before posting.     <http://xml.apache.org/cocoon/faq/index.html>

To unsubscribe, e-mail:     <co...@xml.apache.org>
For additional commands, e-mail:   <co...@xml.apache.org>