You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@jackrabbit.apache.org by Ard Schrijvers <a....@hippo.nl> on 2007/08/08 17:33:29 UTC

IndexingConfiguration jr 1.4 release, analyzing, searching and synonymprovider

Hello, 

and sorry for spamming, but I just want to share my findings/impressions, and what I am posting I am willimg to implement and port to the JackRabbit trunk (so if you bother to read it, and are positive about it, I will implement it :-) )

(if you make it to the end of this mail, I also describe how simple it would become to add a just in the trunk created SynonymProvider functionality....)

First of all, the IndexingConfiguration, very promising! Exactly what we need for better indexing, and, consequently better search results. Because, in the end, what good is a repository when customers can't find the results they are looking for? Storing, versioning, workflow, all very important, but no good when nobody can find their content (duhh, obviously).

So, one part that bothers me, is multilinguality (with lang specific stopwords, stemming, synonyms). Many customers these days want multilingual sites, and search them accordingly. And, obviously, lucene has quite some code for exactly this : see contrib/analyzers/src/java. 

Obviously, lucene has many more analyzers, and you can easily add your own. AFAIU, there is a single configuration place where I can define the overall JackRabbit analyzer that is used within one workspace: 

in repository.xml :

<param name="analyzer" value="org.apache.lucene.analysis.standard.StandardAnalyzer"/>

but, what I want, is a per property defineable analyzer (I would give bode_fr a french analyzer, body_de a german, some properties i might want to be indexed with keyword analyzers, like zipcodes). The best place for this IMO, is the IndexingConfiguration: then, if you do not configure it, nothing changes for you.
 
So, for example the first index rule at http://wiki.apache.org/jackrabbit/IndexingConfiguration would change in:

<index-rule nodeType="nt:unstructured"
              boost="2.0">
    <property analyzer="org.apache.lucene.analysis.Analyzer.GermanAnalyzer">text_de</property>
</index-rule>

and during loading, we construct a Map of {jr-property,analyzer} (call it propertyAnalyzerMap). Then, all we need to add is one jackrabbit global analyzer, that look like:

class JRAnalyzer extends Analyzer {
	Analyzer defaultAnalyzer = new StandardAnalyzer();

	public TokenStream tokenStream(String fieldName, Reader reader) {
		Analyzer analyzer = (Analyzer)propertyAnalyzerMap.get(fieldName);
		if(analyzer!=null){
			return analyzer.tokenStream(fieldName, reader);
		}else{
			return this.defaultAnalyzer.tokenStream(fieldName, reader);
		}
	}
}

This very same JRAnalyzer is also used for the QueryParser in LuceneQueryBuilder, so this will work also for searching IIUC. So, WDOT? I can implement it and send a patch, but if the community is reluctant to it, I will have to do it for myself in a non jr code intrusive way.

Example of the SynonymProvider mentioned at the top:

If my suggested changes are accepted, things like a SynonymProvider becomes superfluous, and very easy to add on the fly:

suppose, I want on the "body" property of my nodes always full searching with dutch synonyms. This boils down to adding an analyzer for this property, that extends the DutchAnalyzer in lucene, and that adds synonym functionality (very simple example in "lucene in action" book). I think it is better to do synonyms during analyzing (as opposed to the SynonymProvider in jr trunk), and simply use an analyzer for it. Ofcourse, a difference of using it, would be that with the current SynonymProvider you specifically have to define that you do a synonymsearch (~term), while with an analyzer, you define which properties whould be indexed with an synonymanalyzer, and searched accordingly (without having to specify it),

So WDOT? Again, sry for mailing so much, just trying to sell my ideas :-) 

 
-- 

Hippo
Oosteinde 11
1017WT Amsterdam
The Netherlands
Tel  +31 (0)20 5224466
-------------------------------------------------------------
a.schrijvers@hippo.nl / ard@apache.org / http://www.hippo.nl
-------------------------------------------------------------- 

Re: IndexingConfiguration jr 1.4 release, analyzing, searching and synonymprovider

Posted by Marcel Reutegger <ma...@gmx.net>.
Ard Schrijvers wrote:
> and sorry for spamming, but I just want to share my findings/impressions, and
> what I am posting I am willimg to implement and port to the JackRabbit trunk
> (so if you bother to read it, and are positive about it, I will implement it
> :-) )

you don't have to feel sorry, your input is very welcome!

[...]

> So, one part that bothers me, is multilinguality (with lang specific
> stopwords, stemming, synonyms). Many customers these days want multilingual
> sites, and search them accordingly. And, obviously, lucene has quite some
> code for exactly this : see contrib/analyzers/src/java.
> 
> Obviously, lucene has many more analyzers, and you can easily add your own.
> AFAIU, there is a single configuration place where I can define the overall
> JackRabbit analyzer that is used within one workspace:
> 
> in repository.xml :
> 
> <param name="analyzer"
> value="org.apache.lucene.analysis.standard.StandardAnalyzer"/>
> 
> but, what I want, is a per property defineable analyzer (I would give bode_fr
> a french analyzer, body_de a german, some properties i might want to be
> indexed with keyword analyzers, like zipcodes). The best place for this IMO,
> is the IndexingConfiguration: then, if you do not configure it, nothing
> changes for you.
> 
> So, for example the first index rule at
> http://wiki.apache.org/jackrabbit/IndexingConfiguration would change in:
> 
> <index-rule nodeType="nt:unstructured" boost="2.0"> <property
> analyzer="org.apache.lucene.analysis.Analyzer.GermanAnalyzer">text_de</property>
>  </index-rule>
> 
> and during loading, we construct a Map of {jr-property,analyzer} (call it
> propertyAnalyzerMap). Then, all we need to add is one jackrabbit global
> analyzer, that look like:
> 
> class JRAnalyzer extends Analyzer { Analyzer defaultAnalyzer = new
> StandardAnalyzer();
> 
> public TokenStream tokenStream(String fieldName, Reader reader) { Analyzer
> analyzer = (Analyzer)propertyAnalyzerMap.get(fieldName); if(analyzer!=null){ 
> return analyzer.tokenStream(fieldName, reader); }else{ return
> this.defaultAnalyzer.tokenStream(fieldName, reader); } } }
> 
> This very same JRAnalyzer is also used for the QueryParser in
> LuceneQueryBuilder, so this will work also for searching IIUC. So, WDOT? I
> can implement it and send a patch, but if the community is reluctant to it, I
> will have to do it for myself in a non jr code intrusive way.

This would work quite well for jcr:contains functions that operate on a 
property. However I'm not sure what to do with this:

//*[jcr:contains(., 'hägar')]

the node scope does not indicate which analyzer to use for the query statement. 
Would we just run the statement through all analyzers and combine them in an OR 
query?

> Example of the SynonymProvider mentioned at the top:
> 
> If my suggested changes are accepted, things like a SynonymProvider becomes
> superfluous, and very easy to add on the fly:
> 
> suppose, I want on the "body" property of my nodes always full searching with
> dutch synonyms. This boils down to adding an analyzer for this property, that
> extends the DutchAnalyzer in lucene, and that adds synonym functionality
> (very simple example in "lucene in action" book). I think it is better to do
> synonyms during analyzing (as opposed to the SynonymProvider in jr trunk),
> and simply use an analyzer for it. Ofcourse, a difference of using it, would
> be that with the current SynonymProvider you specifically have to define that
> you do a synonymsearch (~term), while with an analyzer, you define which
> properties whould be indexed with an synonymanalyzer, and searched
> accordingly (without having to specify it),

well, those are actually the reasons why I implemented it the other way. If you 
go the analyzer way to expand synonyms you have to re-index the complete content 
if you want to add a single synonym. I also wanted the user to decide if 
synonyms should be considered. Again this would not be possible if the analyzer 
automatically adds synonyms.

but fortunately, with jackrabbit both is possible ;) if one prefers to expand 
terms on index time, just use an appropriate analyzer and don't configure a 
SynonymProvider.

> So WDOT? Again, sry for mailing so much, just trying to sell my ideas :-)

again, your ideas are very welcome.

regards
  marcel