You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by ksu wildcats <ks...@gmail.com> on 2012/08/20 19:19:35 UTC

Solr Custom Filter Factory - How to pass parameters?

We are using SOLR and are in the process of adding custom filter factory to
handle the processing of words/tokens to suit our needs.

Here is what our custom filter factory does
1) Reads the tokens and does some analysis and writes the result of analysis
to database.

We are using Embedded Solr with multi-core (separate core for each index).

We have Custom Filter Factory information configured in the Schema.xml

The problem we are running into is - not able to pass parameters to our
custom filter factory.
We need to be able to pass some additional information (index specific and
this would be different for each index) to our custom filter factory.

Can anyone please tell if this is possible with Solr or do we need to switch
back to using Lucene-APIs?

Thanks
-K



--
View this message in context: http://lucene.472066.n3.nabble.com/Solr-Custom-Filter-Factory-How-to-pass-parameters-tp4002217.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr Custom Filter Factory - How to pass parameters?

Posted by KnightRider <ks...@gmail.com>.
Can someone please point to some samples on how to implement custom
SolrEventListeners?

Whats the default behavior of Solr when no SolrEventListeners are configured
in solrconfig.xml.

I am trying to understand how does custom listener fit in with default
listeners (if there are any)

Thanks
-K'Rider



-----
Thanks
-K'Rider
--
View this message in context: http://lucene.472066.n3.nabble.com/Solr-Custom-Filter-Factory-How-to-pass-parameters-handle-PostProcessing-tp4002217p4003014.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr Custom Filter Factory - How to pass parameters?

Posted by ksu wildcats <ks...@gmail.com>.
Thanks Erick.
I tried to do it all at the filter but the problem i am running into doing
it at the filter is intercepting the final commit calls or in other words I
am unable to figure out when the final commit should happen such that I
don't miss out any data.
One option I tried is to increase the in-memory batch size and commit the
data from in-memory to database in "incrementToken" method but this can lead
to missing out data from in-memory if the size of the batch is less than the
set threshold.

I'll try using SolrEventListener and see if that can help resolve the issues
i am running into.



--
View this message in context: http://lucene.472066.n3.nabble.com/Solr-Custom-Filter-Factory-How-to-pass-parameters-handle-PostProcessing-tp4002217p4002768.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr Custom Filter Factory - How to pass parameters?

Posted by Erick Erickson <er...@gmail.com>.
I'm reaching a bit here, haven't implemented one myself, but...

It seems like you're just dealing with some shared memory. So say
your filter recorded all the stuff you want to put into the DB. When
you put stuff in to the shared memory, you probably have to figure
out when you should commit the batch (if you're indexing 100M docs,
you probably don't want to use up that much memory, but what do I know).
This is all done at the filter.

It seems like you could also create an  SolrEventListener on
the PostCommit event
(see: http://wiki.apache.org/solr/SolrPlugins#SolrEventListener)
to put whatever remained in your list into your DB.

Of course you'd have to do some synchronization so multiple threads
played nice with each other. And you'd have to be sure to fire a commit
at the end of your indexing process if you wanted some certainty that
everything was tidied up. If some delay isn't a problem and you have
autocommit configured, then your event listener would be called when
then next autocommit happened.

FWIW
Erick

On Tue, Aug 21, 2012 at 8:19 PM, ksu wildcats <ks...@gmail.com> wrote:
> Jack
>
> Reading through the documentation for UpdateRequestProcessor my
> understanding is that its good for handling processing of documents before
> analysis.
> Is it true that processAdd (where we can have custom logic) is invoked once
> per document and is invoked before any of the analyzers gets invoked?
>
> I couldn't figure out how I can use UpdateRequestProcessor to access the
> tokens stored in memory by CustomFilterFactory/CustomFilter.
>
> Can you please provide more information on how I can use
> UpdateRequestProcessor to handle any post processing that needs to be done
> after all documents are added to the index?
>
> Also does CustomFilterFactory/CustomFilter has any ways to do post
> processing after all documents are added to index?
>
> Here is the code i have for CustomFilterFactory/CustomFilter. This might
> help understand what i am trying to do and may be there is a better way to
> do this.
> The main problem i have with this approach is that i am forced to write
> results stored in memory (customMap) to database per document and if i have
> 1 million documents then thats 1 million db calls. I am trying to avoid the
> number of calls made to database by storing results in memory and write
> results to database once for every X documents (say, every 10000 docs).
>
> public class CustomFilterFactory extends BaseTokenFilterFactory {
>           public CustomFilter create(TokenStream input) {
>                     String databaseName = getArgs().get("paramname");
>                     return new CustomFilter(input, databasename);
>          }
> }
>
> public class CustomFilter extends TokenFilter {
>         private TermAttribute termAtt;
>         Map<TermAttribute, Integer> customMap = new HashMap<TermAttribute,
> Integer>();
>         String databasename = null;
>           protected CustomFilter(TokenStream input, String databasename) {
>                   super(input);
>                   termAtt = (TermAttribute) addAttribute(TermAttribute.class);
>                   this.databasename  = databasename;
>           }
>
>           public final boolean incrementToken() throws IOException {
>                   if (!input.incrementToken()) {
>                       writeResultsToDB()
>                       return false;
>                   }
>
>                   if (addWordToCustomMap()) {
>                         // do some analysis on term and then populate customMap
>                         // customMap.put(term,somevalue);
>                   }
>
>                   if (customMap.size() > commitSize) {
>                         writeResultsToDB()
>                   }
>                   return true;
>           }
>
>           boolean addWordToCustomMap() {
>                 // custom logic - some validation on term to determine if this should be
> added to customMap
>           }
>
>           void writeResultsToDB() throws IOException {
>                 // custom logic that reads data from customMap, does some analysis and
> writes them to database.
>           }
> }
>
>
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Solr-Custom-Filter-Factory-How-to-pass-parameters-tp4002217p4002531.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr Custom Filter Factory - How to pass parameters?

Posted by ksu wildcats <ks...@gmail.com>.
Jack

Reading through the documentation for UpdateRequestProcessor my
understanding is that its good for handling processing of documents before
analysis. 
Is it true that processAdd (where we can have custom logic) is invoked once
per document and is invoked before any of the analyzers gets invoked?

I couldn't figure out how I can use UpdateRequestProcessor to access the
tokens stored in memory by CustomFilterFactory/CustomFilter.

Can you please provide more information on how I can use
UpdateRequestProcessor to handle any post processing that needs to be done
after all documents are added to the index?

Also does CustomFilterFactory/CustomFilter has any ways to do post
processing after all documents are added to index?

Here is the code i have for CustomFilterFactory/CustomFilter. This might
help understand what i am trying to do and may be there is a better way to
do this.
The main problem i have with this approach is that i am forced to write
results stored in memory (customMap) to database per document and if i have
1 million documents then thats 1 million db calls. I am trying to avoid the
number of calls made to database by storing results in memory and write
results to database once for every X documents (say, every 10000 docs).

public class CustomFilterFactory extends BaseTokenFilterFactory {
	  public CustomFilter create(TokenStream input) {
		    String databaseName = getArgs().get("paramname");  		
		    return new CustomFilter(input, databasename);
	 }
}

public class CustomFilter extends TokenFilter {
	private TermAttribute termAtt;
	Map<TermAttribute, Integer> customMap = new HashMap<TermAttribute,
Integer>();
	String databasename = null;	
	  protected CustomFilter(TokenStream input, String databasename) {
		  super(input);
		  termAtt = (TermAttribute) addAttribute(TermAttribute.class);
		  this.databasename  = databasename;
	  }

	  public final boolean incrementToken() throws IOException {
		  if (!input.incrementToken()) {
		      writeResultsToDB()	  
		      return false;
		  }
		  
		  if (addWordToCustomMap()) {
		  	// do some analysis on term and then populate customMap 
		  	// customMap.put(term,somevalue);
		  }

		  if (customMap.size() > commitSize) {
		  	writeResultsToDB()
		  }
		  return true;
	  }

	  boolean addWordToCustomMap() {		  
		// custom logic - some validation on term to determine if this should be
added to customMap
	  }

	  void writeResultsToDB() throws IOException {
		// custom logic that reads data from customMap, does some analysis and
writes them to database.
	  }
}





--
View this message in context: http://lucene.472066.n3.nabble.com/Solr-Custom-Filter-Factory-How-to-pass-parameters-tp4002217p4002531.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr Custom Filter Factory - How to pass parameters?

Posted by Jack Krupansky <ja...@basetechnology.com>.
Read through the update processor stuff. Maybe that might suggest a good 
place to put processing that should occur after all input has been analyzed.

http://wiki.apache.org/solr/UpdateRequestProcessor

-- Jack Krupansky

-----Original Message----- 
From: ksu wildcats
Sent: Tuesday, August 21, 2012 2:02 AM
To: solr-user@lucene.apache.org
Subject: RE: Solr Custom Filter Factory - How to pass parameters?

Thanks for your help. I was able to get it working with using the parameters
from filedtype definition in config files.

I am now stuck on next step.
Can you please tell if there is a way to identify/intercept last token that
gets added to index (across all documents) ?

Here is my scenario
1) I have custom implementation in "incrementToken" method in CustomFilter
2) I am trying to collect all tokens from all documents and then do some
analysis on those tokens and then write the result to database.
3) I have the results saved in-memory and am writing them to database after
last token is parsed.
if (!input.incrementToken()) {
// custom logic that writes the data from in-memory to database
}
4) I noticed that this approach increased too many db calls (one per each
document)
5) To avoid too many calls to database I tried to batch results from
multiple documents and then write them all at once to database but what I
couldn't figure out is how can i determine when to flush the results from
CustomFilter to database.

Is there any method in FilterFactory or Filter class that I can use to know
that Indexing is completed?




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Custom-Filter-Factory-How-to-pass-parameters-tp4002217p4002323.html
Sent from the Solr - User mailing list archive at Nabble.com. 


RE: Solr Custom Filter Factory - How to pass parameters?

Posted by ksu wildcats <ks...@gmail.com>.
Thanks for your help. I was able to get it working with using the parameters
from filedtype definition in config files.

I am now stuck on next step.
Can you please tell if there is a way to identify/intercept last token that
gets added to index (across all documents) ?

Here is my scenario
1) I have custom implementation in "incrementToken" method in CustomFilter
2) I am trying to collect all tokens from all documents and then do some
analysis on those tokens and then write the result to database.
3) I have the results saved in-memory and am writing them to database after
last token is parsed.
if (!input.incrementToken()) {
 // custom logic that writes the data from in-memory to database
}
4) I noticed that this approach increased too many db calls (one per each
document)
5) To avoid too many calls to database I tried to batch results from
multiple documents and then write them all at once to database but what I
couldn't figure out is how can i determine when to flush the results from
CustomFilter to database.

Is there any method in FilterFactory or Filter class that I can use to know
that Indexing is completed?




--
View this message in context: http://lucene.472066.n3.nabble.com/Solr-Custom-Filter-Factory-How-to-pass-parameters-tp4002217p4002323.html
Sent from the Solr - User mailing list archive at Nabble.com.

RE: Solr Custom Filter Factory - How to pass parameters?

Posted by ksu wildcats <ks...@gmail.com>.
Thanks Markus.
Links are helpful. I will give it a try and see if that solves my problem.



--
View this message in context: http://lucene.472066.n3.nabble.com/Solr-Custom-Filter-Factory-How-to-pass-parameters-tp4002217p4002248.html
Sent from the Solr - User mailing list archive at Nabble.com.

RE: Solr Custom Filter Factory - How to pass parameters?

Posted by Markus Jelsma <ma...@openindex.io>.

 
 
-----Original message-----
> From:ksu wildcats <ks...@gmail.com>
> Sent: Mon 20-Aug-2012 20:28
> To: solr-user@lucene.apache.org
> Subject: Re: Solr Custom Filter Factory - How to pass parameters?
> 
> Thanks Jack.
> 
> The information I want to pass is the "databasename" into which the analyzed
> data needs to be inserted.
> 
> As i was saying earlier, the set up we have is
> 1) we use embedded solr server with multi cores  - embedded into our webapp
> 2) support one index for each client - each client has a separate database
> (rdbms) and separate index (core)
> 3) dynamically create the config files when client request comes into our
> service for first time.
>    config files (schema xml) are separate but content is identifical for all
> cores.
> 
> The custom filter factory we want to add to chain of filters in schema.xml
> will process tokens and write them to the clients database.
> I am trying to figure out a way to retrieve the database name based on the
> information coming in request from client.
> 
> I hope this is clear.
> 
> Regarding your suggestion on ability to pass parameters in filed type
> definitions. Can you please point me to documentation or example on how to
> retrieve these parameter values from within filter factory?

You extend a TokenFilterFactory:
http://lucene.apache.org/core/4__0-BETA/analyzers-common/org/apache/lucene/analysis/util/TokenFilterFactory.html

which extends AbstractAnalysisFactory:
http://lucene.apache.org/core/4_0_0-BETA/analyzers-common/org/apache/lucene/analysis/util/AbstractAnalysisFactory.html

Use the get() method to get the parameters defined in the XML. Check how the stopfilter retrieves it's parameters:

http://svn.apache.org/viewvc/lucene/dev/trunk/lucene/analysis/common/src/java/org/apache/lucene/analysis/core/StopFilterFactory.java?view=markup

> 
> Also I am not familiar with "update processor". Any link to additional
> information on how to provide "update processor" will be greatly helpful.
> 
>    
> 
> 
> 
> 
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Solr-Custom-Filter-Factory-How-to-pass-parameters-tp4002217p4002231.html
> Sent from the Solr - User mailing list archive at Nabble.com.
> 

Re: Solr Custom Filter Factory - How to pass parameters?

Posted by ksu wildcats <ks...@gmail.com>.
Thanks Jack.

The information I want to pass is the "databasename" into which the analyzed
data needs to be inserted.

As i was saying earlier, the set up we have is
1) we use embedded solr server with multi cores  - embedded into our webapp
2) support one index for each client - each client has a separate database
(rdbms) and separate index (core)
3) dynamically create the config files when client request comes into our
service for first time.
   config files (schema xml) are separate but content is identifical for all
cores.

The custom filter factory we want to add to chain of filters in schema.xml
will process tokens and write them to the clients database.
I am trying to figure out a way to retrieve the database name based on the
information coming in request from client.

I hope this is clear.

Regarding your suggestion on ability to pass parameters in filed type
definitions. Can you please point me to documentation or example on how to
retrieve these parameter values from within filter factory?

Also I am not familiar with "update processor". Any link to additional
information on how to provide "update processor" will be greatly helpful.

   




--
View this message in context: http://lucene.472066.n3.nabble.com/Solr-Custom-Filter-Factory-How-to-pass-parameters-tp4002217p4002231.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr Custom Filter Factory - How to pass parameters?

Posted by Jack Krupansky <ja...@basetechnology.com>.
First, the obvious question: What kind of information? Be specific.

Second, you can pass parameters to your filter factory in your field type 
definitions. You could have separate schemas or separate field types for the 
different indexes. Is there anything this doesn't cover?

You can also provide an "update processor" that could supply whatever 
parameters you want.

-- Jack Krupansky

-----Original Message----- 
From: ksu wildcats
Sent: Monday, August 20, 2012 1:19 PM
To: solr-user@lucene.apache.org
Subject: Solr Custom Filter Factory - How to pass parameters?

We are using SOLR and are in the process of adding custom filter factory to
handle the processing of words/tokens to suit our needs.

Here is what our custom filter factory does
1) Reads the tokens and does some analysis and writes the result of analysis
to database.

We are using Embedded Solr with multi-core (separate core for each index).

We have Custom Filter Factory information configured in the Schema.xml

The problem we are running into is - not able to pass parameters to our
custom filter factory.
We need to be able to pass some additional information (index specific and
this would be different for each index) to our custom filter factory.

Can anyone please tell if this is possible with Solr or do we need to switch
back to using Lucene-APIs?

Thanks
-K



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Custom-Filter-Factory-How-to-pass-parameters-tp4002217.html
Sent from the Solr - User mailing list archive at Nabble.com.