You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Walter Underwood <wu...@wunderwood.org> on 2018/04/06 21:46:39 UTC

Running an analyzer chain in an update request processor

Is there an easy way to define an analyzer chain in schema.xml then run it in an update request processor?

I want to run a chain ending in the minhash token filter, then take those minhashes, convert them to hex, and put them in a string field. I’d like the values stored.

It seems like this could all work in an update request processor. Grab the text from one field, run it through the chain, format the output tokens and add them to the field for hashes.

wunder
Walter Underwood
wunder@wunderwood.org
http://observer.wunderwood.org/  (my blog)


Re: Running an analyzer chain in an update request processor

Posted by Walter Underwood <wu...@wunderwood.org>.
Thanks, I should have mentioned that I’m doing this in a script URP.

wunder
Walter Underwood
wunder@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Apr 6, 2018, at 3:06 PM, Steve Rowe <sa...@gmail.com> wrote:
> 
> Hi Walter,
> 
> I’ve seen Erik Hatcher recommend using the StatelessScriptUpdateProcessor for this purpose, e.g. on slides 10-11 of https://www.slideshare.net/erikhatcher/solr-indexing-and-analysis-tricks .
> 
> More info at https://wiki.apache.org/solr/ScriptUpdateProcessor and https://lucene.apache.org/solr/7_3_0/solr-core/org/apache/solr/update/processor/StatelessScriptUpdateProcessorFactory.html 
> 
> --
> Steve
> www.lucidworks.com
> 
>> On Apr 6, 2018, at 5:46 PM, Walter Underwood <wu...@wunderwood.org> wrote:
>> 
>> Is there an easy way to define an analyzer chain in schema.xml then run it in an update request processor?
>> 
>> I want to run a chain ending in the minhash token filter, then take those minhashes, convert them to hex, and put them in a string field. I’d like the values stored.
>> 
>> It seems like this could all work in an update request processor. Grab the text from one field, run it through the chain, format the output tokens and add them to the field for hashes.
>> 
>> wunder
>> Walter Underwood
>> wunder@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>> 
> 


Re: Running an analyzer chain in an update request processor

Posted by Steve Rowe <sa...@gmail.com>.
Hi Walter,

I’ve seen Erik Hatcher recommend using the StatelessScriptUpdateProcessor for this purpose, e.g. on slides 10-11 of https://www.slideshare.net/erikhatcher/solr-indexing-and-analysis-tricks .

More info at https://wiki.apache.org/solr/ScriptUpdateProcessor and https://lucene.apache.org/solr/7_3_0/solr-core/org/apache/solr/update/processor/StatelessScriptUpdateProcessorFactory.html 

--
Steve
www.lucidworks.com

> On Apr 6, 2018, at 5:46 PM, Walter Underwood <wu...@wunderwood.org> wrote:
> 
> Is there an easy way to define an analyzer chain in schema.xml then run it in an update request processor?
> 
> I want to run a chain ending in the minhash token filter, then take those minhashes, convert them to hex, and put them in a string field. I’d like the values stored.
> 
> It seems like this could all work in an update request processor. Grab the text from one field, run it through the chain, format the output tokens and add them to the field for hashes.
> 
> wunder
> Walter Underwood
> wunder@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
> 


Re: Running an analyzer chain in an update request processor

Posted by Steve Rowe <sa...@gmail.com>.
Hi Walter,

I haven’t seen this before, but it looks like https://bugs.java.com/view_bug.do?bug_id=8071775

--
Steve
www.lucidworks.com

> On Apr 20, 2018, at 7:54 PM, Walter Underwood <wu...@wunderwood.org> wrote:
> 
> I’m back.
> 
> I think I’m following the steps in Eric Hatcher’s slides: https://www.slideshare.net/erikhatcher/solr-indexing-and-analysis-tricks
> 
> With a few minor changes, like using getIndexAnalyzer() because getAnalyzer() is gone. And I’ve pulled the subroutine code into the main processAdd function.
> 
> Any ideas about the cause of this error?
> 
> java.lang.ClassCastException: Cannot cast jdk.internal.dynalink.beans.StaticClass to java.lang.Class
> 	at java.lang.invoke.MethodHandleImpl.newClassCastException(MethodHandleImpl.java:361)
> 	at java.lang.invoke.MethodHandleImpl.castReference(MethodHandleImpl.java:356)
> 	at jdk.nashorn.internal.scripts.Script$Recompilation$37$104A$\^eval\_.processAdd(<eval>:15)
> 
> This is the code up through line 15:
> 
>    // Generate minhashes using the "minhash" analyzer chain
>    var analyzer = req.getCore().getLatestSchema().getFieldTypeByName('minhash').getIndexAnalyzer();
>    var hashes = [];
>    var token_stream = analyzer.tokenStream(null, new java.io.StringReader(question));
>    var term_att = token_stream.getAttribute(Packages.org.apache.lucene.analysis.tokenattributes.CharTermAttribute);
> 
> wunder
> Walter Underwood
> wunder@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
> 
>> On Apr 7, 2018, at 9:50 AM, Walter Underwood <wu...@wunderwood.org> wrote:
>> 
>> As I think more about this, we should have a signature processor that uses minhash. The MD5 signature processor was really easy to use.
>> 
>> wunder
>> Walter Underwood
>> wunder@wunderwood.org <ma...@wunderwood.org>
>> http://observer.wunderwood.org/  (my blog)
>> 
>>> On Apr 7, 2018, at 4:55 AM, Emir Arnautović <emir.arnautovic@sematext.com <ma...@sematext.com>> wrote:
>>> 
>>> Hi Walter,
>>> I did this sample processor for the purpose of having doc values on analysed field: https://github.com/od-bits/solr-multivaluefield-processor <https://github.com/od-bits/solr-multivaluefield-processor> <https://github.com/od-bits/solr-multivaluefield-processor <https://github.com/od-bits/solr-multivaluefield-processor>>
>>> 
>>> (+ related blog: http://www.od-bits.com/2018/02/solr-docvalues-on-analysed-field.html <http://www.od-bits.com/2018/02/solr-docvalues-on-analysed-field.html> <http://www.od-bits.com/2018/02/solr-docvalues-on-analysed-field.html <http://www.od-bits.com/2018/02/solr-docvalues-on-analysed-field.html>>)
>>> 
>>> HTH,
>>> Emir
>>> --
>>> Monitoring - Log Management - Alerting - Anomaly Detection
>>> Solr & Elasticsearch Consulting Support Training - http://sematext.com/ <http://sematext.com/>
>>> 
>>> 
>>> 
>>>> On 6 Apr 2018, at 23:46, Walter Underwood <wunder@wunderwood.org <ma...@wunderwood.org>> wrote:
>>>> 
>>>> Is there an easy way to define an analyzer chain in schema.xml then run it in an update request processor?
>>>> 
>>>> I want to run a chain ending in the minhash token filter, then take those minhashes, convert them to hex, and put them in a string field. I’d like the values stored.
>>>> 
>>>> It seems like this could all work in an update request processor. Grab the text from one field, run it through the chain, format the output tokens and add them to the field for hashes.
>>>> 
>>>> wunder
>>>> Walter Underwood
>>>> wunder@wunderwood.org <ma...@wunderwood.org>
>>>> http://observer.wunderwood.org/  (my blog)
>>>> 
>>> 
>> 
> 


Re: Running an analyzer chain in an update request processor

Posted by Walter Underwood <wu...@wunderwood.org>.
I’m back.

I think I’m following the steps in Eric Hatcher’s slides: https://www.slideshare.net/erikhatcher/solr-indexing-and-analysis-tricks

With a few minor changes, like using getIndexAnalyzer() because getAnalyzer() is gone. And I’ve pulled the subroutine code into the main processAdd function.

Any ideas about the cause of this error?

java.lang.ClassCastException: Cannot cast jdk.internal.dynalink.beans.StaticClass to java.lang.Class
	at java.lang.invoke.MethodHandleImpl.newClassCastException(MethodHandleImpl.java:361)
	at java.lang.invoke.MethodHandleImpl.castReference(MethodHandleImpl.java:356)
	at jdk.nashorn.internal.scripts.Script$Recompilation$37$104A$\^eval\_.processAdd(<eval>:15)

This is the code up through line 15:

    // Generate minhashes using the "minhash" analyzer chain
    var analyzer = req.getCore().getLatestSchema().getFieldTypeByName('minhash').getIndexAnalyzer();
    var hashes = [];
    var token_stream = analyzer.tokenStream(null, new java.io.StringReader(question));
    var term_att = token_stream.getAttribute(Packages.org.apache.lucene.analysis.tokenattributes.CharTermAttribute);

wunder
Walter Underwood
wunder@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Apr 7, 2018, at 9:50 AM, Walter Underwood <wu...@wunderwood.org> wrote:
> 
> As I think more about this, we should have a signature processor that uses minhash. The MD5 signature processor was really easy to use.
> 
> wunder
> Walter Underwood
> wunder@wunderwood.org <ma...@wunderwood.org>
> http://observer.wunderwood.org/  (my blog)
> 
>> On Apr 7, 2018, at 4:55 AM, Emir Arnautović <emir.arnautovic@sematext.com <ma...@sematext.com>> wrote:
>> 
>> Hi Walter,
>> I did this sample processor for the purpose of having doc values on analysed field: https://github.com/od-bits/solr-multivaluefield-processor <https://github.com/od-bits/solr-multivaluefield-processor> <https://github.com/od-bits/solr-multivaluefield-processor <https://github.com/od-bits/solr-multivaluefield-processor>>
>> 
>> (+ related blog: http://www.od-bits.com/2018/02/solr-docvalues-on-analysed-field.html <http://www.od-bits.com/2018/02/solr-docvalues-on-analysed-field.html> <http://www.od-bits.com/2018/02/solr-docvalues-on-analysed-field.html <http://www.od-bits.com/2018/02/solr-docvalues-on-analysed-field.html>>)
>> 
>> HTH,
>> Emir
>> --
>> Monitoring - Log Management - Alerting - Anomaly Detection
>> Solr & Elasticsearch Consulting Support Training - http://sematext.com/ <http://sematext.com/>
>> 
>> 
>> 
>>> On 6 Apr 2018, at 23:46, Walter Underwood <wunder@wunderwood.org <ma...@wunderwood.org>> wrote:
>>> 
>>> Is there an easy way to define an analyzer chain in schema.xml then run it in an update request processor?
>>> 
>>> I want to run a chain ending in the minhash token filter, then take those minhashes, convert them to hex, and put them in a string field. I’d like the values stored.
>>> 
>>> It seems like this could all work in an update request processor. Grab the text from one field, run it through the chain, format the output tokens and add them to the field for hashes.
>>> 
>>> wunder
>>> Walter Underwood
>>> wunder@wunderwood.org <ma...@wunderwood.org>
>>> http://observer.wunderwood.org/  (my blog)
>>> 
>> 
> 


Re: Running an analyzer chain in an update request processor

Posted by Walter Underwood <wu...@wunderwood.org>.
As I think more about this, we should have a signature processor that uses minhash. The MD5 signature processor was really easy to use.

wunder
Walter Underwood
wunder@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Apr 7, 2018, at 4:55 AM, Emir Arnautović <em...@sematext.com> wrote:
> 
> Hi Walter,
> I did this sample processor for the purpose of having doc values on analysed field: https://github.com/od-bits/solr-multivaluefield-processor <https://github.com/od-bits/solr-multivaluefield-processor>
> 
> (+ related blog: http://www.od-bits.com/2018/02/solr-docvalues-on-analysed-field.html <http://www.od-bits.com/2018/02/solr-docvalues-on-analysed-field.html>)
> 
> HTH,
> Emir
> --
> Monitoring - Log Management - Alerting - Anomaly Detection
> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
> 
> 
> 
>> On 6 Apr 2018, at 23:46, Walter Underwood <wu...@wunderwood.org> wrote:
>> 
>> Is there an easy way to define an analyzer chain in schema.xml then run it in an update request processor?
>> 
>> I want to run a chain ending in the minhash token filter, then take those minhashes, convert them to hex, and put them in a string field. I’d like the values stored.
>> 
>> It seems like this could all work in an update request processor. Grab the text from one field, run it through the chain, format the output tokens and add them to the field for hashes.
>> 
>> wunder
>> Walter Underwood
>> wunder@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>> 
> 


Re: Running an analyzer chain in an update request processor

Posted by Emir Arnautović <em...@sematext.com>.
Hi Walter,
I did this sample processor for the purpose of having doc values on analysed field: https://github.com/od-bits/solr-multivaluefield-processor <https://github.com/od-bits/solr-multivaluefield-processor>

(+ related blog: http://www.od-bits.com/2018/02/solr-docvalues-on-analysed-field.html <http://www.od-bits.com/2018/02/solr-docvalues-on-analysed-field.html>)

HTH,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 6 Apr 2018, at 23:46, Walter Underwood <wu...@wunderwood.org> wrote:
> 
> Is there an easy way to define an analyzer chain in schema.xml then run it in an update request processor?
> 
> I want to run a chain ending in the minhash token filter, then take those minhashes, convert them to hex, and put them in a string field. I’d like the values stored.
> 
> It seems like this could all work in an update request processor. Grab the text from one field, run it through the chain, format the output tokens and add them to the field for hashes.
> 
> wunder
> Walter Underwood
> wunder@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>