You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Dileepa Jayakody <di...@gmail.com> on 2013/11/12 07:31:14 UTC

Indexing a token to a different field in a custom filter

Hi All,

In my custom filter, I need to index the processed token into a different
field. The processed token is a Stanbol enhancement response.

The solution I have so far found is to use a Solr client (solj) to add a
new Document with my processed field into Solr. Below is the sample code
segment;

 SolrServer server = new HttpSolrServer("http://localhost:8983/solr/");
        SolrInputDocument doc1 = new SolrInputDocument();
        doc1.addField( "id", "id1", 1.0f );
        doc1.addField("stanbolResponse", response);
        try {
server.add(doc1);
server.commit();
} catch (SolrServerException e) {
e.printStackTrace();
}


This mechanism requires a new HTTP call to the local Solr server for every
token I process for the stanbolRequest field, and I feel it's not very
efficient.

Is there any other alternative way to invoke a update request to add a new
field to the indexing document within the filter (without making an
explicit HTTP call using Solrj)?

Thanks,
Dileepa

Re: Indexing a token to a different field in a custom filter

Posted by Dileepa Jayakody <di...@gmail.com>.

I need to index the processed token to a different feild (eg:
stanbolResponse), in the same document that's being indexed.

I am looking for a way to retrieve the document.id from the TokenStream so
that I can update the same document with new field values. (In my sample
code above I'm adding a new document, instead of updating the same document)
Any pointers please?

Thanks,
Dileepa


On Tue, Nov 12, 2013 at 12:01 PM, Dileepa Jayakody <
dileepajayakody@gmail.com> wrote:

> Hi All,
>
> In my custom filter, I need to index the processed token into a different
> field. The processed token is a Stanbol enhancement response.
>
> The solution I have so far found is to use a Solr client (solj) to add a
> new Document with my processed field into Solr. Below is the sample code
> segment;
>
>  SolrServer server = new HttpSolrServer("http://localhost:8983/solr/");
>         SolrInputDocument doc1 = new SolrInputDocument();
>         doc1.addField( "id", "id1", 1.0f );
>         doc1.addField("stanbolResponse", response);
>         try {
> server.add(doc1);
> server.commit();
>  } catch (SolrServerException e) {
> e.printStackTrace();
> }
>
>
> This mechanism requires a new HTTP call to the local Solr server for every
> token I process for the stanbolRequest field, and I feel it's not very
> efficient.
>
> Is there any other alternative way to invoke a update request to add a new
> field to the indexing document within the filter (without making an
> explicit HTTP call using Solrj)?
>
> Thanks,
> Dileepa
>

Re: Indexing a token to a different field in a custom filter

Posted by Dileepa Jayakody <di...@gmail.com>.

Thanks all for your valuable inputs.

I looked at suggested solutions and I too feel, a* custom update
processor*during indexing will be the best solution to handle the
content field by
changing the value and storing it in another value.

Do I only need to change the below request handler to intercept all
indexing documents to perform my custom analysis during indexing? Or do I
need to change any other request handler also?
 <requestHandler name="/update" class="solr.UpdateRequestHandler">

Thanks,
Dileepa


On Tue, Nov 12, 2013 at 7:37 PM, Jack Krupansky <ja...@basetechnology.com>wrote:

> Any kind of cross-field processing is best done in an update processor.
> There are a lot of built-in update processors as well as a JavaScript
> script update processor.
>
> -- Jack Krupansky
>
> -----Original Message----- From: Dileepa Jayakody
> Sent: Tuesday, November 12, 2013 1:31 AM
> To: solr-user@lucene.apache.org
> Subject: Indexing a token to a different field in a custom filter
>
>
> Hi All,
>
> In my custom filter, I need to index the processed token into a different
> field. The processed token is a Stanbol enhancement response.
>
> The solution I have so far found is to use a Solr client (solj) to add a
> new Document with my processed field into Solr. Below is the sample code
> segment;
>
> SolrServer server = new HttpSolrServer("http://localhost:8983/solr/");
>        SolrInputDocument doc1 = new SolrInputDocument();
>        doc1.addField( "id", "id1", 1.0f );
>        doc1.addField("stanbolResponse", response);
>        try {
> server.add(doc1);
> server.commit();
> } catch (SolrServerException e) {
> e.printStackTrace();
> }
>
>
> This mechanism requires a new HTTP call to the local Solr server for every
> token I process for the stanbolRequest field, and I feel it's not very
> efficient.
>
> Is there any other alternative way to invoke a update request to add a new
> field to the indexing document within the filter (without making an
> explicit HTTP call using Solrj)?
>
> Thanks,
> Dileepa
>

Re: Indexing a token to a different field in a custom filter

Posted by Jack Krupansky <ja...@basetechnology.com>.

Any kind of cross-field processing is best done in an update processor. 
There are a lot of built-in update processors as well as a JavaScript script 
update processor.

-- Jack Krupansky

-----Original Message----- 
From: Dileepa Jayakody
Sent: Tuesday, November 12, 2013 1:31 AM
To: solr-user@lucene.apache.org
Subject: Indexing a token to a different field in a custom filter

Hi All,

In my custom filter, I need to index the processed token into a different
field. The processed token is a Stanbol enhancement response.

The solution I have so far found is to use a Solr client (solj) to add a
new Document with my processed field into Solr. Below is the sample code
segment;

SolrServer server = new HttpSolrServer("http://localhost:8983/solr/");
        SolrInputDocument doc1 = new SolrInputDocument();
        doc1.addField( "id", "id1", 1.0f );
        doc1.addField("stanbolResponse", response);
        try {
server.add(doc1);
server.commit();
} catch (SolrServerException e) {
e.printStackTrace();
}

This mechanism requires a new HTTP call to the local Solr server for every
token I process for the stanbolRequest field, and I feel it's not very
efficient.

Is there any other alternative way to invoke a update request to add a new
field to the indexing document within the filter (without making an
explicit HTTP call using Solrj)?

Thanks,
Dileepa

Re: Indexing a token to a different field in a custom filter

Posted by Erick Erickson <er...@gmail.com>.

Whether what Alvaro outlined works for you or
not, do NOT commit after every document if you
use SolrJ. The commit will hurt performance much
more than the HTTP overhead.

And you can always batch up, say, 1,000 documents
and use the server.add(doclist) method.

Overall, worrying about HTTP overhead is usually a
red herring.

Best,
Erick


On Tue, Nov 12, 2013 at 3:20 AM, Alvaro Cabrerizo <to...@gmail.com>wrote:

> Hi,
>
> Maybe the synonym
> filter<
> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymFilterFactory
> >is
> the mirror you can look in. You can start creating a new field type in
> your schema that is stanbol enhanced. Let's follow with the parallelism, in
> the case of synonym we could have this schema:
>
> ...
> <fielType name="synonymtext" class="solr.TextField"
> positionIncrementGap="100">
>   <tokenizer class="solr.WhitespaceTokenizerFactory" />
>   <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
> ignoreCase="true" expand="true" />
> </fieldType>
> ...
> <field name="id" type="string" indexed="true" stored="true" required="true"
> />
> <field name="description" type="synonymtext" indexed="true" stored="true"
> multiValued="true" />
> ...
>
> In the case of stanbol:
>
> ...
> <fielType name="stanboltext" class="solr.TextField"
> positionIncrementGap="100">
>   <tokenizer class="solr.WhitespaceTokenizerFactory" />
>   <filter class="StanbolFilterFactory"  your Stanbol filter parameters here
> />
> </fieldType>
> ...
> <field name="id" type="string" indexed="true" stored="true" required="true"
> />
> <field name="description" type="synonymtext" indexed="true" stored="true"
> multiValued="true" />
> ...
>
> Thus the StanbolFilterFactory is in charge of connecting ot Stanbol and
> enhance the data coming from WhitespaceTokenizerFactory, creating an output
> that can be used by other filters.
>
> How do you index your data, then?
>
> Just send your doc:
>
> id:your id
> description:the data to be enhanced
>
>
> Other path you can follow is imitate the behaviour of
> CopyField<http://wiki.apache.org/solr/SchemaXml#Copy_Fields>in a more
> sofisticated fashion i.e. (copy, enhance an put in a new field).
> The you can have the next schema:
>
> ...
> <fielType name="text" class="solr.TextField" positionIncrementGap="100">
>   <tokenizer class="solr.WhitespaceTokenizerFactory" />
> </fieldType>
> ...
> <field name="id" type="string" indexed="true" stored="true" required="true"
> />
> <field name="description" type="text" indexed="true" stored="true"
> multiValued="true" />
> <field name="enhancedDescription" type="text" indexed="true" stored="true"
> multiValued="true" />
> <copyEnhanceField source="description" dest="enhancedDescription" />
>
> The copyEnhanceField is now in charge of take the original field, send to
> stanbol, get the response and write it in the new field.
>
> How do you index your data then?
>
> Just send your doc:
>
> id:your id
> description:the original data
>
> And you will get in solr:
>
> id:your id
> description:the original data
> enhancedDescription:the enhanced data
>
>
> Regards
>

Re: Indexing a token to a different field in a custom filter

Posted by Alvaro Cabrerizo <to...@gmail.com>.

Hi,

Maybe the synonym
filter<http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymFilterFactory>is
the mirror you can look in. You can start creating a new field type in
your schema that is stanbol enhanced. Let's follow with the parallelism, in
the case of synonym we could have this schema:

...
<fielType name="synonymtext" class="solr.TextField"
positionIncrementGap="100">
  <tokenizer class="solr.WhitespaceTokenizerFactory" />
  <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="true" />
</fieldType>
...
<field name="id" type="string" indexed="true" stored="true" required="true"
/>
<field name="description" type="synonymtext" indexed="true" stored="true"
multiValued="true" />
...

In the case of stanbol:

...
<fielType name="stanboltext" class="solr.TextField"
positionIncrementGap="100">
  <tokenizer class="solr.WhitespaceTokenizerFactory" />
  <filter class="StanbolFilterFactory"  your Stanbol filter parameters here
/>
</fieldType>
...
<field name="id" type="string" indexed="true" stored="true" required="true"
/>
<field name="description" type="synonymtext" indexed="true" stored="true"
multiValued="true" />
...

Thus the StanbolFilterFactory is in charge of connecting ot Stanbol and
enhance the data coming from WhitespaceTokenizerFactory, creating an output
that can be used by other filters.

How do you index your data, then?

Just send your doc:

id:your id
description:the data to be enhanced


Other path you can follow is imitate the behaviour of
CopyField<http://wiki.apache.org/solr/SchemaXml#Copy_Fields>in a more
sofisticated fashion i.e. (copy, enhance an put in a new field).
The you can have the next schema:

...
<fielType name="text" class="solr.TextField" positionIncrementGap="100">
  <tokenizer class="solr.WhitespaceTokenizerFactory" />
</fieldType>
...
<field name="id" type="string" indexed="true" stored="true" required="true"
/>
<field name="description" type="text" indexed="true" stored="true"
multiValued="true" />
<field name="enhancedDescription" type="text" indexed="true" stored="true"
multiValued="true" />
<copyEnhanceField source="description" dest="enhancedDescription" />

The copyEnhanceField is now in charge of take the original field, send to
stanbol, get the response and write it in the new field.

How do you index your data then?

Just send your doc:

id:your id
description:the original data

And you will get in solr:

id:your id
description:the original data
enhancedDescription:the enhanced data


Regards