You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by "Markus.Mirsberger" <ma...@gmx.de> on 2015/06/17 11:48:52 UTC
Dedupe in a SolrCloud
Hi,
I am trying to use the dedupe feature to detect and mark near duplicate
content in my collections.
I dont want to prevent duplicate content. I woud like to detect it and
keep it for further processing. Thats why Im using an extra field and
not the documents unique field.
Here is how I added it to the solrConfig.xml :
<requestHandler name="/update" class="solr.UpdateRequestHandler">
<lst name="defaults">
<str name="update.chain">fill_signature</str>
</lst>
</requestHandler>
<updateRequestProcessorChain name="fill_signature"
processor="signature">
<processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>
<updateProcessor
class="solr.processor.SignatureUpdateProcessorFactory" name="signature">
<bool name="enabled">true</bool>
<str name="signatureField">signature</str>
<bool name="overwriteDupes">false</bool>
<str name="fields">content</str>
<str
name="signatureClass">solr.processor.TextProfileSignature</str>
<str name="quantRate">.2</str>
<str name="minTokenLen">3</str>
</updateProcessor>
When I initially add the documents to the cloud everything works as
expected ..... the documents are added and the signature will be created
and added.....perfect:)
The problem occours when I want to update an exisiting document. In that
case the update.chain=fill_signature parameter will of course be set too
and I get a bad request error.
I found this solr issue: https://issues.apache.org/jira/browse/SOLR-3473
Is it that problem I am running into?
Is it somehow possible to add parameters or set a specific update
Handler when Im adding documents to the cloud using solrJ?
In that case I could ether set the update.chain manually and remove it
from the request handler or write a second request Handler which I only
use if I want set the signature field.
I know I can do that manually when Im using eg curl but is it also
possible with SolrJ? :)
Thanks,
Markus
Re: Dedupe in a SolrCloud
Posted by Markus Mirsberger <ma...@gmx.de>.
Thanks :)
exactly what I was looking for...as I only need to create the signature once this works perfect for me:)
Cheers,
Markus
Sent from my iPhone
> On 17.06.2015, at 20:32, Shalin Shekhar Mangar <sh...@gmail.com> wrote:
>
> Comments inline:
>
> On Wed, Jun 17, 2015 at 3:18 PM, Markus.Mirsberger
> <ma...@gmx.de> wrote:
>> Hi,
>>
>> I am trying to use the dedupe feature to detect and mark near duplicate
>> content in my collections.
>> I dont want to prevent duplicate content. I woud like to detect it and keep
>> it for further processing. Thats why Im using an extra field and not the
>> documents unique field.
>>
>> Here is how I added it to the solrConfig.xml :
>>
>> <requestHandler name="/update" class="solr.UpdateRequestHandler">
>> <lst name="defaults">
>> <str name="update.chain">fill_signature</str>
>> </lst>
>> </requestHandler>
>>
>> <updateRequestProcessorChain name="fill_signature"
>> processor="signature">
>> <processor class="solr.RunUpdateProcessorFactory" />
>> </updateRequestProcessorChain>
>>
>> <updateProcessor class="solr.processor.SignatureUpdateProcessorFactory"
>> name="signature">
>> <bool name="enabled">true</bool>
>> <str name="signatureField">signature</str>
>> <bool name="overwriteDupes">false</bool>
>> <str name="fields">content</str>
>> <str
>> name="signatureClass">solr.processor.TextProfileSignature</str>
>> <str name="quantRate">.2</str>
>> <str name="minTokenLen">3</str>
>> </updateProcessor>
>>
>> When I initially add the documents to the cloud everything works as expected
>> ..... the documents are added and the signature will be created and
>> added.....perfect:)
>> The problem occours when I want to update an exisiting document. In that
>> case the update.chain=fill_signature parameter will of course be set too and
>> I get a bad request error.
>>
>> I found this solr issue: https://issues.apache.org/jira/browse/SOLR-3473
>>
>> Is it that problem I am running into?
>
> You haven't pasted the complete error response so I am guessing a bit
> here. It is possible that you are running into the same problem i.e.
> the "signature" is being calculated again and the signature field not
> multi-valued, causes an error.
>
>> Is it somehow possible to add parameters or set a specific update Handler
>> when Im adding documents to the cloud using solrJ?
>
> Yes, any custom parameter can be added to a SolrJ request. There is a
> setParam(String param, String value) method available in
> AbstractUpdateRequest which can be used to set a custom update.chain
> for each SolrJ request.
>
>> In that case I could ether set the update.chain manually and remove it from
>> the request handler or write a second request Handler which I only use if I
>> want set the signature field.
>> I know I can do that manually when Im using eg curl but is it also possible
>> with SolrJ? :)
>>
>>
>> Thanks,
>> Markus
>
>
>
> --
> Regards,
> Shalin Shekhar Mangar.
Re: Dedupe in a SolrCloud
Posted by Shalin Shekhar Mangar <sh...@gmail.com>.
Comments inline:
On Wed, Jun 17, 2015 at 3:18 PM, Markus.Mirsberger
<ma...@gmx.de> wrote:
> Hi,
>
> I am trying to use the dedupe feature to detect and mark near duplicate
> content in my collections.
> I dont want to prevent duplicate content. I woud like to detect it and keep
> it for further processing. Thats why Im using an extra field and not the
> documents unique field.
>
> Here is how I added it to the solrConfig.xml :
>
> <requestHandler name="/update" class="solr.UpdateRequestHandler">
> <lst name="defaults">
> <str name="update.chain">fill_signature</str>
> </lst>
> </requestHandler>
>
> <updateRequestProcessorChain name="fill_signature"
> processor="signature">
> <processor class="solr.RunUpdateProcessorFactory" />
> </updateRequestProcessorChain>
>
> <updateProcessor class="solr.processor.SignatureUpdateProcessorFactory"
> name="signature">
> <bool name="enabled">true</bool>
> <str name="signatureField">signature</str>
> <bool name="overwriteDupes">false</bool>
> <str name="fields">content</str>
> <str
> name="signatureClass">solr.processor.TextProfileSignature</str>
> <str name="quantRate">.2</str>
> <str name="minTokenLen">3</str>
> </updateProcessor>
>
> When I initially add the documents to the cloud everything works as expected
> ..... the documents are added and the signature will be created and
> added.....perfect:)
> The problem occours when I want to update an exisiting document. In that
> case the update.chain=fill_signature parameter will of course be set too and
> I get a bad request error.
>
> I found this solr issue: https://issues.apache.org/jira/browse/SOLR-3473
>
> Is it that problem I am running into?
You haven't pasted the complete error response so I am guessing a bit
here. It is possible that you are running into the same problem i.e.
the "signature" is being calculated again and the signature field not
multi-valued, causes an error.
> Is it somehow possible to add parameters or set a specific update Handler
> when Im adding documents to the cloud using solrJ?
Yes, any custom parameter can be added to a SolrJ request. There is a
setParam(String param, String value) method available in
AbstractUpdateRequest which can be used to set a custom update.chain
for each SolrJ request.
> In that case I could ether set the update.chain manually and remove it from
> the request handler or write a second request Handler which I only use if I
> want set the signature field.
> I know I can do that manually when Im using eg curl but is it also possible
> with SolrJ? :)
>
>
> Thanks,
> Markus
>
>
>
>
--
Regards,
Shalin Shekhar Mangar.