You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Marc Sturlese <ma...@gmail.com> on 2008/11/17 10:18:25 UTC

using deduplication with dataimporthandler

Hey there,

I have posted before telling about my situation but I thing my explanation
was a bit confusing...
I am using dataImportHanlder and delta-import and it's working perfectly. I
have also coded my own SqlEntityProcesor to delete from the index and
database expired rows.

Now I need to do duplication control at indexing time. In my old lucene core
I made my own duplication control but it was so slow as it worked comparing
strings... I have been investigating solr deduplication
(http://wiki.apache.org/solr/Deduplication) and it seems so cool as it works
with hashes instead of strings.

I have learned how to use deduplication using the /update requestHandler as
the wiki says:
 <requestHandler name="/update" class="solr.XmlUpdateRequestHandler" >
    <lst name="defaults">
      <str name="update.processor">dedupe</str>
    </lst>
  </requestHandler>

But the thing is that I want to use it with the /dataimport requestHanlder
(the one used by dataimporthandler). I don't know if there's a possible xml
configuration to add deduplication to dataimportHandler or I should code a
plugin... in that case, I don't exacly now where.

Hope my explanation is more clear now...
Thank's in advanced!


-- 
View this message in context: http://www.nabble.com/using-deduplication-with-dataimporthandler-tp20536053p20536053.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: using deduplication with dataimporthandler

Posted by Shalin Shekhar Mangar <sh...@gmail.com>.
On Mon, Nov 17, 2008 at 5:18 PM, Marc Sturlese <ma...@gmail.com>wrote:

>
> Thank you so much. I have it sorted.
> I am wondering now if there is any more stable way to use deduplication
> than
> adding to the solr source project this patch:
>
> https://issues.apache.org/jira/browse/SOLR-799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
> (SOLR-799.patch<https://issues.apache.org/jira/browse/SOLR-799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel%28SOLR-799.patch>        2008-11-12 05:10 PM this one exactly).
>
> I have downloaded the last nightly-build source code and couldn't see the
> needed classes in there.
> Anyones knows something?Should I ask this in the developers forum?
>

The issue is still open, but I don't think it will remain open for long.
Most likely, it will be released with the next Solr version.

-- 
Regards,
Shalin Shekhar Mangar.

Re: using deduplication with dataimporthandler

Posted by Marc Sturlese <ma...@gmail.com>.


Marc Sturlese wrote:
> 
> Thank you so much. I have it sorted.
> I am wondering now if there is any more stable way to use deduplication
> than adding to the solr source project this patch:
> https://issues.apache.org/jira/browse/SOLR-799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
> (SOLR-799.patch  	2008-11-12 05:10 PM this one exactly).
> 
> I have downloaded the last nightly-build source code and couldn't see the
> needed classes in there.
> Anyones knows something?Should I ask this in the developers forum?
> 
> The thing is I can't find the class
> org.apache.solr.update.processor.DeduplicateUpdateProcessorFactory
> anywhere...
> 
> Thanks in advanced
> 
> 
-- 
View this message in context: http://www.nabble.com/using-deduplication-with-dataimporthandler-tp20536053p20538077.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: using deduplication with dataimporthandler

Posted by Marc Sturlese <ma...@gmail.com>.
Thank you so much. I have it sorted.
I am wondering now if there is any more stable way to use deduplication than
adding to the solr source project this patch:
https://issues.apache.org/jira/browse/SOLR-799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
(SOLR-799.patch  	2008-11-12 05:10 PM this one exactly).

I have downloaded the last nightly-build source code and couldn't see the
needed classes in there.
Anyones knows something?Should I ask this in the developers forum?

Thanks in advanced


Marc Sturlese wrote:
> 
> Hey there,
> 
> I have posted before telling about my situation but I thing my explanation
> was a bit confusing...
> I am using dataImportHanlder and delta-import and it's working perfectly.
> I have also coded my own SqlEntityProcesor to delete from the index and
> database expired rows.
> 
> Now I need to do duplication control at indexing time. In my old lucene
> core I made my own duplication control but it was so slow as it worked
> comparing strings... I have been investigating solr deduplication
> (http://wiki.apache.org/solr/Deduplication) and it seems so cool as it
> works with hashes instead of strings.
> 
> I have learned how to use deduplication using the /update requestHandler
> as the wiki says:
>  <requestHandler name="/update" class="solr.XmlUpdateRequestHandler" >
>     <lst name="defaults">
>       <str name="update.processor">dedupe</str>
>     </lst>
>   </requestHandler>
> 
> But the thing is that I want to use it with the /dataimport requestHanlder
> (the one used by dataimporthandler). I don't know if there's a possible
> xml configuration to add deduplication to dataimportHandler or I should
> code a plugin... in that case, I don't exacly now where.
> 
> Hope my explanation is more clear now...
> Thank's in advanced!
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/using-deduplication-with-dataimporthandler-tp20536053p20538008.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: using deduplication with dataimporthandler

Posted by Noble Paul നോബിള്‍ नोब्ळ् <no...@gmail.com>.
Any update processor can be used with DIH . First of all you may
register your dedupe update processor as you do now. You can either
pass the update.processor is the request parameter pr you can keep the
it in the 'defaults' of  datataimport handler

 <str name="update.processor">dedupe</str>

On Mon, Nov 17, 2008 at 2:48 PM, Marc Sturlese <ma...@gmail.com> wrote:
>
> Hey there,
>
> I have posted before telling about my situation but I thing my explanation
> was a bit confusing...
> I am using dataImportHanlder and delta-import and it's working perfectly. I
> have also coded my own SqlEntityProcesor to delete from the index and
> database expired rows.
>
> Now I need to do duplication control at indexing time. In my old lucene core
> I made my own duplication control but it was so slow as it worked comparing
> strings... I have been investigating solr deduplication
> (http://wiki.apache.org/solr/Deduplication) and it seems so cool as it works
> with hashes instead of strings.
>
> I have learned how to use deduplication using the /update requestHandler as
> the wiki says:
>  <requestHandler name="/update" class="solr.XmlUpdateRequestHandler" >
>    <lst name="defaults">
>      <str name="update.processor">dedupe</str>
>    </lst>
>  </requestHandler>
>
> But the thing is that I want to use it with the /dataimport requestHanlder
> (the one used by dataimporthandler). I don't know if there's a possible xml
> configuration to add deduplication to dataimportHandler or I should code a
> plugin... in that case, I don't exacly now where.
>
> Hope my explanation is more clear now...
> Thank's in advanced!
>
>
> --
> View this message in context: http://www.nabble.com/using-deduplication-with-dataimporthandler-tp20536053p20536053.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>



-- 
--Noble Paul