You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Derek Poh <dp...@globalsources.com> on 2015/03/18 10:52:09 UTC

index duplicate records from data source into 1 document

Hi

If I have duplicaterecords in my source data (DB or delimited files). 
For simplicity sake they are of the following nature

Product Id    Business Type
-----------------------------------
12345         Exporter
12345     Agent
12366     Manufacturer
12377         Exporter
12377 Distributor

There are other fields with multiple values as well.

How do I index theduplicate records into 1 document. Eg. Product Id 
12345 will be 1 document,12366 as 1 document and 12377 as 1 document.

-Derek

Re: index duplicate records from data source into 1 document

Posted by Erick Erickson <er...@gmail.com>.

bq: Am I right to saywe need todo the combine of duplicate records
into 1 before feeding it to Solr to index?

That's what I'd do. As Shawn says, if you simply fire them both at
Solr the more recent one will replace the older one.

Best,
Erick

On Thu, Mar 19, 2015 at 7:44 AM, Shawn Heisey <ap...@elyograg.org> wrote:
> On 3/19/2015 2:09 AM, Derek Poh wrote:
>> Am I right to saywe need todo the combine of duplicate records into 1
>> before feeding it to Solr to index?
>>
>> I am coming from Endecawhich support the combine of duplicate records
>> into 1 recordduring indexing. Was wondering if Solr support this.
>
> If you index multiple documents with the same uniqueId field value, Solr
> will delete the previous document and index the new one.  The data in
> the previous document is never seen.
>
> You could in theory write a custom UpdateRequestProcessor that looks for
> the previous document and merges it in whatever way you desire, so the
> combined information is what will be indexed, and configure Solr to use
> that update processor ...but this capability is not available out of the
> box.
>
> An update processor that does this should probably be included with
> Solr, but it would either need to be highly configurable, or everyone
> would need to agree on exactly what rules should be followed when
> combining duplicate records.
>
> Thanks,
> Shawn
>

Re: index duplicate records from data source into 1 document

Posted by Derek Poh <dp...@globalsources.com>.

Oh that is how Solr works...

On 3/19/2015 10:44 PM, Shawn Heisey wrote:
> On 3/19/2015 2:09 AM, Derek Poh wrote:
>> Am I right to saywe need todo the combine of duplicate records into 1
>> before feeding it to Solr to index?
>>
>> I am coming from Endecawhich support the combine of duplicate records
>> into 1 recordduring indexing. Was wondering if Solr support this.
> If you index multiple documents with the same uniqueId field value, Solr
> will delete the previous document and index the new one.  The data in
> the previous document is never seen.
>
> You could in theory write a custom UpdateRequestProcessor that looks for
> the previous document and merges it in whatever way you desire, so the
> combined information is what will be indexed, and configure Solr to use
> that update processor ...but this capability is not available out of the
> box.
>
> An update processor that does this should probably be included with
> Solr, but it would either need to be highly configurable, or everyone
> would need to agree on exactly what rules should be followed when
> combining duplicate records.
>
> Thanks,
> Shawn
>
>

Re: index duplicate records from data source into 1 document

Posted by Shawn Heisey <el...@elyograg.org>.

On 3/20/2015 4:03 AM, Toke Eskildsen wrote:
> On Thu, 2015-03-19 at 15:44 +0100, Shawn Heisey wrote:
>> You could in theory write a custom UpdateRequestProcessor that looks for
>> the previous document and merges it in whatever way you desire, so the
>> combined information is what will be indexed, and configure Solr to use
>> that update processor ...but this capability is not available out of the
>> box.
> 
> I have not tried it at all, but I thought 
> https://cwiki.apache.org/confluence/display/solr/Updating+Parts+of
> +Documents
> was doing exactly what you describe?

Atomic Updates are a special syntax, not the same thing as simply
indexing different versions of a document and expecting the search
engine to combine them into a single document.

Assuming the schema meets the Atomic Update requirements, an update
processor could be included with Solr to combine a new document with an
existing document, but I doubt everyone would agree on exactly how to
combine the info.  Within a single document, you might want different
fields to behave in different ways ... some fields might replace the
existing value with the new value, others might add the new value(s) to
a multi-valued field, others might only use the new value if there is no
value in the old document, and so on.

Thanks,
Shawn

Re: index duplicate records from data source into 1 document

Posted by Toke Eskildsen <te...@statsbiblioteket.dk>.

On Thu, 2015-03-19 at 15:44 +0100, Shawn Heisey wrote:
> You could in theory write a custom UpdateRequestProcessor that looks for
> the previous document and merges it in whatever way you desire, so the
> combined information is what will be indexed, and configure Solr to use
> that update processor ...but this capability is not available out of the
> box.

I have not tried it at all, but I thought 
https://cwiki.apache.org/confluence/display/solr/Updating+Parts+of
+Documents
was doing exactly what you describe?


- Toke Eskildsen, State and University Library, Denmark

Re: index duplicate records from data source into 1 document

Posted by Shawn Heisey <ap...@elyograg.org>.

On 3/19/2015 2:09 AM, Derek Poh wrote:
> Am I right to saywe need todo the combine of duplicate records into 1
> before feeding it to Solr to index?
>
> I am coming from Endecawhich support the combine of duplicate records
> into 1 recordduring indexing. Was wondering if Solr support this.

If you index multiple documents with the same uniqueId field value, Solr
will delete the previous document and index the new one.  The data in
the previous document is never seen.

You could in theory write a custom UpdateRequestProcessor that looks for
the previous document and merges it in whatever way you desire, so the
combined information is what will be indexed, and configure Solr to use
that update processor ...but this capability is not available out of the
box.

An update processor that does this should probably be included with
Solr, but it would either need to be highly configurable, or everyone
would need to agree on exactly what rules should be followed when
combining duplicate records.

Thanks,
Shawn

Re: index duplicate records from data source into 1 document

Posted by Derek Poh <dp...@globalsources.com>.

Hi Erick

Am I right to saywe need todo the combine of duplicate records into 1 
before feeding it to Solr to index?

I am coming from Endecawhich support the combine of duplicate records 
into 1 recordduring indexing. Was wondering if Solr support this.

-Derek

On 3/18/2015 11:21 PM, Erick Erickson wrote:
> I'd use SolrJ, pull the docs by productId order and combine records
> with the same product ID into a single doc.
>
> Here's a starter set for indexing form a DB with SolrJ. It has Tika
> processing in it as well, but you can pull that out pretty easily.
>
> https://lucidworks.com/blog/indexing-with-solrj/
>
> Best,
> Erick
>
> On Wed, Mar 18, 2015 at 2:52 AM, Derek Poh <dp...@globalsources.com> wrote:
>> Hi
>>
>> If I have duplicaterecords in my source data (DB or delimited files). For
>> simplicity sake they are of the following nature
>>
>> Product Id    Business Type
>> -----------------------------------
>> 12345         Exporter
>> 12345     Agent
>> 12366     Manufacturer
>> 12377         Exporter
>> 12377 Distributor
>>
>> There are other fields with multiple values as well.
>>
>> How do I index theduplicate records into 1 document. Eg. Product Id 12345
>> will be 1 document,12366 as 1 document and 12377 as 1 document.
>>
>> -Derek
>

Re: index duplicate records from data source into 1 document

Posted by Erick Erickson <er...@gmail.com>.

I'd use SolrJ, pull the docs by productId order and combine records
with the same product ID into a single doc.

Here's a starter set for indexing form a DB with SolrJ. It has Tika
processing in it as well, but you can pull that out pretty easily.

https://lucidworks.com/blog/indexing-with-solrj/

Best,
Erick

On Wed, Mar 18, 2015 at 2:52 AM, Derek Poh <dp...@globalsources.com> wrote:
> Hi
>
> If I have duplicaterecords in my source data (DB or delimited files). For
> simplicity sake they are of the following nature
>
> Product Id    Business Type
> -----------------------------------
> 12345         Exporter
> 12345     Agent
> 12366     Manufacturer
> 12377         Exporter
> 12377 Distributor
>
> There are other fields with multiple values as well.
>
> How do I index theduplicate records into 1 document. Eg. Product Id 12345
> will be 1 document,12366 as 1 document and 12377 as 1 document.
>
> -Derek