You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Sunil <su...@truesparrow.com> on 2008/07/15 09:45:41 UTC

Duplicate content

Hi All,

I want to change the duplicate content behavior in solr. What I want to
do is:

1) I don't want duplicate content.
2) I don't want to overwrite old content with new one. 

Means, if I add duplicate content in solr and the content already
exists, the old content should not be overwritten.

Can anyone suggest how to achieve it?


Thanks,
Sunil



RE: Duplicate content

Posted by Sunil <su...@truesparrow.com>.
Thanks guys.


-----Original Message-----
From: Norberto Meijome [mailto:freebsd@meijome.net] 
Sent: Tuesday, July 15, 2008 2:35 PM
To: solr-user@lucene.apache.org
Subject: Re: Duplicate content

On Tue, 15 Jul 2008 10:48:14 +0200
Jarek Zgoda <ja...@redefine.pl> wrote:

> >> 2) I don't want to overwrite old content with new one. 
> >>
> >> Means, if I add duplicate content in solr and the content already
> >> exists, the old content should not be overwritten.  
> > 
> > before inserting a new document, query the index - if you get a
result back,
> > then don't insert. I don't know of any other way.  
> 
> This operation is not atomic, so you get a race condition here. Other
> than that, it seems fine. ;)

of course - but i am not sure you can control atomicity at the SOLR
level
(yet? ;) ) for /update handler - so it'd have to either be a custom
handler, or
your app being the only one accessing and controlling write access to it
that
way. It definitely gets more interesting if you start adding shards ;)

_________________________
{Beto|Norberto|Numard} Meijome

"All parts should go together without forcing. You must remember that
the parts
you are reassembling were disassembled by you. Therefore, if you can't
get them
together again, there must be a reason. By all means, do not use
hammer." IBM
maintenance manual, 1975

I speak for myself, not my employer. Contents may be hot. Slippery when
wet.
Reading disclaimers makes you go blind. Writing them is worse. You have
been
Warned.



Re: Duplicate content

Posted by Norberto Meijome <fr...@meijome.net>.
On Tue, 15 Jul 2008 10:48:14 +0200
Jarek Zgoda <ja...@redefine.pl> wrote:

> >> 2) I don't want to overwrite old content with new one. 
> >>
> >> Means, if I add duplicate content in solr and the content already
> >> exists, the old content should not be overwritten.  
> > 
> > before inserting a new document, query the index - if you get a result back,
> > then don't insert. I don't know of any other way.  
> 
> This operation is not atomic, so you get a race condition here. Other
> than that, it seems fine. ;)

of course - but i am not sure you can control atomicity at the SOLR level
(yet? ;) ) for /update handler - so it'd have to either be a custom handler, or
your app being the only one accessing and controlling write access to it that
way. It definitely gets more interesting if you start adding shards ;)

_________________________
{Beto|Norberto|Numard} Meijome

"All parts should go together without forcing. You must remember that the parts
you are reassembling were disassembled by you. Therefore, if you can't get them
together again, there must be a reason. By all means, do not use hammer." IBM
maintenance manual, 1975

I speak for myself, not my employer. Contents may be hot. Slippery when wet.
Reading disclaimers makes you go blind. Writing them is worse. You have been
Warned.

Re: Duplicate content

Posted by Jarek Zgoda <ja...@redefine.pl>.
Norberto Meijome pisze:

>> 2) I don't want to overwrite old content with new one. 
>>
>> Means, if I add duplicate content in solr and the content already
>> exists, the old content should not be overwritten.
> 
> before inserting a new document, query the index - if you get a result back,
> then don't insert. I don't know of any other way.

This operation is not atomic, so you get a race condition here. Other
than that, it seems fine. ;)

-- 
We read Knuth so you don't have to. -- Tim Peters

Jarek Zgoda
re:define

Re: Duplicate content

Posted by Norberto Meijome <fr...@meijome.net>.
On Tue, 15 Jul 2008 13:15:41 +0530
"Sunil" <su...@truesparrow.com> wrote:

> 1) I don't want duplicate content.

SOLR uses the field you define as the unique field to determine whether a
document should be replaced or added. The rest of the fields are in your hands.
You could devise a setup whereby the document id is generated by hashing all
the other fields in your schema, thereby ensuring that a unique document id
means unique content (of course, for a meaning of 'uniqueness' that is
"different bytes" ;) )

> 2) I don't want to overwrite old content with new one. 
> 
> Means, if I add duplicate content in solr and the content already
> exists, the old content should not be overwritten.

before inserting a new document, query the index - if you get a result back,
then don't insert. I don't know of any other way.

b
_________________________
{Beto|Norberto|Numard} Meijome

"The real voyage of discovery consists not in seeking new landscapes, but in
having new eyes." Marcel Proust

I speak for myself, not my employer. Contents may be hot. Slippery when wet.
Reading disclaimers makes you go blind. Writing them is worse. You have been
Warned.

Re: Duplicate content

Posted by Chris Hostetter <ho...@fucit.org>.
: > Is <uniqueKey> really unique if we allow duplicates? I had similar
: > problem...
: > 
: 
: if you allowDups, then uniqueKey may not be unique...

allowDups is one of those features where Solr not only gives you enough 
rope to hang yourself, but Solr also ties the rope into a knot, cust some 
lumber, and builds you a gallows -- but it's a feature if what you really 
want to do is hang yourself.


-Hoss


Re: Duplicate content

Posted by Ryan McKinley <ry...@gmail.com>.
On Jul 15, 2008, at 10:31 AM, Fuad Efendi wrote:

> Thanks Ryan,
>
> Is <uniqueKey> really unique if we allow duplicates? I had similar  
> problem...
>

if you allowDups, then uniqueKey may not be unique...

however, it is still used as the key for many items.


>
> Quoting Ryan McKinley <ry...@gmail.com>:
>
>>
>> On Jul 15, 2008, at 2:45 AM, Sunil wrote:
>>
>>> Hi All,
>>>
>>> I want to change the duplicate content behavior in solr. What I  
>>> want to
>>> do is:
>>>
>>> 1) I don't want duplicate content.
>>> 2) I don't want to overwrite old content with new one.
>>>
>>> Means, if I add duplicate content in solr and the content already
>>> exists, the old content should not be overwritten.
>>>
>>> Can anyone suggest how to achieve it?
>>>
>>
>> Check the "allowDups" options for <add>
>> http://wiki.apache.org/solr/UpdateXmlMessages#head-3dfbf90fbc69f168ab6f3389daf68571ad614bef
>>
>>
>>
>>>
>>> Thanks,
>>> Sunil
>>>
>>>
>
>
>


Re: Duplicate content

Posted by Fuad Efendi <fu...@efendi.ca>.
Thanks Ryan,

Is <uniqueKey> really unique if we allow duplicates? I had similar problem...


Quoting Ryan McKinley <ry...@gmail.com>:

>
> On Jul 15, 2008, at 2:45 AM, Sunil wrote:
>
>> Hi All,
>>
>> I want to change the duplicate content behavior in solr. What I want to
>> do is:
>>
>> 1) I don't want duplicate content.
>> 2) I don't want to overwrite old content with new one.
>>
>> Means, if I add duplicate content in solr and the content already
>> exists, the old content should not be overwritten.
>>
>> Can anyone suggest how to achieve it?
>>
>
> Check the "allowDups" options for <add>
> http://wiki.apache.org/solr/UpdateXmlMessages#head-3dfbf90fbc69f168ab6f3389daf68571ad614bef
>
>
>
>>
>> Thanks,
>> Sunil
>>
>>




Re: Duplicate content

Posted by Ryan McKinley <ry...@gmail.com>.
On Jul 15, 2008, at 2:45 AM, Sunil wrote:

> Hi All,
>
> I want to change the duplicate content behavior in solr. What I want  
> to
> do is:
>
> 1) I don't want duplicate content.
> 2) I don't want to overwrite old content with new one.
>
> Means, if I add duplicate content in solr and the content already
> exists, the old content should not be overwritten.
>
> Can anyone suggest how to achieve it?
>

Check the "allowDups" options for <add>
http://wiki.apache.org/solr/UpdateXmlMessages#head-3dfbf90fbc69f168ab6f3389daf68571ad614bef



>
> Thanks,
> Sunil
>
>


Re: Duplicate content

Posted by Noble Paul നോബിള്‍ नोब्ळ् <no...@gmail.com>.
You must do a check before adding documents

On Tue, Jul 15, 2008 at 1:15 PM, Sunil <su...@truesparrow.com> wrote:
> Hi All,
>
> I want to change the duplicate content behavior in solr. What I want to
> do is:
>
> 1) I don't want duplicate content.
> 2) I don't want to overwrite old content with new one.
>
> Means, if I add duplicate content in solr and the content already
> exists, the old content should not be overwritten.
>
> Can anyone suggest how to achieve it?
>
>
> Thanks,
> Sunil
>
>
>



-- 
--Noble Paul