You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@lucy.apache.org by Lee Goddard <le...@gmail.com> on 2012/08/15 09:49:32 UTC

[lucy-user] Avoid duplicate docs in hits?

HI

Just started playing with Lucy, but I can't
find a way to prevent duplicate hits
being returned.

Please help.

Thanks
lee

Re: [lucy-user] Avoid duplicate docs in hits?

Posted by Lee <le...@gmail.com>.
On 16/08/2012 16:08, Nick Wellnhofer wrote:
> On 15/08/2012 20:27, Lee wrote:
>>
>> On 15/08/2012 17:41, Peter Karman wrote:
>>> On 8/15/12 2:49 AM, Lee Goddard wrote:
>>>> Just started playing with Lucy, but I can't
>>>> find a way to prevent duplicate hits
>>>> being returned.
>>>
>>> Lucy won't return duplicate hits. But it also won't prevent you from
>>> inserting duplicate documents, for some value of "duplicate".
>>>
>>> A small, reproducable example is best if you are looking for help.
>> Thanks, Peter.
>>
>> Turned out I solved the problem by removing the index directory before
>> re-creating it. I had assumed the 'create' flag would discard any old
>> index in the same location.
>
> That's what the "truncate" option is for:
>
> my $indexer = Lucy::Index::Indexer->new(
>     index    => $SSS::XXX::Config::LUCY_IDX_PATH,
>     create   => 1,
>     truncate => 1,
>     schema   => $schema,
> );
>
> See https://metacpan.org/module/Lucy::Index::Indexer#new-labeled-params-
Yes, thanks. It would be helpful to have a warning if one attempts to 
create when a directory already exists, but I suppose it might be 
considered redundant: depends how friendly the system should be, I suppose.

Thanks
Lee


Re: [lucy-user] Avoid duplicate docs in hits?

Posted by Nick Wellnhofer <we...@aevum.de>.
On 15/08/2012 20:27, Lee wrote:
>
> On 15/08/2012 17:41, Peter Karman wrote:
>> On 8/15/12 2:49 AM, Lee Goddard wrote:
>>> Just started playing with Lucy, but I can't
>>> find a way to prevent duplicate hits
>>> being returned.
>>
>> Lucy won't return duplicate hits. But it also won't prevent you from
>> inserting duplicate documents, for some value of "duplicate".
>>
>> A small, reproducable example is best if you are looking for help.
> Thanks, Peter.
>
> Turned out I solved the problem by removing the index directory before
> re-creating it. I had assumed the 'create' flag would discard any old
> index in the same location.

That's what the "truncate" option is for:

my $indexer = Lucy::Index::Indexer->new(
     index    => $SSS::XXX::Config::LUCY_IDX_PATH,
     create   => 1,
     truncate => 1,
     schema   => $schema,
);

See https://metacpan.org/module/Lucy::Index::Indexer#new-labeled-params-

Nick


Re: [lucy-user] Avoid duplicate docs in hits?

Posted by Lee <le...@gmail.com>.
On 15/08/2012 17:41, Peter Karman wrote:
> On 8/15/12 2:49 AM, Lee Goddard wrote:
>> Just started playing with Lucy, but I can't
>> find a way to prevent duplicate hits
>> being returned.
>
> Lucy won't return duplicate hits. But it also won't prevent you from 
> inserting duplicate documents, for some value of "duplicate".
>
> A small, reproducable example is best if you are looking for help.
Thanks, Peter.

Turned out I solved the problem by removing the index directory before 
re-creating it. I had assumed the 'create' flag would discard any old 
index in the same location.

     my $indexer = Lucy::Index::Indexer->new(
         index   => $SSS::XXX::Config::LUCY_IDX_PATH,
         create  => 1,
         schema  => $schema,
     );

Took me four hours to work this out....

Cheers
Lee

Re: [lucy-user] Avoid duplicate docs in hits?

Posted by Peter Karman <pe...@peknet.com>.
Desilets, Alain wrote on 8/28/12 2:47 PM:
> When I started working with Lucy, I expected it to work like a kind of
> relational DB table, where certain fields of an index acted like "unique
> keys" for the records (which in turn would guarantee that there can be only
> one record with a given key). But that's not how Lucy is designed.
> 
> So in the end, we implemented our own class LucyIndex, which add this kind of
> functionality. When defnining the schema for the index, you indicate which
> field will act as the key. From then on, if you add a record whose key value
> is the same as that of an existing record, then the class will erase the
> existing record, and replace it by the one you provide. It wasn't hard to
> implement, but I am surprised this kind of functionality is not standard in
> Lucy.
> 

Alain,

I think you've answered the question in your comments: it wasn't hard to
implement on top of the Lucy core functionality. That's why it isn't in core.
Core aims to do the hard things.

You're right that Lucy doesn't have the concept of a primary key built in. I
expect that's because there are so many app-specific ways to define a PK, it's
not worth trying to build that functionality into core. (I think Marvin might
say the same about QueryParser.)

Instead, methods like delete_by_term() and delete_by_query() make it simple to
add app-specific constraints.

E.g., here's what I do in SWISH::Prog::Lucy::Indexer, which uses 'swishdocpath'
(the URI) as the unique term for each doc:

    # make sure we delete any existing doc with same URI
    $self->{lucy}->delete_by_term(
        field => 'swishdocpath',
        term  => $doc{swishdocpath}
    );

    $self->{lucy}->add_doc( \%doc );

All that said, I doubt anyone would be opposed to adding PK functionality into
core, were someone to care enough about the feature to work on it. I imagine a
specific FieldType would be the way to go about it, and then some logic in
add_doc() that checks the field types and in %doc and does... what? croak?
delete_by_term (as in your code and my example above)?

Alternately, it might be worth sharing your LucyIndex class on CPAN in the
LucyX::* namespace. Something to consider.


-- 
Peter Karman  .  http://peknet.com/  .  peter@peknet.com

RE: [lucy-user] Avoid duplicate docs in hits?

Posted by "Desilets, Alain" <Al...@nrc-cnrc.gc.ca>.
When I started working with Lucy, I expected it to work like a kind of relational DB table, where certain fields of an index acted like "unique keys" for the records (which in turn would guarantee that there can be only one record with a given key). But that's not how Lucy is designed.

So in the end, we implemented our own class LucyIndex, which add this kind of functionality. When defnining the schema for the index, you indicate which field will act as the key. From then on, if you add a record whose key value is the same as that of an existing record, then the class will erase the existing record, and replace it by the one you provide. It wasn't hard to implement, but I am surprised this kind of functionality is not standard in Lucy.

Alain


-----Original Message-----
From: Peter Karman [mailto:peter@peknet.com] 
Sent: Wednesday, August 15, 2012 11:41 AM
To: user@lucy.apache.org
Subject: Re: [lucy-user] Avoid duplicate docs in hits?

On 8/15/12 2:49 AM, Lee Goddard wrote:
> HI
>
> Just started playing with Lucy, but I can't find a way to prevent 
> duplicate hits being returned.

Lucy won't return duplicate hits. But it also won't prevent you from inserting duplicate documents, for some value of "duplicate".

A small, reproducable example is best if you are looking for help.

--
Peter Karman  .  http://peknet.com/  .  peter@peknet.com

Re: [lucy-user] Avoid duplicate docs in hits?

Posted by Peter Karman <pe...@peknet.com>.
On 8/15/12 2:49 AM, Lee Goddard wrote:
> HI
>
> Just started playing with Lucy, but I can't
> find a way to prevent duplicate hits
> being returned.

Lucy won't return duplicate hits. But it also won't prevent you from 
inserting duplicate documents, for some value of "duplicate".

A small, reproducable example is best if you are looking for help.

-- 
Peter Karman  .  http://peknet.com/  .  peter@peknet.com