You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@lucy.apache.org by Serkan Mulayim <se...@gmail.com> on 2016/11/16 19:17:33 UTC

[lucy-user] Re: Regarding document Ids

Hi guys,

I think I need to simplify my question. After reading it one more time, I
realized I touched many things, and it seem confusing.

It seems like if we index the same document twice, a new document is
created. And as per http://lucy.apache.org/docs/c/Lucy/Docs/DocIDs.html, " If
you truly need a primary key field, you must define it and populate it
yourself". How can we do this, are there any examples around this? Should I
search for the document with the primary key before indexing and if it
exists, should I not index it?

Thanks,
Serkan

On Tue, Nov 15, 2016 at 2:22 PM, Serkan Mulayim <se...@gmail.com>
wrote:

> Hi,
>
> As far as I see if we add the same document twice, it creates a new
> document. As per http://lucy.apache.org/docs/c/Lucy/Docs/DocIDs.html, " If
> you truly need a primary key field, you must define it and populate it
> yourself". Can you please elaborate on this one? Does it mean choosing a
> field to be primary key and delete the document with the primary key and
> re-add it? If so the document might have not been created until we commit,
> so deletion would not be possible, right? Also performance would be another
> issue.
>
> Another solution might be hashing the "primary key" and put it as the
> documentId (but the referred page also says that docIds are ephemeral). If
> the ephemeralness of the docId is not a problem, my concern is regarding
> the collisions considering that I might need to have many documents in the
> same index. This boils down to the birthday problem and we might not be
> safe in the range of an integer.
>
> Do you have any suggestions about this one?
>
> Thanks,
> Serkan
>

Re: [lucy-user] Re: Regarding document Ids

Posted by Serkan Mulayim <se...@gmail.com>.

Thank you Peter for your comments. Regards...

On Wed, Nov 16, 2016 at 3:05 PM, Peter Karman <pe...@peknet.com> wrote:

> Serkan Mulayim wrote on 11/16/16, 2:21 PM:
>
>> Thank you Peter for your quick response.
>>
>> As I understand before adding new documents to the index, you delete by
>> query (by using your primary key). How is the performance in your end,
>> then? Since delete by query will search through all segments in the index
>> for the deletion, I feel like the performance would be affected. Roughly,
>> how many documents do you have in your index, and what is the document
>> size?
>>
>> BTW, my document sizes are very small, and I think I will have around 40K
>> documents.
>>
>>
> performance is fast enough for me. I have 1MM+ docs but not much churn
> (not updating docs constantly). IME the bottleneck is not the search. It's
> a search engine; it's pretty fast. The bottleneck is updating the index.
> That's true whether you delete first or not.
>
>
>
> --
> Peter Karman  .  http://peknet.com/  .  peter@peknet.com
>

Re: [lucy-user] Re: Regarding document Ids

Posted by Peter Karman <pe...@peknet.com>.

Serkan Mulayim wrote on 11/16/16, 2:21 PM:
> Thank you Peter for your quick response.
>
> As I understand before adding new documents to the index, you delete by
> query (by using your primary key). How is the performance in your end,
> then? Since delete by query will search through all segments in the index
> for the deletion, I feel like the performance would be affected. Roughly,
> how many documents do you have in your index, and what is the document size?
>
> BTW, my document sizes are very small, and I think I will have around 40K
> documents.
>

performance is fast enough for me. I have 1MM+ docs but not much churn (not 
updating docs constantly). IME the bottleneck is not the search. It's a search 
engine; it's pretty fast. The bottleneck is updating the index. That's true 
whether you delete first or not.


-- 
Peter Karman  .  http://peknet.com/  .  peter@peknet.com

Re: [lucy-user] Re: Regarding document Ids

Posted by Serkan Mulayim <se...@gmail.com>.

Thank you Peter for your quick response.

As I understand before adding new documents to the index, you delete by
query (by using your primary key). How is the performance in your end,
then? Since delete by query will search through all segments in the index
for the deletion, I feel like the performance would be affected. Roughly,
how many documents do you have in your index, and what is the document size?

BTW, my document sizes are very small, and I think I will have around 40K
documents.

Thanks,
Serkan

On Wed, Nov 16, 2016 at 11:25 AM, Peter Karman <pe...@peknet.com> wrote:

> Serkan Mulayim wrote on 11/16/16, 1:17 PM:
>
>> Hi guys,
>>
>> I think I need to simplify my question. After reading it one more time, I
>> realized I touched many things, and it seem confusing.
>>
>> It seems like if we index the same document twice, a new document is
>> created. And as per http://lucy.apache.org/docs/c/Lucy/Docs/DocIDs.html,
>> " If
>> you truly need a primary key field, you must define it and populate it
>> yourself". How can we do this, are there any examples around this? Should
>> I
>> search for the document with the primary key before indexing and if it
>> exists, should I not index it?
>>
>
> What I do in all my apps is use delete_by_term
> https://metacpan.org/pod/distribution/Lucy/lib/Lucy/Index/
> Indexer.pod#delete_by_term
>
> I have my own primary key system that varies based on the application.
> Sometimes it is a URI, sometimes a db PK. I maintain the document integrity
> myself.
>
> One example from how Dezi solves this more generally:
>
> https://metacpan.org/source/KARMAN/Dezi-App-0.014/lib/Dezi/
> Lucy/Indexer.pm#L451
>
> Lucy isn't a RDBMS. It just tokenizes the fields you shove into it, and
> retrieves very quickly.
>
>
> --
> Peter Karman  .  http://peknet.com/  .  peter@peknet.com
>

Re: [lucy-user] Re: Regarding document Ids

Posted by Peter Karman <pe...@peknet.com>.

Serkan Mulayim wrote on 11/16/16, 1:17 PM:
> Hi guys,
>
> I think I need to simplify my question. After reading it one more time, I
> realized I touched many things, and it seem confusing.
>
> It seems like if we index the same document twice, a new document is
> created. And as per http://lucy.apache.org/docs/c/Lucy/Docs/DocIDs.html, " If
> you truly need a primary key field, you must define it and populate it
> yourself". How can we do this, are there any examples around this? Should I
> search for the document with the primary key before indexing and if it
> exists, should I not index it?

What I do in all my apps is use delete_by_term
https://metacpan.org/pod/distribution/Lucy/lib/Lucy/Index/Indexer.pod#delete_by_term

I have my own primary key system that varies based on the application. Sometimes 
it is a URI, sometimes a db PK. I maintain the document integrity myself.

One example from how Dezi solves this more generally:

https://metacpan.org/source/KARMAN/Dezi-App-0.014/lib/Dezi/Lucy/Indexer.pm#L451

Lucy isn't a RDBMS. It just tokenizes the fields you shove into it, and 
retrieves very quickly.


-- 
Peter Karman  .  http://peknet.com/  .  peter@peknet.com