You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@lucy.apache.org by Aleksandar Radovanovic <Al...@Radovanovic.com> on 2014/01/14 10:03:35 UTC

[lucy-user] Get doc_id during indexing?

Hi there,

I was wondering is it possible to get doc_id during the indexing 
process, or can I simply assume that doc_id starts from 0 and increments 
with each record added?

Basically, I need SQL like:
INSERT INTO tbl (name) VALUES ('John') RETURNING id
after each INSERT I can extend the list of document id's in which name 
John appears.

For example, I want to make a hash which maps some people names to a 
list of internal doc_id:

my %keyword_to_doc_id;
while (...) {
   my $content = ...get a document;
   my $keyword = .. get a person's name;

   $indexer->add_doc( { doc_content => $content, ... } );
   push ( @{$keyword_to_doc_id{$keyword}}, <doc_id> ) if ($keyword is in the $content)
}|
$indexer->commit;
...
make another index of keywords appearing in the indexed documents without
time consuming search of previously created index for|||millions of predefined keywords|
|

For text mining purposes, I can later analyze only index of predefined 
keywords (metadata), and extend the search to much bigger documents 
index only when needed.

Alex

Re: [lucy-user] Get doc_id during indexing?

Posted by Aleksandar Radovanovic <Al...@Radovanovic.com>.

Good to know.
I wish to thank you again for the amazing work on Lucy and other CPAN 
modules.

Alex

On 2014-1-14, 9:52 PM, Peter Karman wrote:
> On 1/14/14 11:44 AM, Aleksandar Radovanovic wrote:
>> Thank you Peter.
>>
>> Actually, I am using the method you suggested. I was thinking that
>> having another field for the record identification is an overhead since
>> the doc_id  is the minimal and the fastest (if I am not mistaken)
>> possible way to retrieve records.
>
>
> The doc_id is ephemeral. It can change whenever an index changes 
> (segments getting merged, etc.).
>
>
>

Re: [lucy-user] Get doc_id during indexing?

Posted by Peter Karman <pe...@peknet.com>.

On 1/14/14 11:44 AM, Aleksandar Radovanovic wrote:
> Thank you Peter.
>
> Actually, I am using the method you suggested. I was thinking that
> having another field for the record identification is an overhead since
> the doc_id  is the minimal and the fastest (if I am not mistaken)
> possible way to retrieve records.


The doc_id is ephemeral. It can change whenever an index changes 
(segments getting merged, etc.).



-- 
Peter Karman  .  http://peknet.com/  .  peter@peknet.com

Re: [lucy-user] Get doc_id during indexing?

Posted by Aleksandar Radovanovic <Al...@Radovanovic.com>.

Thank you Peter.

Actually, I am using the method you suggested. I was thinking that 
having another field for the record identification is an overhead since 
the doc_id  is the minimal and the fastest (if I am not mistaken) 
possible way to retrieve records.

Regards,
Alex

On 2014-1-14, 6:18 PM, Peter Karman wrote:
> On 1/14/14 3:03 AM, Aleksandar Radovanovic wrote:
>> Hi there,
>>
>> I was wondering is it possible to get doc_id during the indexing
>> process, or can I simply assume that doc_id starts from 0 and increments
>> with each record added?
>>
>>
>
> Even if you could, I would not recommend that approach for solving 
> your problem. The doc_id is an internal implementation detail.
>
> Instead, why not assign a unique term (like a URI) to each document in 
> your index, and reference that externally?
>
> You could also, post indexing, iterate over the Lexicons in an index 
> and create a new index based on your keyword identification. Note that 
> 'keyword' might be a misnomer depending on what Analysis classes you 
> apply to your documents: i.e., you might have phrases, etc., not just 
> single terms.
>
>

Re: [lucy-user] Get doc_id during indexing?

Posted by Peter Karman <pe...@peknet.com>.

On 1/14/14 3:03 AM, Aleksandar Radovanovic wrote:
> Hi there,
>
> I was wondering is it possible to get doc_id during the indexing
> process, or can I simply assume that doc_id starts from 0 and increments
> with each record added?
>
>

Even if you could, I would not recommend that approach for solving your 
problem. The doc_id is an internal implementation detail.

Instead, why not assign a unique term (like a URI) to each document in 
your index, and reference that externally?

You could also, post indexing, iterate over the Lexicons in an index and 
create a new index based on your keyword identification. Note that 
'keyword' might be a misnomer depending on what Analysis classes you 
apply to your documents: i.e., you might have phrases, etc., not just 
single terms.

-- 
Peter Karman  .  http://peknet.com/  .  peter@peknet.com