You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Igor Shalyminov <is...@yandex-team.ru> on 2013/05/22 16:11:52 UTC

Getting position increments directly from the the index

Hello!

I'm storing sentence bounds in the index as position increments of 1000.
I want to get the total number of sentences in the index, i. e. the number of "1000" increment values.
Can I do that some other way rather than just loading each document and extracting position increments with a custom Analyzer?

-- 
Best Regards,
Igor Shalyminov

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Getting position increments directly from the the index

Posted by Jack Krupansky <ja...@basetechnology.com>.

Take a look at the Term Vectors Component:
http://wiki.apache.org/solr/TermVectorComponent

-- Jack Krupansky

-----Original Message----- 
From: Igor Shalyminov
Sent: Thursday, May 23, 2013 9:54 AM
To: java-user@lucene.apache.org
Subject: Re: Getting position increments directly from the the index

Thanks, Mike and Jack!

Those are really good options.
But, just to clarify, is there a way to get, let's say, a vector of position 
increments directly from the index, without re-parsing document contents?

-- 
Best Regards,
Igor

23.05.2013, 16:13, "Jack Krupansky" <ja...@basetechnology.com>:
> It might be nice to inquire as to the largest position for a field in a
> document. Is that information kept anywhere? Not that I know of, although 
> I
> suppose it can be calculated at runtime by running though all the terms of
> the field. Then he could just divide by 1000.
>
> -- Jack Krupansky
>
> -----Original Message-----
> From: Michael McCandless
> Sent: Thursday, May 23, 2013 6:28 AM
> To: Lucene Users
> Subject: Re: Getting position increments directly from the the index
>
> Do you actually index the sentence boundary as a token?  If so, you
> could just get the totalTermFreq of that token?
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
> On Wed, May 22, 2013 at 10:11 AM, Igor Shalyminov
> <is...@yandex-team.ru> wrote:
>
>>  Hello!
>>
>>  I'm storing sentence bounds in the index as position increments of 1000.
>>  I want to get the total number of sentences in the index, i. e. the 
>> number
>>  of "1000" increment values.
>>  Can I do that some other way rather than just loading each document and
>>  extracting position increments with a custom Analyzer?
>>
>>  --
>>  Best Regards,
>>  Igor Shalyminov
>>
>>  ---------------------------------------------------------------------
>>  To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>  For additional commands, e-mail: java-user-help@lucene.apache.org
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Getting position increments directly from the the index

Posted by Jack Krupansky <ja...@basetechnology.com>.

If you add a special "end of document term" then some of these calculations 
might be easier.

And, give that special term a payload of the sentence count.

While you're at it, insert "end of sentence" terms that could have a a 
payload of the sentence number.

-- Jack Krupansky
-----Original Message----- 
From: Michael McCandless
Sent: Thursday, May 23, 2013 10:39 AM
To: Lucene Users
Subject: Re: Getting position increments directly from the the index

On Thu, May 23, 2013 at 9:54 AM, Igor Shalyminov
<is...@yandex-team.ru> wrote:

> But, just to clarify, is there a way to get, let's say, a vector of 
> position increments directly from the index, without re-parsing document 
> contents?

Term vectors (as Jack suggested) are one option, but they are very
heavy (slows down indexing, takes lots of disk space, slow
(seek-per-document) to load at search time).

You can enumerate all positions for each termXdoc in the postings, but
you'd then need to collate by document to get the max position (last
term) for that document.  I guess an int[maxDoc] would do the trick,
then walk that array dividing each maxPosition by 1000.  Or index the
sentence token :)

Mike McCandless

http://blog.mikemccandless.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Getting position increments directly from the the index

Posted by Michael McCandless <lu...@mikemccandless.com>.

On Thu, May 23, 2013 at 9:54 AM, Igor Shalyminov
<is...@yandex-team.ru> wrote:

> But, just to clarify, is there a way to get, let's say, a vector of position increments directly from the index, without re-parsing document contents?

Term vectors (as Jack suggested) are one option, but they are very
heavy (slows down indexing, takes lots of disk space, slow
(seek-per-document) to load at search time).

You can enumerate all positions for each termXdoc in the postings, but
you'd then need to collate by document to get the max position (last
term) for that document.  I guess an int[maxDoc] would do the trick,
then walk that array dividing each maxPosition by 1000.  Or index the
sentence token :)

Mike McCandless

http://blog.mikemccandless.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Getting position increments directly from the the index

Posted by Igor Shalyminov <is...@yandex-team.ru>.

Thanks, Mike and Jack!

Those are really good options.
But, just to clarify, is there a way to get, let's say, a vector of position increments directly from the index, without re-parsing document contents?

-- 
Best Regards,
Igor

23.05.2013, 16:13, "Jack Krupansky" <ja...@basetechnology.com>:
> It might be nice to inquire as to the largest position for a field in a
> document. Is that information kept anywhere? Not that I know of, although I
> suppose it can be calculated at runtime by running though all the terms of
> the field. Then he could just divide by 1000.
>
> -- Jack Krupansky
>
> -----Original Message-----
> From: Michael McCandless
> Sent: Thursday, May 23, 2013 6:28 AM
> To: Lucene Users
> Subject: Re: Getting position increments directly from the the index
>
> Do you actually index the sentence boundary as a token?  If so, you
> could just get the totalTermFreq of that token?
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
> On Wed, May 22, 2013 at 10:11 AM, Igor Shalyminov
> <is...@yandex-team.ru> wrote:
>
>>  Hello!
>>
>>  I'm storing sentence bounds in the index as position increments of 1000.
>>  I want to get the total number of sentences in the index, i. e. the number
>>  of "1000" increment values.
>>  Can I do that some other way rather than just loading each document and
>>  extracting position increments with a custom Analyzer?
>>
>>  --
>>  Best Regards,
>>  Igor Shalyminov
>>
>>  ---------------------------------------------------------------------
>>  To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>  For additional commands, e-mail: java-user-help@lucene.apache.org
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Getting position increments directly from the the index

Posted by Jack Krupansky <ja...@basetechnology.com>.

It might be nice to inquire as to the largest position for a field in a 
document. Is that information kept anywhere? Not that I know of, although I 
suppose it can be calculated at runtime by running though all the terms of 
the field. Then he could just divide by 1000.

-- Jack Krupansky

-----Original Message----- 
From: Michael McCandless
Sent: Thursday, May 23, 2013 6:28 AM
To: Lucene Users
Subject: Re: Getting position increments directly from the the index

Do you actually index the sentence boundary as a token?  If so, you
could just get the totalTermFreq of that token?


Mike McCandless

http://blog.mikemccandless.com


On Wed, May 22, 2013 at 10:11 AM, Igor Shalyminov
<is...@yandex-team.ru> wrote:
> Hello!
>
> I'm storing sentence bounds in the index as position increments of 1000.
> I want to get the total number of sentences in the index, i. e. the number 
> of "1000" increment values.
> Can I do that some other way rather than just loading each document and 
> extracting position increments with a custom Analyzer?
>
> --
> Best Regards,
> Igor Shalyminov
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Getting position increments directly from the the index

Posted by Michael McCandless <lu...@mikemccandless.com>.

Do you actually index the sentence boundary as a token?  If so, you
could just get the totalTermFreq of that token?


Mike McCandless

http://blog.mikemccandless.com


On Wed, May 22, 2013 at 10:11 AM, Igor Shalyminov
<is...@yandex-team.ru> wrote:
> Hello!
>
> I'm storing sentence bounds in the index as position increments of 1000.
> I want to get the total number of sentences in the index, i. e. the number of "1000" increment values.
> Can I do that some other way rather than just loading each document and extracting position increments with a custom Analyzer?
>
> --
> Best Regards,
> Igor Shalyminov
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org