You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Dawid Weiss <da...@gmail.com> on 2011/06/14 12:31:58 UTC

External strings sort and case folding.

Hi. While I was playing with automata recently, I had a use case
scenario when I could really use an external sort of a large list of
unicode strings. I know I could simply emulate this by creating
synthetic documents, index, etc., but is there a more "direct" way of
achieving this using Lucene's internals?

Dawid

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Re: External strings sort and case folding.

Posted by Michael McCandless <lu...@mikemccandless.com>.
On Tue, Jun 14, 2011 at 8:35 AM, Dawid Weiss
<da...@cs.put.poznan.pl> wrote:
>> Merging FSTs sounds cool!
>
> Yep, it would be quite neat and relatively simple too. I looked at the
> FST API again, after doing some work on those compounds and there are
> several things that just hurt my eyes... I'll see if I can figure out
> a nicer API... how emotionally attached are you to that simulation of
> terminal states, Mike? :)

I'm not at all!  Fix away :)  FST is very new and it needs some good
iterating...

Just make sure the *FSTEnum work ok -- I think that's why I added the END_LABEL.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Re: External strings sort and case folding.

Posted by Dawid Weiss <da...@cs.put.poznan.pl>.
> Merging FSTs sounds cool!

Yep, it would be quite neat and relatively simple too. I looked at the
FST API again, after doing some work on those compounds and there are
several things that just hurt my eyes... I'll see if I can figure out
a nicer API... how emotionally attached are you to that simulation of
terminal states, Mike? :)

Dawid

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Re: External strings sort and case folding.

Posted by Michael McCandless <lu...@mikemccandless.com>.
In theory, you could use the codec API directly, adding "chunks" of
pre-sorted terms, and then fake up a SegmentInfo to make it look like
some kind of degenerate segment, and then merge them?

But it's gonna be a lot of work to do that :)

Merging FSTs sounds cool!

Mike McCandless

http://blog.mikemccandless.com

On Tue, Jun 14, 2011 at 8:18 AM, Dawid Weiss
<da...@cs.put.poznan.pl> wrote:
>> So actually it would work if you just enum'd the terms yourself, after
>> indexing and optimizing.  And this does amount to an external sort, I
>> think!
>
> Yep. I was just curious if there's a way to do it without the overhead
> of creating fields, documents, etc. If I have a spare minute I'll try
> to write a merge sort from disk splits. It'd be neat to write FST
> merging too (so that, given to FSTs you could merge them into one by
> creating a new FST and adding sequences in order from one or the other
> source).
>
> Dawid
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Re: External strings sort and case folding.

Posted by Dawid Weiss <da...@cs.put.poznan.pl>.
> So actually it would work if you just enum'd the terms yourself, after
> indexing and optimizing.  And this does amount to an external sort, I
> think!

Yep. I was just curious if there's a way to do it without the overhead
of creating fields, documents, etc. If I have a spare minute I'll try
to write a merge sort from disk splits. It'd be neat to write FST
merging too (so that, given to FSTs you could merge them into one by
creating a new FST and adding sequences in order from one or the other
source).

Dawid

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Re: External strings sort and case folding.

Posted by Michael McCandless <lu...@mikemccandless.com>.
On Tue, Jun 14, 2011 at 7:06 AM, Dawid Weiss
<da...@cs.put.poznan.pl> wrote:
>> And, if you create & index such Lucene documents, and then do a
>> MatchAllDocsQuery sorting by your field, this is (unfortunately) not
>
> I was thinking about an optimized segment -- then the terms enum on a
> given field should be sorted, right?

Ahh, right.

So actually it would work if you just enum'd the terms yourself, after
indexing and optimizing.  And this does amount to an external sort, I
think!

Mike McCandless

http://blog.mikemccandless.com

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Re: External strings sort and case folding.

Posted by Dawid Weiss <da...@cs.put.poznan.pl>.
> And, if you create & index such Lucene documents, and then do a
> MatchAllDocsQuery sorting by your field, this is (unfortunately) not

I was thinking about an optimized segment -- then the terms enum on a
given field should be sorted, right?


Dawid

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Re: External strings sort and case folding.

Posted by Michael McCandless <lu...@mikemccandless.com>.
Not that I know of.

And, if you create & index such Lucene documents, and then do a
MatchAllDocsQuery sorting by your field, this is (unfortunately) not
an external sort!  Ie, Lucene loads all terms data in RAM as packed
byte[], for merging the per-segment results.

It even does this, unnecessarily, for an optimized segment, even
though we only need ords in that case (there's an issue open for
this).

Doing a sort-by-String-field without loading the String data even when
there are multiple segments in the index would be a nice addition :)

Mike McCandless

http://blog.mikemccandless.com

On Tue, Jun 14, 2011 at 6:31 AM, Dawid Weiss <da...@gmail.com> wrote:
> Hi. While I was playing with automata recently, I had a use case
> scenario when I could really use an external sort of a large list of
> unicode strings. I know I could simply emulate this by creating
> synthetic documents, index, etc., but is there a more "direct" way of
> achieving this using Lucene's internals?
>
> Dawid
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org