You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by Erick Erickson <er...@gmail.com> on 2011/06/12 15:39:22 UTC

Memory consumption in trunk for sorting and faceting

OK, I know lots of great work has been done to reduce the memory
footprint for sorting and faceting, but what I'm seeing is drastic
enough that I want to see if I'm missing something and to ask what
finer-grained tools people are using to answer the question "How much
more memory efficient is the new way of doing things"?

Setup:

I'm indexing 1.9M Wikipedia articles. Firing up a fresh Solr and
firing a relatively insane query at it while monitoring in jConsole.
Doing a GC from jConsole and looking at the memory used by Solr.
Crude, but I'm trying to get a flavor of what's going on here.

Field Unique values type
id 1,917,727 string
user_sort 62,123 string
text 57,759 text (1.4.1 flavor for all
three Solr versions)
user_id 62,122 int

http://localhost:8983/solr/select/?q=*:*&version=2.2&start=0&rows=10&indent=on&sort=user_sort
asc, id desc&facet=on&facet.field=text&facet.field=user_id&facet.field=id

Yeah, yeah, yeah, faceting and sorting by a unique ID is silly. But it
*does* stress memory.

Anyway, here are the numbers I'm seeing:

1.4.1 328 M
3.2 328 M
trunk 90 M

And it's even more impressive than that when you consider that 20M or
so is just to get in the door.....

Is it fair to say that the two big innovations that have reduced the
memory footprint are:
1> going to byte arrays for string storage
2> the FST work?

Final question. It looks like the FST work is back-ported to the
current 3_x code branch, is that true? Anything else back-ported
there? I'll check that branch out and give it a whirl for kicks.

Thanks,
Erick

A novice programmer gets a program to compile and says "I'm sure it'll
run fine now"
A veteran programmer runs a program for the first time, gets the
expected results and says "I must have done something wrong, that
can't *really* be working".

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Memory consumption in trunk for sorting and faceting

Posted by Michael McCandless <lu...@mikemccandless.com>.

The FST lib is back-ported to 3.x, but nothing makes use of it there (yet!).

Ie, in 3.x the terms index is still several objects per unique term
(TermInfo, Term, String).

For for 3.x it's both the object overhead, and that all chars are 2
bytes (UTF16), and also that each indexed term takes a full long vs
the FST which compresses shared outputs so that each term uses far
less than that on average.

Mike McCandless

http://blog.mikemccandless.com

On Sun, Jun 12, 2011 at 10:49 AM, Erick Erickson
<er...@gmail.com> wrote:
> Yep, thanks. If I'm reading the JIRA right, the FST
> stuff is on 3x, and I just tested that the same way
> and got a memory footprint comparable to 1.4.1...
>
> So this is pretty much all in the ByteRefs, right?
>
> And my crude tests hit the worst case, unique strings..
>
> Thanks
> Erick
>
> On Sun, Jun 12, 2011 at 10:06 AM, Michael McCandless
> <lu...@mikemccandless.com> wrote:
>> Right, not using objects is a huge win, especially on 64 bit JRE.
>>
>> Cutting over to UTF8 bytes is also a big drop in certain cases, since
>> it's UTF8 vs UTF16 for 3.x.
>>
>> Ie, simple ascii fields take half the storage vs 3.x.
>>
>> Similarly, the terms index in 3.x uses multiple objects per indexed
>> Term, and no objects in trunk (since it's just a single byte[] holding
>> the FST), and also uses UTF8 to hold the term data, instead of UTF16.
>>
>> FST has been backported to 3.x but it's not used yet I think;
>> back-porting the terms index improvements would be a biggish change...
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>> On Sun, Jun 12, 2011 at 9:51 AM, Dawid Weiss
>> <da...@cs.put.poznan.pl> wrote:
>>>> Is it fair to say that the two big innovations that have reduced the
>>>> memory footprint are:
>>>> 1> going to byte arrays for string storage
>>>> 2> the FST work?
>>>>
>>>> Final question. It looks like the FST work is back-ported to the
>>>> current 3_x code branch, is that true? Anything else back-ported
>>>> there? I'll check that branch out and give it a whirl for kicks.
>>>
>>> I'm guessing it's going from Strings to ByteRefs (objects have
>>> considerable overhead, really). This used to be my favorite showcase
>>> for students -- manipulate a large array of Integer[] vs. manipulate
>>> the same size array of int[]. A similar think applies to String
>>> instances vs. ByteRefs (utf16 vs. utf8 encoding, object header
>>> overhead, etc).
>>>
>>> Dawid
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Memory consumption in trunk for sorting and faceting

Posted by Erick Erickson <er...@gmail.com>.

Yep, thanks. If I'm reading the JIRA right, the FST
stuff is on 3x, and I just tested that the same way
and got a memory footprint comparable to 1.4.1...

So this is pretty much all in the ByteRefs, right?

And my crude tests hit the worst case, unique strings..

Thanks
Erick

On Sun, Jun 12, 2011 at 10:06 AM, Michael McCandless
<lu...@mikemccandless.com> wrote:
> Right, not using objects is a huge win, especially on 64 bit JRE.
>
> Cutting over to UTF8 bytes is also a big drop in certain cases, since
> it's UTF8 vs UTF16 for 3.x.
>
> Ie, simple ascii fields take half the storage vs 3.x.
>
> Similarly, the terms index in 3.x uses multiple objects per indexed
> Term, and no objects in trunk (since it's just a single byte[] holding
> the FST), and also uses UTF8 to hold the term data, instead of UTF16.
>
> FST has been backported to 3.x but it's not used yet I think;
> back-porting the terms index improvements would be a biggish change...
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
> On Sun, Jun 12, 2011 at 9:51 AM, Dawid Weiss
> <da...@cs.put.poznan.pl> wrote:
>>> Is it fair to say that the two big innovations that have reduced the
>>> memory footprint are:
>>> 1> going to byte arrays for string storage
>>> 2> the FST work?
>>>
>>> Final question. It looks like the FST work is back-ported to the
>>> current 3_x code branch, is that true? Anything else back-ported
>>> there? I'll check that branch out and give it a whirl for kicks.
>>
>> I'm guessing it's going from Strings to ByteRefs (objects have
>> considerable overhead, really). This used to be my favorite showcase
>> for students -- manipulate a large array of Integer[] vs. manipulate
>> the same size array of int[]. A similar think applies to String
>> instances vs. ByteRefs (utf16 vs. utf8 encoding, object header
>> overhead, etc).
>>
>> Dawid
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Memory consumption in trunk for sorting and faceting

Posted by Michael McCandless <lu...@mikemccandless.com>.

Right, not using objects is a huge win, especially on 64 bit JRE.

Cutting over to UTF8 bytes is also a big drop in certain cases, since
it's UTF8 vs UTF16 for 3.x.

Ie, simple ascii fields take half the storage vs 3.x.

Similarly, the terms index in 3.x uses multiple objects per indexed
Term, and no objects in trunk (since it's just a single byte[] holding
the FST), and also uses UTF8 to hold the term data, instead of UTF16.

FST has been backported to 3.x but it's not used yet I think;
back-porting the terms index improvements would be a biggish change...

Mike McCandless

http://blog.mikemccandless.com

On Sun, Jun 12, 2011 at 9:51 AM, Dawid Weiss
<da...@cs.put.poznan.pl> wrote:
>> Is it fair to say that the two big innovations that have reduced the
>> memory footprint are:
>> 1> going to byte arrays for string storage
>> 2> the FST work?
>>
>> Final question. It looks like the FST work is back-ported to the
>> current 3_x code branch, is that true? Anything else back-ported
>> there? I'll check that branch out and give it a whirl for kicks.
>
> I'm guessing it's going from Strings to ByteRefs (objects have
> considerable overhead, really). This used to be my favorite showcase
> for students -- manipulate a large array of Integer[] vs. manipulate
> the same size array of int[]. A similar think applies to String
> instances vs. ByteRefs (utf16 vs. utf8 encoding, object header
> overhead, etc).
>
> Dawid
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Memory consumption in trunk for sorting and faceting

Posted by Dawid Weiss <da...@cs.put.poznan.pl>.

> Is it fair to say that the two big innovations that have reduced the
> memory footprint are:
> 1> going to byte arrays for string storage
> 2> the FST work?
>
> Final question. It looks like the FST work is back-ported to the
> current 3_x code branch, is that true? Anything else back-ported
> there? I'll check that branch out and give it a whirl for kicks.

I'm guessing it's going from Strings to ByteRefs (objects have
considerable overhead, really). This used to be my favorite showcase
for students -- manipulate a large array of Integer[] vs. manipulate
the same size array of int[]. A similar think applies to String
instances vs. ByteRefs (utf16 vs. utf8 encoding, object header
overhead, etc).

Dawid

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org