You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucy.apache.org by Nick Wellnhofer <we...@aevum.de> on 2014/01/30 23:42:56 UTC

[lucy-dev] Re: [lucy-user] 32 bit CentOS Indexing Question

On Jan 29, 2014, at 02:59 , Marvin Humphrey <ma...@rectangular.com> wrote:

> On Tue, Jan 28, 2014 at 11:26 AM, Nick D. <nd...@globaldataguard.com> wrote:
>> Why do I get a memory allocation error on a 32 bit OS and not a 64 bit OS?
> 
> It's probably a known architectural flaw in SortWriter which makes it consume
> too much RAM.

This issue should be resolved in the sortfieldwriter branch. The following two commits are the crucial ones:

https://git-wip-us.apache.org/repos/asf?p=lucy.git;a=commitdiff;h=0e49ac6f6ca45860d5598060b89bdac3fbfed2db

https://git-wip-us.apache.org/repos/asf?p=lucy.git;a=commitdiff;h=a5aa40a93d0b2542dc04afd387619b015cf273b5

It would be interesting to know whether they make a difference in Nick D.’s case. If they solve his problem, we should consider backporting the fix to the 0.3 branch.

Nick W.


Re: [lucy-user] 32 bit CentOS Indexing Question

Posted by "Nick D." <nd...@globaldataguard.com>.
Nick Wellnhofer wrote
> On Jan 29, 2014, at 02:59 , Marvin Humphrey &lt;

> marvin@

> &gt; wrote:
> 
>> On Tue, Jan 28, 2014 at 11:26 AM, Nick D. &lt;

> ndwyer@

> &gt; wrote:
>>> Why do I get a memory allocation error on a 32 bit OS and not a 64 bit
>>> OS?
>> 
>> It's probably a known architectural flaw in SortWriter which makes it
>> consume
>> too much RAM.
> 
> 
> It would be interesting to know whether they make a difference in Nick
> D.’s case. If they solve his problem, we should consider backporting the
> fix to the 0.3 branch.
> 
> Nick W.

It did not fix the issue but committing every 50k records or so (syslog
style records) fixed the issue and sped up indexing a bit. I was able to
install the SortWriter branch and unfortunately the set_mem_threshold did
not speed up indexing (possibly I/O device is the bottleneck).




--
View this message in context: http://lucene.472066.n3.nabble.com/32-bit-CentOS-Indexing-Question-tp4114036p4117201.html
Sent from the lucy-user mailing list archive at Nabble.com.

Re: [lucy-user] 32 bit CentOS Indexing Question

Posted by Nick Wellnhofer <we...@aevum.de>.
On 05/02/2014 00:10, Nick D. wrote:
> I've checkout out the sorfieldwriter branch like so:
> git clone https://git-wip-us.apache.org/repos/asf/lucy.git
> git checkout -b test  origin/sortfieldwriter
>
> And when I do a `git log` I see the commits:

Looks good.

> But when I look at commit you replied with
> "https://git-wip-us.apache.org/repos/asf?p=lucy.git;a=commitdiff;h=0e49ac6f6ca45860d5598060b89bdac3fbfed2db"
> and looking at the file SortFieldWriter.c here:
> https://git-wip-us.apache.org/repos/asf?p=lucy.git;a=blob;f=core/Lucy/Index/SortFieldWriter.c;h=6ae42d10a3e62f6e99058d668cb0b90fd91b53b1;hb=0e49ac6f6ca45860d5598060b89bdac3fbfed2db
>
> I dont see the function "S_compare_doc_ids_by_ord_rev" in my branch. I've
> made sure to do a `git remote update` and merge but it was up to date. Is
> there something thaat I need to do extra?

That's OK. The function S_compare_doc_ids_by_ord_rev was removed in a later 
commit in the branch.

Nick

Re: [lucy-user] 32 bit CentOS Indexing Question

Posted by "Nick D." <nd...@globaldataguard.com>.
Nick Wellnhofer wrote
> On Jan 31, 2014, at 21:43 , Nick D. &lt;

> ndwyer@

> &gt; wrote:
> 
> If you checkout the sortfieldwriter branch, you’ll get all these commits.
> If you’re using the 0.3 branch, you have apply them one-by-one. There’s a
> good chance that this will work without conflicts.
> 
> Nick

I've checkout out the sorfieldwriter branch like so:
git clone https://git-wip-us.apache.org/repos/asf/lucy.git
git checkout -b test  origin/sortfieldwriter

And when I do a `git log` I see the commits:

> commit ad178f10692659b4ed8b170ebfa42d13fd3eed20
> Author: Nick Wellnhofer 
> &lt;
> wellnhofer@aevum.de
> &gt;
> Date:   Thu Sep 26 19:43:42 2013 +0200
> 
>     Use counting sort to sort doc_ids in SortFieldWriter#Refill
>     
>     Since we already have the ordinals for each doc_id, we can use a
>     counting sort. This uses a temporary array of size run_cardinality but
>     runs in linear time.
> 
> commit 98a960ed16c601569ab1b78e5a3e1e9302065180
> Author: Nick Wellnhofer 
> &lt;
> wellnhofer@aevum.de
> &gt;
> Date:   Thu Sep 26 02:28:04 2013 +0200
> 
>     Free sorted_ids in SortFieldWriter a little earlier
> 
> commit 393723d354d8ce44841cd006a26d03894315088d
> Author: Nick Wellnhofer 
> &lt;
> wellnhofer@aevum.de
> &gt;
> Date:   Thu Sep 26 02:13:49 2013 +0200
> 
>     Initialize SortFieldWriter#run_tick to 1
>     
>     Make sure we never use a run_tick of 0.
> 
> commit a5aa40a93d0b2542dc04afd387619b015cf273b5
> Author: Nick Wellnhofer 
> &lt;
> wellnhofer@aevum.de
> &gt;
> Date:   Thu Sep 26 02:04:42 2013 +0200
> 
>     Make SortFieldWriter#Refill obey the memory limit
>     
>     The old logic was broken.
> 
> commit 0e49ac6f6ca45860d5598060b89bdac3fbfed2db
> Author: Nick Wellnhofer 
> &lt;
> wellnhofer@aevum.de
> &gt;
> Date:   Thu Sep 26 01:25:52 2013 +0200
> 
>     Don't sort documents twice in SortFieldWriter#Refill
>     
>     The doc_ids are already sorted in S_lazy_init_sorted_ids. We only have
>     to make sure that S_lazy_init_sorted_ids uses the doc_id as secondary
>     sort key.

But when I look at commit you replied with
"https://git-wip-us.apache.org/repos/asf?p=lucy.git;a=commitdiff;h=0e49ac6f6ca45860d5598060b89bdac3fbfed2db"
and looking at the file SortFieldWriter.c here:
https://git-wip-us.apache.org/repos/asf?p=lucy.git;a=blob;f=core/Lucy/Index/SortFieldWriter.c;h=6ae42d10a3e62f6e99058d668cb0b90fd91b53b1;hb=0e49ac6f6ca45860d5598060b89bdac3fbfed2db

I dont see the function "S_compare_doc_ids_by_ord_rev" in my branch. I've
made sure to do a `git remote update` and merge but it was up to date. Is
there something thaat I need to do extra?

Attached is my SortFieldWriter.c file  SortFieldWriter.c
<http://lucene.472066.n3.nabble.com/file/n4115359/SortFieldWriter.c>  

Any help is always appreciated.



--
View this message in context: http://lucene.472066.n3.nabble.com/32-bit-CentOS-Indexing-Question-tp4114036p4115359.html
Sent from the lucy-user mailing list archive at Nabble.com.

Re: [lucy-user] 32 bit CentOS Indexing Question

Posted by "Nick D." <nd...@globaldataguard.com>.
If I have the current Lucy-0.3.3 version that is on cpan how do I go about
getting those two commits mentioned earlier into the source that I have?



--
View this message in context: http://lucene.472066.n3.nabble.com/32-bit-CentOS-Indexing-Question-tp4114036p4115120.html
Sent from the lucy-user mailing list archive at Nabble.com.

Re: [lucy-user] 32 bit CentOS Indexing Question

Posted by Marvin Humphrey <ma...@rectangular.com>.
On Fri, Jan 31, 2014 at 4:33 PM, Nick Wellnhofer <we...@aevum.de> wrote:
>> Are there any downsides to increasing this threshold to say 40MB?
>
> No, if you have enough memory, you can probably use a much higher value.
> Maybe Marvin can give some additional details.

It's hard to say, it might depend on CPU cache behavior.

The primary reason that global setting exists is not performance tweakery,
it's testing.

>From perl/lib/Lucy/Test.pm:

    # Set the default memory threshold for PostingListWriter to a low number
    # so that we simulate large indexes by performing a lot of PostingPool
    # flushes.
    Lucy::Index::PostingListWriter::set_default_mem_thresh(0x1000);

Marvin Humphrey

Re: [lucy-user] 32 bit CentOS Indexing Question

Posted by Nick Wellnhofer <we...@aevum.de>.
On Jan 31, 2014, at 23:18 , Nick D. <nd...@globaldataguard.com> wrote:

> Does the Lucy::Index::SortWriter::set_default_mem_thresh($bytes); function
> exist in the latest public 0.3.3 version of lucy?

Yes, but it’s ineffective due to a bug which the sortfieldwriter branch should fix.

> Is there a function like this for SegWriter (I'm assuming this is used for
> writing segments that are not sortable)? if so what is the default?

Yes, there’s

    Lucy::Index::PostingListWriter::set_default_mem_thresh($bytes);

with a default of 16MB. This affects segment merging for indexed fields.

(A segment contains data for all the fields of your schema. PostingListWriter creates the posting lists for indexed fields. SortWriter creates the sort cache for sortable fields. Both posting lists and sort caches are contained in a segment.)

> Are there any downsides to increasing this threshold to say 40MB?

No, if you have enough memory, you can probably use a much higher value. Maybe Marvin can give some additional details.

Nick



Re: [lucy-user] 32 bit CentOS Indexing Question

Posted by "Nick D." <nd...@globaldataguard.com>.
Does the Lucy::Index::SortWriter::set_default_mem_thresh($bytes); function
exist in the latest public 0.3.3 version of lucy? 

Is there a function like this for SegWriter (I'm assuming this is used for
writing segments that are not sortable)? if so what is the default?

Are there any downsides to increasing this threshold to say 40MB?



--
View this message in context: http://lucene.472066.n3.nabble.com/32-bit-CentOS-Indexing-Question-tp4114036p4114766.html
Sent from the lucy-user mailing list archive at Nabble.com.

Re: [lucy-user] 32 bit CentOS Indexing Question

Posted by Nick Wellnhofer <we...@aevum.de>.
On Jan 31, 2014, at 21:43 , Nick D. <nd...@globaldataguard.com> wrote:

> Thanks Nick (cool name by the way). If I continue to have problems with this
> I will get those 2 commits and see if there is a difference.
> 
> Would these commits help with speed of indexing? mainly add_doc and commit
> functions that write/re-write segments?

That’s hard to tell. The first commit should make things a bit faster. The second commit helps with memory usage when indexing many documents with sortable fields. This should actually make things slower but there’s a tunable which might help:

    Lucy::Index::SortWriter::set_default_mem_thresh($bytes);

The default is 4MB (0x400000). Larger values should speed up indexing at the expense of memory.

Then the sortfieldwriter branch contains another commit which might improve performance noticably:

https://git-wip-us.apache.org/repos/asf?p=lucy.git;a=commitdiff;h=ad178f10692659b4ed8b170ebfa42d13fd3eed20

If you checkout the sortfieldwriter branch, you’ll get all these commits. If you’re using the 0.3 branch, you have apply them one-by-one. There’s a good chance that this will work without conflicts.

Nick


Re: [lucy-user] 32 bit CentOS Indexing Question

Posted by "Nick D." <nd...@globaldataguard.com>.
Thanks Nick (cool name by the way). If I continue to have problems with this
I will get those 2 commits and see if there is a difference.

Would these commits help with speed of indexing? mainly add_doc and commit
functions that write/re-write segments?



--
View this message in context: http://lucene.472066.n3.nabble.com/32-bit-CentOS-Indexing-Question-tp4114036p4114751.html
Sent from the lucy-user mailing list archive at Nabble.com.