You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@lucy.apache.org by "Nick D." <nd...@globaldataguard.com> on 2014/01/28 20:26:28 UTC

[lucy-user] 32 bit CentOS Indexing Question

Hi all,

I am having issues indexing large files. The format of the files I'm
indexing is a syslog formatted file that is pretty large around 4.4gb.
During the process I am only adding docs to the index making a doc per line
of the syslog log file and committing once at the very end. During this
process the index grows to a relatively enormous size (around 14gb) and (im
guessing) during the commit it uses huge amounts of ram slowing the computer
down to a crawl and then once the commit is done the index size shrinks to
4.1gb on a 64 bit system and on a 32 bit system I get a malloc error saying
it can't allocate more space. Each box has the same amount of ram and they
are the same OS only 1 is 32-bit and the other is 64.

Questions:

Why do I get a memory allocation error on a 32 bit OS and not a 64 bit OS?
Are there any 32 bit limitations of Lucy?
Why does the index file grow so large and then shrinks after commit is done?
Should I commit more often?
Would committing often slow down the indexing process?
Would committing often make the over growth of the index go away?

Any help would be greatly appreciated,

Nick D.


Code Snippet:
# Create Schema.
my $schema = Lucy::Plan::Schema->new;
my $case_folder  = Lucy::Analysis::CaseFolder->new;
my $tokenizer    = Lucy::Analysis::RegexTokenizer->new; #purposely leave out
the Steemer
my $polyanalyzer = Lucy::Analysis::PolyAnalyzer->new(
      analyzers => [ $case_folder, $tokenizer ],  
      );  
my $unstored_full_text_type = Lucy::Plan::FullTextType->new(
                analyzer => $polyanalyzer,
                stored => 0,
                );  
my $unindexed_int_type = Lucy::Plan::Int64Type->new( indexed => 0, sortable
=> 1, );
my $unindexed_string_type = Lucy::Plan::StringType->new( indexed => 0,
sortable => 1, );

$schema->spec_field( name => 'line', type => $unstored_full_text_type );
$schema->spec_field( name => 'offset',     type => $unindexed_int_type );
$schema->spec_field( name => 'time_sec',     type => $unindexed_string_type
);

.........................

open(my $fh, '<', $filename ) or die "Can't open '$filename': $!";
my $offset = 0;
my $time = 0;
while( my $line = <$fh> ) {

   $line =~ /^\w+\s+\d+\s+(\d+)\:(\d+)\:(\d+)/;

   $time = ($1*60*60) + ($2*60) + $3;

   my %doc = (
         line      => $line,
         offset     => $offset,
         time_sec   => sprintf("%0.5d", $time),
         );

   #print Dumper(\%doc);
   $indexer->add_doc(\%doc);  # ta-da!
   $offset = tell($fh);
}

$indexer->commit;
-------------------------------------end of
snippet---------------------------------------

Example format of file to be indexed

Mar 12 12:27:00 server3 named[32172]: lame server resolving
'jakarta5.wasantara.net.id' (in 'wasantara.net.id'?): 202.159.65.171#53 
Mar 12 12:27:03 server3 named[32173]: lame server resolving
'jakarta5.wasantara.net.id' (in 'wasantara.net.id'?): 202.159.65.171#



--
View this message in context: http://lucene.472066.n3.nabble.com/32-bit-CentOS-Indexing-Question-tp4114036.html
Sent from the lucy-user mailing list archive at Nabble.com.

Re: [lucy-user] 32 bit CentOS Indexing Question

Posted by "Nick D." <nd...@globaldataguard.com>.

Nick Wellnhofer wrote
> On Jan 29, 2014, at 02:59 , Marvin Humphrey &lt;

> marvin@

> &gt; wrote:
> 
>> On Tue, Jan 28, 2014 at 11:26 AM, Nick D. &lt;

> ndwyer@

> &gt; wrote:
>>> Why do I get a memory allocation error on a 32 bit OS and not a 64 bit
>>> OS?
>> 
>> It's probably a known architectural flaw in SortWriter which makes it
>> consume
>> too much RAM.
> 
> 
> It would be interesting to know whether they make a difference in Nick
> D.’s case. If they solve his problem, we should consider backporting the
> fix to the 0.3 branch.
> 
> Nick W.

It did not fix the issue but committing every 50k records or so (syslog
style records) fixed the issue and sped up indexing a bit. I was able to
install the SortWriter branch and unfortunately the set_mem_threshold did
not speed up indexing (possibly I/O device is the bottleneck).




--
View this message in context: http://lucene.472066.n3.nabble.com/32-bit-CentOS-Indexing-Question-tp4114036p4117201.html
Sent from the lucy-user mailing list archive at Nabble.com.

Re: [lucy-user] 32 bit CentOS Indexing Question

Posted by Nick Wellnhofer <we...@aevum.de>.

On 05/02/2014 00:10, Nick D. wrote:
> I've checkout out the sorfieldwriter branch like so:
> git clone https://git-wip-us.apache.org/repos/asf/lucy.git
> git checkout -b test  origin/sortfieldwriter
>
> And when I do a `git log` I see the commits:

Looks good.

> But when I look at commit you replied with
> "https://git-wip-us.apache.org/repos/asf?p=lucy.git;a=commitdiff;h=0e49ac6f6ca45860d5598060b89bdac3fbfed2db"
> and looking at the file SortFieldWriter.c here:
> https://git-wip-us.apache.org/repos/asf?p=lucy.git;a=blob;f=core/Lucy/Index/SortFieldWriter.c;h=6ae42d10a3e62f6e99058d668cb0b90fd91b53b1;hb=0e49ac6f6ca45860d5598060b89bdac3fbfed2db
>
> I dont see the function "S_compare_doc_ids_by_ord_rev" in my branch. I've
> made sure to do a `git remote update` and merge but it was up to date. Is
> there something thaat I need to do extra?

That's OK. The function S_compare_doc_ids_by_ord_rev was removed in a later 
commit in the branch.

Nick

Re: [lucy-user] 32 bit CentOS Indexing Question

Posted by "Nick D." <nd...@globaldataguard.com>.

Nick Wellnhofer wrote
> On Jan 31, 2014, at 21:43 , Nick D. &lt;

> ndwyer@

> &gt; wrote:
> 
> If you checkout the sortfieldwriter branch, you’ll get all these commits.
> If you’re using the 0.3 branch, you have apply them one-by-one. There’s a
> good chance that this will work without conflicts.
> 
> Nick

I've checkout out the sorfieldwriter branch like so:
git clone https://git-wip-us.apache.org/repos/asf/lucy.git
git checkout -b test  origin/sortfieldwriter

And when I do a `git log` I see the commits:

> commit ad178f10692659b4ed8b170ebfa42d13fd3eed20
> Author: Nick Wellnhofer 
> &lt;
> wellnhofer@aevum.de
> &gt;
> Date:   Thu Sep 26 19:43:42 2013 +0200
> 
>     Use counting sort to sort doc_ids in SortFieldWriter#Refill
>     
>     Since we already have the ordinals for each doc_id, we can use a
>     counting sort. This uses a temporary array of size run_cardinality but
>     runs in linear time.
> 
> commit 98a960ed16c601569ab1b78e5a3e1e9302065180
> Author: Nick Wellnhofer 
> &lt;
> wellnhofer@aevum.de
> &gt;
> Date:   Thu Sep 26 02:28:04 2013 +0200
> 
>     Free sorted_ids in SortFieldWriter a little earlier
> 
> commit 393723d354d8ce44841cd006a26d03894315088d
> Author: Nick Wellnhofer 
> &lt;
> wellnhofer@aevum.de
> &gt;
> Date:   Thu Sep 26 02:13:49 2013 +0200
> 
>     Initialize SortFieldWriter#run_tick to 1
>     
>     Make sure we never use a run_tick of 0.
> 
> commit a5aa40a93d0b2542dc04afd387619b015cf273b5
> Author: Nick Wellnhofer 
> &lt;
> wellnhofer@aevum.de
> &gt;
> Date:   Thu Sep 26 02:04:42 2013 +0200
> 
>     Make SortFieldWriter#Refill obey the memory limit
>     
>     The old logic was broken.
> 
> commit 0e49ac6f6ca45860d5598060b89bdac3fbfed2db
> Author: Nick Wellnhofer 
> &lt;
> wellnhofer@aevum.de
> &gt;
> Date:   Thu Sep 26 01:25:52 2013 +0200
> 
>     Don't sort documents twice in SortFieldWriter#Refill
>     
>     The doc_ids are already sorted in S_lazy_init_sorted_ids. We only have
>     to make sure that S_lazy_init_sorted_ids uses the doc_id as secondary
>     sort key.

But when I look at commit you replied with
"https://git-wip-us.apache.org/repos/asf?p=lucy.git;a=commitdiff;h=0e49ac6f6ca45860d5598060b89bdac3fbfed2db"
and looking at the file SortFieldWriter.c here:
https://git-wip-us.apache.org/repos/asf?p=lucy.git;a=blob;f=core/Lucy/Index/SortFieldWriter.c;h=6ae42d10a3e62f6e99058d668cb0b90fd91b53b1;hb=0e49ac6f6ca45860d5598060b89bdac3fbfed2db

I dont see the function "S_compare_doc_ids_by_ord_rev" in my branch. I've
made sure to do a `git remote update` and merge but it was up to date. Is
there something thaat I need to do extra?

Attached is my SortFieldWriter.c file  SortFieldWriter.c
<http://lucene.472066.n3.nabble.com/file/n4115359/SortFieldWriter.c>  

Any help is always appreciated.



--
View this message in context: http://lucene.472066.n3.nabble.com/32-bit-CentOS-Indexing-Question-tp4114036p4115359.html
Sent from the lucy-user mailing list archive at Nabble.com.

Re: [lucy-user] 32 bit CentOS Indexing Question

Posted by "Nick D." <nd...@globaldataguard.com>.

If I have the current Lucy-0.3.3 version that is on cpan how do I go about
getting those two commits mentioned earlier into the source that I have?



--
View this message in context: http://lucene.472066.n3.nabble.com/32-bit-CentOS-Indexing-Question-tp4114036p4115120.html
Sent from the lucy-user mailing list archive at Nabble.com.

Re: [lucy-user] 32 bit CentOS Indexing Question

Posted by Marvin Humphrey <ma...@rectangular.com>.

On Fri, Jan 31, 2014 at 4:33 PM, Nick Wellnhofer <we...@aevum.de> wrote:
>> Are there any downsides to increasing this threshold to say 40MB?
>
> No, if you have enough memory, you can probably use a much higher value.
> Maybe Marvin can give some additional details.

It's hard to say, it might depend on CPU cache behavior.

The primary reason that global setting exists is not performance tweakery,
it's testing.

>From perl/lib/Lucy/Test.pm:

    # Set the default memory threshold for PostingListWriter to a low number
    # so that we simulate large indexes by performing a lot of PostingPool
    # flushes.
    Lucy::Index::PostingListWriter::set_default_mem_thresh(0x1000);

Marvin Humphrey

Re: [lucy-user] 32 bit CentOS Indexing Question

Posted by Nick Wellnhofer <we...@aevum.de>.

On Jan 31, 2014, at 23:18 , Nick D. <nd...@globaldataguard.com> wrote:

> Does the Lucy::Index::SortWriter::set_default_mem_thresh($bytes); function
> exist in the latest public 0.3.3 version of lucy?

Yes, but it’s ineffective due to a bug which the sortfieldwriter branch should fix.

> Is there a function like this for SegWriter (I'm assuming this is used for
> writing segments that are not sortable)? if so what is the default?

Yes, there’s

    Lucy::Index::PostingListWriter::set_default_mem_thresh($bytes);

with a default of 16MB. This affects segment merging for indexed fields.

(A segment contains data for all the fields of your schema. PostingListWriter creates the posting lists for indexed fields. SortWriter creates the sort cache for sortable fields. Both posting lists and sort caches are contained in a segment.)

> Are there any downsides to increasing this threshold to say 40MB?

No, if you have enough memory, you can probably use a much higher value. Maybe Marvin can give some additional details.

Nick

Re: [lucy-user] 32 bit CentOS Indexing Question

Posted by "Nick D." <nd...@globaldataguard.com>.

Does the Lucy::Index::SortWriter::set_default_mem_thresh($bytes); function
exist in the latest public 0.3.3 version of lucy? 

Is there a function like this for SegWriter (I'm assuming this is used for
writing segments that are not sortable)? if so what is the default?

Are there any downsides to increasing this threshold to say 40MB?



--
View this message in context: http://lucene.472066.n3.nabble.com/32-bit-CentOS-Indexing-Question-tp4114036p4114766.html
Sent from the lucy-user mailing list archive at Nabble.com.

Re: [lucy-user] 32 bit CentOS Indexing Question

Posted by Nick Wellnhofer <we...@aevum.de>.

On Jan 31, 2014, at 21:43 , Nick D. <nd...@globaldataguard.com> wrote:

> Thanks Nick (cool name by the way). If I continue to have problems with this
> I will get those 2 commits and see if there is a difference.
> 
> Would these commits help with speed of indexing? mainly add_doc and commit
> functions that write/re-write segments?

That’s hard to tell. The first commit should make things a bit faster. The second commit helps with memory usage when indexing many documents with sortable fields. This should actually make things slower but there’s a tunable which might help:

    Lucy::Index::SortWriter::set_default_mem_thresh($bytes);

The default is 4MB (0x400000). Larger values should speed up indexing at the expense of memory.

Then the sortfieldwriter branch contains another commit which might improve performance noticably:

https://git-wip-us.apache.org/repos/asf?p=lucy.git;a=commitdiff;h=ad178f10692659b4ed8b170ebfa42d13fd3eed20

If you checkout the sortfieldwriter branch, you’ll get all these commits. If you’re using the 0.3 branch, you have apply them one-by-one. There’s a good chance that this will work without conflicts.

Nick

Re: [lucy-user] 32 bit CentOS Indexing Question

Posted by "Nick D." <nd...@globaldataguard.com>.

Thanks Nick (cool name by the way). If I continue to have problems with this
I will get those 2 commits and see if there is a difference.

Would these commits help with speed of indexing? mainly add_doc and commit
functions that write/re-write segments?



--
View this message in context: http://lucene.472066.n3.nabble.com/32-bit-CentOS-Indexing-Question-tp4114036p4114751.html
Sent from the lucy-user mailing list archive at Nabble.com.

[lucy-dev] Re: [lucy-user] 32 bit CentOS Indexing Question

Posted by Nick Wellnhofer <we...@aevum.de>.

On Jan 29, 2014, at 02:59 , Marvin Humphrey <ma...@rectangular.com> wrote:

> On Tue, Jan 28, 2014 at 11:26 AM, Nick D. <nd...@globaldataguard.com> wrote:
>> Why do I get a memory allocation error on a 32 bit OS and not a 64 bit OS?
> 
> It's probably a known architectural flaw in SortWriter which makes it consume
> too much RAM.

This issue should be resolved in the sortfieldwriter branch. The following two commits are the crucial ones:

https://git-wip-us.apache.org/repos/asf?p=lucy.git;a=commitdiff;h=0e49ac6f6ca45860d5598060b89bdac3fbfed2db

https://git-wip-us.apache.org/repos/asf?p=lucy.git;a=commitdiff;h=a5aa40a93d0b2542dc04afd387619b015cf273b5

It would be interesting to know whether they make a difference in Nick D.’s case. If they solve his problem, we should consider backporting the fix to the 0.3 branch.

Nick W.

Re: [lucy-user] 32 bit CentOS Indexing Question

Posted by Nick Wellnhofer <we...@aevum.de>.

On Jan 29, 2014, at 02:59 , Marvin Humphrey <ma...@rectangular.com> wrote:

> On Tue, Jan 28, 2014 at 11:26 AM, Nick D. <nd...@globaldataguard.com> wrote:
>> Why do I get a memory allocation error on a 32 bit OS and not a 64 bit OS?
> 
> It's probably a known architectural flaw in SortWriter which makes it consume
> too much RAM.

This issue should be resolved in the sortfieldwriter branch. The following two commits are the crucial ones:

https://git-wip-us.apache.org/repos/asf?p=lucy.git;a=commitdiff;h=0e49ac6f6ca45860d5598060b89bdac3fbfed2db

https://git-wip-us.apache.org/repos/asf?p=lucy.git;a=commitdiff;h=a5aa40a93d0b2542dc04afd387619b015cf273b5

It would be interesting to know whether they make a difference in Nick D.’s case. If they solve his problem, we should consider backporting the fix to the 0.3 branch.

Nick W.

Re: [lucy-user] 32 bit CentOS Indexing Question

Posted by Marvin Humphrey <ma...@rectangular.com>.

On Tue, Jan 28, 2014 at 11:26 AM, Nick D. <nd...@globaldataguard.com> wrote:
> Why do I get a memory allocation error on a 32 bit OS and not a 64 bit OS?

It's probably a known architectural flaw in SortWriter which makes it consume
too much RAM.

> Are there any 32 bit limitations of Lucy?

In theory, there should not be.  We have expended considerable effort to
provide compatibility with 32-bit systems, though our optimization target
remains 64-bit.

> Why does the index file grow so large and then shrinks after commit is done?

There is a lot of temporary data produced during indexing.  Before you can
search a large amount of material, you have to sort it.  That takes a lot of
space.

> Should I commit more often?

If you are only generating this index in a single shot, that should be an
adequqate workaround to overcome the SortWriter problem.  However, you must
also override IndexManager#recycle to return an empty arrayref.  Check out
Lucy::Docs::Cookbook::FastUpdates.

> Would committing often slow down the indexing process?

I don't think the difference would be unreasonable.

> Would committing often make the over growth of the index go away?

If you override IndexManager#recycle, yes.

This is assuming you don't need to modify the index later, which I'm guessing
based on the script that you supplied.

Marvin Humphrey