You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Paul Rudin <pa...@rudin.co.uk> on 2012/01/19 08:40:54 UTC

stackoverflow - recursion in LuceneIterator.computeNext

I have a large lucene index from which I'm trying to extract term
vectors. I get a stackoverflow error, which is believe is caused by the
recursion in LuceneIterator.computeNext().  I could increase the stack
size, but with big enough data there could always be a problem.

I have a modified version that uses a loop instead of the recursion
which seems to work OK. Should I put a patch somewhere?

On a related note I'd quite like to be able to write the vectors
straight to s3 without writing to a local file first - is there a
practical way to do this?

Re: stackoverflow - recursion in LuceneIterator.computeNext

Posted by Paul Rudin <pa...@rudin.co.uk>.

Sean Owen <sr...@gmail.com> writes:

> That part is easy. On the job's Configuration, call:
>
>     set("fs.s3.awsAccessKeyId", "YourAccessKey");
>     set("fs.s3.awsSecretAccessKey", "YourSecretKey");
>
> If you use s3n:// URLs, do the same with ".s3n.".
>
> I also set fs.defaultFS and fs.default.name to "s3://mybucket".

Thanks, although I've yet to try that - I've just copied to s3 with the
amazon web app (s3cmd doesn't seem to work with files over 5Gb.)

I'm curious about the format of the output. First time round I used -x
95 -md 2 and I got a ~15.7Gb vector file and around 6 million terms in
the dictionary file. Eyeballing the dictionary file suggests that a lot
of the low frequency terms are essentially junk - artifacts of imperfect
extraction of text from e.g. word files, so I upped -md to get rid of
some. I also brought -x down to 90.  This did indeed reduce the number
of terms in the dictionary file, but the size of the vector file is only
about 1Gb less.

Does this make sense? The number of documents is the same but each
document vector should presumably now represent ~800000 element vector,
whereas before each was a ~6000000 element vector, yet the size of the
data is the same. I suppose that the low frequency terms are not going
to contribute much to the size of the data, but one might have expected
reducing -x to make more of a difference. Maybe not - if we ignore the
low frequency terms I guess we're just taking out about 5% of non-zero
values so maybe it's in the right ballpark.

Re: stackoverflow - recursion in LuceneIterator.computeNext

Posted by Sean Owen <sr...@gmail.com>.

That part is easy. On the job's Configuration, call:

    set("fs.s3.awsAccessKeyId", "YourAccessKey");
    set("fs.s3.awsSecretAccessKey", "YourSecretKey");

If you use s3n:// URLs, do the same with ".s3n.".

I also set fs.defaultFS and fs.default.name to "s3://mybucket".

On Fri, Jan 20, 2012 at 9:49 AM, Paul Rudin <pa...@rudin.co.uk> wrote:
> No I haven't tried. I'm not sure how the authentication works if I use
> s3://... as a path. I can't immediately find anything by googling, but
> I'll investigate further.

Re: stackoverflow - recursion in LuceneIterator.computeNext

Posted by Paul Rudin <pa...@rudin.co.uk>.

> Yes open a JIRA here with a patch: https://issues.apache.org/jira/browse/MAHOUT

OK - I put it here: 
<https://issues.apache.org/jira/browse/MAHOUT-951>

> If you're writing SequenceFiles, sure you can just write them straight
> to S3 by writing to a Path on s3:// -- is that something you've tried
> and doesn't work? should be that easy.

No I haven't tried. I'm not sure how the authentication works if I use
s3://... as a path. I can't immediately find anything by googling, but
I'll investigate further.

Re: stackoverflow - recursion in LuceneIterator.computeNext

Posted by Sean Owen <sr...@gmail.com>.

Yes open a JIRA here with a patch: https://issues.apache.org/jira/browse/MAHOUT

If you're writing SequenceFiles, sure you can just write them straight
to S3 by writing to a Path on s3:// -- is that something you've tried
and doesn't work? should be that easy.

On Thu, Jan 19, 2012 at 7:40 AM, Paul Rudin <pa...@rudin.co.uk> wrote:
>
> I have a large lucene index from which I'm trying to extract term
> vectors. I get a stackoverflow error, which is believe is caused by the
> recursion in LuceneIterator.computeNext().  I could increase the stack
> size, but with big enough data there could always be a problem.
>
> I have a modified version that uses a loop instead of the recursion
> which seems to work OK. Should I put a patch somewhere?
>
> On a related note I'd quite like to be able to write the vectors
> straight to s3 without writing to a local file first - is there a
> practical way to do this?
>
>