You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@lucenenet.apache.org by Tiberiu Motoc <ti...@gmail.com> on 2010/10/29 19:22:38 UTC

indexing doesn't seem to work in large files

Hi,

I'm using SvnQuery which is based on Lucene.Net to index and search my
SVN repositories. I've noticed that text doesn't get indexed in large
files. Actually the first 2000-2500 lines get indexed and the rest do
not. Is anyone aware of this problem? Is there a solution for it?

Thanks,
Tiberiu

Re: indexing doesn't seem to work in large files

Posted by Tiberiu Motoc <ti...@gmail.com>.
Thanks so much for all the suggestions!

I did try to change the MaxNumberOfTermsPerDocument to int.MaxValue
and that did now work. I also tried compiling SvnQuery with
Lucence.NET 2.9.2 but I ran into too much trouble. I started fixing
some of the incompatibilities, but gave up after a while. I'll
continue doing it on the side for a while, but unfortunately I'm
running out of time with my eval of SvnQuery. I received a reply in
the SvnQuery newsgroup saying this problem might be fixed in the next
version, so that's good news.

Thanks again for all the quick and helpful replies.
Tiberiu

On Fri, Oct 29, 2010 at 6:12 PM, Ben Martz <be...@gmail.com> wrote:
> I am using MaxFieldLength.UNLIMITED successfully in my own product running
> with Lucene.Net 2.9.2 and can definitely index huge documents without an
> issue (given enough RAM anyways).
>
> Regarding the possible SvnQuery-specific issues:
>
> 1. Have you verified that any portion of the document is actually being
> indexed? I noticed that SvnIndex.Indexer.FetchJob selects repository items
> based on a default maximum file size of 1MB (MaxDocumentSize).
>
> 2. Have you tried changing MaxNumberOfTermsPerDocument constant in
> SvnIndex\Indexer.cs from 50000 to IndexWriter.MaxFieldLength.UNLIMITED? I
> noticed that this MaxNumberOfTermsPerDocument and MaxDocumentSize were added
> in r237 and MaxNumberOfTermsPerDocument is used in two places in Indexer.
>
> 3. Is the test query that you're using simple enough to not result in a
> search failure because of an analyzer issue?
>
> If you get stuck and if the document in question can be disclosed, even
> privately, I would be happy to throw it at my Lucene.Net implementation
> (just straight Lucene, not SvnQuery) and run some queries for you if that
> would help.
>
> Cheers,
> Ben
>
> Aaron Powell wrote:
>>
>> I'd suggest upgrading it to work with 2.9.2 of Lucene.Net.
>>
>> What exactly are you indexing, code files or plain text documents?
>> Aaron Powell
>> Umbraco Ninja
>>
>> http://www.aaron-powell.com | http://twitter.com/slace | Skype:
>> aaron.l.powell | MSN: aazzap@hotmail.com
>>
>>
>> On Sat, Oct 30, 2010 at 8:21 AM, Tiberiu
>> Motoc<ti...@gmail.com>wrote:
>>
>>> Thanks Ben and Franklin,
>>>
>>> I tried it and unfortunately it didn't work. SvnQuery has it set to
>>> 50,000. I tried setting it to 60,000 and 500,000 and it still doesn't
>>> work: the text that I'm looking for is not indexed and not found.
>>> If I do a word-count in MS Word around the break-off point (where the
>>> indexing seems to stop) I get a count of 8,450 words and 75,000
>>> characters (with no spaces). It kinda makes me think that somehow the
>>> setting for the MaxFieldLength might not work.
>>> I also noticed that SvnQuery is using Lucene.NET v2.3.1.3. I see there
>>> are new versions of Lucene.NET available. Do you remember of any bugs
>>> in v2.3.1.3 that would cause this? Should I recompile SvnQuery with
>>> the latest version of Lucene.net?
>>>
>>> Thanks,
>>> Tiberiu
>>>
>>> On Fri, Oct 29, 2010 at 10:31 AM, Franklin Simmons
>>> <fs...@sccmediaserver.com>  wrote:
>>>>
>>>> Tiberiu,
>>>>
>>>> Check your IndexWriter's MaxFieldLength. The default is 10000.
>>>>
>>>> -----Original Message-----
>>>> From: Tiberiu Motoc [mailto:tiberiu.motoc@gmail.com]
>>>> Sent: Friday, October 29, 2010 1:23 PM
>>>> To: lucene-net-user@lucene.apache.org
>>>> Subject: indexing doesn't seem to work in large files
>>>>
>>>> Hi,
>>>>
>>>> I'm using SvnQuery which is based on Lucene.Net to index and search my
>>>> SVN repositories. I've noticed that text doesn't get indexed in large
>>>> files. Actually the first 2000-2500 lines get indexed and the rest do
>>>> not. Is anyone aware of this problem? Is there a solution for it?
>>>>
>>>> Thanks,
>>>> Tiberiu
>>>>
>>
>

Re: indexing doesn't seem to work in large files

Posted by Ben Martz <be...@gmail.com>.
I am using MaxFieldLength.UNLIMITED successfully in my own product running with Lucene.Net 2.9.2 and can definitely index huge documents without an issue (given enough RAM anyways).

Regarding the possible SvnQuery-specific issues:

1. Have you verified that any portion of the document is actually being indexed? I noticed that SvnIndex.Indexer.FetchJob selects repository items based on a default maximum file size of 1MB (MaxDocumentSize).

2. Have you tried changing MaxNumberOfTermsPerDocument constant in SvnIndex\Indexer.cs from 50000 to IndexWriter.MaxFieldLength.UNLIMITED? I noticed that this MaxNumberOfTermsPerDocument and MaxDocumentSize were added in r237 and MaxNumberOfTermsPerDocument is used in two places in Indexer.

3. Is the test query that you're using simple enough to not result in a search failure because of an analyzer issue?

If you get stuck and if the document in question can be disclosed, even privately, I would be happy to throw it at my Lucene.Net implementation (just straight Lucene, not SvnQuery) and run some queries for you if that would help.

Cheers,
Ben

Aaron Powell wrote:
> I'd suggest upgrading it to work with 2.9.2 of Lucene.Net.
>
> What exactly are you indexing, code files or plain text documents?
> Aaron Powell
> Umbraco Ninja
>
> http://www.aaron-powell.com | http://twitter.com/slace | Skype:
> aaron.l.powell | MSN: aazzap@hotmail.com
>
>
> On Sat, Oct 30, 2010 at 8:21 AM, Tiberiu Motoc<ti...@gmail.com>wrote:
>
>> Thanks Ben and Franklin,
>>
>> I tried it and unfortunately it didn't work. SvnQuery has it set to
>> 50,000. I tried setting it to 60,000 and 500,000 and it still doesn't
>> work: the text that I'm looking for is not indexed and not found.
>> If I do a word-count in MS Word around the break-off point (where the
>> indexing seems to stop) I get a count of 8,450 words and 75,000
>> characters (with no spaces). It kinda makes me think that somehow the
>> setting for the MaxFieldLength might not work.
>> I also noticed that SvnQuery is using Lucene.NET v2.3.1.3. I see there
>> are new versions of Lucene.NET available. Do you remember of any bugs
>> in v2.3.1.3 that would cause this? Should I recompile SvnQuery with
>> the latest version of Lucene.net?
>>
>> Thanks,
>> Tiberiu
>>
>> On Fri, Oct 29, 2010 at 10:31 AM, Franklin Simmons
>> <fs...@sccmediaserver.com>  wrote:
>>> Tiberiu,
>>>
>>> Check your IndexWriter's MaxFieldLength. The default is 10000.
>>>
>>> -----Original Message-----
>>> From: Tiberiu Motoc [mailto:tiberiu.motoc@gmail.com]
>>> Sent: Friday, October 29, 2010 1:23 PM
>>> To: lucene-net-user@lucene.apache.org
>>> Subject: indexing doesn't seem to work in large files
>>>
>>> Hi,
>>>
>>> I'm using SvnQuery which is based on Lucene.Net to index and search my
>>> SVN repositories. I've noticed that text doesn't get indexed in large
>>> files. Actually the first 2000-2500 lines get indexed and the rest do
>>> not. Is anyone aware of this problem? Is there a solution for it?
>>>
>>> Thanks,
>>> Tiberiu
>>>
>

Re: indexing doesn't seem to work in large files

Posted by Aaron Powell <me...@aaron-powell.com>.
I'd suggest upgrading it to work with 2.9.2 of Lucene.Net.

What exactly are you indexing, code files or plain text documents?
Aaron Powell
Umbraco Ninja

http://www.aaron-powell.com | http://twitter.com/slace | Skype:
aaron.l.powell | MSN: aazzap@hotmail.com


On Sat, Oct 30, 2010 at 8:21 AM, Tiberiu Motoc <ti...@gmail.com>wrote:

> Thanks Ben and Franklin,
>
> I tried it and unfortunately it didn't work. SvnQuery has it set to
> 50,000. I tried setting it to 60,000 and 500,000 and it still doesn't
> work: the text that I'm looking for is not indexed and not found.
> If I do a word-count in MS Word around the break-off point (where the
> indexing seems to stop) I get a count of 8,450 words and 75,000
> characters (with no spaces). It kinda makes me think that somehow the
> setting for the MaxFieldLength might not work.
> I also noticed that SvnQuery is using Lucene.NET v2.3.1.3. I see there
> are new versions of Lucene.NET available. Do you remember of any bugs
> in v2.3.1.3 that would cause this? Should I recompile SvnQuery with
> the latest version of Lucene.net?
>
> Thanks,
> Tiberiu
>
> On Fri, Oct 29, 2010 at 10:31 AM, Franklin Simmons
> <fs...@sccmediaserver.com> wrote:
> > Tiberiu,
> >
> > Check your IndexWriter's MaxFieldLength. The default is 10000.
> >
> > -----Original Message-----
> > From: Tiberiu Motoc [mailto:tiberiu.motoc@gmail.com]
> > Sent: Friday, October 29, 2010 1:23 PM
> > To: lucene-net-user@lucene.apache.org
> > Subject: indexing doesn't seem to work in large files
> >
> > Hi,
> >
> > I'm using SvnQuery which is based on Lucene.Net to index and search my
> > SVN repositories. I've noticed that text doesn't get indexed in large
> > files. Actually the first 2000-2500 lines get indexed and the rest do
> > not. Is anyone aware of this problem? Is there a solution for it?
> >
> > Thanks,
> > Tiberiu
> >
>

Re: indexing doesn't seem to work in large files

Posted by Tiberiu Motoc <ti...@gmail.com>.
Thanks Ben and Franklin,

I tried it and unfortunately it didn't work. SvnQuery has it set to
50,000. I tried setting it to 60,000 and 500,000 and it still doesn't
work: the text that I'm looking for is not indexed and not found.
If I do a word-count in MS Word around the break-off point (where the
indexing seems to stop) I get a count of 8,450 words and 75,000
characters (with no spaces). It kinda makes me think that somehow the
setting for the MaxFieldLength might not work.
I also noticed that SvnQuery is using Lucene.NET v2.3.1.3. I see there
are new versions of Lucene.NET available. Do you remember of any bugs
in v2.3.1.3 that would cause this? Should I recompile SvnQuery with
the latest version of Lucene.net?

Thanks,
Tiberiu

On Fri, Oct 29, 2010 at 10:31 AM, Franklin Simmons
<fs...@sccmediaserver.com> wrote:
> Tiberiu,
>
> Check your IndexWriter's MaxFieldLength. The default is 10000.
>
> -----Original Message-----
> From: Tiberiu Motoc [mailto:tiberiu.motoc@gmail.com]
> Sent: Friday, October 29, 2010 1:23 PM
> To: lucene-net-user@lucene.apache.org
> Subject: indexing doesn't seem to work in large files
>
> Hi,
>
> I'm using SvnQuery which is based on Lucene.Net to index and search my
> SVN repositories. I've noticed that text doesn't get indexed in large
> files. Actually the first 2000-2500 lines get indexed and the rest do
> not. Is anyone aware of this problem? Is there a solution for it?
>
> Thanks,
> Tiberiu
>

RE: indexing doesn't seem to work in large files

Posted by Franklin Simmons <fs...@sccmediaserver.com>.
Tiberiu, 

Check your IndexWriter's MaxFieldLength. The default is 10000.

-----Original Message-----
From: Tiberiu Motoc [mailto:tiberiu.motoc@gmail.com] 
Sent: Friday, October 29, 2010 1:23 PM
To: lucene-net-user@lucene.apache.org
Subject: indexing doesn't seem to work in large files

Hi,

I'm using SvnQuery which is based on Lucene.Net to index and search my
SVN repositories. I've noticed that text doesn't get indexed in large
files. Actually the first 2000-2500 lines get indexed and the rest do
not. Is anyone aware of this problem? Is there a solution for it?

Thanks,
Tiberiu

Re: indexing doesn't seem to work in large files

Posted by Ben Martz <be...@gmail.com>.
Is it possible that you haven't overidden the default max field length setting?

http://wiki.apache.org/lucene-java/LuceneFAQ

"Lucene by default only indexes the first 10,000 terms of a document to avoid OutOfMemory errors. SeeIndexWriter.setMaxFieldLength(int) <http://lucene.apache.org/java/docs/api/org/apache/lucene/index/IndexWriter.html#setMaxFieldLength%28int%29>."

Tiberiu Motoc wrote:
> Hi,
>
> I'm using SvnQuery which is based on Lucene.Net to index and search my
> SVN repositories. I've noticed that text doesn't get indexed in large
> files. Actually the first 2000-2500 lines get indexed and the rest do
> not. Is anyone aware of this problem? Is there a solution for it?
>
> Thanks,
> Tiberiu