You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by aslam bari <ia...@yahoo.co.in> on 2006/09/15 11:09:11 UTC

Big Ducument Indexing Limit?

Dear all,
  I am trying to index a Xml file which has 6MB size. Does lucene support the big document size. What is the limit of lucene Max file size to index. 
  Because when i check and trying to search in the indexed file. I am not able to get all the results. It gives me some results but not others. I think Lucene has done partially indexed and left the remaining part which it could not processed. IS IT RIGHT?. If not , then what will be the problem.
  Thanks...

 				
---------------------------------
 Find out what India is talking about on  - Yahoo! Answers India 
 Send FREE SMS to your friend's mobile from Yahoo! Messenger Version 8. Get it NOW

Re: Big Ducument Indexing Limit?

Posted by Erick Erickson <er...@gmail.com>.
First, You really must undestand analyzers and what they do. If you haven't
seen the book Lucene in Action, I highly recommend it.

Second, get a copy of Luke (google luke and lucene). It is a graphical tool
that lets you examine an index and fire queries at it. It'll show you
exactly what was indexed with the particular analyzer you eventually use. It
will also show you exactly what the result of a query is after it has been
processed by the query parser (again, using different analyzers). I
guarantee that the hour or two you spend understanding Luke will be time
*very* well spent.

About the content you indicated... what do you want to search? Are you going
to search on "\A1" "Frank" "\Paul" "\A1;Frank\Paul"? That is, would you
expect searching on \A1 to produce a hit or not?

You have to decide what parts of this text are to be searchable (i.e. how to
tokenize the input) and then use an analyzer/tokenizer that breaks the
stream up as you wish (for both indexing and querying). If none of the
supplied analyzers do what you want, Lucene in Action presents an example of
making your own (under synonym injection) that you can use as a model for
making your own.

It would be much more helpful if you could give an example of what searches
you would expect to generate hits on a given example input.... Again, using
Luke to look at your index after indexing would help you a lot to understand
what's possible.

Hope this helps
Erick

On 9/15/06, aslam bari <ia...@yahoo.co.in> wrote:
>
> Thanks for response,
>   I have again a small problem.
>   I have some text in a xml tag like
>   <Contents>\A1;Frank\PPaul</Contents>
>
>   Does lucene can not index it using SimpleAnalyzer or
> TextContentExtractor.
>   Thanks...
>
> Catalin Mititelu <ca...@yahoo.com> wrote:
>   One more hint for 2) and 3): use SimpleAnalyzer on your xml (give up at
> XmlContentExtractor). In this manner you can index all "words" from xml file
> at lower case (tag name, attribute name, attribute value and content).
> Of course, you should use the same analyzer for searching.
>
> Simon Willnauer wrote: On 9/15/06, aslam bari wrote:
> > Dear Mititelu,
> > Thanks for reply. Can you help me on some samll issue related to it.
> > 1) I am new to Lucene. Can you tell me where is this
> DEFAULT_MAX_FIELD_LENGTH variable available and how to set it and for my
> purpose like 6-10MB file, how much i should set.
> DEFAULT_MAX_FIELD_LENGTH is a constant of IndexWriter.
> to set your own value use IndexWriter.html#setMaxFieldLength(int)
>
> JavaDoc:
>
> http://lucene.apache.org/java/docs/api/org/apache/lucene/index/IndexWriter.html#setMaxFieldLength(int)
>
> > 2) how can i index all the words of XML file as Case Insensitive means
> either in lowercase or in Uppercase. So that i can search case insensitive.
> How your text is processed depends on the analyzer you use for
> indexing a certain field. A lot of the available analyzers do
> lowercase processing using a lowercasefilter like the
> (whitespaceanalyzer)
> see (Analysis) at:
> http://today.java.net/pub/a/today/2003/07/30/LuceneIntro.html
>
>
> You should have a look at
> > 3) I am using XmlContentExtractor. Actually it extract only Content of
> the xml tag. If i want every thing (including XmlTag or contents or
> properties - attributes ) of xml file to be indexed, where should i do
> change the code.
>
> I'm not 100% sure but isn't XmlContentExtractor a part or jakarta
> slide? If so please contact the slide mailing list. But there are many
> xml apis available to extract every content of your document. If you
> really want to index the whole xml file just read the file using java
> io anyway I would not suggest doing that at all.
>
> best regards simon
> >
> > Thanks...
> > Catalin Mititelu wrote:
> > Yes. The default max limit for indexing tokens is 10,000.
> > Look here
> http://lucene.apache.org/java/docs/api/org/apache/lucene/index/IndexWriter.html#DEFAULT_MAX_FIELD_LENGTH
> >
> > aslam bari wrote: Dear all,
> > I am trying to index a Xml file which has 6MB size. Does lucene support
> the big document size. What is the limit of lucene Max file size to index.
> > Because when i check and trying to search in the indexed file. I am not
> able to get all the results. It gives me some results but not others. I
> think Lucene has done partially indexed and left the remaining part which it
> could not processed. IS IT RIGHT?. If not , then what will be the problem.
> > Thanks...
> >
> >
> > ---------------------------------
> > Find out what India is talking about on - Yahoo! Answers India
> > Send FREE SMS to your friend's mobile from Yahoo! Messenger Version 8.
> Get it NOW
> >
> >
> > ---------------------------------
> > Stay in the know. Pulse on the new Yahoo.com. Check it out.
> >
> >
> > ---------------------------------
> > Find out what India is talking about on - Yahoo! Answers India
> > Send FREE SMS to your friend's mobile from Yahoo! Messenger Version 8.
> Get it NOW
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>
>
> ---------------------------------
> How low will we go? Check out Yahoo! Messenger's low PC-to-Phone call
> rates.
>
>
> ---------------------------------
> Find out what India is talking about on  - Yahoo! Answers India
> Send FREE SMS to your friend's mobile from Yahoo! Messenger Version 8. Get
> it NOW
>

Re: Big Ducument Indexing Limit?

Posted by aslam bari <ia...@yahoo.co.in>.
Thanks for response,
  I have again a small problem.
  I have some text in a xml tag like
  <Contents>\A1;Frank\PPaul</Contents>
   
  Does lucene can not index it using SimpleAnalyzer or TextContentExtractor.
  Thanks...

Catalin Mititelu <ca...@yahoo.com> wrote:
  One more hint for 2) and 3): use SimpleAnalyzer on your xml (give up at XmlContentExtractor). In this manner you can index all "words" from xml file at lower case (tag name, attribute name, attribute value and content). 
Of course, you should use the same analyzer for searching.

Simon Willnauer wrote: On 9/15/06, aslam bari wrote:
> Dear Mititelu,
> Thanks for reply. Can you help me on some samll issue related to it.
> 1) I am new to Lucene. Can you tell me where is this DEFAULT_MAX_FIELD_LENGTH variable available and how to set it and for my purpose like 6-10MB file, how much i should set.
DEFAULT_MAX_FIELD_LENGTH is a constant of IndexWriter.
to set your own value use IndexWriter.html#setMaxFieldLength(int)

JavaDoc:
http://lucene.apache.org/java/docs/api/org/apache/lucene/index/IndexWriter.html#setMaxFieldLength(int)

> 2) how can i index all the words of XML file as Case Insensitive means either in lowercase or in Uppercase. So that i can search case insensitive.
How your text is processed depends on the analyzer you use for
indexing a certain field. A lot of the available analyzers do
lowercase processing using a lowercasefilter like the
(whitespaceanalyzer)
see (Analysis) at:
http://today.java.net/pub/a/today/2003/07/30/LuceneIntro.html


You should have a look at
> 3) I am using XmlContentExtractor. Actually it extract only Content of the xml tag. If i want every thing (including XmlTag or contents or properties - attributes ) of xml file to be indexed, where should i do change the code.

I'm not 100% sure but isn't XmlContentExtractor a part or jakarta
slide? If so please contact the slide mailing list. But there are many
xml apis available to extract every content of your document. If you
really want to index the whole xml file just read the file using java
io anyway I would not suggest doing that at all.

best regards simon
>
> Thanks...
> Catalin Mititelu wrote:
> Yes. The default max limit for indexing tokens is 10,000.
> Look here http://lucene.apache.org/java/docs/api/org/apache/lucene/index/IndexWriter.html#DEFAULT_MAX_FIELD_LENGTH
>
> aslam bari wrote: Dear all,
> I am trying to index a Xml file which has 6MB size. Does lucene support the big document size. What is the limit of lucene Max file size to index.
> Because when i check and trying to search in the indexed file. I am not able to get all the results. It gives me some results but not others. I think Lucene has done partially indexed and left the remaining part which it could not processed. IS IT RIGHT?. If not , then what will be the problem.
> Thanks...
>
>
> ---------------------------------
> Find out what India is talking about on - Yahoo! Answers India
> Send FREE SMS to your friend's mobile from Yahoo! Messenger Version 8. Get it NOW
>
>
> ---------------------------------
> Stay in the know. Pulse on the new Yahoo.com. Check it out.
>
>
> ---------------------------------
> Find out what India is talking about on - Yahoo! Answers India
> Send FREE SMS to your friend's mobile from Yahoo! Messenger Version 8. Get it NOW
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org




---------------------------------
How low will we go? Check out Yahoo! Messenger’s low PC-to-Phone call rates.

 				
---------------------------------
 Find out what India is talking about on  - Yahoo! Answers India 
 Send FREE SMS to your friend's mobile from Yahoo! Messenger Version 8. Get it NOW

Re: Big Ducument Indexing Limit?

Posted by Catalin Mititelu <ca...@yahoo.com>.
One more hint for 2) and 3): use SimpleAnalyzer on your xml (give up at XmlContentExtractor). In this manner you can index all "words" from xml file at lower case (tag name, attribute name, attribute value and content). 
Of course, you should use the same analyzer for searching.

Simon Willnauer <si...@googlemail.com> wrote: On 9/15/06, aslam bari  wrote:
> Dear Mititelu,
>   Thanks for reply. Can you help me on some samll issue related to it.
>   1) I am new to Lucene. Can you tell me where is this DEFAULT_MAX_FIELD_LENGTH variable available and how to set it and for my purpose like 6-10MB file, how much i should set.
DEFAULT_MAX_FIELD_LENGTH is a constant of IndexWriter.
to set your own value use IndexWriter.html#setMaxFieldLength(int)

JavaDoc:
http://lucene.apache.org/java/docs/api/org/apache/lucene/index/IndexWriter.html#setMaxFieldLength(int)

>   2) how can i index all the words of XML file as Case Insensitive means either in lowercase or in Uppercase. So that i can search case insensitive.
How your text is processed depends on the analyzer you use for
indexing a certain field. A lot of the available analyzers do
lowercase processing using a lowercasefilter like the
(whitespaceanalyzer)
see (Analysis) at:
http://today.java.net/pub/a/today/2003/07/30/LuceneIntro.html


You should have a look at
>   3) I am using XmlContentExtractor. Actually it extract only Content of the xml tag. If i want every thing (including XmlTag or contents or properties - attributes ) of xml file to be indexed, where should i do change the code.

I'm not 100% sure but isn't XmlContentExtractor a part or jakarta
slide? If so please contact the slide mailing list. But there are many
xml apis available to extract every content of your document. If you
really want to index the whole xml file just read the file using java
io anyway I would not suggest doing that at all.

best regards simon
>
>   Thanks...
> Catalin Mititelu  wrote:
>   Yes. The default max limit for indexing tokens is 10,000.
> Look here http://lucene.apache.org/java/docs/api/org/apache/lucene/index/IndexWriter.html#DEFAULT_MAX_FIELD_LENGTH
>
> aslam bari wrote: Dear all,
> I am trying to index a Xml file which has 6MB size. Does lucene support the big document size. What is the limit of lucene Max file size to index.
> Because when i check and trying to search in the indexed file. I am not able to get all the results. It gives me some results but not others. I think Lucene has done partially indexed and left the remaining part which it could not processed. IS IT RIGHT?. If not , then what will be the problem.
> Thanks...
>
>
> ---------------------------------
> Find out what India is talking about on - Yahoo! Answers India
> Send FREE SMS to your friend's mobile from Yahoo! Messenger Version 8. Get it NOW
>
>
> ---------------------------------
> Stay in the know. Pulse on the new Yahoo.com. Check it out.
>
>
> ---------------------------------
>  Find out what India is talking about on  - Yahoo! Answers India
>  Send FREE SMS to your friend's mobile from Yahoo! Messenger Version 8. Get it NOW
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org



 		
---------------------------------
How low will we go? Check out Yahoo! Messenger’s low  PC-to-Phone call rates.

Re: Big Ducument Indexing Limit?

Posted by Simon Willnauer <si...@googlemail.com>.
On 9/15/06, aslam bari <ia...@yahoo.co.in> wrote:
> Dear Mititelu,
>   Thanks for reply. Can you help me on some samll issue related to it.
>   1) I am new to Lucene. Can you tell me where is this DEFAULT_MAX_FIELD_LENGTH variable available and how to set it and for my purpose like 6-10MB file, how much i should set.
DEFAULT_MAX_FIELD_LENGTH is a constant of IndexWriter.
to set your own value use IndexWriter.html#setMaxFieldLength(int)

JavaDoc:
http://lucene.apache.org/java/docs/api/org/apache/lucene/index/IndexWriter.html#setMaxFieldLength(int)

>   2) how can i index all the words of XML file as Case Insensitive means either in lowercase or in Uppercase. So that i can search case insensitive.
How your text is processed depends on the analyzer you use for
indexing a certain field. A lot of the available analyzers do
lowercase processing using a lowercasefilter like the
(whitespaceanalyzer)
see (Analysis) at:
http://today.java.net/pub/a/today/2003/07/30/LuceneIntro.html


You should have a look at
>   3) I am using XmlContentExtractor. Actually it extract only Content of the xml tag. If i want every thing (including XmlTag or contents or properties - attributes ) of xml file to be indexed, where should i do change the code.

I'm not 100% sure but isn't XmlContentExtractor a part or jakarta
slide? If so please contact the slide mailing list. But there are many
xml apis available to extract every content of your document. If you
really want to index the whole xml file just read the file using java
io anyway I would not suggest doing that at all.

best regards simon
>
>   Thanks...
> Catalin Mititelu <ca...@yahoo.com> wrote:
>   Yes. The default max limit for indexing tokens is 10,000.
> Look here http://lucene.apache.org/java/docs/api/org/apache/lucene/index/IndexWriter.html#DEFAULT_MAX_FIELD_LENGTH
>
> aslam bari wrote: Dear all,
> I am trying to index a Xml file which has 6MB size. Does lucene support the big document size. What is the limit of lucene Max file size to index.
> Because when i check and trying to search in the indexed file. I am not able to get all the results. It gives me some results but not others. I think Lucene has done partially indexed and left the remaining part which it could not processed. IS IT RIGHT?. If not , then what will be the problem.
> Thanks...
>
>
> ---------------------------------
> Find out what India is talking about on - Yahoo! Answers India
> Send FREE SMS to your friend's mobile from Yahoo! Messenger Version 8. Get it NOW
>
>
> ---------------------------------
> Stay in the know. Pulse on the new Yahoo.com. Check it out.
>
>
> ---------------------------------
>  Find out what India is talking about on  - Yahoo! Answers India
>  Send FREE SMS to your friend's mobile from Yahoo! Messenger Version 8. Get it NOW
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Big Ducument Indexing Limit?

Posted by aslam bari <ia...@yahoo.co.in>.
Dear Mititelu,
  Thanks for reply. Can you help me on some samll issue related to it.
  1) I am new to Lucene. Can you tell me where is this DEFAULT_MAX_FIELD_LENGTH variable available and how to set it and for my purpose like 6-10MB file, how much i should set.
  2) how can i index all the words of XML file as Case Insensitive means either in lowercase or in Uppercase. So that i can search case insensitive.
  3) I am using XmlContentExtractor. Actually it extract only Content of the xml tag. If i want every thing (including XmlTag or contents or properties - attributes ) of xml file to be indexed, where should i do change the code.

  Thanks...
Catalin Mititelu <ca...@yahoo.com> wrote:
  Yes. The default max limit for indexing tokens is 10,000. 
Look here http://lucene.apache.org/java/docs/api/org/apache/lucene/index/IndexWriter.html#DEFAULT_MAX_FIELD_LENGTH

aslam bari wrote: Dear all,
I am trying to index a Xml file which has 6MB size. Does lucene support the big document size. What is the limit of lucene Max file size to index. 
Because when i check and trying to search in the indexed file. I am not able to get all the results. It gives me some results but not others. I think Lucene has done partially indexed and left the remaining part which it could not processed. IS IT RIGHT?. If not , then what will be the problem.
Thanks...


---------------------------------
Find out what India is talking about on - Yahoo! Answers India 
Send FREE SMS to your friend's mobile from Yahoo! Messenger Version 8. Get it NOW


---------------------------------
Stay in the know. Pulse on the new Yahoo.com. Check it out. 

 				
---------------------------------
 Find out what India is talking about on  - Yahoo! Answers India 
 Send FREE SMS to your friend's mobile from Yahoo! Messenger Version 8. Get it NOW

Re: Big Ducument Indexing Limit?

Posted by Catalin Mititelu <ca...@yahoo.com>.
Yes. The default max limit for indexing tokens is 10,000. 
Look here http://lucene.apache.org/java/docs/api/org/apache/lucene/index/IndexWriter.html#DEFAULT_MAX_FIELD_LENGTH

aslam bari <ia...@yahoo.co.in> wrote: Dear all,
  I am trying to index a Xml file which has 6MB size. Does lucene support the big document size. What is the limit of lucene Max file size to index. 
  Because when i check and trying to search in the indexed file. I am not able to get all the results. It gives me some results but not others. I think Lucene has done partially indexed and left the remaining part which it could not processed. IS IT RIGHT?. If not , then what will be the problem.
  Thanks...

     
---------------------------------
 Find out what India is talking about on  - Yahoo! Answers India 
 Send FREE SMS to your friend's mobile from Yahoo! Messenger Version 8. Get it NOW

 		
---------------------------------
Stay in the know. Pulse on the new Yahoo.com.  Check it out.