You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Otis Gospodnetic <ot...@yahoo.com> on 2011/06/08 00:59:09 UTC

400 MB Fields

Hello,

What are the biggest document fields that you've ever indexed in Solr or that 
you've heard of?  Ah, it must be Tom's Hathi trust. :)

I'm asking because I just heard of a case of an index where some documents 
having a field that can be around 400 MB in size!  I'm curious if anyone has any 
experience with such monster fields?
Crazy?  Yes, sure.
Doable?

Otis
----
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/


Re: 400 MB Fields

Posted by Lance Norskog <go...@gmail.com>.
The Salesforce book is 2800 pages of PDF, last I looked.

What can you do with a field that big? Can you get all of the snippets?

On Tue, Jun 7, 2011 at 5:33 PM, Fuad Efendi <fu...@efendi.ca> wrote:
> Hi Otis,
>
>
> I am recalling "pagination" feature, it is still unresolved (with default
> scoring implementation): even with small documents, searching-retrieving
> documents 1 to 10 can take 0 milliseconds, but from 100,000 to 100,010 can
> take few minutes (I saw it with trunk version 6 months ago, and with very
> small documents, total 100 mlns docs); it is advisable to restrict search
> results to top-1000 in any case (as with Google)...
>
>
>
> I believe things can get wrong; yes, most plain-text retrieved from books
> should be 2kb per page, 500 pages, :=> 1,000,000 bytes (or double it for
> UTF-8)
>
> Theoretically, it doesn't make any sense to index BIG document containing
> all terms from dictionary without any "terms frequency" calcs, but even
> with it... I can't imagine we should index 1000s docs and each is just
> (different) version of whole Wikipedia, should be wrong design...
>
> Ok, use case: index single HUGE document. What will we do? Create index
> with _the_only_ document? And all search will return the same result (or
> nothing)? Paginate it; split into pages. I am pragmatic...
>
>
> Fuad
>
>
>
> On 11-06-07 8:04 PM, "Otis Gospodnetic" <ot...@yahoo.com> wrote:
>
>>Hi,
>>
>>
>>> I think the question is strange... May be you are wondering about
>>>possible
>>> OOM exceptions?
>>
>>No, that's an easier one. I was more wondering whether with 400 MB Fields
>>(indexed, not stored) it becomes incredibly slow to:
>>* analyze
>>* commit / write to disk
>>* search
>>
>>> I think we can pass to Lucene single document  containing
>>> comma separated list of "term, term, ..." (few billion times)...  Except
>>> "stored" and "TermVectorComponent"...
>
>
>



-- 
Lance Norskog
goksron@gmail.com

Re: 400 MB Fields

Posted by Fuad Efendi <fu...@efendi.ca>.
Hi Otis,


I am recalling "pagination" feature, it is still unresolved (with default
scoring implementation): even with small documents, searching-retrieving
documents 1 to 10 can take 0 milliseconds, but from 100,000 to 100,010 can
take few minutes (I saw it with trunk version 6 months ago, and with very
small documents, total 100 mlns docs); it is advisable to restrict search
results to top-1000 in any case (as with Google)...



I believe things can get wrong; yes, most plain-text retrieved from books
should be 2kb per page, 500 pages, :=> 1,000,000 bytes (or double it for
UTF-8)

Theoretically, it doesn't make any sense to index BIG document containing
all terms from dictionary without any "terms frequency" calcs, but even
with it... I can't imagine we should index 1000s docs and each is just
(different) version of whole Wikipedia, should be wrong design...

Ok, use case: index single HUGE document. What will we do? Create index
with _the_only_ document? And all search will return the same result (or
nothing)? Paginate it; split into pages. I am pragmatic...


Fuad



On 11-06-07 8:04 PM, "Otis Gospodnetic" <ot...@yahoo.com> wrote:

>Hi,
>
>
>> I think the question is strange... May be you are wondering about
>>possible
>> OOM exceptions? 
>
>No, that's an easier one. I was more wondering whether with 400 MB Fields
>(indexed, not stored) it becomes incredibly slow to:
>* analyze
>* commit / write to disk
>* search
>
>> I think we can pass to Lucene single document  containing
>> comma separated list of "term, term, ..." (few billion times)...  Except
>> "stored" and "TermVectorComponent"...



Re: 400 MB Fields

Posted by Otis Gospodnetic <ot...@yahoo.com>.
Hi,


> I think the question is strange... May be you are wondering about  possible
> OOM exceptions? 

No, that's an easier one. I was more wondering whether with 400 MB Fields 
(indexed, not stored) it becomes incredibly slow to:
* analyze
* commit / write to disk
* search

> I think we can pass to Lucene single document  containing
> comma separated list of "term, term, ..." (few billion times)...  Except
> "stored" and "TermVectorComponent"...

Oh, I know it can be done, but I'm wondering how bad things (like the ones 
above) get.

> I believe thousands  companies already indexed millions documents with
> average size few hundreds  Mbytes... There should not be any limits (except

Which ones are you thinking about?  What sort of documents?

> 100,000 _unique_ terms vs. single document containing  100,000,000,000,000
> of non-unique terms (and trying to store  offsets)
> 
> Personally, I indexed only small (up  to 1000 bytes) documents-fields, but
> I believe 500Mb is very common use case  with PDFs (which vendors use

Nah, PDF files may be big, but I think the text in them is often not *that* big, 
unless those are PDFs of very big books.

Thanks,
Otis


> On  11-06-07 7:02 PM, "Erick Erickson" <er...@gmail.com>  wrote:
> 
> >From older (2.4) Lucene days, I once indexed the 23 volume  "Encyclopedia
> >of Michigan Civil War Volunteers" in a single  document/field, so it's
> >probably
> >within the realm of possibility  at least <G>...
> >
> >Erick
> >
> >On Tue, Jun 7, 2011 at  6:59 PM, Otis Gospodnetic
> ><ot...@yahoo.com>  wrote:
> >> Hello,
> >>
> >> What are the biggest document  fields that you've ever indexed in Solr
> >>or that
> >> you've  heard of?  Ah, it must be Tom's Hathi trust. :)
> >>
> >> I'm  asking because I just heard of a case of an index where  some
> >>documents
> >> having a field that can be around 400 MB  in size!  I'm curious if
> >>anyone has any
> >> experience  with such monster fields?
> >> Crazy?  Yes, sure.
> >>  Doable?
> >>
> >> Otis
> >> ----
> >> Sematext :: http://sematext.com/ :: Solr -  Lucene - Nutch
> >> Lucene ecosystem search :: http://search-lucene.com/
> >>
> >>
> 
> 
> > 


Re: 400 MB Fields

Posted by Fuad Efendi <fu...@efendi.ca>.
I think the question is strange... May be you are wondering about possible
OOM exceptions? I think we can pass to Lucene single document containing
comma separated list of "term, term, ..." (few billion times)... Except
"stored" and "TermVectorComponent"...

I believe thousands companies already indexed millions documents with
average size few hundreds Mbytes... There should not be any limits (except
InputSource vs. ByteArray)

100,000 _unique_ terms vs. single document containing 100,000,000,000,000
of non-unique terms (and trying to store offsets)

What about "Spell Checker" feature? Is anyone tried to index single
terabytes-like document?

Personally, I indexed only small (up to 1000 bytes) documents-fields, but
I believe 500Mb is very common use case with PDFs (which vendors use
Lucene already? Eclipse? To index Eclipse Help file? Even Microsoft uses
Lucene...)


Fuad




On 11-06-07 7:02 PM, "Erick Erickson" <er...@gmail.com> wrote:

>From older (2.4) Lucene days, I once indexed the 23 volume "Encyclopedia
>of Michigan Civil War Volunteers" in a single document/field, so it's
>probably
>within the realm of possibility at least <G>...
>
>Erick
>
>On Tue, Jun 7, 2011 at 6:59 PM, Otis Gospodnetic
><ot...@yahoo.com> wrote:
>> Hello,
>>
>> What are the biggest document fields that you've ever indexed in Solr
>>or that
>> you've heard of?  Ah, it must be Tom's Hathi trust. :)
>>
>> I'm asking because I just heard of a case of an index where some
>>documents
>> having a field that can be around 400 MB in size!  I'm curious if
>>anyone has any
>> experience with such monster fields?
>> Crazy?  Yes, sure.
>> Doable?
>>
>> Otis
>> ----
>> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
>> Lucene ecosystem search :: http://search-lucene.com/
>>
>>



Re: 400 MB Fields

Posted by Erick Erickson <er...@gmail.com>.
>From older (2.4) Lucene days, I once indexed the 23 volume "Encyclopedia
of Michigan Civil War Volunteers" in a single document/field, so it's probably
within the realm of possibility at least <G>...

Erick

On Tue, Jun 7, 2011 at 6:59 PM, Otis Gospodnetic
<ot...@yahoo.com> wrote:
> Hello,
>
> What are the biggest document fields that you've ever indexed in Solr or that
> you've heard of?  Ah, it must be Tom's Hathi trust. :)
>
> I'm asking because I just heard of a case of an index where some documents
> having a field that can be around 400 MB in size!  I'm curious if anyone has any
> experience with such monster fields?
> Crazy?  Yes, sure.
> Doable?
>
> Otis
> ----
> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> Lucene ecosystem search :: http://search-lucene.com/
>
>

Re: 400 MB Fields

Posted by Alexander Kanarsky <ka...@gmail.com>.
Otis,

Not sure about the Solr, but with Lucene It was certainly doable. I
saw fields way bigger than 400Mb indexed, sometimes having a large set
of unique terms as well (think something like log file with lots of
alphanumeric tokens, couple of gigs in size). While indexing and
querying of such things the I/O, naturally, could easily become a
bottleneck.

-Alexander

RE: 400 MB Fields

Posted by "Burton-West, Tom" <tb...@umich.edu>.
Hi Otis, 

Our OCR fields average around 800 KB.  My guess is that the largest docs we index (in a single OCR field) are somewhere between 2 and 10MB.  We have had issues where the in-memory representation of the document (the in memory index structures being built)is several times the size of the text, so I would suspect even with the largest ramBufferSizeMB, you might run into problems.  (This is with the 3.x branch.  Trunk might not have this problem since it's much more memory efficient when indexing

Tom Burton-West
www.hathitrust.org/blogs
________________________________________
From: Otis Gospodnetic [otis_gospodnetic@yahoo.com]
Sent: Tuesday, June 07, 2011 6:59 PM
To: solr-user@lucene.apache.org
Subject: 400 MB Fields

Hello,

What are the biggest document fields that you've ever indexed in Solr or that
you've heard of?  Ah, it must be Tom's Hathi trust. :)

I'm asking because I just heard of a case of an index where some documents
having a field that can be around 400 MB in size!  I'm curious if anyone has any
experience with such monster fields?
Crazy?  Yes, sure.
Doable?

Otis
----
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/