You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Avishai Ish-Shalom <av...@fewbytes.com> on 2014/12/01 23:10:39 UTC

Large fields storage

Hi all,

I have very large documents (as big as 1GB) which i'm indexing and planning
to store in Solr in order to use highlighting snippets. I am concerned
about possible performance issues with such large fields - does storing the
fields require additional RAM over what is required to index/fetch/search?
I'm assuming Solr reads only the required data by offset from the storage
and not the entire field. Am I correct in this assumption?

Does anyone on this list has experience to share with such large documents?

Thanks,
Avishai

Re: Large fields storage

Posted by Shawn Heisey <ap...@elyograg.org>.

On 12/1/2014 3:10 PM, Avishai Ish-Shalom wrote:
> I have very large documents (as big as 1GB) which i'm indexing and planning
> to store in Solr in order to use highlighting snippets. I am concerned
> about possible performance issues with such large fields - does storing the
> fields require additional RAM over what is required to index/fetch/search?
> I'm assuming Solr reads only the required data by offset from the storage
> and not the entire field. Am I correct in this assumption?
>
> Does anyone on this list has experience to share with such large documents?

You've gotten some excellent replies already, I just wanted to mention
compression.

Short answer to the question about RAM: You might need a fair amount of
extra memory for the Java heap.  Because of the potential for a large
index size, you'll want a large amount of memory beyond the heap, for
caching.

More detailed info:

The response that gets built to send to the user, if the fl parameter
contains the field with that large data in it, will require memory to
hold that data, up to the number of records in the "rows" parameter on
the query.  If it's a distributed index, some of that data might cross
the network twice -- once from the server that stores it, and again to
the client.

In Solr 4.1 and later, stored fields are compressed, with no way to turn
compression off.  With very large stored fields, there may be
performance and memory implications for both indexing (compression) and
queries (decompression). Termvectors (which Michael Sokolov mentioned in
his reply) have been compressed since version 4.2.

More memory will probably be required for "ramBufferSizeMB" -- a
temporary storage area in RAM used during indexing.  That defaults to
100MB in recent Solr versions.  This is normally enough for several
hundred or several thousand typical documents, but just one of your
documents may not fit.  This will increase your heap requirements.

As for whether there is a way to only retrieve specific data from the
compressed information without uncompressing all of it, that I do not
know.  The compression is handled by the Lucene layer, not Solr itself.

https://issues.apache.org/jira/browse/LUCENE-4226

Thanks,
Shawn

Re: Large fields storage

Posted by Michael Sokolov <ms...@safaribooksonline.com>.

There's no appreciable RAM cost during querying, faceting, sorting of 
search results and so on.  Stored fields are separate from the inverted 
index.  There is some cost in additional disk space required and I/O 
during merging, but I think you'll find these are not significant.  The 
main cost we've observed from handling very large texts is highlighting. 
The default highlighter essentially re-scans the entire document, so 
it's necessary to limit its scope to get decent performance. 
FastVectorHighlighter is better, but also has some scaling issues with 
large documents, and it requires term vectors, which are expensive in 
their own right.  We've gotten best performance from 
PostingsHighlighter, but it doesn't handle phrase-sensitive 
highlighting, and I will say that I haven't tried it on such large 
documents as that: I believe it builds a mini-index on the fly in order 
to score highlighting passages, and that could get expensive with 1GB docs.

You might find in the end that you are better off splitting these very 
large documents into smaller pieces and rolling those up using 
parent/child document indexing or grouping or something, primarily 
because of the highlighting.

-Mike

On 12/3/14 4:56 PM, Avishai Ish-Shalom wrote:
> The use case is not for pdf or documents with images but very large text
> documents. My question is does storing the documents degrade performance
> more then just indexing without storing? i will only return highlighted
> text of limited length and probably never download the entire document.
>
> On Tue, Dec 2, 2014 at 2:15 AM, Jack Krupansky <ja...@basetechnology.com>
> wrote:
>
>> In particular, if they are image-intensive, all the images go away. And
>> the formatting as well.
>>
>> -- Jack Krupansky
>>
>> -----Original Message----- From: Ahmet Arslan
>> Sent: Monday, December 1, 2014 6:02 PM
>> To: solr-user@lucene.apache.org
>> Subject: Re: Large fields storage
>>
>>
>> Hi Avi,
>>
>> I assume your documents are rich documents like pdf word, am I correct?
>> When you extract textual content from them, their size will shrink.
>>
>> Ahmet
>>
>>
>>
>> On Tuesday, December 2, 2014 12:11 AM, Avishai Ish-Shalom <
>> avishai@fewbytes.com> wrote:
>> Hi all,
>>
>> I have very large documents (as big as 1GB) which i'm indexing and planning
>> to store in Solr in order to use highlighting snippets. I am concerned
>> about possible performance issues with such large fields - does storing the
>> fields require additional RAM over what is required to index/fetch/search?
>> I'm assuming Solr reads only the required data by offset from the storage
>> and not the entire field. Am I correct in this assumption?
>>
>> Does anyone on this list has experience to share with such large documents?
>>
>> Thanks,
>> Avishai
>>

Re: Large fields storage

Posted by Avishai Ish-Shalom <av...@fewbytes.com>.

The use case is not for pdf or documents with images but very large text
documents. My question is does storing the documents degrade performance
more then just indexing without storing? i will only return highlighted
text of limited length and probably never download the entire document.

On Tue, Dec 2, 2014 at 2:15 AM, Jack Krupansky <ja...@basetechnology.com>
wrote:

> In particular, if they are image-intensive, all the images go away. And
> the formatting as well.
>
> -- Jack Krupansky
>
> -----Original Message----- From: Ahmet Arslan
> Sent: Monday, December 1, 2014 6:02 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Large fields storage
>
>
> Hi Avi,
>
> I assume your documents are rich documents like pdf word, am I correct?
> When you extract textual content from them, their size will shrink.
>
> Ahmet
>
>
>
> On Tuesday, December 2, 2014 12:11 AM, Avishai Ish-Shalom <
> avishai@fewbytes.com> wrote:
> Hi all,
>
> I have very large documents (as big as 1GB) which i'm indexing and planning
> to store in Solr in order to use highlighting snippets. I am concerned
> about possible performance issues with such large fields - does storing the
> fields require additional RAM over what is required to index/fetch/search?
> I'm assuming Solr reads only the required data by offset from the storage
> and not the entire field. Am I correct in this assumption?
>
> Does anyone on this list has experience to share with such large documents?
>
> Thanks,
> Avishai
>

Re: Large fields storage

Posted by Jack Krupansky <ja...@basetechnology.com>.

In particular, if they are image-intensive, all the images go away. And the 
formatting as well.

-- Jack Krupansky

-----Original Message----- 
From: Ahmet Arslan
Sent: Monday, December 1, 2014 6:02 PM
To: solr-user@lucene.apache.org
Subject: Re: Large fields storage

Hi Avi,

I assume your documents are rich documents like pdf word, am I correct?
When you extract textual content from them, their size will shrink.

Ahmet

On Tuesday, December 2, 2014 12:11 AM, Avishai Ish-Shalom 
<av...@fewbytes.com> wrote:
Hi all,

I have very large documents (as big as 1GB) which i'm indexing and planning
to store in Solr in order to use highlighting snippets. I am concerned
about possible performance issues with such large fields - does storing the
fields require additional RAM over what is required to index/fetch/search?
I'm assuming Solr reads only the required data by offset from the storage
and not the entire field. Am I correct in this assumption?

Does anyone on this list has experience to share with such large documents?

Thanks,
Avishai

Re: Large fields storage

Posted by Ahmet Arslan <io...@yahoo.com.INVALID>.

Hi Avi,

I assume your documents are rich documents like pdf word, am I correct?
When you extract textual content from them, their size will shrink.

Ahmet



On Tuesday, December 2, 2014 12:11 AM, Avishai Ish-Shalom <av...@fewbytes.com> wrote:
Hi all,

I have very large documents (as big as 1GB) which i'm indexing and planning
to store in Solr in order to use highlighting snippets. I am concerned
about possible performance issues with such large fields - does storing the
fields require additional RAM over what is required to index/fetch/search?
I'm assuming Solr reads only the required data by offset from the storage
and not the entire field. Am I correct in this assumption?

Does anyone on this list has experience to share with such large documents?

Thanks,
Avishai

Re: Large fields storage

Posted by Erick Erickson <er...@gmail.com>.

I really have to question the utility of this. The doc will
match a _lot_ of queries, but I guess they'll be scored
quite low due to length normalization.

And even if the user does decide to click on the document,
are they going to then download a bigger than 1G document?

All in all, your concerns about performance are well founded,
I have to wonder whether this is an XY problem....

Best,
Erick

On Mon, Dec 1, 2014 at 2:10 PM, Avishai Ish-Shalom <av...@fewbytes.com> wrote:
> Hi all,
>
> I have very large documents (as big as 1GB) which i'm indexing and planning
> to store in Solr in order to use highlighting snippets. I am concerned
> about possible performance issues with such large fields - does storing the
> fields require additional RAM over what is required to index/fetch/search?
> I'm assuming Solr reads only the required data by offset from the storage
> and not the entire field. Am I correct in this assumption?
>
> Does anyone on this list has experience to share with such large documents?
>
> Thanks,
> Avishai