You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Roman Chyla <ro...@gmail.com> on 2018/02/20 15:27:50 UTC

storing large text fields in a database? (instead of inside index)

Hello,

We have a use case of a very large index (slave-master; for unrelated
reasons the search cannot work in the cloud mode) - one of the fields is a
very large text, stored mostly for highlighting. To cut down the index size
(for purposes of replication/scaling) I thought I could try to save it in a
database - and not in the index.

Lucene has codecs - one of the methods is for 'stored field', so that seems
likes a natural path for me.

However, I'd expect somebody else before had a similar problem. I googled
and couldn't find any solutions. Using the codecs seems really good thing
for this particular problem, am I missing something? Is there a better way
to cut down on index size? (besides solr cloud/sharding, compression)

Thank you,

   Roman

Re: storing large text fields in a database? (instead of inside index)

Posted by Roman Chyla <ro...@gmail.com>.

Hi and thanks, Emir! FieldType might indeed be another layer where the
logic could live.

On Wed, Feb 21, 2018 at 6:32 AM, Emir Arnautović <
emir.arnautovic@sematext.com> wrote:

> Hi,
> Maybe you could use external field type as an example how to hook up
> values from DB: https://lucene.apache.org/solr/guide/6_6/working-with-
> external-files-and-processes.html <https://lucene.apache.org/
> solr/guide/6_6/working-with-external-files-and-processes.html>
>
> HTH,
> Emir
> --
> Monitoring - Log Management - Alerting - Anomaly Detection
> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>
>
>
> > On 20 Feb 2018, at 20:39, Roman Chyla <ro...@gmail.com> wrote:
> >
> > Say there is a high load and  I'd like to bring a new machine and let it
> > replicate the index, if 100gb and more can be shaved, it will have a
> > significant impact on how quickly the new searcher is ready and added to
> > the cluster. Impact on the search speed is likely minimal.
> >
> > we are investigating the idea of two clusters but i have to say it seems
> to
> > me more complex than storing/loading a field from an external source.
> > having said that, I wonder why this was not done before (maybe it was)
> and
> > what the cons are (besides the obvious ones: maintenance and the database
> > being potential point of failure; well in that case i'd miss highlights -
> > can live with that...)
> >
> > On Tue, Feb 20, 2018 at 10:36 AM, David Hastings <
> > hastings.recursive@gmail.com> wrote:
> >
> >> Really depends on what you consider too large, and why the size is a big
> >> issue, since most replication will go at about 100mg/second give or
> take,
> >> and replicating a 300GB index is only an hour or two.  What i do for
> this
> >> purpose is store my text in a separate index altogether, and call on
> that
> >> core for highlighting.  So for my use case, the primary index with no
> >> stored text is around 300GB and replicates as needed, and the full text
> >> indexes with stored text totals around 500GB and are replicating non
> stop.
> >> All searching goes against the primary index, and for highlighting i
> call
> >> on the full text indexes that have a stupid simple schema.  This has
> worked
> >> for me pretty well at least.
> >>
> >> On Tue, Feb 20, 2018 at 10:27 AM, Roman Chyla <ro...@gmail.com>
> >> wrote:
> >>
> >>> Hello,
> >>>
> >>> We have a use case of a very large index (slave-master; for unrelated
> >>> reasons the search cannot work in the cloud mode) - one of the fields
> is
> >> a
> >>> very large text, stored mostly for highlighting. To cut down the index
> >> size
> >>> (for purposes of replication/scaling) I thought I could try to save it
> >> in a
> >>> database - and not in the index.
> >>>
> >>> Lucene has codecs - one of the methods is for 'stored field', so that
> >> seems
> >>> likes a natural path for me.
> >>>
> >>> However, I'd expect somebody else before had a similar problem. I
> googled
> >>> and couldn't find any solutions. Using the codecs seems really good
> thing
> >>> for this particular problem, am I missing something? Is there a better
> >> way
> >>> to cut down on index size? (besides solr cloud/sharding, compression)
> >>>
> >>> Thank you,
> >>>
> >>>   Roman
> >>>
> >>
>
>

Re: storing large text fields in a database? (instead of inside index)

Posted by Emir Arnautović <em...@sematext.com>.

Hi,
Maybe you could use external field type as an example how to hook up values from DB: https://lucene.apache.org/solr/guide/6_6/working-with-external-files-and-processes.html <https://lucene.apache.org/solr/guide/6_6/working-with-external-files-and-processes.html>

HTH,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 20 Feb 2018, at 20:39, Roman Chyla <ro...@gmail.com> wrote:
> 
> Say there is a high load and  I'd like to bring a new machine and let it
> replicate the index, if 100gb and more can be shaved, it will have a
> significant impact on how quickly the new searcher is ready and added to
> the cluster. Impact on the search speed is likely minimal.
> 
> we are investigating the idea of two clusters but i have to say it seems to
> me more complex than storing/loading a field from an external source.
> having said that, I wonder why this was not done before (maybe it was) and
> what the cons are (besides the obvious ones: maintenance and the database
> being potential point of failure; well in that case i'd miss highlights -
> can live with that...)
> 
> On Tue, Feb 20, 2018 at 10:36 AM, David Hastings <
> hastings.recursive@gmail.com> wrote:
> 
>> Really depends on what you consider too large, and why the size is a big
>> issue, since most replication will go at about 100mg/second give or take,
>> and replicating a 300GB index is only an hour or two.  What i do for this
>> purpose is store my text in a separate index altogether, and call on that
>> core for highlighting.  So for my use case, the primary index with no
>> stored text is around 300GB and replicates as needed, and the full text
>> indexes with stored text totals around 500GB and are replicating non stop.
>> All searching goes against the primary index, and for highlighting i call
>> on the full text indexes that have a stupid simple schema.  This has worked
>> for me pretty well at least.
>> 
>> On Tue, Feb 20, 2018 at 10:27 AM, Roman Chyla <ro...@gmail.com>
>> wrote:
>> 
>>> Hello,
>>> 
>>> We have a use case of a very large index (slave-master; for unrelated
>>> reasons the search cannot work in the cloud mode) - one of the fields is
>> a
>>> very large text, stored mostly for highlighting. To cut down the index
>> size
>>> (for purposes of replication/scaling) I thought I could try to save it
>> in a
>>> database - and not in the index.
>>> 
>>> Lucene has codecs - one of the methods is for 'stored field', so that
>> seems
>>> likes a natural path for me.
>>> 
>>> However, I'd expect somebody else before had a similar problem. I googled
>>> and couldn't find any solutions. Using the codecs seems really good thing
>>> for this particular problem, am I missing something? Is there a better
>> way
>>> to cut down on index size? (besides solr cloud/sharding, compression)
>>> 
>>> Thank you,
>>> 
>>>   Roman
>>> 
>>

Re: storing large text fields in a database? (instead of inside index)

Posted by Roman Chyla <ro...@gmail.com>.

Say there is a high load and  I'd like to bring a new machine and let it
replicate the index, if 100gb and more can be shaved, it will have a
significant impact on how quickly the new searcher is ready and added to
the cluster. Impact on the search speed is likely minimal.

we are investigating the idea of two clusters but i have to say it seems to
me more complex than storing/loading a field from an external source.
having said that, I wonder why this was not done before (maybe it was) and
what the cons are (besides the obvious ones: maintenance and the database
being potential point of failure; well in that case i'd miss highlights -
can live with that...)

On Tue, Feb 20, 2018 at 10:36 AM, David Hastings <
hastings.recursive@gmail.com> wrote:

> Really depends on what you consider too large, and why the size is a big
> issue, since most replication will go at about 100mg/second give or take,
> and replicating a 300GB index is only an hour or two.  What i do for this
> purpose is store my text in a separate index altogether, and call on that
> core for highlighting.  So for my use case, the primary index with no
> stored text is around 300GB and replicates as needed, and the full text
> indexes with stored text totals around 500GB and are replicating non stop.
> All searching goes against the primary index, and for highlighting i call
> on the full text indexes that have a stupid simple schema.  This has worked
> for me pretty well at least.
>
> On Tue, Feb 20, 2018 at 10:27 AM, Roman Chyla <ro...@gmail.com>
> wrote:
>
> > Hello,
> >
> > We have a use case of a very large index (slave-master; for unrelated
> > reasons the search cannot work in the cloud mode) - one of the fields is
> a
> > very large text, stored mostly for highlighting. To cut down the index
> size
> > (for purposes of replication/scaling) I thought I could try to save it
> in a
> > database - and not in the index.
> >
> > Lucene has codecs - one of the methods is for 'stored field', so that
> seems
> > likes a natural path for me.
> >
> > However, I'd expect somebody else before had a similar problem. I googled
> > and couldn't find any solutions. Using the codecs seems really good thing
> > for this particular problem, am I missing something? Is there a better
> way
> > to cut down on index size? (besides solr cloud/sharding, compression)
> >
> > Thank you,
> >
> >    Roman
> >
>

Re: storing large text fields in a database? (instead of inside index)

Posted by David Hastings <ha...@gmail.com>.

Really depends on what you consider too large, and why the size is a big
issue, since most replication will go at about 100mg/second give or take,
and replicating a 300GB index is only an hour or two.  What i do for this
purpose is store my text in a separate index altogether, and call on that
core for highlighting.  So for my use case, the primary index with no
stored text is around 300GB and replicates as needed, and the full text
indexes with stored text totals around 500GB and are replicating non stop.
All searching goes against the primary index, and for highlighting i call
on the full text indexes that have a stupid simple schema.  This has worked
for me pretty well at least.

On Tue, Feb 20, 2018 at 10:27 AM, Roman Chyla <ro...@gmail.com> wrote:

> Hello,
>
> We have a use case of a very large index (slave-master; for unrelated
> reasons the search cannot work in the cloud mode) - one of the fields is a
> very large text, stored mostly for highlighting. To cut down the index size
> (for purposes of replication/scaling) I thought I could try to save it in a
> database - and not in the index.
>
> Lucene has codecs - one of the methods is for 'stored field', so that seems
> likes a natural path for me.
>
> However, I'd expect somebody else before had a similar problem. I googled
> and couldn't find any solutions. Using the codecs seems really good thing
> for this particular problem, am I missing something? Is there a better way
> to cut down on index size? (besides solr cloud/sharding, compression)
>
> Thank you,
>
>    Roman
>