You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by Doğacan Güney <do...@gmail.com> on 2010/07/03 10:00:46 UTC

Minimizing the number of stored fields for Solr

Hey everyone,

This is not really a proposition but rather something I have been wondering
for a while so I wanted to see what everyone is
thinking.

Currently in our solr backend, we have "stored=true indexed=false" fields
and "stored=true indexed=true" fields. The former
class of fields are mostly used for storing digest, caching information etc.
I suggest that we get rid of all "indexed=false" fields and
read all such data from storage backend.

For the latter class of fields (i.e., stored=true indexed=true), I suggest
that we set them to stored=false for everything but "id" field. As an
example currently title is stored/indexed in solr while text is only indexed
(thus, will need to be fetched from storage backend). But for hbase
backend, title and text are already stored close together (in the same
column family) so performance hit of reading just text or reading both
will likely be same. And removing storage from solr may lead to better
caching of indexed fields and may lead to better example.

What does everyone think?

-- 
Doğacan Güney

Re: Minimizing the number of stored fields for Solr

Posted by Doğacan Güney <do...@gmail.com>.

On Sat, Jul 3, 2010 at 14:14, Andrzej Bialecki <ab...@getopt.org> wrote:

> On 2010-07-03 12:41, Doğacan Güney wrote:
>
>  Some fields of course are not useful for display, but are used for
>>> searching only (e.g. anchors). These should be indexed but not stored in
>>> Solr. And it's ok to get them from non-solr storage if requested, because
>>> it's a rare event. The same goes for the full raw content, if you want to
>>> offer a "cached" view - this should not be stored in Solr but instead it
>>> should come from a separate layer (note that sometimes cached view might
>>> not
>>> be in the original format - pdf, office, etc - and instead an html
>>> representation may be more suitable, so in general the cached view
>>> shouldn't
>>> automatically equal the original raw content).
>>>
>>>
>>>  I am also talking about fields like digest. For the most part, I think
>> we
>> can get rid of all indexed=false fields.
>>
>
> Yes, digest doesn't have to be stored.
>
>
>
>> Are you sure about this part: " And it's ok to get them from non-solr
>> storage if requested, because it's a rare event." Your assumption seems to
>> be that random reads in solr will be faster (I am talking about reading
>> stored fields) than random reads in, say, hbase. The reason why I started
>>
>
> I'm pretty sure that's the case - are you aware of any benchmarks that
> would prove otherwise?
>
>
>  this discussion was that I think random reads in hbase can actually end up
>> being faster. Though you are right that this would be a premature
>> optimization at this point, I think it may be worthwhile to look into it
>> at
>> some time in future.
>>
>
> Certainly - but at this point if you insist on keeping every non-indexed
> bit in external storage it will complicate and slow down the most common use
> case, which is just a plain search.
>
>
I am not really insisting on anything. I guess I am wrong but I thought we
do not really display any non-indexed field for plain search (it is really
just URL, title and text, no?)


>
>  But for other fields I would argue that for now they should remain stored
>>> in Solr, *even the full text*, until we figure out how they affect the
>>> ability and performance of common search operations. E.g. if we remove
>>> the
>>> stored "title" field then we need to reach to the storage layer in order
>>> to
>>> display each page of results... not to mention issues like highlighting,
>>> faceting, function queries and a host of other functionalities that Solr
>>> can
>>> offer just because a field is stored in its index.
>>>
>>> So I'm -0 to this proposal - of course we should review our schema, and
>>> of
>>> course we should have a mechanism to get data from the storage layer, but
>>> what you propose is IMHO a premature optimization at this point.
>>>
>>>
>>>  You obviously make good points. Am I correct in assuming that you agree
>> that
>> our current schema needs change? If we want to make use of solr's awesome
>>
>
> Yes, it needs to change, but not as much as you propose. :)
>
>
>  features like faceting, then it makes sense that everything (I mean,
>> everything that is returned in a typical search query) is stored in solr.
>> But currently, title is stored in solr while content is not. Thus, we have
>>
>
> That's why I wrote that we should store the content in Solr as well. We
> should store as much (and not more) data as we need to present a typical
> page of search results.
>
>
>  to hit the storage anyway. My proposition was that we remove all storage
>> from Solr, but keeping everything in Solr also makes sense if it is
>> actually
>> everything.
>>
>
> Not everything, just enough to present a typical page of results without
> hitting external storage. For other use cases (cached view, anchors, etc)
> it's ok to use external storage because such use cases are relatively
> infrequent.
>
>
I already clarified my "everything" a couple lines above ( "everything (I
mean,
everything that is returned in a typical search query)" ) :)


>
>
>  But, IMHO, our hybrid apprach may need to change.
>>
>
> FWIW, there are some discussions about implementing a hybrid storage
> directly in Solr (using column stores for stored fields), but that's
> something that will be completely transparent to us, so I think it doesn't
> bear on this discussion (especially since it's still a vaporware at this
> point).
>
>
Anyway, I am, for now, convinced that storing content is the better way to
go (compared to my proposal of removing all). I will be dropping this for
now. If, in the future, random reads in hbase are faster than solr, I'll
bring it up again.


>
> --
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>


-- 
Doğacan Güney

Re: Minimizing the number of stored fields for Solr

Posted by Andrzej Bialecki <ab...@getopt.org>.

On 2010-07-03 12:41, Doğacan Güney wrote:

>> Some fields of course are not useful for display, but are used for
>> searching only (e.g. anchors). These should be indexed but not stored in
>> Solr. And it's ok to get them from non-solr storage if requested, because
>> it's a rare event. The same goes for the full raw content, if you want to
>> offer a "cached" view - this should not be stored in Solr but instead it
>> should come from a separate layer (note that sometimes cached view might not
>> be in the original format - pdf, office, etc - and instead an html
>> representation may be more suitable, so in general the cached view shouldn't
>> automatically equal the original raw content).
>>
>>
> I am also talking about fields like digest. For the most part, I think we
> can get rid of all indexed=false fields.

Yes, digest doesn't have to be stored.

>
> Are you sure about this part: " And it's ok to get them from non-solr
> storage if requested, because it's a rare event." Your assumption seems to
> be that random reads in solr will be faster (I am talking about reading
> stored fields) than random reads in, say, hbase. The reason why I started

I'm pretty sure that's the case - are you aware of any benchmarks that 
would prove otherwise?

> this discussion was that I think random reads in hbase can actually end up
> being faster. Though you are right that this would be a premature
> optimization at this point, I think it may be worthwhile to look into it at
> some time in future.

Certainly - but at this point if you insist on keeping every non-indexed 
bit in external storage it will complicate and slow down the most common 
use case, which is just a plain search.

>> But for other fields I would argue that for now they should remain stored
>> in Solr, *even the full text*, until we figure out how they affect the
>> ability and performance of common search operations. E.g. if we remove the
>> stored "title" field then we need to reach to the storage layer in order to
>> display each page of results... not to mention issues like highlighting,
>> faceting, function queries and a host of other functionalities that Solr can
>> offer just because a field is stored in its index.
>>
>> So I'm -0 to this proposal - of course we should review our schema, and of
>> course we should have a mechanism to get data from the storage layer, but
>> what you propose is IMHO a premature optimization at this point.
>>
>>
> You obviously make good points. Am I correct in assuming that you agree that
> our current schema needs change? If we want to make use of solr's awesome

Yes, it needs to change, but not as much as you propose. :)

> features like faceting, then it makes sense that everything (I mean,
> everything that is returned in a typical search query) is stored in solr.
> But currently, title is stored in solr while content is not. Thus, we have

That's why I wrote that we should store the content in Solr as well. We 
should store as much (and not more) data as we need to present a typical 
page of search results.

> to hit the storage anyway. My proposition was that we remove all storage
> from Solr, but keeping everything in Solr also makes sense if it is actually
> everything.

Not everything, just enough to present a typical page of results without 
hitting external storage. For other use cases (cached view, anchors, 
etc) it's ok to use external storage because such use cases are 
relatively infrequent.


> But, IMHO, our hybrid apprach may need to change.

FWIW, there are some discussions about implementing a hybrid storage 
directly in Solr (using column stores for stored fields), but that's 
something that will be completely transparent to us, so I think it 
doesn't bear on this discussion (especially since it's still a vaporware 
at this point).

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Minimizing the number of stored fields for Solr

Posted by Doğacan Güney <do...@gmail.com>.

Hi,

On Sat, Jul 3, 2010 at 13:12, Andrzej Bialecki <ab...@getopt.org> wrote:

> On 2010-07-03 10:00, Doğacan Güney wrote:
>
>> Hey everyone,
>>
>> This is not really a proposition but rather something I have been
>> wondering
>> for a while so I wanted to see what everyone is
>> thinking.
>>
>> Currently in our solr backend, we have "stored=true indexed=false" fields
>> and "stored=true indexed=true" fields. The former
>> class of fields are mostly used for storing digest, caching information
>> etc.
>> I suggest that we get rid of all "indexed=false" fields and
>> read all such data from storage backend.
>>
>> For the latter class of fields (i.e., stored=true indexed=true), I suggest
>> that we set them to stored=false for everything but "id" field. As an
>> example currently title is stored/indexed in solr while text is only
>> indexed
>> (thus, will need to be fetched from storage backend). But for hbase
>> backend, title and text are already stored close together (in the same
>> column family) so performance hit of reading just text or reading both
>> will likely be same. And removing storage from solr may lead to better
>> caching of indexed fields and may lead to better example.
>>
>> What does everyone think?
>>
>>
> The issue is not as simple as it looks. If you want to have a good
> performance for searching & snippet generation then you still need to store
> some data in stored fields - at least url, title, and plain text (not to
> mention the option to use term vectors in order to speed up the snippet
> generation). Solr functionality can be also impaired by a lack of data
> available directly from Lucene storage (field cache, faceting, term vector
> highlighting).
>
> Some fields of course are not useful for display, but are used for
> searching only (e.g. anchors). These should be indexed but not stored in
> Solr. And it's ok to get them from non-solr storage if requested, because
> it's a rare event. The same goes for the full raw content, if you want to
> offer a "cached" view - this should not be stored in Solr but instead it
> should come from a separate layer (note that sometimes cached view might not
> be in the original format - pdf, office, etc - and instead an html
> representation may be more suitable, so in general the cached view shouldn't
> automatically equal the original raw content).
>
>
I am also talking about fields like digest. For the most part, I think we
can get rid of all indexed=false fields.

Are you sure about this part: " And it's ok to get them from non-solr
storage if requested, because it's a rare event." Your assumption seems to
be that random reads in solr will be faster (I am talking about reading
stored fields) than random reads in, say, hbase. The reason why I started
this discussion was that I think random reads in hbase can actually end up
being faster. Though you are right that this would be a premature
optimization at this point, I think it may be worthwhile to look into it at
some time in future.


> But for other fields I would argue that for now they should remain stored
> in Solr, *even the full text*, until we figure out how they affect the
> ability and performance of common search operations. E.g. if we remove the
> stored "title" field then we need to reach to the storage layer in order to
> display each page of results... not to mention issues like highlighting,
> faceting, function queries and a host of other functionalities that Solr can
> offer just because a field is stored in its index.
>
> So I'm -0 to this proposal - of course we should review our schema, and of
> course we should have a mechanism to get data from the storage layer, but
> what you propose is IMHO a premature optimization at this point.
>
>
You obviously make good points. Am I correct in assuming that you agree that
our current schema needs change? If we want to make use of solr's awesome
features like faceting, then it makes sense that everything (I mean,
everything that is returned in a typical search query) is stored in solr.
But currently, title is stored in solr while content is not. Thus, we have
to hit the storage anyway. My proposition was that we remove all storage
from Solr, but keeping everything in Solr also makes sense if it is actually
everything. But, IMHO, our hybrid apprach may need to change.


> --
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>


-- 
Doğacan Güney

Re: Minimizing the number of stored fields for Solr

Posted by Andrzej Bialecki <ab...@getopt.org>.

On 2010-07-03 10:00, Doğacan Güney wrote:
> Hey everyone,
>
> This is not really a proposition but rather something I have been wondering
> for a while so I wanted to see what everyone is
> thinking.
>
> Currently in our solr backend, we have "stored=true indexed=false" fields
> and "stored=true indexed=true" fields. The former
> class of fields are mostly used for storing digest, caching information etc.
> I suggest that we get rid of all "indexed=false" fields and
> read all such data from storage backend.
>
> For the latter class of fields (i.e., stored=true indexed=true), I suggest
> that we set them to stored=false for everything but "id" field. As an
> example currently title is stored/indexed in solr while text is only indexed
> (thus, will need to be fetched from storage backend). But for hbase
> backend, title and text are already stored close together (in the same
> column family) so performance hit of reading just text or reading both
> will likely be same. And removing storage from solr may lead to better
> caching of indexed fields and may lead to better example.
>
> What does everyone think?
>

The issue is not as simple as it looks. If you want to have a good 
performance for searching & snippet generation then you still need to 
store some data in stored fields - at least url, title, and plain text 
(not to mention the option to use term vectors in order to speed up the 
snippet generation). Solr functionality can be also impaired by a lack 
of data available directly from Lucene storage (field cache, faceting, 
term vector highlighting).

Some fields of course are not useful for display, but are used for 
searching only (e.g. anchors). These should be indexed but not stored in 
Solr. And it's ok to get them from non-solr storage if requested, 
because it's a rare event. The same goes for the full raw content, if 
you want to offer a "cached" view - this should not be stored in Solr 
but instead it should come from a separate layer (note that sometimes 
cached view might not be in the original format - pdf, office, etc - and 
instead an html representation may be more suitable, so in general the 
cached view shouldn't automatically equal the original raw content).

But for other fields I would argue that for now they should remain 
stored in Solr, *even the full text*, until we figure out how they 
affect the ability and performance of common search operations. E.g. if 
we remove the stored "title" field then we need to reach to the storage 
layer in order to display each page of results... not to mention issues 
like highlighting, faceting, function queries and a host of other 
functionalities that Solr can offer just because a field is stored in 
its index.

So I'm -0 to this proposal - of course we should review our schema, and 
of course we should have a mechanism to get data from the storage layer, 
but what you propose is IMHO a premature optimization at this point.

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com