You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Adrien Grand <jp...@gmail.com> on 2016/10/26 13:05:11 UTC

Future of FieldCache in Solr

Hi all,

I'm sending this email as there seem to be different expectations about the
future of FieldCache depending on who you are asking. To me FieldCache (and
uninverting in general) is a legacy feature that has been superseded by doc
values since version 4.0, when we introduced them. But we seem to still
care a lot about uninverting, which does not make sense to me since
everybody should have moved to doc values already?

For the record, doc values have many benefits over FieldCache like better
compression (table compression, gcd compression, prefix compression of the
terms dicts, etc.), faster reopens, the fact that data is stored off-heap
and in master they also have better support for sparse fields, which is
something that FieldCache cannot do due to the fact that it is built in
random doc ID order (or it would have to generate more garbage and slow
down reopens even more).

The documentation (
https://cwiki.apache.org/confluence/display/solr/DocValues) and some online
resources (eg.
https://support.lucidworks.com/hc/en-us/articles/201839163-When-to-use-DocValues-in-Solr)
are already recommending to use doc values for sorting, faceting and
function queries. I think it's time to schedule the entire removal of
FieldCache from Solr?

Re: Future of FieldCache in Solr

Posted by David Smiley <da...@gmail.com>.
I'm +1 to phase the FieldCache (UninvertedField) out in some release ahead,
like Solr 7.  The upgrade process is to switch to DV in a 6x release first.

On Wed, Oct 26, 2016 at 10:52 AM Adrien Grand <jp...@gmail.com> wrote:

> Le mer. 26 oct. 2016 à 16:23, Yonik Seeley <ys...@gmail.com> a écrit :
>
> Docvalues benefits is the reason we recommend them by default (and
> non-text fields now do have docvalues by default).
> They do have some drawbacks however:
>  - Require reindexing
>
>
> I don't think that one is an issue if the schema examples enable doc
> values by default.
>
>
>  - Take up more index space
>
>
> If doc values are using X GB of disk space, then it means FieldCache would
> use *at least* as much *memory*. It sounds pretty weird to me to not be
> willing to put on disk something that would reside in memory otherwise.
>
>  - Slower than FieldCache
>
>
> It depends what we are talking about. While facets on a static index might
> be slightly faster, FieldCache makes reopens much slower.
>
>
> So although the majority will be better served by docvalues, I don't
> think there should be a rush to remove the option of using the
> FieldCache.
>
>
> Doc values have been out for more than 4 years, I don't think I am rushing
> anything. FieldCache has existed for a very long time, so it does not look
> too terrible, but when you think about it, wouldn't you think it is crazy
> if we decided to build an inverted index in memory from stored fields on
> the first time that a field is searched on?
>
> Finally something that annoys me too is that it makes points harder to
> integrate since it is expected that a field that is indexed with points
> instead of the inverted index should be uninvertable too.
>
-- 
Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker
LinkedIn: http://linkedin.com/in/davidwsmiley | Book:
http://www.solrenterprisesearchserver.com

Re: Future of FieldCache in Solr

Posted by Adrien Grand <jp...@gmail.com>.
Le jeu. 27 oct. 2016 à 20:45, Yonik Seeley <ys...@gmail.com> a écrit :

> One might as well complain that it took until 2016 for Lucene to get
> proper numeric index support.
> This is volunteer development, and Tomas has been the only person to
> find time to work on Points support.
>

I agree about the volunteer aspect. I wish this was the reason why points
are not integrated already but to me the problem is also the fact that old
features never go away, which makes new features harder to integrate than
they should.


> We have many users that depend on us,
> and we've already made it hard enough for people to move, and too many
> people are stuck back on v4.x.
>

I agree that there are times when we could have provided a better story
around backward compatibility with little effort. But it is a general issue
that users want innovation and backward compatibility, which are
conflicting requirements. If we decide to never remove the old features
then we will be stuck at some point. I am concerned we are slowly moving
towards that direction. We do not need to remove them now, but we should at
least schedule the removal of some of them for 8.0.

Re: Future of FieldCache in Solr

Posted by Michael McCandless <lu...@mikemccandless.com>.
On Thu, Oct 27, 2016 at 6:21 PM, Yonik Seeley <ys...@gmail.com> wrote:

> I fee like I helped develop NumericField (although Uwe was the primary
> author)!

Sorry, you are correct: thank you for that!  I had indeed forgotten
that you helped
improve on Uwe's numeric fields, originally.

> all I remember is an honest technical opinion about if it should be baked
> into the index format

Yes, that is exactly what I am referring too.  Your comment stated
that we either
commit numerics in a buggy state (so users don't get back a NumericField when
they load their document at search time), or we don't even add a NumericField
at all (an even worse API for direct Lucene users).  Both options made Lucene's
numerics harder to use.

So of course we compromised, and the Uwe's numeric fields did go into core,
in the buggy state.

Fortunately we finally managed to fix that bug but iirc that took several
years.

Mike McCandless

http://blog.mikemccandless.com

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Re: Future of FieldCache in Solr

Posted by Michael McCandless <lu...@mikemccandless.com>.
Well said Mark, that is exactly the design of the Apache model, and I
agree in general it's healthy: it means only conservative-ish changes
happen in a project.

Mike McCandless

http://blog.mikemccandless.com

On Thu, Oct 27, 2016 at 10:02 PM, Mark Miller <ma...@gmail.com> wrote:
> Apache is not designed to handle accusations of a history of behavior of
> poor opinions when driving code forward in any meaningful way.
>
> Instead we have technical discussions per issue and the power of the veto.
> The threat that we should just to work together rather than attacking one
> another.
>
> Some people may want to plow forward in any given area at any given time.
> And it's great when progress happens. But we have given dozens of people the
> power of veto, and that's pretty much the rules. If it acts as a brake
> sometimes, IMO, that is exactly the design. A lot of people here like to
> think they know what should happen despite opposing views. I think our
> system is designed with the understanding the truth is often in the middle.
>
> Discussion and veto power are not attached to activity either. If someone
> wants to participate on a JIRA issue, they are in the club, regardless of
> how they choose to develop.
>
> It's like a political system. Choose deadlock or consensus, and stop
> worrying about opposing conspiracy theories. True or not means little in how
> things are decided.
>
> I can nitpick on a lot of the choices and motivations of a lot of people
> here. But it would be useless for forward progress (detrimental even) and
> perpetuate what has been a huge culture decline in these projects.
>
> - Mark
>
> On Thu, Oct 27, 2016 at 6:22 PM Yonik Seeley <ys...@gmail.com> wrote:
>>
>> (splitting this off)
>>
>> > Your threat to veto the original addition of Uwe's NumericFields to
>> > Lucene's core stands out in my (long) memory as another.
>>
>> ??? I seriously question that long memory.  Or perhaps just the color
>> of the glasses you're viewing the world through.
>>
>> I fee like I helped develop NumericField (although Uwe was the primary
>> author)! IIRC, I wrote the first draft of the code that enabled
>> variable precision steps.
>>
>>
>> https://issues.apache.org/jira/browse/LUCENE-1470?focusedCommentId=12671495&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-12671495
>>
>> http://markmail.org/message/vcwwxwciwf7ztrfg
>>
>> And this is the JIRA issue to actually move it to core... all I
>> remember is an honest technical opinion about if it should be baked
>> into the index format (and certainly no vetoes or even opinions
>> against it being in "core"):
>> https://issues.apache.org/jira/browse/LUCENE-1673
>>
>>
>> Luckily, I'm in good company... I'm not the only person to be accused
>> of nefariously obstructing Lucene and only participating in Lucene
>> issues to slow it down or make it harder to use.
>> If one looks hard enough for something, they will start seeing it.
>>
>> -Yonik
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
> --
> - Mark
> about.me/markrmiller

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Re: Future of FieldCache in Solr

Posted by Mark Miller <ma...@gmail.com>.
Apache is not designed to handle accusations of a history of behavior of
poor opinions when driving code forward in any meaningful way.

Instead we have technical discussions per issue and the power of the veto.
The threat that we should just to work together rather than attacking one
another.

Some people may want to plow forward in any given area at any given time.
And it's great when progress happens. But we have given dozens of people
the power of veto, and that's pretty much the rules. If it acts as a brake
sometimes, IMO, that is exactly the design. A lot of people here like to
think they know what should happen despite opposing views. I think our
system is designed with the understanding the truth is often in the middle.

Discussion and veto power are not attached to activity either. If someone
wants to participate on a JIRA issue, they are in the club, regardless of
how they choose to develop.

It's like a political system. Choose deadlock or consensus, and stop
worrying about opposing conspiracy theories. True or not means little in
how things are decided.

I can nitpick on a lot of the choices and motivations of a lot of people
here. But it would be useless for forward progress (detrimental even) and
perpetuate what has been a huge culture decline in these projects.

- Mark

On Thu, Oct 27, 2016 at 6:22 PM Yonik Seeley <ys...@gmail.com> wrote:

> (splitting this off)
>
> > Your threat to veto the original addition of Uwe's NumericFields to
> > Lucene's core stands out in my (long) memory as another.
>
> ??? I seriously question that long memory.  Or perhaps just the color
> of the glasses you're viewing the world through.
>
> I fee like I helped develop NumericField (although Uwe was the primary
> author)! IIRC, I wrote the first draft of the code that enabled
> variable precision steps.
>
>
> https://issues.apache.org/jira/browse/LUCENE-1470?focusedCommentId=12671495&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-12671495
>
> http://markmail.org/message/vcwwxwciwf7ztrfg
>
> And this is the JIRA issue to actually move it to core... all I
> remember is an honest technical opinion about if it should be baked
> into the index format (and certainly no vetoes or even opinions
> against it being in "core"):
> https://issues.apache.org/jira/browse/LUCENE-1673
>
>
> Luckily, I'm in good company... I'm not the only person to be accused
> of nefariously obstructing Lucene and only participating in Lucene
> issues to slow it down or make it harder to use.
> If one looks hard enough for something, they will start seeing it.
>
> -Yonik
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
> --
- Mark
about.me/markrmiller

Re: Future of FieldCache in Solr

Posted by Yonik Seeley <ys...@gmail.com>.
(splitting this off)

> Your threat to veto the original addition of Uwe's NumericFields to
> Lucene's core stands out in my (long) memory as another.

??? I seriously question that long memory.  Or perhaps just the color
of the glasses you're viewing the world through.

I fee like I helped develop NumericField (although Uwe was the primary
author)! IIRC, I wrote the first draft of the code that enabled
variable precision steps.

https://issues.apache.org/jira/browse/LUCENE-1470?focusedCommentId=12671495&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-12671495

http://markmail.org/message/vcwwxwciwf7ztrfg

And this is the JIRA issue to actually move it to core... all I
remember is an honest technical opinion about if it should be baked
into the index format (and certainly no vetoes or even opinions
against it being in "core"):
https://issues.apache.org/jira/browse/LUCENE-1673


Luckily, I'm in good company... I'm not the only person to be accused
of nefariously obstructing Lucene and only participating in Lucene
issues to slow it down or make it harder to use.
If one looks hard enough for something, they will start seeing it.

-Yonik

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Re: Future of FieldCache in Solr

Posted by Michael McCandless <lu...@mikemccandless.com>.
On Thu, Oct 27, 2016 at 2:45 PM, Yonik Seeley <ys...@gmail.com> wrote:
> On Thu, Oct 27, 2016 at 5:10 AM, Adrien Grand <jp...@gmail.com> wrote:
>> [...] the discrepancy between the best practices as recommended by Lucene
>
> Different discrepancies have different reasons... one shouldn't
> attempt to paint them all with the same brush.
> - For some issues, it may simply be a lack of volunteers (see "Why
> Solr doesn't use Point fields" below).
> - For some issues, there may not actually be consensus (and I've been
> a Lucene committer before Solr even saw the light of day, so hopefully
> my opinions would be included in the "recommended by Lucene".

Yes, you have near infinite Apache Lucene/Solr merit, Yonik, but for
the past several years your interactions with Lucene have largely been
obstructionist vs. helping to improve things.

Your comments on https://issues.apache.org/jira/browse/LUCENE-7407 is
a fresh example.

Your comments on https://issues.apache.org/jira/browse/LUCENE-7521,
triggering Adrien sending this email, is another fresh example.

Your threat to veto the original addition of Uwe's NumericFields to
Lucene's core stands out in my (long) memory as another.

Being against volunteers offering to improve Solr's javadocs a few
years ago stands out as yet another.

> - For some issues, there may be consensus to "recommend X", but that
> doesn't imply a consensus to prohibit all alternatives.

Software moves forward and we cannot continue to support poor past
decisions like FieldCache just because a few users may prefer it.
Carrying forward such ancient legacy code forever has a non-trivial
cost to future development.  When would you ever remove something?

> With respect to "using doc values", I think most agree on that default
> recommendation. That doesn't necessarily follow that everyone agrees
> that FieldCache should be prohibited or should go away entirely.

Yonik, if you really feel so strongly that Solr users should still
have a FieldCache option (I strongly disagree) then why don't you step
up?:

Go help Adrien today, on
https://issues.apache.org/jira/browse/LUCENE-7521?  Put a patch up
that moves the over-specialized packedints code, no longer useful to
Lucene but apparently required by Solr's FieldCache, in your opinion,
over to Solr's sources?

Failing that, I think Lucene should just move on: Adrien's patch there
really is a good step forward.

> The fact that it was removed from Lucene so quietly under a JIRA
> originally entitled "Move SlowCompositeReaderWrapper to solr sources"
> (LUCENE-7283) was definitely surprising to many.

There have been many references to FieldCache going away over the
years, on various Jira issues, on the dev list, etc., so for you to
claim it's a "surprise to many" is ridiculous: such "many" must not
really follow Lucene's development and therefore should not feign
surprise.

> Why Solr doesn't use Point fields for numerics (yet):
>
> Some of the same Lucene/Solr committers that worked on Points in
> Lucene also added support in Elasticsearch as well as "Lucene Server"
> ( https://github.com/mikemccand/luceneserver )
> One might as well complain that it took until 2016 for Lucene to get
> proper numeric index support.
> This is volunteer development, and Tomas has been the only person to
> find time to work on Points support.

That's fine: it's an open source project, though I do think Solr's
gradual demise has only been accelerated by you aggressively fighting
for preserving all "old" ways of doing things.  Solr hasn't even fully
adopted near-real-time search yet.

Such an attitude ("nothing can ever be removed because a few users may
want it") only makes it harder for volunteers like Tomas to e.g. add
dimensional points support to Solr.

But again, why not step up and help Tomas out?  Take his latest patch/branch
and iterate some, getting it closer to a committable state?

> Why doesn't Solr use SortedNumericDocValues (yet):
>
> Same as above.  Some Lucene/Solr committers added support to
> Elasticsearch and to luceneserver, but none have gotten around to
> adding support to Solr yet.  I don't think it's anyone's fault.  No
> one has argued against using SortedNumericDocValues (or Points for
> that matter).  But neither has anyone stepped up and contributed the
> work.

Right: it's nobody's fault, it's a volunteer effort.  This is simply how
open-source works.

> Finally, Solr is not just an "example" of how to use Lucene.  While
> Lucene seems to rapidly change APIs every major release (and some back
> incompatible changes are  just because someone likes a name better),
> Solr does not have that luxury.  We have many users that depend on us,
> and we've already made it hard enough for people to move, and too many
> people are stuck back on v4.x.

Sorry, but Solr has become a poor example of how to use Lucene at this
point, in my opinion.  I also think Elasticsearch is also a poor
example, in other ways.

These are exactly the reasons why I made the example "Lucene Server",
to be as thin a wrapper around Lucene as possible, a good/simple
example of how one could make a basic yet very functional single-node
server on top of Lucene.

Mike McCandless

http://blog.mikemccandless.com

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Re: Future of FieldCache in Solr

Posted by Yonik Seeley <ys...@gmail.com>.
On Thu, Oct 27, 2016 at 5:10 AM, Adrien Grand <jp...@gmail.com> wrote:
> [...] the discrepancy between the best practices as recommended by Lucene

Different discrepancies have different reasons... one shouldn't
attempt to paint them all with the same brush.
- For some issues, it may simply be a lack of volunteers (see "Why
Solr doesn't use Point fields" below).
- For some issues, there may not actually be consensus (and I've been
a Lucene committer before Solr even saw the light of day, so hopefully
my opinions would be included in the "recommended by Lucene".
- For some issues, there may be consensus to "recommend X", but that
doesn't imply a consensus to prohibit all alternatives.

With respect to "using doc values", I think most agree on that default
recommendation. That doesn't necessarily follow that everyone agrees
that FieldCache should be prohibited or should go away entirely.  The
fact that it was removed from Lucene so quietly under a JIRA
originally entitled "Move SlowCompositeReaderWrapper to solr sources"
(LUCENE-7283) was definitely surprising to many.

Why Solr doesn't use Point fields for numerics (yet):

Some of the same Lucene/Solr committers that worked on Points in
Lucene also added support in Elasticsearch as well as "Lucene Server"
( https://github.com/mikemccand/luceneserver )
One might as well complain that it took until 2016 for Lucene to get
proper numeric index support.
This is volunteer development, and Tomas has been the only person to
find time to work on Points support.

Why doesn't Solr use SortedNumericDocValues (yet):

Same as above.  Some Lucene/Solr committers added support to
Elasticsearch and to luceneserver, but none have gotten around to
adding support to Solr yet.  I don't think it's anyone's fault.  No
one has argued against using SortedNumericDocValues (or Points for
that matter).  But neither has anyone stepped up and contributed the
work.

Finally, Solr is not just an "example" of how to use Lucene.  While
Lucene seems to rapidly change APIs every major release (and some back
incompatible changes are  just because someone likes a name better),
Solr does not have that luxury.  We have many users that depend on us,
and we've already made it hard enough for people to move, and too many
people are stuck back on v4.x.

-Yonik



> such as using doc values, consuming readers per segment and using points for
> numerics and the fact that Solr, which should be the greatest example of how
> to use Lucene, is not totally following them. I agree that things are
> changing fast sometimes and changing is not easy, but some of the things we
> are talking about here are more than 4 years old. I am also concerned that
> these things are accumulating, for instance I can already see how the
> integration of points is made harder by the fact that fields are supposed to
> support uninverting. And even if we decide that the new point fields do not
> need to support uninverting, then it will give users the feeling that these
> new point fields are not feature-complete, which would be a pity. On a
> similar note, I hope that we can prevent new indices from using legacy
> numerics as of Solr 7 so that the implementation can be completely removed
> in Solr 8 and points will be the only way to index numerics.
>
> If we decide that Solr should keep supporting uninverting, I will accept it,
> but I genuinely think it would be better for Solr to eventually drop this
> feature and require doc values for sorting, facets and functions.
>
> Le mer. 26 oct. 2016 à 17:46, Yonik Seeley <ys...@gmail.com> a écrit :
>>
>> I understand that the existence of the FieldCache meant additional
>> work when changing the DocValues APIs.
>> Those APIs are core enough however, that hopefully they won't change
>> too much in the future!
>> In the event that they do though, I'll try and help out with any
>> required transition.
>>
>> -Yonik
>>
>>
>> On Wed, Oct 26, 2016 at 11:34 AM, Adrien Grand <jp...@gmail.com> wrote:
>> > I hear you that FieldCache has different trade-offs. That said, doc
>> > values
>> > look superior to me for a vast majority of users so I wish that we spent
>> > energy on improving doc values, making IndexSearcher aware of index
>> > sorting,
>> > or integrating points into Solr, which are more interesting ways to make
>> > Solr faster to me than spending energy keeping FieldCache and
>> > uninverting
>> > alive.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Re: Future of FieldCache in Solr

Posted by Adrien Grand <jp...@gmail.com>.
Thanks for proposing your help. To me the issue is not that much about the
additional work it required when changing the doc values API, but rather
about the discrepancy between the best practices as recommended by Lucene
such as using doc values, consuming readers per segment and using points
for numerics and the fact that Solr, which should be the greatest example
of how to use Lucene, is not totally following them. I agree that things
are changing fast sometimes and changing is not easy, but some of the
things we are talking about here are more than 4 years old. I am also
concerned that these things are accumulating, for instance I can already
see how the integration of points is made harder by the fact that fields
are supposed to support uninverting. And even if we decide that the new
point fields do not need to support uninverting, then it will give users
the feeling that these new point fields are not feature-complete, which
would be a pity. On a similar note, I hope that we can prevent new indices
from using legacy numerics as of Solr 7 so that the implementation can be
completely removed in Solr 8 and points will be the only way to index
numerics.

If we decide that Solr should keep supporting uninverting, I will accept
it, but I genuinely think it would be better for Solr to eventually drop
this feature and require doc values for sorting, facets and functions.

Le mer. 26 oct. 2016 à 17:46, Yonik Seeley <ys...@gmail.com> a écrit :

> I understand that the existence of the FieldCache meant additional
> work when changing the DocValues APIs.
> Those APIs are core enough however, that hopefully they won't change
> too much in the future!
> In the event that they do though, I'll try and help out with any
> required transition.
>
> -Yonik
>
>
> On Wed, Oct 26, 2016 at 11:34 AM, Adrien Grand <jp...@gmail.com> wrote:
> > I hear you that FieldCache has different trade-offs. That said, doc
> values
> > look superior to me for a vast majority of users so I wish that we spent
> > energy on improving doc values, making IndexSearcher aware of index
> sorting,
> > or integrating points into Solr, which are more interesting ways to make
> > Solr faster to me than spending energy keeping FieldCache and uninverting
> > alive.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

Re: Future of FieldCache in Solr

Posted by Yonik Seeley <ys...@gmail.com>.
I understand that the existence of the FieldCache meant additional
work when changing the DocValues APIs.
Those APIs are core enough however, that hopefully they won't change
too much in the future!
In the event that they do though, I'll try and help out with any
required transition.

-Yonik


On Wed, Oct 26, 2016 at 11:34 AM, Adrien Grand <jp...@gmail.com> wrote:
> I hear you that FieldCache has different trade-offs. That said, doc values
> look superior to me for a vast majority of users so I wish that we spent
> energy on improving doc values, making IndexSearcher aware of index sorting,
> or integrating points into Solr, which are more interesting ways to make
> Solr faster to me than spending energy keeping FieldCache and uninverting
> alive.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Re: Future of FieldCache in Solr

Posted by Adrien Grand <jp...@gmail.com>.
I hear you that FieldCache has different trade-offs. That said, doc values
look superior to me for a vast majority of users so I wish that we spent
energy on improving doc values, making IndexSearcher aware of index
sorting, or integrating points into Solr, which are more interesting ways
to make Solr faster to me than spending energy keeping FieldCache and
uninverting alive.

Re: Future of FieldCache in Solr

Posted by Yonik Seeley <ys...@gmail.com>.
On Wed, Oct 26, 2016 at 10:51 AM, Adrien Grand <jp...@gmail.com> wrote:
> Le mer. 26 oct. 2016 à 16:23, Yonik Seeley <ys...@gmail.com> a écrit :
>>
>> Docvalues benefits is the reason we recommend them by default (and
>> non-text fields now do have docvalues by default).
>> They do have some drawbacks however:
>>  - Require reindexing
>
>
> I don't think that one is an issue if the schema examples enable doc values
> by default.

People upgrading and using their old schema for example.

>>  - Take up more index space
>
>
> If doc values are using X GB of disk space, then it means FieldCache would
> use *at least* as much *memory*. It sounds pretty weird to me to not be
> willing to put on disk something that would reside in memory otherwise.

Sure, and things like that are why docvalues are default.
But it doesn't follow that it won't be an issue for anyone though.
And if one has a workload where writes are much more important than
reads (again a minority), and one is IO bound indexing, that larger
index space is going to directly translate to worse indexing
throughput.

>>  - Slower than FieldCache
>
>
> It depends what we are talking about. While facets on a static index might
> be slightly faster, FieldCache makes reopens much slower.

Right.  Again, the FieldCache advantages aren't across the board
(which is why it's not default).
For some though, reopens aren't an issue, and maximum throughput on
static indexes is (and more than slightly faster I think, but I'd have
to benchmark again...).

Just like UnInvertedField (which is arguably part of "uninverted"...),
it's *very* expensive to reopen, but some people have decided it's
still worth the cost.

Additionally, some of the weaknesses in the FieldCache can always be
improved in the future.  Limitations aren't permanent.

-Yonik

>> So although the majority will be better served by docvalues, I don't
>> think there should be a rush to remove the option of using the
>> FieldCache.
>
>
> Doc values have been out for more than 4 years, I don't think I am rushing
> anything. FieldCache has existed for a very long time, so it does not look
> too terrible, but when you think about it, wouldn't you think it is crazy if
> we decided to build an inverted index in memory from stored fields on the
> first time that a field is searched on?
>
> Finally something that annoys me too is that it makes points harder to
> integrate since it is expected that a field that is indexed with points
> instead of the inverted index should be uninvertable too.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Re: Future of FieldCache in Solr

Posted by Adrien Grand <jp...@gmail.com>.
Le mer. 26 oct. 2016 à 16:23, Yonik Seeley <ys...@gmail.com> a écrit :

> Docvalues benefits is the reason we recommend them by default (and
> non-text fields now do have docvalues by default).
> They do have some drawbacks however:
>  - Require reindexing
>

I don't think that one is an issue if the schema examples enable doc values
by default.


>  - Take up more index space
>

If doc values are using X GB of disk space, then it means FieldCache would
use *at least* as much *memory*. It sounds pretty weird to me to not be
willing to put on disk something that would reside in memory otherwise.

 - Slower than FieldCache
>

It depends what we are talking about. While facets on a static index might
be slightly faster, FieldCache makes reopens much slower.


> So although the majority will be better served by docvalues, I don't
> think there should be a rush to remove the option of using the
> FieldCache.
>

Doc values have been out for more than 4 years, I don't think I am rushing
anything. FieldCache has existed for a very long time, so it does not look
too terrible, but when you think about it, wouldn't you think it is crazy
if we decided to build an inverted index in memory from stored fields on
the first time that a field is searched on?

Finally something that annoys me too is that it makes points harder to
integrate since it is expected that a field that is indexed with points
instead of the inverted index should be uninvertable too.

Re: Future of FieldCache in Solr

Posted by Yonik Seeley <ys...@gmail.com>.
Docvalues benefits is the reason we recommend them by default (and
non-text fields now do have docvalues by default).
They do have some drawbacks however:
 - Require reindexing
 - Take up more index space
 - Slower than FieldCache

So although the majority will be better served by docvalues, I don't
think there should be a rush to remove the option of using the
FieldCache.

-Yonik

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Re: Future of FieldCache in Solr

Posted by Ryan Josal <rj...@gmail.com>.
We use Toke's scenario in a couple places too. We are capable writing a URP
that does it, but it feels dirty, and replacing config with code seems like
something it makes sense to avoid.

Having top level support of some kind of analysis in URP or somewhere else
can also help in other situations where you want some analysis before it
goes in a non-TextField.

Ryan

On Thu, Nov 24, 2016 at 00:53 Toke Eskildsen <te...@statsbiblioteket.dk> wrote:

> On Wed, 2016-11-23 at 13:23 +0000, David Smiley wrote:
> > This is supported at the Lucene level via SortedSetDocValues.  Solr
> > doesn't yet support this for its TextField
> > -- https://issues.apache.org/jira/browse/SOLR-8362
> >  however you could work around this with an URP or copyField
>
> copyfield does not help here as that copies the raw values. We need the
> normalised values for display when we do faceting.
>
> >  or perhaps subclassing TextField so that you can tokenize the text a
> > second time to generate a list of SortedSetDocValuesField.  Probably
> > least painless is to use another field.
>
> So to facet on the normalised (analyzed really) values on a Text field
> in a post-FieldCache Solr, I would need to write an URP or some other
> custom code. I can manage that or just do the normalisation as part of
> the pre-processing.
>
> Question is if my scenario (using analyzers for facet terms) is wide-
> spread? If so, I find this increase in implementation requirements
> problematic.
>
>
> I don't care for FieldCache as such - SOLR-8362 would be a better
> solution for the scenario I describe. Or maybe an URP that makes it
> easy to provide a list of analyzers? I am simply looking for a way
> that a random end-user can easily do faceting on analyzed terms,
> leveraging all the nice build-in filters in Solr.
>
> - Toke Eskildsen, State and University Library, Denmark
>

Re: Future of FieldCache in Solr

Posted by Toke Eskildsen <te...@statsbiblioteket.dk>.
On Wed, 2016-11-23 at 13:23 +0000, David Smiley wrote:
> This is supported at the Lucene level via SortedSetDocValues.  Solr
> doesn't yet support this for its TextField
> -- https://issues.apache.org/jira/browse/SOLR-8362
>  however you could work around this with an URP or copyField

copyfield does not help here as that copies the raw values. We need the
normalised values for display when we do faceting.

>  or perhaps subclassing TextField so that you can tokenize the text a
> second time to generate a list of SortedSetDocValuesField.  Probably
> least painless is to use another field.

So to facet on the normalised (analyzed really) values on a Text field
in a post-FieldCache Solr, I would need to write an URP or some other
custom code. I can manage that or just do the normalisation as part of
the pre-processing.

Question is if my scenario (using analyzers for facet terms) is wide-
spread? If so, I find this increase in implementation requirements
problematic.


I don't care for FieldCache as such - SOLR-8362 would be a better
solution for the scenario I describe. Or maybe an URP that makes it
easy to provide a list of analyzers? I am simply looking for a way
that a random end-user can easily do faceting on analyzed terms,
leveraging all the nice build-in filters in Solr.

- Toke Eskildsen, State and University Library, Denmark

Re: Future of FieldCache in Solr

Posted by David Smiley <da...@gmail.com>.
This is supported at the Lucene level via SortedSetDocValues.  Solr doesn't
yet support this for its TextField --
https://issues.apache.org/jira/browse/SOLR-8362   however you could work
around this with an URP or copyField or perhaps subclassing TextField so
that you can tokenize the text a second time to generate a list
of SortedSetDocValuesField.  Probably least painless is to use another
field.
~ David

On Wed, Nov 23, 2016 at 4:14 AM Toke Eskildsen <te...@statsbiblioteket.dk>
wrote:

> On Wed, 2016-10-26 at 13:05 +0000, Adrien Grand wrote:
> > But we seem to still care a lot about uninverting, which does
> > not make sense to me since everybody should have moved to doc values
> > already?
>
> I might have missed something here, but doesn't such a switch mean that
> it will no longer be possible to facet on Text fields?
>
> We facet on 3 Text fields in our core index: Title, Author and
> Location. All of these fields use a KeywordTokenizer and multiple steps
> of normalising the input. It seems like quite an obvious setup, so I
> would guess that it is not uncommon.
>
> How would this scenario be supported if uninversion is removed?
>
>
> - Toke Eskildsen, State and University Library, Denmark
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
> --
Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker
LinkedIn: http://linkedin.com/in/davidwsmiley | Book:
http://www.solrenterprisesearchserver.com

Re: Future of FieldCache in Solr

Posted by Toke Eskildsen <te...@statsbiblioteket.dk>.
On Wed, 2016-10-26 at 13:05 +0000, Adrien Grand wrote:
> But we seem to still care a lot about uninverting, which does
> not make sense to me since everybody should have moved to doc values
> already?

I might have missed something here, but doesn't such a switch mean that
it will no longer be possible to facet on Text fields?

We facet on 3 Text fields in our core index: Title, Author and
Location. All of these fields use a KeywordTokenizer and multiple steps
of normalising the input. It seems like quite an obvious setup, so I
would guess that it is not uncommon.

How would this scenario be supported if uninversion is removed?


- Toke Eskildsen, State and University Library, Denmark

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org