You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by Da Huang <dh...@gmail.com> on 2014/03/08 03:01:46 UTC

[GoSC] I'm interested in LUCENE-3333

Hello, everyone,

My name is Da Huang. I'm studying for my master degree of Computer Science
in Peking University. I have been using lucene for about half a year. It's
so elegent that I hope to have a chance to contribute some code for it.

Therefore, I have been scaned the jira GoSC 2014 Ideas page about lucene
for several days. I find "LUCENE-3333: Specialize DisjunctionScorer if all
clauses are TermQueries" more suitable for me to do. I have spent some time
to scan the revelant code, and the Issue "LUCENE-3328" which spinoff
"LUCENE-3333". I find the following questions confusing me.

1) I have checkout the code from "
http://svn.apache.org/repos/asf/lucene/dev/trunk lucene_trunk", but I
couldn't find the relevant code of the fixed Issue "LUCENE-3328". It seems
that the patch attached on the page is not on the trunk. Why?

2)  My intuitive idea of solving this issue is to make a class
"DisjunctionTermScorer" to do the all TermQueries clauses; then, judging
whether to use DisjunctionTermScorer in the method 'scorer' in class
BooleanQuery. Is this idea right?

Above are my questions about "LUCENE-3333". Besides, I would like to
propose the following issue which is about the QueryParser.

When we use QueryParser to parse a querystring like "science AND
(engineering AND technology)". The generated query would be
"+science (+engineering +technology)". I think it would be more efficient
for searching if the final query is "+science +engineering +technology". My
idea is to make the cascaded AND and cascaded OR flat. Do you agree? I hope
I have made my idea clear.

Thanks,
Da Huang




-- 
黄达（Da Huang）
Team of Search Engine & Web Mining
School of Electronic Engineering & Computer Science
Peking University, Beijing, 100871, P.R.China

Re: [GoSC] I'm interested in LUCENE-3333

Posted by Michael McCandless <lu...@mikemccandless.com>.

On Sun, Mar 9, 2014 at 10:46 AM, Da Huang <dh...@gmail.com> wrote:
> Thanks a lot. That's very helpful.
>
> I think you get exactly what I mean about the LUCENE-4396.
> By grouping up the MUST clauses, the conjunctive query can be
> done specifiedly with easy way. Then, the original query would have
> no more than 1 MUST clause. I think in this situation, it's much more
> easier to judge whether to use BooleanScorer or BooleanScorer2. :)

Well, because we now have the DISI.cost() method, we can use this to
find the least-cost MUST clause (e.g. the one matching the fewest
documents) and then make a call up front on whether BS or BS2 is
appropriate.

But these would all be fun ideas to explore under a GSoC project, if
we can scope it appropriately.

Mike McCandless

http://blog.mikemccandless.com

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: [GoSC] I'm interested in LUCENE-3333

Posted by Da Huang <dh...@gmail.com>.

Thanks a lot. That's very helpful.

I think you get exactly what I mean about the LUCENE-4396.
By grouping up the MUST clauses, the conjunctive query can be
done specifiedly with easy way. Then, the original query would have
no more than 1 MUST clause. I think in this situation, it's much more
easier to judge whether to use BooleanScorer or BooleanScorer2. :)


Thanks,
Da Huang


-- 
黄达（Da Huang）
Team of Search Engine & Web Mining
School of Electronic Engineering & Computer Science
Peking University, Beijing, 100871, P.R.China

Re: [GoSC] I'm interested in LUCENE-3333

Posted by Michael McCandless <lu...@mikemccandless.com>.

On Sun, Mar 9, 2014 at 9:55 AM, Da Huang <dh...@gmail.com> wrote:
> Hi, Mike.
>
> You're right. After having a look at the comments on LUCENE-1518, I find
> that my idea about that has many bugs. Sorry for that.

It's fine, it's a VERY hard fix :)  This is why it hasn't been done yet!

> Thus, I have checked some other suggestions you gave me to see whether
> relevant comments can be found in jira.
>
> I think I have some idea on "LUCENE-4396: BooleanScorer should sometimes be
> used for MUST clauses".
> Can we adjust the query to make the problem easier? For the query "+a b c +e
> +f" as an example, maybe we can
> turn it into "(+a +e +f) b c" which has only one MUST clause. Then, it would
> be easier to judge which scorer to use?

You mean create nesting when there wasn't before, by grouping all MUST
clauses together?  We could explore that ...

Or we could pass all the clauses (still flat) to BooleanScorer.  I
think this would only be faster when the MUST clauses are high cost
relative to all other clauses.  E.g. a super-rare MUST'd clause would
probably be faster with BooleanScorer2.

I think this could make a good GSoC project.

> Besides, I seems that the suggestion "we should pass a needsScorers boolean
> up-front to Weight.scorer"
> is not on jira. But it sounds that it can be done by adjusting some class
> methods' arguments and return value
> to pass the "needsScorers"? not sure.

I think it's this Jira: https://issues.apache.org/jira/browse/LUCENE-3331

(I just searched for "needs scores" on
http://jirasearch.mikemccandless.com and it was one of the
suggestions).

All that should be needed here is to add a "boolean needsScores" (or
something) to the Weight.scorer method, and fix the numerous places
where this method is invoked to pass the right value.  E.g.
ConstantScoreQuery would pass false, and this would mean e.g. if it
wraps a TermQuery, we could avoid decoding freq blocks from the
postings.

> At last, recently I find something strange in the code about heap. I find
> heap has been implemented duplicately
> for many times in the trunk, and a PriorityQueue is also implemented in the
> package org.apache.lucene.util.
> I remember java has already implemented the PriorityQueue. Why not use that?

Good question!  There is a fair amount of duplicated code, and we
should fix that over time.  Lucene has had its own PQ class forever,
and we do strange things like pre-filling the queue with a sentinel
value to avoid "if (queueIsNotFullYet)" checks in collect(int doc),
and we can replace the top value and re-heap ... but maybe these do
not in fact matter in practice and if so we should stop duplicating
code :)

Mike McCandless

http://blog.mikemccandless.com

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: [GoSC] I'm interested in LUCENE-3333

Posted by Da Huang <dh...@gmail.com>.

Hi, Mike.

You're right. After having a look at the comments on LUCENE-1518, I find
that my idea about that has many bugs. Sorry for that.

Thus, I have checked some other suggestions you gave me to see whether
relevant comments can be found in jira.

I think I have some idea on "LUCENE-4396: BooleanScorer should sometimes be
used for MUST clauses".
Can we adjust the query to make the problem easier? For the query "+a b c
+e +f" as an example, maybe we can
turn it into "(+a +e +f) b c" which has only one MUST clause. Then, it
would be easier to judge which scorer to use?

Besides, I seems that the suggestion "we should pass a needsScorers boolean
up-front to Weight.scorer"
is not on jira. But it sounds that it can be done by adjusting some class
methods' arguments and return value
to pass the "needsScorers"? not sure.

At last, recently I find something strange in the code about heap. I find
heap has been implemented duplicately
for many times in the trunk, and a PriorityQueue is also implemented in the
package org.apache.lucene.util.
I remember java has already implemented the PriorityQueue. Why not use that?


Thanks,
Da Huang


-- 
黄达（Da Huang）
Team of Search Engine & Web Mining
School of Electronic Engineering & Computer Science
Peking University, Beijing, 100871, P.R.China

Re: [GoSC] I'm interested in LUCENE-3333

Posted by Michael McCandless <lu...@mikemccandless.com>.

Hi, Da,

On Sun, Mar 9, 2014 at 1:30 AM, Da Huang <dh...@gmail.com> wrote:

> I have spent some time considering your suggestions in last mail. I find
> that I'm interested in the suggestion " Filter and Query should be more
> 'combined' ".

OK, cool, and ambitious; it might be safer to choose a less
ambitious/controversial change for a GSoC project.

Maybe, have a look at LUCENE-1518?  There was lots of discussion there.

> In my opinion, to implement this suggestion, a new class "FilterQuery",
> which is a subclass of "Query",  should be created. If "FilterQuery" is
> implemented, then it can be the query element of "BooleanClause", and the
> "BooleanQuery" can naturally add a "Filter" as a "BooleanClause". I think
> one of the most important things is to deal with the scores, as Filter does
> not contribute anything to score.

I feel like it should be the opposite?  Like, a Filter has less
functionality that a Query, because it does only matching?  So I would
think a Quey would subclass Filter and then add scoring onto it?  But
there was lots of discussion on the above issue that I don't
remember...

Mike McCandless

http://blog.mikemccandless.com

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: [GoSC] I'm interested in LUCENE-3333

Posted by Da Huang <dh...@gmail.com>.

Hello, Mike.

I have spent some time considering your suggestions in last mail. I find
that I'm interested in the suggestion " Filter and Query should be more
'combined' ".

In my opinion, to implement this suggestion, a new class "FilterQuery",
which is a subclass of "Query",  should be created. If "FilterQuery" is
implemented, then it can be the query element of "BooleanClause", and the
"BooleanQuery" can naturally add a "Filter" as a "BooleanClause". I think
one of the most important things is to deal with the scores, as Filter does
not contribute anything to score.

Above is my intuitive idea about this suggestion. Do you think it makes
sense? I hope I have made my idea clear.


Thanks,
Da Huang


-- 
黄达（Da Huang）
Team of Search Engine & Web Mining
School of Electronic Engineering & Computer Science
Peking University, Beijing, 100871, P.R.China

Re: [GoSC] I'm interested in LUCENE-3333

Posted by Michael McCandless <lu...@mikemccandless.com>.

On Fri, Mar 7, 2014 at 9:01 PM, Da Huang <dh...@gmail.com> wrote:
> Hello, everyone,
>
> My name is Da Huang. I'm studying for my master degree of Computer Science
> in Peking University. I have been using lucene for about half a year. It's
> so elegent that I hope to have a chance to contribute some code for it.

Welcome!

> Therefore, I have been scaned the jira GoSC 2014 Ideas page about lucene for
> several days. I find "LUCENE-3333: Specialize DisjunctionScorer if all
> clauses are TermQueries" more suitable for me to do. I have spent some time
> to scan the revelant code, and the Issue "LUCENE-3328" which spinoff
> "LUCENE-3333". I find the following questions confusing me.
>
> 1) I have checkout the code from
> "http://svn.apache.org/repos/asf/lucene/dev/trunk lucene_trunk", but I
> couldn't find the relevant code of the fixed Issue "LUCENE-3328". It seems
> that the patch attached on the page is not on the trunk. Why?

Well, some time after LUCENE-3328, we made further changes and
discovered that this code specialized scorer was not in fact [that
much?] faster.  I forget which issue removed it... but you could
probably find it with some svn archaeology.

Net/net the trend in Lucene has been against adding source code
specialization, since this is really code duplication to try to make
hotspot's life easier.  Unfortunately, it does sometimes work; e.g see
http://blog.mikemccandless.com/2013/06/screaming-fast-lucene-searches-using-c.html
though that's not a fair comparison since it was also a different
programming language!  So, while it's a nice tradeoff for performance,
it's a poor tradeoff for ongoing code management.  See all the
specialized collectors we have in TopFieldCollector!

So I'm not sure at this point if we should even pursue LUCENE-3333.
There are however tons of other things to fix on the search side;
maybe we could craft a good GSoC project from something else; e.g.:

  - we should pass a needsScorers boolean up-front to Weight.scorer
  - disjunctions now score during matching

  - BooleanScorer should sometimes be used for MUST clauses

  - We sort of duplicate code across BooleanQuery, FilteredQuery,
BooleanFilter, TermsFilter

  - Somehow, Filter and Query should be more "combined"; e.g. you
should be able to add a Filter as a clause onto a BooleanQuery

  - "Post filtering" is too hard to use today

  - ...

> 2)  My intuitive idea of solving this issue is to make a class
> "DisjunctionTermScorer" to do the all TermQueries clauses; then, judging
> whether to use DisjunctionTermScorer in the method 'scorer' in class
> BooleanQuery. Is this idea right?

Yes this would be the right idea.

> Above are my questions about "LUCENE-3333". Besides, I would like to propose
> the following issue which is about the QueryParser.
>
> When we use QueryParser to parse a querystring like "science AND
> (engineering AND technology)". The generated query would be "+science
> (+engineering +technology)". I think it would be more efficient for
> searching if the final query is "+science +engineering +technology". My idea
> is to make the cascaded AND and cascaded OR flat. Do you agree? I hope I
> have made my idea clear.

I think this would make tons of sense; the only "challenge" is that
this will change how scores are computed, when coord is enabled.  I'm
not sure how much that'd matter in practice; if it is important to
preserve that, then maybe we could still make a single 3-clause
BooleanQuery, but somehow remember the original structure for the sake
of coord scoring ... not sure.

Mike McCandless

http://blog.mikemccandless.com

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org