You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Li Li <fa...@gmail.com> on 2012/06/01 09:36:36 UTC

looking for a BooleanMatcher instead of BooleanScorer

hi all,
    I am looking for a 'BooleanMatcher' in lucene. for many
application, we don't need order matched documents by relevant scores.
we just like the boolean query. But the BooleanScorer/BooleanScorer2
is a little bit heavy for the purpose of relevant scoring.
    one use case is: we have some fields which has very small number
of tokens(usually only one word). such as id,tag or something else.
    But we need query like this: id in (1,3,5.....). if using
booleanQuery (id:1 id:3 id:5 ...). BooleanScorer can only apply to 31
terms. BooleanScorer2 using priority queue to know how many terms are
matched(Coord).
    Filters may help but it can be a very complicated query(or else,
it self still using BooleanQuery, there is a recursive problem)

    we may divide current BooleanScorer to a BooleanMatcher and a
Ranker. if we need score the hitted docs, we ask the BooleanScorer for
not only hitted id but also tf/idf coord or anything we need to use in
ranking. but sometimes we only need docIds. then the BooleanMatcher
can optimize it's implementation. for the case of many disjunction
terms, we can do it like Filter or BooleanScorer instead of
BooleanScorer2.

    is it possible?

    following is some user demands I searched from the mail list. the
first one is my own requirement.

    1. https://github.com/neo4j/community/issues/494

    2. mail to lucene

qibaoyuan@126.com qibaoyuan@126.com via lucene.apache.org
	
May 6
		
to lucene
Hi,
      I met a problem about how to search many keywords  in about
5,000,000 documents.For example the query may be like "(a1 or a2 or a3
....a200) and (b1 or b2 or b3 or b4 ..... b400)",I found it will take
vey long time(40seconds) to get the the answer in only one field(Title
field),and JVM will throw OutMemory error in more fields(title field
plus content field).Any suggestions or good idea to solve this
problem?thanks in advance.


   3 mail to lucene
Chris Book chrisbook@gmail.com via lucene.apache.org
	
Apr 11
		
to solr-user
Hello, I have a solr index running that is working very well as a search.
 But I want to add the ability (if possible) to use it to do matching.  The
problem is that by default it is only looking for all the input terms to be
present, and it doesn't give me any indication as to how many terms in the
target field were not specified by the input.

For example, if I'm trying to match to the song title "dust in the wind",
I'm correctly getting a match if the input query is "dust in wind".  But I
don't want to get a match if the input is just "dust".  Although as a
search "dust" should return this result, I'm looking for some way to filter
this out based on some indication that the input isn't close enough to the
output.  Perhaps if I could get information that that the number of input
terms is much less than the number of terms in the field.  Or something
else along those line?

I realize that this isn't the typical use case for a search, but I'm just
looking for some suggestions as to how I could improve the above example a
bit.

Thanks,
Chris

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Re: looking for a BooleanMatcher instead of BooleanScorer

Posted by Li Li <fa...@gmail.com>.
sorry, the first problem is not mine.

On Fri, Jun 1, 2012 at 4:58 PM, Tanguy Moal <ta...@gmail.com> wrote:
> Hello,
>
> I'm just sharing my thoughts, they might be off-topic...
>
> Take the first example quoted from github : the user wants to find all nodes
> having their facebookId in a given quite long list ( a friends list, be
> aware that some facebook users have 1500+ friends!).
>
> The application firstly had the facebookId for a user (say id=someId), and
> requested the facebook graph with that id and got a quite long list of
> facebookIds back, right?
> At that time, I think the application should not try to enumerate its neo4j
> graph using a OR-ed facebookIds list.
> It should make sure that each neo4j node in set of the friends list has a
> "friendOf" attribute and ensure that this multivalued attribute contains the
> facebookId : someId for each involved node. Trigger an update request of
> those updated nodes.
> You could make your application wait for that update to complete if it
> really needs to be synchronous with facebook.
> That moves the problem to handling update request smartly which might be
> easier sometimes.
> Here you will eventually want to store a hash the user's friendslist
> somewhere in the user's node so you know in advance if that user's friends
> list has changed and if you need to trigger the update process again (just
> thinking).
> When your user uses the application for the first time, or every time after
> she updated her friends list, an update job will be fired for that user. You
> may want to wait for update request to complete only the first time (if you
> don't need your app to be 100% synchronized with facebook), and make the
> subsequent jobs be queued to something handling these updates
> efficiently.  That could stress the storage system with intensive writes
> from times to times, especially at the beginning but that will converge to a
> mainly read-based application after most active user has used the
> application once. New friendships aren't that frequent (IMHO).
> May by NRT developments could be used in this scenario... I don't know much
> more. I don't know anything about how Neo4J works, I used it once, that's
> all.
> Anyway if you hit writes issues, congratulations your application is being
> used widely, go buy SSD disks :)
>
> Finally, you will then enumerate your nodes with a very quick and efficient
> query friendOf:"someId" .
>
>
> What I wanted to mean is that if your application really needs to perform
> queries made of many, many, many, ... really many terms that are OR-ed, then
> there might exist (but it's not always true) a different design of your data
> model that could allow you to still fit the use case of a search engine.

I agree. Lucene/solr may need support many other types of query used
in traditional database.
for now, we usually store structured data in rdbms and full text in
lucene/solr. But the
synchronization of data is a nightmare.  we like just use one full
featured solution instead of
integrating many solutions.


>
> This applies to 1 and may be to 2 too. ( :p 2-2-2 -- never mind )
>
> I don't really understand for 3 which seems to be a MinShouldMatch issue.
>
> As I said in the beginning, I'm simply sharing my thoughts! I hope this
> helps...
>
> --
> Tanguy
>
> 2012/6/1 Li Li <fa...@gmail.com>
>>
>> hi all,
>>    I am looking for a 'BooleanMatcher' in lucene. for many
>> application, we don't need order matched documents by relevant scores.
>> we just like the boolean query. But the BooleanScorer/BooleanScorer2
>> is a little bit heavy for the purpose of relevant scoring.
>>    one use case is: we have some fields which has very small number
>> of tokens(usually only one word). such as id,tag or something else.
>>    But we need query like this: id in (1,3,5.....). if using
>> booleanQuery (id:1 id:3 id:5 ...). BooleanScorer can only apply to 31
>> terms. BooleanScorer2 using priority queue to know how many terms are
>> matched(Coord).
>>    Filters may help but it can be a very complicated query(or else,
>> it self still using BooleanQuery, there is a recursive problem)
>>
>>    we may divide current BooleanScorer to a BooleanMatcher and a
>> Ranker. if we need score the hitted docs, we ask the BooleanScorer for
>> not only hitted id but also tf/idf coord or anything we need to use in
>> ranking. but sometimes we only need docIds. then the BooleanMatcher
>> can optimize it's implementation. for the case of many disjunction
>> terms, we can do it like Filter or BooleanScorer instead of
>> BooleanScorer2.
>>
>>    is it possible?
>>
>>    following is some user demands I searched from the mail list. the
>> first one is my own requirement.
>>
>>    1. https://github.com/neo4j/community/issues/494
>>
>>    2. mail to lucene
>>
>> qibaoyuan@126.com qibaoyuan@126.com via lucene.apache.org
>>
>> May 6
>>
>> to lucene
>> Hi,
>>      I met a problem about how to search many keywords  in about
>> 5,000,000 documents.For example the query may be like "(a1 or a2 or a3
>> ....a200) and (b1 or b2 or b3 or b4 ..... b400)",I found it will take
>> vey long time(40seconds) to get the the answer in only one field(Title
>> field),and JVM will throw OutMemory error in more fields(title field
>> plus content field).Any suggestions or good idea to solve this
>> problem?thanks in advance.
>>
>>
>>   3 mail to lucene
>> Chris Book chrisbook@gmail.com via lucene.apache.org
>>
>> Apr 11
>>
>> to solr-user
>> Hello, I have a solr index running that is working very well as a search.
>>  But I want to add the ability (if possible) to use it to do matching.
>>  The
>> problem is that by default it is only looking for all the input terms to
>> be
>> present, and it doesn't give me any indication as to how many terms in the
>> target field were not specified by the input.
>>
>> For example, if I'm trying to match to the song title "dust in the wind",
>> I'm correctly getting a match if the input query is "dust in wind".  But I
>> don't want to get a match if the input is just "dust".  Although as a
>> search "dust" should return this result, I'm looking for some way to
>> filter
>> this out based on some indication that the input isn't close enough to the
>> output.  Perhaps if I could get information that that the number of input
>> terms is much less than the number of terms in the field.  Or something
>> else along those line?
>>
>> I realize that this isn't the typical use case for a search, but I'm just
>> looking for some suggestions as to how I could improve the above example a
>> bit.
>>
>> Thanks,
>> Chris
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Re: looking for a BooleanMatcher instead of BooleanScorer

Posted by Tanguy Moal <ta...@gmail.com>.
Hello,

I'm just sharing my thoughts, they might be off-topic...

Take the first example quoted from github : the user wants to find all
nodes having their facebookId in a given quite long list ( a friends list,
be aware that some facebook users have 1500+ friends!).

The application firstly had the facebookId for a user (say id=someId), and
requested the facebook graph with that id and got a quite long list of
facebookIds back, right?
At that time, I think the application should not try to enumerate its neo4j
graph using a OR-ed facebookIds list.
It should make sure that each neo4j node in set of the friends list has a
"friendOf" attribute and ensure that this multivalued attribute contains
the facebookId : someId for each involved node. Trigger an update request
of those updated nodes.
You could make your application wait for that update to complete if it
really needs to be synchronous with facebook.
That moves the problem to handling update request smartly which might be
easier sometimes.
Here you will eventually want to store a hash the user's friendslist
somewhere in the user's node so you know in advance if that user's friends
list has changed and if you need to trigger the update process again (just
thinking).
When your user uses the application for the first time, or every time after
she updated her friends list, an update job will be fired for that user.
You may want to wait for update request to complete only the first time (if
you don't need your app to be 100% synchronized with facebook), and make
the subsequent jobs be queued to something handling these updates
efficiently.  That could stress the storage system with intensive writes
from times to times, especially at the beginning but that will converge to
a mainly read-based application after most active user has used the
application once. New friendships aren't that frequent (IMHO).
May by NRT developments could be used in this scenario... I don't know much
more. I don't know anything about how Neo4J works, I used it once, that's
all.
Anyway if you hit writes issues, congratulations your application is being
used widely, go buy SSD disks :)

Finally, you will then enumerate your nodes with a very quick and efficient
query friendOf:"someId" .


What I wanted to mean is that if your application really needs to perform
queries made of many, many, many, ... really many terms that are OR-ed,
then there might exist (but it's not always true) a different design of
your data model that could allow you to still fit the use case of a search
engine.

This applies to 1 and may be to 2 too. ( :p 2-2-2 -- never mind )

I don't really understand for 3 which seems to be a MinShouldMatch issue.

As I said in the beginning, I'm simply sharing my thoughts! I hope this
helps...

--
Tanguy

2012/6/1 Li Li <fa...@gmail.com>

> hi all,
>    I am looking for a 'BooleanMatcher' in lucene. for many
> application, we don't need order matched documents by relevant scores.
> we just like the boolean query. But the BooleanScorer/BooleanScorer2
> is a little bit heavy for the purpose of relevant scoring.
>    one use case is: we have some fields which has very small number
> of tokens(usually only one word). such as id,tag or something else.
>    But we need query like this: id in (1,3,5.....). if using
> booleanQuery (id:1 id:3 id:5 ...). BooleanScorer can only apply to 31
> terms. BooleanScorer2 using priority queue to know how many terms are
> matched(Coord).
>    Filters may help but it can be a very complicated query(or else,
> it self still using BooleanQuery, there is a recursive problem)
>
>    we may divide current BooleanScorer to a BooleanMatcher and a
> Ranker. if we need score the hitted docs, we ask the BooleanScorer for
> not only hitted id but also tf/idf coord or anything we need to use in
> ranking. but sometimes we only need docIds. then the BooleanMatcher
> can optimize it's implementation. for the case of many disjunction
> terms, we can do it like Filter or BooleanScorer instead of
> BooleanScorer2.
>
>    is it possible?
>
>    following is some user demands I searched from the mail list. the
> first one is my own requirement.
>
>    1. https://github.com/neo4j/community/issues/494
>
>    2. mail to lucene
>
> qibaoyuan@126.com qibaoyuan@126.com via lucene.apache.org
>
> May 6
>
> to lucene
> Hi,
>      I met a problem about how to search many keywords  in about
> 5,000,000 documents.For example the query may be like "(a1 or a2 or a3
> ....a200) and (b1 or b2 or b3 or b4 ..... b400)",I found it will take
> vey long time(40seconds) to get the the answer in only one field(Title
> field),and JVM will throw OutMemory error in more fields(title field
> plus content field).Any suggestions or good idea to solve this
> problem?thanks in advance.
>
>
>   3 mail to lucene
> Chris Book chrisbook@gmail.com via lucene.apache.org
>
> Apr 11
>
> to solr-user
> Hello, I have a solr index running that is working very well as a search.
>  But I want to add the ability (if possible) to use it to do matching.  The
> problem is that by default it is only looking for all the input terms to be
> present, and it doesn't give me any indication as to how many terms in the
> target field were not specified by the input.
>
> For example, if I'm trying to match to the song title "dust in the wind",
> I'm correctly getting a match if the input query is "dust in wind".  But I
> don't want to get a match if the input is just "dust".  Although as a
> search "dust" should return this result, I'm looking for some way to filter
> this out based on some indication that the input isn't close enough to the
> output.  Perhaps if I could get information that that the number of input
> terms is much less than the number of terms in the field.  Or something
> else along those line?
>
> I realize that this isn't the typical use case for a search, but I'm just
> looking for some suggestions as to how I could improve the above example a
> bit.
>
> Thanks,
> Chris
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>