You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@ignite.apache.org by Courtney Robinson <co...@hypi.io> on 2018/09/03 19:34:31 UTC

Fulltext matching

Hi,

We've got Ignite in production and decided to start using some fulltext
matching as well.
I've investigated and can't figure out why my queries are not matching.

I construct a query entity e.g new QueryEntity(keyClass, valueClass) and in
debug I can see it generates a list of fields
e.g. a, b, c.a, c.b
I then expected to be able to match on those fields that are marked as
indexed. Everything is annotation driven. The appropriate fields have been
annotated and appear to be detected as such
when I inspect what gets put into the QueryEntityDescriptor. i.e. all
expected indices and indexed fields are present.

In LuceneGridIndex I see that the lucene document generated as fields a,b
(c.a and c.b are not included). Now a couple of questions arise:

1. Is there a way to get Ignite to index the nested fields as well so that
c.a and c.b end up in the doc?

2. If you use a composite object as a key, its fields are extracted into
the top level so if you have Key.a and Value.a you cannot index both since
Key.a becomes a which collides with Value.a - can this be changed, are
there any known reasons why it couldn't be (i.e. I'm happy to send a PR
doing so - but I suspect the answer to this is linked to the answer to the
first question)

3. The docs simply say you can use lucene syntax, I presume it means the
syntax that appears in
https://lucene.apache.org/core/2_9_4/queryparsersyntax.html is all valid -
checking the code that appears to be case as it does
a MultiFieldQueryParser in GridLuceneIndex. However, when I try to run a
query such as a:<my-text> - none of the indexed documents match. In debug
mode I've enabled parser.setAllowLeadingWildcard(true); and if I do a
simple searcher.search * I get back the list of expected documents.

What's even more odd is I tried querying each of the 6 indexed fields as
found in idxdFields in GridLuceneIndex and 1 of them match. The other
values are being typed exactly but also doing wild cards or other free text
forms do not match.

4. I couldn't see a way to provide a custom GridLuceneIndex, I found the
two cases where it's constructed in the code base and doesn't look like I
can inject instances. Is it ok to construct and use a custom
GridLuceneDirectory/IndexWriter/Searcher and so on in the same way
GridLuceneIndex does it so I can do a custom IndexingSpi to change how
indexing happens?
There are a number of things I'd like to customise and from looking at the
current impl. these things aren't injectable, I guess it's not considered a
prime use case maybe.

Yeah, the analyzer and a number of things would be handy to change. Ideally
also want to customise how a field is indexed e.g. to be able to do term
matches with lucene queries

Looking at this impl as well it passes Integer.MAX_VALUE and pulls back all
matches. That'll surely kill our nodes for some of the use cases we're
considering.
I'd also like to implement paging, the searcher API has a nice option to
pass through a last doc it can continue from to potentially implement
something like deep-paging.

5. If I were to do a custom IndexingSpi to make all of this happen, how do
I get additional parameters through so that I could have paging params
passed

Ideally I could customise the indexing, searching and paging through
standard Ignite means but I can't find any means of doing that in the
current code and short of doing a custom IndexingSpi I think I've gone as
far as I can debugging and could do with a few pointers of how to go about
this.

FYI, SQL isn't a great option for this part of the product, we're
generating and compiling Java classes at runtime and generating SQL to do
the queries is an order of magnitude more work than indexing the relatively
few fields we need and then searching but off the bat the paging would be
an issue as there can be several million matches to a query. Can't have
Ignite pulling all of those into memory.

Thanks in advance

Courtney

Re: Fulltext matching

Posted by Ilya Kasnacheev <il...@gmail.com>.

Hello!

The only way to know if it will be accepted is to fill those tickets and
pull-requests (and then write about it on developers list)

Regards,
-- 
Ilya Kasnacheev


вт, 11 сент. 2018 г. в 0:04, Courtney Robinson <co...@hypi.io>:

> Hi,
> Thanks for the response.
> I went ahead and implemented a custom indexing SPI. Works like a charm. As
> long as Ignite doesn't drop support for the indexing SPI interface this is
> exactly what we need.
> I'm happy to create Jira issues and extract this into something more
> generic for upstream if it'll be accepted.
>
> Regards,
> Courtney Robinson
> CTO, Hypi
> Tel: +4402032870961 (GMT+0) <https://hypi.io>
>
> <https://hypi.io>
> https://hypi.io
>
>
> On Thu, Sep 6, 2018 at 4:09 PM Ilya Kasnacheev <il...@gmail.com>
> wrote:
>
>> Hello!
>>
>> Unfortunately, fulltext doesn't seem to have much traction, so I
>> recommend doing investigations on your side, possibly creating JIRA issues
>> in the process.
>>
>> Regards,
>> --
>> Ilya Kasnacheev
>>
>>
>> пн, 3 сент. 2018 г. в 22:34, Courtney Robinson <courtney.robinson@hypi.io
>> >:
>>
>>> Hi,
>>>
>>> We've got Ignite in production and decided to start using some fulltext
>>> matching as well.
>>> I've investigated and can't figure out why my queries are not matching.
>>>
>>> I construct a query entity e.g new QueryEntity(keyClass, valueClass) and
>>> in debug I can see it generates a list of fields
>>> e.g. a, b, c.a, c.b
>>> I then expected to be able to match on those fields that are marked as
>>> indexed. Everything is annotation driven. The appropriate fields have been
>>> annotated and appear to be detected as such
>>> when I inspect what gets put into the QueryEntityDescriptor. i.e. all
>>> expected indices and indexed fields are present.
>>>
>>> In LuceneGridIndex I see that the lucene document generated as fields
>>> a,b (c.a and c.b are not included). Now a couple of questions arise:
>>>
>>> 1. Is there a way to get Ignite to index the nested fields as well so
>>> that c.a and c.b end up in the doc?
>>>
>>> 2. If you use a composite object as a key, its fields are extracted into
>>> the top level so if you have Key.a and Value.a you cannot index both since
>>> Key.a becomes a which collides with Value.a - can this be changed, are
>>> there any known reasons why it couldn't be (i.e. I'm happy to send a PR
>>> doing so - but I suspect the answer to this is linked to the answer to the
>>> first question)
>>>
>>> 3. The docs simply say you can use lucene syntax, I presume it means the
>>> syntax that appears in
>>> https://lucene.apache.org/core/2_9_4/queryparsersyntax.html is all
>>> valid - checking the code that appears to be case as it does
>>> a MultiFieldQueryParser in GridLuceneIndex. However, when I try to run a
>>> query such as a:<my-text> - none of the indexed documents match. In debug
>>> mode I've enabled parser.setAllowLeadingWildcard(true); and if I do a
>>> simple searcher.search * I get back the list of expected documents.
>>>
>>> What's even more odd is I tried querying each of the 6 indexed fields as
>>> found in idxdFields in GridLuceneIndex and 1 of them match. The other
>>> values are being typed exactly but also doing wild cards or other free text
>>> forms do not match.
>>>
>>> 4. I couldn't see a way to provide a custom GridLuceneIndex, I found the
>>> two cases where it's constructed in the code base and doesn't look like I
>>> can inject instances. Is it ok to construct and use a custom
>>> GridLuceneDirectory/IndexWriter/Searcher and so on in the same way
>>> GridLuceneIndex does it so I can do a custom IndexingSpi to change how
>>> indexing happens?
>>> There are a number of things I'd like to customise and from looking at
>>> the current impl. these things aren't injectable, I guess it's not
>>> considered a prime use case maybe.
>>>
>>> Yeah, the analyzer and a number of things would be handy to change.
>>> Ideally also want to customise how a field is indexed e.g. to be able to do
>>> term matches with lucene queries
>>>
>>> Looking at this impl as well it passes Integer.MAX_VALUE and pulls back
>>> all matches. That'll surely kill our nodes for some of the use cases we're
>>> considering.
>>> I'd also like to implement paging, the searcher API has a nice option to
>>> pass through a last doc it can continue from to potentially implement
>>> something like deep-paging.
>>>
>>> 5. If I were to do a custom IndexingSpi to make all of this happen, how
>>> do I get additional parameters through so that I could have paging params
>>> passed
>>>
>>> Ideally I could customise the indexing, searching and paging through
>>> standard Ignite means but I can't find any means of doing that in the
>>> current code and short of doing a custom IndexingSpi I think I've gone as
>>> far as I can debugging and could do with a few pointers of how to go about
>>> this.
>>>
>>> FYI, SQL isn't a great option for this part of the product, we're
>>> generating and compiling Java classes at runtime and generating SQL to do
>>> the queries is an order of magnitude more work than indexing the relatively
>>> few fields we need and then searching but off the bat the paging would be
>>> an issue as there can be several million matches to a query. Can't have
>>> Ignite pulling all of those into memory.
>>>
>>> Thanks in advance
>>>
>>> Courtney
>>>
>>

Re: Fulltext matching

Posted by Courtney Robinson <co...@hypi.io>.

Hi,
Thanks for the response.
I went ahead and implemented a custom indexing SPI. Works like a charm. As
long as Ignite doesn't drop support for the indexing SPI interface this is
exactly what we need.
I'm happy to create Jira issues and extract this into something more
generic for upstream if it'll be accepted.

Regards,
Courtney Robinson
CTO, Hypi
Tel: +4402032870961 (GMT+0) <https://hypi.io>

<https://hypi.io>
https://hypi.io


On Thu, Sep 6, 2018 at 4:09 PM Ilya Kasnacheev <il...@gmail.com>
wrote:

> Hello!
>
> Unfortunately, fulltext doesn't seem to have much traction, so I recommend
> doing investigations on your side, possibly creating JIRA issues in the
> process.
>
> Regards,
> --
> Ilya Kasnacheev
>
>
> пн, 3 сент. 2018 г. в 22:34, Courtney Robinson <courtney.robinson@hypi.io
> >:
>
>> Hi,
>>
>> We've got Ignite in production and decided to start using some fulltext
>> matching as well.
>> I've investigated and can't figure out why my queries are not matching.
>>
>> I construct a query entity e.g new QueryEntity(keyClass, valueClass) and
>> in debug I can see it generates a list of fields
>> e.g. a, b, c.a, c.b
>> I then expected to be able to match on those fields that are marked as
>> indexed. Everything is annotation driven. The appropriate fields have been
>> annotated and appear to be detected as such
>> when I inspect what gets put into the QueryEntityDescriptor. i.e. all
>> expected indices and indexed fields are present.
>>
>> In LuceneGridIndex I see that the lucene document generated as fields a,b
>> (c.a and c.b are not included). Now a couple of questions arise:
>>
>> 1. Is there a way to get Ignite to index the nested fields as well so
>> that c.a and c.b end up in the doc?
>>
>> 2. If you use a composite object as a key, its fields are extracted into
>> the top level so if you have Key.a and Value.a you cannot index both since
>> Key.a becomes a which collides with Value.a - can this be changed, are
>> there any known reasons why it couldn't be (i.e. I'm happy to send a PR
>> doing so - but I suspect the answer to this is linked to the answer to the
>> first question)
>>
>> 3. The docs simply say you can use lucene syntax, I presume it means the
>> syntax that appears in
>> https://lucene.apache.org/core/2_9_4/queryparsersyntax.html is all valid
>> - checking the code that appears to be case as it does
>> a MultiFieldQueryParser in GridLuceneIndex. However, when I try to run a
>> query such as a:<my-text> - none of the indexed documents match. In debug
>> mode I've enabled parser.setAllowLeadingWildcard(true); and if I do a
>> simple searcher.search * I get back the list of expected documents.
>>
>> What's even more odd is I tried querying each of the 6 indexed fields as
>> found in idxdFields in GridLuceneIndex and 1 of them match. The other
>> values are being typed exactly but also doing wild cards or other free text
>> forms do not match.
>>
>> 4. I couldn't see a way to provide a custom GridLuceneIndex, I found the
>> two cases where it's constructed in the code base and doesn't look like I
>> can inject instances. Is it ok to construct and use a custom
>> GridLuceneDirectory/IndexWriter/Searcher and so on in the same way
>> GridLuceneIndex does it so I can do a custom IndexingSpi to change how
>> indexing happens?
>> There are a number of things I'd like to customise and from looking at
>> the current impl. these things aren't injectable, I guess it's not
>> considered a prime use case maybe.
>>
>> Yeah, the analyzer and a number of things would be handy to change.
>> Ideally also want to customise how a field is indexed e.g. to be able to do
>> term matches with lucene queries
>>
>> Looking at this impl as well it passes Integer.MAX_VALUE and pulls back
>> all matches. That'll surely kill our nodes for some of the use cases we're
>> considering.
>> I'd also like to implement paging, the searcher API has a nice option to
>> pass through a last doc it can continue from to potentially implement
>> something like deep-paging.
>>
>> 5. If I were to do a custom IndexingSpi to make all of this happen, how
>> do I get additional parameters through so that I could have paging params
>> passed
>>
>> Ideally I could customise the indexing, searching and paging through
>> standard Ignite means but I can't find any means of doing that in the
>> current code and short of doing a custom IndexingSpi I think I've gone as
>> far as I can debugging and could do with a few pointers of how to go about
>> this.
>>
>> FYI, SQL isn't a great option for this part of the product, we're
>> generating and compiling Java classes at runtime and generating SQL to do
>> the queries is an order of magnitude more work than indexing the relatively
>> few fields we need and then searching but off the bat the paging would be
>> an issue as there can be several million matches to a query. Can't have
>> Ignite pulling all of those into memory.
>>
>> Thanks in advance
>>
>> Courtney
>>
>

Re: Fulltext matching

Posted by Ilya Kasnacheev <il...@gmail.com>.

Hello!

Unfortunately, fulltext doesn't seem to have much traction, so I recommend
doing investigations on your side, possibly creating JIRA issues in the
process.

Regards,
-- 
Ilya Kasnacheev


пн, 3 сент. 2018 г. в 22:34, Courtney Robinson <co...@hypi.io>:

> Hi,
>
> We've got Ignite in production and decided to start using some fulltext
> matching as well.
> I've investigated and can't figure out why my queries are not matching.
>
> I construct a query entity e.g new QueryEntity(keyClass, valueClass) and
> in debug I can see it generates a list of fields
> e.g. a, b, c.a, c.b
> I then expected to be able to match on those fields that are marked as
> indexed. Everything is annotation driven. The appropriate fields have been
> annotated and appear to be detected as such
> when I inspect what gets put into the QueryEntityDescriptor. i.e. all
> expected indices and indexed fields are present.
>
> In LuceneGridIndex I see that the lucene document generated as fields a,b
> (c.a and c.b are not included). Now a couple of questions arise:
>
> 1. Is there a way to get Ignite to index the nested fields as well so that
> c.a and c.b end up in the doc?
>
> 2. If you use a composite object as a key, its fields are extracted into
> the top level so if you have Key.a and Value.a you cannot index both since
> Key.a becomes a which collides with Value.a - can this be changed, are
> there any known reasons why it couldn't be (i.e. I'm happy to send a PR
> doing so - but I suspect the answer to this is linked to the answer to the
> first question)
>
> 3. The docs simply say you can use lucene syntax, I presume it means the
> syntax that appears in
> https://lucene.apache.org/core/2_9_4/queryparsersyntax.html is all valid
> - checking the code that appears to be case as it does
> a MultiFieldQueryParser in GridLuceneIndex. However, when I try to run a
> query such as a:<my-text> - none of the indexed documents match. In debug
> mode I've enabled parser.setAllowLeadingWildcard(true); and if I do a
> simple searcher.search * I get back the list of expected documents.
>
> What's even more odd is I tried querying each of the 6 indexed fields as
> found in idxdFields in GridLuceneIndex and 1 of them match. The other
> values are being typed exactly but also doing wild cards or other free text
> forms do not match.
>
> 4. I couldn't see a way to provide a custom GridLuceneIndex, I found the
> two cases where it's constructed in the code base and doesn't look like I
> can inject instances. Is it ok to construct and use a custom
> GridLuceneDirectory/IndexWriter/Searcher and so on in the same way
> GridLuceneIndex does it so I can do a custom IndexingSpi to change how
> indexing happens?
> There are a number of things I'd like to customise and from looking at the
> current impl. these things aren't injectable, I guess it's not considered a
> prime use case maybe.
>
> Yeah, the analyzer and a number of things would be handy to change.
> Ideally also want to customise how a field is indexed e.g. to be able to do
> term matches with lucene queries
>
> Looking at this impl as well it passes Integer.MAX_VALUE and pulls back
> all matches. That'll surely kill our nodes for some of the use cases we're
> considering.
> I'd also like to implement paging, the searcher API has a nice option to
> pass through a last doc it can continue from to potentially implement
> something like deep-paging.
>
> 5. If I were to do a custom IndexingSpi to make all of this happen, how do
> I get additional parameters through so that I could have paging params
> passed
>
> Ideally I could customise the indexing, searching and paging through
> standard Ignite means but I can't find any means of doing that in the
> current code and short of doing a custom IndexingSpi I think I've gone as
> far as I can debugging and could do with a few pointers of how to go about
> this.
>
> FYI, SQL isn't a great option for this part of the product, we're
> generating and compiling Java classes at runtime and generating SQL to do
> the queries is an order of magnitude more work than indexing the relatively
> few fields we need and then searching but off the bat the paging would be
> an issue as there can be several million matches to a query. Can't have
> Ignite pulling all of those into memory.
>
> Thanks in advance
>
> Courtney
>