You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@ignite.apache.org by Atri Sharma <at...@apache.org> on 2021/06/21 08:10:52 UTC

Text Queries Support

Hi All,

I have been looking into our text queries support and see that it has
limited community support.

Therefore, I volunteer to be the maintainer of the module and work on
enhancing it further.

First goal would be to move to Lucene 8.x, then work on sorted reduce
- merge across nodes. Fundamentally, this is doable since Lucene ranks
documents according to their score, and documents are returned in the
order of their score. Since the scoring function is homogeneous, this
means that across nodes, we can compare scores and merge sort.

Please let me know if I can take this up.

Atri

-- 
Regards,

Atri
Apache Concerted

Re: Text Queries Support

Posted by Atri Sharma <at...@apache.org>.

Thank you, Maksin and Alexei!

To dive a bit deeper, what are our biggest issues with text queries
today? One is persistence, the other (IIUC) is the fact that we cannot
order results from different nodes (which PR 9081 seems to have
resolved?)

What else would be pending for text queries to become usable?

Atri

On Mon, Jun 21, 2021 at 2:49 PM Alexei Scherbakov
<al...@gmail.com> wrote:
>
> Hi.
>
> One of the biggest issues with text queries is a lack of support for lucene
> indices persistence, which makes this functionality useless if a
> persistence is enabled.
>
> I would first take care of it.
>
> пн, 21 июн. 2021 г. в 12:16, Maksim Timonin <ti...@gmail.com>:
>
> > Hi, Atri!
> >
> > You're right, Actually there is a lack of support for TextQueries. For the
> > last ticket I'm doing I see some obvious issues with them (no page size
> > support, for example). I'm glad that somebody wants to maintain this
> > functionality. Thanks a lot!
> >
> > For the MergeSort algorithm there is already a patch for that [1]. It's
> > currently on review. This patch introduces an abstract reducer for
> > CacheQueries with 2 implementations (unordered, merge-sort). Then TextQuery
> > leverages on MergeSort to order results from multiple nodes by score. This
> > patch also fixes the pageSize issue, I've mentioned before. Could you
> > please check if it fully matches your idea? Any issues or comments are
> > welcome.
> >
> > I've prepared this ticket, because I need the MergeSort algorithm for the
> > new type of queries I'm implementing (IndexQuery, it should also provide
> > ordered results over multiple nodes). Currently I'm not planning to go
> > further with TextQuery, so if you're going to support this it'll be a great
> > contribution, I think.
> >
> > [1] https://issues.apache.org/jira/browse/IGNITE-14703
> > [2] https://github.com/apache/ignite/pull/9081
> >
> >
> > On Mon, Jun 21, 2021 at 11:11 AM Atri Sharma <at...@apache.org> wrote:
> >
> > > Hi All,
> > >
> > > I have been looking into our text queries support and see that it has
> > > limited community support.
> > >
> > > Therefore, I volunteer to be the maintainer of the module and work on
> > > enhancing it further.
> > >
> > > First goal would be to move to Lucene 8.x, then work on sorted reduce
> > > - merge across nodes. Fundamentally, this is doable since Lucene ranks
> > > documents according to their score, and documents are returned in the
> > > order of their score. Since the scoring function is homogeneous, this
> > > means that across nodes, we can compare scores and merge sort.
> > >
> > > Please let me know if I can take this up.
> > >
> > > Atri
> > >
> > > --
> > > Regards,
> > >
> > > Atri
> > > Apache Concerted
> > >
> >
>
>
> --
>
> Best regards,
> Alexei Scherbakov

-- 
Regards,

Atri
Apache Concerted

Re: Text Queries Support

Posted by Ilya Kasnacheev <il...@gmail.com>.

Hello!

Let me try to answer the questions below, since I did not see anybody do
that and thus not everybody may be on the same page.

Regards,

пт, 23 июл. 2021 г. в 13:56, Andrey Mashenkov <an...@gmail.com>:

> Atri,
>
> As for now, the potential capabilities are not clear to me.
> At first glance, I see the next topics that must be covered at first:
>
> General questions
> * How Lucene index can be split among the nodes?
>
In the same fashion as SQL indexes - each node might only hold index for
its primary partitions.


> * If we'll have a single index for all partitions on the particular node,
> then how index records will be aware of partitioning?
>
I'm not sure, how does our SQL deal with it? If there is scenario where
some keys are no longer primary, we can perhaps filter them out and in the
meantime exclude from index.


> This is important to filter out backup records from the results to avoid
> duplicates.
> * How results from several nodes can be merged on the Reduce stage?
>
It is actually the primary use case for Lucene/Solr, usually they are
merged by relevance/score.


> * Does Lucene supports smth like JOIN operation or others that may require
> data from another partition or index?
> If so, then it likes to multistep query with merging results on
> intermediate stages and requires detailed investigation and design.
> It is ok if Ignite will have some limitations here, but we would like to
> know about them at the early stage.
>
Lucene has block-join which allows it to near store related data. Lucene
also has regular join, but I don't see any use case for it since we can do
SQL join as well.



> * How effectively map Lucene files to the page memory? Is it even possible?
> Otherwise, how to deal with potential OOM on large queries and memory
> capacity planning?
>
I think it's pretty good here, it's the must for information retrieval
since there's usually a lot of it.


>
> Persistence.
> * How and what consistency guarantees could we have/expect?
> Seems, we may not be able to write physical records for Lucene index to our
> WAL. What can we do with this?
>
I think we should be able to do it in the same fashion as we do it with SQL
indexes, during WAL recovery, also update the Lucene index. On clear
shutdown, assume that it is okay. If Lucene index is removed then do a full
rebuild, like we do it with index.bin.


>
> Transactions.
> * Will we support transactions?
> * Should Lucene be aware of Transaction and track mvcc (or whatever)
> versions for the records?
> * What will be consistency guarantees?
>
I think the answer here is NO. Text search is not expected to be
transactionally up-to-date. It is expected to be eventually full. So it is
OK if it takes a split-second for entries to become searchable.

The traditional way to update text indexes is batching.


>
> UX
> * How to add FullText search queries syntax into Calcite?
> * AFAIK, the Lucene index has many properties for tuning. How will the user
> configure the index?
> * How and where to store the settings? What are cluster-wide and what a
> local to the particular node?
> * Will be all the settings immutable? Can be they changed on-fly? after
> node/grid restart?
> * Any limitations on query syntax?
>
Solr and Elasticsearch spent a lot of time on this, and the field is huge
here. They have really extensive query language. On the bright side, most
of the "settings" are dynamic and cached, so if you need a different
filtering of your data all you need is to request it once. Ones which
aren't usually concern how data is prepared before being put into index
(stemming, tokenizing, etc), changing it will require index rebuild. I
don't think why settings will not be shared along cluster.


> SQL
> * Will we support FullText search in SQL?
> * How to integrate Lucene index into Calcite? What is the cost model?
> Splitting rules? Traits?
> * What about consistency with DDL operations, e.g. column rename?
> Ignite indices will operate column ID, so rename operation will not affect
> the index.
>

Regards,

-- 



>
>
> With all of this, you can go with the IEP (or even some short summary) and
> further POC and implementation.
> That's a big deal, so let's discuss what could be done here.
>
> On Fri, Jul 23, 2021 at 12:58 PM Atri Sharma <at...@apache.org> wrote:
>
> > I am actually happy to drive the feature for Ignite 3. FTS is very
> > important for me and I think Ignite users will benefit from it
> > greatly.
> >
> > If it makes sense to be focusing on Ignite 3 for this capability, I am
> > eager to contribute there and lead the development.
> >
> > Please share your thoughts.
> >
> > On Fri, Jul 23, 2021 at 3:21 PM Andrey Mashenkov
> > <an...@gmail.com> wrote:
> > >
> > > Hi Atri,
> > >
> > > All the Jira tickets we have on the Full-text search (FTS) thing are
> > > targeted to Ignite 2.
> > >
> > > AFAIK, we want, but we have NOT committed to FTS support in Ignite 3,
> > yet.
> > > By the way, we are getting requests for this thing from the user side,
> > and
> > > definitely,
> > > FTS would be a valuable feature for Ignite.
> > >
> > > It will be great if the one wants to drive it, any help will be
> > appreciated.
> > >
> > >
> > > On Fri, Jul 23, 2021 at 12:12 PM Atri Sharma <at...@apache.org> wrote:
> > >
> > > > Hello,
> > > >
> > > > An update, please. I am working through persistence of Lucene index
> > using
> > > > Ignite Dictionary, and will be asking some questions soon.
> > > >
> > > > I had one doubt - - where does this change go? Ignite 3?
> > > >
> > > > Also, I know we want to build native support for text searches in
> > Ignite 3.
> > > > Is the work I am proposing here part of that, or will that be a
> > separate
> > > > effort?
> > > >
> > > > On Mon, 28 Jun 2021, 19:20 Ilya Kasnacheev, <
> ilya.kasnacheev@gmail.com
> > >
> > > > wrote:
> > > >
> > > > > Hello!
> > > > >
> > > > > I think that number one is the most important one, then maybe it
> > will see
> > > > > more use and other deficiencies become more apparent, leading to
> more
> > > > > tickets and visibility.
> > > > >
> > > > > Maybe 2. and 3. will even use a different approach when persistence
> > is
> > > > > implemented.
> > > > >
> > > > > Regards,
> > > > > --
> > > > > Ilya Kasnacheev
> > > > >
> > > > >
> > > > > пн, 28 июн. 2021 г. в 14:34, Atri Sharma <at...@apache.org>:
> > > > >
> > > > > > Hello Again!
> > > > > >
> > > > > > I have been looking into the aforementioned and here are my
> follow
> > up
> > > > > > thoughts:
> > > > > >
> > > > > > 1. Support persistence of Lucene indexes.
> > > > > > 2. https://issues.apache.org/jira/browse/IGNITE-12401 (Needs
> > fixing of
> > > > > > moving partitions first)
> > > > > > 3. Figure out how to return scores from nodes and use them as
> sort
> > > > > > parameters on the coordinator node
> > > > > > (https://issues.apache.org/jira/browse/IGNITE-12291)
> > > > > >
> > > > > > Please let me know if this looks ok to make text queries
> > functional?
> > > > > >
> > > > > > Atri
> > > > > >
> > > > > > On Mon, Jun 21, 2021 at 2:49 PM Alexei Scherbakov
> > > > > > <al...@gmail.com> wrote:
> > > > > > >
> > > > > > > Hi.
> > > > > > >
> > > > > > > One of the biggest issues with text queries is a lack of
> support
> > for
> > > > > > lucene
> > > > > > > indices persistence, which makes this functionality useless if
> a
> > > > > > > persistence is enabled.
> > > > > > >
> > > > > > > I would first take care of it.
> > > > > > >
> > > > > > > пн, 21 июн. 2021 г. в 12:16, Maksim Timonin <
> > timonin.maxim@gmail.com
> > > > >:
> > > > > > >
> > > > > > > > Hi, Atri!
> > > > > > > >
> > > > > > > > You're right, Actually there is a lack of support for
> > TextQueries.
> > > > > For
> > > > > > the
> > > > > > > > last ticket I'm doing I see some obvious issues with them (no
> > page
> > > > > size
> > > > > > > > support, for example). I'm glad that somebody wants to
> maintain
> > > > this
> > > > > > > > functionality. Thanks a lot!
> > > > > > > >
> > > > > > > > For the MergeSort algorithm there is already a patch for that
> > [1].
> > > > > It's
> > > > > > > > currently on review. This patch introduces an abstract
> reducer
> > for
> > > > > > > > CacheQueries with 2 implementations (unordered, merge-sort).
> > Then
> > > > > > TextQuery
> > > > > > > > leverages on MergeSort to order results from multiple nodes
> by
> > > > score.
> > > > > > This
> > > > > > > > patch also fixes the pageSize issue, I've mentioned before.
> > Could
> > > > you
> > > > > > > > please check if it fully matches your idea? Any issues or
> > comments
> > > > > are
> > > > > > > > welcome.
> > > > > > > >
> > > > > > > > I've prepared this ticket, because I need the MergeSort
> > algorithm
> > > > for
> > > > > > the
> > > > > > > > new type of queries I'm implementing (IndexQuery, it should
> > also
> > > > > > provide
> > > > > > > > ordered results over multiple nodes). Currently I'm not
> > planning to
> > > > > go
> > > > > > > > further with TextQuery, so if you're going to support this
> > it'll
> > > > be a
> > > > > > great
> > > > > > > > contribution, I think.
> > > > > > > >
> > > > > > > > [1] https://issues.apache.org/jira/browse/IGNITE-14703
> > > > > > > > [2] https://github.com/apache/ignite/pull/9081
> > > > > > > >
> > > > > > > >
> > > > > > > > On Mon, Jun 21, 2021 at 11:11 AM Atri Sharma <
> atri@apache.org>
> > > > > wrote:
> > > > > > > >
> > > > > > > > > Hi All,
> > > > > > > > >
> > > > > > > > > I have been looking into our text queries support and see
> > that it
> > > > > has
> > > > > > > > > limited community support.
> > > > > > > > >
> > > > > > > > > Therefore, I volunteer to be the maintainer of the module
> and
> > > > work
> > > > > on
> > > > > > > > > enhancing it further.
> > > > > > > > >
> > > > > > > > > First goal would be to move to Lucene 8.x, then work on
> > sorted
> > > > > reduce
> > > > > > > > > - merge across nodes. Fundamentally, this is doable since
> > Lucene
> > > > > > ranks
> > > > > > > > > documents according to their score, and documents are
> > returned in
> > > > > the
> > > > > > > > > order of their score. Since the scoring function is
> > homogeneous,
> > > > > this
> > > > > > > > > means that across nodes, we can compare scores and merge
> > sort.
> > > > > > > > >
> > > > > > > > > Please let me know if I can take this up.
> > > > > > > > >
> > > > > > > > > Atri
> > > > > > > > >
> > > > > > > > > --
> > > > > > > > > Regards,
> > > > > > > > >
> > > > > > > > > Atri
> > > > > > > > > Apache Concerted
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > >
> > > > > > > Best regards,
> > > > > > > Alexei Scherbakov
> > > > > >
> > > > > > --
> > > > > > Regards,
> > > > > >
> > > > > > Atri
> > > > > > Apache Concerted
> > > > > >
> > > > >
> > > >
> > >
> > >
> > > --
> > > Best regards,
> > > Andrey V. Mashenkov
> >
> > --
> > Regards,
> >
> > Atri
> > Apache Concerted
> >
>
>
> --
> Best regards,
> Andrey V. Mashenkov
>

Re: Text Queries Support

Posted by Valentin Kulichenko <va...@gmail.com>.

It surely will not be supported out of the box, but in my understanding
Calcite provides enough flexibility for us to develop these features in a
way we would like to see them.

Andrey, please correct me if I am wrong here.

-Val

On Fri, Jul 23, 2021 at 10:58 AM Atri Sharma <at...@apache.org> wrote:

> The standard ways to deal with text based searches in SQL are the
> CONTAINS operator, the LIKE operator or specific functions
> (REGEXP_MATCHES, for eg). I do not think any of these are supported by
> Calcite at the moment.
>
> On Fri, Jul 23, 2021 at 11:20 PM Valentin Kulichenko
> <va...@gmail.com> wrote:
> >
> > In my experience, one of the biggest usability issues with the current
> > support of text queries is that they are completely decoupled from SQL.
> > I.e. you can either execute a SQL query OR a text query. Modern
> databases,
> > on the other hand, typically allow creating text-based indexes within
> > regular tables and then using those indexes within regular SQL queries.
> > Here is an example from Oracle:
> > https://docs.oracle.com/cd/B10501_01/text.920/a96517/cdefault.htm
> >
> > I believe this is something we can look into in the scope of Ignite 3.
> > Andrey, does Calcite have any support for this? What's your view on this?
> >
> > -Val
> >
> > On Fri, Jul 23, 2021 at 3:56 AM Andrey Mashenkov <
> andrey.mashenkov@gmail.com>
> > wrote:
> >
> > > Atri,
> > >
> > > First of all, I'd recommend going through the Ignite ticket to gather
> > > information about the current implementation issues and users' wants.
> > > Then look at a code to get a complete understanding of how things work
> now,
> > > which may help in future decisions.
> > >
> > > As we use the outdated Lucene version, some things may be irrelevant
> for
> > > the latest Lucene version.
> > > So, you will need expertise in the internals of modern Lucene version
> to
> > > understand what capabilities, guarantees, and limitations Lucene has
> and
> > > could bring to the Ignite.
> > > The expertise could be got from the Lucene project code or Lucene
> project
> > > dev-list.
> > >
> > >
> > > As for now, the potential capabilities are not clear to me.
> > > At first glance, I see the next topics that must be covered at first:
> > >
> > > General questions
> > > * How Lucene index can be split among the nodes?
> > > * If we'll have a single index for all partitions on the particular
> node,
> > > then how index records will be aware of partitioning?
> > > This is important to filter out backup records from the results to
> avoid
> > > duplicates.
> > > * How results from several nodes can be merged on the Reduce stage?
> > > * Does Lucene supports smth like JOIN operation or others that may
> require
> > > data from another partition or index?
> > > If so, then it likes to multistep query with merging results on
> > > intermediate stages and requires detailed investigation and design.
> > > It is ok if Ignite will have some limitations here, but we would like
> to
> > > know about them at the early stage.
> > > * How effectively map Lucene files to the page memory? Is it even
> possible?
> > > Otherwise, how to deal with potential OOM on large queries and memory
> > > capacity planning?
> > >
> > > Persistence.
> > > * How and what consistency guarantees could we have/expect?
> > > Seems, we may not be able to write physical records for Lucene index
> to our
> > > WAL. What can we do with this?
> > >
> > > Transactions.
> > > * Will we support transactions?
> > > * Should Lucene be aware of Transaction and track mvcc (or whatever)
> > > versions for the records?
> > > * What will be consistency guarantees?
> > >
> > > UX
> > > * How to add FullText search queries syntax into Calcite?
> > > * AFAIK, the Lucene index has many properties for tuning. How will the
> user
> > > configure the index?
> > > * How and where to store the settings? What are cluster-wide and what a
> > > local to the particular node?
> > > * Will be all the settings immutable? Can be they changed on-fly? after
> > > node/grid restart?
> > > * Any limitations on query syntax?
> > >
> > > SQL
> > > * Will we support FullText search in SQL?
> > > * How to integrate Lucene index into Calcite? What is the cost model?
> > > Splitting rules? Traits?
> > > * What about consistency with DDL operations, e.g. column rename?
> > > Ignite indices will operate column ID, so rename operation will not
> affect
> > > the index.
> > >
> > >
> > > With all of this, you can go with the IEP (or even some short summary)
> and
> > > further POC and implementation.
> > > That's a big deal, so let's discuss what could be done here.
> > >
> > > On Fri, Jul 23, 2021 at 12:58 PM Atri Sharma <at...@apache.org> wrote:
> > >
> > > > I am actually happy to drive the feature for Ignite 3. FTS is very
> > > > important for me and I think Ignite users will benefit from it
> > > > greatly.
> > > >
> > > > If it makes sense to be focusing on Ignite 3 for this capability, I
> am
> > > > eager to contribute there and lead the development.
> > > >
> > > > Please share your thoughts.
> > > >
> > > > On Fri, Jul 23, 2021 at 3:21 PM Andrey Mashenkov
> > > > <an...@gmail.com> wrote:
> > > > >
> > > > > Hi Atri,
> > > > >
> > > > > All the Jira tickets we have on the Full-text search (FTS) thing
> are
> > > > > targeted to Ignite 2.
> > > > >
> > > > > AFAIK, we want, but we have NOT committed to FTS support in Ignite
> 3,
> > > > yet.
> > > > > By the way, we are getting requests for this thing from the user
> side,
> > > > and
> > > > > definitely,
> > > > > FTS would be a valuable feature for Ignite.
> > > > >
> > > > > It will be great if the one wants to drive it, any help will be
> > > > appreciated.
> > > > >
> > > > >
> > > > > On Fri, Jul 23, 2021 at 12:12 PM Atri Sharma <at...@apache.org>
> wrote:
> > > > >
> > > > > > Hello,
> > > > > >
> > > > > > An update, please. I am working through persistence of Lucene
> index
> > > > using
> > > > > > Ignite Dictionary, and will be asking some questions soon.
> > > > > >
> > > > > > I had one doubt - - where does this change go? Ignite 3?
> > > > > >
> > > > > > Also, I know we want to build native support for text searches in
> > > > Ignite 3.
> > > > > > Is the work I am proposing here part of that, or will that be a
> > > > separate
> > > > > > effort?
> > > > > >
> > > > > > On Mon, 28 Jun 2021, 19:20 Ilya Kasnacheev, <
> > > ilya.kasnacheev@gmail.com
> > > > >
> > > > > > wrote:
> > > > > >
> > > > > > > Hello!
> > > > > > >
> > > > > > > I think that number one is the most important one, then maybe
> it
> > > > will see
> > > > > > > more use and other deficiencies become more apparent, leading
> to
> > > more
> > > > > > > tickets and visibility.
> > > > > > >
> > > > > > > Maybe 2. and 3. will even use a different approach when
> persistence
> > > > is
> > > > > > > implemented.
> > > > > > >
> > > > > > > Regards,
> > > > > > > --
> > > > > > > Ilya Kasnacheev
> > > > > > >
> > > > > > >
> > > > > > > пн, 28 июн. 2021 г. в 14:34, Atri Sharma <at...@apache.org>:
> > > > > > >
> > > > > > > > Hello Again!
> > > > > > > >
> > > > > > > > I have been looking into the aforementioned and here are my
> > > follow
> > > > up
> > > > > > > > thoughts:
> > > > > > > >
> > > > > > > > 1. Support persistence of Lucene indexes.
> > > > > > > > 2. https://issues.apache.org/jira/browse/IGNITE-12401 (Needs
> > > > fixing of
> > > > > > > > moving partitions first)
> > > > > > > > 3. Figure out how to return scores from nodes and use them as
> > > sort
> > > > > > > > parameters on the coordinator node
> > > > > > > > (https://issues.apache.org/jira/browse/IGNITE-12291)
> > > > > > > >
> > > > > > > > Please let me know if this looks ok to make text queries
> > > > functional?
> > > > > > > >
> > > > > > > > Atri
> > > > > > > >
> > > > > > > > On Mon, Jun 21, 2021 at 2:49 PM Alexei Scherbakov
> > > > > > > > <al...@gmail.com> wrote:
> > > > > > > > >
> > > > > > > > > Hi.
> > > > > > > > >
> > > > > > > > > One of the biggest issues with text queries is a lack of
> > > support
> > > > for
> > > > > > > > lucene
> > > > > > > > > indices persistence, which makes this functionality
> useless if
> > > a
> > > > > > > > > persistence is enabled.
> > > > > > > > >
> > > > > > > > > I would first take care of it.
> > > > > > > > >
> > > > > > > > > пн, 21 июн. 2021 г. в 12:16, Maksim Timonin <
> > > > timonin.maxim@gmail.com
> > > > > > >:
> > > > > > > > >
> > > > > > > > > > Hi, Atri!
> > > > > > > > > >
> > > > > > > > > > You're right, Actually there is a lack of support for
> > > > TextQueries.
> > > > > > > For
> > > > > > > > the
> > > > > > > > > > last ticket I'm doing I see some obvious issues with
> them (no
> > > > page
> > > > > > > size
> > > > > > > > > > support, for example). I'm glad that somebody wants to
> > > maintain
> > > > > > this
> > > > > > > > > > functionality. Thanks a lot!
> > > > > > > > > >
> > > > > > > > > > For the MergeSort algorithm there is already a patch for
> that
> > > > [1].
> > > > > > > It's
> > > > > > > > > > currently on review. This patch introduces an abstract
> > > reducer
> > > > for
> > > > > > > > > > CacheQueries with 2 implementations (unordered,
> merge-sort).
> > > > Then
> > > > > > > > TextQuery
> > > > > > > > > > leverages on MergeSort to order results from multiple
> nodes
> > > by
> > > > > > score.
> > > > > > > > This
> > > > > > > > > > patch also fixes the pageSize issue, I've mentioned
> before.
> > > > Could
> > > > > > you
> > > > > > > > > > please check if it fully matches your idea? Any issues or
> > > > comments
> > > > > > > are
> > > > > > > > > > welcome.
> > > > > > > > > >
> > > > > > > > > > I've prepared this ticket, because I need the MergeSort
> > > > algorithm
> > > > > > for
> > > > > > > > the
> > > > > > > > > > new type of queries I'm implementing (IndexQuery, it
> should
> > > > also
> > > > > > > > provide
> > > > > > > > > > ordered results over multiple nodes). Currently I'm not
> > > > planning to
> > > > > > > go
> > > > > > > > > > further with TextQuery, so if you're going to support
> this
> > > > it'll
> > > > > > be a
> > > > > > > > great
> > > > > > > > > > contribution, I think.
> > > > > > > > > >
> > > > > > > > > > [1] https://issues.apache.org/jira/browse/IGNITE-14703
> > > > > > > > > > [2] https://github.com/apache/ignite/pull/9081
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > On Mon, Jun 21, 2021 at 11:11 AM Atri Sharma <
> > > atri@apache.org>
> > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Hi All,
> > > > > > > > > > >
> > > > > > > > > > > I have been looking into our text queries support and
> see
> > > > that it
> > > > > > > has
> > > > > > > > > > > limited community support.
> > > > > > > > > > >
> > > > > > > > > > > Therefore, I volunteer to be the maintainer of the
> module
> > > and
> > > > > > work
> > > > > > > on
> > > > > > > > > > > enhancing it further.
> > > > > > > > > > >
> > > > > > > > > > > First goal would be to move to Lucene 8.x, then work on
> > > > sorted
> > > > > > > reduce
> > > > > > > > > > > - merge across nodes. Fundamentally, this is doable
> since
> > > > Lucene
> > > > > > > > ranks
> > > > > > > > > > > documents according to their score, and documents are
> > > > returned in
> > > > > > > the
> > > > > > > > > > > order of their score. Since the scoring function is
> > > > homogeneous,
> > > > > > > this
> > > > > > > > > > > means that across nodes, we can compare scores and
> merge
> > > > sort.
> > > > > > > > > > >
> > > > > > > > > > > Please let me know if I can take this up.
> > > > > > > > > > >
> > > > > > > > > > > Atri
> > > > > > > > > > >
> > > > > > > > > > > --
> > > > > > > > > > > Regards,
> > > > > > > > > > >
> > > > > > > > > > > Atri
> > > > > > > > > > > Apache Concerted
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > --
> > > > > > > > >
> > > > > > > > > Best regards,
> > > > > > > > > Alexei Scherbakov
> > > > > > > >
> > > > > > > > --
> > > > > > > > Regards,
> > > > > > > >
> > > > > > > > Atri
> > > > > > > > Apache Concerted
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Best regards,
> > > > > Andrey V. Mashenkov
> > > >
> > > > --
> > > > Regards,
> > > >
> > > > Atri
> > > > Apache Concerted
> > > >
> > >
> > >
> > > --
> > > Best regards,
> > > Andrey V. Mashenkov
> > >
>
> --
> Regards,
>
> Atri
> Apache Concerted
>

Re: Text Queries Support

Posted by Atri Sharma <at...@apache.org>.

The standard ways to deal with text based searches in SQL are the
CONTAINS operator, the LIKE operator or specific functions
(REGEXP_MATCHES, for eg). I do not think any of these are supported by
Calcite at the moment.

On Fri, Jul 23, 2021 at 11:20 PM Valentin Kulichenko
<va...@gmail.com> wrote:
>
> In my experience, one of the biggest usability issues with the current
> support of text queries is that they are completely decoupled from SQL.
> I.e. you can either execute a SQL query OR a text query. Modern databases,
> on the other hand, typically allow creating text-based indexes within
> regular tables and then using those indexes within regular SQL queries.
> Here is an example from Oracle:
> https://docs.oracle.com/cd/B10501_01/text.920/a96517/cdefault.htm
>
> I believe this is something we can look into in the scope of Ignite 3.
> Andrey, does Calcite have any support for this? What's your view on this?
>
> -Val
>
> On Fri, Jul 23, 2021 at 3:56 AM Andrey Mashenkov <an...@gmail.com>
> wrote:
>
> > Atri,
> >
> > First of all, I'd recommend going through the Ignite ticket to gather
> > information about the current implementation issues and users' wants.
> > Then look at a code to get a complete understanding of how things work now,
> > which may help in future decisions.
> >
> > As we use the outdated Lucene version, some things may be irrelevant for
> > the latest Lucene version.
> > So, you will need expertise in the internals of modern Lucene version to
> > understand what capabilities, guarantees, and limitations Lucene has and
> > could bring to the Ignite.
> > The expertise could be got from the Lucene project code or Lucene project
> > dev-list.
> >
> >
> > As for now, the potential capabilities are not clear to me.
> > At first glance, I see the next topics that must be covered at first:
> >
> > General questions
> > * How Lucene index can be split among the nodes?
> > * If we'll have a single index for all partitions on the particular node,
> > then how index records will be aware of partitioning?
> > This is important to filter out backup records from the results to avoid
> > duplicates.
> > * How results from several nodes can be merged on the Reduce stage?
> > * Does Lucene supports smth like JOIN operation or others that may require
> > data from another partition or index?
> > If so, then it likes to multistep query with merging results on
> > intermediate stages and requires detailed investigation and design.
> > It is ok if Ignite will have some limitations here, but we would like to
> > know about them at the early stage.
> > * How effectively map Lucene files to the page memory? Is it even possible?
> > Otherwise, how to deal with potential OOM on large queries and memory
> > capacity planning?
> >
> > Persistence.
> > * How and what consistency guarantees could we have/expect?
> > Seems, we may not be able to write physical records for Lucene index to our
> > WAL. What can we do with this?
> >
> > Transactions.
> > * Will we support transactions?
> > * Should Lucene be aware of Transaction and track mvcc (or whatever)
> > versions for the records?
> > * What will be consistency guarantees?
> >
> > UX
> > * How to add FullText search queries syntax into Calcite?
> > * AFAIK, the Lucene index has many properties for tuning. How will the user
> > configure the index?
> > * How and where to store the settings? What are cluster-wide and what a
> > local to the particular node?
> > * Will be all the settings immutable? Can be they changed on-fly? after
> > node/grid restart?
> > * Any limitations on query syntax?
> >
> > SQL
> > * Will we support FullText search in SQL?
> > * How to integrate Lucene index into Calcite? What is the cost model?
> > Splitting rules? Traits?
> > * What about consistency with DDL operations, e.g. column rename?
> > Ignite indices will operate column ID, so rename operation will not affect
> > the index.
> >
> >
> > With all of this, you can go with the IEP (or even some short summary) and
> > further POC and implementation.
> > That's a big deal, so let's discuss what could be done here.
> >
> > On Fri, Jul 23, 2021 at 12:58 PM Atri Sharma <at...@apache.org> wrote:
> >
> > > I am actually happy to drive the feature for Ignite 3. FTS is very
> > > important for me and I think Ignite users will benefit from it
> > > greatly.
> > >
> > > If it makes sense to be focusing on Ignite 3 for this capability, I am
> > > eager to contribute there and lead the development.
> > >
> > > Please share your thoughts.
> > >
> > > On Fri, Jul 23, 2021 at 3:21 PM Andrey Mashenkov
> > > <an...@gmail.com> wrote:
> > > >
> > > > Hi Atri,
> > > >
> > > > All the Jira tickets we have on the Full-text search (FTS) thing are
> > > > targeted to Ignite 2.
> > > >
> > > > AFAIK, we want, but we have NOT committed to FTS support in Ignite 3,
> > > yet.
> > > > By the way, we are getting requests for this thing from the user side,
> > > and
> > > > definitely,
> > > > FTS would be a valuable feature for Ignite.
> > > >
> > > > It will be great if the one wants to drive it, any help will be
> > > appreciated.
> > > >
> > > >
> > > > On Fri, Jul 23, 2021 at 12:12 PM Atri Sharma <at...@apache.org> wrote:
> > > >
> > > > > Hello,
> > > > >
> > > > > An update, please. I am working through persistence of Lucene index
> > > using
> > > > > Ignite Dictionary, and will be asking some questions soon.
> > > > >
> > > > > I had one doubt - - where does this change go? Ignite 3?
> > > > >
> > > > > Also, I know we want to build native support for text searches in
> > > Ignite 3.
> > > > > Is the work I am proposing here part of that, or will that be a
> > > separate
> > > > > effort?
> > > > >
> > > > > On Mon, 28 Jun 2021, 19:20 Ilya Kasnacheev, <
> > ilya.kasnacheev@gmail.com
> > > >
> > > > > wrote:
> > > > >
> > > > > > Hello!
> > > > > >
> > > > > > I think that number one is the most important one, then maybe it
> > > will see
> > > > > > more use and other deficiencies become more apparent, leading to
> > more
> > > > > > tickets and visibility.
> > > > > >
> > > > > > Maybe 2. and 3. will even use a different approach when persistence
> > > is
> > > > > > implemented.
> > > > > >
> > > > > > Regards,
> > > > > > --
> > > > > > Ilya Kasnacheev
> > > > > >
> > > > > >
> > > > > > пн, 28 июн. 2021 г. в 14:34, Atri Sharma <at...@apache.org>:
> > > > > >
> > > > > > > Hello Again!
> > > > > > >
> > > > > > > I have been looking into the aforementioned and here are my
> > follow
> > > up
> > > > > > > thoughts:
> > > > > > >
> > > > > > > 1. Support persistence of Lucene indexes.
> > > > > > > 2. https://issues.apache.org/jira/browse/IGNITE-12401 (Needs
> > > fixing of
> > > > > > > moving partitions first)
> > > > > > > 3. Figure out how to return scores from nodes and use them as
> > sort
> > > > > > > parameters on the coordinator node
> > > > > > > (https://issues.apache.org/jira/browse/IGNITE-12291)
> > > > > > >
> > > > > > > Please let me know if this looks ok to make text queries
> > > functional?
> > > > > > >
> > > > > > > Atri
> > > > > > >
> > > > > > > On Mon, Jun 21, 2021 at 2:49 PM Alexei Scherbakov
> > > > > > > <al...@gmail.com> wrote:
> > > > > > > >
> > > > > > > > Hi.
> > > > > > > >
> > > > > > > > One of the biggest issues with text queries is a lack of
> > support
> > > for
> > > > > > > lucene
> > > > > > > > indices persistence, which makes this functionality useless if
> > a
> > > > > > > > persistence is enabled.
> > > > > > > >
> > > > > > > > I would first take care of it.
> > > > > > > >
> > > > > > > > пн, 21 июн. 2021 г. в 12:16, Maksim Timonin <
> > > timonin.maxim@gmail.com
> > > > > >:
> > > > > > > >
> > > > > > > > > Hi, Atri!
> > > > > > > > >
> > > > > > > > > You're right, Actually there is a lack of support for
> > > TextQueries.
> > > > > > For
> > > > > > > the
> > > > > > > > > last ticket I'm doing I see some obvious issues with them (no
> > > page
> > > > > > size
> > > > > > > > > support, for example). I'm glad that somebody wants to
> > maintain
> > > > > this
> > > > > > > > > functionality. Thanks a lot!
> > > > > > > > >
> > > > > > > > > For the MergeSort algorithm there is already a patch for that
> > > [1].
> > > > > > It's
> > > > > > > > > currently on review. This patch introduces an abstract
> > reducer
> > > for
> > > > > > > > > CacheQueries with 2 implementations (unordered, merge-sort).
> > > Then
> > > > > > > TextQuery
> > > > > > > > > leverages on MergeSort to order results from multiple nodes
> > by
> > > > > score.
> > > > > > > This
> > > > > > > > > patch also fixes the pageSize issue, I've mentioned before.
> > > Could
> > > > > you
> > > > > > > > > please check if it fully matches your idea? Any issues or
> > > comments
> > > > > > are
> > > > > > > > > welcome.
> > > > > > > > >
> > > > > > > > > I've prepared this ticket, because I need the MergeSort
> > > algorithm
> > > > > for
> > > > > > > the
> > > > > > > > > new type of queries I'm implementing (IndexQuery, it should
> > > also
> > > > > > > provide
> > > > > > > > > ordered results over multiple nodes). Currently I'm not
> > > planning to
> > > > > > go
> > > > > > > > > further with TextQuery, so if you're going to support this
> > > it'll
> > > > > be a
> > > > > > > great
> > > > > > > > > contribution, I think.
> > > > > > > > >
> > > > > > > > > [1] https://issues.apache.org/jira/browse/IGNITE-14703
> > > > > > > > > [2] https://github.com/apache/ignite/pull/9081
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Mon, Jun 21, 2021 at 11:11 AM Atri Sharma <
> > atri@apache.org>
> > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Hi All,
> > > > > > > > > >
> > > > > > > > > > I have been looking into our text queries support and see
> > > that it
> > > > > > has
> > > > > > > > > > limited community support.
> > > > > > > > > >
> > > > > > > > > > Therefore, I volunteer to be the maintainer of the module
> > and
> > > > > work
> > > > > > on
> > > > > > > > > > enhancing it further.
> > > > > > > > > >
> > > > > > > > > > First goal would be to move to Lucene 8.x, then work on
> > > sorted
> > > > > > reduce
> > > > > > > > > > - merge across nodes. Fundamentally, this is doable since
> > > Lucene
> > > > > > > ranks
> > > > > > > > > > documents according to their score, and documents are
> > > returned in
> > > > > > the
> > > > > > > > > > order of their score. Since the scoring function is
> > > homogeneous,
> > > > > > this
> > > > > > > > > > means that across nodes, we can compare scores and merge
> > > sort.
> > > > > > > > > >
> > > > > > > > > > Please let me know if I can take this up.
> > > > > > > > > >
> > > > > > > > > > Atri
> > > > > > > > > >
> > > > > > > > > > --
> > > > > > > > > > Regards,
> > > > > > > > > >
> > > > > > > > > > Atri
> > > > > > > > > > Apache Concerted
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > --
> > > > > > > >
> > > > > > > > Best regards,
> > > > > > > > Alexei Scherbakov
> > > > > > >
> > > > > > > --
> > > > > > > Regards,
> > > > > > >
> > > > > > > Atri
> > > > > > > Apache Concerted
> > > > > > >
> > > > > >
> > > > >
> > > >
> > > >
> > > > --
> > > > Best regards,
> > > > Andrey V. Mashenkov
> > >
> > > --
> > > Regards,
> > >
> > > Atri
> > > Apache Concerted
> > >
> >
> >
> > --
> > Best regards,
> > Andrey V. Mashenkov
> >

-- 
Regards,

Atri
Apache Concerted

Re: Text Queries Support

Posted by Valentin Kulichenko <va...@gmail.com>.

In my experience, one of the biggest usability issues with the current
support of text queries is that they are completely decoupled from SQL.
I.e. you can either execute a SQL query OR a text query. Modern databases,
on the other hand, typically allow creating text-based indexes within
regular tables and then using those indexes within regular SQL queries.
Here is an example from Oracle:
https://docs.oracle.com/cd/B10501_01/text.920/a96517/cdefault.htm

I believe this is something we can look into in the scope of Ignite 3.
Andrey, does Calcite have any support for this? What's your view on this?

-Val

On Fri, Jul 23, 2021 at 3:56 AM Andrey Mashenkov <an...@gmail.com>
wrote:

> Atri,
>
> First of all, I'd recommend going through the Ignite ticket to gather
> information about the current implementation issues and users' wants.
> Then look at a code to get a complete understanding of how things work now,
> which may help in future decisions.
>
> As we use the outdated Lucene version, some things may be irrelevant for
> the latest Lucene version.
> So, you will need expertise in the internals of modern Lucene version to
> understand what capabilities, guarantees, and limitations Lucene has and
> could bring to the Ignite.
> The expertise could be got from the Lucene project code or Lucene project
> dev-list.
>
>
> As for now, the potential capabilities are not clear to me.
> At first glance, I see the next topics that must be covered at first:
>
> General questions
> * How Lucene index can be split among the nodes?
> * If we'll have a single index for all partitions on the particular node,
> then how index records will be aware of partitioning?
> This is important to filter out backup records from the results to avoid
> duplicates.
> * How results from several nodes can be merged on the Reduce stage?
> * Does Lucene supports smth like JOIN operation or others that may require
> data from another partition or index?
> If so, then it likes to multistep query with merging results on
> intermediate stages and requires detailed investigation and design.
> It is ok if Ignite will have some limitations here, but we would like to
> know about them at the early stage.
> * How effectively map Lucene files to the page memory? Is it even possible?
> Otherwise, how to deal with potential OOM on large queries and memory
> capacity planning?
>
> Persistence.
> * How and what consistency guarantees could we have/expect?
> Seems, we may not be able to write physical records for Lucene index to our
> WAL. What can we do with this?
>
> Transactions.
> * Will we support transactions?
> * Should Lucene be aware of Transaction and track mvcc (or whatever)
> versions for the records?
> * What will be consistency guarantees?
>
> UX
> * How to add FullText search queries syntax into Calcite?
> * AFAIK, the Lucene index has many properties for tuning. How will the user
> configure the index?
> * How and where to store the settings? What are cluster-wide and what a
> local to the particular node?
> * Will be all the settings immutable? Can be they changed on-fly? after
> node/grid restart?
> * Any limitations on query syntax?
>
> SQL
> * Will we support FullText search in SQL?
> * How to integrate Lucene index into Calcite? What is the cost model?
> Splitting rules? Traits?
> * What about consistency with DDL operations, e.g. column rename?
> Ignite indices will operate column ID, so rename operation will not affect
> the index.
>
>
> With all of this, you can go with the IEP (or even some short summary) and
> further POC and implementation.
> That's a big deal, so let's discuss what could be done here.
>
> On Fri, Jul 23, 2021 at 12:58 PM Atri Sharma <at...@apache.org> wrote:
>
> > I am actually happy to drive the feature for Ignite 3. FTS is very
> > important for me and I think Ignite users will benefit from it
> > greatly.
> >
> > If it makes sense to be focusing on Ignite 3 for this capability, I am
> > eager to contribute there and lead the development.
> >
> > Please share your thoughts.
> >
> > On Fri, Jul 23, 2021 at 3:21 PM Andrey Mashenkov
> > <an...@gmail.com> wrote:
> > >
> > > Hi Atri,
> > >
> > > All the Jira tickets we have on the Full-text search (FTS) thing are
> > > targeted to Ignite 2.
> > >
> > > AFAIK, we want, but we have NOT committed to FTS support in Ignite 3,
> > yet.
> > > By the way, we are getting requests for this thing from the user side,
> > and
> > > definitely,
> > > FTS would be a valuable feature for Ignite.
> > >
> > > It will be great if the one wants to drive it, any help will be
> > appreciated.
> > >
> > >
> > > On Fri, Jul 23, 2021 at 12:12 PM Atri Sharma <at...@apache.org> wrote:
> > >
> > > > Hello,
> > > >
> > > > An update, please. I am working through persistence of Lucene index
> > using
> > > > Ignite Dictionary, and will be asking some questions soon.
> > > >
> > > > I had one doubt - - where does this change go? Ignite 3?
> > > >
> > > > Also, I know we want to build native support for text searches in
> > Ignite 3.
> > > > Is the work I am proposing here part of that, or will that be a
> > separate
> > > > effort?
> > > >
> > > > On Mon, 28 Jun 2021, 19:20 Ilya Kasnacheev, <
> ilya.kasnacheev@gmail.com
> > >
> > > > wrote:
> > > >
> > > > > Hello!
> > > > >
> > > > > I think that number one is the most important one, then maybe it
> > will see
> > > > > more use and other deficiencies become more apparent, leading to
> more
> > > > > tickets and visibility.
> > > > >
> > > > > Maybe 2. and 3. will even use a different approach when persistence
> > is
> > > > > implemented.
> > > > >
> > > > > Regards,
> > > > > --
> > > > > Ilya Kasnacheev
> > > > >
> > > > >
> > > > > пн, 28 июн. 2021 г. в 14:34, Atri Sharma <at...@apache.org>:
> > > > >
> > > > > > Hello Again!
> > > > > >
> > > > > > I have been looking into the aforementioned and here are my
> follow
> > up
> > > > > > thoughts:
> > > > > >
> > > > > > 1. Support persistence of Lucene indexes.
> > > > > > 2. https://issues.apache.org/jira/browse/IGNITE-12401 (Needs
> > fixing of
> > > > > > moving partitions first)
> > > > > > 3. Figure out how to return scores from nodes and use them as
> sort
> > > > > > parameters on the coordinator node
> > > > > > (https://issues.apache.org/jira/browse/IGNITE-12291)
> > > > > >
> > > > > > Please let me know if this looks ok to make text queries
> > functional?
> > > > > >
> > > > > > Atri
> > > > > >
> > > > > > On Mon, Jun 21, 2021 at 2:49 PM Alexei Scherbakov
> > > > > > <al...@gmail.com> wrote:
> > > > > > >
> > > > > > > Hi.
> > > > > > >
> > > > > > > One of the biggest issues with text queries is a lack of
> support
> > for
> > > > > > lucene
> > > > > > > indices persistence, which makes this functionality useless if
> a
> > > > > > > persistence is enabled.
> > > > > > >
> > > > > > > I would first take care of it.
> > > > > > >
> > > > > > > пн, 21 июн. 2021 г. в 12:16, Maksim Timonin <
> > timonin.maxim@gmail.com
> > > > >:
> > > > > > >
> > > > > > > > Hi, Atri!
> > > > > > > >
> > > > > > > > You're right, Actually there is a lack of support for
> > TextQueries.
> > > > > For
> > > > > > the
> > > > > > > > last ticket I'm doing I see some obvious issues with them (no
> > page
> > > > > size
> > > > > > > > support, for example). I'm glad that somebody wants to
> maintain
> > > > this
> > > > > > > > functionality. Thanks a lot!
> > > > > > > >
> > > > > > > > For the MergeSort algorithm there is already a patch for that
> > [1].
> > > > > It's
> > > > > > > > currently on review. This patch introduces an abstract
> reducer
> > for
> > > > > > > > CacheQueries with 2 implementations (unordered, merge-sort).
> > Then
> > > > > > TextQuery
> > > > > > > > leverages on MergeSort to order results from multiple nodes
> by
> > > > score.
> > > > > > This
> > > > > > > > patch also fixes the pageSize issue, I've mentioned before.
> > Could
> > > > you
> > > > > > > > please check if it fully matches your idea? Any issues or
> > comments
> > > > > are
> > > > > > > > welcome.
> > > > > > > >
> > > > > > > > I've prepared this ticket, because I need the MergeSort
> > algorithm
> > > > for
> > > > > > the
> > > > > > > > new type of queries I'm implementing (IndexQuery, it should
> > also
> > > > > > provide
> > > > > > > > ordered results over multiple nodes). Currently I'm not
> > planning to
> > > > > go
> > > > > > > > further with TextQuery, so if you're going to support this
> > it'll
> > > > be a
> > > > > > great
> > > > > > > > contribution, I think.
> > > > > > > >
> > > > > > > > [1] https://issues.apache.org/jira/browse/IGNITE-14703
> > > > > > > > [2] https://github.com/apache/ignite/pull/9081
> > > > > > > >
> > > > > > > >
> > > > > > > > On Mon, Jun 21, 2021 at 11:11 AM Atri Sharma <
> atri@apache.org>
> > > > > wrote:
> > > > > > > >
> > > > > > > > > Hi All,
> > > > > > > > >
> > > > > > > > > I have been looking into our text queries support and see
> > that it
> > > > > has
> > > > > > > > > limited community support.
> > > > > > > > >
> > > > > > > > > Therefore, I volunteer to be the maintainer of the module
> and
> > > > work
> > > > > on
> > > > > > > > > enhancing it further.
> > > > > > > > >
> > > > > > > > > First goal would be to move to Lucene 8.x, then work on
> > sorted
> > > > > reduce
> > > > > > > > > - merge across nodes. Fundamentally, this is doable since
> > Lucene
> > > > > > ranks
> > > > > > > > > documents according to their score, and documents are
> > returned in
> > > > > the
> > > > > > > > > order of their score. Since the scoring function is
> > homogeneous,
> > > > > this
> > > > > > > > > means that across nodes, we can compare scores and merge
> > sort.
> > > > > > > > >
> > > > > > > > > Please let me know if I can take this up.
> > > > > > > > >
> > > > > > > > > Atri
> > > > > > > > >
> > > > > > > > > --
> > > > > > > > > Regards,
> > > > > > > > >
> > > > > > > > > Atri
> > > > > > > > > Apache Concerted
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > >
> > > > > > > Best regards,
> > > > > > > Alexei Scherbakov
> > > > > >
> > > > > > --
> > > > > > Regards,
> > > > > >
> > > > > > Atri
> > > > > > Apache Concerted
> > > > > >
> > > > >
> > > >
> > >
> > >
> > > --
> > > Best regards,
> > > Andrey V. Mashenkov
> >
> > --
> > Regards,
> >
> > Atri
> > Apache Concerted
> >
>
>
> --
> Best regards,
> Andrey V. Mashenkov
>

Re: Text Queries Support

Posted by Kseniya Romanova <ks...@apache.org>.

I think we can invite them to our virtual meetup and share details. Your
thoughts?

чт, 28 окт. 2021 г. в 10:15, Ivan Pavlukhin <vo...@gmail.com>:

> Hi Maximiliano,
>
> Thank you for pointing this out, rather interesting. Have you tried to
> communicate with a hawkore team? I doubt that anyone in Community
> knows implementation details of hawkore additions.
>
> 2021-10-22 19:58 GMT+03:00, Maximiliano Gazquez <maximiliano.628@gmail.com
> >:
> > Hello everyone!
> >
> > I wanted to add this to the discussion.
> > I've found this project https://github.com/hawkore/ignite-hk which
> promises
> > to solve most of the issues that are being discussed here like
> pagination,
> > sorting and most important, persisting the lucene index.
> >
> > It does stuff like this to create indexes:
> >
> > CREATE INDEX PERSON_LUCENE_IDX ON "PUBLIC".PERSON(LUCENE)
> > FULLTEXT '{
> > ''refresh_seconds'':''60'',
> > ''directory_path'':'''',
> > ''ram_buffer_mb'':''10'',
> > ''max_cached_mb'':''-1'',
> > ''partitioner'':''{"type":"token","partitions":10}'',
> > ''optimizer_enabled'':''true'',
> > ''optimizer_schedule'':''0 1 * * *'',
> > ''version'':''0'',
> > ''schema'':''{
> >     "default_analyzer":"english",
> >
> >
> "analyzers":{"my_custom_analyzer":{"type":"snowball","language":"Spanish","stopwords":"el,la,lo,loas,las,a,ante,bajo,cabe,con,contra"}},
> >     "fields":{
> >
> >
> "duration":{"type":"date_range","from":"start_date","to":"stop_date","validated":false,"pattern":"yyyy/MM/dd"},
> >
> >
> "place":{"type":"geo_point","latitude":"latitude","longitude":"longitude"},
> >       "date":{"type":"date","validated":true,"pattern":"yyyy/MM/dd"},
> >       "number":{"type":"integer","validated":false,"boost":1.0},
> >       "gender":{"type":"string","validated":true,"case_sensitive":true},
> >       "bool":{"type":"boolean","validated":false},
> >
> >
> "phrase":{"type":"text","validated":false,"analyzer":"my_custom_analyzer"},
> >       "name":{"type":"string","validated":false,"case_sensitive":true},
> >       "animal":{"type":"string","validated":false,"case_sensitive":true},
> >       "age":{"type":"integer","validated":false,"boost":1.0},
> >       "food":{"type":"string","validated":false,"case_sensitive":true}
> >     }
> >   }''
> > }';
> >
> > And this to use that lucene index from inside SQL:
> >
> > SELECT * FROM "test".user
> > WHERE lucene = '{ query : {
> >                               type : "boolean",
> >                               must : [{type : "wildcard", field : "name",
> > value : "J*"},
> >                                       {type : "wildcard", field : "food",
> > value : "tu*"}]}}';
> >
> > More examples here
> >
> https://github.com/hawkore/examples-apache-ignite-extensions/tree/master/examples-advanced-ignite-indexing
> >
> > I don't have anything to do with that company but it would be great to
> know
> > how they implemented this stuff.
> >
> >
> > On Mon, Aug 9, 2021 at 3:00 AM Ivan Pavlukhin <vo...@gmail.com>
> wrote:
> >
> >> Hi Atri,
> >>
> >> Sorry for a late answer.
> >>
> >> > I didn't quite understand. Are you proposing that Ignite should not
> >> > have
> >> FTS capabilities?
> >>
> >> It seems an option to me. IMHO it is better to have no FTS instead of
> >> something like current Ignite TextQueries.
> >>
> >> 2021-08-03 12:45 GMT+03:00, Atri Sharma <at...@apache.org>:
> >> > Hi Ivan,
> >> >
> >> > I didn't quite understand. Are you proposing that Ignite should not
> >> > have FTS capabilities?
> >> >
> >> > Atri
> >> >
> >> > On Tue, Aug 3, 2021 at 2:57 PM Ivan Pavlukhin <vo...@gmail.com>
> >> wrote:
> >> >>
> >> >> Hi Atri,
> >> >>
> >> >> My main concern is non-maleficence. Every task has several solutions,
> >> >> e.g. straightforward ones:
> >> >> 1. Do not implement FTS.
> >> >> 2. Create own implementation.
> >> >>
> >> >> Some of the strongest ones live without FTS [1].
> >> >>
> >> >> [1] https://github.com/cockroachdb/cockroach/issues/7821
> >> >>
> >> >> 2021-08-02 11:33 GMT+03:00, Atri Sharma <at...@apache.org>:
> >> >> > Hi Ivan,
> >> >> >
> >> >> > Would you like to propose an alternative to Lucene?
> >> >> >
> >> >> > Atri
> >> >> >
> >> >> > On Mon, 2 Aug 2021, 13:48 Ivan Pavlukhin, <vo...@gmail.com>
> >> wrote:
> >> >> >
> >> >> >> Folks,
> >> >> >>
> >> >> >> Sorry if read the thread not thoroughly enough, but do we consider
> >> >> >> Lucene as obviously right choice? In my understanding Ignite
> >> >> >> history
> >> >> >> has shown clearly that "fastest feature implementation" is not
> >> usually
> >> >> >> the best. And one example of this are text queries. Are not we
> >> >> >> trying
> >> >> >> to do a same mistake again? FTS is a huge feature, I do not
> believe
> >> >> >> there is an easy win for it.
> >> >> >>
> >> >> >> 2021-07-27 19:18 GMT+03:00, Atri Sharma <at...@apache.org>:
> >> >> >> > Andrey,
> >> >> >> >
> >> >> >> >> Per-partition Lucene index looks simple to implement, but it
> may
> >> >> >> >> require
> >> >> >> >> per-partition SQL to make full-text search expressions work
> >> >> >> >> correctly
> >> >> >> >> within the SQL quiery.
> >> >> >> > I think that as long as we follow the map - reduce process that
> >> >> >> > we
> >> >> >> > already do for other queries, we should be fine.
> >> >> >> >
> >> >> >> >> Per-partition SQL index may kill the performance. We already
> >> >> >> >> tried
> >> >> >> >> to
> >> >> >> >> do
> >> >> >> >> that in Ignite 2. However, QueryParallelism feature helps to
> >> >> >> >> speed
> >> >> >> >> up
> >> >> >> >> some
> >> >> >> >> data-intensive queries,
> >> >> >> >> but hits the performance in simple cases, and at some point
> >> >> >> >> (e.g.
> >> >> >> >> segments
> >> >> >> >> > number of CPU) the performance rapidly degrades with the
> >> >> >> >> > increasing
> >> >> >> >> number of segments.
> >> >> >> >
> >> >> >> > Yeah, that is always the case, but a global index will be a
> >> >> >> > nightmare
> >> >> >> > in terms of concurrency and pessimistic concurrency control will
> >> >> >> > anyways kill the benefits, coupled with the metadata
> >> >> >> > requirements.
> >> >> >> > What were the specific issues with per partition index?
> >> >> >> >>
> >> >> >> >> AFAIK, Lucene widely used bitmap indices that are easy to
> merge.
> >> >> >> >> Maybe, the map-reduce technique underneath FTS expressions and
> >> some
> >> >> >> hacks
> >> >> >> >> will add a minimal overhead.
> >> >> >> >
> >> >> >> > Lucene uses many types of indices but the aspect here is that
> per
> >> >> >> > partition Lucene indices can return docIDs and we can merge them
> >> >> >> > in
> >> >> >> > reduce phase. So we are abstracted out from specifics of the
> >> >> >> > internal
> >> >> >> > index being used to serve the query.
> >> >> >> >
> >> >> >> >>
> >> >> >> >> > As illustrated by Ilya, we can use Ignite's WAL records to
> >> >> >> >> > rebuild
> >> >> >> >> > Lucene indices. The important thing here is to not treat
> >> >> >> >> > Lucene
> >> >> >> >> > indices as source of truth.
> >> >> >> >> To use WAL we either should relay Lucene files to our Page
> >> >> >> >> memory
> >> >> >> >> or
> >> >> >> >> be
> >> >> >> >> aware of Lucene files structure.
> >> >> >> >> The first looks tricky, as we should guarantee a contiguous
> >> address
> >> >> >> space
> >> >> >> >> in Page memory for reflecting Lucene file. Maybe separate
> >> >> >> >> managed
> >> >> >> >> memory
> >> >> >> >> segment with its own rules?
> >> >> >> >
> >> >> >> > Why not use Lucene's MMappedDirectory and map it to our storage
> >> >> >> > classes?
> >> >> >> >
> >> >> >> >>
> >> >> >> >> >> Transactions.
> >> >> >> >> >> * Will we support transactions?
> >> >> >> >> > Lucene has no concept of transactions.
> >> >> >> >> Yes, but we have.
> >> >> >> >> Lucene index may be non-transactional, but users never expect
> to
> >> >> >> >> see
> >> >> >> >> uncommited data.
> >> >> >> >> How does this connect with transactional SQL?
> >> >> >> > We could have the Lucene writes done as a part of transactions
> >> >> >> > and
> >> >> >> > ack
> >> >> >> > back only when it succeeds/fails. WDYT?
> >> >> >> >>
> >> >> >> >> On Tue, Jul 27, 2021 at 1:36 PM Atri Sharma <at...@apache.org>
> >> >> >> >> wrote:
> >> >> >> >>
> >> >> >> >> > Sorry, I planned on creating a Wiki page for this, but it
> >> >> >> >> > makes
> >> >> >> >> > more
> >> >> >> >> > sense to be replying here.
> >> >> >> >> >
> >> >> >> >> > > * How Lucene index can be split among the nodes?
> >> >> >> >> >
> >> >> >> >> > We can have partition level indices on each node.
> >> >> >> >> >
> >> >> >> >> > > * If we'll have a single index for all partitions on the
> >> >> >> >> > > particular
> >> >> >> >> > > node,
> >> >> >> >> > > then how index records will be aware of partitioning?
> >> >> >> >> >
> >> >> >> >> > Index records dont need to be aware of partitioning -- each
> >> >> >> >> > Lucene
> >> >> >> >> > index is independent.
> >> >> >> >> >
> >> >> >> >> > > This is important to filter out backup records from the
> >> results
> >> >> >> >> > > to
> >> >> >> >> > > avoid
> >> >> >> >> > > duplicates.
> >> >> >> >> >
> >> >> >> >> > We can merge documents from different nodes and remove
> >> duplicates
> >> >> >> >> > as
> >> >> >> >> > long as docIDs are globally unique.
> >> >> >> >> >
> >> >> >> >> > > * How results from several nodes can be merged on the
> Reduce
> >> >> >> >> > > stage?
> >> >> >> >> >
> >> >> >> >> > As long as documents have a globally unique docID, Lucene has
> >> >> >> >> > merge
> >> >> >> >> > functions that can merge results from multiple partial
> >> >> >> >> > results.
> >> >> >> >> >
> >> >> >> >> > > * Does Lucene supports smth like JOIN operation or others
> >> >> >> >> > > that
> >> >> >> >> > > may
> >> >> >> >> > require
> >> >> >> >> > > data from another partition or index?
> >> >> >> >> >
> >> >> >> >> > As illustrated by Ilya, Block-Join works for us.
> >> >> >> >> >
> >> >> >> >> > > If so, then it likes to multistep query with merging
> results
> >> on
> >> >> >> >> > > intermediate stages and requires detailed investigation and
> >> >> >> >> > > design.
> >> >> >> >> > > It is ok if Ignite will have some limitations here, but we
> >> >> >> >> > > would
> >> >> >> like
> >> >> >> >> > > to
> >> >> >> >> > > know about them at the early stage.
> >> >> >> >> >
> >> >> >> >> > > * How effectively map Lucene files to the page memory? Is
> it
> >> >> >> >> > > even
> >> >> >> >> > possible?
> >> >> >> >> >
> >> >> >> >> > Lucene has PageDirectory implementations which allow storing
> >> >> >> >> > Lucene
> >> >> >> >> > indices on different kind of file structures. It has a
> >> >> >> >> > MMappedFileDirectory that we could use?
> >> >> >> >> >
> >> >> >> >> > > Otherwise, how to deal with potential OOM on large queries
> >> >> >> >> > > and
> >> >> >> memory
> >> >> >> >> > > capacity planning?
> >> >> >> >> >
> >> >> >> >> > We can use Lucene's MMapped directory.
> >> >> >> >> >
> >> >> >> >> > >
> >> >> >> >> > > Persistence.
> >> >> >> >> > > * How and what consistency guarantees could we have/expect?
> >> >> >> >> >
> >> >> >> >> > Lucene does not have WAL logs but is append only
> >> >> >> >> >
> >> >> >> >> > > Seems, we may not be able to write physical records for
> >> >> >> >> > > Lucene
> >> >> >> >> > > index
> >> >> >> >> > > to
> >> >> >> >> > our
> >> >> >> >> > > WAL. What can we do with this?
> >> >> >> >> >
> >> >> >> >> > As illustrated by Ilya, we can use Ignite's WAL records to
> >> >> >> >> > rebuild
> >> >> >> >> > Lucene indices. The important thing here is to not treat
> >> >> >> >> > Lucene
> >> >> >> >> > indices as source of truth.
> >> >> >> >> > >
> >> >> >> >> > > Transactions.
> >> >> >> >> > > * Will we support transactions?
> >> >> >> >> > Lucene has no concept of transactions.
> >> >> >> >> >
> >> >> >> >> > > * Should Lucene be aware of Transaction and track mvcc (or
> >> >> >> >> > > whatever)
> >> >> >> >> > > versions for the records?
> >> >> >> >> > No
> >> >> >> >> > > * What will be consistency guarantees?
> >> >> >> >> > We can acknowledge writes back only after Lucene index is
> >> >> >> >> > updated.
> >> >> >> >> > >
> >> >> >> >> > > UX
> >> >> >> >> > > * How to add FullText search queries syntax into Calcite?
> >> >> >> >> > Postgres's FTS functions are a good reference.
> >> >> >> >> > > * AFAIK, the Lucene index has many properties for tuning.
> >> >> >> >> > > How
> >> >> >> >> > > will
> >> >> >> >> > > the
> >> >> >> >> > user
> >> >> >> >> > > configure the index?
> >> >> >> >> > Most of those properties can be cluster level and exposed as
> a
> >> >> >> >> > new
> >> >> >> >> > sub
> >> >> >> >> > config for ignite.
> >> >> >> >> > > * How and where to store the settings? What are
> cluster-wide
> >> >> >> >> > > and
> >> >> >> what
> >> >> >> >> > > a
> >> >> >> >> > > local to the particular node?
> >> >> >> >> > All can be cluster level.
> >> >> >> >> > > * Will be all the settings immutable? Can be they changed
> >> >> >> >> > > on-fly?
> >> >> >> >> > > after
> >> >> >> >> > > node/grid restart?
> >> >> >> >> > They should be applied post restart.
> >> >> >> >> >
> >> >> >> >> > > * Any limitations on query syntax?
> >> >> >> >> > It depends on how we model our queries for text search.
> >> >> >> >> >
> >> >> >> >> > >
> >> >> >> >> > > SQL
> >> >> >> >> > > * Will we support FullText search in SQL?
> >> >> >> >> > We need custom functions for it. See Postgres's FTS
> functions.
> >> >> >> >> > > * How to integrate Lucene index into Calcite? What is the
> >> >> >> >> > > cost
> >> >> >> model?
> >> >> >> >> > There cannot be any cost model since there are no paths for a
> >> >> >> >> > text
> >> >> >> >> > query. If we see a text query, we have to use Lucene index or
> >> >> >> >> > return
> >> >> >> >> > an error. In this way, we need to model text search as a set
> >> >> >> >> > of
> >> >> >> >> > UDFs
> >> >> >> >> >
> >> >> >> >> > > Splitting rules? Traits?
> >> >> >> >> > Please see my reply above.
> >> >> >> >> > >
> >> >> >> >> > >
> >> >> >> >> > > With all of this, you can go with the IEP (or even some
> >> >> >> >> > > short
> >> >> >> >> > > summary)
> >> >> >> >> > and
> >> >> >> >> > > further POC and implementation.
> >> >> >> >> > > That's a big deal, so let's discuss what could be done
> here.
> >> >> >> >> > >
> >> >> >> >> > > On Fri, Jul 23, 2021 at 12:58 PM Atri Sharma
> >> >> >> >> > > <atri@apache.org
> >> >
> >> >> >> wrote:
> >> >> >> >> > >
> >> >> >> >> > > > I am actually happy to drive the feature for Ignite 3.
> FTS
> >> is
> >> >> >> >> > > > very
> >> >> >> >> > > > important for me and I think Ignite users will benefit
> >> >> >> >> > > > from
> >> >> >> >> > > > it
> >> >> >> >> > > > greatly.
> >> >> >> >> > > >
> >> >> >> >> > > > If it makes sense to be focusing on Ignite 3 for this
> >> >> >> >> > > > capability,
> >> >> >> I
> >> >> >> >> > > > am
> >> >> >> >> > > > eager to contribute there and lead the development.
> >> >> >> >> > > >
> >> >> >> >> > > > Please share your thoughts.
> >> >> >> >> > > >
> >> >> >> >> > > > On Fri, Jul 23, 2021 at 3:21 PM Andrey Mashenkov
> >> >> >> >> > > > <an...@gmail.com> wrote:
> >> >> >> >> > > > >
> >> >> >> >> > > > > Hi Atri,
> >> >> >> >> > > > >
> >> >> >> >> > > > > All the Jira tickets we have on the Full-text search
> >> >> >> >> > > > > (FTS)
> >> >> >> >> > > > > thing
> >> >> >> >> > > > > are
> >> >> >> >> > > > > targeted to Ignite 2.
> >> >> >> >> > > > >
> >> >> >> >> > > > > AFAIK, we want, but we have NOT committed to FTS
> support
> >> in
> >> >> >> Ignite
> >> >> >> >> > > > > 3,
> >> >> >> >> > > > yet.
> >> >> >> >> > > > > By the way, we are getting requests for this thing from
> >> the
> >> >> >> >> > > > > user
> >> >> >> >> > side,
> >> >> >> >> > > > and
> >> >> >> >> > > > > definitely,
> >> >> >> >> > > > > FTS would be a valuable feature for Ignite.
> >> >> >> >> > > > >
> >> >> >> >> > > > > It will be great if the one wants to drive it, any help
> >> >> >> >> > > > > will
> >> >> >> >> > > > > be
> >> >> >> >> > > > appreciated.
> >> >> >> >> > > > >
> >> >> >> >> > > > >
> >> >> >> >> > > > > On Fri, Jul 23, 2021 at 12:12 PM Atri Sharma
> >> >> >> >> > > > > <at...@apache.org>
> >> >> >> >> > wrote:
> >> >> >> >> > > > >
> >> >> >> >> > > > > > Hello,
> >> >> >> >> > > > > >
> >> >> >> >> > > > > > An update, please. I am working through persistence
> of
> >> >> >> >> > > > > > Lucene
> >> >> >> >> > > > > > index
> >> >> >> >> > > > using
> >> >> >> >> > > > > > Ignite Dictionary, and will be asking some questions
> >> >> >> >> > > > > > soon.
> >> >> >> >> > > > > >
> >> >> >> >> > > > > > I had one doubt - - where does this change go? Ignite
> >> >> >> >> > > > > > 3?
> >> >> >> >> > > > > >
> >> >> >> >> > > > > > Also, I know we want to build native support for text
> >> >> >> >> > > > > > searches
> >> >> >> >> > > > > > in
> >> >> >> >> > > > Ignite 3.
> >> >> >> >> > > > > > Is the work I am proposing here part of that, or will
> >> >> >> >> > > > > > that
> >> >> >> >> > > > > > be
> >> >> >> a
> >> >> >> >> > > > separate
> >> >> >> >> > > > > > effort?
> >> >> >> >> > > > > >
> >> >> >> >> > > > > > On Mon, 28 Jun 2021, 19:20 Ilya Kasnacheev, <
> >> >> >> >> > ilya.kasnacheev@gmail.com
> >> >> >> >> > > > >
> >> >> >> >> > > > > > wrote:
> >> >> >> >> > > > > >
> >> >> >> >> > > > > > > Hello!
> >> >> >> >> > > > > > >
> >> >> >> >> > > > > > > I think that number one is the most important one,
> >> then
> >> >> >> maybe
> >> >> >> >> > > > > > > it
> >> >> >> >> > > > will see
> >> >> >> >> > > > > > > more use and other deficiencies become more
> >> >> >> >> > > > > > > apparent,
> >> >> >> leading
> >> >> >> >> > > > > > > to
> >> >> >> >> > more
> >> >> >> >> > > > > > > tickets and visibility.
> >> >> >> >> > > > > > >
> >> >> >> >> > > > > > > Maybe 2. and 3. will even use a different approach
> >> when
> >> >> >> >> > persistence
> >> >> >> >> > > > is
> >> >> >> >> > > > > > > implemented.
> >> >> >> >> > > > > > >
> >> >> >> >> > > > > > > Regards,
> >> >> >> >> > > > > > > --
> >> >> >> >> > > > > > > Ilya Kasnacheev
> >> >> >> >> > > > > > >
> >> >> >> >> > > > > > >
> >> >> >> >> > > > > > > пн, 28 июн. 2021 г. в 14:34, Atri Sharma
> >> >> >> >> > > > > > > <at...@apache.org>:
> >> >> >> >> > > > > > >
> >> >> >> >> > > > > > > > Hello Again!
> >> >> >> >> > > > > > > >
> >> >> >> >> > > > > > > > I have been looking into the aforementioned and
> >> >> >> >> > > > > > > > here
> >> >> >> >> > > > > > > > are
> >> >> >> my
> >> >> >> >> > follow
> >> >> >> >> > > > up
> >> >> >> >> > > > > > > > thoughts:
> >> >> >> >> > > > > > > >
> >> >> >> >> > > > > > > > 1. Support persistence of Lucene indexes.
> >> >> >> >> > > > > > > > 2.
> >> https://issues.apache.org/jira/browse/IGNITE-12401
> >> >> >> >> > > > > > > > (Needs
> >> >> >> >> > > > fixing of
> >> >> >> >> > > > > > > > moving partitions first)
> >> >> >> >> > > > > > > > 3. Figure out how to return scores from nodes and
> >> use
> >> >> >> >> > > > > > > > them
> >> >> >> >> > > > > > > > as
> >> >> >> >> > sort
> >> >> >> >> > > > > > > > parameters on the coordinator node
> >> >> >> >> > > > > > > > (
> https://issues.apache.org/jira/browse/IGNITE-12291
> >> )
> >> >> >> >> > > > > > > >
> >> >> >> >> > > > > > > > Please let me know if this looks ok to make text
> >> >> >> >> > > > > > > > queries
> >> >> >> >> > > > functional?
> >> >> >> >> > > > > > > >
> >> >> >> >> > > > > > > > Atri
> >> >> >> >> > > > > > > >
> >> >> >> >> > > > > > > > On Mon, Jun 21, 2021 at 2:49 PM Alexei Scherbakov
> >> >> >> >> > > > > > > > <al...@gmail.com> wrote:
> >> >> >> >> > > > > > > > >
> >> >> >> >> > > > > > > > > Hi.
> >> >> >> >> > > > > > > > >
> >> >> >> >> > > > > > > > > One of the biggest issues with text queries is
> a
> >> >> >> >> > > > > > > > > lack
> >> >> >> >> > > > > > > > > of
> >> >> >> >> > support
> >> >> >> >> > > > for
> >> >> >> >> > > > > > > > lucene
> >> >> >> >> > > > > > > > > indices persistence, which makes this
> >> functionality
> >> >> >> >> > > > > > > > > useless
> >> >> >> >> > if a
> >> >> >> >> > > > > > > > > persistence is enabled.
> >> >> >> >> > > > > > > > >
> >> >> >> >> > > > > > > > > I would first take care of it.
> >> >> >> >> > > > > > > > >
> >> >> >> >> > > > > > > > > пн, 21 июн. 2021 г. в 12:16, Maksim Timonin <
> >> >> >> >> > > > timonin.maxim@gmail.com
> >> >> >> >> > > > > > >:
> >> >> >> >> > > > > > > > >
> >> >> >> >> > > > > > > > > > Hi, Atri!
> >> >> >> >> > > > > > > > > >
> >> >> >> >> > > > > > > > > > You're right, Actually there is a lack of
> >> support
> >> >> >> >> > > > > > > > > > for
> >> >> >> >> > > > TextQueries.
> >> >> >> >> > > > > > > For
> >> >> >> >> > > > > > > > the
> >> >> >> >> > > > > > > > > > last ticket I'm doing I see some obvious
> >> >> >> >> > > > > > > > > > issues
> >> >> >> >> > > > > > > > > > with
> >> >> >> >> > > > > > > > > > them
> >> >> >> >> > (no
> >> >> >> >> > > > page
> >> >> >> >> > > > > > > size
> >> >> >> >> > > > > > > > > > support, for example). I'm glad that somebody
> >> >> >> >> > > > > > > > > > wants
> >> >> >> >> > > > > > > > > > to
> >> >> >> >> > maintain
> >> >> >> >> > > > > > this
> >> >> >> >> > > > > > > > > > functionality. Thanks a lot!
> >> >> >> >> > > > > > > > > >
> >> >> >> >> > > > > > > > > > For the MergeSort algorithm there is already
> a
> >> >> >> >> > > > > > > > > > patch
> >> >> >> >> > > > > > > > > > for
> >> >> >> >> > that
> >> >> >> >> > > > [1].
> >> >> >> >> > > > > > > It's
> >> >> >> >> > > > > > > > > > currently on review. This patch introduces an
> >> >> >> >> > > > > > > > > > abstract
> >> >> >> >> > reducer
> >> >> >> >> > > > for
> >> >> >> >> > > > > > > > > > CacheQueries with 2 implementations
> >> >> >> >> > > > > > > > > > (unordered,
> >> >> >> >> > merge-sort).
> >> >> >> >> > > > Then
> >> >> >> >> > > > > > > > TextQuery
> >> >> >> >> > > > > > > > > > leverages on MergeSort to order results from
> >> >> >> >> > > > > > > > > > multiple
> >> >> >> >> > nodes by
> >> >> >> >> > > > > > score.
> >> >> >> >> > > > > > > > This
> >> >> >> >> > > > > > > > > > patch also fixes the pageSize issue, I've
> >> >> >> >> > > > > > > > > > mentioned
> >> >> >> >> > > > > > > > > > before.
> >> >> >> >> > > > Could
> >> >> >> >> > > > > > you
> >> >> >> >> > > > > > > > > > please check if it fully matches your idea?
> >> >> >> >> > > > > > > > > > Any
> >> >> >> >> > > > > > > > > > issues
> >> >> >> >> > > > > > > > > > or
> >> >> >> >> > > > comments
> >> >> >> >> > > > > > > are
> >> >> >> >> > > > > > > > > > welcome.
> >> >> >> >> > > > > > > > > >
> >> >> >> >> > > > > > > > > > I've prepared this ticket, because I need the
> >> >> >> MergeSort
> >> >> >> >> > > > algorithm
> >> >> >> >> > > > > > for
> >> >> >> >> > > > > > > > the
> >> >> >> >> > > > > > > > > > new type of queries I'm implementing
> >> (IndexQuery,
> >> >> >> >> > > > > > > > > > it
> >> >> >> >> > > > > > > > > > should
> >> >> >> >> > > > also
> >> >> >> >> > > > > > > > provide
> >> >> >> >> > > > > > > > > > ordered results over multiple nodes).
> >> >> >> >> > > > > > > > > > Currently
> >> >> >> >> > > > > > > > > > I'm
> >> >> >> not
> >> >> >> >> > > > planning to
> >> >> >> >> > > > > > > go
> >> >> >> >> > > > > > > > > > further with TextQuery, so if you're going to
> >> >> >> >> > > > > > > > > > support
> >> >> >> >> > > > > > > > > > this
> >> >> >> >> > > > it'll
> >> >> >> >> > > > > > be a
> >> >> >> >> > > > > > > > great
> >> >> >> >> > > > > > > > > > contribution, I think.
> >> >> >> >> > > > > > > > > >
> >> >> >> >> > > > > > > > > > [1]
> >> >> >> https://issues.apache.org/jira/browse/IGNITE-14703
> >> >> >> >> > > > > > > > > > [2]
> https://github.com/apache/ignite/pull/9081
> >> >> >> >> > > > > > > > > >
> >> >> >> >> > > > > > > > > >
> >> >> >> >> > > > > > > > > > On Mon, Jun 21, 2021 at 11:11 AM Atri Sharma
> <
> >> >> >> >> > atri@apache.org>
> >> >> >> >> > > > > > > wrote:
> >> >> >> >> > > > > > > > > >
> >> >> >> >> > > > > > > > > > > Hi All,
> >> >> >> >> > > > > > > > > > >
> >> >> >> >> > > > > > > > > > > I have been looking into our text queries
> >> >> >> >> > > > > > > > > > > support
> >> >> >> and
> >> >> >> >> > > > > > > > > > > see
> >> >> >> >> > > > that it
> >> >> >> >> > > > > > > has
> >> >> >> >> > > > > > > > > > > limited community support.
> >> >> >> >> > > > > > > > > > >
> >> >> >> >> > > > > > > > > > > Therefore, I volunteer to be the maintainer
> >> >> >> >> > > > > > > > > > > of
> >> >> >> >> > > > > > > > > > > the
> >> >> >> >> > module and
> >> >> >> >> > > > > > work
> >> >> >> >> > > > > > > on
> >> >> >> >> > > > > > > > > > > enhancing it further.
> >> >> >> >> > > > > > > > > > >
> >> >> >> >> > > > > > > > > > > First goal would be to move to Lucene 8.x,
> >> then
> >> >> >> >> > > > > > > > > > > work
> >> >> >> >> > > > > > > > > > > on
> >> >> >> >> > > > sorted
> >> >> >> >> > > > > > > reduce
> >> >> >> >> > > > > > > > > > > - merge across nodes. Fundamentally, this
> is
> >> >> >> >> > > > > > > > > > > doable
> >> >> >> >> > > > > > > > > > > since
> >> >> >> >> > > > Lucene
> >> >> >> >> > > > > > > > ranks
> >> >> >> >> > > > > > > > > > > documents according to their score, and
> >> >> >> >> > > > > > > > > > > documents
> >> >> >> are
> >> >> >> >> > > > returned in
> >> >> >> >> > > > > > > the
> >> >> >> >> > > > > > > > > > > order of their score. Since the scoring
> >> >> >> >> > > > > > > > > > > function
> >> >> >> >> > > > > > > > > > > is
> >> >> >> >> > > > homogeneous,
> >> >> >> >> > > > > > > this
> >> >> >> >> > > > > > > > > > > means that across nodes, we can compare
> >> >> >> >> > > > > > > > > > > scores
> >> >> >> >> > > > > > > > > > > and
> >> >> >> >> > > > > > > > > > > merge
> >> >> >> >> > > > sort.
> >> >> >> >> > > > > > > > > > >
> >> >> >> >> > > > > > > > > > > Please let me know if I can take this up.
> >> >> >> >> > > > > > > > > > >
> >> >> >> >> > > > > > > > > > > Atri
> >> >> >> >> > > > > > > > > > >
> >> >> >> >> > > > > > > > > > > --
> >> >> >> >> > > > > > > > > > > Regards,
> >> >> >> >> > > > > > > > > > >
> >> >> >> >> > > > > > > > > > > Atri
> >> >> >> >> > > > > > > > > > > Apache Concerted
> >> >> >> >> > > > > > > > > > >
> >> >> >> >> > > > > > > > > >
> >> >> >> >> > > > > > > > >
> >> >> >> >> > > > > > > > >
> >> >> >> >> > > > > > > > > --
> >> >> >> >> > > > > > > > >
> >> >> >> >> > > > > > > > > Best regards,
> >> >> >> >> > > > > > > > > Alexei Scherbakov
> >> >> >> >> > > > > > > >
> >> >> >> >> > > > > > > > --
> >> >> >> >> > > > > > > > Regards,
> >> >> >> >> > > > > > > >
> >> >> >> >> > > > > > > > Atri
> >> >> >> >> > > > > > > > Apache Concerted
> >> >> >> >> > > > > > > >
> >> >> >> >> > > > > > >
> >> >> >> >> > > > > >
> >> >> >> >> > > > >
> >> >> >> >> > > > >
> >> >> >> >> > > > > --
> >> >> >> >> > > > > Best regards,
> >> >> >> >> > > > > Andrey V. Mashenkov
> >> >> >> >> > > >
> >> >> >> >> > > > --
> >> >> >> >> > > > Regards,
> >> >> >> >> > > >
> >> >> >> >> > > > Atri
> >> >> >> >> > > > Apache Concerted
> >> >> >> >> > > >
> >> >> >> >> > >
> >> >> >> >> > >
> >> >> >> >> > > --
> >> >> >> >> > > Best regards,
> >> >> >> >> > > Andrey V. Mashenkov
> >> >> >> >> >
> >> >> >> >> > --
> >> >> >> >> > Regards,
> >> >> >> >> >
> >> >> >> >> > Atri
> >> >> >> >> > Apache Concerted
> >> >> >> >> >
> >> >> >> >>
> >> >> >> >>
> >> >> >> >> --
> >> >> >> >> Best regards,
> >> >> >> >> Andrey V. Mashenkov
> >> >> >> >
> >> >> >> > --
> >> >> >> > Regards,
> >> >> >> >
> >> >> >> > Atri
> >> >> >> > Apache Concerted
> >> >> >> >
> >> >> >>
> >> >> >>
> >> >> >> --
> >> >> >>
> >> >> >> Best regards,
> >> >> >> Ivan Pavlukhin
> >> >> >>
> >> >> >
> >> >>
> >> >>
> >> >> --
> >> >>
> >> >> Best regards,
> >> >> Ivan Pavlukhin
> >> >
> >> > --
> >> > Regards,
> >> >
> >> > Atri
> >> > Apache Concerted
> >> >
> >>
> >>
> >> --
> >>
> >> Best regards,
> >> Ivan Pavlukhin
> >>
> >
>
>
> --
>
> Best regards,
> Ivan Pavlukhin
>

Re: Text Queries Support

Posted by Ivan Pavlukhin <vo...@gmail.com>.

Hi Maximiliano,

Thank you for pointing this out, rather interesting. Have you tried to
communicate with a hawkore team? I doubt that anyone in Community
knows implementation details of hawkore additions.

2021-10-22 19:58 GMT+03:00, Maximiliano Gazquez <ma...@gmail.com>:
> Hello everyone!
>
> I wanted to add this to the discussion.
> I've found this project https://github.com/hawkore/ignite-hk which promises
> to solve most of the issues that are being discussed here like pagination,
> sorting and most important, persisting the lucene index.
>
> It does stuff like this to create indexes:
>
> CREATE INDEX PERSON_LUCENE_IDX ON "PUBLIC".PERSON(LUCENE)
> FULLTEXT '{
> ''refresh_seconds'':''60'',
> ''directory_path'':'''',
> ''ram_buffer_mb'':''10'',
> ''max_cached_mb'':''-1'',
> ''partitioner'':''{"type":"token","partitions":10}'',
> ''optimizer_enabled'':''true'',
> ''optimizer_schedule'':''0 1 * * *'',
> ''version'':''0'',
> ''schema'':''{
>     "default_analyzer":"english",
>
> "analyzers":{"my_custom_analyzer":{"type":"snowball","language":"Spanish","stopwords":"el,la,lo,loas,las,a,ante,bajo,cabe,con,contra"}},
>     "fields":{
>
> "duration":{"type":"date_range","from":"start_date","to":"stop_date","validated":false,"pattern":"yyyy/MM/dd"},
>
> "place":{"type":"geo_point","latitude":"latitude","longitude":"longitude"},
>       "date":{"type":"date","validated":true,"pattern":"yyyy/MM/dd"},
>       "number":{"type":"integer","validated":false,"boost":1.0},
>       "gender":{"type":"string","validated":true,"case_sensitive":true},
>       "bool":{"type":"boolean","validated":false},
>
> "phrase":{"type":"text","validated":false,"analyzer":"my_custom_analyzer"},
>       "name":{"type":"string","validated":false,"case_sensitive":true},
>       "animal":{"type":"string","validated":false,"case_sensitive":true},
>       "age":{"type":"integer","validated":false,"boost":1.0},
>       "food":{"type":"string","validated":false,"case_sensitive":true}
>     }
>   }''
> }';
>
> And this to use that lucene index from inside SQL:
>
> SELECT * FROM "test".user
> WHERE lucene = '{ query : {
>                               type : "boolean",
>                               must : [{type : "wildcard", field : "name",
> value : "J*"},
>                                       {type : "wildcard", field : "food",
> value : "tu*"}]}}';
>
> More examples here
> https://github.com/hawkore/examples-apache-ignite-extensions/tree/master/examples-advanced-ignite-indexing
>
> I don't have anything to do with that company but it would be great to know
> how they implemented this stuff.
>
>
> On Mon, Aug 9, 2021 at 3:00 AM Ivan Pavlukhin <vo...@gmail.com> wrote:
>
>> Hi Atri,
>>
>> Sorry for a late answer.
>>
>> > I didn't quite understand. Are you proposing that Ignite should not
>> > have
>> FTS capabilities?
>>
>> It seems an option to me. IMHO it is better to have no FTS instead of
>> something like current Ignite TextQueries.
>>
>> 2021-08-03 12:45 GMT+03:00, Atri Sharma <at...@apache.org>:
>> > Hi Ivan,
>> >
>> > I didn't quite understand. Are you proposing that Ignite should not
>> > have FTS capabilities?
>> >
>> > Atri
>> >
>> > On Tue, Aug 3, 2021 at 2:57 PM Ivan Pavlukhin <vo...@gmail.com>
>> wrote:
>> >>
>> >> Hi Atri,
>> >>
>> >> My main concern is non-maleficence. Every task has several solutions,
>> >> e.g. straightforward ones:
>> >> 1. Do not implement FTS.
>> >> 2. Create own implementation.
>> >>
>> >> Some of the strongest ones live without FTS [1].
>> >>
>> >> [1] https://github.com/cockroachdb/cockroach/issues/7821
>> >>
>> >> 2021-08-02 11:33 GMT+03:00, Atri Sharma <at...@apache.org>:
>> >> > Hi Ivan,
>> >> >
>> >> > Would you like to propose an alternative to Lucene?
>> >> >
>> >> > Atri
>> >> >
>> >> > On Mon, 2 Aug 2021, 13:48 Ivan Pavlukhin, <vo...@gmail.com>
>> wrote:
>> >> >
>> >> >> Folks,
>> >> >>
>> >> >> Sorry if read the thread not thoroughly enough, but do we consider
>> >> >> Lucene as obviously right choice? In my understanding Ignite
>> >> >> history
>> >> >> has shown clearly that "fastest feature implementation" is not
>> usually
>> >> >> the best. And one example of this are text queries. Are not we
>> >> >> trying
>> >> >> to do a same mistake again? FTS is a huge feature, I do not believe
>> >> >> there is an easy win for it.
>> >> >>
>> >> >> 2021-07-27 19:18 GMT+03:00, Atri Sharma <at...@apache.org>:
>> >> >> > Andrey,
>> >> >> >
>> >> >> >> Per-partition Lucene index looks simple to implement, but it may
>> >> >> >> require
>> >> >> >> per-partition SQL to make full-text search expressions work
>> >> >> >> correctly
>> >> >> >> within the SQL quiery.
>> >> >> > I think that as long as we follow the map - reduce process that
>> >> >> > we
>> >> >> > already do for other queries, we should be fine.
>> >> >> >
>> >> >> >> Per-partition SQL index may kill the performance. We already
>> >> >> >> tried
>> >> >> >> to
>> >> >> >> do
>> >> >> >> that in Ignite 2. However, QueryParallelism feature helps to
>> >> >> >> speed
>> >> >> >> up
>> >> >> >> some
>> >> >> >> data-intensive queries,
>> >> >> >> but hits the performance in simple cases, and at some point
>> >> >> >> (e.g.
>> >> >> >> segments
>> >> >> >> > number of CPU) the performance rapidly degrades with the
>> >> >> >> > increasing
>> >> >> >> number of segments.
>> >> >> >
>> >> >> > Yeah, that is always the case, but a global index will be a
>> >> >> > nightmare
>> >> >> > in terms of concurrency and pessimistic concurrency control will
>> >> >> > anyways kill the benefits, coupled with the metadata
>> >> >> > requirements.
>> >> >> > What were the specific issues with per partition index?
>> >> >> >>
>> >> >> >> AFAIK, Lucene widely used bitmap indices that are easy to merge.
>> >> >> >> Maybe, the map-reduce technique underneath FTS expressions and
>> some
>> >> >> hacks
>> >> >> >> will add a minimal overhead.
>> >> >> >
>> >> >> > Lucene uses many types of indices but the aspect here is that per
>> >> >> > partition Lucene indices can return docIDs and we can merge them
>> >> >> > in
>> >> >> > reduce phase. So we are abstracted out from specifics of the
>> >> >> > internal
>> >> >> > index being used to serve the query.
>> >> >> >
>> >> >> >>
>> >> >> >> > As illustrated by Ilya, we can use Ignite's WAL records to
>> >> >> >> > rebuild
>> >> >> >> > Lucene indices. The important thing here is to not treat
>> >> >> >> > Lucene
>> >> >> >> > indices as source of truth.
>> >> >> >> To use WAL we either should relay Lucene files to our Page
>> >> >> >> memory
>> >> >> >> or
>> >> >> >> be
>> >> >> >> aware of Lucene files structure.
>> >> >> >> The first looks tricky, as we should guarantee a contiguous
>> address
>> >> >> space
>> >> >> >> in Page memory for reflecting Lucene file. Maybe separate
>> >> >> >> managed
>> >> >> >> memory
>> >> >> >> segment with its own rules?
>> >> >> >
>> >> >> > Why not use Lucene's MMappedDirectory and map it to our storage
>> >> >> > classes?
>> >> >> >
>> >> >> >>
>> >> >> >> >> Transactions.
>> >> >> >> >> * Will we support transactions?
>> >> >> >> > Lucene has no concept of transactions.
>> >> >> >> Yes, but we have.
>> >> >> >> Lucene index may be non-transactional, but users never expect to
>> >> >> >> see
>> >> >> >> uncommited data.
>> >> >> >> How does this connect with transactional SQL?
>> >> >> > We could have the Lucene writes done as a part of transactions
>> >> >> > and
>> >> >> > ack
>> >> >> > back only when it succeeds/fails. WDYT?
>> >> >> >>
>> >> >> >> On Tue, Jul 27, 2021 at 1:36 PM Atri Sharma <at...@apache.org>
>> >> >> >> wrote:
>> >> >> >>
>> >> >> >> > Sorry, I planned on creating a Wiki page for this, but it
>> >> >> >> > makes
>> >> >> >> > more
>> >> >> >> > sense to be replying here.
>> >> >> >> >
>> >> >> >> > > * How Lucene index can be split among the nodes?
>> >> >> >> >
>> >> >> >> > We can have partition level indices on each node.
>> >> >> >> >
>> >> >> >> > > * If we'll have a single index for all partitions on the
>> >> >> >> > > particular
>> >> >> >> > > node,
>> >> >> >> > > then how index records will be aware of partitioning?
>> >> >> >> >
>> >> >> >> > Index records dont need to be aware of partitioning -- each
>> >> >> >> > Lucene
>> >> >> >> > index is independent.
>> >> >> >> >
>> >> >> >> > > This is important to filter out backup records from the
>> results
>> >> >> >> > > to
>> >> >> >> > > avoid
>> >> >> >> > > duplicates.
>> >> >> >> >
>> >> >> >> > We can merge documents from different nodes and remove
>> duplicates
>> >> >> >> > as
>> >> >> >> > long as docIDs are globally unique.
>> >> >> >> >
>> >> >> >> > > * How results from several nodes can be merged on the Reduce
>> >> >> >> > > stage?
>> >> >> >> >
>> >> >> >> > As long as documents have a globally unique docID, Lucene has
>> >> >> >> > merge
>> >> >> >> > functions that can merge results from multiple partial
>> >> >> >> > results.
>> >> >> >> >
>> >> >> >> > > * Does Lucene supports smth like JOIN operation or others
>> >> >> >> > > that
>> >> >> >> > > may
>> >> >> >> > require
>> >> >> >> > > data from another partition or index?
>> >> >> >> >
>> >> >> >> > As illustrated by Ilya, Block-Join works for us.
>> >> >> >> >
>> >> >> >> > > If so, then it likes to multistep query with merging results
>> on
>> >> >> >> > > intermediate stages and requires detailed investigation and
>> >> >> >> > > design.
>> >> >> >> > > It is ok if Ignite will have some limitations here, but we
>> >> >> >> > > would
>> >> >> like
>> >> >> >> > > to
>> >> >> >> > > know about them at the early stage.
>> >> >> >> >
>> >> >> >> > > * How effectively map Lucene files to the page memory? Is it
>> >> >> >> > > even
>> >> >> >> > possible?
>> >> >> >> >
>> >> >> >> > Lucene has PageDirectory implementations which allow storing
>> >> >> >> > Lucene
>> >> >> >> > indices on different kind of file structures. It has a
>> >> >> >> > MMappedFileDirectory that we could use?
>> >> >> >> >
>> >> >> >> > > Otherwise, how to deal with potential OOM on large queries
>> >> >> >> > > and
>> >> >> memory
>> >> >> >> > > capacity planning?
>> >> >> >> >
>> >> >> >> > We can use Lucene's MMapped directory.
>> >> >> >> >
>> >> >> >> > >
>> >> >> >> > > Persistence.
>> >> >> >> > > * How and what consistency guarantees could we have/expect?
>> >> >> >> >
>> >> >> >> > Lucene does not have WAL logs but is append only
>> >> >> >> >
>> >> >> >> > > Seems, we may not be able to write physical records for
>> >> >> >> > > Lucene
>> >> >> >> > > index
>> >> >> >> > > to
>> >> >> >> > our
>> >> >> >> > > WAL. What can we do with this?
>> >> >> >> >
>> >> >> >> > As illustrated by Ilya, we can use Ignite's WAL records to
>> >> >> >> > rebuild
>> >> >> >> > Lucene indices. The important thing here is to not treat
>> >> >> >> > Lucene
>> >> >> >> > indices as source of truth.
>> >> >> >> > >
>> >> >> >> > > Transactions.
>> >> >> >> > > * Will we support transactions?
>> >> >> >> > Lucene has no concept of transactions.
>> >> >> >> >
>> >> >> >> > > * Should Lucene be aware of Transaction and track mvcc (or
>> >> >> >> > > whatever)
>> >> >> >> > > versions for the records?
>> >> >> >> > No
>> >> >> >> > > * What will be consistency guarantees?
>> >> >> >> > We can acknowledge writes back only after Lucene index is
>> >> >> >> > updated.
>> >> >> >> > >
>> >> >> >> > > UX
>> >> >> >> > > * How to add FullText search queries syntax into Calcite?
>> >> >> >> > Postgres's FTS functions are a good reference.
>> >> >> >> > > * AFAIK, the Lucene index has many properties for tuning.
>> >> >> >> > > How
>> >> >> >> > > will
>> >> >> >> > > the
>> >> >> >> > user
>> >> >> >> > > configure the index?
>> >> >> >> > Most of those properties can be cluster level and exposed as a
>> >> >> >> > new
>> >> >> >> > sub
>> >> >> >> > config for ignite.
>> >> >> >> > > * How and where to store the settings? What are cluster-wide
>> >> >> >> > > and
>> >> >> what
>> >> >> >> > > a
>> >> >> >> > > local to the particular node?
>> >> >> >> > All can be cluster level.
>> >> >> >> > > * Will be all the settings immutable? Can be they changed
>> >> >> >> > > on-fly?
>> >> >> >> > > after
>> >> >> >> > > node/grid restart?
>> >> >> >> > They should be applied post restart.
>> >> >> >> >
>> >> >> >> > > * Any limitations on query syntax?
>> >> >> >> > It depends on how we model our queries for text search.
>> >> >> >> >
>> >> >> >> > >
>> >> >> >> > > SQL
>> >> >> >> > > * Will we support FullText search in SQL?
>> >> >> >> > We need custom functions for it. See Postgres's FTS functions.
>> >> >> >> > > * How to integrate Lucene index into Calcite? What is the
>> >> >> >> > > cost
>> >> >> model?
>> >> >> >> > There cannot be any cost model since there are no paths for a
>> >> >> >> > text
>> >> >> >> > query. If we see a text query, we have to use Lucene index or
>> >> >> >> > return
>> >> >> >> > an error. In this way, we need to model text search as a set
>> >> >> >> > of
>> >> >> >> > UDFs
>> >> >> >> >
>> >> >> >> > > Splitting rules? Traits?
>> >> >> >> > Please see my reply above.
>> >> >> >> > >
>> >> >> >> > >
>> >> >> >> > > With all of this, you can go with the IEP (or even some
>> >> >> >> > > short
>> >> >> >> > > summary)
>> >> >> >> > and
>> >> >> >> > > further POC and implementation.
>> >> >> >> > > That's a big deal, so let's discuss what could be done here.
>> >> >> >> > >
>> >> >> >> > > On Fri, Jul 23, 2021 at 12:58 PM Atri Sharma
>> >> >> >> > > <atri@apache.org
>> >
>> >> >> wrote:
>> >> >> >> > >
>> >> >> >> > > > I am actually happy to drive the feature for Ignite 3. FTS
>> is
>> >> >> >> > > > very
>> >> >> >> > > > important for me and I think Ignite users will benefit
>> >> >> >> > > > from
>> >> >> >> > > > it
>> >> >> >> > > > greatly.
>> >> >> >> > > >
>> >> >> >> > > > If it makes sense to be focusing on Ignite 3 for this
>> >> >> >> > > > capability,
>> >> >> I
>> >> >> >> > > > am
>> >> >> >> > > > eager to contribute there and lead the development.
>> >> >> >> > > >
>> >> >> >> > > > Please share your thoughts.
>> >> >> >> > > >
>> >> >> >> > > > On Fri, Jul 23, 2021 at 3:21 PM Andrey Mashenkov
>> >> >> >> > > > <an...@gmail.com> wrote:
>> >> >> >> > > > >
>> >> >> >> > > > > Hi Atri,
>> >> >> >> > > > >
>> >> >> >> > > > > All the Jira tickets we have on the Full-text search
>> >> >> >> > > > > (FTS)
>> >> >> >> > > > > thing
>> >> >> >> > > > > are
>> >> >> >> > > > > targeted to Ignite 2.
>> >> >> >> > > > >
>> >> >> >> > > > > AFAIK, we want, but we have NOT committed to FTS support
>> in
>> >> >> Ignite
>> >> >> >> > > > > 3,
>> >> >> >> > > > yet.
>> >> >> >> > > > > By the way, we are getting requests for this thing from
>> the
>> >> >> >> > > > > user
>> >> >> >> > side,
>> >> >> >> > > > and
>> >> >> >> > > > > definitely,
>> >> >> >> > > > > FTS would be a valuable feature for Ignite.
>> >> >> >> > > > >
>> >> >> >> > > > > It will be great if the one wants to drive it, any help
>> >> >> >> > > > > will
>> >> >> >> > > > > be
>> >> >> >> > > > appreciated.
>> >> >> >> > > > >
>> >> >> >> > > > >
>> >> >> >> > > > > On Fri, Jul 23, 2021 at 12:12 PM Atri Sharma
>> >> >> >> > > > > <at...@apache.org>
>> >> >> >> > wrote:
>> >> >> >> > > > >
>> >> >> >> > > > > > Hello,
>> >> >> >> > > > > >
>> >> >> >> > > > > > An update, please. I am working through persistence of
>> >> >> >> > > > > > Lucene
>> >> >> >> > > > > > index
>> >> >> >> > > > using
>> >> >> >> > > > > > Ignite Dictionary, and will be asking some questions
>> >> >> >> > > > > > soon.
>> >> >> >> > > > > >
>> >> >> >> > > > > > I had one doubt - - where does this change go? Ignite
>> >> >> >> > > > > > 3?
>> >> >> >> > > > > >
>> >> >> >> > > > > > Also, I know we want to build native support for text
>> >> >> >> > > > > > searches
>> >> >> >> > > > > > in
>> >> >> >> > > > Ignite 3.
>> >> >> >> > > > > > Is the work I am proposing here part of that, or will
>> >> >> >> > > > > > that
>> >> >> >> > > > > > be
>> >> >> a
>> >> >> >> > > > separate
>> >> >> >> > > > > > effort?
>> >> >> >> > > > > >
>> >> >> >> > > > > > On Mon, 28 Jun 2021, 19:20 Ilya Kasnacheev, <
>> >> >> >> > ilya.kasnacheev@gmail.com
>> >> >> >> > > > >
>> >> >> >> > > > > > wrote:
>> >> >> >> > > > > >
>> >> >> >> > > > > > > Hello!
>> >> >> >> > > > > > >
>> >> >> >> > > > > > > I think that number one is the most important one,
>> then
>> >> >> maybe
>> >> >> >> > > > > > > it
>> >> >> >> > > > will see
>> >> >> >> > > > > > > more use and other deficiencies become more
>> >> >> >> > > > > > > apparent,
>> >> >> leading
>> >> >> >> > > > > > > to
>> >> >> >> > more
>> >> >> >> > > > > > > tickets and visibility.
>> >> >> >> > > > > > >
>> >> >> >> > > > > > > Maybe 2. and 3. will even use a different approach
>> when
>> >> >> >> > persistence
>> >> >> >> > > > is
>> >> >> >> > > > > > > implemented.
>> >> >> >> > > > > > >
>> >> >> >> > > > > > > Regards,
>> >> >> >> > > > > > > --
>> >> >> >> > > > > > > Ilya Kasnacheev
>> >> >> >> > > > > > >
>> >> >> >> > > > > > >
>> >> >> >> > > > > > > пн, 28 июн. 2021 г. в 14:34, Atri Sharma
>> >> >> >> > > > > > > <at...@apache.org>:
>> >> >> >> > > > > > >
>> >> >> >> > > > > > > > Hello Again!
>> >> >> >> > > > > > > >
>> >> >> >> > > > > > > > I have been looking into the aforementioned and
>> >> >> >> > > > > > > > here
>> >> >> >> > > > > > > > are
>> >> >> my
>> >> >> >> > follow
>> >> >> >> > > > up
>> >> >> >> > > > > > > > thoughts:
>> >> >> >> > > > > > > >
>> >> >> >> > > > > > > > 1. Support persistence of Lucene indexes.
>> >> >> >> > > > > > > > 2.
>> https://issues.apache.org/jira/browse/IGNITE-12401
>> >> >> >> > > > > > > > (Needs
>> >> >> >> > > > fixing of
>> >> >> >> > > > > > > > moving partitions first)
>> >> >> >> > > > > > > > 3. Figure out how to return scores from nodes and
>> use
>> >> >> >> > > > > > > > them
>> >> >> >> > > > > > > > as
>> >> >> >> > sort
>> >> >> >> > > > > > > > parameters on the coordinator node
>> >> >> >> > > > > > > > (https://issues.apache.org/jira/browse/IGNITE-12291
>> )
>> >> >> >> > > > > > > >
>> >> >> >> > > > > > > > Please let me know if this looks ok to make text
>> >> >> >> > > > > > > > queries
>> >> >> >> > > > functional?
>> >> >> >> > > > > > > >
>> >> >> >> > > > > > > > Atri
>> >> >> >> > > > > > > >
>> >> >> >> > > > > > > > On Mon, Jun 21, 2021 at 2:49 PM Alexei Scherbakov
>> >> >> >> > > > > > > > <al...@gmail.com> wrote:
>> >> >> >> > > > > > > > >
>> >> >> >> > > > > > > > > Hi.
>> >> >> >> > > > > > > > >
>> >> >> >> > > > > > > > > One of the biggest issues with text queries is a
>> >> >> >> > > > > > > > > lack
>> >> >> >> > > > > > > > > of
>> >> >> >> > support
>> >> >> >> > > > for
>> >> >> >> > > > > > > > lucene
>> >> >> >> > > > > > > > > indices persistence, which makes this
>> functionality
>> >> >> >> > > > > > > > > useless
>> >> >> >> > if a
>> >> >> >> > > > > > > > > persistence is enabled.
>> >> >> >> > > > > > > > >
>> >> >> >> > > > > > > > > I would first take care of it.
>> >> >> >> > > > > > > > >
>> >> >> >> > > > > > > > > пн, 21 июн. 2021 г. в 12:16, Maksim Timonin <
>> >> >> >> > > > timonin.maxim@gmail.com
>> >> >> >> > > > > > >:
>> >> >> >> > > > > > > > >
>> >> >> >> > > > > > > > > > Hi, Atri!
>> >> >> >> > > > > > > > > >
>> >> >> >> > > > > > > > > > You're right, Actually there is a lack of
>> support
>> >> >> >> > > > > > > > > > for
>> >> >> >> > > > TextQueries.
>> >> >> >> > > > > > > For
>> >> >> >> > > > > > > > the
>> >> >> >> > > > > > > > > > last ticket I'm doing I see some obvious
>> >> >> >> > > > > > > > > > issues
>> >> >> >> > > > > > > > > > with
>> >> >> >> > > > > > > > > > them
>> >> >> >> > (no
>> >> >> >> > > > page
>> >> >> >> > > > > > > size
>> >> >> >> > > > > > > > > > support, for example). I'm glad that somebody
>> >> >> >> > > > > > > > > > wants
>> >> >> >> > > > > > > > > > to
>> >> >> >> > maintain
>> >> >> >> > > > > > this
>> >> >> >> > > > > > > > > > functionality. Thanks a lot!
>> >> >> >> > > > > > > > > >
>> >> >> >> > > > > > > > > > For the MergeSort algorithm there is already a
>> >> >> >> > > > > > > > > > patch
>> >> >> >> > > > > > > > > > for
>> >> >> >> > that
>> >> >> >> > > > [1].
>> >> >> >> > > > > > > It's
>> >> >> >> > > > > > > > > > currently on review. This patch introduces an
>> >> >> >> > > > > > > > > > abstract
>> >> >> >> > reducer
>> >> >> >> > > > for
>> >> >> >> > > > > > > > > > CacheQueries with 2 implementations
>> >> >> >> > > > > > > > > > (unordered,
>> >> >> >> > merge-sort).
>> >> >> >> > > > Then
>> >> >> >> > > > > > > > TextQuery
>> >> >> >> > > > > > > > > > leverages on MergeSort to order results from
>> >> >> >> > > > > > > > > > multiple
>> >> >> >> > nodes by
>> >> >> >> > > > > > score.
>> >> >> >> > > > > > > > This
>> >> >> >> > > > > > > > > > patch also fixes the pageSize issue, I've
>> >> >> >> > > > > > > > > > mentioned
>> >> >> >> > > > > > > > > > before.
>> >> >> >> > > > Could
>> >> >> >> > > > > > you
>> >> >> >> > > > > > > > > > please check if it fully matches your idea?
>> >> >> >> > > > > > > > > > Any
>> >> >> >> > > > > > > > > > issues
>> >> >> >> > > > > > > > > > or
>> >> >> >> > > > comments
>> >> >> >> > > > > > > are
>> >> >> >> > > > > > > > > > welcome.
>> >> >> >> > > > > > > > > >
>> >> >> >> > > > > > > > > > I've prepared this ticket, because I need the
>> >> >> MergeSort
>> >> >> >> > > > algorithm
>> >> >> >> > > > > > for
>> >> >> >> > > > > > > > the
>> >> >> >> > > > > > > > > > new type of queries I'm implementing
>> (IndexQuery,
>> >> >> >> > > > > > > > > > it
>> >> >> >> > > > > > > > > > should
>> >> >> >> > > > also
>> >> >> >> > > > > > > > provide
>> >> >> >> > > > > > > > > > ordered results over multiple nodes).
>> >> >> >> > > > > > > > > > Currently
>> >> >> >> > > > > > > > > > I'm
>> >> >> not
>> >> >> >> > > > planning to
>> >> >> >> > > > > > > go
>> >> >> >> > > > > > > > > > further with TextQuery, so if you're going to
>> >> >> >> > > > > > > > > > support
>> >> >> >> > > > > > > > > > this
>> >> >> >> > > > it'll
>> >> >> >> > > > > > be a
>> >> >> >> > > > > > > > great
>> >> >> >> > > > > > > > > > contribution, I think.
>> >> >> >> > > > > > > > > >
>> >> >> >> > > > > > > > > > [1]
>> >> >> https://issues.apache.org/jira/browse/IGNITE-14703
>> >> >> >> > > > > > > > > > [2] https://github.com/apache/ignite/pull/9081
>> >> >> >> > > > > > > > > >
>> >> >> >> > > > > > > > > >
>> >> >> >> > > > > > > > > > On Mon, Jun 21, 2021 at 11:11 AM Atri Sharma <
>> >> >> >> > atri@apache.org>
>> >> >> >> > > > > > > wrote:
>> >> >> >> > > > > > > > > >
>> >> >> >> > > > > > > > > > > Hi All,
>> >> >> >> > > > > > > > > > >
>> >> >> >> > > > > > > > > > > I have been looking into our text queries
>> >> >> >> > > > > > > > > > > support
>> >> >> and
>> >> >> >> > > > > > > > > > > see
>> >> >> >> > > > that it
>> >> >> >> > > > > > > has
>> >> >> >> > > > > > > > > > > limited community support.
>> >> >> >> > > > > > > > > > >
>> >> >> >> > > > > > > > > > > Therefore, I volunteer to be the maintainer
>> >> >> >> > > > > > > > > > > of
>> >> >> >> > > > > > > > > > > the
>> >> >> >> > module and
>> >> >> >> > > > > > work
>> >> >> >> > > > > > > on
>> >> >> >> > > > > > > > > > > enhancing it further.
>> >> >> >> > > > > > > > > > >
>> >> >> >> > > > > > > > > > > First goal would be to move to Lucene 8.x,
>> then
>> >> >> >> > > > > > > > > > > work
>> >> >> >> > > > > > > > > > > on
>> >> >> >> > > > sorted
>> >> >> >> > > > > > > reduce
>> >> >> >> > > > > > > > > > > - merge across nodes. Fundamentally, this is
>> >> >> >> > > > > > > > > > > doable
>> >> >> >> > > > > > > > > > > since
>> >> >> >> > > > Lucene
>> >> >> >> > > > > > > > ranks
>> >> >> >> > > > > > > > > > > documents according to their score, and
>> >> >> >> > > > > > > > > > > documents
>> >> >> are
>> >> >> >> > > > returned in
>> >> >> >> > > > > > > the
>> >> >> >> > > > > > > > > > > order of their score. Since the scoring
>> >> >> >> > > > > > > > > > > function
>> >> >> >> > > > > > > > > > > is
>> >> >> >> > > > homogeneous,
>> >> >> >> > > > > > > this
>> >> >> >> > > > > > > > > > > means that across nodes, we can compare
>> >> >> >> > > > > > > > > > > scores
>> >> >> >> > > > > > > > > > > and
>> >> >> >> > > > > > > > > > > merge
>> >> >> >> > > > sort.
>> >> >> >> > > > > > > > > > >
>> >> >> >> > > > > > > > > > > Please let me know if I can take this up.
>> >> >> >> > > > > > > > > > >
>> >> >> >> > > > > > > > > > > Atri
>> >> >> >> > > > > > > > > > >
>> >> >> >> > > > > > > > > > > --
>> >> >> >> > > > > > > > > > > Regards,
>> >> >> >> > > > > > > > > > >
>> >> >> >> > > > > > > > > > > Atri
>> >> >> >> > > > > > > > > > > Apache Concerted
>> >> >> >> > > > > > > > > > >
>> >> >> >> > > > > > > > > >
>> >> >> >> > > > > > > > >
>> >> >> >> > > > > > > > >
>> >> >> >> > > > > > > > > --
>> >> >> >> > > > > > > > >
>> >> >> >> > > > > > > > > Best regards,
>> >> >> >> > > > > > > > > Alexei Scherbakov
>> >> >> >> > > > > > > >
>> >> >> >> > > > > > > > --
>> >> >> >> > > > > > > > Regards,
>> >> >> >> > > > > > > >
>> >> >> >> > > > > > > > Atri
>> >> >> >> > > > > > > > Apache Concerted
>> >> >> >> > > > > > > >
>> >> >> >> > > > > > >
>> >> >> >> > > > > >
>> >> >> >> > > > >
>> >> >> >> > > > >
>> >> >> >> > > > > --
>> >> >> >> > > > > Best regards,
>> >> >> >> > > > > Andrey V. Mashenkov
>> >> >> >> > > >
>> >> >> >> > > > --
>> >> >> >> > > > Regards,
>> >> >> >> > > >
>> >> >> >> > > > Atri
>> >> >> >> > > > Apache Concerted
>> >> >> >> > > >
>> >> >> >> > >
>> >> >> >> > >
>> >> >> >> > > --
>> >> >> >> > > Best regards,
>> >> >> >> > > Andrey V. Mashenkov
>> >> >> >> >
>> >> >> >> > --
>> >> >> >> > Regards,
>> >> >> >> >
>> >> >> >> > Atri
>> >> >> >> > Apache Concerted
>> >> >> >> >
>> >> >> >>
>> >> >> >>
>> >> >> >> --
>> >> >> >> Best regards,
>> >> >> >> Andrey V. Mashenkov
>> >> >> >
>> >> >> > --
>> >> >> > Regards,
>> >> >> >
>> >> >> > Atri
>> >> >> > Apache Concerted
>> >> >> >
>> >> >>
>> >> >>
>> >> >> --
>> >> >>
>> >> >> Best regards,
>> >> >> Ivan Pavlukhin
>> >> >>
>> >> >
>> >>
>> >>
>> >> --
>> >>
>> >> Best regards,
>> >> Ivan Pavlukhin
>> >
>> > --
>> > Regards,
>> >
>> > Atri
>> > Apache Concerted
>> >
>>
>>
>> --
>>
>> Best regards,
>> Ivan Pavlukhin
>>
>


-- 

Best regards,
Ivan Pavlukhin

Re: Text Queries Support

Posted by Maximiliano Gazquez <ma...@gmail.com>.

Hello everyone!

I wanted to add this to the discussion.
I've found this project https://github.com/hawkore/ignite-hk which promises
to solve most of the issues that are being discussed here like pagination,
sorting and most important, persisting the lucene index.

It does stuff like this to create indexes:

CREATE INDEX PERSON_LUCENE_IDX ON "PUBLIC".PERSON(LUCENE)
FULLTEXT '{
''refresh_seconds'':''60'',
''directory_path'':'''',
''ram_buffer_mb'':''10'',
''max_cached_mb'':''-1'',
''partitioner'':''{"type":"token","partitions":10}'',
''optimizer_enabled'':''true'',
''optimizer_schedule'':''0 1 * * *'',
''version'':''0'',
''schema'':''{
    "default_analyzer":"english",

"analyzers":{"my_custom_analyzer":{"type":"snowball","language":"Spanish","stopwords":"el,la,lo,loas,las,a,ante,bajo,cabe,con,contra"}},
    "fields":{

"duration":{"type":"date_range","from":"start_date","to":"stop_date","validated":false,"pattern":"yyyy/MM/dd"},

"place":{"type":"geo_point","latitude":"latitude","longitude":"longitude"},
      "date":{"type":"date","validated":true,"pattern":"yyyy/MM/dd"},
      "number":{"type":"integer","validated":false,"boost":1.0},
      "gender":{"type":"string","validated":true,"case_sensitive":true},
      "bool":{"type":"boolean","validated":false},

"phrase":{"type":"text","validated":false,"analyzer":"my_custom_analyzer"},
      "name":{"type":"string","validated":false,"case_sensitive":true},
      "animal":{"type":"string","validated":false,"case_sensitive":true},
      "age":{"type":"integer","validated":false,"boost":1.0},
      "food":{"type":"string","validated":false,"case_sensitive":true}
    }
  }''
}';

And this to use that lucene index from inside SQL:

SELECT * FROM "test".user
WHERE lucene = '{ query : {
                              type : "boolean",
                              must : [{type : "wildcard", field : "name",
value : "J*"},
                                      {type : "wildcard", field : "food",
value : "tu*"}]}}';

More examples here
https://github.com/hawkore/examples-apache-ignite-extensions/tree/master/examples-advanced-ignite-indexing

I don't have anything to do with that company but it would be great to know
how they implemented this stuff.


On Mon, Aug 9, 2021 at 3:00 AM Ivan Pavlukhin <vo...@gmail.com> wrote:

> Hi Atri,
>
> Sorry for a late answer.
>
> > I didn't quite understand. Are you proposing that Ignite should not have
> FTS capabilities?
>
> It seems an option to me. IMHO it is better to have no FTS instead of
> something like current Ignite TextQueries.
>
> 2021-08-03 12:45 GMT+03:00, Atri Sharma <at...@apache.org>:
> > Hi Ivan,
> >
> > I didn't quite understand. Are you proposing that Ignite should not
> > have FTS capabilities?
> >
> > Atri
> >
> > On Tue, Aug 3, 2021 at 2:57 PM Ivan Pavlukhin <vo...@gmail.com>
> wrote:
> >>
> >> Hi Atri,
> >>
> >> My main concern is non-maleficence. Every task has several solutions,
> >> e.g. straightforward ones:
> >> 1. Do not implement FTS.
> >> 2. Create own implementation.
> >>
> >> Some of the strongest ones live without FTS [1].
> >>
> >> [1] https://github.com/cockroachdb/cockroach/issues/7821
> >>
> >> 2021-08-02 11:33 GMT+03:00, Atri Sharma <at...@apache.org>:
> >> > Hi Ivan,
> >> >
> >> > Would you like to propose an alternative to Lucene?
> >> >
> >> > Atri
> >> >
> >> > On Mon, 2 Aug 2021, 13:48 Ivan Pavlukhin, <vo...@gmail.com>
> wrote:
> >> >
> >> >> Folks,
> >> >>
> >> >> Sorry if read the thread not thoroughly enough, but do we consider
> >> >> Lucene as obviously right choice? In my understanding Ignite history
> >> >> has shown clearly that "fastest feature implementation" is not
> usually
> >> >> the best. And one example of this are text queries. Are not we trying
> >> >> to do a same mistake again? FTS is a huge feature, I do not believe
> >> >> there is an easy win for it.
> >> >>
> >> >> 2021-07-27 19:18 GMT+03:00, Atri Sharma <at...@apache.org>:
> >> >> > Andrey,
> >> >> >
> >> >> >> Per-partition Lucene index looks simple to implement, but it may
> >> >> >> require
> >> >> >> per-partition SQL to make full-text search expressions work
> >> >> >> correctly
> >> >> >> within the SQL quiery.
> >> >> > I think that as long as we follow the map - reduce process that we
> >> >> > already do for other queries, we should be fine.
> >> >> >
> >> >> >> Per-partition SQL index may kill the performance. We already tried
> >> >> >> to
> >> >> >> do
> >> >> >> that in Ignite 2. However, QueryParallelism feature helps to speed
> >> >> >> up
> >> >> >> some
> >> >> >> data-intensive queries,
> >> >> >> but hits the performance in simple cases, and at some point (e.g.
> >> >> >> segments
> >> >> >> > number of CPU) the performance rapidly degrades with the
> >> >> >> > increasing
> >> >> >> number of segments.
> >> >> >
> >> >> > Yeah, that is always the case, but a global index will be a
> >> >> > nightmare
> >> >> > in terms of concurrency and pessimistic concurrency control will
> >> >> > anyways kill the benefits, coupled with the metadata requirements.
> >> >> > What were the specific issues with per partition index?
> >> >> >>
> >> >> >> AFAIK, Lucene widely used bitmap indices that are easy to merge.
> >> >> >> Maybe, the map-reduce technique underneath FTS expressions and
> some
> >> >> hacks
> >> >> >> will add a minimal overhead.
> >> >> >
> >> >> > Lucene uses many types of indices but the aspect here is that per
> >> >> > partition Lucene indices can return docIDs and we can merge them in
> >> >> > reduce phase. So we are abstracted out from specifics of the
> >> >> > internal
> >> >> > index being used to serve the query.
> >> >> >
> >> >> >>
> >> >> >> > As illustrated by Ilya, we can use Ignite's WAL records to
> >> >> >> > rebuild
> >> >> >> > Lucene indices. The important thing here is to not treat Lucene
> >> >> >> > indices as source of truth.
> >> >> >> To use WAL we either should relay Lucene files to our Page memory
> >> >> >> or
> >> >> >> be
> >> >> >> aware of Lucene files structure.
> >> >> >> The first looks tricky, as we should guarantee a contiguous
> address
> >> >> space
> >> >> >> in Page memory for reflecting Lucene file. Maybe separate managed
> >> >> >> memory
> >> >> >> segment with its own rules?
> >> >> >
> >> >> > Why not use Lucene's MMappedDirectory and map it to our storage
> >> >> > classes?
> >> >> >
> >> >> >>
> >> >> >> >> Transactions.
> >> >> >> >> * Will we support transactions?
> >> >> >> > Lucene has no concept of transactions.
> >> >> >> Yes, but we have.
> >> >> >> Lucene index may be non-transactional, but users never expect to
> >> >> >> see
> >> >> >> uncommited data.
> >> >> >> How does this connect with transactional SQL?
> >> >> > We could have the Lucene writes done as a part of transactions and
> >> >> > ack
> >> >> > back only when it succeeds/fails. WDYT?
> >> >> >>
> >> >> >> On Tue, Jul 27, 2021 at 1:36 PM Atri Sharma <at...@apache.org>
> >> >> >> wrote:
> >> >> >>
> >> >> >> > Sorry, I planned on creating a Wiki page for this, but it makes
> >> >> >> > more
> >> >> >> > sense to be replying here.
> >> >> >> >
> >> >> >> > > * How Lucene index can be split among the nodes?
> >> >> >> >
> >> >> >> > We can have partition level indices on each node.
> >> >> >> >
> >> >> >> > > * If we'll have a single index for all partitions on the
> >> >> >> > > particular
> >> >> >> > > node,
> >> >> >> > > then how index records will be aware of partitioning?
> >> >> >> >
> >> >> >> > Index records dont need to be aware of partitioning -- each
> >> >> >> > Lucene
> >> >> >> > index is independent.
> >> >> >> >
> >> >> >> > > This is important to filter out backup records from the
> results
> >> >> >> > > to
> >> >> >> > > avoid
> >> >> >> > > duplicates.
> >> >> >> >
> >> >> >> > We can merge documents from different nodes and remove
> duplicates
> >> >> >> > as
> >> >> >> > long as docIDs are globally unique.
> >> >> >> >
> >> >> >> > > * How results from several nodes can be merged on the Reduce
> >> >> >> > > stage?
> >> >> >> >
> >> >> >> > As long as documents have a globally unique docID, Lucene has
> >> >> >> > merge
> >> >> >> > functions that can merge results from multiple partial results.
> >> >> >> >
> >> >> >> > > * Does Lucene supports smth like JOIN operation or others that
> >> >> >> > > may
> >> >> >> > require
> >> >> >> > > data from another partition or index?
> >> >> >> >
> >> >> >> > As illustrated by Ilya, Block-Join works for us.
> >> >> >> >
> >> >> >> > > If so, then it likes to multistep query with merging results
> on
> >> >> >> > > intermediate stages and requires detailed investigation and
> >> >> >> > > design.
> >> >> >> > > It is ok if Ignite will have some limitations here, but we
> >> >> >> > > would
> >> >> like
> >> >> >> > > to
> >> >> >> > > know about them at the early stage.
> >> >> >> >
> >> >> >> > > * How effectively map Lucene files to the page memory? Is it
> >> >> >> > > even
> >> >> >> > possible?
> >> >> >> >
> >> >> >> > Lucene has PageDirectory implementations which allow storing
> >> >> >> > Lucene
> >> >> >> > indices on different kind of file structures. It has a
> >> >> >> > MMappedFileDirectory that we could use?
> >> >> >> >
> >> >> >> > > Otherwise, how to deal with potential OOM on large queries and
> >> >> memory
> >> >> >> > > capacity planning?
> >> >> >> >
> >> >> >> > We can use Lucene's MMapped directory.
> >> >> >> >
> >> >> >> > >
> >> >> >> > > Persistence.
> >> >> >> > > * How and what consistency guarantees could we have/expect?
> >> >> >> >
> >> >> >> > Lucene does not have WAL logs but is append only
> >> >> >> >
> >> >> >> > > Seems, we may not be able to write physical records for Lucene
> >> >> >> > > index
> >> >> >> > > to
> >> >> >> > our
> >> >> >> > > WAL. What can we do with this?
> >> >> >> >
> >> >> >> > As illustrated by Ilya, we can use Ignite's WAL records to
> >> >> >> > rebuild
> >> >> >> > Lucene indices. The important thing here is to not treat Lucene
> >> >> >> > indices as source of truth.
> >> >> >> > >
> >> >> >> > > Transactions.
> >> >> >> > > * Will we support transactions?
> >> >> >> > Lucene has no concept of transactions.
> >> >> >> >
> >> >> >> > > * Should Lucene be aware of Transaction and track mvcc (or
> >> >> >> > > whatever)
> >> >> >> > > versions for the records?
> >> >> >> > No
> >> >> >> > > * What will be consistency guarantees?
> >> >> >> > We can acknowledge writes back only after Lucene index is
> >> >> >> > updated.
> >> >> >> > >
> >> >> >> > > UX
> >> >> >> > > * How to add FullText search queries syntax into Calcite?
> >> >> >> > Postgres's FTS functions are a good reference.
> >> >> >> > > * AFAIK, the Lucene index has many properties for tuning. How
> >> >> >> > > will
> >> >> >> > > the
> >> >> >> > user
> >> >> >> > > configure the index?
> >> >> >> > Most of those properties can be cluster level and exposed as a
> >> >> >> > new
> >> >> >> > sub
> >> >> >> > config for ignite.
> >> >> >> > > * How and where to store the settings? What are cluster-wide
> >> >> >> > > and
> >> >> what
> >> >> >> > > a
> >> >> >> > > local to the particular node?
> >> >> >> > All can be cluster level.
> >> >> >> > > * Will be all the settings immutable? Can be they changed
> >> >> >> > > on-fly?
> >> >> >> > > after
> >> >> >> > > node/grid restart?
> >> >> >> > They should be applied post restart.
> >> >> >> >
> >> >> >> > > * Any limitations on query syntax?
> >> >> >> > It depends on how we model our queries for text search.
> >> >> >> >
> >> >> >> > >
> >> >> >> > > SQL
> >> >> >> > > * Will we support FullText search in SQL?
> >> >> >> > We need custom functions for it. See Postgres's FTS functions.
> >> >> >> > > * How to integrate Lucene index into Calcite? What is the cost
> >> >> model?
> >> >> >> > There cannot be any cost model since there are no paths for a
> >> >> >> > text
> >> >> >> > query. If we see a text query, we have to use Lucene index or
> >> >> >> > return
> >> >> >> > an error. In this way, we need to model text search as a set of
> >> >> >> > UDFs
> >> >> >> >
> >> >> >> > > Splitting rules? Traits?
> >> >> >> > Please see my reply above.
> >> >> >> > >
> >> >> >> > >
> >> >> >> > > With all of this, you can go with the IEP (or even some short
> >> >> >> > > summary)
> >> >> >> > and
> >> >> >> > > further POC and implementation.
> >> >> >> > > That's a big deal, so let's discuss what could be done here.
> >> >> >> > >
> >> >> >> > > On Fri, Jul 23, 2021 at 12:58 PM Atri Sharma <atri@apache.org
> >
> >> >> wrote:
> >> >> >> > >
> >> >> >> > > > I am actually happy to drive the feature for Ignite 3. FTS
> is
> >> >> >> > > > very
> >> >> >> > > > important for me and I think Ignite users will benefit from
> >> >> >> > > > it
> >> >> >> > > > greatly.
> >> >> >> > > >
> >> >> >> > > > If it makes sense to be focusing on Ignite 3 for this
> >> >> >> > > > capability,
> >> >> I
> >> >> >> > > > am
> >> >> >> > > > eager to contribute there and lead the development.
> >> >> >> > > >
> >> >> >> > > > Please share your thoughts.
> >> >> >> > > >
> >> >> >> > > > On Fri, Jul 23, 2021 at 3:21 PM Andrey Mashenkov
> >> >> >> > > > <an...@gmail.com> wrote:
> >> >> >> > > > >
> >> >> >> > > > > Hi Atri,
> >> >> >> > > > >
> >> >> >> > > > > All the Jira tickets we have on the Full-text search (FTS)
> >> >> >> > > > > thing
> >> >> >> > > > > are
> >> >> >> > > > > targeted to Ignite 2.
> >> >> >> > > > >
> >> >> >> > > > > AFAIK, we want, but we have NOT committed to FTS support
> in
> >> >> Ignite
> >> >> >> > > > > 3,
> >> >> >> > > > yet.
> >> >> >> > > > > By the way, we are getting requests for this thing from
> the
> >> >> >> > > > > user
> >> >> >> > side,
> >> >> >> > > > and
> >> >> >> > > > > definitely,
> >> >> >> > > > > FTS would be a valuable feature for Ignite.
> >> >> >> > > > >
> >> >> >> > > > > It will be great if the one wants to drive it, any help
> >> >> >> > > > > will
> >> >> >> > > > > be
> >> >> >> > > > appreciated.
> >> >> >> > > > >
> >> >> >> > > > >
> >> >> >> > > > > On Fri, Jul 23, 2021 at 12:12 PM Atri Sharma
> >> >> >> > > > > <at...@apache.org>
> >> >> >> > wrote:
> >> >> >> > > > >
> >> >> >> > > > > > Hello,
> >> >> >> > > > > >
> >> >> >> > > > > > An update, please. I am working through persistence of
> >> >> >> > > > > > Lucene
> >> >> >> > > > > > index
> >> >> >> > > > using
> >> >> >> > > > > > Ignite Dictionary, and will be asking some questions
> >> >> >> > > > > > soon.
> >> >> >> > > > > >
> >> >> >> > > > > > I had one doubt - - where does this change go? Ignite 3?
> >> >> >> > > > > >
> >> >> >> > > > > > Also, I know we want to build native support for text
> >> >> >> > > > > > searches
> >> >> >> > > > > > in
> >> >> >> > > > Ignite 3.
> >> >> >> > > > > > Is the work I am proposing here part of that, or will
> >> >> >> > > > > > that
> >> >> >> > > > > > be
> >> >> a
> >> >> >> > > > separate
> >> >> >> > > > > > effort?
> >> >> >> > > > > >
> >> >> >> > > > > > On Mon, 28 Jun 2021, 19:20 Ilya Kasnacheev, <
> >> >> >> > ilya.kasnacheev@gmail.com
> >> >> >> > > > >
> >> >> >> > > > > > wrote:
> >> >> >> > > > > >
> >> >> >> > > > > > > Hello!
> >> >> >> > > > > > >
> >> >> >> > > > > > > I think that number one is the most important one,
> then
> >> >> maybe
> >> >> >> > > > > > > it
> >> >> >> > > > will see
> >> >> >> > > > > > > more use and other deficiencies become more apparent,
> >> >> leading
> >> >> >> > > > > > > to
> >> >> >> > more
> >> >> >> > > > > > > tickets and visibility.
> >> >> >> > > > > > >
> >> >> >> > > > > > > Maybe 2. and 3. will even use a different approach
> when
> >> >> >> > persistence
> >> >> >> > > > is
> >> >> >> > > > > > > implemented.
> >> >> >> > > > > > >
> >> >> >> > > > > > > Regards,
> >> >> >> > > > > > > --
> >> >> >> > > > > > > Ilya Kasnacheev
> >> >> >> > > > > > >
> >> >> >> > > > > > >
> >> >> >> > > > > > > пн, 28 июн. 2021 г. в 14:34, Atri Sharma
> >> >> >> > > > > > > <at...@apache.org>:
> >> >> >> > > > > > >
> >> >> >> > > > > > > > Hello Again!
> >> >> >> > > > > > > >
> >> >> >> > > > > > > > I have been looking into the aforementioned and here
> >> >> >> > > > > > > > are
> >> >> my
> >> >> >> > follow
> >> >> >> > > > up
> >> >> >> > > > > > > > thoughts:
> >> >> >> > > > > > > >
> >> >> >> > > > > > > > 1. Support persistence of Lucene indexes.
> >> >> >> > > > > > > > 2.
> https://issues.apache.org/jira/browse/IGNITE-12401
> >> >> >> > > > > > > > (Needs
> >> >> >> > > > fixing of
> >> >> >> > > > > > > > moving partitions first)
> >> >> >> > > > > > > > 3. Figure out how to return scores from nodes and
> use
> >> >> >> > > > > > > > them
> >> >> >> > > > > > > > as
> >> >> >> > sort
> >> >> >> > > > > > > > parameters on the coordinator node
> >> >> >> > > > > > > > (https://issues.apache.org/jira/browse/IGNITE-12291
> )
> >> >> >> > > > > > > >
> >> >> >> > > > > > > > Please let me know if this looks ok to make text
> >> >> >> > > > > > > > queries
> >> >> >> > > > functional?
> >> >> >> > > > > > > >
> >> >> >> > > > > > > > Atri
> >> >> >> > > > > > > >
> >> >> >> > > > > > > > On Mon, Jun 21, 2021 at 2:49 PM Alexei Scherbakov
> >> >> >> > > > > > > > <al...@gmail.com> wrote:
> >> >> >> > > > > > > > >
> >> >> >> > > > > > > > > Hi.
> >> >> >> > > > > > > > >
> >> >> >> > > > > > > > > One of the biggest issues with text queries is a
> >> >> >> > > > > > > > > lack
> >> >> >> > > > > > > > > of
> >> >> >> > support
> >> >> >> > > > for
> >> >> >> > > > > > > > lucene
> >> >> >> > > > > > > > > indices persistence, which makes this
> functionality
> >> >> >> > > > > > > > > useless
> >> >> >> > if a
> >> >> >> > > > > > > > > persistence is enabled.
> >> >> >> > > > > > > > >
> >> >> >> > > > > > > > > I would first take care of it.
> >> >> >> > > > > > > > >
> >> >> >> > > > > > > > > пн, 21 июн. 2021 г. в 12:16, Maksim Timonin <
> >> >> >> > > > timonin.maxim@gmail.com
> >> >> >> > > > > > >:
> >> >> >> > > > > > > > >
> >> >> >> > > > > > > > > > Hi, Atri!
> >> >> >> > > > > > > > > >
> >> >> >> > > > > > > > > > You're right, Actually there is a lack of
> support
> >> >> >> > > > > > > > > > for
> >> >> >> > > > TextQueries.
> >> >> >> > > > > > > For
> >> >> >> > > > > > > > the
> >> >> >> > > > > > > > > > last ticket I'm doing I see some obvious issues
> >> >> >> > > > > > > > > > with
> >> >> >> > > > > > > > > > them
> >> >> >> > (no
> >> >> >> > > > page
> >> >> >> > > > > > > size
> >> >> >> > > > > > > > > > support, for example). I'm glad that somebody
> >> >> >> > > > > > > > > > wants
> >> >> >> > > > > > > > > > to
> >> >> >> > maintain
> >> >> >> > > > > > this
> >> >> >> > > > > > > > > > functionality. Thanks a lot!
> >> >> >> > > > > > > > > >
> >> >> >> > > > > > > > > > For the MergeSort algorithm there is already a
> >> >> >> > > > > > > > > > patch
> >> >> >> > > > > > > > > > for
> >> >> >> > that
> >> >> >> > > > [1].
> >> >> >> > > > > > > It's
> >> >> >> > > > > > > > > > currently on review. This patch introduces an
> >> >> >> > > > > > > > > > abstract
> >> >> >> > reducer
> >> >> >> > > > for
> >> >> >> > > > > > > > > > CacheQueries with 2 implementations (unordered,
> >> >> >> > merge-sort).
> >> >> >> > > > Then
> >> >> >> > > > > > > > TextQuery
> >> >> >> > > > > > > > > > leverages on MergeSort to order results from
> >> >> >> > > > > > > > > > multiple
> >> >> >> > nodes by
> >> >> >> > > > > > score.
> >> >> >> > > > > > > > This
> >> >> >> > > > > > > > > > patch also fixes the pageSize issue, I've
> >> >> >> > > > > > > > > > mentioned
> >> >> >> > > > > > > > > > before.
> >> >> >> > > > Could
> >> >> >> > > > > > you
> >> >> >> > > > > > > > > > please check if it fully matches your idea? Any
> >> >> >> > > > > > > > > > issues
> >> >> >> > > > > > > > > > or
> >> >> >> > > > comments
> >> >> >> > > > > > > are
> >> >> >> > > > > > > > > > welcome.
> >> >> >> > > > > > > > > >
> >> >> >> > > > > > > > > > I've prepared this ticket, because I need the
> >> >> MergeSort
> >> >> >> > > > algorithm
> >> >> >> > > > > > for
> >> >> >> > > > > > > > the
> >> >> >> > > > > > > > > > new type of queries I'm implementing
> (IndexQuery,
> >> >> >> > > > > > > > > > it
> >> >> >> > > > > > > > > > should
> >> >> >> > > > also
> >> >> >> > > > > > > > provide
> >> >> >> > > > > > > > > > ordered results over multiple nodes). Currently
> >> >> >> > > > > > > > > > I'm
> >> >> not
> >> >> >> > > > planning to
> >> >> >> > > > > > > go
> >> >> >> > > > > > > > > > further with TextQuery, so if you're going to
> >> >> >> > > > > > > > > > support
> >> >> >> > > > > > > > > > this
> >> >> >> > > > it'll
> >> >> >> > > > > > be a
> >> >> >> > > > > > > > great
> >> >> >> > > > > > > > > > contribution, I think.
> >> >> >> > > > > > > > > >
> >> >> >> > > > > > > > > > [1]
> >> >> https://issues.apache.org/jira/browse/IGNITE-14703
> >> >> >> > > > > > > > > > [2] https://github.com/apache/ignite/pull/9081
> >> >> >> > > > > > > > > >
> >> >> >> > > > > > > > > >
> >> >> >> > > > > > > > > > On Mon, Jun 21, 2021 at 11:11 AM Atri Sharma <
> >> >> >> > atri@apache.org>
> >> >> >> > > > > > > wrote:
> >> >> >> > > > > > > > > >
> >> >> >> > > > > > > > > > > Hi All,
> >> >> >> > > > > > > > > > >
> >> >> >> > > > > > > > > > > I have been looking into our text queries
> >> >> >> > > > > > > > > > > support
> >> >> and
> >> >> >> > > > > > > > > > > see
> >> >> >> > > > that it
> >> >> >> > > > > > > has
> >> >> >> > > > > > > > > > > limited community support.
> >> >> >> > > > > > > > > > >
> >> >> >> > > > > > > > > > > Therefore, I volunteer to be the maintainer of
> >> >> >> > > > > > > > > > > the
> >> >> >> > module and
> >> >> >> > > > > > work
> >> >> >> > > > > > > on
> >> >> >> > > > > > > > > > > enhancing it further.
> >> >> >> > > > > > > > > > >
> >> >> >> > > > > > > > > > > First goal would be to move to Lucene 8.x,
> then
> >> >> >> > > > > > > > > > > work
> >> >> >> > > > > > > > > > > on
> >> >> >> > > > sorted
> >> >> >> > > > > > > reduce
> >> >> >> > > > > > > > > > > - merge across nodes. Fundamentally, this is
> >> >> >> > > > > > > > > > > doable
> >> >> >> > > > > > > > > > > since
> >> >> >> > > > Lucene
> >> >> >> > > > > > > > ranks
> >> >> >> > > > > > > > > > > documents according to their score, and
> >> >> >> > > > > > > > > > > documents
> >> >> are
> >> >> >> > > > returned in
> >> >> >> > > > > > > the
> >> >> >> > > > > > > > > > > order of their score. Since the scoring
> >> >> >> > > > > > > > > > > function
> >> >> >> > > > > > > > > > > is
> >> >> >> > > > homogeneous,
> >> >> >> > > > > > > this
> >> >> >> > > > > > > > > > > means that across nodes, we can compare scores
> >> >> >> > > > > > > > > > > and
> >> >> >> > > > > > > > > > > merge
> >> >> >> > > > sort.
> >> >> >> > > > > > > > > > >
> >> >> >> > > > > > > > > > > Please let me know if I can take this up.
> >> >> >> > > > > > > > > > >
> >> >> >> > > > > > > > > > > Atri
> >> >> >> > > > > > > > > > >
> >> >> >> > > > > > > > > > > --
> >> >> >> > > > > > > > > > > Regards,
> >> >> >> > > > > > > > > > >
> >> >> >> > > > > > > > > > > Atri
> >> >> >> > > > > > > > > > > Apache Concerted
> >> >> >> > > > > > > > > > >
> >> >> >> > > > > > > > > >
> >> >> >> > > > > > > > >
> >> >> >> > > > > > > > >
> >> >> >> > > > > > > > > --
> >> >> >> > > > > > > > >
> >> >> >> > > > > > > > > Best regards,
> >> >> >> > > > > > > > > Alexei Scherbakov
> >> >> >> > > > > > > >
> >> >> >> > > > > > > > --
> >> >> >> > > > > > > > Regards,
> >> >> >> > > > > > > >
> >> >> >> > > > > > > > Atri
> >> >> >> > > > > > > > Apache Concerted
> >> >> >> > > > > > > >
> >> >> >> > > > > > >
> >> >> >> > > > > >
> >> >> >> > > > >
> >> >> >> > > > >
> >> >> >> > > > > --
> >> >> >> > > > > Best regards,
> >> >> >> > > > > Andrey V. Mashenkov
> >> >> >> > > >
> >> >> >> > > > --
> >> >> >> > > > Regards,
> >> >> >> > > >
> >> >> >> > > > Atri
> >> >> >> > > > Apache Concerted
> >> >> >> > > >
> >> >> >> > >
> >> >> >> > >
> >> >> >> > > --
> >> >> >> > > Best regards,
> >> >> >> > > Andrey V. Mashenkov
> >> >> >> >
> >> >> >> > --
> >> >> >> > Regards,
> >> >> >> >
> >> >> >> > Atri
> >> >> >> > Apache Concerted
> >> >> >> >
> >> >> >>
> >> >> >>
> >> >> >> --
> >> >> >> Best regards,
> >> >> >> Andrey V. Mashenkov
> >> >> >
> >> >> > --
> >> >> > Regards,
> >> >> >
> >> >> > Atri
> >> >> > Apache Concerted
> >> >> >
> >> >>
> >> >>
> >> >> --
> >> >>
> >> >> Best regards,
> >> >> Ivan Pavlukhin
> >> >>
> >> >
> >>
> >>
> >> --
> >>
> >> Best regards,
> >> Ivan Pavlukhin
> >
> > --
> > Regards,
> >
> > Atri
> > Apache Concerted
> >
>
>
> --
>
> Best regards,
> Ivan Pavlukhin
>

Re: Text Queries Support

Posted by Ivan Pavlukhin <vo...@gmail.com>.

Hi Atri,

Sorry for a late answer.

> I didn't quite understand. Are you proposing that Ignite should not have FTS capabilities?

It seems an option to me. IMHO it is better to have no FTS instead of
something like current Ignite TextQueries.

2021-08-03 12:45 GMT+03:00, Atri Sharma <at...@apache.org>:
> Hi Ivan,
>
> I didn't quite understand. Are you proposing that Ignite should not
> have FTS capabilities?
>
> Atri
>
> On Tue, Aug 3, 2021 at 2:57 PM Ivan Pavlukhin <vo...@gmail.com> wrote:
>>
>> Hi Atri,
>>
>> My main concern is non-maleficence. Every task has several solutions,
>> e.g. straightforward ones:
>> 1. Do not implement FTS.
>> 2. Create own implementation.
>>
>> Some of the strongest ones live without FTS [1].
>>
>> [1] https://github.com/cockroachdb/cockroach/issues/7821
>>
>> 2021-08-02 11:33 GMT+03:00, Atri Sharma <at...@apache.org>:
>> > Hi Ivan,
>> >
>> > Would you like to propose an alternative to Lucene?
>> >
>> > Atri
>> >
>> > On Mon, 2 Aug 2021, 13:48 Ivan Pavlukhin, <vo...@gmail.com> wrote:
>> >
>> >> Folks,
>> >>
>> >> Sorry if read the thread not thoroughly enough, but do we consider
>> >> Lucene as obviously right choice? In my understanding Ignite history
>> >> has shown clearly that "fastest feature implementation" is not usually
>> >> the best. And one example of this are text queries. Are not we trying
>> >> to do a same mistake again? FTS is a huge feature, I do not believe
>> >> there is an easy win for it.
>> >>
>> >> 2021-07-27 19:18 GMT+03:00, Atri Sharma <at...@apache.org>:
>> >> > Andrey,
>> >> >
>> >> >> Per-partition Lucene index looks simple to implement, but it may
>> >> >> require
>> >> >> per-partition SQL to make full-text search expressions work
>> >> >> correctly
>> >> >> within the SQL quiery.
>> >> > I think that as long as we follow the map - reduce process that we
>> >> > already do for other queries, we should be fine.
>> >> >
>> >> >> Per-partition SQL index may kill the performance. We already tried
>> >> >> to
>> >> >> do
>> >> >> that in Ignite 2. However, QueryParallelism feature helps to speed
>> >> >> up
>> >> >> some
>> >> >> data-intensive queries,
>> >> >> but hits the performance in simple cases, and at some point (e.g.
>> >> >> segments
>> >> >> > number of CPU) the performance rapidly degrades with the
>> >> >> > increasing
>> >> >> number of segments.
>> >> >
>> >> > Yeah, that is always the case, but a global index will be a
>> >> > nightmare
>> >> > in terms of concurrency and pessimistic concurrency control will
>> >> > anyways kill the benefits, coupled with the metadata requirements.
>> >> > What were the specific issues with per partition index?
>> >> >>
>> >> >> AFAIK, Lucene widely used bitmap indices that are easy to merge.
>> >> >> Maybe, the map-reduce technique underneath FTS expressions and some
>> >> hacks
>> >> >> will add a minimal overhead.
>> >> >
>> >> > Lucene uses many types of indices but the aspect here is that per
>> >> > partition Lucene indices can return docIDs and we can merge them in
>> >> > reduce phase. So we are abstracted out from specifics of the
>> >> > internal
>> >> > index being used to serve the query.
>> >> >
>> >> >>
>> >> >> > As illustrated by Ilya, we can use Ignite's WAL records to
>> >> >> > rebuild
>> >> >> > Lucene indices. The important thing here is to not treat Lucene
>> >> >> > indices as source of truth.
>> >> >> To use WAL we either should relay Lucene files to our Page memory
>> >> >> or
>> >> >> be
>> >> >> aware of Lucene files structure.
>> >> >> The first looks tricky, as we should guarantee a contiguous address
>> >> space
>> >> >> in Page memory for reflecting Lucene file. Maybe separate managed
>> >> >> memory
>> >> >> segment with its own rules?
>> >> >
>> >> > Why not use Lucene's MMappedDirectory and map it to our storage
>> >> > classes?
>> >> >
>> >> >>
>> >> >> >> Transactions.
>> >> >> >> * Will we support transactions?
>> >> >> > Lucene has no concept of transactions.
>> >> >> Yes, but we have.
>> >> >> Lucene index may be non-transactional, but users never expect to
>> >> >> see
>> >> >> uncommited data.
>> >> >> How does this connect with transactional SQL?
>> >> > We could have the Lucene writes done as a part of transactions and
>> >> > ack
>> >> > back only when it succeeds/fails. WDYT?
>> >> >>
>> >> >> On Tue, Jul 27, 2021 at 1:36 PM Atri Sharma <at...@apache.org>
>> >> >> wrote:
>> >> >>
>> >> >> > Sorry, I planned on creating a Wiki page for this, but it makes
>> >> >> > more
>> >> >> > sense to be replying here.
>> >> >> >
>> >> >> > > * How Lucene index can be split among the nodes?
>> >> >> >
>> >> >> > We can have partition level indices on each node.
>> >> >> >
>> >> >> > > * If we'll have a single index for all partitions on the
>> >> >> > > particular
>> >> >> > > node,
>> >> >> > > then how index records will be aware of partitioning?
>> >> >> >
>> >> >> > Index records dont need to be aware of partitioning -- each
>> >> >> > Lucene
>> >> >> > index is independent.
>> >> >> >
>> >> >> > > This is important to filter out backup records from the results
>> >> >> > > to
>> >> >> > > avoid
>> >> >> > > duplicates.
>> >> >> >
>> >> >> > We can merge documents from different nodes and remove duplicates
>> >> >> > as
>> >> >> > long as docIDs are globally unique.
>> >> >> >
>> >> >> > > * How results from several nodes can be merged on the Reduce
>> >> >> > > stage?
>> >> >> >
>> >> >> > As long as documents have a globally unique docID, Lucene has
>> >> >> > merge
>> >> >> > functions that can merge results from multiple partial results.
>> >> >> >
>> >> >> > > * Does Lucene supports smth like JOIN operation or others that
>> >> >> > > may
>> >> >> > require
>> >> >> > > data from another partition or index?
>> >> >> >
>> >> >> > As illustrated by Ilya, Block-Join works for us.
>> >> >> >
>> >> >> > > If so, then it likes to multistep query with merging results on
>> >> >> > > intermediate stages and requires detailed investigation and
>> >> >> > > design.
>> >> >> > > It is ok if Ignite will have some limitations here, but we
>> >> >> > > would
>> >> like
>> >> >> > > to
>> >> >> > > know about them at the early stage.
>> >> >> >
>> >> >> > > * How effectively map Lucene files to the page memory? Is it
>> >> >> > > even
>> >> >> > possible?
>> >> >> >
>> >> >> > Lucene has PageDirectory implementations which allow storing
>> >> >> > Lucene
>> >> >> > indices on different kind of file structures. It has a
>> >> >> > MMappedFileDirectory that we could use?
>> >> >> >
>> >> >> > > Otherwise, how to deal with potential OOM on large queries and
>> >> memory
>> >> >> > > capacity planning?
>> >> >> >
>> >> >> > We can use Lucene's MMapped directory.
>> >> >> >
>> >> >> > >
>> >> >> > > Persistence.
>> >> >> > > * How and what consistency guarantees could we have/expect?
>> >> >> >
>> >> >> > Lucene does not have WAL logs but is append only
>> >> >> >
>> >> >> > > Seems, we may not be able to write physical records for Lucene
>> >> >> > > index
>> >> >> > > to
>> >> >> > our
>> >> >> > > WAL. What can we do with this?
>> >> >> >
>> >> >> > As illustrated by Ilya, we can use Ignite's WAL records to
>> >> >> > rebuild
>> >> >> > Lucene indices. The important thing here is to not treat Lucene
>> >> >> > indices as source of truth.
>> >> >> > >
>> >> >> > > Transactions.
>> >> >> > > * Will we support transactions?
>> >> >> > Lucene has no concept of transactions.
>> >> >> >
>> >> >> > > * Should Lucene be aware of Transaction and track mvcc (or
>> >> >> > > whatever)
>> >> >> > > versions for the records?
>> >> >> > No
>> >> >> > > * What will be consistency guarantees?
>> >> >> > We can acknowledge writes back only after Lucene index is
>> >> >> > updated.
>> >> >> > >
>> >> >> > > UX
>> >> >> > > * How to add FullText search queries syntax into Calcite?
>> >> >> > Postgres's FTS functions are a good reference.
>> >> >> > > * AFAIK, the Lucene index has many properties for tuning. How
>> >> >> > > will
>> >> >> > > the
>> >> >> > user
>> >> >> > > configure the index?
>> >> >> > Most of those properties can be cluster level and exposed as a
>> >> >> > new
>> >> >> > sub
>> >> >> > config for ignite.
>> >> >> > > * How and where to store the settings? What are cluster-wide
>> >> >> > > and
>> >> what
>> >> >> > > a
>> >> >> > > local to the particular node?
>> >> >> > All can be cluster level.
>> >> >> > > * Will be all the settings immutable? Can be they changed
>> >> >> > > on-fly?
>> >> >> > > after
>> >> >> > > node/grid restart?
>> >> >> > They should be applied post restart.
>> >> >> >
>> >> >> > > * Any limitations on query syntax?
>> >> >> > It depends on how we model our queries for text search.
>> >> >> >
>> >> >> > >
>> >> >> > > SQL
>> >> >> > > * Will we support FullText search in SQL?
>> >> >> > We need custom functions for it. See Postgres's FTS functions.
>> >> >> > > * How to integrate Lucene index into Calcite? What is the cost
>> >> model?
>> >> >> > There cannot be any cost model since there are no paths for a
>> >> >> > text
>> >> >> > query. If we see a text query, we have to use Lucene index or
>> >> >> > return
>> >> >> > an error. In this way, we need to model text search as a set of
>> >> >> > UDFs
>> >> >> >
>> >> >> > > Splitting rules? Traits?
>> >> >> > Please see my reply above.
>> >> >> > >
>> >> >> > >
>> >> >> > > With all of this, you can go with the IEP (or even some short
>> >> >> > > summary)
>> >> >> > and
>> >> >> > > further POC and implementation.
>> >> >> > > That's a big deal, so let's discuss what could be done here.
>> >> >> > >
>> >> >> > > On Fri, Jul 23, 2021 at 12:58 PM Atri Sharma <at...@apache.org>
>> >> wrote:
>> >> >> > >
>> >> >> > > > I am actually happy to drive the feature for Ignite 3. FTS is
>> >> >> > > > very
>> >> >> > > > important for me and I think Ignite users will benefit from
>> >> >> > > > it
>> >> >> > > > greatly.
>> >> >> > > >
>> >> >> > > > If it makes sense to be focusing on Ignite 3 for this
>> >> >> > > > capability,
>> >> I
>> >> >> > > > am
>> >> >> > > > eager to contribute there and lead the development.
>> >> >> > > >
>> >> >> > > > Please share your thoughts.
>> >> >> > > >
>> >> >> > > > On Fri, Jul 23, 2021 at 3:21 PM Andrey Mashenkov
>> >> >> > > > <an...@gmail.com> wrote:
>> >> >> > > > >
>> >> >> > > > > Hi Atri,
>> >> >> > > > >
>> >> >> > > > > All the Jira tickets we have on the Full-text search (FTS)
>> >> >> > > > > thing
>> >> >> > > > > are
>> >> >> > > > > targeted to Ignite 2.
>> >> >> > > > >
>> >> >> > > > > AFAIK, we want, but we have NOT committed to FTS support in
>> >> Ignite
>> >> >> > > > > 3,
>> >> >> > > > yet.
>> >> >> > > > > By the way, we are getting requests for this thing from the
>> >> >> > > > > user
>> >> >> > side,
>> >> >> > > > and
>> >> >> > > > > definitely,
>> >> >> > > > > FTS would be a valuable feature for Ignite.
>> >> >> > > > >
>> >> >> > > > > It will be great if the one wants to drive it, any help
>> >> >> > > > > will
>> >> >> > > > > be
>> >> >> > > > appreciated.
>> >> >> > > > >
>> >> >> > > > >
>> >> >> > > > > On Fri, Jul 23, 2021 at 12:12 PM Atri Sharma
>> >> >> > > > > <at...@apache.org>
>> >> >> > wrote:
>> >> >> > > > >
>> >> >> > > > > > Hello,
>> >> >> > > > > >
>> >> >> > > > > > An update, please. I am working through persistence of
>> >> >> > > > > > Lucene
>> >> >> > > > > > index
>> >> >> > > > using
>> >> >> > > > > > Ignite Dictionary, and will be asking some questions
>> >> >> > > > > > soon.
>> >> >> > > > > >
>> >> >> > > > > > I had one doubt - - where does this change go? Ignite 3?
>> >> >> > > > > >
>> >> >> > > > > > Also, I know we want to build native support for text
>> >> >> > > > > > searches
>> >> >> > > > > > in
>> >> >> > > > Ignite 3.
>> >> >> > > > > > Is the work I am proposing here part of that, or will
>> >> >> > > > > > that
>> >> >> > > > > > be
>> >> a
>> >> >> > > > separate
>> >> >> > > > > > effort?
>> >> >> > > > > >
>> >> >> > > > > > On Mon, 28 Jun 2021, 19:20 Ilya Kasnacheev, <
>> >> >> > ilya.kasnacheev@gmail.com
>> >> >> > > > >
>> >> >> > > > > > wrote:
>> >> >> > > > > >
>> >> >> > > > > > > Hello!
>> >> >> > > > > > >
>> >> >> > > > > > > I think that number one is the most important one, then
>> >> maybe
>> >> >> > > > > > > it
>> >> >> > > > will see
>> >> >> > > > > > > more use and other deficiencies become more apparent,
>> >> leading
>> >> >> > > > > > > to
>> >> >> > more
>> >> >> > > > > > > tickets and visibility.
>> >> >> > > > > > >
>> >> >> > > > > > > Maybe 2. and 3. will even use a different approach when
>> >> >> > persistence
>> >> >> > > > is
>> >> >> > > > > > > implemented.
>> >> >> > > > > > >
>> >> >> > > > > > > Regards,
>> >> >> > > > > > > --
>> >> >> > > > > > > Ilya Kasnacheev
>> >> >> > > > > > >
>> >> >> > > > > > >
>> >> >> > > > > > > пн, 28 июн. 2021 г. в 14:34, Atri Sharma
>> >> >> > > > > > > <at...@apache.org>:
>> >> >> > > > > > >
>> >> >> > > > > > > > Hello Again!
>> >> >> > > > > > > >
>> >> >> > > > > > > > I have been looking into the aforementioned and here
>> >> >> > > > > > > > are
>> >> my
>> >> >> > follow
>> >> >> > > > up
>> >> >> > > > > > > > thoughts:
>> >> >> > > > > > > >
>> >> >> > > > > > > > 1. Support persistence of Lucene indexes.
>> >> >> > > > > > > > 2. https://issues.apache.org/jira/browse/IGNITE-12401
>> >> >> > > > > > > > (Needs
>> >> >> > > > fixing of
>> >> >> > > > > > > > moving partitions first)
>> >> >> > > > > > > > 3. Figure out how to return scores from nodes and use
>> >> >> > > > > > > > them
>> >> >> > > > > > > > as
>> >> >> > sort
>> >> >> > > > > > > > parameters on the coordinator node
>> >> >> > > > > > > > (https://issues.apache.org/jira/browse/IGNITE-12291)
>> >> >> > > > > > > >
>> >> >> > > > > > > > Please let me know if this looks ok to make text
>> >> >> > > > > > > > queries
>> >> >> > > > functional?
>> >> >> > > > > > > >
>> >> >> > > > > > > > Atri
>> >> >> > > > > > > >
>> >> >> > > > > > > > On Mon, Jun 21, 2021 at 2:49 PM Alexei Scherbakov
>> >> >> > > > > > > > <al...@gmail.com> wrote:
>> >> >> > > > > > > > >
>> >> >> > > > > > > > > Hi.
>> >> >> > > > > > > > >
>> >> >> > > > > > > > > One of the biggest issues with text queries is a
>> >> >> > > > > > > > > lack
>> >> >> > > > > > > > > of
>> >> >> > support
>> >> >> > > > for
>> >> >> > > > > > > > lucene
>> >> >> > > > > > > > > indices persistence, which makes this functionality
>> >> >> > > > > > > > > useless
>> >> >> > if a
>> >> >> > > > > > > > > persistence is enabled.
>> >> >> > > > > > > > >
>> >> >> > > > > > > > > I would first take care of it.
>> >> >> > > > > > > > >
>> >> >> > > > > > > > > пн, 21 июн. 2021 г. в 12:16, Maksim Timonin <
>> >> >> > > > timonin.maxim@gmail.com
>> >> >> > > > > > >:
>> >> >> > > > > > > > >
>> >> >> > > > > > > > > > Hi, Atri!
>> >> >> > > > > > > > > >
>> >> >> > > > > > > > > > You're right, Actually there is a lack of support
>> >> >> > > > > > > > > > for
>> >> >> > > > TextQueries.
>> >> >> > > > > > > For
>> >> >> > > > > > > > the
>> >> >> > > > > > > > > > last ticket I'm doing I see some obvious issues
>> >> >> > > > > > > > > > with
>> >> >> > > > > > > > > > them
>> >> >> > (no
>> >> >> > > > page
>> >> >> > > > > > > size
>> >> >> > > > > > > > > > support, for example). I'm glad that somebody
>> >> >> > > > > > > > > > wants
>> >> >> > > > > > > > > > to
>> >> >> > maintain
>> >> >> > > > > > this
>> >> >> > > > > > > > > > functionality. Thanks a lot!
>> >> >> > > > > > > > > >
>> >> >> > > > > > > > > > For the MergeSort algorithm there is already a
>> >> >> > > > > > > > > > patch
>> >> >> > > > > > > > > > for
>> >> >> > that
>> >> >> > > > [1].
>> >> >> > > > > > > It's
>> >> >> > > > > > > > > > currently on review. This patch introduces an
>> >> >> > > > > > > > > > abstract
>> >> >> > reducer
>> >> >> > > > for
>> >> >> > > > > > > > > > CacheQueries with 2 implementations (unordered,
>> >> >> > merge-sort).
>> >> >> > > > Then
>> >> >> > > > > > > > TextQuery
>> >> >> > > > > > > > > > leverages on MergeSort to order results from
>> >> >> > > > > > > > > > multiple
>> >> >> > nodes by
>> >> >> > > > > > score.
>> >> >> > > > > > > > This
>> >> >> > > > > > > > > > patch also fixes the pageSize issue, I've
>> >> >> > > > > > > > > > mentioned
>> >> >> > > > > > > > > > before.
>> >> >> > > > Could
>> >> >> > > > > > you
>> >> >> > > > > > > > > > please check if it fully matches your idea? Any
>> >> >> > > > > > > > > > issues
>> >> >> > > > > > > > > > or
>> >> >> > > > comments
>> >> >> > > > > > > are
>> >> >> > > > > > > > > > welcome.
>> >> >> > > > > > > > > >
>> >> >> > > > > > > > > > I've prepared this ticket, because I need the
>> >> MergeSort
>> >> >> > > > algorithm
>> >> >> > > > > > for
>> >> >> > > > > > > > the
>> >> >> > > > > > > > > > new type of queries I'm implementing (IndexQuery,
>> >> >> > > > > > > > > > it
>> >> >> > > > > > > > > > should
>> >> >> > > > also
>> >> >> > > > > > > > provide
>> >> >> > > > > > > > > > ordered results over multiple nodes). Currently
>> >> >> > > > > > > > > > I'm
>> >> not
>> >> >> > > > planning to
>> >> >> > > > > > > go
>> >> >> > > > > > > > > > further with TextQuery, so if you're going to
>> >> >> > > > > > > > > > support
>> >> >> > > > > > > > > > this
>> >> >> > > > it'll
>> >> >> > > > > > be a
>> >> >> > > > > > > > great
>> >> >> > > > > > > > > > contribution, I think.
>> >> >> > > > > > > > > >
>> >> >> > > > > > > > > > [1]
>> >> https://issues.apache.org/jira/browse/IGNITE-14703
>> >> >> > > > > > > > > > [2] https://github.com/apache/ignite/pull/9081
>> >> >> > > > > > > > > >
>> >> >> > > > > > > > > >
>> >> >> > > > > > > > > > On Mon, Jun 21, 2021 at 11:11 AM Atri Sharma <
>> >> >> > atri@apache.org>
>> >> >> > > > > > > wrote:
>> >> >> > > > > > > > > >
>> >> >> > > > > > > > > > > Hi All,
>> >> >> > > > > > > > > > >
>> >> >> > > > > > > > > > > I have been looking into our text queries
>> >> >> > > > > > > > > > > support
>> >> and
>> >> >> > > > > > > > > > > see
>> >> >> > > > that it
>> >> >> > > > > > > has
>> >> >> > > > > > > > > > > limited community support.
>> >> >> > > > > > > > > > >
>> >> >> > > > > > > > > > > Therefore, I volunteer to be the maintainer of
>> >> >> > > > > > > > > > > the
>> >> >> > module and
>> >> >> > > > > > work
>> >> >> > > > > > > on
>> >> >> > > > > > > > > > > enhancing it further.
>> >> >> > > > > > > > > > >
>> >> >> > > > > > > > > > > First goal would be to move to Lucene 8.x, then
>> >> >> > > > > > > > > > > work
>> >> >> > > > > > > > > > > on
>> >> >> > > > sorted
>> >> >> > > > > > > reduce
>> >> >> > > > > > > > > > > - merge across nodes. Fundamentally, this is
>> >> >> > > > > > > > > > > doable
>> >> >> > > > > > > > > > > since
>> >> >> > > > Lucene
>> >> >> > > > > > > > ranks
>> >> >> > > > > > > > > > > documents according to their score, and
>> >> >> > > > > > > > > > > documents
>> >> are
>> >> >> > > > returned in
>> >> >> > > > > > > the
>> >> >> > > > > > > > > > > order of their score. Since the scoring
>> >> >> > > > > > > > > > > function
>> >> >> > > > > > > > > > > is
>> >> >> > > > homogeneous,
>> >> >> > > > > > > this
>> >> >> > > > > > > > > > > means that across nodes, we can compare scores
>> >> >> > > > > > > > > > > and
>> >> >> > > > > > > > > > > merge
>> >> >> > > > sort.
>> >> >> > > > > > > > > > >
>> >> >> > > > > > > > > > > Please let me know if I can take this up.
>> >> >> > > > > > > > > > >
>> >> >> > > > > > > > > > > Atri
>> >> >> > > > > > > > > > >
>> >> >> > > > > > > > > > > --
>> >> >> > > > > > > > > > > Regards,
>> >> >> > > > > > > > > > >
>> >> >> > > > > > > > > > > Atri
>> >> >> > > > > > > > > > > Apache Concerted
>> >> >> > > > > > > > > > >
>> >> >> > > > > > > > > >
>> >> >> > > > > > > > >
>> >> >> > > > > > > > >
>> >> >> > > > > > > > > --
>> >> >> > > > > > > > >
>> >> >> > > > > > > > > Best regards,
>> >> >> > > > > > > > > Alexei Scherbakov
>> >> >> > > > > > > >
>> >> >> > > > > > > > --
>> >> >> > > > > > > > Regards,
>> >> >> > > > > > > >
>> >> >> > > > > > > > Atri
>> >> >> > > > > > > > Apache Concerted
>> >> >> > > > > > > >
>> >> >> > > > > > >
>> >> >> > > > > >
>> >> >> > > > >
>> >> >> > > > >
>> >> >> > > > > --
>> >> >> > > > > Best regards,
>> >> >> > > > > Andrey V. Mashenkov
>> >> >> > > >
>> >> >> > > > --
>> >> >> > > > Regards,
>> >> >> > > >
>> >> >> > > > Atri
>> >> >> > > > Apache Concerted
>> >> >> > > >
>> >> >> > >
>> >> >> > >
>> >> >> > > --
>> >> >> > > Best regards,
>> >> >> > > Andrey V. Mashenkov
>> >> >> >
>> >> >> > --
>> >> >> > Regards,
>> >> >> >
>> >> >> > Atri
>> >> >> > Apache Concerted
>> >> >> >
>> >> >>
>> >> >>
>> >> >> --
>> >> >> Best regards,
>> >> >> Andrey V. Mashenkov
>> >> >
>> >> > --
>> >> > Regards,
>> >> >
>> >> > Atri
>> >> > Apache Concerted
>> >> >
>> >>
>> >>
>> >> --
>> >>
>> >> Best regards,
>> >> Ivan Pavlukhin
>> >>
>> >
>>
>>
>> --
>>
>> Best regards,
>> Ivan Pavlukhin
>
> --
> Regards,
>
> Atri
> Apache Concerted
>


-- 

Best regards,
Ivan Pavlukhin

Re: Text Queries Support

Posted by Atri Sharma <at...@apache.org>.

Hi Ivan,

I didn't quite understand. Are you proposing that Ignite should not
have FTS capabilities?

Atri

On Tue, Aug 3, 2021 at 2:57 PM Ivan Pavlukhin <vo...@gmail.com> wrote:
>
> Hi Atri,
>
> My main concern is non-maleficence. Every task has several solutions,
> e.g. straightforward ones:
> 1. Do not implement FTS.
> 2. Create own implementation.
>
> Some of the strongest ones live without FTS [1].
>
> [1] https://github.com/cockroachdb/cockroach/issues/7821
>
> 2021-08-02 11:33 GMT+03:00, Atri Sharma <at...@apache.org>:
> > Hi Ivan,
> >
> > Would you like to propose an alternative to Lucene?
> >
> > Atri
> >
> > On Mon, 2 Aug 2021, 13:48 Ivan Pavlukhin, <vo...@gmail.com> wrote:
> >
> >> Folks,
> >>
> >> Sorry if read the thread not thoroughly enough, but do we consider
> >> Lucene as obviously right choice? In my understanding Ignite history
> >> has shown clearly that "fastest feature implementation" is not usually
> >> the best. And one example of this are text queries. Are not we trying
> >> to do a same mistake again? FTS is a huge feature, I do not believe
> >> there is an easy win for it.
> >>
> >> 2021-07-27 19:18 GMT+03:00, Atri Sharma <at...@apache.org>:
> >> > Andrey,
> >> >
> >> >> Per-partition Lucene index looks simple to implement, but it may
> >> >> require
> >> >> per-partition SQL to make full-text search expressions work correctly
> >> >> within the SQL quiery.
> >> > I think that as long as we follow the map - reduce process that we
> >> > already do for other queries, we should be fine.
> >> >
> >> >> Per-partition SQL index may kill the performance. We already tried to
> >> >> do
> >> >> that in Ignite 2. However, QueryParallelism feature helps to speed up
> >> >> some
> >> >> data-intensive queries,
> >> >> but hits the performance in simple cases, and at some point (e.g.
> >> >> segments
> >> >> > number of CPU) the performance rapidly degrades with the increasing
> >> >> number of segments.
> >> >
> >> > Yeah, that is always the case, but a global index will be a nightmare
> >> > in terms of concurrency and pessimistic concurrency control will
> >> > anyways kill the benefits, coupled with the metadata requirements.
> >> > What were the specific issues with per partition index?
> >> >>
> >> >> AFAIK, Lucene widely used bitmap indices that are easy to merge.
> >> >> Maybe, the map-reduce technique underneath FTS expressions and some
> >> hacks
> >> >> will add a minimal overhead.
> >> >
> >> > Lucene uses many types of indices but the aspect here is that per
> >> > partition Lucene indices can return docIDs and we can merge them in
> >> > reduce phase. So we are abstracted out from specifics of the internal
> >> > index being used to serve the query.
> >> >
> >> >>
> >> >> > As illustrated by Ilya, we can use Ignite's WAL records to rebuild
> >> >> > Lucene indices. The important thing here is to not treat Lucene
> >> >> > indices as source of truth.
> >> >> To use WAL we either should relay Lucene files to our Page memory or
> >> >> be
> >> >> aware of Lucene files structure.
> >> >> The first looks tricky, as we should guarantee a contiguous address
> >> space
> >> >> in Page memory for reflecting Lucene file. Maybe separate managed
> >> >> memory
> >> >> segment with its own rules?
> >> >
> >> > Why not use Lucene's MMappedDirectory and map it to our storage
> >> > classes?
> >> >
> >> >>
> >> >> >> Transactions.
> >> >> >> * Will we support transactions?
> >> >> > Lucene has no concept of transactions.
> >> >> Yes, but we have.
> >> >> Lucene index may be non-transactional, but users never expect to see
> >> >> uncommited data.
> >> >> How does this connect with transactional SQL?
> >> > We could have the Lucene writes done as a part of transactions and ack
> >> > back only when it succeeds/fails. WDYT?
> >> >>
> >> >> On Tue, Jul 27, 2021 at 1:36 PM Atri Sharma <at...@apache.org> wrote:
> >> >>
> >> >> > Sorry, I planned on creating a Wiki page for this, but it makes more
> >> >> > sense to be replying here.
> >> >> >
> >> >> > > * How Lucene index can be split among the nodes?
> >> >> >
> >> >> > We can have partition level indices on each node.
> >> >> >
> >> >> > > * If we'll have a single index for all partitions on the
> >> >> > > particular
> >> >> > > node,
> >> >> > > then how index records will be aware of partitioning?
> >> >> >
> >> >> > Index records dont need to be aware of partitioning -- each Lucene
> >> >> > index is independent.
> >> >> >
> >> >> > > This is important to filter out backup records from the results to
> >> >> > > avoid
> >> >> > > duplicates.
> >> >> >
> >> >> > We can merge documents from different nodes and remove duplicates as
> >> >> > long as docIDs are globally unique.
> >> >> >
> >> >> > > * How results from several nodes can be merged on the Reduce
> >> >> > > stage?
> >> >> >
> >> >> > As long as documents have a globally unique docID, Lucene has merge
> >> >> > functions that can merge results from multiple partial results.
> >> >> >
> >> >> > > * Does Lucene supports smth like JOIN operation or others that may
> >> >> > require
> >> >> > > data from another partition or index?
> >> >> >
> >> >> > As illustrated by Ilya, Block-Join works for us.
> >> >> >
> >> >> > > If so, then it likes to multistep query with merging results on
> >> >> > > intermediate stages and requires detailed investigation and
> >> >> > > design.
> >> >> > > It is ok if Ignite will have some limitations here, but we would
> >> like
> >> >> > > to
> >> >> > > know about them at the early stage.
> >> >> >
> >> >> > > * How effectively map Lucene files to the page memory? Is it even
> >> >> > possible?
> >> >> >
> >> >> > Lucene has PageDirectory implementations which allow storing Lucene
> >> >> > indices on different kind of file structures. It has a
> >> >> > MMappedFileDirectory that we could use?
> >> >> >
> >> >> > > Otherwise, how to deal with potential OOM on large queries and
> >> memory
> >> >> > > capacity planning?
> >> >> >
> >> >> > We can use Lucene's MMapped directory.
> >> >> >
> >> >> > >
> >> >> > > Persistence.
> >> >> > > * How and what consistency guarantees could we have/expect?
> >> >> >
> >> >> > Lucene does not have WAL logs but is append only
> >> >> >
> >> >> > > Seems, we may not be able to write physical records for Lucene
> >> >> > > index
> >> >> > > to
> >> >> > our
> >> >> > > WAL. What can we do with this?
> >> >> >
> >> >> > As illustrated by Ilya, we can use Ignite's WAL records to rebuild
> >> >> > Lucene indices. The important thing here is to not treat Lucene
> >> >> > indices as source of truth.
> >> >> > >
> >> >> > > Transactions.
> >> >> > > * Will we support transactions?
> >> >> > Lucene has no concept of transactions.
> >> >> >
> >> >> > > * Should Lucene be aware of Transaction and track mvcc (or
> >> >> > > whatever)
> >> >> > > versions for the records?
> >> >> > No
> >> >> > > * What will be consistency guarantees?
> >> >> > We can acknowledge writes back only after Lucene index is updated.
> >> >> > >
> >> >> > > UX
> >> >> > > * How to add FullText search queries syntax into Calcite?
> >> >> > Postgres's FTS functions are a good reference.
> >> >> > > * AFAIK, the Lucene index has many properties for tuning. How will
> >> >> > > the
> >> >> > user
> >> >> > > configure the index?
> >> >> > Most of those properties can be cluster level and exposed as a new
> >> >> > sub
> >> >> > config for ignite.
> >> >> > > * How and where to store the settings? What are cluster-wide and
> >> what
> >> >> > > a
> >> >> > > local to the particular node?
> >> >> > All can be cluster level.
> >> >> > > * Will be all the settings immutable? Can be they changed on-fly?
> >> >> > > after
> >> >> > > node/grid restart?
> >> >> > They should be applied post restart.
> >> >> >
> >> >> > > * Any limitations on query syntax?
> >> >> > It depends on how we model our queries for text search.
> >> >> >
> >> >> > >
> >> >> > > SQL
> >> >> > > * Will we support FullText search in SQL?
> >> >> > We need custom functions for it. See Postgres's FTS functions.
> >> >> > > * How to integrate Lucene index into Calcite? What is the cost
> >> model?
> >> >> > There cannot be any cost model since there are no paths for a text
> >> >> > query. If we see a text query, we have to use Lucene index or return
> >> >> > an error. In this way, we need to model text search as a set of UDFs
> >> >> >
> >> >> > > Splitting rules? Traits?
> >> >> > Please see my reply above.
> >> >> > >
> >> >> > >
> >> >> > > With all of this, you can go with the IEP (or even some short
> >> >> > > summary)
> >> >> > and
> >> >> > > further POC and implementation.
> >> >> > > That's a big deal, so let's discuss what could be done here.
> >> >> > >
> >> >> > > On Fri, Jul 23, 2021 at 12:58 PM Atri Sharma <at...@apache.org>
> >> wrote:
> >> >> > >
> >> >> > > > I am actually happy to drive the feature for Ignite 3. FTS is
> >> >> > > > very
> >> >> > > > important for me and I think Ignite users will benefit from it
> >> >> > > > greatly.
> >> >> > > >
> >> >> > > > If it makes sense to be focusing on Ignite 3 for this
> >> >> > > > capability,
> >> I
> >> >> > > > am
> >> >> > > > eager to contribute there and lead the development.
> >> >> > > >
> >> >> > > > Please share your thoughts.
> >> >> > > >
> >> >> > > > On Fri, Jul 23, 2021 at 3:21 PM Andrey Mashenkov
> >> >> > > > <an...@gmail.com> wrote:
> >> >> > > > >
> >> >> > > > > Hi Atri,
> >> >> > > > >
> >> >> > > > > All the Jira tickets we have on the Full-text search (FTS)
> >> >> > > > > thing
> >> >> > > > > are
> >> >> > > > > targeted to Ignite 2.
> >> >> > > > >
> >> >> > > > > AFAIK, we want, but we have NOT committed to FTS support in
> >> Ignite
> >> >> > > > > 3,
> >> >> > > > yet.
> >> >> > > > > By the way, we are getting requests for this thing from the
> >> >> > > > > user
> >> >> > side,
> >> >> > > > and
> >> >> > > > > definitely,
> >> >> > > > > FTS would be a valuable feature for Ignite.
> >> >> > > > >
> >> >> > > > > It will be great if the one wants to drive it, any help will
> >> >> > > > > be
> >> >> > > > appreciated.
> >> >> > > > >
> >> >> > > > >
> >> >> > > > > On Fri, Jul 23, 2021 at 12:12 PM Atri Sharma <at...@apache.org>
> >> >> > wrote:
> >> >> > > > >
> >> >> > > > > > Hello,
> >> >> > > > > >
> >> >> > > > > > An update, please. I am working through persistence of
> >> >> > > > > > Lucene
> >> >> > > > > > index
> >> >> > > > using
> >> >> > > > > > Ignite Dictionary, and will be asking some questions soon.
> >> >> > > > > >
> >> >> > > > > > I had one doubt - - where does this change go? Ignite 3?
> >> >> > > > > >
> >> >> > > > > > Also, I know we want to build native support for text
> >> >> > > > > > searches
> >> >> > > > > > in
> >> >> > > > Ignite 3.
> >> >> > > > > > Is the work I am proposing here part of that, or will that
> >> >> > > > > > be
> >> a
> >> >> > > > separate
> >> >> > > > > > effort?
> >> >> > > > > >
> >> >> > > > > > On Mon, 28 Jun 2021, 19:20 Ilya Kasnacheev, <
> >> >> > ilya.kasnacheev@gmail.com
> >> >> > > > >
> >> >> > > > > > wrote:
> >> >> > > > > >
> >> >> > > > > > > Hello!
> >> >> > > > > > >
> >> >> > > > > > > I think that number one is the most important one, then
> >> maybe
> >> >> > > > > > > it
> >> >> > > > will see
> >> >> > > > > > > more use and other deficiencies become more apparent,
> >> leading
> >> >> > > > > > > to
> >> >> > more
> >> >> > > > > > > tickets and visibility.
> >> >> > > > > > >
> >> >> > > > > > > Maybe 2. and 3. will even use a different approach when
> >> >> > persistence
> >> >> > > > is
> >> >> > > > > > > implemented.
> >> >> > > > > > >
> >> >> > > > > > > Regards,
> >> >> > > > > > > --
> >> >> > > > > > > Ilya Kasnacheev
> >> >> > > > > > >
> >> >> > > > > > >
> >> >> > > > > > > пн, 28 июн. 2021 г. в 14:34, Atri Sharma
> >> >> > > > > > > <at...@apache.org>:
> >> >> > > > > > >
> >> >> > > > > > > > Hello Again!
> >> >> > > > > > > >
> >> >> > > > > > > > I have been looking into the aforementioned and here are
> >> my
> >> >> > follow
> >> >> > > > up
> >> >> > > > > > > > thoughts:
> >> >> > > > > > > >
> >> >> > > > > > > > 1. Support persistence of Lucene indexes.
> >> >> > > > > > > > 2. https://issues.apache.org/jira/browse/IGNITE-12401
> >> >> > > > > > > > (Needs
> >> >> > > > fixing of
> >> >> > > > > > > > moving partitions first)
> >> >> > > > > > > > 3. Figure out how to return scores from nodes and use
> >> >> > > > > > > > them
> >> >> > > > > > > > as
> >> >> > sort
> >> >> > > > > > > > parameters on the coordinator node
> >> >> > > > > > > > (https://issues.apache.org/jira/browse/IGNITE-12291)
> >> >> > > > > > > >
> >> >> > > > > > > > Please let me know if this looks ok to make text queries
> >> >> > > > functional?
> >> >> > > > > > > >
> >> >> > > > > > > > Atri
> >> >> > > > > > > >
> >> >> > > > > > > > On Mon, Jun 21, 2021 at 2:49 PM Alexei Scherbakov
> >> >> > > > > > > > <al...@gmail.com> wrote:
> >> >> > > > > > > > >
> >> >> > > > > > > > > Hi.
> >> >> > > > > > > > >
> >> >> > > > > > > > > One of the biggest issues with text queries is a lack
> >> >> > > > > > > > > of
> >> >> > support
> >> >> > > > for
> >> >> > > > > > > > lucene
> >> >> > > > > > > > > indices persistence, which makes this functionality
> >> >> > > > > > > > > useless
> >> >> > if a
> >> >> > > > > > > > > persistence is enabled.
> >> >> > > > > > > > >
> >> >> > > > > > > > > I would first take care of it.
> >> >> > > > > > > > >
> >> >> > > > > > > > > пн, 21 июн. 2021 г. в 12:16, Maksim Timonin <
> >> >> > > > timonin.maxim@gmail.com
> >> >> > > > > > >:
> >> >> > > > > > > > >
> >> >> > > > > > > > > > Hi, Atri!
> >> >> > > > > > > > > >
> >> >> > > > > > > > > > You're right, Actually there is a lack of support
> >> >> > > > > > > > > > for
> >> >> > > > TextQueries.
> >> >> > > > > > > For
> >> >> > > > > > > > the
> >> >> > > > > > > > > > last ticket I'm doing I see some obvious issues with
> >> >> > > > > > > > > > them
> >> >> > (no
> >> >> > > > page
> >> >> > > > > > > size
> >> >> > > > > > > > > > support, for example). I'm glad that somebody wants
> >> >> > > > > > > > > > to
> >> >> > maintain
> >> >> > > > > > this
> >> >> > > > > > > > > > functionality. Thanks a lot!
> >> >> > > > > > > > > >
> >> >> > > > > > > > > > For the MergeSort algorithm there is already a patch
> >> >> > > > > > > > > > for
> >> >> > that
> >> >> > > > [1].
> >> >> > > > > > > It's
> >> >> > > > > > > > > > currently on review. This patch introduces an
> >> >> > > > > > > > > > abstract
> >> >> > reducer
> >> >> > > > for
> >> >> > > > > > > > > > CacheQueries with 2 implementations (unordered,
> >> >> > merge-sort).
> >> >> > > > Then
> >> >> > > > > > > > TextQuery
> >> >> > > > > > > > > > leverages on MergeSort to order results from
> >> >> > > > > > > > > > multiple
> >> >> > nodes by
> >> >> > > > > > score.
> >> >> > > > > > > > This
> >> >> > > > > > > > > > patch also fixes the pageSize issue, I've mentioned
> >> >> > > > > > > > > > before.
> >> >> > > > Could
> >> >> > > > > > you
> >> >> > > > > > > > > > please check if it fully matches your idea? Any
> >> >> > > > > > > > > > issues
> >> >> > > > > > > > > > or
> >> >> > > > comments
> >> >> > > > > > > are
> >> >> > > > > > > > > > welcome.
> >> >> > > > > > > > > >
> >> >> > > > > > > > > > I've prepared this ticket, because I need the
> >> MergeSort
> >> >> > > > algorithm
> >> >> > > > > > for
> >> >> > > > > > > > the
> >> >> > > > > > > > > > new type of queries I'm implementing (IndexQuery, it
> >> >> > > > > > > > > > should
> >> >> > > > also
> >> >> > > > > > > > provide
> >> >> > > > > > > > > > ordered results over multiple nodes). Currently I'm
> >> not
> >> >> > > > planning to
> >> >> > > > > > > go
> >> >> > > > > > > > > > further with TextQuery, so if you're going to
> >> >> > > > > > > > > > support
> >> >> > > > > > > > > > this
> >> >> > > > it'll
> >> >> > > > > > be a
> >> >> > > > > > > > great
> >> >> > > > > > > > > > contribution, I think.
> >> >> > > > > > > > > >
> >> >> > > > > > > > > > [1]
> >> https://issues.apache.org/jira/browse/IGNITE-14703
> >> >> > > > > > > > > > [2] https://github.com/apache/ignite/pull/9081
> >> >> > > > > > > > > >
> >> >> > > > > > > > > >
> >> >> > > > > > > > > > On Mon, Jun 21, 2021 at 11:11 AM Atri Sharma <
> >> >> > atri@apache.org>
> >> >> > > > > > > wrote:
> >> >> > > > > > > > > >
> >> >> > > > > > > > > > > Hi All,
> >> >> > > > > > > > > > >
> >> >> > > > > > > > > > > I have been looking into our text queries support
> >> and
> >> >> > > > > > > > > > > see
> >> >> > > > that it
> >> >> > > > > > > has
> >> >> > > > > > > > > > > limited community support.
> >> >> > > > > > > > > > >
> >> >> > > > > > > > > > > Therefore, I volunteer to be the maintainer of the
> >> >> > module and
> >> >> > > > > > work
> >> >> > > > > > > on
> >> >> > > > > > > > > > > enhancing it further.
> >> >> > > > > > > > > > >
> >> >> > > > > > > > > > > First goal would be to move to Lucene 8.x, then
> >> >> > > > > > > > > > > work
> >> >> > > > > > > > > > > on
> >> >> > > > sorted
> >> >> > > > > > > reduce
> >> >> > > > > > > > > > > - merge across nodes. Fundamentally, this is
> >> >> > > > > > > > > > > doable
> >> >> > > > > > > > > > > since
> >> >> > > > Lucene
> >> >> > > > > > > > ranks
> >> >> > > > > > > > > > > documents according to their score, and documents
> >> are
> >> >> > > > returned in
> >> >> > > > > > > the
> >> >> > > > > > > > > > > order of their score. Since the scoring function
> >> >> > > > > > > > > > > is
> >> >> > > > homogeneous,
> >> >> > > > > > > this
> >> >> > > > > > > > > > > means that across nodes, we can compare scores and
> >> >> > > > > > > > > > > merge
> >> >> > > > sort.
> >> >> > > > > > > > > > >
> >> >> > > > > > > > > > > Please let me know if I can take this up.
> >> >> > > > > > > > > > >
> >> >> > > > > > > > > > > Atri
> >> >> > > > > > > > > > >
> >> >> > > > > > > > > > > --
> >> >> > > > > > > > > > > Regards,
> >> >> > > > > > > > > > >
> >> >> > > > > > > > > > > Atri
> >> >> > > > > > > > > > > Apache Concerted
> >> >> > > > > > > > > > >
> >> >> > > > > > > > > >
> >> >> > > > > > > > >
> >> >> > > > > > > > >
> >> >> > > > > > > > > --
> >> >> > > > > > > > >
> >> >> > > > > > > > > Best regards,
> >> >> > > > > > > > > Alexei Scherbakov
> >> >> > > > > > > >
> >> >> > > > > > > > --
> >> >> > > > > > > > Regards,
> >> >> > > > > > > >
> >> >> > > > > > > > Atri
> >> >> > > > > > > > Apache Concerted
> >> >> > > > > > > >
> >> >> > > > > > >
> >> >> > > > > >
> >> >> > > > >
> >> >> > > > >
> >> >> > > > > --
> >> >> > > > > Best regards,
> >> >> > > > > Andrey V. Mashenkov
> >> >> > > >
> >> >> > > > --
> >> >> > > > Regards,
> >> >> > > >
> >> >> > > > Atri
> >> >> > > > Apache Concerted
> >> >> > > >
> >> >> > >
> >> >> > >
> >> >> > > --
> >> >> > > Best regards,
> >> >> > > Andrey V. Mashenkov
> >> >> >
> >> >> > --
> >> >> > Regards,
> >> >> >
> >> >> > Atri
> >> >> > Apache Concerted
> >> >> >
> >> >>
> >> >>
> >> >> --
> >> >> Best regards,
> >> >> Andrey V. Mashenkov
> >> >
> >> > --
> >> > Regards,
> >> >
> >> > Atri
> >> > Apache Concerted
> >> >
> >>
> >>
> >> --
> >>
> >> Best regards,
> >> Ivan Pavlukhin
> >>
> >
>
>
> --
>
> Best regards,
> Ivan Pavlukhin

-- 
Regards,

Atri
Apache Concerted

Re: Text Queries Support

Posted by Ivan Pavlukhin <vo...@gmail.com>.

Hi Atri,

My main concern is non-maleficence. Every task has several solutions,
e.g. straightforward ones:
1. Do not implement FTS.
2. Create own implementation.

Some of the strongest ones live without FTS [1].

[1] https://github.com/cockroachdb/cockroach/issues/7821

2021-08-02 11:33 GMT+03:00, Atri Sharma <at...@apache.org>:
> Hi Ivan,
>
> Would you like to propose an alternative to Lucene?
>
> Atri
>
> On Mon, 2 Aug 2021, 13:48 Ivan Pavlukhin, <vo...@gmail.com> wrote:
>
>> Folks,
>>
>> Sorry if read the thread not thoroughly enough, but do we consider
>> Lucene as obviously right choice? In my understanding Ignite history
>> has shown clearly that "fastest feature implementation" is not usually
>> the best. And one example of this are text queries. Are not we trying
>> to do a same mistake again? FTS is a huge feature, I do not believe
>> there is an easy win for it.
>>
>> 2021-07-27 19:18 GMT+03:00, Atri Sharma <at...@apache.org>:
>> > Andrey,
>> >
>> >> Per-partition Lucene index looks simple to implement, but it may
>> >> require
>> >> per-partition SQL to make full-text search expressions work correctly
>> >> within the SQL quiery.
>> > I think that as long as we follow the map - reduce process that we
>> > already do for other queries, we should be fine.
>> >
>> >> Per-partition SQL index may kill the performance. We already tried to
>> >> do
>> >> that in Ignite 2. However, QueryParallelism feature helps to speed up
>> >> some
>> >> data-intensive queries,
>> >> but hits the performance in simple cases, and at some point (e.g.
>> >> segments
>> >> > number of CPU) the performance rapidly degrades with the increasing
>> >> number of segments.
>> >
>> > Yeah, that is always the case, but a global index will be a nightmare
>> > in terms of concurrency and pessimistic concurrency control will
>> > anyways kill the benefits, coupled with the metadata requirements.
>> > What were the specific issues with per partition index?
>> >>
>> >> AFAIK, Lucene widely used bitmap indices that are easy to merge.
>> >> Maybe, the map-reduce technique underneath FTS expressions and some
>> hacks
>> >> will add a minimal overhead.
>> >
>> > Lucene uses many types of indices but the aspect here is that per
>> > partition Lucene indices can return docIDs and we can merge them in
>> > reduce phase. So we are abstracted out from specifics of the internal
>> > index being used to serve the query.
>> >
>> >>
>> >> > As illustrated by Ilya, we can use Ignite's WAL records to rebuild
>> >> > Lucene indices. The important thing here is to not treat Lucene
>> >> > indices as source of truth.
>> >> To use WAL we either should relay Lucene files to our Page memory or
>> >> be
>> >> aware of Lucene files structure.
>> >> The first looks tricky, as we should guarantee a contiguous address
>> space
>> >> in Page memory for reflecting Lucene file. Maybe separate managed
>> >> memory
>> >> segment with its own rules?
>> >
>> > Why not use Lucene's MMappedDirectory and map it to our storage
>> > classes?
>> >
>> >>
>> >> >> Transactions.
>> >> >> * Will we support transactions?
>> >> > Lucene has no concept of transactions.
>> >> Yes, but we have.
>> >> Lucene index may be non-transactional, but users never expect to see
>> >> uncommited data.
>> >> How does this connect with transactional SQL?
>> > We could have the Lucene writes done as a part of transactions and ack
>> > back only when it succeeds/fails. WDYT?
>> >>
>> >> On Tue, Jul 27, 2021 at 1:36 PM Atri Sharma <at...@apache.org> wrote:
>> >>
>> >> > Sorry, I planned on creating a Wiki page for this, but it makes more
>> >> > sense to be replying here.
>> >> >
>> >> > > * How Lucene index can be split among the nodes?
>> >> >
>> >> > We can have partition level indices on each node.
>> >> >
>> >> > > * If we'll have a single index for all partitions on the
>> >> > > particular
>> >> > > node,
>> >> > > then how index records will be aware of partitioning?
>> >> >
>> >> > Index records dont need to be aware of partitioning -- each Lucene
>> >> > index is independent.
>> >> >
>> >> > > This is important to filter out backup records from the results to
>> >> > > avoid
>> >> > > duplicates.
>> >> >
>> >> > We can merge documents from different nodes and remove duplicates as
>> >> > long as docIDs are globally unique.
>> >> >
>> >> > > * How results from several nodes can be merged on the Reduce
>> >> > > stage?
>> >> >
>> >> > As long as documents have a globally unique docID, Lucene has merge
>> >> > functions that can merge results from multiple partial results.
>> >> >
>> >> > > * Does Lucene supports smth like JOIN operation or others that may
>> >> > require
>> >> > > data from another partition or index?
>> >> >
>> >> > As illustrated by Ilya, Block-Join works for us.
>> >> >
>> >> > > If so, then it likes to multistep query with merging results on
>> >> > > intermediate stages and requires detailed investigation and
>> >> > > design.
>> >> > > It is ok if Ignite will have some limitations here, but we would
>> like
>> >> > > to
>> >> > > know about them at the early stage.
>> >> >
>> >> > > * How effectively map Lucene files to the page memory? Is it even
>> >> > possible?
>> >> >
>> >> > Lucene has PageDirectory implementations which allow storing Lucene
>> >> > indices on different kind of file structures. It has a
>> >> > MMappedFileDirectory that we could use?
>> >> >
>> >> > > Otherwise, how to deal with potential OOM on large queries and
>> memory
>> >> > > capacity planning?
>> >> >
>> >> > We can use Lucene's MMapped directory.
>> >> >
>> >> > >
>> >> > > Persistence.
>> >> > > * How and what consistency guarantees could we have/expect?
>> >> >
>> >> > Lucene does not have WAL logs but is append only
>> >> >
>> >> > > Seems, we may not be able to write physical records for Lucene
>> >> > > index
>> >> > > to
>> >> > our
>> >> > > WAL. What can we do with this?
>> >> >
>> >> > As illustrated by Ilya, we can use Ignite's WAL records to rebuild
>> >> > Lucene indices. The important thing here is to not treat Lucene
>> >> > indices as source of truth.
>> >> > >
>> >> > > Transactions.
>> >> > > * Will we support transactions?
>> >> > Lucene has no concept of transactions.
>> >> >
>> >> > > * Should Lucene be aware of Transaction and track mvcc (or
>> >> > > whatever)
>> >> > > versions for the records?
>> >> > No
>> >> > > * What will be consistency guarantees?
>> >> > We can acknowledge writes back only after Lucene index is updated.
>> >> > >
>> >> > > UX
>> >> > > * How to add FullText search queries syntax into Calcite?
>> >> > Postgres's FTS functions are a good reference.
>> >> > > * AFAIK, the Lucene index has many properties for tuning. How will
>> >> > > the
>> >> > user
>> >> > > configure the index?
>> >> > Most of those properties can be cluster level and exposed as a new
>> >> > sub
>> >> > config for ignite.
>> >> > > * How and where to store the settings? What are cluster-wide and
>> what
>> >> > > a
>> >> > > local to the particular node?
>> >> > All can be cluster level.
>> >> > > * Will be all the settings immutable? Can be they changed on-fly?
>> >> > > after
>> >> > > node/grid restart?
>> >> > They should be applied post restart.
>> >> >
>> >> > > * Any limitations on query syntax?
>> >> > It depends on how we model our queries for text search.
>> >> >
>> >> > >
>> >> > > SQL
>> >> > > * Will we support FullText search in SQL?
>> >> > We need custom functions for it. See Postgres's FTS functions.
>> >> > > * How to integrate Lucene index into Calcite? What is the cost
>> model?
>> >> > There cannot be any cost model since there are no paths for a text
>> >> > query. If we see a text query, we have to use Lucene index or return
>> >> > an error. In this way, we need to model text search as a set of UDFs
>> >> >
>> >> > > Splitting rules? Traits?
>> >> > Please see my reply above.
>> >> > >
>> >> > >
>> >> > > With all of this, you can go with the IEP (or even some short
>> >> > > summary)
>> >> > and
>> >> > > further POC and implementation.
>> >> > > That's a big deal, so let's discuss what could be done here.
>> >> > >
>> >> > > On Fri, Jul 23, 2021 at 12:58 PM Atri Sharma <at...@apache.org>
>> wrote:
>> >> > >
>> >> > > > I am actually happy to drive the feature for Ignite 3. FTS is
>> >> > > > very
>> >> > > > important for me and I think Ignite users will benefit from it
>> >> > > > greatly.
>> >> > > >
>> >> > > > If it makes sense to be focusing on Ignite 3 for this
>> >> > > > capability,
>> I
>> >> > > > am
>> >> > > > eager to contribute there and lead the development.
>> >> > > >
>> >> > > > Please share your thoughts.
>> >> > > >
>> >> > > > On Fri, Jul 23, 2021 at 3:21 PM Andrey Mashenkov
>> >> > > > <an...@gmail.com> wrote:
>> >> > > > >
>> >> > > > > Hi Atri,
>> >> > > > >
>> >> > > > > All the Jira tickets we have on the Full-text search (FTS)
>> >> > > > > thing
>> >> > > > > are
>> >> > > > > targeted to Ignite 2.
>> >> > > > >
>> >> > > > > AFAIK, we want, but we have NOT committed to FTS support in
>> Ignite
>> >> > > > > 3,
>> >> > > > yet.
>> >> > > > > By the way, we are getting requests for this thing from the
>> >> > > > > user
>> >> > side,
>> >> > > > and
>> >> > > > > definitely,
>> >> > > > > FTS would be a valuable feature for Ignite.
>> >> > > > >
>> >> > > > > It will be great if the one wants to drive it, any help will
>> >> > > > > be
>> >> > > > appreciated.
>> >> > > > >
>> >> > > > >
>> >> > > > > On Fri, Jul 23, 2021 at 12:12 PM Atri Sharma <at...@apache.org>
>> >> > wrote:
>> >> > > > >
>> >> > > > > > Hello,
>> >> > > > > >
>> >> > > > > > An update, please. I am working through persistence of
>> >> > > > > > Lucene
>> >> > > > > > index
>> >> > > > using
>> >> > > > > > Ignite Dictionary, and will be asking some questions soon.
>> >> > > > > >
>> >> > > > > > I had one doubt - - where does this change go? Ignite 3?
>> >> > > > > >
>> >> > > > > > Also, I know we want to build native support for text
>> >> > > > > > searches
>> >> > > > > > in
>> >> > > > Ignite 3.
>> >> > > > > > Is the work I am proposing here part of that, or will that
>> >> > > > > > be
>> a
>> >> > > > separate
>> >> > > > > > effort?
>> >> > > > > >
>> >> > > > > > On Mon, 28 Jun 2021, 19:20 Ilya Kasnacheev, <
>> >> > ilya.kasnacheev@gmail.com
>> >> > > > >
>> >> > > > > > wrote:
>> >> > > > > >
>> >> > > > > > > Hello!
>> >> > > > > > >
>> >> > > > > > > I think that number one is the most important one, then
>> maybe
>> >> > > > > > > it
>> >> > > > will see
>> >> > > > > > > more use and other deficiencies become more apparent,
>> leading
>> >> > > > > > > to
>> >> > more
>> >> > > > > > > tickets and visibility.
>> >> > > > > > >
>> >> > > > > > > Maybe 2. and 3. will even use a different approach when
>> >> > persistence
>> >> > > > is
>> >> > > > > > > implemented.
>> >> > > > > > >
>> >> > > > > > > Regards,
>> >> > > > > > > --
>> >> > > > > > > Ilya Kasnacheev
>> >> > > > > > >
>> >> > > > > > >
>> >> > > > > > > пн, 28 июн. 2021 г. в 14:34, Atri Sharma
>> >> > > > > > > <at...@apache.org>:
>> >> > > > > > >
>> >> > > > > > > > Hello Again!
>> >> > > > > > > >
>> >> > > > > > > > I have been looking into the aforementioned and here are
>> my
>> >> > follow
>> >> > > > up
>> >> > > > > > > > thoughts:
>> >> > > > > > > >
>> >> > > > > > > > 1. Support persistence of Lucene indexes.
>> >> > > > > > > > 2. https://issues.apache.org/jira/browse/IGNITE-12401
>> >> > > > > > > > (Needs
>> >> > > > fixing of
>> >> > > > > > > > moving partitions first)
>> >> > > > > > > > 3. Figure out how to return scores from nodes and use
>> >> > > > > > > > them
>> >> > > > > > > > as
>> >> > sort
>> >> > > > > > > > parameters on the coordinator node
>> >> > > > > > > > (https://issues.apache.org/jira/browse/IGNITE-12291)
>> >> > > > > > > >
>> >> > > > > > > > Please let me know if this looks ok to make text queries
>> >> > > > functional?
>> >> > > > > > > >
>> >> > > > > > > > Atri
>> >> > > > > > > >
>> >> > > > > > > > On Mon, Jun 21, 2021 at 2:49 PM Alexei Scherbakov
>> >> > > > > > > > <al...@gmail.com> wrote:
>> >> > > > > > > > >
>> >> > > > > > > > > Hi.
>> >> > > > > > > > >
>> >> > > > > > > > > One of the biggest issues with text queries is a lack
>> >> > > > > > > > > of
>> >> > support
>> >> > > > for
>> >> > > > > > > > lucene
>> >> > > > > > > > > indices persistence, which makes this functionality
>> >> > > > > > > > > useless
>> >> > if a
>> >> > > > > > > > > persistence is enabled.
>> >> > > > > > > > >
>> >> > > > > > > > > I would first take care of it.
>> >> > > > > > > > >
>> >> > > > > > > > > пн, 21 июн. 2021 г. в 12:16, Maksim Timonin <
>> >> > > > timonin.maxim@gmail.com
>> >> > > > > > >:
>> >> > > > > > > > >
>> >> > > > > > > > > > Hi, Atri!
>> >> > > > > > > > > >
>> >> > > > > > > > > > You're right, Actually there is a lack of support
>> >> > > > > > > > > > for
>> >> > > > TextQueries.
>> >> > > > > > > For
>> >> > > > > > > > the
>> >> > > > > > > > > > last ticket I'm doing I see some obvious issues with
>> >> > > > > > > > > > them
>> >> > (no
>> >> > > > page
>> >> > > > > > > size
>> >> > > > > > > > > > support, for example). I'm glad that somebody wants
>> >> > > > > > > > > > to
>> >> > maintain
>> >> > > > > > this
>> >> > > > > > > > > > functionality. Thanks a lot!
>> >> > > > > > > > > >
>> >> > > > > > > > > > For the MergeSort algorithm there is already a patch
>> >> > > > > > > > > > for
>> >> > that
>> >> > > > [1].
>> >> > > > > > > It's
>> >> > > > > > > > > > currently on review. This patch introduces an
>> >> > > > > > > > > > abstract
>> >> > reducer
>> >> > > > for
>> >> > > > > > > > > > CacheQueries with 2 implementations (unordered,
>> >> > merge-sort).
>> >> > > > Then
>> >> > > > > > > > TextQuery
>> >> > > > > > > > > > leverages on MergeSort to order results from
>> >> > > > > > > > > > multiple
>> >> > nodes by
>> >> > > > > > score.
>> >> > > > > > > > This
>> >> > > > > > > > > > patch also fixes the pageSize issue, I've mentioned
>> >> > > > > > > > > > before.
>> >> > > > Could
>> >> > > > > > you
>> >> > > > > > > > > > please check if it fully matches your idea? Any
>> >> > > > > > > > > > issues
>> >> > > > > > > > > > or
>> >> > > > comments
>> >> > > > > > > are
>> >> > > > > > > > > > welcome.
>> >> > > > > > > > > >
>> >> > > > > > > > > > I've prepared this ticket, because I need the
>> MergeSort
>> >> > > > algorithm
>> >> > > > > > for
>> >> > > > > > > > the
>> >> > > > > > > > > > new type of queries I'm implementing (IndexQuery, it
>> >> > > > > > > > > > should
>> >> > > > also
>> >> > > > > > > > provide
>> >> > > > > > > > > > ordered results over multiple nodes). Currently I'm
>> not
>> >> > > > planning to
>> >> > > > > > > go
>> >> > > > > > > > > > further with TextQuery, so if you're going to
>> >> > > > > > > > > > support
>> >> > > > > > > > > > this
>> >> > > > it'll
>> >> > > > > > be a
>> >> > > > > > > > great
>> >> > > > > > > > > > contribution, I think.
>> >> > > > > > > > > >
>> >> > > > > > > > > > [1]
>> https://issues.apache.org/jira/browse/IGNITE-14703
>> >> > > > > > > > > > [2] https://github.com/apache/ignite/pull/9081
>> >> > > > > > > > > >
>> >> > > > > > > > > >
>> >> > > > > > > > > > On Mon, Jun 21, 2021 at 11:11 AM Atri Sharma <
>> >> > atri@apache.org>
>> >> > > > > > > wrote:
>> >> > > > > > > > > >
>> >> > > > > > > > > > > Hi All,
>> >> > > > > > > > > > >
>> >> > > > > > > > > > > I have been looking into our text queries support
>> and
>> >> > > > > > > > > > > see
>> >> > > > that it
>> >> > > > > > > has
>> >> > > > > > > > > > > limited community support.
>> >> > > > > > > > > > >
>> >> > > > > > > > > > > Therefore, I volunteer to be the maintainer of the
>> >> > module and
>> >> > > > > > work
>> >> > > > > > > on
>> >> > > > > > > > > > > enhancing it further.
>> >> > > > > > > > > > >
>> >> > > > > > > > > > > First goal would be to move to Lucene 8.x, then
>> >> > > > > > > > > > > work
>> >> > > > > > > > > > > on
>> >> > > > sorted
>> >> > > > > > > reduce
>> >> > > > > > > > > > > - merge across nodes. Fundamentally, this is
>> >> > > > > > > > > > > doable
>> >> > > > > > > > > > > since
>> >> > > > Lucene
>> >> > > > > > > > ranks
>> >> > > > > > > > > > > documents according to their score, and documents
>> are
>> >> > > > returned in
>> >> > > > > > > the
>> >> > > > > > > > > > > order of their score. Since the scoring function
>> >> > > > > > > > > > > is
>> >> > > > homogeneous,
>> >> > > > > > > this
>> >> > > > > > > > > > > means that across nodes, we can compare scores and
>> >> > > > > > > > > > > merge
>> >> > > > sort.
>> >> > > > > > > > > > >
>> >> > > > > > > > > > > Please let me know if I can take this up.
>> >> > > > > > > > > > >
>> >> > > > > > > > > > > Atri
>> >> > > > > > > > > > >
>> >> > > > > > > > > > > --
>> >> > > > > > > > > > > Regards,
>> >> > > > > > > > > > >
>> >> > > > > > > > > > > Atri
>> >> > > > > > > > > > > Apache Concerted
>> >> > > > > > > > > > >
>> >> > > > > > > > > >
>> >> > > > > > > > >
>> >> > > > > > > > >
>> >> > > > > > > > > --
>> >> > > > > > > > >
>> >> > > > > > > > > Best regards,
>> >> > > > > > > > > Alexei Scherbakov
>> >> > > > > > > >
>> >> > > > > > > > --
>> >> > > > > > > > Regards,
>> >> > > > > > > >
>> >> > > > > > > > Atri
>> >> > > > > > > > Apache Concerted
>> >> > > > > > > >
>> >> > > > > > >
>> >> > > > > >
>> >> > > > >
>> >> > > > >
>> >> > > > > --
>> >> > > > > Best regards,
>> >> > > > > Andrey V. Mashenkov
>> >> > > >
>> >> > > > --
>> >> > > > Regards,
>> >> > > >
>> >> > > > Atri
>> >> > > > Apache Concerted
>> >> > > >
>> >> > >
>> >> > >
>> >> > > --
>> >> > > Best regards,
>> >> > > Andrey V. Mashenkov
>> >> >
>> >> > --
>> >> > Regards,
>> >> >
>> >> > Atri
>> >> > Apache Concerted
>> >> >
>> >>
>> >>
>> >> --
>> >> Best regards,
>> >> Andrey V. Mashenkov
>> >
>> > --
>> > Regards,
>> >
>> > Atri
>> > Apache Concerted
>> >
>>
>>
>> --
>>
>> Best regards,
>> Ivan Pavlukhin
>>
>


-- 

Best regards,
Ivan Pavlukhin

Re: Text Queries Support

Posted by Atri Sharma <at...@apache.org>.

Hi Ivan,

Would you like to propose an alternative to Lucene?

Atri

On Mon, 2 Aug 2021, 13:48 Ivan Pavlukhin, <vo...@gmail.com> wrote:

> Folks,
>
> Sorry if read the thread not thoroughly enough, but do we consider
> Lucene as obviously right choice? In my understanding Ignite history
> has shown clearly that "fastest feature implementation" is not usually
> the best. And one example of this are text queries. Are not we trying
> to do a same mistake again? FTS is a huge feature, I do not believe
> there is an easy win for it.
>
> 2021-07-27 19:18 GMT+03:00, Atri Sharma <at...@apache.org>:
> > Andrey,
> >
> >> Per-partition Lucene index looks simple to implement, but it may require
> >> per-partition SQL to make full-text search expressions work correctly
> >> within the SQL quiery.
> > I think that as long as we follow the map - reduce process that we
> > already do for other queries, we should be fine.
> >
> >> Per-partition SQL index may kill the performance. We already tried to do
> >> that in Ignite 2. However, QueryParallelism feature helps to speed up
> >> some
> >> data-intensive queries,
> >> but hits the performance in simple cases, and at some point (e.g.
> >> segments
> >> > number of CPU) the performance rapidly degrades with the increasing
> >> number of segments.
> >
> > Yeah, that is always the case, but a global index will be a nightmare
> > in terms of concurrency and pessimistic concurrency control will
> > anyways kill the benefits, coupled with the metadata requirements.
> > What were the specific issues with per partition index?
> >>
> >> AFAIK, Lucene widely used bitmap indices that are easy to merge.
> >> Maybe, the map-reduce technique underneath FTS expressions and some
> hacks
> >> will add a minimal overhead.
> >
> > Lucene uses many types of indices but the aspect here is that per
> > partition Lucene indices can return docIDs and we can merge them in
> > reduce phase. So we are abstracted out from specifics of the internal
> > index being used to serve the query.
> >
> >>
> >> > As illustrated by Ilya, we can use Ignite's WAL records to rebuild
> >> > Lucene indices. The important thing here is to not treat Lucene
> >> > indices as source of truth.
> >> To use WAL we either should relay Lucene files to our Page memory or be
> >> aware of Lucene files structure.
> >> The first looks tricky, as we should guarantee a contiguous address
> space
> >> in Page memory for reflecting Lucene file. Maybe separate managed memory
> >> segment with its own rules?
> >
> > Why not use Lucene's MMappedDirectory and map it to our storage classes?
> >
> >>
> >> >> Transactions.
> >> >> * Will we support transactions?
> >> > Lucene has no concept of transactions.
> >> Yes, but we have.
> >> Lucene index may be non-transactional, but users never expect to see
> >> uncommited data.
> >> How does this connect with transactional SQL?
> > We could have the Lucene writes done as a part of transactions and ack
> > back only when it succeeds/fails. WDYT?
> >>
> >> On Tue, Jul 27, 2021 at 1:36 PM Atri Sharma <at...@apache.org> wrote:
> >>
> >> > Sorry, I planned on creating a Wiki page for this, but it makes more
> >> > sense to be replying here.
> >> >
> >> > > * How Lucene index can be split among the nodes?
> >> >
> >> > We can have partition level indices on each node.
> >> >
> >> > > * If we'll have a single index for all partitions on the particular
> >> > > node,
> >> > > then how index records will be aware of partitioning?
> >> >
> >> > Index records dont need to be aware of partitioning -- each Lucene
> >> > index is independent.
> >> >
> >> > > This is important to filter out backup records from the results to
> >> > > avoid
> >> > > duplicates.
> >> >
> >> > We can merge documents from different nodes and remove duplicates as
> >> > long as docIDs are globally unique.
> >> >
> >> > > * How results from several nodes can be merged on the Reduce stage?
> >> >
> >> > As long as documents have a globally unique docID, Lucene has merge
> >> > functions that can merge results from multiple partial results.
> >> >
> >> > > * Does Lucene supports smth like JOIN operation or others that may
> >> > require
> >> > > data from another partition or index?
> >> >
> >> > As illustrated by Ilya, Block-Join works for us.
> >> >
> >> > > If so, then it likes to multistep query with merging results on
> >> > > intermediate stages and requires detailed investigation and design.
> >> > > It is ok if Ignite will have some limitations here, but we would
> like
> >> > > to
> >> > > know about them at the early stage.
> >> >
> >> > > * How effectively map Lucene files to the page memory? Is it even
> >> > possible?
> >> >
> >> > Lucene has PageDirectory implementations which allow storing Lucene
> >> > indices on different kind of file structures. It has a
> >> > MMappedFileDirectory that we could use?
> >> >
> >> > > Otherwise, how to deal with potential OOM on large queries and
> memory
> >> > > capacity planning?
> >> >
> >> > We can use Lucene's MMapped directory.
> >> >
> >> > >
> >> > > Persistence.
> >> > > * How and what consistency guarantees could we have/expect?
> >> >
> >> > Lucene does not have WAL logs but is append only
> >> >
> >> > > Seems, we may not be able to write physical records for Lucene index
> >> > > to
> >> > our
> >> > > WAL. What can we do with this?
> >> >
> >> > As illustrated by Ilya, we can use Ignite's WAL records to rebuild
> >> > Lucene indices. The important thing here is to not treat Lucene
> >> > indices as source of truth.
> >> > >
> >> > > Transactions.
> >> > > * Will we support transactions?
> >> > Lucene has no concept of transactions.
> >> >
> >> > > * Should Lucene be aware of Transaction and track mvcc (or whatever)
> >> > > versions for the records?
> >> > No
> >> > > * What will be consistency guarantees?
> >> > We can acknowledge writes back only after Lucene index is updated.
> >> > >
> >> > > UX
> >> > > * How to add FullText search queries syntax into Calcite?
> >> > Postgres's FTS functions are a good reference.
> >> > > * AFAIK, the Lucene index has many properties for tuning. How will
> >> > > the
> >> > user
> >> > > configure the index?
> >> > Most of those properties can be cluster level and exposed as a new sub
> >> > config for ignite.
> >> > > * How and where to store the settings? What are cluster-wide and
> what
> >> > > a
> >> > > local to the particular node?
> >> > All can be cluster level.
> >> > > * Will be all the settings immutable? Can be they changed on-fly?
> >> > > after
> >> > > node/grid restart?
> >> > They should be applied post restart.
> >> >
> >> > > * Any limitations on query syntax?
> >> > It depends on how we model our queries for text search.
> >> >
> >> > >
> >> > > SQL
> >> > > * Will we support FullText search in SQL?
> >> > We need custom functions for it. See Postgres's FTS functions.
> >> > > * How to integrate Lucene index into Calcite? What is the cost
> model?
> >> > There cannot be any cost model since there are no paths for a text
> >> > query. If we see a text query, we have to use Lucene index or return
> >> > an error. In this way, we need to model text search as a set of UDFs
> >> >
> >> > > Splitting rules? Traits?
> >> > Please see my reply above.
> >> > >
> >> > >
> >> > > With all of this, you can go with the IEP (or even some short
> >> > > summary)
> >> > and
> >> > > further POC and implementation.
> >> > > That's a big deal, so let's discuss what could be done here.
> >> > >
> >> > > On Fri, Jul 23, 2021 at 12:58 PM Atri Sharma <at...@apache.org>
> wrote:
> >> > >
> >> > > > I am actually happy to drive the feature for Ignite 3. FTS is very
> >> > > > important for me and I think Ignite users will benefit from it
> >> > > > greatly.
> >> > > >
> >> > > > If it makes sense to be focusing on Ignite 3 for this capability,
> I
> >> > > > am
> >> > > > eager to contribute there and lead the development.
> >> > > >
> >> > > > Please share your thoughts.
> >> > > >
> >> > > > On Fri, Jul 23, 2021 at 3:21 PM Andrey Mashenkov
> >> > > > <an...@gmail.com> wrote:
> >> > > > >
> >> > > > > Hi Atri,
> >> > > > >
> >> > > > > All the Jira tickets we have on the Full-text search (FTS) thing
> >> > > > > are
> >> > > > > targeted to Ignite 2.
> >> > > > >
> >> > > > > AFAIK, we want, but we have NOT committed to FTS support in
> Ignite
> >> > > > > 3,
> >> > > > yet.
> >> > > > > By the way, we are getting requests for this thing from the user
> >> > side,
> >> > > > and
> >> > > > > definitely,
> >> > > > > FTS would be a valuable feature for Ignite.
> >> > > > >
> >> > > > > It will be great if the one wants to drive it, any help will be
> >> > > > appreciated.
> >> > > > >
> >> > > > >
> >> > > > > On Fri, Jul 23, 2021 at 12:12 PM Atri Sharma <at...@apache.org>
> >> > wrote:
> >> > > > >
> >> > > > > > Hello,
> >> > > > > >
> >> > > > > > An update, please. I am working through persistence of Lucene
> >> > > > > > index
> >> > > > using
> >> > > > > > Ignite Dictionary, and will be asking some questions soon.
> >> > > > > >
> >> > > > > > I had one doubt - - where does this change go? Ignite 3?
> >> > > > > >
> >> > > > > > Also, I know we want to build native support for text searches
> >> > > > > > in
> >> > > > Ignite 3.
> >> > > > > > Is the work I am proposing here part of that, or will that be
> a
> >> > > > separate
> >> > > > > > effort?
> >> > > > > >
> >> > > > > > On Mon, 28 Jun 2021, 19:20 Ilya Kasnacheev, <
> >> > ilya.kasnacheev@gmail.com
> >> > > > >
> >> > > > > > wrote:
> >> > > > > >
> >> > > > > > > Hello!
> >> > > > > > >
> >> > > > > > > I think that number one is the most important one, then
> maybe
> >> > > > > > > it
> >> > > > will see
> >> > > > > > > more use and other deficiencies become more apparent,
> leading
> >> > > > > > > to
> >> > more
> >> > > > > > > tickets and visibility.
> >> > > > > > >
> >> > > > > > > Maybe 2. and 3. will even use a different approach when
> >> > persistence
> >> > > > is
> >> > > > > > > implemented.
> >> > > > > > >
> >> > > > > > > Regards,
> >> > > > > > > --
> >> > > > > > > Ilya Kasnacheev
> >> > > > > > >
> >> > > > > > >
> >> > > > > > > пн, 28 июн. 2021 г. в 14:34, Atri Sharma <at...@apache.org>:
> >> > > > > > >
> >> > > > > > > > Hello Again!
> >> > > > > > > >
> >> > > > > > > > I have been looking into the aforementioned and here are
> my
> >> > follow
> >> > > > up
> >> > > > > > > > thoughts:
> >> > > > > > > >
> >> > > > > > > > 1. Support persistence of Lucene indexes.
> >> > > > > > > > 2. https://issues.apache.org/jira/browse/IGNITE-12401
> >> > > > > > > > (Needs
> >> > > > fixing of
> >> > > > > > > > moving partitions first)
> >> > > > > > > > 3. Figure out how to return scores from nodes and use them
> >> > > > > > > > as
> >> > sort
> >> > > > > > > > parameters on the coordinator node
> >> > > > > > > > (https://issues.apache.org/jira/browse/IGNITE-12291)
> >> > > > > > > >
> >> > > > > > > > Please let me know if this looks ok to make text queries
> >> > > > functional?
> >> > > > > > > >
> >> > > > > > > > Atri
> >> > > > > > > >
> >> > > > > > > > On Mon, Jun 21, 2021 at 2:49 PM Alexei Scherbakov
> >> > > > > > > > <al...@gmail.com> wrote:
> >> > > > > > > > >
> >> > > > > > > > > Hi.
> >> > > > > > > > >
> >> > > > > > > > > One of the biggest issues with text queries is a lack of
> >> > support
> >> > > > for
> >> > > > > > > > lucene
> >> > > > > > > > > indices persistence, which makes this functionality
> >> > > > > > > > > useless
> >> > if a
> >> > > > > > > > > persistence is enabled.
> >> > > > > > > > >
> >> > > > > > > > > I would first take care of it.
> >> > > > > > > > >
> >> > > > > > > > > пн, 21 июн. 2021 г. в 12:16, Maksim Timonin <
> >> > > > timonin.maxim@gmail.com
> >> > > > > > >:
> >> > > > > > > > >
> >> > > > > > > > > > Hi, Atri!
> >> > > > > > > > > >
> >> > > > > > > > > > You're right, Actually there is a lack of support for
> >> > > > TextQueries.
> >> > > > > > > For
> >> > > > > > > > the
> >> > > > > > > > > > last ticket I'm doing I see some obvious issues with
> >> > > > > > > > > > them
> >> > (no
> >> > > > page
> >> > > > > > > size
> >> > > > > > > > > > support, for example). I'm glad that somebody wants to
> >> > maintain
> >> > > > > > this
> >> > > > > > > > > > functionality. Thanks a lot!
> >> > > > > > > > > >
> >> > > > > > > > > > For the MergeSort algorithm there is already a patch
> >> > > > > > > > > > for
> >> > that
> >> > > > [1].
> >> > > > > > > It's
> >> > > > > > > > > > currently on review. This patch introduces an abstract
> >> > reducer
> >> > > > for
> >> > > > > > > > > > CacheQueries with 2 implementations (unordered,
> >> > merge-sort).
> >> > > > Then
> >> > > > > > > > TextQuery
> >> > > > > > > > > > leverages on MergeSort to order results from multiple
> >> > nodes by
> >> > > > > > score.
> >> > > > > > > > This
> >> > > > > > > > > > patch also fixes the pageSize issue, I've mentioned
> >> > > > > > > > > > before.
> >> > > > Could
> >> > > > > > you
> >> > > > > > > > > > please check if it fully matches your idea? Any issues
> >> > > > > > > > > > or
> >> > > > comments
> >> > > > > > > are
> >> > > > > > > > > > welcome.
> >> > > > > > > > > >
> >> > > > > > > > > > I've prepared this ticket, because I need the
> MergeSort
> >> > > > algorithm
> >> > > > > > for
> >> > > > > > > > the
> >> > > > > > > > > > new type of queries I'm implementing (IndexQuery, it
> >> > > > > > > > > > should
> >> > > > also
> >> > > > > > > > provide
> >> > > > > > > > > > ordered results over multiple nodes). Currently I'm
> not
> >> > > > planning to
> >> > > > > > > go
> >> > > > > > > > > > further with TextQuery, so if you're going to support
> >> > > > > > > > > > this
> >> > > > it'll
> >> > > > > > be a
> >> > > > > > > > great
> >> > > > > > > > > > contribution, I think.
> >> > > > > > > > > >
> >> > > > > > > > > > [1]
> https://issues.apache.org/jira/browse/IGNITE-14703
> >> > > > > > > > > > [2] https://github.com/apache/ignite/pull/9081
> >> > > > > > > > > >
> >> > > > > > > > > >
> >> > > > > > > > > > On Mon, Jun 21, 2021 at 11:11 AM Atri Sharma <
> >> > atri@apache.org>
> >> > > > > > > wrote:
> >> > > > > > > > > >
> >> > > > > > > > > > > Hi All,
> >> > > > > > > > > > >
> >> > > > > > > > > > > I have been looking into our text queries support
> and
> >> > > > > > > > > > > see
> >> > > > that it
> >> > > > > > > has
> >> > > > > > > > > > > limited community support.
> >> > > > > > > > > > >
> >> > > > > > > > > > > Therefore, I volunteer to be the maintainer of the
> >> > module and
> >> > > > > > work
> >> > > > > > > on
> >> > > > > > > > > > > enhancing it further.
> >> > > > > > > > > > >
> >> > > > > > > > > > > First goal would be to move to Lucene 8.x, then work
> >> > > > > > > > > > > on
> >> > > > sorted
> >> > > > > > > reduce
> >> > > > > > > > > > > - merge across nodes. Fundamentally, this is doable
> >> > > > > > > > > > > since
> >> > > > Lucene
> >> > > > > > > > ranks
> >> > > > > > > > > > > documents according to their score, and documents
> are
> >> > > > returned in
> >> > > > > > > the
> >> > > > > > > > > > > order of their score. Since the scoring function is
> >> > > > homogeneous,
> >> > > > > > > this
> >> > > > > > > > > > > means that across nodes, we can compare scores and
> >> > > > > > > > > > > merge
> >> > > > sort.
> >> > > > > > > > > > >
> >> > > > > > > > > > > Please let me know if I can take this up.
> >> > > > > > > > > > >
> >> > > > > > > > > > > Atri
> >> > > > > > > > > > >
> >> > > > > > > > > > > --
> >> > > > > > > > > > > Regards,
> >> > > > > > > > > > >
> >> > > > > > > > > > > Atri
> >> > > > > > > > > > > Apache Concerted
> >> > > > > > > > > > >
> >> > > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > > > --
> >> > > > > > > > >
> >> > > > > > > > > Best regards,
> >> > > > > > > > > Alexei Scherbakov
> >> > > > > > > >
> >> > > > > > > > --
> >> > > > > > > > Regards,
> >> > > > > > > >
> >> > > > > > > > Atri
> >> > > > > > > > Apache Concerted
> >> > > > > > > >
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > > >
> >> > > > > --
> >> > > > > Best regards,
> >> > > > > Andrey V. Mashenkov
> >> > > >
> >> > > > --
> >> > > > Regards,
> >> > > >
> >> > > > Atri
> >> > > > Apache Concerted
> >> > > >
> >> > >
> >> > >
> >> > > --
> >> > > Best regards,
> >> > > Andrey V. Mashenkov
> >> >
> >> > --
> >> > Regards,
> >> >
> >> > Atri
> >> > Apache Concerted
> >> >
> >>
> >>
> >> --
> >> Best regards,
> >> Andrey V. Mashenkov
> >
> > --
> > Regards,
> >
> > Atri
> > Apache Concerted
> >
>
>
> --
>
> Best regards,
> Ivan Pavlukhin
>

Re: Text Queries Support

Posted by Ivan Pavlukhin <vo...@gmail.com>.

Folks,

Sorry if read the thread not thoroughly enough, but do we consider
Lucene as obviously right choice? In my understanding Ignite history
has shown clearly that "fastest feature implementation" is not usually
the best. And one example of this are text queries. Are not we trying
to do a same mistake again? FTS is a huge feature, I do not believe
there is an easy win for it.

2021-07-27 19:18 GMT+03:00, Atri Sharma <at...@apache.org>:
> Andrey,
>
>> Per-partition Lucene index looks simple to implement, but it may require
>> per-partition SQL to make full-text search expressions work correctly
>> within the SQL quiery.
> I think that as long as we follow the map - reduce process that we
> already do for other queries, we should be fine.
>
>> Per-partition SQL index may kill the performance. We already tried to do
>> that in Ignite 2. However, QueryParallelism feature helps to speed up
>> some
>> data-intensive queries,
>> but hits the performance in simple cases, and at some point (e.g.
>> segments
>> > number of CPU) the performance rapidly degrades with the increasing
>> number of segments.
>
> Yeah, that is always the case, but a global index will be a nightmare
> in terms of concurrency and pessimistic concurrency control will
> anyways kill the benefits, coupled with the metadata requirements.
> What were the specific issues with per partition index?
>>
>> AFAIK, Lucene widely used bitmap indices that are easy to merge.
>> Maybe, the map-reduce technique underneath FTS expressions and some hacks
>> will add a minimal overhead.
>
> Lucene uses many types of indices but the aspect here is that per
> partition Lucene indices can return docIDs and we can merge them in
> reduce phase. So we are abstracted out from specifics of the internal
> index being used to serve the query.
>
>>
>> > As illustrated by Ilya, we can use Ignite's WAL records to rebuild
>> > Lucene indices. The important thing here is to not treat Lucene
>> > indices as source of truth.
>> To use WAL we either should relay Lucene files to our Page memory or be
>> aware of Lucene files structure.
>> The first looks tricky, as we should guarantee a contiguous address space
>> in Page memory for reflecting Lucene file. Maybe separate managed memory
>> segment with its own rules?
>
> Why not use Lucene's MMappedDirectory and map it to our storage classes?
>
>>
>> >> Transactions.
>> >> * Will we support transactions?
>> > Lucene has no concept of transactions.
>> Yes, but we have.
>> Lucene index may be non-transactional, but users never expect to see
>> uncommited data.
>> How does this connect with transactional SQL?
> We could have the Lucene writes done as a part of transactions and ack
> back only when it succeeds/fails. WDYT?
>>
>> On Tue, Jul 27, 2021 at 1:36 PM Atri Sharma <at...@apache.org> wrote:
>>
>> > Sorry, I planned on creating a Wiki page for this, but it makes more
>> > sense to be replying here.
>> >
>> > > * How Lucene index can be split among the nodes?
>> >
>> > We can have partition level indices on each node.
>> >
>> > > * If we'll have a single index for all partitions on the particular
>> > > node,
>> > > then how index records will be aware of partitioning?
>> >
>> > Index records dont need to be aware of partitioning -- each Lucene
>> > index is independent.
>> >
>> > > This is important to filter out backup records from the results to
>> > > avoid
>> > > duplicates.
>> >
>> > We can merge documents from different nodes and remove duplicates as
>> > long as docIDs are globally unique.
>> >
>> > > * How results from several nodes can be merged on the Reduce stage?
>> >
>> > As long as documents have a globally unique docID, Lucene has merge
>> > functions that can merge results from multiple partial results.
>> >
>> > > * Does Lucene supports smth like JOIN operation or others that may
>> > require
>> > > data from another partition or index?
>> >
>> > As illustrated by Ilya, Block-Join works for us.
>> >
>> > > If so, then it likes to multistep query with merging results on
>> > > intermediate stages and requires detailed investigation and design.
>> > > It is ok if Ignite will have some limitations here, but we would like
>> > > to
>> > > know about them at the early stage.
>> >
>> > > * How effectively map Lucene files to the page memory? Is it even
>> > possible?
>> >
>> > Lucene has PageDirectory implementations which allow storing Lucene
>> > indices on different kind of file structures. It has a
>> > MMappedFileDirectory that we could use?
>> >
>> > > Otherwise, how to deal with potential OOM on large queries and memory
>> > > capacity planning?
>> >
>> > We can use Lucene's MMapped directory.
>> >
>> > >
>> > > Persistence.
>> > > * How and what consistency guarantees could we have/expect?
>> >
>> > Lucene does not have WAL logs but is append only
>> >
>> > > Seems, we may not be able to write physical records for Lucene index
>> > > to
>> > our
>> > > WAL. What can we do with this?
>> >
>> > As illustrated by Ilya, we can use Ignite's WAL records to rebuild
>> > Lucene indices. The important thing here is to not treat Lucene
>> > indices as source of truth.
>> > >
>> > > Transactions.
>> > > * Will we support transactions?
>> > Lucene has no concept of transactions.
>> >
>> > > * Should Lucene be aware of Transaction and track mvcc (or whatever)
>> > > versions for the records?
>> > No
>> > > * What will be consistency guarantees?
>> > We can acknowledge writes back only after Lucene index is updated.
>> > >
>> > > UX
>> > > * How to add FullText search queries syntax into Calcite?
>> > Postgres's FTS functions are a good reference.
>> > > * AFAIK, the Lucene index has many properties for tuning. How will
>> > > the
>> > user
>> > > configure the index?
>> > Most of those properties can be cluster level and exposed as a new sub
>> > config for ignite.
>> > > * How and where to store the settings? What are cluster-wide and what
>> > > a
>> > > local to the particular node?
>> > All can be cluster level.
>> > > * Will be all the settings immutable? Can be they changed on-fly?
>> > > after
>> > > node/grid restart?
>> > They should be applied post restart.
>> >
>> > > * Any limitations on query syntax?
>> > It depends on how we model our queries for text search.
>> >
>> > >
>> > > SQL
>> > > * Will we support FullText search in SQL?
>> > We need custom functions for it. See Postgres's FTS functions.
>> > > * How to integrate Lucene index into Calcite? What is the cost model?
>> > There cannot be any cost model since there are no paths for a text
>> > query. If we see a text query, we have to use Lucene index or return
>> > an error. In this way, we need to model text search as a set of UDFs
>> >
>> > > Splitting rules? Traits?
>> > Please see my reply above.
>> > >
>> > >
>> > > With all of this, you can go with the IEP (or even some short
>> > > summary)
>> > and
>> > > further POC and implementation.
>> > > That's a big deal, so let's discuss what could be done here.
>> > >
>> > > On Fri, Jul 23, 2021 at 12:58 PM Atri Sharma <at...@apache.org> wrote:
>> > >
>> > > > I am actually happy to drive the feature for Ignite 3. FTS is very
>> > > > important for me and I think Ignite users will benefit from it
>> > > > greatly.
>> > > >
>> > > > If it makes sense to be focusing on Ignite 3 for this capability, I
>> > > > am
>> > > > eager to contribute there and lead the development.
>> > > >
>> > > > Please share your thoughts.
>> > > >
>> > > > On Fri, Jul 23, 2021 at 3:21 PM Andrey Mashenkov
>> > > > <an...@gmail.com> wrote:
>> > > > >
>> > > > > Hi Atri,
>> > > > >
>> > > > > All the Jira tickets we have on the Full-text search (FTS) thing
>> > > > > are
>> > > > > targeted to Ignite 2.
>> > > > >
>> > > > > AFAIK, we want, but we have NOT committed to FTS support in Ignite
>> > > > > 3,
>> > > > yet.
>> > > > > By the way, we are getting requests for this thing from the user
>> > side,
>> > > > and
>> > > > > definitely,
>> > > > > FTS would be a valuable feature for Ignite.
>> > > > >
>> > > > > It will be great if the one wants to drive it, any help will be
>> > > > appreciated.
>> > > > >
>> > > > >
>> > > > > On Fri, Jul 23, 2021 at 12:12 PM Atri Sharma <at...@apache.org>
>> > wrote:
>> > > > >
>> > > > > > Hello,
>> > > > > >
>> > > > > > An update, please. I am working through persistence of Lucene
>> > > > > > index
>> > > > using
>> > > > > > Ignite Dictionary, and will be asking some questions soon.
>> > > > > >
>> > > > > > I had one doubt - - where does this change go? Ignite 3?
>> > > > > >
>> > > > > > Also, I know we want to build native support for text searches
>> > > > > > in
>> > > > Ignite 3.
>> > > > > > Is the work I am proposing here part of that, or will that be a
>> > > > separate
>> > > > > > effort?
>> > > > > >
>> > > > > > On Mon, 28 Jun 2021, 19:20 Ilya Kasnacheev, <
>> > ilya.kasnacheev@gmail.com
>> > > > >
>> > > > > > wrote:
>> > > > > >
>> > > > > > > Hello!
>> > > > > > >
>> > > > > > > I think that number one is the most important one, then maybe
>> > > > > > > it
>> > > > will see
>> > > > > > > more use and other deficiencies become more apparent, leading
>> > > > > > > to
>> > more
>> > > > > > > tickets and visibility.
>> > > > > > >
>> > > > > > > Maybe 2. and 3. will even use a different approach when
>> > persistence
>> > > > is
>> > > > > > > implemented.
>> > > > > > >
>> > > > > > > Regards,
>> > > > > > > --
>> > > > > > > Ilya Kasnacheev
>> > > > > > >
>> > > > > > >
>> > > > > > > пн, 28 июн. 2021 г. в 14:34, Atri Sharma <at...@apache.org>:
>> > > > > > >
>> > > > > > > > Hello Again!
>> > > > > > > >
>> > > > > > > > I have been looking into the aforementioned and here are my
>> > follow
>> > > > up
>> > > > > > > > thoughts:
>> > > > > > > >
>> > > > > > > > 1. Support persistence of Lucene indexes.
>> > > > > > > > 2. https://issues.apache.org/jira/browse/IGNITE-12401
>> > > > > > > > (Needs
>> > > > fixing of
>> > > > > > > > moving partitions first)
>> > > > > > > > 3. Figure out how to return scores from nodes and use them
>> > > > > > > > as
>> > sort
>> > > > > > > > parameters on the coordinator node
>> > > > > > > > (https://issues.apache.org/jira/browse/IGNITE-12291)
>> > > > > > > >
>> > > > > > > > Please let me know if this looks ok to make text queries
>> > > > functional?
>> > > > > > > >
>> > > > > > > > Atri
>> > > > > > > >
>> > > > > > > > On Mon, Jun 21, 2021 at 2:49 PM Alexei Scherbakov
>> > > > > > > > <al...@gmail.com> wrote:
>> > > > > > > > >
>> > > > > > > > > Hi.
>> > > > > > > > >
>> > > > > > > > > One of the biggest issues with text queries is a lack of
>> > support
>> > > > for
>> > > > > > > > lucene
>> > > > > > > > > indices persistence, which makes this functionality
>> > > > > > > > > useless
>> > if a
>> > > > > > > > > persistence is enabled.
>> > > > > > > > >
>> > > > > > > > > I would first take care of it.
>> > > > > > > > >
>> > > > > > > > > пн, 21 июн. 2021 г. в 12:16, Maksim Timonin <
>> > > > timonin.maxim@gmail.com
>> > > > > > >:
>> > > > > > > > >
>> > > > > > > > > > Hi, Atri!
>> > > > > > > > > >
>> > > > > > > > > > You're right, Actually there is a lack of support for
>> > > > TextQueries.
>> > > > > > > For
>> > > > > > > > the
>> > > > > > > > > > last ticket I'm doing I see some obvious issues with
>> > > > > > > > > > them
>> > (no
>> > > > page
>> > > > > > > size
>> > > > > > > > > > support, for example). I'm glad that somebody wants to
>> > maintain
>> > > > > > this
>> > > > > > > > > > functionality. Thanks a lot!
>> > > > > > > > > >
>> > > > > > > > > > For the MergeSort algorithm there is already a patch
>> > > > > > > > > > for
>> > that
>> > > > [1].
>> > > > > > > It's
>> > > > > > > > > > currently on review. This patch introduces an abstract
>> > reducer
>> > > > for
>> > > > > > > > > > CacheQueries with 2 implementations (unordered,
>> > merge-sort).
>> > > > Then
>> > > > > > > > TextQuery
>> > > > > > > > > > leverages on MergeSort to order results from multiple
>> > nodes by
>> > > > > > score.
>> > > > > > > > This
>> > > > > > > > > > patch also fixes the pageSize issue, I've mentioned
>> > > > > > > > > > before.
>> > > > Could
>> > > > > > you
>> > > > > > > > > > please check if it fully matches your idea? Any issues
>> > > > > > > > > > or
>> > > > comments
>> > > > > > > are
>> > > > > > > > > > welcome.
>> > > > > > > > > >
>> > > > > > > > > > I've prepared this ticket, because I need the MergeSort
>> > > > algorithm
>> > > > > > for
>> > > > > > > > the
>> > > > > > > > > > new type of queries I'm implementing (IndexQuery, it
>> > > > > > > > > > should
>> > > > also
>> > > > > > > > provide
>> > > > > > > > > > ordered results over multiple nodes). Currently I'm not
>> > > > planning to
>> > > > > > > go
>> > > > > > > > > > further with TextQuery, so if you're going to support
>> > > > > > > > > > this
>> > > > it'll
>> > > > > > be a
>> > > > > > > > great
>> > > > > > > > > > contribution, I think.
>> > > > > > > > > >
>> > > > > > > > > > [1] https://issues.apache.org/jira/browse/IGNITE-14703
>> > > > > > > > > > [2] https://github.com/apache/ignite/pull/9081
>> > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > > > On Mon, Jun 21, 2021 at 11:11 AM Atri Sharma <
>> > atri@apache.org>
>> > > > > > > wrote:
>> > > > > > > > > >
>> > > > > > > > > > > Hi All,
>> > > > > > > > > > >
>> > > > > > > > > > > I have been looking into our text queries support and
>> > > > > > > > > > > see
>> > > > that it
>> > > > > > > has
>> > > > > > > > > > > limited community support.
>> > > > > > > > > > >
>> > > > > > > > > > > Therefore, I volunteer to be the maintainer of the
>> > module and
>> > > > > > work
>> > > > > > > on
>> > > > > > > > > > > enhancing it further.
>> > > > > > > > > > >
>> > > > > > > > > > > First goal would be to move to Lucene 8.x, then work
>> > > > > > > > > > > on
>> > > > sorted
>> > > > > > > reduce
>> > > > > > > > > > > - merge across nodes. Fundamentally, this is doable
>> > > > > > > > > > > since
>> > > > Lucene
>> > > > > > > > ranks
>> > > > > > > > > > > documents according to their score, and documents are
>> > > > returned in
>> > > > > > > the
>> > > > > > > > > > > order of their score. Since the scoring function is
>> > > > homogeneous,
>> > > > > > > this
>> > > > > > > > > > > means that across nodes, we can compare scores and
>> > > > > > > > > > > merge
>> > > > sort.
>> > > > > > > > > > >
>> > > > > > > > > > > Please let me know if I can take this up.
>> > > > > > > > > > >
>> > > > > > > > > > > Atri
>> > > > > > > > > > >
>> > > > > > > > > > > --
>> > > > > > > > > > > Regards,
>> > > > > > > > > > >
>> > > > > > > > > > > Atri
>> > > > > > > > > > > Apache Concerted
>> > > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > > --
>> > > > > > > > >
>> > > > > > > > > Best regards,
>> > > > > > > > > Alexei Scherbakov
>> > > > > > > >
>> > > > > > > > --
>> > > > > > > > Regards,
>> > > > > > > >
>> > > > > > > > Atri
>> > > > > > > > Apache Concerted
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > > >
>> > > > > --
>> > > > > Best regards,
>> > > > > Andrey V. Mashenkov
>> > > >
>> > > > --
>> > > > Regards,
>> > > >
>> > > > Atri
>> > > > Apache Concerted
>> > > >
>> > >
>> > >
>> > > --
>> > > Best regards,
>> > > Andrey V. Mashenkov
>> >
>> > --
>> > Regards,
>> >
>> > Atri
>> > Apache Concerted
>> >
>>
>>
>> --
>> Best regards,
>> Andrey V. Mashenkov
>
> --
> Regards,
>
> Atri
> Apache Concerted
>


-- 

Best regards,
Ivan Pavlukhin

Re: Text Queries Support

Posted by Atri Sharma <at...@apache.org>.

Andrey,

> Per-partition Lucene index looks simple to implement, but it may require
> per-partition SQL to make full-text search expressions work correctly
> within the SQL quiery.
I think that as long as we follow the map - reduce process that we
already do for other queries, we should be fine.

> Per-partition SQL index may kill the performance. We already tried to do
> that in Ignite 2. However, QueryParallelism feature helps to speed up some
> data-intensive queries,
> but hits the performance in simple cases, and at some point (e.g. segments
> > number of CPU) the performance rapidly degrades with the increasing
> number of segments.

Yeah, that is always the case, but a global index will be a nightmare
in terms of concurrency and pessimistic concurrency control will
anyways kill the benefits, coupled with the metadata requirements.
What were the specific issues with per partition index?
>
> AFAIK, Lucene widely used bitmap indices that are easy to merge.
> Maybe, the map-reduce technique underneath FTS expressions and some hacks
> will add a minimal overhead.

Lucene uses many types of indices but the aspect here is that per
partition Lucene indices can return docIDs and we can merge them in
reduce phase. So we are abstracted out from specifics of the internal
index being used to serve the query.

>
> > As illustrated by Ilya, we can use Ignite's WAL records to rebuild
> > Lucene indices. The important thing here is to not treat Lucene
> > indices as source of truth.
> To use WAL we either should relay Lucene files to our Page memory or be
> aware of Lucene files structure.
> The first looks tricky, as we should guarantee a contiguous address space
> in Page memory for reflecting Lucene file. Maybe separate managed memory
> segment with its own rules?

Why not use Lucene's MMappedDirectory and map it to our storage classes?

>
> >> Transactions.
> >> * Will we support transactions?
> > Lucene has no concept of transactions.
> Yes, but we have.
> Lucene index may be non-transactional, but users never expect to see
> uncommited data.
> How does this connect with transactional SQL?
We could have the Lucene writes done as a part of transactions and ack
back only when it succeeds/fails. WDYT?
>
> On Tue, Jul 27, 2021 at 1:36 PM Atri Sharma <at...@apache.org> wrote:
>
> > Sorry, I planned on creating a Wiki page for this, but it makes more
> > sense to be replying here.
> >
> > > * How Lucene index can be split among the nodes?
> >
> > We can have partition level indices on each node.
> >
> > > * If we'll have a single index for all partitions on the particular node,
> > > then how index records will be aware of partitioning?
> >
> > Index records dont need to be aware of partitioning -- each Lucene
> > index is independent.
> >
> > > This is important to filter out backup records from the results to avoid
> > > duplicates.
> >
> > We can merge documents from different nodes and remove duplicates as
> > long as docIDs are globally unique.
> >
> > > * How results from several nodes can be merged on the Reduce stage?
> >
> > As long as documents have a globally unique docID, Lucene has merge
> > functions that can merge results from multiple partial results.
> >
> > > * Does Lucene supports smth like JOIN operation or others that may
> > require
> > > data from another partition or index?
> >
> > As illustrated by Ilya, Block-Join works for us.
> >
> > > If so, then it likes to multistep query with merging results on
> > > intermediate stages and requires detailed investigation and design.
> > > It is ok if Ignite will have some limitations here, but we would like to
> > > know about them at the early stage.
> >
> > > * How effectively map Lucene files to the page memory? Is it even
> > possible?
> >
> > Lucene has PageDirectory implementations which allow storing Lucene
> > indices on different kind of file structures. It has a
> > MMappedFileDirectory that we could use?
> >
> > > Otherwise, how to deal with potential OOM on large queries and memory
> > > capacity planning?
> >
> > We can use Lucene's MMapped directory.
> >
> > >
> > > Persistence.
> > > * How and what consistency guarantees could we have/expect?
> >
> > Lucene does not have WAL logs but is append only
> >
> > > Seems, we may not be able to write physical records for Lucene index to
> > our
> > > WAL. What can we do with this?
> >
> > As illustrated by Ilya, we can use Ignite's WAL records to rebuild
> > Lucene indices. The important thing here is to not treat Lucene
> > indices as source of truth.
> > >
> > > Transactions.
> > > * Will we support transactions?
> > Lucene has no concept of transactions.
> >
> > > * Should Lucene be aware of Transaction and track mvcc (or whatever)
> > > versions for the records?
> > No
> > > * What will be consistency guarantees?
> > We can acknowledge writes back only after Lucene index is updated.
> > >
> > > UX
> > > * How to add FullText search queries syntax into Calcite?
> > Postgres's FTS functions are a good reference.
> > > * AFAIK, the Lucene index has many properties for tuning. How will the
> > user
> > > configure the index?
> > Most of those properties can be cluster level and exposed as a new sub
> > config for ignite.
> > > * How and where to store the settings? What are cluster-wide and what a
> > > local to the particular node?
> > All can be cluster level.
> > > * Will be all the settings immutable? Can be they changed on-fly? after
> > > node/grid restart?
> > They should be applied post restart.
> >
> > > * Any limitations on query syntax?
> > It depends on how we model our queries for text search.
> >
> > >
> > > SQL
> > > * Will we support FullText search in SQL?
> > We need custom functions for it. See Postgres's FTS functions.
> > > * How to integrate Lucene index into Calcite? What is the cost model?
> > There cannot be any cost model since there are no paths for a text
> > query. If we see a text query, we have to use Lucene index or return
> > an error. In this way, we need to model text search as a set of UDFs
> >
> > > Splitting rules? Traits?
> > Please see my reply above.
> > >
> > >
> > > With all of this, you can go with the IEP (or even some short summary)
> > and
> > > further POC and implementation.
> > > That's a big deal, so let's discuss what could be done here.
> > >
> > > On Fri, Jul 23, 2021 at 12:58 PM Atri Sharma <at...@apache.org> wrote:
> > >
> > > > I am actually happy to drive the feature for Ignite 3. FTS is very
> > > > important for me and I think Ignite users will benefit from it
> > > > greatly.
> > > >
> > > > If it makes sense to be focusing on Ignite 3 for this capability, I am
> > > > eager to contribute there and lead the development.
> > > >
> > > > Please share your thoughts.
> > > >
> > > > On Fri, Jul 23, 2021 at 3:21 PM Andrey Mashenkov
> > > > <an...@gmail.com> wrote:
> > > > >
> > > > > Hi Atri,
> > > > >
> > > > > All the Jira tickets we have on the Full-text search (FTS) thing are
> > > > > targeted to Ignite 2.
> > > > >
> > > > > AFAIK, we want, but we have NOT committed to FTS support in Ignite 3,
> > > > yet.
> > > > > By the way, we are getting requests for this thing from the user
> > side,
> > > > and
> > > > > definitely,
> > > > > FTS would be a valuable feature for Ignite.
> > > > >
> > > > > It will be great if the one wants to drive it, any help will be
> > > > appreciated.
> > > > >
> > > > >
> > > > > On Fri, Jul 23, 2021 at 12:12 PM Atri Sharma <at...@apache.org>
> > wrote:
> > > > >
> > > > > > Hello,
> > > > > >
> > > > > > An update, please. I am working through persistence of Lucene index
> > > > using
> > > > > > Ignite Dictionary, and will be asking some questions soon.
> > > > > >
> > > > > > I had one doubt - - where does this change go? Ignite 3?
> > > > > >
> > > > > > Also, I know we want to build native support for text searches in
> > > > Ignite 3.
> > > > > > Is the work I am proposing here part of that, or will that be a
> > > > separate
> > > > > > effort?
> > > > > >
> > > > > > On Mon, 28 Jun 2021, 19:20 Ilya Kasnacheev, <
> > ilya.kasnacheev@gmail.com
> > > > >
> > > > > > wrote:
> > > > > >
> > > > > > > Hello!
> > > > > > >
> > > > > > > I think that number one is the most important one, then maybe it
> > > > will see
> > > > > > > more use and other deficiencies become more apparent, leading to
> > more
> > > > > > > tickets and visibility.
> > > > > > >
> > > > > > > Maybe 2. and 3. will even use a different approach when
> > persistence
> > > > is
> > > > > > > implemented.
> > > > > > >
> > > > > > > Regards,
> > > > > > > --
> > > > > > > Ilya Kasnacheev
> > > > > > >
> > > > > > >
> > > > > > > пн, 28 июн. 2021 г. в 14:34, Atri Sharma <at...@apache.org>:
> > > > > > >
> > > > > > > > Hello Again!
> > > > > > > >
> > > > > > > > I have been looking into the aforementioned and here are my
> > follow
> > > > up
> > > > > > > > thoughts:
> > > > > > > >
> > > > > > > > 1. Support persistence of Lucene indexes.
> > > > > > > > 2. https://issues.apache.org/jira/browse/IGNITE-12401 (Needs
> > > > fixing of
> > > > > > > > moving partitions first)
> > > > > > > > 3. Figure out how to return scores from nodes and use them as
> > sort
> > > > > > > > parameters on the coordinator node
> > > > > > > > (https://issues.apache.org/jira/browse/IGNITE-12291)
> > > > > > > >
> > > > > > > > Please let me know if this looks ok to make text queries
> > > > functional?
> > > > > > > >
> > > > > > > > Atri
> > > > > > > >
> > > > > > > > On Mon, Jun 21, 2021 at 2:49 PM Alexei Scherbakov
> > > > > > > > <al...@gmail.com> wrote:
> > > > > > > > >
> > > > > > > > > Hi.
> > > > > > > > >
> > > > > > > > > One of the biggest issues with text queries is a lack of
> > support
> > > > for
> > > > > > > > lucene
> > > > > > > > > indices persistence, which makes this functionality useless
> > if a
> > > > > > > > > persistence is enabled.
> > > > > > > > >
> > > > > > > > > I would first take care of it.
> > > > > > > > >
> > > > > > > > > пн, 21 июн. 2021 г. в 12:16, Maksim Timonin <
> > > > timonin.maxim@gmail.com
> > > > > > >:
> > > > > > > > >
> > > > > > > > > > Hi, Atri!
> > > > > > > > > >
> > > > > > > > > > You're right, Actually there is a lack of support for
> > > > TextQueries.
> > > > > > > For
> > > > > > > > the
> > > > > > > > > > last ticket I'm doing I see some obvious issues with them
> > (no
> > > > page
> > > > > > > size
> > > > > > > > > > support, for example). I'm glad that somebody wants to
> > maintain
> > > > > > this
> > > > > > > > > > functionality. Thanks a lot!
> > > > > > > > > >
> > > > > > > > > > For the MergeSort algorithm there is already a patch for
> > that
> > > > [1].
> > > > > > > It's
> > > > > > > > > > currently on review. This patch introduces an abstract
> > reducer
> > > > for
> > > > > > > > > > CacheQueries with 2 implementations (unordered,
> > merge-sort).
> > > > Then
> > > > > > > > TextQuery
> > > > > > > > > > leverages on MergeSort to order results from multiple
> > nodes by
> > > > > > score.
> > > > > > > > This
> > > > > > > > > > patch also fixes the pageSize issue, I've mentioned before.
> > > > Could
> > > > > > you
> > > > > > > > > > please check if it fully matches your idea? Any issues or
> > > > comments
> > > > > > > are
> > > > > > > > > > welcome.
> > > > > > > > > >
> > > > > > > > > > I've prepared this ticket, because I need the MergeSort
> > > > algorithm
> > > > > > for
> > > > > > > > the
> > > > > > > > > > new type of queries I'm implementing (IndexQuery, it should
> > > > also
> > > > > > > > provide
> > > > > > > > > > ordered results over multiple nodes). Currently I'm not
> > > > planning to
> > > > > > > go
> > > > > > > > > > further with TextQuery, so if you're going to support this
> > > > it'll
> > > > > > be a
> > > > > > > > great
> > > > > > > > > > contribution, I think.
> > > > > > > > > >
> > > > > > > > > > [1] https://issues.apache.org/jira/browse/IGNITE-14703
> > > > > > > > > > [2] https://github.com/apache/ignite/pull/9081
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > On Mon, Jun 21, 2021 at 11:11 AM Atri Sharma <
> > atri@apache.org>
> > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Hi All,
> > > > > > > > > > >
> > > > > > > > > > > I have been looking into our text queries support and see
> > > > that it
> > > > > > > has
> > > > > > > > > > > limited community support.
> > > > > > > > > > >
> > > > > > > > > > > Therefore, I volunteer to be the maintainer of the
> > module and
> > > > > > work
> > > > > > > on
> > > > > > > > > > > enhancing it further.
> > > > > > > > > > >
> > > > > > > > > > > First goal would be to move to Lucene 8.x, then work on
> > > > sorted
> > > > > > > reduce
> > > > > > > > > > > - merge across nodes. Fundamentally, this is doable since
> > > > Lucene
> > > > > > > > ranks
> > > > > > > > > > > documents according to their score, and documents are
> > > > returned in
> > > > > > > the
> > > > > > > > > > > order of their score. Since the scoring function is
> > > > homogeneous,
> > > > > > > this
> > > > > > > > > > > means that across nodes, we can compare scores and merge
> > > > sort.
> > > > > > > > > > >
> > > > > > > > > > > Please let me know if I can take this up.
> > > > > > > > > > >
> > > > > > > > > > > Atri
> > > > > > > > > > >
> > > > > > > > > > > --
> > > > > > > > > > > Regards,
> > > > > > > > > > >
> > > > > > > > > > > Atri
> > > > > > > > > > > Apache Concerted
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > --
> > > > > > > > >
> > > > > > > > > Best regards,
> > > > > > > > > Alexei Scherbakov
> > > > > > > >
> > > > > > > > --
> > > > > > > > Regards,
> > > > > > > >
> > > > > > > > Atri
> > > > > > > > Apache Concerted
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Best regards,
> > > > > Andrey V. Mashenkov
> > > >
> > > > --
> > > > Regards,
> > > >
> > > > Atri
> > > > Apache Concerted
> > > >
> > >
> > >
> > > --
> > > Best regards,
> > > Andrey V. Mashenkov
> >
> > --
> > Regards,
> >
> > Atri
> > Apache Concerted
> >
>
>
> --
> Best regards,
> Andrey V. Mashenkov

-- 
Regards,

Atri
Apache Concerted

Re: Text Queries Support

Posted by Andrey Mashenkov <an...@gmail.com>.

Atri,

> We can have partition level indices on each node.
Per-partition Lucene index looks simple to implement, but it may require
per-partition SQL to make full-text search expressions work correctly
within the SQL quiery.
Per-partition SQL index may kill the performance. We already tried to do
that in Ignite 2. However, QueryParallelism feature helps to speed up some
data-intensive queries,
but hits the performance in simple cases, and at some point (e.g. segments
> number of CPU) the performance rapidly degrades with the increasing
number of segments.

AFAIK, Lucene widely used bitmap indices that are easy to merge.
Maybe, the map-reduce technique underneath FTS expressions and some hacks
will add a minimal overhead.


> We can use Lucene's MMapped directory.
> Lucene does not have WAL logs but is append only.
Yes.
Also, per-partition index will require at least 1 (or more?) files. OS
usually has tight limits for opened file descriptors numbers.
'Append only' sounds good, that might be helpful.


> As illustrated by Ilya, we can use Ignite's WAL records to rebuild
> Lucene indices. The important thing here is to not treat Lucene
> indices as source of truth.
To use WAL we either should relay Lucene files to our Page memory or be
aware of Lucene files structure.
The first looks tricky, as we should guarantee a contiguous address space
in Page memory for reflecting Lucene file. Maybe separate managed memory
segment with its own rules?
The second looks like overkill.

>> Transactions.
>> * Will we support transactions?
> Lucene has no concept of transactions.
Yes, but we have.
Lucene index may be non-transactional, but users never expect to see
uncommited data.
How does this connect with transactional SQL?



On Tue, Jul 27, 2021 at 1:36 PM Atri Sharma <at...@apache.org> wrote:

> Sorry, I planned on creating a Wiki page for this, but it makes more
> sense to be replying here.
>
> > * How Lucene index can be split among the nodes?
>
> We can have partition level indices on each node.
>
> > * If we'll have a single index for all partitions on the particular node,
> > then how index records will be aware of partitioning?
>
> Index records dont need to be aware of partitioning -- each Lucene
> index is independent.
>
> > This is important to filter out backup records from the results to avoid
> > duplicates.
>
> We can merge documents from different nodes and remove duplicates as
> long as docIDs are globally unique.
>
> > * How results from several nodes can be merged on the Reduce stage?
>
> As long as documents have a globally unique docID, Lucene has merge
> functions that can merge results from multiple partial results.
>
> > * Does Lucene supports smth like JOIN operation or others that may
> require
> > data from another partition or index?
>
> As illustrated by Ilya, Block-Join works for us.
>
> > If so, then it likes to multistep query with merging results on
> > intermediate stages and requires detailed investigation and design.
> > It is ok if Ignite will have some limitations here, but we would like to
> > know about them at the early stage.
>
> > * How effectively map Lucene files to the page memory? Is it even
> possible?
>
> Lucene has PageDirectory implementations which allow storing Lucene
> indices on different kind of file structures. It has a
> MMappedFileDirectory that we could use?
>
> > Otherwise, how to deal with potential OOM on large queries and memory
> > capacity planning?
>
> We can use Lucene's MMapped directory.
>
> >
> > Persistence.
> > * How and what consistency guarantees could we have/expect?
>
> Lucene does not have WAL logs but is append only
>
> > Seems, we may not be able to write physical records for Lucene index to
> our
> > WAL. What can we do with this?
>
> As illustrated by Ilya, we can use Ignite's WAL records to rebuild
> Lucene indices. The important thing here is to not treat Lucene
> indices as source of truth.
> >
> > Transactions.
> > * Will we support transactions?
> Lucene has no concept of transactions.
>
> > * Should Lucene be aware of Transaction and track mvcc (or whatever)
> > versions for the records?
> No
> > * What will be consistency guarantees?
> We can acknowledge writes back only after Lucene index is updated.
> >
> > UX
> > * How to add FullText search queries syntax into Calcite?
> Postgres's FTS functions are a good reference.
> > * AFAIK, the Lucene index has many properties for tuning. How will the
> user
> > configure the index?
> Most of those properties can be cluster level and exposed as a new sub
> config for ignite.
> > * How and where to store the settings? What are cluster-wide and what a
> > local to the particular node?
> All can be cluster level.
> > * Will be all the settings immutable? Can be they changed on-fly? after
> > node/grid restart?
> They should be applied post restart.
>
> > * Any limitations on query syntax?
> It depends on how we model our queries for text search.
>
> >
> > SQL
> > * Will we support FullText search in SQL?
> We need custom functions for it. See Postgres's FTS functions.
> > * How to integrate Lucene index into Calcite? What is the cost model?
> There cannot be any cost model since there are no paths for a text
> query. If we see a text query, we have to use Lucene index or return
> an error. In this way, we need to model text search as a set of UDFs
>
> > Splitting rules? Traits?
> Please see my reply above.
> >
> >
> > With all of this, you can go with the IEP (or even some short summary)
> and
> > further POC and implementation.
> > That's a big deal, so let's discuss what could be done here.
> >
> > On Fri, Jul 23, 2021 at 12:58 PM Atri Sharma <at...@apache.org> wrote:
> >
> > > I am actually happy to drive the feature for Ignite 3. FTS is very
> > > important for me and I think Ignite users will benefit from it
> > > greatly.
> > >
> > > If it makes sense to be focusing on Ignite 3 for this capability, I am
> > > eager to contribute there and lead the development.
> > >
> > > Please share your thoughts.
> > >
> > > On Fri, Jul 23, 2021 at 3:21 PM Andrey Mashenkov
> > > <an...@gmail.com> wrote:
> > > >
> > > > Hi Atri,
> > > >
> > > > All the Jira tickets we have on the Full-text search (FTS) thing are
> > > > targeted to Ignite 2.
> > > >
> > > > AFAIK, we want, but we have NOT committed to FTS support in Ignite 3,
> > > yet.
> > > > By the way, we are getting requests for this thing from the user
> side,
> > > and
> > > > definitely,
> > > > FTS would be a valuable feature for Ignite.
> > > >
> > > > It will be great if the one wants to drive it, any help will be
> > > appreciated.
> > > >
> > > >
> > > > On Fri, Jul 23, 2021 at 12:12 PM Atri Sharma <at...@apache.org>
> wrote:
> > > >
> > > > > Hello,
> > > > >
> > > > > An update, please. I am working through persistence of Lucene index
> > > using
> > > > > Ignite Dictionary, and will be asking some questions soon.
> > > > >
> > > > > I had one doubt - - where does this change go? Ignite 3?
> > > > >
> > > > > Also, I know we want to build native support for text searches in
> > > Ignite 3.
> > > > > Is the work I am proposing here part of that, or will that be a
> > > separate
> > > > > effort?
> > > > >
> > > > > On Mon, 28 Jun 2021, 19:20 Ilya Kasnacheev, <
> ilya.kasnacheev@gmail.com
> > > >
> > > > > wrote:
> > > > >
> > > > > > Hello!
> > > > > >
> > > > > > I think that number one is the most important one, then maybe it
> > > will see
> > > > > > more use and other deficiencies become more apparent, leading to
> more
> > > > > > tickets and visibility.
> > > > > >
> > > > > > Maybe 2. and 3. will even use a different approach when
> persistence
> > > is
> > > > > > implemented.
> > > > > >
> > > > > > Regards,
> > > > > > --
> > > > > > Ilya Kasnacheev
> > > > > >
> > > > > >
> > > > > > пн, 28 июн. 2021 г. в 14:34, Atri Sharma <at...@apache.org>:
> > > > > >
> > > > > > > Hello Again!
> > > > > > >
> > > > > > > I have been looking into the aforementioned and here are my
> follow
> > > up
> > > > > > > thoughts:
> > > > > > >
> > > > > > > 1. Support persistence of Lucene indexes.
> > > > > > > 2. https://issues.apache.org/jira/browse/IGNITE-12401 (Needs
> > > fixing of
> > > > > > > moving partitions first)
> > > > > > > 3. Figure out how to return scores from nodes and use them as
> sort
> > > > > > > parameters on the coordinator node
> > > > > > > (https://issues.apache.org/jira/browse/IGNITE-12291)
> > > > > > >
> > > > > > > Please let me know if this looks ok to make text queries
> > > functional?
> > > > > > >
> > > > > > > Atri
> > > > > > >
> > > > > > > On Mon, Jun 21, 2021 at 2:49 PM Alexei Scherbakov
> > > > > > > <al...@gmail.com> wrote:
> > > > > > > >
> > > > > > > > Hi.
> > > > > > > >
> > > > > > > > One of the biggest issues with text queries is a lack of
> support
> > > for
> > > > > > > lucene
> > > > > > > > indices persistence, which makes this functionality useless
> if a
> > > > > > > > persistence is enabled.
> > > > > > > >
> > > > > > > > I would first take care of it.
> > > > > > > >
> > > > > > > > пн, 21 июн. 2021 г. в 12:16, Maksim Timonin <
> > > timonin.maxim@gmail.com
> > > > > >:
> > > > > > > >
> > > > > > > > > Hi, Atri!
> > > > > > > > >
> > > > > > > > > You're right, Actually there is a lack of support for
> > > TextQueries.
> > > > > > For
> > > > > > > the
> > > > > > > > > last ticket I'm doing I see some obvious issues with them
> (no
> > > page
> > > > > > size
> > > > > > > > > support, for example). I'm glad that somebody wants to
> maintain
> > > > > this
> > > > > > > > > functionality. Thanks a lot!
> > > > > > > > >
> > > > > > > > > For the MergeSort algorithm there is already a patch for
> that
> > > [1].
> > > > > > It's
> > > > > > > > > currently on review. This patch introduces an abstract
> reducer
> > > for
> > > > > > > > > CacheQueries with 2 implementations (unordered,
> merge-sort).
> > > Then
> > > > > > > TextQuery
> > > > > > > > > leverages on MergeSort to order results from multiple
> nodes by
> > > > > score.
> > > > > > > This
> > > > > > > > > patch also fixes the pageSize issue, I've mentioned before.
> > > Could
> > > > > you
> > > > > > > > > please check if it fully matches your idea? Any issues or
> > > comments
> > > > > > are
> > > > > > > > > welcome.
> > > > > > > > >
> > > > > > > > > I've prepared this ticket, because I need the MergeSort
> > > algorithm
> > > > > for
> > > > > > > the
> > > > > > > > > new type of queries I'm implementing (IndexQuery, it should
> > > also
> > > > > > > provide
> > > > > > > > > ordered results over multiple nodes). Currently I'm not
> > > planning to
> > > > > > go
> > > > > > > > > further with TextQuery, so if you're going to support this
> > > it'll
> > > > > be a
> > > > > > > great
> > > > > > > > > contribution, I think.
> > > > > > > > >
> > > > > > > > > [1] https://issues.apache.org/jira/browse/IGNITE-14703
> > > > > > > > > [2] https://github.com/apache/ignite/pull/9081
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Mon, Jun 21, 2021 at 11:11 AM Atri Sharma <
> atri@apache.org>
> > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Hi All,
> > > > > > > > > >
> > > > > > > > > > I have been looking into our text queries support and see
> > > that it
> > > > > > has
> > > > > > > > > > limited community support.
> > > > > > > > > >
> > > > > > > > > > Therefore, I volunteer to be the maintainer of the
> module and
> > > > > work
> > > > > > on
> > > > > > > > > > enhancing it further.
> > > > > > > > > >
> > > > > > > > > > First goal would be to move to Lucene 8.x, then work on
> > > sorted
> > > > > > reduce
> > > > > > > > > > - merge across nodes. Fundamentally, this is doable since
> > > Lucene
> > > > > > > ranks
> > > > > > > > > > documents according to their score, and documents are
> > > returned in
> > > > > > the
> > > > > > > > > > order of their score. Since the scoring function is
> > > homogeneous,
> > > > > > this
> > > > > > > > > > means that across nodes, we can compare scores and merge
> > > sort.
> > > > > > > > > >
> > > > > > > > > > Please let me know if I can take this up.
> > > > > > > > > >
> > > > > > > > > > Atri
> > > > > > > > > >
> > > > > > > > > > --
> > > > > > > > > > Regards,
> > > > > > > > > >
> > > > > > > > > > Atri
> > > > > > > > > > Apache Concerted
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > --
> > > > > > > >
> > > > > > > > Best regards,
> > > > > > > > Alexei Scherbakov
> > > > > > >
> > > > > > > --
> > > > > > > Regards,
> > > > > > >
> > > > > > > Atri
> > > > > > > Apache Concerted
> > > > > > >
> > > > > >
> > > > >
> > > >
> > > >
> > > > --
> > > > Best regards,
> > > > Andrey V. Mashenkov
> > >
> > > --
> > > Regards,
> > >
> > > Atri
> > > Apache Concerted
> > >
> >
> >
> > --
> > Best regards,
> > Andrey V. Mashenkov
>
> --
> Regards,
>
> Atri
> Apache Concerted
>


-- 
Best regards,
Andrey V. Mashenkov

Re: Text Queries Support

Posted by Atri Sharma <at...@apache.org>.

Sorry, I planned on creating a Wiki page for this, but it makes more
sense to be replying here.

> * How Lucene index can be split among the nodes?

We can have partition level indices on each node.

> * If we'll have a single index for all partitions on the particular node,
> then how index records will be aware of partitioning?

Index records dont need to be aware of partitioning -- each Lucene
index is independent.

> This is important to filter out backup records from the results to avoid
> duplicates.

We can merge documents from different nodes and remove duplicates as
long as docIDs are globally unique.

> * How results from several nodes can be merged on the Reduce stage?

As long as documents have a globally unique docID, Lucene has merge
functions that can merge results from multiple partial results.

> * Does Lucene supports smth like JOIN operation or others that may require
> data from another partition or index?

As illustrated by Ilya, Block-Join works for us.

> If so, then it likes to multistep query with merging results on
> intermediate stages and requires detailed investigation and design.
> It is ok if Ignite will have some limitations here, but we would like to
> know about them at the early stage.

> * How effectively map Lucene files to the page memory? Is it even possible?

Lucene has PageDirectory implementations which allow storing Lucene
indices on different kind of file structures. It has a
MMappedFileDirectory that we could use?

> Otherwise, how to deal with potential OOM on large queries and memory
> capacity planning?

We can use Lucene's MMapped directory.

>
> Persistence.
> * How and what consistency guarantees could we have/expect?

Lucene does not have WAL logs but is append only

> Seems, we may not be able to write physical records for Lucene index to our
> WAL. What can we do with this?

As illustrated by Ilya, we can use Ignite's WAL records to rebuild
Lucene indices. The important thing here is to not treat Lucene
indices as source of truth.
>
> Transactions.
> * Will we support transactions?
Lucene has no concept of transactions.

> * Should Lucene be aware of Transaction and track mvcc (or whatever)
> versions for the records?
No
> * What will be consistency guarantees?
We can acknowledge writes back only after Lucene index is updated.
>
> UX
> * How to add FullText search queries syntax into Calcite?
Postgres's FTS functions are a good reference.
> * AFAIK, the Lucene index has many properties for tuning. How will the user
> configure the index?
Most of those properties can be cluster level and exposed as a new sub
config for ignite.
> * How and where to store the settings? What are cluster-wide and what a
> local to the particular node?
All can be cluster level.
> * Will be all the settings immutable? Can be they changed on-fly? after
> node/grid restart?
They should be applied post restart.

> * Any limitations on query syntax?
It depends on how we model our queries for text search.

>
> SQL
> * Will we support FullText search in SQL?
We need custom functions for it. See Postgres's FTS functions.
> * How to integrate Lucene index into Calcite? What is the cost model?
There cannot be any cost model since there are no paths for a text
query. If we see a text query, we have to use Lucene index or return
an error. In this way, we need to model text search as a set of UDFs

> Splitting rules? Traits?
Please see my reply above.
>
>
> With all of this, you can go with the IEP (or even some short summary) and
> further POC and implementation.
> That's a big deal, so let's discuss what could be done here.
>
> On Fri, Jul 23, 2021 at 12:58 PM Atri Sharma <at...@apache.org> wrote:
>
> > I am actually happy to drive the feature for Ignite 3. FTS is very
> > important for me and I think Ignite users will benefit from it
> > greatly.
> >
> > If it makes sense to be focusing on Ignite 3 for this capability, I am
> > eager to contribute there and lead the development.
> >
> > Please share your thoughts.
> >
> > On Fri, Jul 23, 2021 at 3:21 PM Andrey Mashenkov
> > <an...@gmail.com> wrote:
> > >
> > > Hi Atri,
> > >
> > > All the Jira tickets we have on the Full-text search (FTS) thing are
> > > targeted to Ignite 2.
> > >
> > > AFAIK, we want, but we have NOT committed to FTS support in Ignite 3,
> > yet.
> > > By the way, we are getting requests for this thing from the user side,
> > and
> > > definitely,
> > > FTS would be a valuable feature for Ignite.
> > >
> > > It will be great if the one wants to drive it, any help will be
> > appreciated.
> > >
> > >
> > > On Fri, Jul 23, 2021 at 12:12 PM Atri Sharma <at...@apache.org> wrote:
> > >
> > > > Hello,
> > > >
> > > > An update, please. I am working through persistence of Lucene index
> > using
> > > > Ignite Dictionary, and will be asking some questions soon.
> > > >
> > > > I had one doubt - - where does this change go? Ignite 3?
> > > >
> > > > Also, I know we want to build native support for text searches in
> > Ignite 3.
> > > > Is the work I am proposing here part of that, or will that be a
> > separate
> > > > effort?
> > > >
> > > > On Mon, 28 Jun 2021, 19:20 Ilya Kasnacheev, <ilya.kasnacheev@gmail.com
> > >
> > > > wrote:
> > > >
> > > > > Hello!
> > > > >
> > > > > I think that number one is the most important one, then maybe it
> > will see
> > > > > more use and other deficiencies become more apparent, leading to more
> > > > > tickets and visibility.
> > > > >
> > > > > Maybe 2. and 3. will even use a different approach when persistence
> > is
> > > > > implemented.
> > > > >
> > > > > Regards,
> > > > > --
> > > > > Ilya Kasnacheev
> > > > >
> > > > >
> > > > > пн, 28 июн. 2021 г. в 14:34, Atri Sharma <at...@apache.org>:
> > > > >
> > > > > > Hello Again!
> > > > > >
> > > > > > I have been looking into the aforementioned and here are my follow
> > up
> > > > > > thoughts:
> > > > > >
> > > > > > 1. Support persistence of Lucene indexes.
> > > > > > 2. https://issues.apache.org/jira/browse/IGNITE-12401 (Needs
> > fixing of
> > > > > > moving partitions first)
> > > > > > 3. Figure out how to return scores from nodes and use them as sort
> > > > > > parameters on the coordinator node
> > > > > > (https://issues.apache.org/jira/browse/IGNITE-12291)
> > > > > >
> > > > > > Please let me know if this looks ok to make text queries
> > functional?
> > > > > >
> > > > > > Atri
> > > > > >
> > > > > > On Mon, Jun 21, 2021 at 2:49 PM Alexei Scherbakov
> > > > > > <al...@gmail.com> wrote:
> > > > > > >
> > > > > > > Hi.
> > > > > > >
> > > > > > > One of the biggest issues with text queries is a lack of support
> > for
> > > > > > lucene
> > > > > > > indices persistence, which makes this functionality useless if a
> > > > > > > persistence is enabled.
> > > > > > >
> > > > > > > I would first take care of it.
> > > > > > >
> > > > > > > пн, 21 июн. 2021 г. в 12:16, Maksim Timonin <
> > timonin.maxim@gmail.com
> > > > >:
> > > > > > >
> > > > > > > > Hi, Atri!
> > > > > > > >
> > > > > > > > You're right, Actually there is a lack of support for
> > TextQueries.
> > > > > For
> > > > > > the
> > > > > > > > last ticket I'm doing I see some obvious issues with them (no
> > page
> > > > > size
> > > > > > > > support, for example). I'm glad that somebody wants to maintain
> > > > this
> > > > > > > > functionality. Thanks a lot!
> > > > > > > >
> > > > > > > > For the MergeSort algorithm there is already a patch for that
> > [1].
> > > > > It's
> > > > > > > > currently on review. This patch introduces an abstract reducer
> > for
> > > > > > > > CacheQueries with 2 implementations (unordered, merge-sort).
> > Then
> > > > > > TextQuery
> > > > > > > > leverages on MergeSort to order results from multiple nodes by
> > > > score.
> > > > > > This
> > > > > > > > patch also fixes the pageSize issue, I've mentioned before.
> > Could
> > > > you
> > > > > > > > please check if it fully matches your idea? Any issues or
> > comments
> > > > > are
> > > > > > > > welcome.
> > > > > > > >
> > > > > > > > I've prepared this ticket, because I need the MergeSort
> > algorithm
> > > > for
> > > > > > the
> > > > > > > > new type of queries I'm implementing (IndexQuery, it should
> > also
> > > > > > provide
> > > > > > > > ordered results over multiple nodes). Currently I'm not
> > planning to
> > > > > go
> > > > > > > > further with TextQuery, so if you're going to support this
> > it'll
> > > > be a
> > > > > > great
> > > > > > > > contribution, I think.
> > > > > > > >
> > > > > > > > [1] https://issues.apache.org/jira/browse/IGNITE-14703
> > > > > > > > [2] https://github.com/apache/ignite/pull/9081
> > > > > > > >
> > > > > > > >
> > > > > > > > On Mon, Jun 21, 2021 at 11:11 AM Atri Sharma <at...@apache.org>
> > > > > wrote:
> > > > > > > >
> > > > > > > > > Hi All,
> > > > > > > > >
> > > > > > > > > I have been looking into our text queries support and see
> > that it
> > > > > has
> > > > > > > > > limited community support.
> > > > > > > > >
> > > > > > > > > Therefore, I volunteer to be the maintainer of the module and
> > > > work
> > > > > on
> > > > > > > > > enhancing it further.
> > > > > > > > >
> > > > > > > > > First goal would be to move to Lucene 8.x, then work on
> > sorted
> > > > > reduce
> > > > > > > > > - merge across nodes. Fundamentally, this is doable since
> > Lucene
> > > > > > ranks
> > > > > > > > > documents according to their score, and documents are
> > returned in
> > > > > the
> > > > > > > > > order of their score. Since the scoring function is
> > homogeneous,
> > > > > this
> > > > > > > > > means that across nodes, we can compare scores and merge
> > sort.
> > > > > > > > >
> > > > > > > > > Please let me know if I can take this up.
> > > > > > > > >
> > > > > > > > > Atri
> > > > > > > > >
> > > > > > > > > --
> > > > > > > > > Regards,
> > > > > > > > >
> > > > > > > > > Atri
> > > > > > > > > Apache Concerted
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > >
> > > > > > > Best regards,
> > > > > > > Alexei Scherbakov
> > > > > >
> > > > > > --
> > > > > > Regards,
> > > > > >
> > > > > > Atri
> > > > > > Apache Concerted
> > > > > >
> > > > >
> > > >
> > >
> > >
> > > --
> > > Best regards,
> > > Andrey V. Mashenkov
> >
> > --
> > Regards,
> >
> > Atri
> > Apache Concerted
> >
>
>
> --
> Best regards,
> Andrey V. Mashenkov

-- 
Regards,

Atri
Apache Concerted

Re: Text Queries Support

Posted by Courtney Robinson <co...@hypi.io>.

+1 we're all saying the same thing here.

My example from before select x from T0 where term(args to solr term query)
AND ..
term(xxx) was meant to indicate a lucene term query and so there'd be a
list of lucene functions exposed in a similar way.


On Mon, Jul 26, 2021 at 5:45 PM Atri Sharma <at...@apache.org> wrote:

> +1
>
> Lets expose custom functions in Ignite SQL which allows us to use the full
> capabilities that Lucene offers
>
> On Mon, 26 Jul 2021, 21:51 Andrey Mashenkov, <an...@gmail.com>
> wrote:
>
> > Val,
> >
> > > I believe this is something we can look into in the scope of Ignite 3.
> > > Andrey, does Calcite have any support for this? What's your view on
> this?
> >
> > As Atri already mentioned, SQL 92 standard declares "LIKE" operator for
> > pattern matching.
> > Calcite supports LIKE operator.
> >
> > I've found it is a RexNode (expression) and I doubt it supports indices.
> > Maybe, LIKE can use a sorted index for prefix matching or equality
> > conditions, but it is very far from what we are talking about.
> >
> > Full-text search term is much wider than just a pattern matching.
> > Lucene provides much more capabilities on that and has rich
> > syntax contrary to "LIKE" operator.
> > So, LIKE operator is the standard operator with the defined contract. I'm
> > not sure it is worth integrating Lucene just for it.
> > I think we should have native support for full-text search queries
> and/or a
> > custom SQL function.
> >
> > E.g. Postgres syntax for FTS queries [1] is completely different to
> "LIKE"
> > operator.
> >
> > [1]
> >
> >
> https://www.postgresql.org/docs/9.5/textsearch-intro.html#TEXTSEARCH-MATCHING
> >
> > On Sat, Jul 24, 2021 at 4:49 PM Courtney Robinson <
> > courtney.robinson@hypi.io>
> > wrote:
> >
> > > Hey Ari,
> > > Yes, I wasn't suggesting that Solr should be used. That's just what
> we're
> > > doing now out of necessity.
> > > It was more the fact that Calcite's SqlOperator can be used to provide
> > the
> > > interface to Lucene.
> > > For all the reasons you mentioned and more, using Lucene is the right
> > > choice
> > >
> > > Calcite doesn't have support for Solr but it has an ES adapter which is
> > > what we modified to support Solr.
> > >
> > > Regards,
> > > Courtney Robinson
> > > Founder and CEO, Hypi
> > > Tel: ++44 208 123 2413 (GMT+0) <https://hypi.io>
> > >
> > > <https://hypi.io>
> > > https://hypi.io
> > >
> > >
> > > On Sat, Jul 24, 2021 at 1:59 PM Atri Sharma <at...@apache.org> wrote:
> > >
> > > > What that entails is that the end user has to keep a Solr cluster
> > > running,
> > > > which comes with its own challenges (now you have to manage two
> systems
> > > > instead of one).
> > > >
> > > > I believe Calcite has native support for Solr?
> > > >
> > > > OTOH, having native Lucene indices allow us to control per partition
> > > > indices with no distributed overhead, since Lucene is a per node
> > instance
> > > > with no global coordination.
> > > >
> > > > On Sat, 24 Jul 2021, 16:57 Courtney Robinson, <
> > courtney.robinson@hypi.io
> > > >
> > > > wrote:
> > > >
> > > > > I'll add in here.
> > > > > I agree with you Valentin, the decoupled state of text queries
> makes
> > it
> > > > > useless for most use cases we have.
> > > > >
> > > > > As it relates to Calcite and Ignite 3, one approach (the one we're
> > > taking
> > > > > because we use calcite independent of Ignite) is to provide a bunch
> > of
> > > > SQL
> > > > > functions that we implement as SqlOperator
> > > > > <
> > > > >
> > > >
> > >
> >
> https://calcite.apache.org/javadocAggregate/org/apache/calcite/sql/SqlOperator.html
> > > > > >.
> > > > > I forget how we've done aggregation functions but we have those too
> > and
> > > > > they map to Solr aggregations (which ultimately end up in lucene).
> > > > >
> > > > > This allows Solr filters to take part in the rest of the query.
> It's
> > > > > probably more complex than this for Ignite but that's one possible
> > > route
> > > > > but we generate queries like select x from T0 where term(args to
> solr
> > > > term
> > > > > query) AND ...
> > > > >
> > > > > Regards,
> > > > > Courtney Robinson
> > > > > Founder and CEO, Hypi
> > > > > Tel: ++44 208 123 2413 (GMT+0) <https://hypi.io>
> > > > >
> > > > > <https://hypi.io>
> > > > > https://hypi.io
> > > > >
> > > > >
> > > > > On Fri, Jul 23, 2021 at 7:14 PM Valentin Kulichenko <
> > > > > valentin.kulichenko@gmail.com> wrote:
> > > > >
> > > > > > Atri,
> > > > > >
> > > > > > Sure, go ahead. Let's put the ideas on paper and have a
> discussion.
> > > > > >
> > > > > > -Val
> > > > > >
> > > > > > On Fri, Jul 23, 2021 at 10:59 AM Atri Sharma <at...@apache.org>
> > > wrote:
> > > > > >
> > > > > > > Thanks Andrey.
> > > > > > >
> > > > > > > I have collected answers or proposals to many of these
> questions
> > > and
> > > > > > > would like to start a wiki page covering what we can do for
> > Ignite
> > > 3.
> > > > > > >
> > > > > > > Does that sound good, please?
> > > > > > >
> > > > > > > On Fri, Jul 23, 2021 at 4:26 PM Andrey Mashenkov
> > > > > > > <an...@gmail.com> wrote:
> > > > > > > >
> > > > > > > > Atri,
> > > > > > > >
> > > > > > > > First of all, I'd recommend going through the Ignite ticket
> to
> > > > gather
> > > > > > > > information about the current implementation issues and
> users'
> > > > wants.
> > > > > > > > Then look at a code to get a complete understanding of how
> > things
> > > > > work
> > > > > > > now,
> > > > > > > > which may help in future decisions.
> > > > > > > >
> > > > > > > > As we use the outdated Lucene version, some things may be
> > > > irrelevant
> > > > > > for
> > > > > > > > the latest Lucene version.
> > > > > > > > So, you will need expertise in the internals of modern Lucene
> > > > version
> > > > > > to
> > > > > > > > understand what capabilities, guarantees, and limitations
> > Lucene
> > > > has
> > > > > > and
> > > > > > > > could bring to the Ignite.
> > > > > > > > The expertise could be got from the Lucene project code or
> > Lucene
> > > > > > project
> > > > > > > > dev-list.
> > > > > > > >
> > > > > > > >
> > > > > > > > As for now, the potential capabilities are not clear to me.
> > > > > > > > At first glance, I see the next topics that must be covered
> at
> > > > first:
> > > > > > > >
> > > > > > > > General questions
> > > > > > > > * How Lucene index can be split among the nodes?
> > > > > > > > * If we'll have a single index for all partitions on the
> > > particular
> > > > > > node,
> > > > > > > > then how index records will be aware of partitioning?
> > > > > > > > This is important to filter out backup records from the
> results
> > > to
> > > > > > avoid
> > > > > > > > duplicates.
> > > > > > > > * How results from several nodes can be merged on the Reduce
> > > stage?
> > > > > > > > * Does Lucene supports smth like JOIN operation or others
> that
> > > may
> > > > > > > require
> > > > > > > > data from another partition or index?
> > > > > > > > If so, then it likes to multistep query with merging results
> on
> > > > > > > > intermediate stages and requires detailed investigation and
> > > design.
> > > > > > > > It is ok if Ignite will have some limitations here, but we
> > would
> > > > like
> > > > > > to
> > > > > > > > know about them at the early stage.
> > > > > > > > * How effectively map Lucene files to the page memory? Is it
> > even
> > > > > > > possible?
> > > > > > > > Otherwise, how to deal with potential OOM on large queries
> and
> > > > memory
> > > > > > > > capacity planning?
> > > > > > > >
> > > > > > > > Persistence.
> > > > > > > > * How and what consistency guarantees could we have/expect?
> > > > > > > > Seems, we may not be able to write physical records for
> Lucene
> > > > index
> > > > > to
> > > > > > > our
> > > > > > > > WAL. What can we do with this?
> > > > > > > >
> > > > > > > > Transactions.
> > > > > > > > * Will we support transactions?
> > > > > > > > * Should Lucene be aware of Transaction and track mvcc (or
> > > > whatever)
> > > > > > > > versions for the records?
> > > > > > > > * What will be consistency guarantees?
> > > > > > > >
> > > > > > > > UX
> > > > > > > > * How to add FullText search queries syntax into Calcite?
> > > > > > > > * AFAIK, the Lucene index has many properties for tuning. How
> > > will
> > > > > the
> > > > > > > user
> > > > > > > > configure the index?
> > > > > > > > * How and where to store the settings? What are cluster-wide
> > and
> > > > > what a
> > > > > > > > local to the particular node?
> > > > > > > > * Will be all the settings immutable? Can be they changed
> > on-fly?
> > > > > after
> > > > > > > > node/grid restart?
> > > > > > > > * Any limitations on query syntax?
> > > > > > > >
> > > > > > > > SQL
> > > > > > > > * Will we support FullText search in SQL?
> > > > > > > > * How to integrate Lucene index into Calcite? What is the
> cost
> > > > model?
> > > > > > > > Splitting rules? Traits?
> > > > > > > > * What about consistency with DDL operations, e.g. column
> > rename?
> > > > > > > > Ignite indices will operate column ID, so rename operation
> will
> > > not
> > > > > > > affect
> > > > > > > > the index.
> > > > > > > >
> > > > > > > >
> > > > > > > > With all of this, you can go with the IEP (or even some short
> > > > > summary)
> > > > > > > and
> > > > > > > > further POC and implementation.
> > > > > > > > That's a big deal, so let's discuss what could be done here.
> > > > > > > >
> > > > > > > > On Fri, Jul 23, 2021 at 12:58 PM Atri Sharma <
> atri@apache.org>
> > > > > wrote:
> > > > > > > >
> > > > > > > > > I am actually happy to drive the feature for Ignite 3. FTS
> is
> > > > very
> > > > > > > > > important for me and I think Ignite users will benefit from
> > it
> > > > > > > > > greatly.
> > > > > > > > >
> > > > > > > > > If it makes sense to be focusing on Ignite 3 for this
> > > > capability, I
> > > > > > am
> > > > > > > > > eager to contribute there and lead the development.
> > > > > > > > >
> > > > > > > > > Please share your thoughts.
> > > > > > > > >
> > > > > > > > > On Fri, Jul 23, 2021 at 3:21 PM Andrey Mashenkov
> > > > > > > > > <an...@gmail.com> wrote:
> > > > > > > > > >
> > > > > > > > > > Hi Atri,
> > > > > > > > > >
> > > > > > > > > > All the Jira tickets we have on the Full-text search
> (FTS)
> > > > thing
> > > > > > are
> > > > > > > > > > targeted to Ignite 2.
> > > > > > > > > >
> > > > > > > > > > AFAIK, we want, but we have NOT committed to FTS support
> in
> > > > > Ignite
> > > > > > 3,
> > > > > > > > > yet.
> > > > > > > > > > By the way, we are getting requests for this thing from
> the
> > > > user
> > > > > > > side,
> > > > > > > > > and
> > > > > > > > > > definitely,
> > > > > > > > > > FTS would be a valuable feature for Ignite.
> > > > > > > > > >
> > > > > > > > > > It will be great if the one wants to drive it, any help
> > will
> > > be
> > > > > > > > > appreciated.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > On Fri, Jul 23, 2021 at 12:12 PM Atri Sharma <
> > > atri@apache.org>
> > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Hello,
> > > > > > > > > > >
> > > > > > > > > > > An update, please. I am working through persistence of
> > > Lucene
> > > > > > index
> > > > > > > > > using
> > > > > > > > > > > Ignite Dictionary, and will be asking some questions
> > soon.
> > > > > > > > > > >
> > > > > > > > > > > I had one doubt - - where does this change go? Ignite
> 3?
> > > > > > > > > > >
> > > > > > > > > > > Also, I know we want to build native support for text
> > > > searches
> > > > > in
> > > > > > > > > Ignite 3.
> > > > > > > > > > > Is the work I am proposing here part of that, or will
> > that
> > > > be a
> > > > > > > > > separate
> > > > > > > > > > > effort?
> > > > > > > > > > >
> > > > > > > > > > > On Mon, 28 Jun 2021, 19:20 Ilya Kasnacheev, <
> > > > > > > ilya.kasnacheev@gmail.com
> > > > > > > > > >
> > > > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > Hello!
> > > > > > > > > > > >
> > > > > > > > > > > > I think that number one is the most important one,
> then
> > > > maybe
> > > > > > it
> > > > > > > > > will see
> > > > > > > > > > > > more use and other deficiencies become more apparent,
> > > > leading
> > > > > > to
> > > > > > > more
> > > > > > > > > > > > tickets and visibility.
> > > > > > > > > > > >
> > > > > > > > > > > > Maybe 2. and 3. will even use a different approach
> when
> > > > > > > persistence
> > > > > > > > > is
> > > > > > > > > > > > implemented.
> > > > > > > > > > > >
> > > > > > > > > > > > Regards,
> > > > > > > > > > > > --
> > > > > > > > > > > > Ilya Kasnacheev
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > пн, 28 июн. 2021 г. в 14:34, Atri Sharma <
> > > atri@apache.org
> > > > >:
> > > > > > > > > > > >
> > > > > > > > > > > > > Hello Again!
> > > > > > > > > > > > >
> > > > > > > > > > > > > I have been looking into the aforementioned and
> here
> > > are
> > > > my
> > > > > > > follow
> > > > > > > > > up
> > > > > > > > > > > > > thoughts:
> > > > > > > > > > > > >
> > > > > > > > > > > > > 1. Support persistence of Lucene indexes.
> > > > > > > > > > > > > 2.
> > https://issues.apache.org/jira/browse/IGNITE-12401
> > > > > (Needs
> > > > > > > > > fixing of
> > > > > > > > > > > > > moving partitions first)
> > > > > > > > > > > > > 3. Figure out how to return scores from nodes and
> use
> > > > them
> > > > > as
> > > > > > > sort
> > > > > > > > > > > > > parameters on the coordinator node
> > > > > > > > > > > > > (
> https://issues.apache.org/jira/browse/IGNITE-12291)
> > > > > > > > > > > > >
> > > > > > > > > > > > > Please let me know if this looks ok to make text
> > > queries
> > > > > > > > > functional?
> > > > > > > > > > > > >
> > > > > > > > > > > > > Atri
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Mon, Jun 21, 2021 at 2:49 PM Alexei Scherbakov
> > > > > > > > > > > > > <al...@gmail.com> wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Hi.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > One of the biggest issues with text queries is a
> > lack
> > > > of
> > > > > > > support
> > > > > > > > > for
> > > > > > > > > > > > > lucene
> > > > > > > > > > > > > > indices persistence, which makes this
> functionality
> > > > > useless
> > > > > > > if a
> > > > > > > > > > > > > > persistence is enabled.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > I would first take care of it.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > пн, 21 июн. 2021 г. в 12:16, Maksim Timonin <
> > > > > > > > > timonin.maxim@gmail.com
> > > > > > > > > > > >:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Hi, Atri!
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > You're right, Actually there is a lack of
> support
> > > for
> > > > > > > > > TextQueries.
> > > > > > > > > > > > For
> > > > > > > > > > > > > the
> > > > > > > > > > > > > > > last ticket I'm doing I see some obvious issues
> > > with
> > > > > them
> > > > > > > (no
> > > > > > > > > page
> > > > > > > > > > > > size
> > > > > > > > > > > > > > > support, for example). I'm glad that somebody
> > wants
> > > > to
> > > > > > > maintain
> > > > > > > > > > > this
> > > > > > > > > > > > > > > functionality. Thanks a lot!
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > For the MergeSort algorithm there is already a
> > > patch
> > > > > for
> > > > > > > that
> > > > > > > > > [1].
> > > > > > > > > > > > It's
> > > > > > > > > > > > > > > currently on review. This patch introduces an
> > > > abstract
> > > > > > > reducer
> > > > > > > > > for
> > > > > > > > > > > > > > > CacheQueries with 2 implementations (unordered,
> > > > > > > merge-sort).
> > > > > > > > > Then
> > > > > > > > > > > > > TextQuery
> > > > > > > > > > > > > > > leverages on MergeSort to order results from
> > > multiple
> > > > > > > nodes by
> > > > > > > > > > > score.
> > > > > > > > > > > > > This
> > > > > > > > > > > > > > > patch also fixes the pageSize issue, I've
> > mentioned
> > > > > > before.
> > > > > > > > > Could
> > > > > > > > > > > you
> > > > > > > > > > > > > > > please check if it fully matches your idea? Any
> > > > issues
> > > > > or
> > > > > > > > > comments
> > > > > > > > > > > > are
> > > > > > > > > > > > > > > welcome.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > I've prepared this ticket, because I need the
> > > > MergeSort
> > > > > > > > > algorithm
> > > > > > > > > > > for
> > > > > > > > > > > > > the
> > > > > > > > > > > > > > > new type of queries I'm implementing
> (IndexQuery,
> > > it
> > > > > > should
> > > > > > > > > also
> > > > > > > > > > > > > provide
> > > > > > > > > > > > > > > ordered results over multiple nodes). Currently
> > I'm
> > > > not
> > > > > > > > > planning to
> > > > > > > > > > > > go
> > > > > > > > > > > > > > > further with TextQuery, so if you're going to
> > > support
> > > > > > this
> > > > > > > > > it'll
> > > > > > > > > > > be a
> > > > > > > > > > > > > great
> > > > > > > > > > > > > > > contribution, I think.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > [1]
> > > > https://issues.apache.org/jira/browse/IGNITE-14703
> > > > > > > > > > > > > > > [2] https://github.com/apache/ignite/pull/9081
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > On Mon, Jun 21, 2021 at 11:11 AM Atri Sharma <
> > > > > > > atri@apache.org>
> > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Hi All,
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > I have been looking into our text queries
> > support
> > > > and
> > > > > > see
> > > > > > > > > that it
> > > > > > > > > > > > has
> > > > > > > > > > > > > > > > limited community support.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Therefore, I volunteer to be the maintainer
> of
> > > the
> > > > > > > module and
> > > > > > > > > > > work
> > > > > > > > > > > > on
> > > > > > > > > > > > > > > > enhancing it further.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > First goal would be to move to Lucene 8.x,
> then
> > > > work
> > > > > on
> > > > > > > > > sorted
> > > > > > > > > > > > reduce
> > > > > > > > > > > > > > > > - merge across nodes. Fundamentally, this is
> > > doable
> > > > > > since
> > > > > > > > > Lucene
> > > > > > > > > > > > > ranks
> > > > > > > > > > > > > > > > documents according to their score, and
> > documents
> > > > are
> > > > > > > > > returned in
> > > > > > > > > > > > the
> > > > > > > > > > > > > > > > order of their score. Since the scoring
> > function
> > > is
> > > > > > > > > homogeneous,
> > > > > > > > > > > > this
> > > > > > > > > > > > > > > > means that across nodes, we can compare
> scores
> > > and
> > > > > > merge
> > > > > > > > > sort.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Please let me know if I can take this up.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Atri
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > > Regards,
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Atri
> > > > > > > > > > > > > > > > Apache Concerted
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > --
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Best regards,
> > > > > > > > > > > > > > Alexei Scherbakov
> > > > > > > > > > > > >
> > > > > > > > > > > > > --
> > > > > > > > > > > > > Regards,
> > > > > > > > > > > > >
> > > > > > > > > > > > > Atri
> > > > > > > > > > > > > Apache Concerted
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > --
> > > > > > > > > > Best regards,
> > > > > > > > > > Andrey V. Mashenkov
> > > > > > > > >
> > > > > > > > > --
> > > > > > > > > Regards,
> > > > > > > > >
> > > > > > > > > Atri
> > > > > > > > > Apache Concerted
> > > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > --
> > > > > > > > Best regards,
> > > > > > > > Andrey V. Mashenkov
> > > > > > >
> > > > > > > --
> > > > > > > Regards,
> > > > > > >
> > > > > > > Atri
> > > > > > > Apache Concerted
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> >
> > --
> > Best regards,
> > Andrey V. Mashenkov
> >
>

Re: Text Queries Support

Posted by Atri Sharma <at...@apache.org>.

+1

Lets expose custom functions in Ignite SQL which allows us to use the full
capabilities that Lucene offers

On Mon, 26 Jul 2021, 21:51 Andrey Mashenkov, <an...@gmail.com>
wrote:

> Val,
>
> > I believe this is something we can look into in the scope of Ignite 3.
> > Andrey, does Calcite have any support for this? What's your view on this?
>
> As Atri already mentioned, SQL 92 standard declares "LIKE" operator for
> pattern matching.
> Calcite supports LIKE operator.
>
> I've found it is a RexNode (expression) and I doubt it supports indices.
> Maybe, LIKE can use a sorted index for prefix matching or equality
> conditions, but it is very far from what we are talking about.
>
> Full-text search term is much wider than just a pattern matching.
> Lucene provides much more capabilities on that and has rich
> syntax contrary to "LIKE" operator.
> So, LIKE operator is the standard operator with the defined contract. I'm
> not sure it is worth integrating Lucene just for it.
> I think we should have native support for full-text search queries and/or a
> custom SQL function.
>
> E.g. Postgres syntax for FTS queries [1] is completely different to "LIKE"
> operator.
>
> [1]
>
> https://www.postgresql.org/docs/9.5/textsearch-intro.html#TEXTSEARCH-MATCHING
>
> On Sat, Jul 24, 2021 at 4:49 PM Courtney Robinson <
> courtney.robinson@hypi.io>
> wrote:
>
> > Hey Ari,
> > Yes, I wasn't suggesting that Solr should be used. That's just what we're
> > doing now out of necessity.
> > It was more the fact that Calcite's SqlOperator can be used to provide
> the
> > interface to Lucene.
> > For all the reasons you mentioned and more, using Lucene is the right
> > choice
> >
> > Calcite doesn't have support for Solr but it has an ES adapter which is
> > what we modified to support Solr.
> >
> > Regards,
> > Courtney Robinson
> > Founder and CEO, Hypi
> > Tel: ++44 208 123 2413 (GMT+0) <https://hypi.io>
> >
> > <https://hypi.io>
> > https://hypi.io
> >
> >
> > On Sat, Jul 24, 2021 at 1:59 PM Atri Sharma <at...@apache.org> wrote:
> >
> > > What that entails is that the end user has to keep a Solr cluster
> > running,
> > > which comes with its own challenges (now you have to manage two systems
> > > instead of one).
> > >
> > > I believe Calcite has native support for Solr?
> > >
> > > OTOH, having native Lucene indices allow us to control per partition
> > > indices with no distributed overhead, since Lucene is a per node
> instance
> > > with no global coordination.
> > >
> > > On Sat, 24 Jul 2021, 16:57 Courtney Robinson, <
> courtney.robinson@hypi.io
> > >
> > > wrote:
> > >
> > > > I'll add in here.
> > > > I agree with you Valentin, the decoupled state of text queries makes
> it
> > > > useless for most use cases we have.
> > > >
> > > > As it relates to Calcite and Ignite 3, one approach (the one we're
> > taking
> > > > because we use calcite independent of Ignite) is to provide a bunch
> of
> > > SQL
> > > > functions that we implement as SqlOperator
> > > > <
> > > >
> > >
> >
> https://calcite.apache.org/javadocAggregate/org/apache/calcite/sql/SqlOperator.html
> > > > >.
> > > > I forget how we've done aggregation functions but we have those too
> and
> > > > they map to Solr aggregations (which ultimately end up in lucene).
> > > >
> > > > This allows Solr filters to take part in the rest of the query. It's
> > > > probably more complex than this for Ignite but that's one possible
> > route
> > > > but we generate queries like select x from T0 where term(args to solr
> > > term
> > > > query) AND ...
> > > >
> > > > Regards,
> > > > Courtney Robinson
> > > > Founder and CEO, Hypi
> > > > Tel: ++44 208 123 2413 (GMT+0) <https://hypi.io>
> > > >
> > > > <https://hypi.io>
> > > > https://hypi.io
> > > >
> > > >
> > > > On Fri, Jul 23, 2021 at 7:14 PM Valentin Kulichenko <
> > > > valentin.kulichenko@gmail.com> wrote:
> > > >
> > > > > Atri,
> > > > >
> > > > > Sure, go ahead. Let's put the ideas on paper and have a discussion.
> > > > >
> > > > > -Val
> > > > >
> > > > > On Fri, Jul 23, 2021 at 10:59 AM Atri Sharma <at...@apache.org>
> > wrote:
> > > > >
> > > > > > Thanks Andrey.
> > > > > >
> > > > > > I have collected answers or proposals to many of these questions
> > and
> > > > > > would like to start a wiki page covering what we can do for
> Ignite
> > 3.
> > > > > >
> > > > > > Does that sound good, please?
> > > > > >
> > > > > > On Fri, Jul 23, 2021 at 4:26 PM Andrey Mashenkov
> > > > > > <an...@gmail.com> wrote:
> > > > > > >
> > > > > > > Atri,
> > > > > > >
> > > > > > > First of all, I'd recommend going through the Ignite ticket to
> > > gather
> > > > > > > information about the current implementation issues and users'
> > > wants.
> > > > > > > Then look at a code to get a complete understanding of how
> things
> > > > work
> > > > > > now,
> > > > > > > which may help in future decisions.
> > > > > > >
> > > > > > > As we use the outdated Lucene version, some things may be
> > > irrelevant
> > > > > for
> > > > > > > the latest Lucene version.
> > > > > > > So, you will need expertise in the internals of modern Lucene
> > > version
> > > > > to
> > > > > > > understand what capabilities, guarantees, and limitations
> Lucene
> > > has
> > > > > and
> > > > > > > could bring to the Ignite.
> > > > > > > The expertise could be got from the Lucene project code or
> Lucene
> > > > > project
> > > > > > > dev-list.
> > > > > > >
> > > > > > >
> > > > > > > As for now, the potential capabilities are not clear to me.
> > > > > > > At first glance, I see the next topics that must be covered at
> > > first:
> > > > > > >
> > > > > > > General questions
> > > > > > > * How Lucene index can be split among the nodes?
> > > > > > > * If we'll have a single index for all partitions on the
> > particular
> > > > > node,
> > > > > > > then how index records will be aware of partitioning?
> > > > > > > This is important to filter out backup records from the results
> > to
> > > > > avoid
> > > > > > > duplicates.
> > > > > > > * How results from several nodes can be merged on the Reduce
> > stage?
> > > > > > > * Does Lucene supports smth like JOIN operation or others that
> > may
> > > > > > require
> > > > > > > data from another partition or index?
> > > > > > > If so, then it likes to multistep query with merging results on
> > > > > > > intermediate stages and requires detailed investigation and
> > design.
> > > > > > > It is ok if Ignite will have some limitations here, but we
> would
> > > like
> > > > > to
> > > > > > > know about them at the early stage.
> > > > > > > * How effectively map Lucene files to the page memory? Is it
> even
> > > > > > possible?
> > > > > > > Otherwise, how to deal with potential OOM on large queries and
> > > memory
> > > > > > > capacity planning?
> > > > > > >
> > > > > > > Persistence.
> > > > > > > * How and what consistency guarantees could we have/expect?
> > > > > > > Seems, we may not be able to write physical records for Lucene
> > > index
> > > > to
> > > > > > our
> > > > > > > WAL. What can we do with this?
> > > > > > >
> > > > > > > Transactions.
> > > > > > > * Will we support transactions?
> > > > > > > * Should Lucene be aware of Transaction and track mvcc (or
> > > whatever)
> > > > > > > versions for the records?
> > > > > > > * What will be consistency guarantees?
> > > > > > >
> > > > > > > UX
> > > > > > > * How to add FullText search queries syntax into Calcite?
> > > > > > > * AFAIK, the Lucene index has many properties for tuning. How
> > will
> > > > the
> > > > > > user
> > > > > > > configure the index?
> > > > > > > * How and where to store the settings? What are cluster-wide
> and
> > > > what a
> > > > > > > local to the particular node?
> > > > > > > * Will be all the settings immutable? Can be they changed
> on-fly?
> > > > after
> > > > > > > node/grid restart?
> > > > > > > * Any limitations on query syntax?
> > > > > > >
> > > > > > > SQL
> > > > > > > * Will we support FullText search in SQL?
> > > > > > > * How to integrate Lucene index into Calcite? What is the cost
> > > model?
> > > > > > > Splitting rules? Traits?
> > > > > > > * What about consistency with DDL operations, e.g. column
> rename?
> > > > > > > Ignite indices will operate column ID, so rename operation will
> > not
> > > > > > affect
> > > > > > > the index.
> > > > > > >
> > > > > > >
> > > > > > > With all of this, you can go with the IEP (or even some short
> > > > summary)
> > > > > > and
> > > > > > > further POC and implementation.
> > > > > > > That's a big deal, so let's discuss what could be done here.
> > > > > > >
> > > > > > > On Fri, Jul 23, 2021 at 12:58 PM Atri Sharma <at...@apache.org>
> > > > wrote:
> > > > > > >
> > > > > > > > I am actually happy to drive the feature for Ignite 3. FTS is
> > > very
> > > > > > > > important for me and I think Ignite users will benefit from
> it
> > > > > > > > greatly.
> > > > > > > >
> > > > > > > > If it makes sense to be focusing on Ignite 3 for this
> > > capability, I
> > > > > am
> > > > > > > > eager to contribute there and lead the development.
> > > > > > > >
> > > > > > > > Please share your thoughts.
> > > > > > > >
> > > > > > > > On Fri, Jul 23, 2021 at 3:21 PM Andrey Mashenkov
> > > > > > > > <an...@gmail.com> wrote:
> > > > > > > > >
> > > > > > > > > Hi Atri,
> > > > > > > > >
> > > > > > > > > All the Jira tickets we have on the Full-text search (FTS)
> > > thing
> > > > > are
> > > > > > > > > targeted to Ignite 2.
> > > > > > > > >
> > > > > > > > > AFAIK, we want, but we have NOT committed to FTS support in
> > > > Ignite
> > > > > 3,
> > > > > > > > yet.
> > > > > > > > > By the way, we are getting requests for this thing from the
> > > user
> > > > > > side,
> > > > > > > > and
> > > > > > > > > definitely,
> > > > > > > > > FTS would be a valuable feature for Ignite.
> > > > > > > > >
> > > > > > > > > It will be great if the one wants to drive it, any help
> will
> > be
> > > > > > > > appreciated.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Fri, Jul 23, 2021 at 12:12 PM Atri Sharma <
> > atri@apache.org>
> > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Hello,
> > > > > > > > > >
> > > > > > > > > > An update, please. I am working through persistence of
> > Lucene
> > > > > index
> > > > > > > > using
> > > > > > > > > > Ignite Dictionary, and will be asking some questions
> soon.
> > > > > > > > > >
> > > > > > > > > > I had one doubt - - where does this change go? Ignite 3?
> > > > > > > > > >
> > > > > > > > > > Also, I know we want to build native support for text
> > > searches
> > > > in
> > > > > > > > Ignite 3.
> > > > > > > > > > Is the work I am proposing here part of that, or will
> that
> > > be a
> > > > > > > > separate
> > > > > > > > > > effort?
> > > > > > > > > >
> > > > > > > > > > On Mon, 28 Jun 2021, 19:20 Ilya Kasnacheev, <
> > > > > > ilya.kasnacheev@gmail.com
> > > > > > > > >
> > > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Hello!
> > > > > > > > > > >
> > > > > > > > > > > I think that number one is the most important one, then
> > > maybe
> > > > > it
> > > > > > > > will see
> > > > > > > > > > > more use and other deficiencies become more apparent,
> > > leading
> > > > > to
> > > > > > more
> > > > > > > > > > > tickets and visibility.
> > > > > > > > > > >
> > > > > > > > > > > Maybe 2. and 3. will even use a different approach when
> > > > > > persistence
> > > > > > > > is
> > > > > > > > > > > implemented.
> > > > > > > > > > >
> > > > > > > > > > > Regards,
> > > > > > > > > > > --
> > > > > > > > > > > Ilya Kasnacheev
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > пн, 28 июн. 2021 г. в 14:34, Atri Sharma <
> > atri@apache.org
> > > >:
> > > > > > > > > > >
> > > > > > > > > > > > Hello Again!
> > > > > > > > > > > >
> > > > > > > > > > > > I have been looking into the aforementioned and here
> > are
> > > my
> > > > > > follow
> > > > > > > > up
> > > > > > > > > > > > thoughts:
> > > > > > > > > > > >
> > > > > > > > > > > > 1. Support persistence of Lucene indexes.
> > > > > > > > > > > > 2.
> https://issues.apache.org/jira/browse/IGNITE-12401
> > > > (Needs
> > > > > > > > fixing of
> > > > > > > > > > > > moving partitions first)
> > > > > > > > > > > > 3. Figure out how to return scores from nodes and use
> > > them
> > > > as
> > > > > > sort
> > > > > > > > > > > > parameters on the coordinator node
> > > > > > > > > > > > (https://issues.apache.org/jira/browse/IGNITE-12291)
> > > > > > > > > > > >
> > > > > > > > > > > > Please let me know if this looks ok to make text
> > queries
> > > > > > > > functional?
> > > > > > > > > > > >
> > > > > > > > > > > > Atri
> > > > > > > > > > > >
> > > > > > > > > > > > On Mon, Jun 21, 2021 at 2:49 PM Alexei Scherbakov
> > > > > > > > > > > > <al...@gmail.com> wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > Hi.
> > > > > > > > > > > > >
> > > > > > > > > > > > > One of the biggest issues with text queries is a
> lack
> > > of
> > > > > > support
> > > > > > > > for
> > > > > > > > > > > > lucene
> > > > > > > > > > > > > indices persistence, which makes this functionality
> > > > useless
> > > > > > if a
> > > > > > > > > > > > > persistence is enabled.
> > > > > > > > > > > > >
> > > > > > > > > > > > > I would first take care of it.
> > > > > > > > > > > > >
> > > > > > > > > > > > > пн, 21 июн. 2021 г. в 12:16, Maksim Timonin <
> > > > > > > > timonin.maxim@gmail.com
> > > > > > > > > > >:
> > > > > > > > > > > > >
> > > > > > > > > > > > > > Hi, Atri!
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > You're right, Actually there is a lack of support
> > for
> > > > > > > > TextQueries.
> > > > > > > > > > > For
> > > > > > > > > > > > the
> > > > > > > > > > > > > > last ticket I'm doing I see some obvious issues
> > with
> > > > them
> > > > > > (no
> > > > > > > > page
> > > > > > > > > > > size
> > > > > > > > > > > > > > support, for example). I'm glad that somebody
> wants
> > > to
> > > > > > maintain
> > > > > > > > > > this
> > > > > > > > > > > > > > functionality. Thanks a lot!
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > For the MergeSort algorithm there is already a
> > patch
> > > > for
> > > > > > that
> > > > > > > > [1].
> > > > > > > > > > > It's
> > > > > > > > > > > > > > currently on review. This patch introduces an
> > > abstract
> > > > > > reducer
> > > > > > > > for
> > > > > > > > > > > > > > CacheQueries with 2 implementations (unordered,
> > > > > > merge-sort).
> > > > > > > > Then
> > > > > > > > > > > > TextQuery
> > > > > > > > > > > > > > leverages on MergeSort to order results from
> > multiple
> > > > > > nodes by
> > > > > > > > > > score.
> > > > > > > > > > > > This
> > > > > > > > > > > > > > patch also fixes the pageSize issue, I've
> mentioned
> > > > > before.
> > > > > > > > Could
> > > > > > > > > > you
> > > > > > > > > > > > > > please check if it fully matches your idea? Any
> > > issues
> > > > or
> > > > > > > > comments
> > > > > > > > > > > are
> > > > > > > > > > > > > > welcome.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > I've prepared this ticket, because I need the
> > > MergeSort
> > > > > > > > algorithm
> > > > > > > > > > for
> > > > > > > > > > > > the
> > > > > > > > > > > > > > new type of queries I'm implementing (IndexQuery,
> > it
> > > > > should
> > > > > > > > also
> > > > > > > > > > > > provide
> > > > > > > > > > > > > > ordered results over multiple nodes). Currently
> I'm
> > > not
> > > > > > > > planning to
> > > > > > > > > > > go
> > > > > > > > > > > > > > further with TextQuery, so if you're going to
> > support
> > > > > this
> > > > > > > > it'll
> > > > > > > > > > be a
> > > > > > > > > > > > great
> > > > > > > > > > > > > > contribution, I think.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > [1]
> > > https://issues.apache.org/jira/browse/IGNITE-14703
> > > > > > > > > > > > > > [2] https://github.com/apache/ignite/pull/9081
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On Mon, Jun 21, 2021 at 11:11 AM Atri Sharma <
> > > > > > atri@apache.org>
> > > > > > > > > > > wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Hi All,
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > I have been looking into our text queries
> support
> > > and
> > > > > see
> > > > > > > > that it
> > > > > > > > > > > has
> > > > > > > > > > > > > > > limited community support.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Therefore, I volunteer to be the maintainer of
> > the
> > > > > > module and
> > > > > > > > > > work
> > > > > > > > > > > on
> > > > > > > > > > > > > > > enhancing it further.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > First goal would be to move to Lucene 8.x, then
> > > work
> > > > on
> > > > > > > > sorted
> > > > > > > > > > > reduce
> > > > > > > > > > > > > > > - merge across nodes. Fundamentally, this is
> > doable
> > > > > since
> > > > > > > > Lucene
> > > > > > > > > > > > ranks
> > > > > > > > > > > > > > > documents according to their score, and
> documents
> > > are
> > > > > > > > returned in
> > > > > > > > > > > the
> > > > > > > > > > > > > > > order of their score. Since the scoring
> function
> > is
> > > > > > > > homogeneous,
> > > > > > > > > > > this
> > > > > > > > > > > > > > > means that across nodes, we can compare scores
> > and
> > > > > merge
> > > > > > > > sort.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Please let me know if I can take this up.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Atri
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > Regards,
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Atri
> > > > > > > > > > > > > > > Apache Concerted
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > --
> > > > > > > > > > > > >
> > > > > > > > > > > > > Best regards,
> > > > > > > > > > > > > Alexei Scherbakov
> > > > > > > > > > > >
> > > > > > > > > > > > --
> > > > > > > > > > > > Regards,
> > > > > > > > > > > >
> > > > > > > > > > > > Atri
> > > > > > > > > > > > Apache Concerted
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > --
> > > > > > > > > Best regards,
> > > > > > > > > Andrey V. Mashenkov
> > > > > > > >
> > > > > > > > --
> > > > > > > > Regards,
> > > > > > > >
> > > > > > > > Atri
> > > > > > > > Apache Concerted
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Best regards,
> > > > > > > Andrey V. Mashenkov
> > > > > >
> > > > > > --
> > > > > > Regards,
> > > > > >
> > > > > > Atri
> > > > > > Apache Concerted
> > > > > >
> > > > >
> > > >
> > >
> >
>
>
> --
> Best regards,
> Andrey V. Mashenkov
>

Re: Text Queries Support

Posted by Andrey Mashenkov <an...@gmail.com>.

Val,

> I believe this is something we can look into in the scope of Ignite 3.
> Andrey, does Calcite have any support for this? What's your view on this?

As Atri already mentioned, SQL 92 standard declares "LIKE" operator for
pattern matching.
Calcite supports LIKE operator.

I've found it is a RexNode (expression) and I doubt it supports indices.
Maybe, LIKE can use a sorted index for prefix matching or equality
conditions, but it is very far from what we are talking about.

Full-text search term is much wider than just a pattern matching.
Lucene provides much more capabilities on that and has rich
syntax contrary to "LIKE" operator.
So, LIKE operator is the standard operator with the defined contract. I'm
not sure it is worth integrating Lucene just for it.
I think we should have native support for full-text search queries and/or a
custom SQL function.

E.g. Postgres syntax for FTS queries [1] is completely different to "LIKE"
operator.

[1]
https://www.postgresql.org/docs/9.5/textsearch-intro.html#TEXTSEARCH-MATCHING

On Sat, Jul 24, 2021 at 4:49 PM Courtney Robinson <co...@hypi.io>
wrote:

> Hey Ari,
> Yes, I wasn't suggesting that Solr should be used. That's just what we're
> doing now out of necessity.
> It was more the fact that Calcite's SqlOperator can be used to provide the
> interface to Lucene.
> For all the reasons you mentioned and more, using Lucene is the right
> choice
>
> Calcite doesn't have support for Solr but it has an ES adapter which is
> what we modified to support Solr.
>
> Regards,
> Courtney Robinson
> Founder and CEO, Hypi
> Tel: ++44 208 123 2413 (GMT+0) <https://hypi.io>
>
> <https://hypi.io>
> https://hypi.io
>
>
> On Sat, Jul 24, 2021 at 1:59 PM Atri Sharma <at...@apache.org> wrote:
>
> > What that entails is that the end user has to keep a Solr cluster
> running,
> > which comes with its own challenges (now you have to manage two systems
> > instead of one).
> >
> > I believe Calcite has native support for Solr?
> >
> > OTOH, having native Lucene indices allow us to control per partition
> > indices with no distributed overhead, since Lucene is a per node instance
> > with no global coordination.
> >
> > On Sat, 24 Jul 2021, 16:57 Courtney Robinson, <courtney.robinson@hypi.io
> >
> > wrote:
> >
> > > I'll add in here.
> > > I agree with you Valentin, the decoupled state of text queries makes it
> > > useless for most use cases we have.
> > >
> > > As it relates to Calcite and Ignite 3, one approach (the one we're
> taking
> > > because we use calcite independent of Ignite) is to provide a bunch of
> > SQL
> > > functions that we implement as SqlOperator
> > > <
> > >
> >
> https://calcite.apache.org/javadocAggregate/org/apache/calcite/sql/SqlOperator.html
> > > >.
> > > I forget how we've done aggregation functions but we have those too and
> > > they map to Solr aggregations (which ultimately end up in lucene).
> > >
> > > This allows Solr filters to take part in the rest of the query. It's
> > > probably more complex than this for Ignite but that's one possible
> route
> > > but we generate queries like select x from T0 where term(args to solr
> > term
> > > query) AND ...
> > >
> > > Regards,
> > > Courtney Robinson
> > > Founder and CEO, Hypi
> > > Tel: ++44 208 123 2413 (GMT+0) <https://hypi.io>
> > >
> > > <https://hypi.io>
> > > https://hypi.io
> > >
> > >
> > > On Fri, Jul 23, 2021 at 7:14 PM Valentin Kulichenko <
> > > valentin.kulichenko@gmail.com> wrote:
> > >
> > > > Atri,
> > > >
> > > > Sure, go ahead. Let's put the ideas on paper and have a discussion.
> > > >
> > > > -Val
> > > >
> > > > On Fri, Jul 23, 2021 at 10:59 AM Atri Sharma <at...@apache.org>
> wrote:
> > > >
> > > > > Thanks Andrey.
> > > > >
> > > > > I have collected answers or proposals to many of these questions
> and
> > > > > would like to start a wiki page covering what we can do for Ignite
> 3.
> > > > >
> > > > > Does that sound good, please?
> > > > >
> > > > > On Fri, Jul 23, 2021 at 4:26 PM Andrey Mashenkov
> > > > > <an...@gmail.com> wrote:
> > > > > >
> > > > > > Atri,
> > > > > >
> > > > > > First of all, I'd recommend going through the Ignite ticket to
> > gather
> > > > > > information about the current implementation issues and users'
> > wants.
> > > > > > Then look at a code to get a complete understanding of how things
> > > work
> > > > > now,
> > > > > > which may help in future decisions.
> > > > > >
> > > > > > As we use the outdated Lucene version, some things may be
> > irrelevant
> > > > for
> > > > > > the latest Lucene version.
> > > > > > So, you will need expertise in the internals of modern Lucene
> > version
> > > > to
> > > > > > understand what capabilities, guarantees, and limitations Lucene
> > has
> > > > and
> > > > > > could bring to the Ignite.
> > > > > > The expertise could be got from the Lucene project code or Lucene
> > > > project
> > > > > > dev-list.
> > > > > >
> > > > > >
> > > > > > As for now, the potential capabilities are not clear to me.
> > > > > > At first glance, I see the next topics that must be covered at
> > first:
> > > > > >
> > > > > > General questions
> > > > > > * How Lucene index can be split among the nodes?
> > > > > > * If we'll have a single index for all partitions on the
> particular
> > > > node,
> > > > > > then how index records will be aware of partitioning?
> > > > > > This is important to filter out backup records from the results
> to
> > > > avoid
> > > > > > duplicates.
> > > > > > * How results from several nodes can be merged on the Reduce
> stage?
> > > > > > * Does Lucene supports smth like JOIN operation or others that
> may
> > > > > require
> > > > > > data from another partition or index?
> > > > > > If so, then it likes to multistep query with merging results on
> > > > > > intermediate stages and requires detailed investigation and
> design.
> > > > > > It is ok if Ignite will have some limitations here, but we would
> > like
> > > > to
> > > > > > know about them at the early stage.
> > > > > > * How effectively map Lucene files to the page memory? Is it even
> > > > > possible?
> > > > > > Otherwise, how to deal with potential OOM on large queries and
> > memory
> > > > > > capacity planning?
> > > > > >
> > > > > > Persistence.
> > > > > > * How and what consistency guarantees could we have/expect?
> > > > > > Seems, we may not be able to write physical records for Lucene
> > index
> > > to
> > > > > our
> > > > > > WAL. What can we do with this?
> > > > > >
> > > > > > Transactions.
> > > > > > * Will we support transactions?
> > > > > > * Should Lucene be aware of Transaction and track mvcc (or
> > whatever)
> > > > > > versions for the records?
> > > > > > * What will be consistency guarantees?
> > > > > >
> > > > > > UX
> > > > > > * How to add FullText search queries syntax into Calcite?
> > > > > > * AFAIK, the Lucene index has many properties for tuning. How
> will
> > > the
> > > > > user
> > > > > > configure the index?
> > > > > > * How and where to store the settings? What are cluster-wide and
> > > what a
> > > > > > local to the particular node?
> > > > > > * Will be all the settings immutable? Can be they changed on-fly?
> > > after
> > > > > > node/grid restart?
> > > > > > * Any limitations on query syntax?
> > > > > >
> > > > > > SQL
> > > > > > * Will we support FullText search in SQL?
> > > > > > * How to integrate Lucene index into Calcite? What is the cost
> > model?
> > > > > > Splitting rules? Traits?
> > > > > > * What about consistency with DDL operations, e.g. column rename?
> > > > > > Ignite indices will operate column ID, so rename operation will
> not
> > > > > affect
> > > > > > the index.
> > > > > >
> > > > > >
> > > > > > With all of this, you can go with the IEP (or even some short
> > > summary)
> > > > > and
> > > > > > further POC and implementation.
> > > > > > That's a big deal, so let's discuss what could be done here.
> > > > > >
> > > > > > On Fri, Jul 23, 2021 at 12:58 PM Atri Sharma <at...@apache.org>
> > > wrote:
> > > > > >
> > > > > > > I am actually happy to drive the feature for Ignite 3. FTS is
> > very
> > > > > > > important for me and I think Ignite users will benefit from it
> > > > > > > greatly.
> > > > > > >
> > > > > > > If it makes sense to be focusing on Ignite 3 for this
> > capability, I
> > > > am
> > > > > > > eager to contribute there and lead the development.
> > > > > > >
> > > > > > > Please share your thoughts.
> > > > > > >
> > > > > > > On Fri, Jul 23, 2021 at 3:21 PM Andrey Mashenkov
> > > > > > > <an...@gmail.com> wrote:
> > > > > > > >
> > > > > > > > Hi Atri,
> > > > > > > >
> > > > > > > > All the Jira tickets we have on the Full-text search (FTS)
> > thing
> > > > are
> > > > > > > > targeted to Ignite 2.
> > > > > > > >
> > > > > > > > AFAIK, we want, but we have NOT committed to FTS support in
> > > Ignite
> > > > 3,
> > > > > > > yet.
> > > > > > > > By the way, we are getting requests for this thing from the
> > user
> > > > > side,
> > > > > > > and
> > > > > > > > definitely,
> > > > > > > > FTS would be a valuable feature for Ignite.
> > > > > > > >
> > > > > > > > It will be great if the one wants to drive it, any help will
> be
> > > > > > > appreciated.
> > > > > > > >
> > > > > > > >
> > > > > > > > On Fri, Jul 23, 2021 at 12:12 PM Atri Sharma <
> atri@apache.org>
> > > > > wrote:
> > > > > > > >
> > > > > > > > > Hello,
> > > > > > > > >
> > > > > > > > > An update, please. I am working through persistence of
> Lucene
> > > > index
> > > > > > > using
> > > > > > > > > Ignite Dictionary, and will be asking some questions soon.
> > > > > > > > >
> > > > > > > > > I had one doubt - - where does this change go? Ignite 3?
> > > > > > > > >
> > > > > > > > > Also, I know we want to build native support for text
> > searches
> > > in
> > > > > > > Ignite 3.
> > > > > > > > > Is the work I am proposing here part of that, or will that
> > be a
> > > > > > > separate
> > > > > > > > > effort?
> > > > > > > > >
> > > > > > > > > On Mon, 28 Jun 2021, 19:20 Ilya Kasnacheev, <
> > > > > ilya.kasnacheev@gmail.com
> > > > > > > >
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Hello!
> > > > > > > > > >
> > > > > > > > > > I think that number one is the most important one, then
> > maybe
> > > > it
> > > > > > > will see
> > > > > > > > > > more use and other deficiencies become more apparent,
> > leading
> > > > to
> > > > > more
> > > > > > > > > > tickets and visibility.
> > > > > > > > > >
> > > > > > > > > > Maybe 2. and 3. will even use a different approach when
> > > > > persistence
> > > > > > > is
> > > > > > > > > > implemented.
> > > > > > > > > >
> > > > > > > > > > Regards,
> > > > > > > > > > --
> > > > > > > > > > Ilya Kasnacheev
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > пн, 28 июн. 2021 г. в 14:34, Atri Sharma <
> atri@apache.org
> > >:
> > > > > > > > > >
> > > > > > > > > > > Hello Again!
> > > > > > > > > > >
> > > > > > > > > > > I have been looking into the aforementioned and here
> are
> > my
> > > > > follow
> > > > > > > up
> > > > > > > > > > > thoughts:
> > > > > > > > > > >
> > > > > > > > > > > 1. Support persistence of Lucene indexes.
> > > > > > > > > > > 2. https://issues.apache.org/jira/browse/IGNITE-12401
> > > (Needs
> > > > > > > fixing of
> > > > > > > > > > > moving partitions first)
> > > > > > > > > > > 3. Figure out how to return scores from nodes and use
> > them
> > > as
> > > > > sort
> > > > > > > > > > > parameters on the coordinator node
> > > > > > > > > > > (https://issues.apache.org/jira/browse/IGNITE-12291)
> > > > > > > > > > >
> > > > > > > > > > > Please let me know if this looks ok to make text
> queries
> > > > > > > functional?
> > > > > > > > > > >
> > > > > > > > > > > Atri
> > > > > > > > > > >
> > > > > > > > > > > On Mon, Jun 21, 2021 at 2:49 PM Alexei Scherbakov
> > > > > > > > > > > <al...@gmail.com> wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > Hi.
> > > > > > > > > > > >
> > > > > > > > > > > > One of the biggest issues with text queries is a lack
> > of
> > > > > support
> > > > > > > for
> > > > > > > > > > > lucene
> > > > > > > > > > > > indices persistence, which makes this functionality
> > > useless
> > > > > if a
> > > > > > > > > > > > persistence is enabled.
> > > > > > > > > > > >
> > > > > > > > > > > > I would first take care of it.
> > > > > > > > > > > >
> > > > > > > > > > > > пн, 21 июн. 2021 г. в 12:16, Maksim Timonin <
> > > > > > > timonin.maxim@gmail.com
> > > > > > > > > >:
> > > > > > > > > > > >
> > > > > > > > > > > > > Hi, Atri!
> > > > > > > > > > > > >
> > > > > > > > > > > > > You're right, Actually there is a lack of support
> for
> > > > > > > TextQueries.
> > > > > > > > > > For
> > > > > > > > > > > the
> > > > > > > > > > > > > last ticket I'm doing I see some obvious issues
> with
> > > them
> > > > > (no
> > > > > > > page
> > > > > > > > > > size
> > > > > > > > > > > > > support, for example). I'm glad that somebody wants
> > to
> > > > > maintain
> > > > > > > > > this
> > > > > > > > > > > > > functionality. Thanks a lot!
> > > > > > > > > > > > >
> > > > > > > > > > > > > For the MergeSort algorithm there is already a
> patch
> > > for
> > > > > that
> > > > > > > [1].
> > > > > > > > > > It's
> > > > > > > > > > > > > currently on review. This patch introduces an
> > abstract
> > > > > reducer
> > > > > > > for
> > > > > > > > > > > > > CacheQueries with 2 implementations (unordered,
> > > > > merge-sort).
> > > > > > > Then
> > > > > > > > > > > TextQuery
> > > > > > > > > > > > > leverages on MergeSort to order results from
> multiple
> > > > > nodes by
> > > > > > > > > score.
> > > > > > > > > > > This
> > > > > > > > > > > > > patch also fixes the pageSize issue, I've mentioned
> > > > before.
> > > > > > > Could
> > > > > > > > > you
> > > > > > > > > > > > > please check if it fully matches your idea? Any
> > issues
> > > or
> > > > > > > comments
> > > > > > > > > > are
> > > > > > > > > > > > > welcome.
> > > > > > > > > > > > >
> > > > > > > > > > > > > I've prepared this ticket, because I need the
> > MergeSort
> > > > > > > algorithm
> > > > > > > > > for
> > > > > > > > > > > the
> > > > > > > > > > > > > new type of queries I'm implementing (IndexQuery,
> it
> > > > should
> > > > > > > also
> > > > > > > > > > > provide
> > > > > > > > > > > > > ordered results over multiple nodes). Currently I'm
> > not
> > > > > > > planning to
> > > > > > > > > > go
> > > > > > > > > > > > > further with TextQuery, so if you're going to
> support
> > > > this
> > > > > > > it'll
> > > > > > > > > be a
> > > > > > > > > > > great
> > > > > > > > > > > > > contribution, I think.
> > > > > > > > > > > > >
> > > > > > > > > > > > > [1]
> > https://issues.apache.org/jira/browse/IGNITE-14703
> > > > > > > > > > > > > [2] https://github.com/apache/ignite/pull/9081
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Mon, Jun 21, 2021 at 11:11 AM Atri Sharma <
> > > > > atri@apache.org>
> > > > > > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > > Hi All,
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > I have been looking into our text queries support
> > and
> > > > see
> > > > > > > that it
> > > > > > > > > > has
> > > > > > > > > > > > > > limited community support.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Therefore, I volunteer to be the maintainer of
> the
> > > > > module and
> > > > > > > > > work
> > > > > > > > > > on
> > > > > > > > > > > > > > enhancing it further.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > First goal would be to move to Lucene 8.x, then
> > work
> > > on
> > > > > > > sorted
> > > > > > > > > > reduce
> > > > > > > > > > > > > > - merge across nodes. Fundamentally, this is
> doable
> > > > since
> > > > > > > Lucene
> > > > > > > > > > > ranks
> > > > > > > > > > > > > > documents according to their score, and documents
> > are
> > > > > > > returned in
> > > > > > > > > > the
> > > > > > > > > > > > > > order of their score. Since the scoring function
> is
> > > > > > > homogeneous,
> > > > > > > > > > this
> > > > > > > > > > > > > > means that across nodes, we can compare scores
> and
> > > > merge
> > > > > > > sort.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Please let me know if I can take this up.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Atri
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > --
> > > > > > > > > > > > > > Regards,
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Atri
> > > > > > > > > > > > > > Apache Concerted
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > --
> > > > > > > > > > > >
> > > > > > > > > > > > Best regards,
> > > > > > > > > > > > Alexei Scherbakov
> > > > > > > > > > >
> > > > > > > > > > > --
> > > > > > > > > > > Regards,
> > > > > > > > > > >
> > > > > > > > > > > Atri
> > > > > > > > > > > Apache Concerted
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > --
> > > > > > > > Best regards,
> > > > > > > > Andrey V. Mashenkov
> > > > > > >
> > > > > > > --
> > > > > > > Regards,
> > > > > > >
> > > > > > > Atri
> > > > > > > Apache Concerted
> > > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Best regards,
> > > > > > Andrey V. Mashenkov
> > > > >
> > > > > --
> > > > > Regards,
> > > > >
> > > > > Atri
> > > > > Apache Concerted
> > > > >
> > > >
> > >
> >
>


-- 
Best regards,
Andrey V. Mashenkov

Re: Text Queries Support

Posted by Courtney Robinson <co...@hypi.io>.

Hey Ari,
Yes, I wasn't suggesting that Solr should be used. That's just what we're
doing now out of necessity.
It was more the fact that Calcite's SqlOperator can be used to provide the
interface to Lucene.
For all the reasons you mentioned and more, using Lucene is the right choice

Calcite doesn't have support for Solr but it has an ES adapter which is
what we modified to support Solr.

Regards,
Courtney Robinson
Founder and CEO, Hypi
Tel: ++44 208 123 2413 (GMT+0) <https://hypi.io>

<https://hypi.io>
https://hypi.io


On Sat, Jul 24, 2021 at 1:59 PM Atri Sharma <at...@apache.org> wrote:

> What that entails is that the end user has to keep a Solr cluster running,
> which comes with its own challenges (now you have to manage two systems
> instead of one).
>
> I believe Calcite has native support for Solr?
>
> OTOH, having native Lucene indices allow us to control per partition
> indices with no distributed overhead, since Lucene is a per node instance
> with no global coordination.
>
> On Sat, 24 Jul 2021, 16:57 Courtney Robinson, <co...@hypi.io>
> wrote:
>
> > I'll add in here.
> > I agree with you Valentin, the decoupled state of text queries makes it
> > useless for most use cases we have.
> >
> > As it relates to Calcite and Ignite 3, one approach (the one we're taking
> > because we use calcite independent of Ignite) is to provide a bunch of
> SQL
> > functions that we implement as SqlOperator
> > <
> >
> https://calcite.apache.org/javadocAggregate/org/apache/calcite/sql/SqlOperator.html
> > >.
> > I forget how we've done aggregation functions but we have those too and
> > they map to Solr aggregations (which ultimately end up in lucene).
> >
> > This allows Solr filters to take part in the rest of the query. It's
> > probably more complex than this for Ignite but that's one possible route
> > but we generate queries like select x from T0 where term(args to solr
> term
> > query) AND ...
> >
> > Regards,
> > Courtney Robinson
> > Founder and CEO, Hypi
> > Tel: ++44 208 123 2413 (GMT+0) <https://hypi.io>
> >
> > <https://hypi.io>
> > https://hypi.io
> >
> >
> > On Fri, Jul 23, 2021 at 7:14 PM Valentin Kulichenko <
> > valentin.kulichenko@gmail.com> wrote:
> >
> > > Atri,
> > >
> > > Sure, go ahead. Let's put the ideas on paper and have a discussion.
> > >
> > > -Val
> > >
> > > On Fri, Jul 23, 2021 at 10:59 AM Atri Sharma <at...@apache.org> wrote:
> > >
> > > > Thanks Andrey.
> > > >
> > > > I have collected answers or proposals to many of these questions and
> > > > would like to start a wiki page covering what we can do for Ignite 3.
> > > >
> > > > Does that sound good, please?
> > > >
> > > > On Fri, Jul 23, 2021 at 4:26 PM Andrey Mashenkov
> > > > <an...@gmail.com> wrote:
> > > > >
> > > > > Atri,
> > > > >
> > > > > First of all, I'd recommend going through the Ignite ticket to
> gather
> > > > > information about the current implementation issues and users'
> wants.
> > > > > Then look at a code to get a complete understanding of how things
> > work
> > > > now,
> > > > > which may help in future decisions.
> > > > >
> > > > > As we use the outdated Lucene version, some things may be
> irrelevant
> > > for
> > > > > the latest Lucene version.
> > > > > So, you will need expertise in the internals of modern Lucene
> version
> > > to
> > > > > understand what capabilities, guarantees, and limitations Lucene
> has
> > > and
> > > > > could bring to the Ignite.
> > > > > The expertise could be got from the Lucene project code or Lucene
> > > project
> > > > > dev-list.
> > > > >
> > > > >
> > > > > As for now, the potential capabilities are not clear to me.
> > > > > At first glance, I see the next topics that must be covered at
> first:
> > > > >
> > > > > General questions
> > > > > * How Lucene index can be split among the nodes?
> > > > > * If we'll have a single index for all partitions on the particular
> > > node,
> > > > > then how index records will be aware of partitioning?
> > > > > This is important to filter out backup records from the results to
> > > avoid
> > > > > duplicates.
> > > > > * How results from several nodes can be merged on the Reduce stage?
> > > > > * Does Lucene supports smth like JOIN operation or others that may
> > > > require
> > > > > data from another partition or index?
> > > > > If so, then it likes to multistep query with merging results on
> > > > > intermediate stages and requires detailed investigation and design.
> > > > > It is ok if Ignite will have some limitations here, but we would
> like
> > > to
> > > > > know about them at the early stage.
> > > > > * How effectively map Lucene files to the page memory? Is it even
> > > > possible?
> > > > > Otherwise, how to deal with potential OOM on large queries and
> memory
> > > > > capacity planning?
> > > > >
> > > > > Persistence.
> > > > > * How and what consistency guarantees could we have/expect?
> > > > > Seems, we may not be able to write physical records for Lucene
> index
> > to
> > > > our
> > > > > WAL. What can we do with this?
> > > > >
> > > > > Transactions.
> > > > > * Will we support transactions?
> > > > > * Should Lucene be aware of Transaction and track mvcc (or
> whatever)
> > > > > versions for the records?
> > > > > * What will be consistency guarantees?
> > > > >
> > > > > UX
> > > > > * How to add FullText search queries syntax into Calcite?
> > > > > * AFAIK, the Lucene index has many properties for tuning. How will
> > the
> > > > user
> > > > > configure the index?
> > > > > * How and where to store the settings? What are cluster-wide and
> > what a
> > > > > local to the particular node?
> > > > > * Will be all the settings immutable? Can be they changed on-fly?
> > after
> > > > > node/grid restart?
> > > > > * Any limitations on query syntax?
> > > > >
> > > > > SQL
> > > > > * Will we support FullText search in SQL?
> > > > > * How to integrate Lucene index into Calcite? What is the cost
> model?
> > > > > Splitting rules? Traits?
> > > > > * What about consistency with DDL operations, e.g. column rename?
> > > > > Ignite indices will operate column ID, so rename operation will not
> > > > affect
> > > > > the index.
> > > > >
> > > > >
> > > > > With all of this, you can go with the IEP (or even some short
> > summary)
> > > > and
> > > > > further POC and implementation.
> > > > > That's a big deal, so let's discuss what could be done here.
> > > > >
> > > > > On Fri, Jul 23, 2021 at 12:58 PM Atri Sharma <at...@apache.org>
> > wrote:
> > > > >
> > > > > > I am actually happy to drive the feature for Ignite 3. FTS is
> very
> > > > > > important for me and I think Ignite users will benefit from it
> > > > > > greatly.
> > > > > >
> > > > > > If it makes sense to be focusing on Ignite 3 for this
> capability, I
> > > am
> > > > > > eager to contribute there and lead the development.
> > > > > >
> > > > > > Please share your thoughts.
> > > > > >
> > > > > > On Fri, Jul 23, 2021 at 3:21 PM Andrey Mashenkov
> > > > > > <an...@gmail.com> wrote:
> > > > > > >
> > > > > > > Hi Atri,
> > > > > > >
> > > > > > > All the Jira tickets we have on the Full-text search (FTS)
> thing
> > > are
> > > > > > > targeted to Ignite 2.
> > > > > > >
> > > > > > > AFAIK, we want, but we have NOT committed to FTS support in
> > Ignite
> > > 3,
> > > > > > yet.
> > > > > > > By the way, we are getting requests for this thing from the
> user
> > > > side,
> > > > > > and
> > > > > > > definitely,
> > > > > > > FTS would be a valuable feature for Ignite.
> > > > > > >
> > > > > > > It will be great if the one wants to drive it, any help will be
> > > > > > appreciated.
> > > > > > >
> > > > > > >
> > > > > > > On Fri, Jul 23, 2021 at 12:12 PM Atri Sharma <at...@apache.org>
> > > > wrote:
> > > > > > >
> > > > > > > > Hello,
> > > > > > > >
> > > > > > > > An update, please. I am working through persistence of Lucene
> > > index
> > > > > > using
> > > > > > > > Ignite Dictionary, and will be asking some questions soon.
> > > > > > > >
> > > > > > > > I had one doubt - - where does this change go? Ignite 3?
> > > > > > > >
> > > > > > > > Also, I know we want to build native support for text
> searches
> > in
> > > > > > Ignite 3.
> > > > > > > > Is the work I am proposing here part of that, or will that
> be a
> > > > > > separate
> > > > > > > > effort?
> > > > > > > >
> > > > > > > > On Mon, 28 Jun 2021, 19:20 Ilya Kasnacheev, <
> > > > ilya.kasnacheev@gmail.com
> > > > > > >
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hello!
> > > > > > > > >
> > > > > > > > > I think that number one is the most important one, then
> maybe
> > > it
> > > > > > will see
> > > > > > > > > more use and other deficiencies become more apparent,
> leading
> > > to
> > > > more
> > > > > > > > > tickets and visibility.
> > > > > > > > >
> > > > > > > > > Maybe 2. and 3. will even use a different approach when
> > > > persistence
> > > > > > is
> > > > > > > > > implemented.
> > > > > > > > >
> > > > > > > > > Regards,
> > > > > > > > > --
> > > > > > > > > Ilya Kasnacheev
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > пн, 28 июн. 2021 г. в 14:34, Atri Sharma <atri@apache.org
> >:
> > > > > > > > >
> > > > > > > > > > Hello Again!
> > > > > > > > > >
> > > > > > > > > > I have been looking into the aforementioned and here are
> my
> > > > follow
> > > > > > up
> > > > > > > > > > thoughts:
> > > > > > > > > >
> > > > > > > > > > 1. Support persistence of Lucene indexes.
> > > > > > > > > > 2. https://issues.apache.org/jira/browse/IGNITE-12401
> > (Needs
> > > > > > fixing of
> > > > > > > > > > moving partitions first)
> > > > > > > > > > 3. Figure out how to return scores from nodes and use
> them
> > as
> > > > sort
> > > > > > > > > > parameters on the coordinator node
> > > > > > > > > > (https://issues.apache.org/jira/browse/IGNITE-12291)
> > > > > > > > > >
> > > > > > > > > > Please let me know if this looks ok to make text queries
> > > > > > functional?
> > > > > > > > > >
> > > > > > > > > > Atri
> > > > > > > > > >
> > > > > > > > > > On Mon, Jun 21, 2021 at 2:49 PM Alexei Scherbakov
> > > > > > > > > > <al...@gmail.com> wrote:
> > > > > > > > > > >
> > > > > > > > > > > Hi.
> > > > > > > > > > >
> > > > > > > > > > > One of the biggest issues with text queries is a lack
> of
> > > > support
> > > > > > for
> > > > > > > > > > lucene
> > > > > > > > > > > indices persistence, which makes this functionality
> > useless
> > > > if a
> > > > > > > > > > > persistence is enabled.
> > > > > > > > > > >
> > > > > > > > > > > I would first take care of it.
> > > > > > > > > > >
> > > > > > > > > > > пн, 21 июн. 2021 г. в 12:16, Maksim Timonin <
> > > > > > timonin.maxim@gmail.com
> > > > > > > > >:
> > > > > > > > > > >
> > > > > > > > > > > > Hi, Atri!
> > > > > > > > > > > >
> > > > > > > > > > > > You're right, Actually there is a lack of support for
> > > > > > TextQueries.
> > > > > > > > > For
> > > > > > > > > > the
> > > > > > > > > > > > last ticket I'm doing I see some obvious issues with
> > them
> > > > (no
> > > > > > page
> > > > > > > > > size
> > > > > > > > > > > > support, for example). I'm glad that somebody wants
> to
> > > > maintain
> > > > > > > > this
> > > > > > > > > > > > functionality. Thanks a lot!
> > > > > > > > > > > >
> > > > > > > > > > > > For the MergeSort algorithm there is already a patch
> > for
> > > > that
> > > > > > [1].
> > > > > > > > > It's
> > > > > > > > > > > > currently on review. This patch introduces an
> abstract
> > > > reducer
> > > > > > for
> > > > > > > > > > > > CacheQueries with 2 implementations (unordered,
> > > > merge-sort).
> > > > > > Then
> > > > > > > > > > TextQuery
> > > > > > > > > > > > leverages on MergeSort to order results from multiple
> > > > nodes by
> > > > > > > > score.
> > > > > > > > > > This
> > > > > > > > > > > > patch also fixes the pageSize issue, I've mentioned
> > > before.
> > > > > > Could
> > > > > > > > you
> > > > > > > > > > > > please check if it fully matches your idea? Any
> issues
> > or
> > > > > > comments
> > > > > > > > > are
> > > > > > > > > > > > welcome.
> > > > > > > > > > > >
> > > > > > > > > > > > I've prepared this ticket, because I need the
> MergeSort
> > > > > > algorithm
> > > > > > > > for
> > > > > > > > > > the
> > > > > > > > > > > > new type of queries I'm implementing (IndexQuery, it
> > > should
> > > > > > also
> > > > > > > > > > provide
> > > > > > > > > > > > ordered results over multiple nodes). Currently I'm
> not
> > > > > > planning to
> > > > > > > > > go
> > > > > > > > > > > > further with TextQuery, so if you're going to support
> > > this
> > > > > > it'll
> > > > > > > > be a
> > > > > > > > > > great
> > > > > > > > > > > > contribution, I think.
> > > > > > > > > > > >
> > > > > > > > > > > > [1]
> https://issues.apache.org/jira/browse/IGNITE-14703
> > > > > > > > > > > > [2] https://github.com/apache/ignite/pull/9081
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > On Mon, Jun 21, 2021 at 11:11 AM Atri Sharma <
> > > > atri@apache.org>
> > > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > Hi All,
> > > > > > > > > > > > >
> > > > > > > > > > > > > I have been looking into our text queries support
> and
> > > see
> > > > > > that it
> > > > > > > > > has
> > > > > > > > > > > > > limited community support.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Therefore, I volunteer to be the maintainer of the
> > > > module and
> > > > > > > > work
> > > > > > > > > on
> > > > > > > > > > > > > enhancing it further.
> > > > > > > > > > > > >
> > > > > > > > > > > > > First goal would be to move to Lucene 8.x, then
> work
> > on
> > > > > > sorted
> > > > > > > > > reduce
> > > > > > > > > > > > > - merge across nodes. Fundamentally, this is doable
> > > since
> > > > > > Lucene
> > > > > > > > > > ranks
> > > > > > > > > > > > > documents according to their score, and documents
> are
> > > > > > returned in
> > > > > > > > > the
> > > > > > > > > > > > > order of their score. Since the scoring function is
> > > > > > homogeneous,
> > > > > > > > > this
> > > > > > > > > > > > > means that across nodes, we can compare scores and
> > > merge
> > > > > > sort.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Please let me know if I can take this up.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Atri
> > > > > > > > > > > > >
> > > > > > > > > > > > > --
> > > > > > > > > > > > > Regards,
> > > > > > > > > > > > >
> > > > > > > > > > > > > Atri
> > > > > > > > > > > > > Apache Concerted
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > --
> > > > > > > > > > >
> > > > > > > > > > > Best regards,
> > > > > > > > > > > Alexei Scherbakov
> > > > > > > > > >
> > > > > > > > > > --
> > > > > > > > > > Regards,
> > > > > > > > > >
> > > > > > > > > > Atri
> > > > > > > > > > Apache Concerted
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Best regards,
> > > > > > > Andrey V. Mashenkov
> > > > > >
> > > > > > --
> > > > > > Regards,
> > > > > >
> > > > > > Atri
> > > > > > Apache Concerted
> > > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Best regards,
> > > > > Andrey V. Mashenkov
> > > >
> > > > --
> > > > Regards,
> > > >
> > > > Atri
> > > > Apache Concerted
> > > >
> > >
> >
>

Re: Text Queries Support

Posted by Atri Sharma <at...@apache.org>.

What that entails is that the end user has to keep a Solr cluster running,
which comes with its own challenges (now you have to manage two systems
instead of one).

I believe Calcite has native support for Solr?

OTOH, having native Lucene indices allow us to control per partition
indices with no distributed overhead, since Lucene is a per node instance
with no global coordination.

On Sat, 24 Jul 2021, 16:57 Courtney Robinson, <co...@hypi.io>
wrote:

> I'll add in here.
> I agree with you Valentin, the decoupled state of text queries makes it
> useless for most use cases we have.
>
> As it relates to Calcite and Ignite 3, one approach (the one we're taking
> because we use calcite independent of Ignite) is to provide a bunch of SQL
> functions that we implement as SqlOperator
> <
> https://calcite.apache.org/javadocAggregate/org/apache/calcite/sql/SqlOperator.html
> >.
> I forget how we've done aggregation functions but we have those too and
> they map to Solr aggregations (which ultimately end up in lucene).
>
> This allows Solr filters to take part in the rest of the query. It's
> probably more complex than this for Ignite but that's one possible route
> but we generate queries like select x from T0 where term(args to solr term
> query) AND ...
>
> Regards,
> Courtney Robinson
> Founder and CEO, Hypi
> Tel: ++44 208 123 2413 (GMT+0) <https://hypi.io>
>
> <https://hypi.io>
> https://hypi.io
>
>
> On Fri, Jul 23, 2021 at 7:14 PM Valentin Kulichenko <
> valentin.kulichenko@gmail.com> wrote:
>
> > Atri,
> >
> > Sure, go ahead. Let's put the ideas on paper and have a discussion.
> >
> > -Val
> >
> > On Fri, Jul 23, 2021 at 10:59 AM Atri Sharma <at...@apache.org> wrote:
> >
> > > Thanks Andrey.
> > >
> > > I have collected answers or proposals to many of these questions and
> > > would like to start a wiki page covering what we can do for Ignite 3.
> > >
> > > Does that sound good, please?
> > >
> > > On Fri, Jul 23, 2021 at 4:26 PM Andrey Mashenkov
> > > <an...@gmail.com> wrote:
> > > >
> > > > Atri,
> > > >
> > > > First of all, I'd recommend going through the Ignite ticket to gather
> > > > information about the current implementation issues and users' wants.
> > > > Then look at a code to get a complete understanding of how things
> work
> > > now,
> > > > which may help in future decisions.
> > > >
> > > > As we use the outdated Lucene version, some things may be irrelevant
> > for
> > > > the latest Lucene version.
> > > > So, you will need expertise in the internals of modern Lucene version
> > to
> > > > understand what capabilities, guarantees, and limitations Lucene has
> > and
> > > > could bring to the Ignite.
> > > > The expertise could be got from the Lucene project code or Lucene
> > project
> > > > dev-list.
> > > >
> > > >
> > > > As for now, the potential capabilities are not clear to me.
> > > > At first glance, I see the next topics that must be covered at first:
> > > >
> > > > General questions
> > > > * How Lucene index can be split among the nodes?
> > > > * If we'll have a single index for all partitions on the particular
> > node,
> > > > then how index records will be aware of partitioning?
> > > > This is important to filter out backup records from the results to
> > avoid
> > > > duplicates.
> > > > * How results from several nodes can be merged on the Reduce stage?
> > > > * Does Lucene supports smth like JOIN operation or others that may
> > > require
> > > > data from another partition or index?
> > > > If so, then it likes to multistep query with merging results on
> > > > intermediate stages and requires detailed investigation and design.
> > > > It is ok if Ignite will have some limitations here, but we would like
> > to
> > > > know about them at the early stage.
> > > > * How effectively map Lucene files to the page memory? Is it even
> > > possible?
> > > > Otherwise, how to deal with potential OOM on large queries and memory
> > > > capacity planning?
> > > >
> > > > Persistence.
> > > > * How and what consistency guarantees could we have/expect?
> > > > Seems, we may not be able to write physical records for Lucene index
> to
> > > our
> > > > WAL. What can we do with this?
> > > >
> > > > Transactions.
> > > > * Will we support transactions?
> > > > * Should Lucene be aware of Transaction and track mvcc (or whatever)
> > > > versions for the records?
> > > > * What will be consistency guarantees?
> > > >
> > > > UX
> > > > * How to add FullText search queries syntax into Calcite?
> > > > * AFAIK, the Lucene index has many properties for tuning. How will
> the
> > > user
> > > > configure the index?
> > > > * How and where to store the settings? What are cluster-wide and
> what a
> > > > local to the particular node?
> > > > * Will be all the settings immutable? Can be they changed on-fly?
> after
> > > > node/grid restart?
> > > > * Any limitations on query syntax?
> > > >
> > > > SQL
> > > > * Will we support FullText search in SQL?
> > > > * How to integrate Lucene index into Calcite? What is the cost model?
> > > > Splitting rules? Traits?
> > > > * What about consistency with DDL operations, e.g. column rename?
> > > > Ignite indices will operate column ID, so rename operation will not
> > > affect
> > > > the index.
> > > >
> > > >
> > > > With all of this, you can go with the IEP (or even some short
> summary)
> > > and
> > > > further POC and implementation.
> > > > That's a big deal, so let's discuss what could be done here.
> > > >
> > > > On Fri, Jul 23, 2021 at 12:58 PM Atri Sharma <at...@apache.org>
> wrote:
> > > >
> > > > > I am actually happy to drive the feature for Ignite 3. FTS is very
> > > > > important for me and I think Ignite users will benefit from it
> > > > > greatly.
> > > > >
> > > > > If it makes sense to be focusing on Ignite 3 for this capability, I
> > am
> > > > > eager to contribute there and lead the development.
> > > > >
> > > > > Please share your thoughts.
> > > > >
> > > > > On Fri, Jul 23, 2021 at 3:21 PM Andrey Mashenkov
> > > > > <an...@gmail.com> wrote:
> > > > > >
> > > > > > Hi Atri,
> > > > > >
> > > > > > All the Jira tickets we have on the Full-text search (FTS) thing
> > are
> > > > > > targeted to Ignite 2.
> > > > > >
> > > > > > AFAIK, we want, but we have NOT committed to FTS support in
> Ignite
> > 3,
> > > > > yet.
> > > > > > By the way, we are getting requests for this thing from the user
> > > side,
> > > > > and
> > > > > > definitely,
> > > > > > FTS would be a valuable feature for Ignite.
> > > > > >
> > > > > > It will be great if the one wants to drive it, any help will be
> > > > > appreciated.
> > > > > >
> > > > > >
> > > > > > On Fri, Jul 23, 2021 at 12:12 PM Atri Sharma <at...@apache.org>
> > > wrote:
> > > > > >
> > > > > > > Hello,
> > > > > > >
> > > > > > > An update, please. I am working through persistence of Lucene
> > index
> > > > > using
> > > > > > > Ignite Dictionary, and will be asking some questions soon.
> > > > > > >
> > > > > > > I had one doubt - - where does this change go? Ignite 3?
> > > > > > >
> > > > > > > Also, I know we want to build native support for text searches
> in
> > > > > Ignite 3.
> > > > > > > Is the work I am proposing here part of that, or will that be a
> > > > > separate
> > > > > > > effort?
> > > > > > >
> > > > > > > On Mon, 28 Jun 2021, 19:20 Ilya Kasnacheev, <
> > > ilya.kasnacheev@gmail.com
> > > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Hello!
> > > > > > > >
> > > > > > > > I think that number one is the most important one, then maybe
> > it
> > > > > will see
> > > > > > > > more use and other deficiencies become more apparent, leading
> > to
> > > more
> > > > > > > > tickets and visibility.
> > > > > > > >
> > > > > > > > Maybe 2. and 3. will even use a different approach when
> > > persistence
> > > > > is
> > > > > > > > implemented.
> > > > > > > >
> > > > > > > > Regards,
> > > > > > > > --
> > > > > > > > Ilya Kasnacheev
> > > > > > > >
> > > > > > > >
> > > > > > > > пн, 28 июн. 2021 г. в 14:34, Atri Sharma <at...@apache.org>:
> > > > > > > >
> > > > > > > > > Hello Again!
> > > > > > > > >
> > > > > > > > > I have been looking into the aforementioned and here are my
> > > follow
> > > > > up
> > > > > > > > > thoughts:
> > > > > > > > >
> > > > > > > > > 1. Support persistence of Lucene indexes.
> > > > > > > > > 2. https://issues.apache.org/jira/browse/IGNITE-12401
> (Needs
> > > > > fixing of
> > > > > > > > > moving partitions first)
> > > > > > > > > 3. Figure out how to return scores from nodes and use them
> as
> > > sort
> > > > > > > > > parameters on the coordinator node
> > > > > > > > > (https://issues.apache.org/jira/browse/IGNITE-12291)
> > > > > > > > >
> > > > > > > > > Please let me know if this looks ok to make text queries
> > > > > functional?
> > > > > > > > >
> > > > > > > > > Atri
> > > > > > > > >
> > > > > > > > > On Mon, Jun 21, 2021 at 2:49 PM Alexei Scherbakov
> > > > > > > > > <al...@gmail.com> wrote:
> > > > > > > > > >
> > > > > > > > > > Hi.
> > > > > > > > > >
> > > > > > > > > > One of the biggest issues with text queries is a lack of
> > > support
> > > > > for
> > > > > > > > > lucene
> > > > > > > > > > indices persistence, which makes this functionality
> useless
> > > if a
> > > > > > > > > > persistence is enabled.
> > > > > > > > > >
> > > > > > > > > > I would first take care of it.
> > > > > > > > > >
> > > > > > > > > > пн, 21 июн. 2021 г. в 12:16, Maksim Timonin <
> > > > > timonin.maxim@gmail.com
> > > > > > > >:
> > > > > > > > > >
> > > > > > > > > > > Hi, Atri!
> > > > > > > > > > >
> > > > > > > > > > > You're right, Actually there is a lack of support for
> > > > > TextQueries.
> > > > > > > > For
> > > > > > > > > the
> > > > > > > > > > > last ticket I'm doing I see some obvious issues with
> them
> > > (no
> > > > > page
> > > > > > > > size
> > > > > > > > > > > support, for example). I'm glad that somebody wants to
> > > maintain
> > > > > > > this
> > > > > > > > > > > functionality. Thanks a lot!
> > > > > > > > > > >
> > > > > > > > > > > For the MergeSort algorithm there is already a patch
> for
> > > that
> > > > > [1].
> > > > > > > > It's
> > > > > > > > > > > currently on review. This patch introduces an abstract
> > > reducer
> > > > > for
> > > > > > > > > > > CacheQueries with 2 implementations (unordered,
> > > merge-sort).
> > > > > Then
> > > > > > > > > TextQuery
> > > > > > > > > > > leverages on MergeSort to order results from multiple
> > > nodes by
> > > > > > > score.
> > > > > > > > > This
> > > > > > > > > > > patch also fixes the pageSize issue, I've mentioned
> > before.
> > > > > Could
> > > > > > > you
> > > > > > > > > > > please check if it fully matches your idea? Any issues
> or
> > > > > comments
> > > > > > > > are
> > > > > > > > > > > welcome.
> > > > > > > > > > >
> > > > > > > > > > > I've prepared this ticket, because I need the MergeSort
> > > > > algorithm
> > > > > > > for
> > > > > > > > > the
> > > > > > > > > > > new type of queries I'm implementing (IndexQuery, it
> > should
> > > > > also
> > > > > > > > > provide
> > > > > > > > > > > ordered results over multiple nodes). Currently I'm not
> > > > > planning to
> > > > > > > > go
> > > > > > > > > > > further with TextQuery, so if you're going to support
> > this
> > > > > it'll
> > > > > > > be a
> > > > > > > > > great
> > > > > > > > > > > contribution, I think.
> > > > > > > > > > >
> > > > > > > > > > > [1] https://issues.apache.org/jira/browse/IGNITE-14703
> > > > > > > > > > > [2] https://github.com/apache/ignite/pull/9081
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > On Mon, Jun 21, 2021 at 11:11 AM Atri Sharma <
> > > atri@apache.org>
> > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > Hi All,
> > > > > > > > > > > >
> > > > > > > > > > > > I have been looking into our text queries support and
> > see
> > > > > that it
> > > > > > > > has
> > > > > > > > > > > > limited community support.
> > > > > > > > > > > >
> > > > > > > > > > > > Therefore, I volunteer to be the maintainer of the
> > > module and
> > > > > > > work
> > > > > > > > on
> > > > > > > > > > > > enhancing it further.
> > > > > > > > > > > >
> > > > > > > > > > > > First goal would be to move to Lucene 8.x, then work
> on
> > > > > sorted
> > > > > > > > reduce
> > > > > > > > > > > > - merge across nodes. Fundamentally, this is doable
> > since
> > > > > Lucene
> > > > > > > > > ranks
> > > > > > > > > > > > documents according to their score, and documents are
> > > > > returned in
> > > > > > > > the
> > > > > > > > > > > > order of their score. Since the scoring function is
> > > > > homogeneous,
> > > > > > > > this
> > > > > > > > > > > > means that across nodes, we can compare scores and
> > merge
> > > > > sort.
> > > > > > > > > > > >
> > > > > > > > > > > > Please let me know if I can take this up.
> > > > > > > > > > > >
> > > > > > > > > > > > Atri
> > > > > > > > > > > >
> > > > > > > > > > > > --
> > > > > > > > > > > > Regards,
> > > > > > > > > > > >
> > > > > > > > > > > > Atri
> > > > > > > > > > > > Apache Concerted
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > --
> > > > > > > > > >
> > > > > > > > > > Best regards,
> > > > > > > > > > Alexei Scherbakov
> > > > > > > > >
> > > > > > > > > --
> > > > > > > > > Regards,
> > > > > > > > >
> > > > > > > > > Atri
> > > > > > > > > Apache Concerted
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Best regards,
> > > > > > Andrey V. Mashenkov
> > > > >
> > > > > --
> > > > > Regards,
> > > > >
> > > > > Atri
> > > > > Apache Concerted
> > > > >
> > > >
> > > >
> > > > --
> > > > Best regards,
> > > > Andrey V. Mashenkov
> > >
> > > --
> > > Regards,
> > >
> > > Atri
> > > Apache Concerted
> > >
> >
>

Re: Text Queries Support

Posted by Courtney Robinson <co...@hypi.io>.

I'll add in here.
I agree with you Valentin, the decoupled state of text queries makes it
useless for most use cases we have.

As it relates to Calcite and Ignite 3, one approach (the one we're taking
because we use calcite independent of Ignite) is to provide a bunch of SQL
functions that we implement as SqlOperator
<https://calcite.apache.org/javadocAggregate/org/apache/calcite/sql/SqlOperator.html>.
I forget how we've done aggregation functions but we have those too and
they map to Solr aggregations (which ultimately end up in lucene).

This allows Solr filters to take part in the rest of the query. It's
probably more complex than this for Ignite but that's one possible route
but we generate queries like select x from T0 where term(args to solr term
query) AND ...

Regards,
Courtney Robinson
Founder and CEO, Hypi
Tel: ++44 208 123 2413 (GMT+0) <https://hypi.io>

<https://hypi.io>
https://hypi.io


On Fri, Jul 23, 2021 at 7:14 PM Valentin Kulichenko <
valentin.kulichenko@gmail.com> wrote:

> Atri,
>
> Sure, go ahead. Let's put the ideas on paper and have a discussion.
>
> -Val
>
> On Fri, Jul 23, 2021 at 10:59 AM Atri Sharma <at...@apache.org> wrote:
>
> > Thanks Andrey.
> >
> > I have collected answers or proposals to many of these questions and
> > would like to start a wiki page covering what we can do for Ignite 3.
> >
> > Does that sound good, please?
> >
> > On Fri, Jul 23, 2021 at 4:26 PM Andrey Mashenkov
> > <an...@gmail.com> wrote:
> > >
> > > Atri,
> > >
> > > First of all, I'd recommend going through the Ignite ticket to gather
> > > information about the current implementation issues and users' wants.
> > > Then look at a code to get a complete understanding of how things work
> > now,
> > > which may help in future decisions.
> > >
> > > As we use the outdated Lucene version, some things may be irrelevant
> for
> > > the latest Lucene version.
> > > So, you will need expertise in the internals of modern Lucene version
> to
> > > understand what capabilities, guarantees, and limitations Lucene has
> and
> > > could bring to the Ignite.
> > > The expertise could be got from the Lucene project code or Lucene
> project
> > > dev-list.
> > >
> > >
> > > As for now, the potential capabilities are not clear to me.
> > > At first glance, I see the next topics that must be covered at first:
> > >
> > > General questions
> > > * How Lucene index can be split among the nodes?
> > > * If we'll have a single index for all partitions on the particular
> node,
> > > then how index records will be aware of partitioning?
> > > This is important to filter out backup records from the results to
> avoid
> > > duplicates.
> > > * How results from several nodes can be merged on the Reduce stage?
> > > * Does Lucene supports smth like JOIN operation or others that may
> > require
> > > data from another partition or index?
> > > If so, then it likes to multistep query with merging results on
> > > intermediate stages and requires detailed investigation and design.
> > > It is ok if Ignite will have some limitations here, but we would like
> to
> > > know about them at the early stage.
> > > * How effectively map Lucene files to the page memory? Is it even
> > possible?
> > > Otherwise, how to deal with potential OOM on large queries and memory
> > > capacity planning?
> > >
> > > Persistence.
> > > * How and what consistency guarantees could we have/expect?
> > > Seems, we may not be able to write physical records for Lucene index to
> > our
> > > WAL. What can we do with this?
> > >
> > > Transactions.
> > > * Will we support transactions?
> > > * Should Lucene be aware of Transaction and track mvcc (or whatever)
> > > versions for the records?
> > > * What will be consistency guarantees?
> > >
> > > UX
> > > * How to add FullText search queries syntax into Calcite?
> > > * AFAIK, the Lucene index has many properties for tuning. How will the
> > user
> > > configure the index?
> > > * How and where to store the settings? What are cluster-wide and what a
> > > local to the particular node?
> > > * Will be all the settings immutable? Can be they changed on-fly? after
> > > node/grid restart?
> > > * Any limitations on query syntax?
> > >
> > > SQL
> > > * Will we support FullText search in SQL?
> > > * How to integrate Lucene index into Calcite? What is the cost model?
> > > Splitting rules? Traits?
> > > * What about consistency with DDL operations, e.g. column rename?
> > > Ignite indices will operate column ID, so rename operation will not
> > affect
> > > the index.
> > >
> > >
> > > With all of this, you can go with the IEP (or even some short summary)
> > and
> > > further POC and implementation.
> > > That's a big deal, so let's discuss what could be done here.
> > >
> > > On Fri, Jul 23, 2021 at 12:58 PM Atri Sharma <at...@apache.org> wrote:
> > >
> > > > I am actually happy to drive the feature for Ignite 3. FTS is very
> > > > important for me and I think Ignite users will benefit from it
> > > > greatly.
> > > >
> > > > If it makes sense to be focusing on Ignite 3 for this capability, I
> am
> > > > eager to contribute there and lead the development.
> > > >
> > > > Please share your thoughts.
> > > >
> > > > On Fri, Jul 23, 2021 at 3:21 PM Andrey Mashenkov
> > > > <an...@gmail.com> wrote:
> > > > >
> > > > > Hi Atri,
> > > > >
> > > > > All the Jira tickets we have on the Full-text search (FTS) thing
> are
> > > > > targeted to Ignite 2.
> > > > >
> > > > > AFAIK, we want, but we have NOT committed to FTS support in Ignite
> 3,
> > > > yet.
> > > > > By the way, we are getting requests for this thing from the user
> > side,
> > > > and
> > > > > definitely,
> > > > > FTS would be a valuable feature for Ignite.
> > > > >
> > > > > It will be great if the one wants to drive it, any help will be
> > > > appreciated.
> > > > >
> > > > >
> > > > > On Fri, Jul 23, 2021 at 12:12 PM Atri Sharma <at...@apache.org>
> > wrote:
> > > > >
> > > > > > Hello,
> > > > > >
> > > > > > An update, please. I am working through persistence of Lucene
> index
> > > > using
> > > > > > Ignite Dictionary, and will be asking some questions soon.
> > > > > >
> > > > > > I had one doubt - - where does this change go? Ignite 3?
> > > > > >
> > > > > > Also, I know we want to build native support for text searches in
> > > > Ignite 3.
> > > > > > Is the work I am proposing here part of that, or will that be a
> > > > separate
> > > > > > effort?
> > > > > >
> > > > > > On Mon, 28 Jun 2021, 19:20 Ilya Kasnacheev, <
> > ilya.kasnacheev@gmail.com
> > > > >
> > > > > > wrote:
> > > > > >
> > > > > > > Hello!
> > > > > > >
> > > > > > > I think that number one is the most important one, then maybe
> it
> > > > will see
> > > > > > > more use and other deficiencies become more apparent, leading
> to
> > more
> > > > > > > tickets and visibility.
> > > > > > >
> > > > > > > Maybe 2. and 3. will even use a different approach when
> > persistence
> > > > is
> > > > > > > implemented.
> > > > > > >
> > > > > > > Regards,
> > > > > > > --
> > > > > > > Ilya Kasnacheev
> > > > > > >
> > > > > > >
> > > > > > > пн, 28 июн. 2021 г. в 14:34, Atri Sharma <at...@apache.org>:
> > > > > > >
> > > > > > > > Hello Again!
> > > > > > > >
> > > > > > > > I have been looking into the aforementioned and here are my
> > follow
> > > > up
> > > > > > > > thoughts:
> > > > > > > >
> > > > > > > > 1. Support persistence of Lucene indexes.
> > > > > > > > 2. https://issues.apache.org/jira/browse/IGNITE-12401 (Needs
> > > > fixing of
> > > > > > > > moving partitions first)
> > > > > > > > 3. Figure out how to return scores from nodes and use them as
> > sort
> > > > > > > > parameters on the coordinator node
> > > > > > > > (https://issues.apache.org/jira/browse/IGNITE-12291)
> > > > > > > >
> > > > > > > > Please let me know if this looks ok to make text queries
> > > > functional?
> > > > > > > >
> > > > > > > > Atri
> > > > > > > >
> > > > > > > > On Mon, Jun 21, 2021 at 2:49 PM Alexei Scherbakov
> > > > > > > > <al...@gmail.com> wrote:
> > > > > > > > >
> > > > > > > > > Hi.
> > > > > > > > >
> > > > > > > > > One of the biggest issues with text queries is a lack of
> > support
> > > > for
> > > > > > > > lucene
> > > > > > > > > indices persistence, which makes this functionality useless
> > if a
> > > > > > > > > persistence is enabled.
> > > > > > > > >
> > > > > > > > > I would first take care of it.
> > > > > > > > >
> > > > > > > > > пн, 21 июн. 2021 г. в 12:16, Maksim Timonin <
> > > > timonin.maxim@gmail.com
> > > > > > >:
> > > > > > > > >
> > > > > > > > > > Hi, Atri!
> > > > > > > > > >
> > > > > > > > > > You're right, Actually there is a lack of support for
> > > > TextQueries.
> > > > > > > For
> > > > > > > > the
> > > > > > > > > > last ticket I'm doing I see some obvious issues with them
> > (no
> > > > page
> > > > > > > size
> > > > > > > > > > support, for example). I'm glad that somebody wants to
> > maintain
> > > > > > this
> > > > > > > > > > functionality. Thanks a lot!
> > > > > > > > > >
> > > > > > > > > > For the MergeSort algorithm there is already a patch for
> > that
> > > > [1].
> > > > > > > It's
> > > > > > > > > > currently on review. This patch introduces an abstract
> > reducer
> > > > for
> > > > > > > > > > CacheQueries with 2 implementations (unordered,
> > merge-sort).
> > > > Then
> > > > > > > > TextQuery
> > > > > > > > > > leverages on MergeSort to order results from multiple
> > nodes by
> > > > > > score.
> > > > > > > > This
> > > > > > > > > > patch also fixes the pageSize issue, I've mentioned
> before.
> > > > Could
> > > > > > you
> > > > > > > > > > please check if it fully matches your idea? Any issues or
> > > > comments
> > > > > > > are
> > > > > > > > > > welcome.
> > > > > > > > > >
> > > > > > > > > > I've prepared this ticket, because I need the MergeSort
> > > > algorithm
> > > > > > for
> > > > > > > > the
> > > > > > > > > > new type of queries I'm implementing (IndexQuery, it
> should
> > > > also
> > > > > > > > provide
> > > > > > > > > > ordered results over multiple nodes). Currently I'm not
> > > > planning to
> > > > > > > go
> > > > > > > > > > further with TextQuery, so if you're going to support
> this
> > > > it'll
> > > > > > be a
> > > > > > > > great
> > > > > > > > > > contribution, I think.
> > > > > > > > > >
> > > > > > > > > > [1] https://issues.apache.org/jira/browse/IGNITE-14703
> > > > > > > > > > [2] https://github.com/apache/ignite/pull/9081
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > On Mon, Jun 21, 2021 at 11:11 AM Atri Sharma <
> > atri@apache.org>
> > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Hi All,
> > > > > > > > > > >
> > > > > > > > > > > I have been looking into our text queries support and
> see
> > > > that it
> > > > > > > has
> > > > > > > > > > > limited community support.
> > > > > > > > > > >
> > > > > > > > > > > Therefore, I volunteer to be the maintainer of the
> > module and
> > > > > > work
> > > > > > > on
> > > > > > > > > > > enhancing it further.
> > > > > > > > > > >
> > > > > > > > > > > First goal would be to move to Lucene 8.x, then work on
> > > > sorted
> > > > > > > reduce
> > > > > > > > > > > - merge across nodes. Fundamentally, this is doable
> since
> > > > Lucene
> > > > > > > > ranks
> > > > > > > > > > > documents according to their score, and documents are
> > > > returned in
> > > > > > > the
> > > > > > > > > > > order of their score. Since the scoring function is
> > > > homogeneous,
> > > > > > > this
> > > > > > > > > > > means that across nodes, we can compare scores and
> merge
> > > > sort.
> > > > > > > > > > >
> > > > > > > > > > > Please let me know if I can take this up.
> > > > > > > > > > >
> > > > > > > > > > > Atri
> > > > > > > > > > >
> > > > > > > > > > > --
> > > > > > > > > > > Regards,
> > > > > > > > > > >
> > > > > > > > > > > Atri
> > > > > > > > > > > Apache Concerted
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > --
> > > > > > > > >
> > > > > > > > > Best regards,
> > > > > > > > > Alexei Scherbakov
> > > > > > > >
> > > > > > > > --
> > > > > > > > Regards,
> > > > > > > >
> > > > > > > > Atri
> > > > > > > > Apache Concerted
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Best regards,
> > > > > Andrey V. Mashenkov
> > > >
> > > > --
> > > > Regards,
> > > >
> > > > Atri
> > > > Apache Concerted
> > > >
> > >
> > >
> > > --
> > > Best regards,
> > > Andrey V. Mashenkov
> >
> > --
> > Regards,
> >
> > Atri
> > Apache Concerted
> >
>

Re: Text Queries Support

Posted by Valentin Kulichenko <va...@gmail.com>.

Atri,

Sure, go ahead. Let's put the ideas on paper and have a discussion.

-Val

On Fri, Jul 23, 2021 at 10:59 AM Atri Sharma <at...@apache.org> wrote:

> Thanks Andrey.
>
> I have collected answers or proposals to many of these questions and
> would like to start a wiki page covering what we can do for Ignite 3.
>
> Does that sound good, please?
>
> On Fri, Jul 23, 2021 at 4:26 PM Andrey Mashenkov
> <an...@gmail.com> wrote:
> >
> > Atri,
> >
> > First of all, I'd recommend going through the Ignite ticket to gather
> > information about the current implementation issues and users' wants.
> > Then look at a code to get a complete understanding of how things work
> now,
> > which may help in future decisions.
> >
> > As we use the outdated Lucene version, some things may be irrelevant for
> > the latest Lucene version.
> > So, you will need expertise in the internals of modern Lucene version to
> > understand what capabilities, guarantees, and limitations Lucene has and
> > could bring to the Ignite.
> > The expertise could be got from the Lucene project code or Lucene project
> > dev-list.
> >
> >
> > As for now, the potential capabilities are not clear to me.
> > At first glance, I see the next topics that must be covered at first:
> >
> > General questions
> > * How Lucene index can be split among the nodes?
> > * If we'll have a single index for all partitions on the particular node,
> > then how index records will be aware of partitioning?
> > This is important to filter out backup records from the results to avoid
> > duplicates.
> > * How results from several nodes can be merged on the Reduce stage?
> > * Does Lucene supports smth like JOIN operation or others that may
> require
> > data from another partition or index?
> > If so, then it likes to multistep query with merging results on
> > intermediate stages and requires detailed investigation and design.
> > It is ok if Ignite will have some limitations here, but we would like to
> > know about them at the early stage.
> > * How effectively map Lucene files to the page memory? Is it even
> possible?
> > Otherwise, how to deal with potential OOM on large queries and memory
> > capacity planning?
> >
> > Persistence.
> > * How and what consistency guarantees could we have/expect?
> > Seems, we may not be able to write physical records for Lucene index to
> our
> > WAL. What can we do with this?
> >
> > Transactions.
> > * Will we support transactions?
> > * Should Lucene be aware of Transaction and track mvcc (or whatever)
> > versions for the records?
> > * What will be consistency guarantees?
> >
> > UX
> > * How to add FullText search queries syntax into Calcite?
> > * AFAIK, the Lucene index has many properties for tuning. How will the
> user
> > configure the index?
> > * How and where to store the settings? What are cluster-wide and what a
> > local to the particular node?
> > * Will be all the settings immutable? Can be they changed on-fly? after
> > node/grid restart?
> > * Any limitations on query syntax?
> >
> > SQL
> > * Will we support FullText search in SQL?
> > * How to integrate Lucene index into Calcite? What is the cost model?
> > Splitting rules? Traits?
> > * What about consistency with DDL operations, e.g. column rename?
> > Ignite indices will operate column ID, so rename operation will not
> affect
> > the index.
> >
> >
> > With all of this, you can go with the IEP (or even some short summary)
> and
> > further POC and implementation.
> > That's a big deal, so let's discuss what could be done here.
> >
> > On Fri, Jul 23, 2021 at 12:58 PM Atri Sharma <at...@apache.org> wrote:
> >
> > > I am actually happy to drive the feature for Ignite 3. FTS is very
> > > important for me and I think Ignite users will benefit from it
> > > greatly.
> > >
> > > If it makes sense to be focusing on Ignite 3 for this capability, I am
> > > eager to contribute there and lead the development.
> > >
> > > Please share your thoughts.
> > >
> > > On Fri, Jul 23, 2021 at 3:21 PM Andrey Mashenkov
> > > <an...@gmail.com> wrote:
> > > >
> > > > Hi Atri,
> > > >
> > > > All the Jira tickets we have on the Full-text search (FTS) thing are
> > > > targeted to Ignite 2.
> > > >
> > > > AFAIK, we want, but we have NOT committed to FTS support in Ignite 3,
> > > yet.
> > > > By the way, we are getting requests for this thing from the user
> side,
> > > and
> > > > definitely,
> > > > FTS would be a valuable feature for Ignite.
> > > >
> > > > It will be great if the one wants to drive it, any help will be
> > > appreciated.
> > > >
> > > >
> > > > On Fri, Jul 23, 2021 at 12:12 PM Atri Sharma <at...@apache.org>
> wrote:
> > > >
> > > > > Hello,
> > > > >
> > > > > An update, please. I am working through persistence of Lucene index
> > > using
> > > > > Ignite Dictionary, and will be asking some questions soon.
> > > > >
> > > > > I had one doubt - - where does this change go? Ignite 3?
> > > > >
> > > > > Also, I know we want to build native support for text searches in
> > > Ignite 3.
> > > > > Is the work I am proposing here part of that, or will that be a
> > > separate
> > > > > effort?
> > > > >
> > > > > On Mon, 28 Jun 2021, 19:20 Ilya Kasnacheev, <
> ilya.kasnacheev@gmail.com
> > > >
> > > > > wrote:
> > > > >
> > > > > > Hello!
> > > > > >
> > > > > > I think that number one is the most important one, then maybe it
> > > will see
> > > > > > more use and other deficiencies become more apparent, leading to
> more
> > > > > > tickets and visibility.
> > > > > >
> > > > > > Maybe 2. and 3. will even use a different approach when
> persistence
> > > is
> > > > > > implemented.
> > > > > >
> > > > > > Regards,
> > > > > > --
> > > > > > Ilya Kasnacheev
> > > > > >
> > > > > >
> > > > > > пн, 28 июн. 2021 г. в 14:34, Atri Sharma <at...@apache.org>:
> > > > > >
> > > > > > > Hello Again!
> > > > > > >
> > > > > > > I have been looking into the aforementioned and here are my
> follow
> > > up
> > > > > > > thoughts:
> > > > > > >
> > > > > > > 1. Support persistence of Lucene indexes.
> > > > > > > 2. https://issues.apache.org/jira/browse/IGNITE-12401 (Needs
> > > fixing of
> > > > > > > moving partitions first)
> > > > > > > 3. Figure out how to return scores from nodes and use them as
> sort
> > > > > > > parameters on the coordinator node
> > > > > > > (https://issues.apache.org/jira/browse/IGNITE-12291)
> > > > > > >
> > > > > > > Please let me know if this looks ok to make text queries
> > > functional?
> > > > > > >
> > > > > > > Atri
> > > > > > >
> > > > > > > On Mon, Jun 21, 2021 at 2:49 PM Alexei Scherbakov
> > > > > > > <al...@gmail.com> wrote:
> > > > > > > >
> > > > > > > > Hi.
> > > > > > > >
> > > > > > > > One of the biggest issues with text queries is a lack of
> support
> > > for
> > > > > > > lucene
> > > > > > > > indices persistence, which makes this functionality useless
> if a
> > > > > > > > persistence is enabled.
> > > > > > > >
> > > > > > > > I would first take care of it.
> > > > > > > >
> > > > > > > > пн, 21 июн. 2021 г. в 12:16, Maksim Timonin <
> > > timonin.maxim@gmail.com
> > > > > >:
> > > > > > > >
> > > > > > > > > Hi, Atri!
> > > > > > > > >
> > > > > > > > > You're right, Actually there is a lack of support for
> > > TextQueries.
> > > > > > For
> > > > > > > the
> > > > > > > > > last ticket I'm doing I see some obvious issues with them
> (no
> > > page
> > > > > > size
> > > > > > > > > support, for example). I'm glad that somebody wants to
> maintain
> > > > > this
> > > > > > > > > functionality. Thanks a lot!
> > > > > > > > >
> > > > > > > > > For the MergeSort algorithm there is already a patch for
> that
> > > [1].
> > > > > > It's
> > > > > > > > > currently on review. This patch introduces an abstract
> reducer
> > > for
> > > > > > > > > CacheQueries with 2 implementations (unordered,
> merge-sort).
> > > Then
> > > > > > > TextQuery
> > > > > > > > > leverages on MergeSort to order results from multiple
> nodes by
> > > > > score.
> > > > > > > This
> > > > > > > > > patch also fixes the pageSize issue, I've mentioned before.
> > > Could
> > > > > you
> > > > > > > > > please check if it fully matches your idea? Any issues or
> > > comments
> > > > > > are
> > > > > > > > > welcome.
> > > > > > > > >
> > > > > > > > > I've prepared this ticket, because I need the MergeSort
> > > algorithm
> > > > > for
> > > > > > > the
> > > > > > > > > new type of queries I'm implementing (IndexQuery, it should
> > > also
> > > > > > > provide
> > > > > > > > > ordered results over multiple nodes). Currently I'm not
> > > planning to
> > > > > > go
> > > > > > > > > further with TextQuery, so if you're going to support this
> > > it'll
> > > > > be a
> > > > > > > great
> > > > > > > > > contribution, I think.
> > > > > > > > >
> > > > > > > > > [1] https://issues.apache.org/jira/browse/IGNITE-14703
> > > > > > > > > [2] https://github.com/apache/ignite/pull/9081
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Mon, Jun 21, 2021 at 11:11 AM Atri Sharma <
> atri@apache.org>
> > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Hi All,
> > > > > > > > > >
> > > > > > > > > > I have been looking into our text queries support and see
> > > that it
> > > > > > has
> > > > > > > > > > limited community support.
> > > > > > > > > >
> > > > > > > > > > Therefore, I volunteer to be the maintainer of the
> module and
> > > > > work
> > > > > > on
> > > > > > > > > > enhancing it further.
> > > > > > > > > >
> > > > > > > > > > First goal would be to move to Lucene 8.x, then work on
> > > sorted
> > > > > > reduce
> > > > > > > > > > - merge across nodes. Fundamentally, this is doable since
> > > Lucene
> > > > > > > ranks
> > > > > > > > > > documents according to their score, and documents are
> > > returned in
> > > > > > the
> > > > > > > > > > order of their score. Since the scoring function is
> > > homogeneous,
> > > > > > this
> > > > > > > > > > means that across nodes, we can compare scores and merge
> > > sort.
> > > > > > > > > >
> > > > > > > > > > Please let me know if I can take this up.
> > > > > > > > > >
> > > > > > > > > > Atri
> > > > > > > > > >
> > > > > > > > > > --
> > > > > > > > > > Regards,
> > > > > > > > > >
> > > > > > > > > > Atri
> > > > > > > > > > Apache Concerted
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > --
> > > > > > > >
> > > > > > > > Best regards,
> > > > > > > > Alexei Scherbakov
> > > > > > >
> > > > > > > --
> > > > > > > Regards,
> > > > > > >
> > > > > > > Atri
> > > > > > > Apache Concerted
> > > > > > >
> > > > > >
> > > > >
> > > >
> > > >
> > > > --
> > > > Best regards,
> > > > Andrey V. Mashenkov
> > >
> > > --
> > > Regards,
> > >
> > > Atri
> > > Apache Concerted
> > >
> >
> >
> > --
> > Best regards,
> > Andrey V. Mashenkov
>
> --
> Regards,
>
> Atri
> Apache Concerted
>

Re: Text Queries Support

Posted by Atri Sharma <at...@apache.org>.

Thanks Andrey.

I have collected answers or proposals to many of these questions and
would like to start a wiki page covering what we can do for Ignite 3.

Does that sound good, please?

On Fri, Jul 23, 2021 at 4:26 PM Andrey Mashenkov
<an...@gmail.com> wrote:
>
> Atri,
>
> First of all, I'd recommend going through the Ignite ticket to gather
> information about the current implementation issues and users' wants.
> Then look at a code to get a complete understanding of how things work now,
> which may help in future decisions.
>
> As we use the outdated Lucene version, some things may be irrelevant for
> the latest Lucene version.
> So, you will need expertise in the internals of modern Lucene version to
> understand what capabilities, guarantees, and limitations Lucene has and
> could bring to the Ignite.
> The expertise could be got from the Lucene project code or Lucene project
> dev-list.
>
>
> As for now, the potential capabilities are not clear to me.
> At first glance, I see the next topics that must be covered at first:
>
> General questions
> * How Lucene index can be split among the nodes?
> * If we'll have a single index for all partitions on the particular node,
> then how index records will be aware of partitioning?
> This is important to filter out backup records from the results to avoid
> duplicates.
> * How results from several nodes can be merged on the Reduce stage?
> * Does Lucene supports smth like JOIN operation or others that may require
> data from another partition or index?
> If so, then it likes to multistep query with merging results on
> intermediate stages and requires detailed investigation and design.
> It is ok if Ignite will have some limitations here, but we would like to
> know about them at the early stage.
> * How effectively map Lucene files to the page memory? Is it even possible?
> Otherwise, how to deal with potential OOM on large queries and memory
> capacity planning?
>
> Persistence.
> * How and what consistency guarantees could we have/expect?
> Seems, we may not be able to write physical records for Lucene index to our
> WAL. What can we do with this?
>
> Transactions.
> * Will we support transactions?
> * Should Lucene be aware of Transaction and track mvcc (or whatever)
> versions for the records?
> * What will be consistency guarantees?
>
> UX
> * How to add FullText search queries syntax into Calcite?
> * AFAIK, the Lucene index has many properties for tuning. How will the user
> configure the index?
> * How and where to store the settings? What are cluster-wide and what a
> local to the particular node?
> * Will be all the settings immutable? Can be they changed on-fly? after
> node/grid restart?
> * Any limitations on query syntax?
>
> SQL
> * Will we support FullText search in SQL?
> * How to integrate Lucene index into Calcite? What is the cost model?
> Splitting rules? Traits?
> * What about consistency with DDL operations, e.g. column rename?
> Ignite indices will operate column ID, so rename operation will not affect
> the index.
>
>
> With all of this, you can go with the IEP (or even some short summary) and
> further POC and implementation.
> That's a big deal, so let's discuss what could be done here.
>
> On Fri, Jul 23, 2021 at 12:58 PM Atri Sharma <at...@apache.org> wrote:
>
> > I am actually happy to drive the feature for Ignite 3. FTS is very
> > important for me and I think Ignite users will benefit from it
> > greatly.
> >
> > If it makes sense to be focusing on Ignite 3 for this capability, I am
> > eager to contribute there and lead the development.
> >
> > Please share your thoughts.
> >
> > On Fri, Jul 23, 2021 at 3:21 PM Andrey Mashenkov
> > <an...@gmail.com> wrote:
> > >
> > > Hi Atri,
> > >
> > > All the Jira tickets we have on the Full-text search (FTS) thing are
> > > targeted to Ignite 2.
> > >
> > > AFAIK, we want, but we have NOT committed to FTS support in Ignite 3,
> > yet.
> > > By the way, we are getting requests for this thing from the user side,
> > and
> > > definitely,
> > > FTS would be a valuable feature for Ignite.
> > >
> > > It will be great if the one wants to drive it, any help will be
> > appreciated.
> > >
> > >
> > > On Fri, Jul 23, 2021 at 12:12 PM Atri Sharma <at...@apache.org> wrote:
> > >
> > > > Hello,
> > > >
> > > > An update, please. I am working through persistence of Lucene index
> > using
> > > > Ignite Dictionary, and will be asking some questions soon.
> > > >
> > > > I had one doubt - - where does this change go? Ignite 3?
> > > >
> > > > Also, I know we want to build native support for text searches in
> > Ignite 3.
> > > > Is the work I am proposing here part of that, or will that be a
> > separate
> > > > effort?
> > > >
> > > > On Mon, 28 Jun 2021, 19:20 Ilya Kasnacheev, <ilya.kasnacheev@gmail.com
> > >
> > > > wrote:
> > > >
> > > > > Hello!
> > > > >
> > > > > I think that number one is the most important one, then maybe it
> > will see
> > > > > more use and other deficiencies become more apparent, leading to more
> > > > > tickets and visibility.
> > > > >
> > > > > Maybe 2. and 3. will even use a different approach when persistence
> > is
> > > > > implemented.
> > > > >
> > > > > Regards,
> > > > > --
> > > > > Ilya Kasnacheev
> > > > >
> > > > >
> > > > > пн, 28 июн. 2021 г. в 14:34, Atri Sharma <at...@apache.org>:
> > > > >
> > > > > > Hello Again!
> > > > > >
> > > > > > I have been looking into the aforementioned and here are my follow
> > up
> > > > > > thoughts:
> > > > > >
> > > > > > 1. Support persistence of Lucene indexes.
> > > > > > 2. https://issues.apache.org/jira/browse/IGNITE-12401 (Needs
> > fixing of
> > > > > > moving partitions first)
> > > > > > 3. Figure out how to return scores from nodes and use them as sort
> > > > > > parameters on the coordinator node
> > > > > > (https://issues.apache.org/jira/browse/IGNITE-12291)
> > > > > >
> > > > > > Please let me know if this looks ok to make text queries
> > functional?
> > > > > >
> > > > > > Atri
> > > > > >
> > > > > > On Mon, Jun 21, 2021 at 2:49 PM Alexei Scherbakov
> > > > > > <al...@gmail.com> wrote:
> > > > > > >
> > > > > > > Hi.
> > > > > > >
> > > > > > > One of the biggest issues with text queries is a lack of support
> > for
> > > > > > lucene
> > > > > > > indices persistence, which makes this functionality useless if a
> > > > > > > persistence is enabled.
> > > > > > >
> > > > > > > I would first take care of it.
> > > > > > >
> > > > > > > пн, 21 июн. 2021 г. в 12:16, Maksim Timonin <
> > timonin.maxim@gmail.com
> > > > >:
> > > > > > >
> > > > > > > > Hi, Atri!
> > > > > > > >
> > > > > > > > You're right, Actually there is a lack of support for
> > TextQueries.
> > > > > For
> > > > > > the
> > > > > > > > last ticket I'm doing I see some obvious issues with them (no
> > page
> > > > > size
> > > > > > > > support, for example). I'm glad that somebody wants to maintain
> > > > this
> > > > > > > > functionality. Thanks a lot!
> > > > > > > >
> > > > > > > > For the MergeSort algorithm there is already a patch for that
> > [1].
> > > > > It's
> > > > > > > > currently on review. This patch introduces an abstract reducer
> > for
> > > > > > > > CacheQueries with 2 implementations (unordered, merge-sort).
> > Then
> > > > > > TextQuery
> > > > > > > > leverages on MergeSort to order results from multiple nodes by
> > > > score.
> > > > > > This
> > > > > > > > patch also fixes the pageSize issue, I've mentioned before.
> > Could
> > > > you
> > > > > > > > please check if it fully matches your idea? Any issues or
> > comments
> > > > > are
> > > > > > > > welcome.
> > > > > > > >
> > > > > > > > I've prepared this ticket, because I need the MergeSort
> > algorithm
> > > > for
> > > > > > the
> > > > > > > > new type of queries I'm implementing (IndexQuery, it should
> > also
> > > > > > provide
> > > > > > > > ordered results over multiple nodes). Currently I'm not
> > planning to
> > > > > go
> > > > > > > > further with TextQuery, so if you're going to support this
> > it'll
> > > > be a
> > > > > > great
> > > > > > > > contribution, I think.
> > > > > > > >
> > > > > > > > [1] https://issues.apache.org/jira/browse/IGNITE-14703
> > > > > > > > [2] https://github.com/apache/ignite/pull/9081
> > > > > > > >
> > > > > > > >
> > > > > > > > On Mon, Jun 21, 2021 at 11:11 AM Atri Sharma <at...@apache.org>
> > > > > wrote:
> > > > > > > >
> > > > > > > > > Hi All,
> > > > > > > > >
> > > > > > > > > I have been looking into our text queries support and see
> > that it
> > > > > has
> > > > > > > > > limited community support.
> > > > > > > > >
> > > > > > > > > Therefore, I volunteer to be the maintainer of the module and
> > > > work
> > > > > on
> > > > > > > > > enhancing it further.
> > > > > > > > >
> > > > > > > > > First goal would be to move to Lucene 8.x, then work on
> > sorted
> > > > > reduce
> > > > > > > > > - merge across nodes. Fundamentally, this is doable since
> > Lucene
> > > > > > ranks
> > > > > > > > > documents according to their score, and documents are
> > returned in
> > > > > the
> > > > > > > > > order of their score. Since the scoring function is
> > homogeneous,
> > > > > this
> > > > > > > > > means that across nodes, we can compare scores and merge
> > sort.
> > > > > > > > >
> > > > > > > > > Please let me know if I can take this up.
> > > > > > > > >
> > > > > > > > > Atri
> > > > > > > > >
> > > > > > > > > --
> > > > > > > > > Regards,
> > > > > > > > >
> > > > > > > > > Atri
> > > > > > > > > Apache Concerted
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > >
> > > > > > > Best regards,
> > > > > > > Alexei Scherbakov
> > > > > >
> > > > > > --
> > > > > > Regards,
> > > > > >
> > > > > > Atri
> > > > > > Apache Concerted
> > > > > >
> > > > >
> > > >
> > >
> > >
> > > --
> > > Best regards,
> > > Andrey V. Mashenkov
> >
> > --
> > Regards,
> >
> > Atri
> > Apache Concerted
> >
>
>
> --
> Best regards,
> Andrey V. Mashenkov

-- 
Regards,

Atri
Apache Concerted

Re: Text Queries Support

Posted by Andrey Mashenkov <an...@gmail.com>.

Atri,

First of all, I'd recommend going through the Ignite ticket to gather
information about the current implementation issues and users' wants.
Then look at a code to get a complete understanding of how things work now,
which may help in future decisions.

As we use the outdated Lucene version, some things may be irrelevant for
the latest Lucene version.
So, you will need expertise in the internals of modern Lucene version to
understand what capabilities, guarantees, and limitations Lucene has and
could bring to the Ignite.
The expertise could be got from the Lucene project code or Lucene project
dev-list.

As for now, the potential capabilities are not clear to me.
At first glance, I see the next topics that must be covered at first:

General questions
* How Lucene index can be split among the nodes?
* If we'll have a single index for all partitions on the particular node,
then how index records will be aware of partitioning?
This is important to filter out backup records from the results to avoid
duplicates.
* How results from several nodes can be merged on the Reduce stage?
* Does Lucene supports smth like JOIN operation or others that may require
data from another partition or index?
If so, then it likes to multistep query with merging results on
intermediate stages and requires detailed investigation and design.
It is ok if Ignite will have some limitations here, but we would like to
know about them at the early stage.
* How effectively map Lucene files to the page memory? Is it even possible?
Otherwise, how to deal with potential OOM on large queries and memory
capacity planning?

Persistence.
* How and what consistency guarantees could we have/expect?
Seems, we may not be able to write physical records for Lucene index to our
WAL. What can we do with this?

Transactions.
* Will we support transactions?
* Should Lucene be aware of Transaction and track mvcc (or whatever)
versions for the records?
* What will be consistency guarantees?

UX
* How to add FullText search queries syntax into Calcite?
* AFAIK, the Lucene index has many properties for tuning. How will the user
configure the index?
* How and where to store the settings? What are cluster-wide and what a
local to the particular node?
* Will be all the settings immutable? Can be they changed on-fly? after
node/grid restart?
* Any limitations on query syntax?

SQL
* Will we support FullText search in SQL?
* How to integrate Lucene index into Calcite? What is the cost model?
Splitting rules? Traits?
* What about consistency with DDL operations, e.g. column rename?
Ignite indices will operate column ID, so rename operation will not affect
the index.

With all of this, you can go with the IEP (or even some short summary) and
further POC and implementation.
That's a big deal, so let's discuss what could be done here.

On Fri, Jul 23, 2021 at 12:58 PM Atri Sharma <at...@apache.org> wrote:

> I am actually happy to drive the feature for Ignite 3. FTS is very
> important for me and I think Ignite users will benefit from it
> greatly.
>
> If it makes sense to be focusing on Ignite 3 for this capability, I am
> eager to contribute there and lead the development.
>
> Please share your thoughts.
>
> On Fri, Jul 23, 2021 at 3:21 PM Andrey Mashenkov
> <an...@gmail.com> wrote:
> >
> > Hi Atri,
> >
> > All the Jira tickets we have on the Full-text search (FTS) thing are
> > targeted to Ignite 2.
> >
> > AFAIK, we want, but we have NOT committed to FTS support in Ignite 3,
> yet.
> > By the way, we are getting requests for this thing from the user side,
> and
> > definitely,
> > FTS would be a valuable feature for Ignite.
> >
> > It will be great if the one wants to drive it, any help will be
> appreciated.
> >
> >
> > On Fri, Jul 23, 2021 at 12:12 PM Atri Sharma <at...@apache.org> wrote:
> >
> > > Hello,
> > >
> > > An update, please. I am working through persistence of Lucene index
> using
> > > Ignite Dictionary, and will be asking some questions soon.
> > >
> > > I had one doubt - - where does this change go? Ignite 3?
> > >
> > > Also, I know we want to build native support for text searches in
> Ignite 3.
> > > Is the work I am proposing here part of that, or will that be a
> separate
> > > effort?
> > >
> > > On Mon, 28 Jun 2021, 19:20 Ilya Kasnacheev, <ilya.kasnacheev@gmail.com
> >
> > > wrote:
> > >
> > > > Hello!
> > > >
> > > > I think that number one is the most important one, then maybe it
> will see
> > > > more use and other deficiencies become more apparent, leading to more
> > > > tickets and visibility.
> > > >
> > > > Maybe 2. and 3. will even use a different approach when persistence
> is
> > > > implemented.
> > > >
> > > > Regards,
> > > > --
> > > > Ilya Kasnacheev
> > > >
> > > >
> > > > пн, 28 июн. 2021 г. в 14:34, Atri Sharma <at...@apache.org>:
> > > >
> > > > > Hello Again!
> > > > >
> > > > > I have been looking into the aforementioned and here are my follow
> up
> > > > > thoughts:
> > > > >
> > > > > 1. Support persistence of Lucene indexes.
> > > > > 2. https://issues.apache.org/jira/browse/IGNITE-12401 (Needs
> fixing of
> > > > > moving partitions first)
> > > > > 3. Figure out how to return scores from nodes and use them as sort
> > > > > parameters on the coordinator node
> > > > > (https://issues.apache.org/jira/browse/IGNITE-12291)
> > > > >
> > > > > Please let me know if this looks ok to make text queries
> functional?
> > > > >
> > > > > Atri
> > > > >
> > > > > On Mon, Jun 21, 2021 at 2:49 PM Alexei Scherbakov
> > > > > <al...@gmail.com> wrote:
> > > > > >
> > > > > > Hi.
> > > > > >
> > > > > > One of the biggest issues with text queries is a lack of support
> for
> > > > > lucene
> > > > > > indices persistence, which makes this functionality useless if a
> > > > > > persistence is enabled.
> > > > > >
> > > > > > I would first take care of it.
> > > > > >
> > > > > > пн, 21 июн. 2021 г. в 12:16, Maksim Timonin <
> timonin.maxim@gmail.com
> > > >:
> > > > > >
> > > > > > > Hi, Atri!
> > > > > > >
> > > > > > > You're right, Actually there is a lack of support for
> TextQueries.
> > > > For
> > > > > the
> > > > > > > last ticket I'm doing I see some obvious issues with them (no
> page
> > > > size
> > > > > > > support, for example). I'm glad that somebody wants to maintain
> > > this
> > > > > > > functionality. Thanks a lot!
> > > > > > >
> > > > > > > For the MergeSort algorithm there is already a patch for that
> [1].
> > > > It's
> > > > > > > currently on review. This patch introduces an abstract reducer
> for
> > > > > > > CacheQueries with 2 implementations (unordered, merge-sort).
> Then
> > > > > TextQuery
> > > > > > > leverages on MergeSort to order results from multiple nodes by
> > > score.
> > > > > This
> > > > > > > patch also fixes the pageSize issue, I've mentioned before.
> Could
> > > you
> > > > > > > please check if it fully matches your idea? Any issues or
> comments
> > > > are
> > > > > > > welcome.
> > > > > > >
> > > > > > > I've prepared this ticket, because I need the MergeSort
> algorithm
> > > for
> > > > > the
> > > > > > > new type of queries I'm implementing (IndexQuery, it should
> also
> > > > > provide
> > > > > > > ordered results over multiple nodes). Currently I'm not
> planning to
> > > > go
> > > > > > > further with TextQuery, so if you're going to support this
> it'll
> > > be a
> > > > > great
> > > > > > > contribution, I think.
> > > > > > >
> > > > > > > [1] https://issues.apache.org/jira/browse/IGNITE-14703
> > > > > > > [2] https://github.com/apache/ignite/pull/9081
> > > > > > >
> > > > > > >
> > > > > > > On Mon, Jun 21, 2021 at 11:11 AM Atri Sharma <at...@apache.org>
> > > > wrote:
> > > > > > >
> > > > > > > > Hi All,
> > > > > > > >
> > > > > > > > I have been looking into our text queries support and see
> that it
> > > > has
> > > > > > > > limited community support.
> > > > > > > >
> > > > > > > > Therefore, I volunteer to be the maintainer of the module and
> > > work
> > > > on
> > > > > > > > enhancing it further.
> > > > > > > >
> > > > > > > > First goal would be to move to Lucene 8.x, then work on
> sorted
> > > > reduce
> > > > > > > > - merge across nodes. Fundamentally, this is doable since
> Lucene
> > > > > ranks
> > > > > > > > documents according to their score, and documents are
> returned in
> > > > the
> > > > > > > > order of their score. Since the scoring function is
> homogeneous,
> > > > this
> > > > > > > > means that across nodes, we can compare scores and merge
> sort.
> > > > > > > >
> > > > > > > > Please let me know if I can take this up.
> > > > > > > >
> > > > > > > > Atri
> > > > > > > >
> > > > > > > > --
> > > > > > > > Regards,
> > > > > > > >
> > > > > > > > Atri
> > > > > > > > Apache Concerted
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > >
> > > > > > Best regards,
> > > > > > Alexei Scherbakov
> > > > >
> > > > > --
> > > > > Regards,
> > > > >
> > > > > Atri
> > > > > Apache Concerted
> > > > >
> > > >
> > >
> >
> >
> > --
> > Best regards,
> > Andrey V. Mashenkov
>
> --
> Regards,
>
> Atri
> Apache Concerted
>

-- 
Best regards,
Andrey V. Mashenkov

Re: Text Queries Support

Posted by Atri Sharma <at...@apache.org>.

I am actually happy to drive the feature for Ignite 3. FTS is very
important for me and I think Ignite users will benefit from it
greatly.

If it makes sense to be focusing on Ignite 3 for this capability, I am
eager to contribute there and lead the development.

Please share your thoughts.

On Fri, Jul 23, 2021 at 3:21 PM Andrey Mashenkov
<an...@gmail.com> wrote:
>
> Hi Atri,
>
> All the Jira tickets we have on the Full-text search (FTS) thing are
> targeted to Ignite 2.
>
> AFAIK, we want, but we have NOT committed to FTS support in Ignite 3, yet.
> By the way, we are getting requests for this thing from the user side, and
> definitely,
> FTS would be a valuable feature for Ignite.
>
> It will be great if the one wants to drive it, any help will be appreciated.
>
>
> On Fri, Jul 23, 2021 at 12:12 PM Atri Sharma <at...@apache.org> wrote:
>
> > Hello,
> >
> > An update, please. I am working through persistence of Lucene index using
> > Ignite Dictionary, and will be asking some questions soon.
> >
> > I had one doubt - - where does this change go? Ignite 3?
> >
> > Also, I know we want to build native support for text searches in Ignite 3.
> > Is the work I am proposing here part of that, or will that be a separate
> > effort?
> >
> > On Mon, 28 Jun 2021, 19:20 Ilya Kasnacheev, <il...@gmail.com>
> > wrote:
> >
> > > Hello!
> > >
> > > I think that number one is the most important one, then maybe it will see
> > > more use and other deficiencies become more apparent, leading to more
> > > tickets and visibility.
> > >
> > > Maybe 2. and 3. will even use a different approach when persistence is
> > > implemented.
> > >
> > > Regards,
> > > --
> > > Ilya Kasnacheev
> > >
> > >
> > > пн, 28 июн. 2021 г. в 14:34, Atri Sharma <at...@apache.org>:
> > >
> > > > Hello Again!
> > > >
> > > > I have been looking into the aforementioned and here are my follow up
> > > > thoughts:
> > > >
> > > > 1. Support persistence of Lucene indexes.
> > > > 2. https://issues.apache.org/jira/browse/IGNITE-12401 (Needs fixing of
> > > > moving partitions first)
> > > > 3. Figure out how to return scores from nodes and use them as sort
> > > > parameters on the coordinator node
> > > > (https://issues.apache.org/jira/browse/IGNITE-12291)
> > > >
> > > > Please let me know if this looks ok to make text queries functional?
> > > >
> > > > Atri
> > > >
> > > > On Mon, Jun 21, 2021 at 2:49 PM Alexei Scherbakov
> > > > <al...@gmail.com> wrote:
> > > > >
> > > > > Hi.
> > > > >
> > > > > One of the biggest issues with text queries is a lack of support for
> > > > lucene
> > > > > indices persistence, which makes this functionality useless if a
> > > > > persistence is enabled.
> > > > >
> > > > > I would first take care of it.
> > > > >
> > > > > пн, 21 июн. 2021 г. в 12:16, Maksim Timonin <timonin.maxim@gmail.com
> > >:
> > > > >
> > > > > > Hi, Atri!
> > > > > >
> > > > > > You're right, Actually there is a lack of support for TextQueries.
> > > For
> > > > the
> > > > > > last ticket I'm doing I see some obvious issues with them (no page
> > > size
> > > > > > support, for example). I'm glad that somebody wants to maintain
> > this
> > > > > > functionality. Thanks a lot!
> > > > > >
> > > > > > For the MergeSort algorithm there is already a patch for that [1].
> > > It's
> > > > > > currently on review. This patch introduces an abstract reducer for
> > > > > > CacheQueries with 2 implementations (unordered, merge-sort). Then
> > > > TextQuery
> > > > > > leverages on MergeSort to order results from multiple nodes by
> > score.
> > > > This
> > > > > > patch also fixes the pageSize issue, I've mentioned before. Could
> > you
> > > > > > please check if it fully matches your idea? Any issues or comments
> > > are
> > > > > > welcome.
> > > > > >
> > > > > > I've prepared this ticket, because I need the MergeSort algorithm
> > for
> > > > the
> > > > > > new type of queries I'm implementing (IndexQuery, it should also
> > > > provide
> > > > > > ordered results over multiple nodes). Currently I'm not planning to
> > > go
> > > > > > further with TextQuery, so if you're going to support this it'll
> > be a
> > > > great
> > > > > > contribution, I think.
> > > > > >
> > > > > > [1] https://issues.apache.org/jira/browse/IGNITE-14703
> > > > > > [2] https://github.com/apache/ignite/pull/9081
> > > > > >
> > > > > >
> > > > > > On Mon, Jun 21, 2021 at 11:11 AM Atri Sharma <at...@apache.org>
> > > wrote:
> > > > > >
> > > > > > > Hi All,
> > > > > > >
> > > > > > > I have been looking into our text queries support and see that it
> > > has
> > > > > > > limited community support.
> > > > > > >
> > > > > > > Therefore, I volunteer to be the maintainer of the module and
> > work
> > > on
> > > > > > > enhancing it further.
> > > > > > >
> > > > > > > First goal would be to move to Lucene 8.x, then work on sorted
> > > reduce
> > > > > > > - merge across nodes. Fundamentally, this is doable since Lucene
> > > > ranks
> > > > > > > documents according to their score, and documents are returned in
> > > the
> > > > > > > order of their score. Since the scoring function is homogeneous,
> > > this
> > > > > > > means that across nodes, we can compare scores and merge sort.
> > > > > > >
> > > > > > > Please let me know if I can take this up.
> > > > > > >
> > > > > > > Atri
> > > > > > >
> > > > > > > --
> > > > > > > Regards,
> > > > > > >
> > > > > > > Atri
> > > > > > > Apache Concerted
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > > > --
> > > > >
> > > > > Best regards,
> > > > > Alexei Scherbakov
> > > >
> > > > --
> > > > Regards,
> > > >
> > > > Atri
> > > > Apache Concerted
> > > >
> > >
> >
>
>
> --
> Best regards,
> Andrey V. Mashenkov

-- 
Regards,

Atri
Apache Concerted

Re: Text Queries Support

Posted by Andrey Mashenkov <an...@gmail.com>.

Hi Atri,

All the Jira tickets we have on the Full-text search (FTS) thing are
targeted to Ignite 2.

AFAIK, we want, but we have NOT committed to FTS support in Ignite 3, yet.
By the way, we are getting requests for this thing from the user side, and
definitely,
FTS would be a valuable feature for Ignite.

It will be great if the one wants to drive it, any help will be appreciated.


On Fri, Jul 23, 2021 at 12:12 PM Atri Sharma <at...@apache.org> wrote:

> Hello,
>
> An update, please. I am working through persistence of Lucene index using
> Ignite Dictionary, and will be asking some questions soon.
>
> I had one doubt - - where does this change go? Ignite 3?
>
> Also, I know we want to build native support for text searches in Ignite 3.
> Is the work I am proposing here part of that, or will that be a separate
> effort?
>
> On Mon, 28 Jun 2021, 19:20 Ilya Kasnacheev, <il...@gmail.com>
> wrote:
>
> > Hello!
> >
> > I think that number one is the most important one, then maybe it will see
> > more use and other deficiencies become more apparent, leading to more
> > tickets and visibility.
> >
> > Maybe 2. and 3. will even use a different approach when persistence is
> > implemented.
> >
> > Regards,
> > --
> > Ilya Kasnacheev
> >
> >
> > пн, 28 июн. 2021 г. в 14:34, Atri Sharma <at...@apache.org>:
> >
> > > Hello Again!
> > >
> > > I have been looking into the aforementioned and here are my follow up
> > > thoughts:
> > >
> > > 1. Support persistence of Lucene indexes.
> > > 2. https://issues.apache.org/jira/browse/IGNITE-12401 (Needs fixing of
> > > moving partitions first)
> > > 3. Figure out how to return scores from nodes and use them as sort
> > > parameters on the coordinator node
> > > (https://issues.apache.org/jira/browse/IGNITE-12291)
> > >
> > > Please let me know if this looks ok to make text queries functional?
> > >
> > > Atri
> > >
> > > On Mon, Jun 21, 2021 at 2:49 PM Alexei Scherbakov
> > > <al...@gmail.com> wrote:
> > > >
> > > > Hi.
> > > >
> > > > One of the biggest issues with text queries is a lack of support for
> > > lucene
> > > > indices persistence, which makes this functionality useless if a
> > > > persistence is enabled.
> > > >
> > > > I would first take care of it.
> > > >
> > > > пн, 21 июн. 2021 г. в 12:16, Maksim Timonin <timonin.maxim@gmail.com
> >:
> > > >
> > > > > Hi, Atri!
> > > > >
> > > > > You're right, Actually there is a lack of support for TextQueries.
> > For
> > > the
> > > > > last ticket I'm doing I see some obvious issues with them (no page
> > size
> > > > > support, for example). I'm glad that somebody wants to maintain
> this
> > > > > functionality. Thanks a lot!
> > > > >
> > > > > For the MergeSort algorithm there is already a patch for that [1].
> > It's
> > > > > currently on review. This patch introduces an abstract reducer for
> > > > > CacheQueries with 2 implementations (unordered, merge-sort). Then
> > > TextQuery
> > > > > leverages on MergeSort to order results from multiple nodes by
> score.
> > > This
> > > > > patch also fixes the pageSize issue, I've mentioned before. Could
> you
> > > > > please check if it fully matches your idea? Any issues or comments
> > are
> > > > > welcome.
> > > > >
> > > > > I've prepared this ticket, because I need the MergeSort algorithm
> for
> > > the
> > > > > new type of queries I'm implementing (IndexQuery, it should also
> > > provide
> > > > > ordered results over multiple nodes). Currently I'm not planning to
> > go
> > > > > further with TextQuery, so if you're going to support this it'll
> be a
> > > great
> > > > > contribution, I think.
> > > > >
> > > > > [1] https://issues.apache.org/jira/browse/IGNITE-14703
> > > > > [2] https://github.com/apache/ignite/pull/9081
> > > > >
> > > > >
> > > > > On Mon, Jun 21, 2021 at 11:11 AM Atri Sharma <at...@apache.org>
> > wrote:
> > > > >
> > > > > > Hi All,
> > > > > >
> > > > > > I have been looking into our text queries support and see that it
> > has
> > > > > > limited community support.
> > > > > >
> > > > > > Therefore, I volunteer to be the maintainer of the module and
> work
> > on
> > > > > > enhancing it further.
> > > > > >
> > > > > > First goal would be to move to Lucene 8.x, then work on sorted
> > reduce
> > > > > > - merge across nodes. Fundamentally, this is doable since Lucene
> > > ranks
> > > > > > documents according to their score, and documents are returned in
> > the
> > > > > > order of their score. Since the scoring function is homogeneous,
> > this
> > > > > > means that across nodes, we can compare scores and merge sort.
> > > > > >
> > > > > > Please let me know if I can take this up.
> > > > > >
> > > > > > Atri
> > > > > >
> > > > > > --
> > > > > > Regards,
> > > > > >
> > > > > > Atri
> > > > > > Apache Concerted
> > > > > >
> > > > >
> > > >
> > > >
> > > > --
> > > >
> > > > Best regards,
> > > > Alexei Scherbakov
> > >
> > > --
> > > Regards,
> > >
> > > Atri
> > > Apache Concerted
> > >
> >
>


-- 
Best regards,
Andrey V. Mashenkov

Re: Text Queries Support

Posted by Atri Sharma <at...@apache.org>.

Hello,

An update, please. I am working through persistence of Lucene index using
Ignite Dictionary, and will be asking some questions soon.

I had one doubt - - where does this change go? Ignite 3?

Also, I know we want to build native support for text searches in Ignite 3.
Is the work I am proposing here part of that, or will that be a separate
effort?

On Mon, 28 Jun 2021, 19:20 Ilya Kasnacheev, <il...@gmail.com>
wrote:

> Hello!
>
> I think that number one is the most important one, then maybe it will see
> more use and other deficiencies become more apparent, leading to more
> tickets and visibility.
>
> Maybe 2. and 3. will even use a different approach when persistence is
> implemented.
>
> Regards,
> --
> Ilya Kasnacheev
>
>
> пн, 28 июн. 2021 г. в 14:34, Atri Sharma <at...@apache.org>:
>
> > Hello Again!
> >
> > I have been looking into the aforementioned and here are my follow up
> > thoughts:
> >
> > 1. Support persistence of Lucene indexes.
> > 2. https://issues.apache.org/jira/browse/IGNITE-12401 (Needs fixing of
> > moving partitions first)
> > 3. Figure out how to return scores from nodes and use them as sort
> > parameters on the coordinator node
> > (https://issues.apache.org/jira/browse/IGNITE-12291)
> >
> > Please let me know if this looks ok to make text queries functional?
> >
> > Atri
> >
> > On Mon, Jun 21, 2021 at 2:49 PM Alexei Scherbakov
> > <al...@gmail.com> wrote:
> > >
> > > Hi.
> > >
> > > One of the biggest issues with text queries is a lack of support for
> > lucene
> > > indices persistence, which makes this functionality useless if a
> > > persistence is enabled.
> > >
> > > I would first take care of it.
> > >
> > > пн, 21 июн. 2021 г. в 12:16, Maksim Timonin <ti...@gmail.com>:
> > >
> > > > Hi, Atri!
> > > >
> > > > You're right, Actually there is a lack of support for TextQueries.
> For
> > the
> > > > last ticket I'm doing I see some obvious issues with them (no page
> size
> > > > support, for example). I'm glad that somebody wants to maintain this
> > > > functionality. Thanks a lot!
> > > >
> > > > For the MergeSort algorithm there is already a patch for that [1].
> It's
> > > > currently on review. This patch introduces an abstract reducer for
> > > > CacheQueries with 2 implementations (unordered, merge-sort). Then
> > TextQuery
> > > > leverages on MergeSort to order results from multiple nodes by score.
> > This
> > > > patch also fixes the pageSize issue, I've mentioned before. Could you
> > > > please check if it fully matches your idea? Any issues or comments
> are
> > > > welcome.
> > > >
> > > > I've prepared this ticket, because I need the MergeSort algorithm for
> > the
> > > > new type of queries I'm implementing (IndexQuery, it should also
> > provide
> > > > ordered results over multiple nodes). Currently I'm not planning to
> go
> > > > further with TextQuery, so if you're going to support this it'll be a
> > great
> > > > contribution, I think.
> > > >
> > > > [1] https://issues.apache.org/jira/browse/IGNITE-14703
> > > > [2] https://github.com/apache/ignite/pull/9081
> > > >
> > > >
> > > > On Mon, Jun 21, 2021 at 11:11 AM Atri Sharma <at...@apache.org>
> wrote:
> > > >
> > > > > Hi All,
> > > > >
> > > > > I have been looking into our text queries support and see that it
> has
> > > > > limited community support.
> > > > >
> > > > > Therefore, I volunteer to be the maintainer of the module and work
> on
> > > > > enhancing it further.
> > > > >
> > > > > First goal would be to move to Lucene 8.x, then work on sorted
> reduce
> > > > > - merge across nodes. Fundamentally, this is doable since Lucene
> > ranks
> > > > > documents according to their score, and documents are returned in
> the
> > > > > order of their score. Since the scoring function is homogeneous,
> this
> > > > > means that across nodes, we can compare scores and merge sort.
> > > > >
> > > > > Please let me know if I can take this up.
> > > > >
> > > > > Atri
> > > > >
> > > > > --
> > > > > Regards,
> > > > >
> > > > > Atri
> > > > > Apache Concerted
> > > > >
> > > >
> > >
> > >
> > > --
> > >
> > > Best regards,
> > > Alexei Scherbakov
> >
> > --
> > Regards,
> >
> > Atri
> > Apache Concerted
> >
>

Re: Text Queries Support

Posted by Ilya Kasnacheev <il...@gmail.com>.

Hello!

I think that number one is the most important one, then maybe it will see
more use and other deficiencies become more apparent, leading to more
tickets and visibility.

Maybe 2. and 3. will even use a different approach when persistence is
implemented.

Regards,
-- 
Ilya Kasnacheev


пн, 28 июн. 2021 г. в 14:34, Atri Sharma <at...@apache.org>:

> Hello Again!
>
> I have been looking into the aforementioned and here are my follow up
> thoughts:
>
> 1. Support persistence of Lucene indexes.
> 2. https://issues.apache.org/jira/browse/IGNITE-12401 (Needs fixing of
> moving partitions first)
> 3. Figure out how to return scores from nodes and use them as sort
> parameters on the coordinator node
> (https://issues.apache.org/jira/browse/IGNITE-12291)
>
> Please let me know if this looks ok to make text queries functional?
>
> Atri
>
> On Mon, Jun 21, 2021 at 2:49 PM Alexei Scherbakov
> <al...@gmail.com> wrote:
> >
> > Hi.
> >
> > One of the biggest issues with text queries is a lack of support for
> lucene
> > indices persistence, which makes this functionality useless if a
> > persistence is enabled.
> >
> > I would first take care of it.
> >
> > пн, 21 июн. 2021 г. в 12:16, Maksim Timonin <ti...@gmail.com>:
> >
> > > Hi, Atri!
> > >
> > > You're right, Actually there is a lack of support for TextQueries. For
> the
> > > last ticket I'm doing I see some obvious issues with them (no page size
> > > support, for example). I'm glad that somebody wants to maintain this
> > > functionality. Thanks a lot!
> > >
> > > For the MergeSort algorithm there is already a patch for that [1]. It's
> > > currently on review. This patch introduces an abstract reducer for
> > > CacheQueries with 2 implementations (unordered, merge-sort). Then
> TextQuery
> > > leverages on MergeSort to order results from multiple nodes by score.
> This
> > > patch also fixes the pageSize issue, I've mentioned before. Could you
> > > please check if it fully matches your idea? Any issues or comments are
> > > welcome.
> > >
> > > I've prepared this ticket, because I need the MergeSort algorithm for
> the
> > > new type of queries I'm implementing (IndexQuery, it should also
> provide
> > > ordered results over multiple nodes). Currently I'm not planning to go
> > > further with TextQuery, so if you're going to support this it'll be a
> great
> > > contribution, I think.
> > >
> > > [1] https://issues.apache.org/jira/browse/IGNITE-14703
> > > [2] https://github.com/apache/ignite/pull/9081
> > >
> > >
> > > On Mon, Jun 21, 2021 at 11:11 AM Atri Sharma <at...@apache.org> wrote:
> > >
> > > > Hi All,
> > > >
> > > > I have been looking into our text queries support and see that it has
> > > > limited community support.
> > > >
> > > > Therefore, I volunteer to be the maintainer of the module and work on
> > > > enhancing it further.
> > > >
> > > > First goal would be to move to Lucene 8.x, then work on sorted reduce
> > > > - merge across nodes. Fundamentally, this is doable since Lucene
> ranks
> > > > documents according to their score, and documents are returned in the
> > > > order of their score. Since the scoring function is homogeneous, this
> > > > means that across nodes, we can compare scores and merge sort.
> > > >
> > > > Please let me know if I can take this up.
> > > >
> > > > Atri
> > > >
> > > > --
> > > > Regards,
> > > >
> > > > Atri
> > > > Apache Concerted
> > > >
> > >
> >
> >
> > --
> >
> > Best regards,
> > Alexei Scherbakov
>
> --
> Regards,
>
> Atri
> Apache Concerted
>

Re: Text Queries Support

Posted by Atri Sharma <at...@apache.org>.

Hello Again!

I have been looking into the aforementioned and here are my follow up thoughts:

1. Support persistence of Lucene indexes.
2. https://issues.apache.org/jira/browse/IGNITE-12401 (Needs fixing of
moving partitions first)
3. Figure out how to return scores from nodes and use them as sort
parameters on the coordinator node
(https://issues.apache.org/jira/browse/IGNITE-12291)

Please let me know if this looks ok to make text queries functional?

Atri

On Mon, Jun 21, 2021 at 2:49 PM Alexei Scherbakov
<al...@gmail.com> wrote:
>
> Hi.
>
> One of the biggest issues with text queries is a lack of support for lucene
> indices persistence, which makes this functionality useless if a
> persistence is enabled.
>
> I would first take care of it.
>
> пн, 21 июн. 2021 г. в 12:16, Maksim Timonin <ti...@gmail.com>:
>
> > Hi, Atri!
> >
> > You're right, Actually there is a lack of support for TextQueries. For the
> > last ticket I'm doing I see some obvious issues with them (no page size
> > support, for example). I'm glad that somebody wants to maintain this
> > functionality. Thanks a lot!
> >
> > For the MergeSort algorithm there is already a patch for that [1]. It's
> > currently on review. This patch introduces an abstract reducer for
> > CacheQueries with 2 implementations (unordered, merge-sort). Then TextQuery
> > leverages on MergeSort to order results from multiple nodes by score. This
> > patch also fixes the pageSize issue, I've mentioned before. Could you
> > please check if it fully matches your idea? Any issues or comments are
> > welcome.
> >
> > I've prepared this ticket, because I need the MergeSort algorithm for the
> > new type of queries I'm implementing (IndexQuery, it should also provide
> > ordered results over multiple nodes). Currently I'm not planning to go
> > further with TextQuery, so if you're going to support this it'll be a great
> > contribution, I think.
> >
> > [1] https://issues.apache.org/jira/browse/IGNITE-14703
> > [2] https://github.com/apache/ignite/pull/9081
> >
> >
> > On Mon, Jun 21, 2021 at 11:11 AM Atri Sharma <at...@apache.org> wrote:
> >
> > > Hi All,
> > >
> > > I have been looking into our text queries support and see that it has
> > > limited community support.
> > >
> > > Therefore, I volunteer to be the maintainer of the module and work on
> > > enhancing it further.
> > >
> > > First goal would be to move to Lucene 8.x, then work on sorted reduce
> > > - merge across nodes. Fundamentally, this is doable since Lucene ranks
> > > documents according to their score, and documents are returned in the
> > > order of their score. Since the scoring function is homogeneous, this
> > > means that across nodes, we can compare scores and merge sort.
> > >
> > > Please let me know if I can take this up.
> > >
> > > Atri
> > >
> > > --
> > > Regards,
> > >
> > > Atri
> > > Apache Concerted
> > >
> >
>
>
> --
>
> Best regards,
> Alexei Scherbakov

-- 
Regards,

Atri
Apache Concerted

Re: Text Queries Support

Posted by Alexei Scherbakov <al...@gmail.com>.

Hi.

One of the biggest issues with text queries is a lack of support for lucene
indices persistence, which makes this functionality useless if a
persistence is enabled.

I would first take care of it.

пн, 21 июн. 2021 г. в 12:16, Maksim Timonin <ti...@gmail.com>:

> Hi, Atri!
>
> You're right, Actually there is a lack of support for TextQueries. For the
> last ticket I'm doing I see some obvious issues with them (no page size
> support, for example). I'm glad that somebody wants to maintain this
> functionality. Thanks a lot!
>
> For the MergeSort algorithm there is already a patch for that [1]. It's
> currently on review. This patch introduces an abstract reducer for
> CacheQueries with 2 implementations (unordered, merge-sort). Then TextQuery
> leverages on MergeSort to order results from multiple nodes by score. This
> patch also fixes the pageSize issue, I've mentioned before. Could you
> please check if it fully matches your idea? Any issues or comments are
> welcome.
>
> I've prepared this ticket, because I need the MergeSort algorithm for the
> new type of queries I'm implementing (IndexQuery, it should also provide
> ordered results over multiple nodes). Currently I'm not planning to go
> further with TextQuery, so if you're going to support this it'll be a great
> contribution, I think.
>
> [1] https://issues.apache.org/jira/browse/IGNITE-14703
> [2] https://github.com/apache/ignite/pull/9081
>
>
> On Mon, Jun 21, 2021 at 11:11 AM Atri Sharma <at...@apache.org> wrote:
>
> > Hi All,
> >
> > I have been looking into our text queries support and see that it has
> > limited community support.
> >
> > Therefore, I volunteer to be the maintainer of the module and work on
> > enhancing it further.
> >
> > First goal would be to move to Lucene 8.x, then work on sorted reduce
> > - merge across nodes. Fundamentally, this is doable since Lucene ranks
> > documents according to their score, and documents are returned in the
> > order of their score. Since the scoring function is homogeneous, this
> > means that across nodes, we can compare scores and merge sort.
> >
> > Please let me know if I can take this up.
> >
> > Atri
> >
> > --
> > Regards,
> >
> > Atri
> > Apache Concerted
> >
>


-- 

Best regards,
Alexei Scherbakov

Re: Text Queries Support

Posted by Maksim Timonin <ti...@gmail.com>.

Hi, Atri!

You're right, Actually there is a lack of support for TextQueries. For the
last ticket I'm doing I see some obvious issues with them (no page size
support, for example). I'm glad that somebody wants to maintain this
functionality. Thanks a lot!

For the MergeSort algorithm there is already a patch for that [1]. It's
currently on review. This patch introduces an abstract reducer for
CacheQueries with 2 implementations (unordered, merge-sort). Then TextQuery
leverages on MergeSort to order results from multiple nodes by score. This
patch also fixes the pageSize issue, I've mentioned before. Could you
please check if it fully matches your idea? Any issues or comments are
welcome.

I've prepared this ticket, because I need the MergeSort algorithm for the
new type of queries I'm implementing (IndexQuery, it should also provide
ordered results over multiple nodes). Currently I'm not planning to go
further with TextQuery, so if you're going to support this it'll be a great
contribution, I think.

[1] https://issues.apache.org/jira/browse/IGNITE-14703
[2] https://github.com/apache/ignite/pull/9081

On Mon, Jun 21, 2021 at 11:11 AM Atri Sharma <at...@apache.org> wrote:

> Hi All,
>
> I have been looking into our text queries support and see that it has
> limited community support.
>
> Therefore, I volunteer to be the maintainer of the module and work on
> enhancing it further.
>
> First goal would be to move to Lucene 8.x, then work on sorted reduce
> - merge across nodes. Fundamentally, this is doable since Lucene ranks
> documents according to their score, and documents are returned in the
> order of their score. Since the scoring function is homogeneous, this
> means that across nodes, we can compare scores and merge sort.
>
> Please let me know if I can take this up.
>
> Atri
>
> --
> Regards,
>
> Atri
> Apache Concerted
>