You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@calcite.apache.org by Andrei Sereda <an...@sereda.cc> on 2019/01/02 15:55:23 UTC

Re: ElasticSearch adapter. Exposing meta fields (like _id, _uid etc.)

Thanks, Stamatis. I wanted to confirm that adding one more mapping is not
redundant (this information might have existed somewhere).

I am not sure if you really need to keep the mapping between EXPR and ITEM
or just the field names

Do you mean having a List instead of a Map ? Currently, I’m doing the
latter. The mapping is between String and String.

/**
      * Keeps mapping between calcite expression identifier (like
{@code EXPR$0}) and
      * original item call like {@code _MAP['foo.bar']} ({@code
foo.bar} really).
      * This information otherwise might be lost during query translation.
      *
      * @see SqlStdOperatorTable#ITEM
      */
     final List<Map.Entry<String, String>> expressionItemMap = new
ArrayList<>();


On Mon, Dec 31, 2018 at 9:28 AM Stamatis Zampetakis <za...@gmail.com>
wrote:

> I would say that the plan you provided is the physical plan but this does
> not change the point that
> I was trying to make, which is that you have all the necessary information
> in the plan.
>
> Andrei>Does it mean one has to manually keep mapping between EXPR and ITEM
> somewhere ?
>
> I would say yes. I am not sure if you really need to keep the mapping
> between EXPR and ITEM or just the field names but certainly it is up to you
> to store the information that you need somewhere.
> The right place I guess would
> be org.apache.calcite.adapter.elasticsearch.ElasticsearchRel.Implementor.
> Note
> that you could make also Implementor#visitChild method return a result
> instead of being void and add there necessary information.
>
> Andrei>Is there a better way ?
>
> An alternative solution (not sure if it is better) would be to put
> necessary information in the Elastic operators when moving from the logical
> plan to the physical plan by having custom rules. This would require
> possibly many changes so I guess in the end it is isn't worth it.
>
> Andrei>In Sort Rel (ElasticsearchSort). How to know that sort0=[$2]
> (EXPR$2) is sort on ITEM($0, 'c') ?
>
> Sorting is position based so you just have to know that field c is in
> position $2. Again you can exploit information that you added in
> Implementor.
>
>
>
> Στις Σάβ, 29 Δεκ 2018 στις 7:00 μ.μ., ο/η Andrei Sereda <an...@sereda.cc>
> έγραψε:
>
> > Let’s take an example of a simple query:
> >
> > -- note _id fieldselect  _MAP['a'], _MAP['_id']  from elastic order by
> > _MAP['c']
> >
> > Logical plan is as follows:
> >
> > ElasticsearchToEnumerableConverter
> >   ElasticsearchSort(sort0=[$2], dir0=[ASC])
> >     ElasticsearchProject(EXPR$0=[ITEM($0, 'a')], EXPR$1=[ITEM($0,
> > '_id')], EXPR$2=[ITEM($0, 'c')])
> >       ElasticsearchTableScan(table=[[elastic, foo]])
> >
> > Questions
> >
> >    1. In ElasticsearchEnumerators
> >    <
> >
> https://github.com/asereda-gs/calcite/blob/elastic-idfield/elasticsearch/src/main/java/org/apache/calcite/adapter/elasticsearch/ElasticsearchEnumerators.java#L46
> > >
> >    I need to know if requested field is a meta field (_id) or not (to
> >    correctly fetch it from payload). Does it mean one has to manually
> keep
> >    mapping between EXPR and ITEM somewhere ? Is there a better way ?
> >    2. In Sort Rel (ElasticsearchSort). How to know that sort0=[$2]
> (EXPR$2)
> >    is sort on ITEM($0, 'c') ? Mapping above (1) will help.
> >
> > Thanks for your help.
> >
> > On Sat, Dec 29, 2018 at 3:08 AM Stamatis Zampetakis <za...@gmail.com>
> > wrote:
> >
> > > I think a concrete example of a query with its respective logical plan
> > > could clarify a few things. I would expect that the logical plan
> contains
> > > all the necessary information to decide which fields need to be fetched
> > > from Elastic without requiring additional aliases. If that information
> is
> > > lost when passing to the physical plan then we could possibly add some
> > > rules and/or operators to ensure this doesn't happen.
> > >
> > > Στις Παρ, 28 Δεκ 2018 στις 11:37 μ.μ., ο/η Andrei Sereda
> > <andrei@sereda.cc
> > > >
> > > έγραψε:
> > >
> > > > Other than that it seems to me _MAP[‘_id’] will become a RexCall
> > similar
> > > to
> > > > ITEM($0, ‘_ID’) so it appears to me that you can retrieve the field
> > name
> > > by
> > > > implementing a RexVisitor.
> > > >
> > > > Yes. That’s what I’m doing (see isItem
> > > > <
> > > >
> > >
> >
> https://github.com/apache/calcite/pull/982/files#diff-d1419ede28af115c0b45d6e02828cb6fR85
> > > > >
> > > > )
> > > >
> > > > I don’t see clearly why you need to set explicitly aliases but
> probably
> > > is
> > > > because I am missing implementation details related with the adapter
> > > > itself.
> > > >
> > > > For physical plan I get a list of expressions / projections (via
> > > > RelDataType
> > > > / getRowType) but at some point I need to know if (for example)
> EXPR$0
> > ==
> > > > ITEM($0, '_id') (_id is a system field and needs to be fetched
> > > differently
> > > > from payload). RelDataType doesn’t have item information.
> > > >
> > > > That’s the only reason I’m manually saving this mapping. Perhaps
> there
> > > is a
> > > > better way to do it ?
> > > >
> > > > On Fri, Dec 28, 2018 at 2:08 AM Stamatis Zampetakis <
> zabetak@gmail.com
> > >
> > > > wrote:
> > > >
> > > > > Hi Andrei,
> > > > >
> > > > > I tried to have a look in the PR but I am not familiar with the
> > Elastic
> > > > > adapter code so I am missing a lot of context.
> > > > >
> > > > > In general, I think that relying on the on the field names coming
> > from
> > > a
> > > > > RelDataType it is not safe. As you observed, it can get messed up
> > quite
> > > > > easily from many parts of Calcite.
> > > > > It is usually better to try to reason using the position and/or the
> > > type
> > > > of
> > > > > the field rather than its name.
> > > > >
> > > > > Other than that it seems to me _MAP['_id'] will become a RexCall
> > > similar
> > > > to
> > > > > ITEM($0, '_ID') so it appears to me that you can retrieve the field
> > > name
> > > > by
> > > > > implementing a RexVisitor.
> > > > > I don't see clearly why you need to set explicitly aliases but
> > probably
> > > > is
> > > > > because I am missing implementation details related with the
> adapter
> > > > > itself.
> > > > >
> > > > > Best,
> > > > > Stamatis
> > > > >
> > > > > Στις Πέμ, 27 Δεκ 2018 στις 8:36 μ.μ., ο/η Julian Hyde <
> > > jhyde@apache.org>
> > > > > έγραψε:
> > > > >
> > > > > > I added comments to the case. I did not look at the PR, sorry.
> > > > > >
> > > > > > > On Dec 26, 2018, at 1:56 PM, Andrei Sereda <an...@sereda.cc>
> > > wrote:
> > > > > > >
> > > > > > > Hello,
> > > > > > >
> > > > > > > Can somebody pls take a look at the following PR:
> > > > > > >
> > > > > > > PR: https://github.com/apache/calcite/pull/982
> > > > > > > JIRA: https://issues.apache.org/jira/browse/CALCITE-2755
> > > > > > >
> > > > > > > Specifically, I would like to understand if adding explicit
> > mapping
> > > > > > > <
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/calcite/pull/982/files#diff-f24822f58a8062386da5287cd57cae65R75
> > > > > > >
> > > > > > > between expression (`EXPR$n`) and item (`_MAP['foo.bar']`) is a
> > > > correct
> > > > > > > (and recommended) approach. The reason I need it is because the
> > > > > original
> > > > > > > item name (eg. `_MAP['foo.bar']`) is lost after converter
> > operation
> > > > (ie
> > > > > > I'm
> > > > > > > not able to derive it from `input.getRowType()`)
> > > > > > >
> > > > > > > ```sql
> > > > > > > -- Here I get EXPR$0 and EXPR$1 but no information about 'id'
> and
> > > > > > 'foo.bar'
> > > > > > > select _MAP['_id'], _MAP['foo.bar'] from elastic
> > > > > > >
> > > > > > > -- Query below preserves attribute information
> > > > > > > select * from (select _MAP['_id'] as \"_id\", _MAP['foo.bar']
> as
> > > > > > > \"foo.bar\" from elastic)
> > > > > > > ```
> > > > > > >
> > > > > > > Many thanks,
> > > > > > > Andrei.
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Sat, Dec 22, 2018 at 3:11 AM Stamatis Zampetakis <
> > > > zabetak@gmail.com
> > > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > >> Hi Andrei,
> > > > > > >>
> > > > > > >> Many data management systems have internal columns which the
> > users
> > > > may
> > > > > > be
> > > > > > >> able to query (see for example Postgres [1]). It doesn't sound
> > > > > > unnatural to
> > > > > > >> allow users access these fields. On the other hand such system
> > > > columns
> > > > > > do
> > > > > > >> not appear in SELECT * queries and I suppose they are pruned
> out
> > > by
> > > > > > >> projections early on if they are not used by the query. I
> guess
> > > you
> > > > > > could
> > > > > > >> take a similar approach in the ElasticSearch adapter.
> > > > > > >>
> > > > > > >> Best,
> > > > > > >> Stamatis
> > > > > > >>
> > > > > > >> [1]
> > > https://www.postgresql.org/docs/current/ddl-system-columns.html
> > > > > > >>
> > > > > > >>
> > > > > > >> Στις Σάβ, 22 Δεκ 2018 στις 1:14 π.μ., ο/η Andrei Sereda
> > > > > > <an...@sereda.cc>
> > > > > > >> έγραψε:
> > > > > > >>
> > > > > > >>> Hello,
> > > > > > >>>
> > > > > > >>> Upon indexing, elastic allocates a unique _id
> > > > > > >>> <
> > > > > > >>>
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
> https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-id-field.html
> > > > > > >>>>
> > > > > > >>> for each document (unless specified by the user). Currently
> > when
> > > > > > mapping
> > > > > > >>> between elastic result and calcite return type we query only
> > > > _source
> > > > > or
> > > > > > >>> fields attributes. _id is not part those but exposed at
> higher
> > > > level
> > > > > in
> > > > > > >>> query response (see SearchHit
> > > > > > >>> <
> > > > > > >>>
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
> https://www.elastic.co/guide/en/elasticsearch/reference/6.1/_the_search_api.html
> > > > > > >>>>
> > > > > > >>> ).
> > > > > > >>>
> > > > > > >>> Currently  _MAP['foo'] becomes _source.get('foo') (or
> > > > > > fields.get('foo')).
> > > > > > >>>
> > > > > > >>> Do you think we should make _MAP['_id'] available implicitly
> ?
> > > > > > >>>
> > > > > > >>> I have a couple of use-cases where one needs to know document
> > ID.
> > > > > > >>>
> > > > > > >>> Please note this makes sense only for non-aggregate queries.
> > > > > > >>>
> > > > > > >>> Regards,
> > > > > > >>>
> > > > > > >>> Andrei.
> > > > > > >>>
> > > > > > >>
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>