You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "alamb (via GitHub)" <gi...@apache.org> on 2023/06/28 21:08:32 UTC

[GitHub] [arrow-datafusion] alamb opened a new issue, #6790: Schema error: No field named mytable."A". when table is named `"MyTable"`

alamb opened a new issue, #6790:
URL: https://github.com/apache/arrow-datafusion/issues/6790

   ### Describe the bug
   
   A query is failing that it can't find a column which exists
   
   This was reported by a customer of IOx
   
   ### To Reproduce
   
   DataFusion CLI v27.0.0
   
   ```sql
   ❯
   create table "MyTable" (
   "A"    VARCHAR,
   "B"    VARCHAR,
   "time" timestamp
   );
   0 rows in set. Query took 0.002 seconds.
   
   ❯ 
   SELECT "A", COUNT(DISTINCT("B"))
   FROM "MyTable"
   WHERE time >= now() - interval '1 month'
   GROUP BY "A"
   LIMIT 10;
   
   Optimizer rule 'simplify_expressions' failed
   caused by
   Schema error: No field named mytable."A". Valid fields are "MyTable.A", "COUNT(DISTINCT MyTable.B)".
   ```
   
   
   
   ### Expected behavior
   
   The query should plan successfully (but return no rows)
   ```sql
   +-----------+---------------------------+
   | MyTable.A | COUNT(DISTINCT MyTable.B) |
   +-----------+---------------------------+
   +-----------+---------------------------+
   ```
   
   ### Additional context
   
   The query works fine in version `26.0.0` 👍 
   
   ```sql
   DataFusion CLI v26.0.0
   ❯ create table "MyTable" (
   "A"    VARCHAR,
   "B"    VARCHAR,
   "time" timestamp
   );
   
   0 rows in set. Query took 0.003 seconds.
   ❯ SELECT "A", COUNT(DISTINCT("B"))
   FROM "MyTable"
   WHERE time >= now() - interval '1 month'
   GROUP BY "A"
   LIMIT 10;
   
   +-----------+---------------------------+
   | MyTable.A | COUNT(DISTINCT MyTable.B) |
   +-----------+---------------------------+
   +-----------+---------------------------+
   ```
   
   Also if we remove the distinct the query also works:
   
   ```sql
   ❯ SELECT "A", COUNT("B")
   FROM "MyTable"
   WHERE time >= now() - interval '1 month'
   GROUP BY "A"
   LIMIT 10;
   +---+------------------+
   | A | COUNT(MyTable.B) |
   +---+------------------+
   +---+------------------+
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on issue #6790: Schema error: No field named mytable."A". when table is named `"MyTable"`

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on issue #6790:
URL: https://github.com/apache/arrow-datafusion/issues/6790#issuecomment-1617807970

   I also hit this same error with some of the clickbench queries. 
   
   I think this was one:
   
   ```sql
   SELECT "RegionID", COUNT(DISTINCT "UserID") AS u FROM hits GROUP BY "RegionID" ORDER BY u DESC LIMIT 10;
   ```
   
   I hope to find time to work on this issue later in the week if you haven't had a chance @jackwener 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on issue #6790: Schema error: No field named mytable."A". when table is named `"MyTable"`

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on issue #6790:
URL: https://github.com/apache/arrow-datafusion/issues/6790#issuecomment-1624194554

   > so, I try to resolve it by add a field in Alias to keep schema same.
   > I try to use https://github.com/apache/arrow-datafusion/issues/6681 https://github.com/apache/arrow-datafusion/pull/6826 to resolve it, but I think it will bring much complexity, I think it's a good idea and give up it.
   
   🤔  yeah it is the approach that seems to make the most sense from an incremental sense, but I trust you if you say we can't make it work in practice...
   
   @jackwener  do you think there is a good ticket that covers this issue adequately ? Your description in https://github.com/apache/arrow-datafusion/issues/6790#issuecomment-1623576542 is really nicely done, and I think it will likely get lost in this PR if we don't pull it into a ticket 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb closed issue #6790: Schema error: No field named mytable."A". when table is named `"MyTable"`

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb closed issue #6790: Schema error: No field named mytable."A". when table is named `"MyTable"`
URL: https://github.com/apache/arrow-datafusion/issues/6790


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on issue #6790: Schema error: No field named mytable."A". when table is named `"MyTable"`

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on issue #6790:
URL: https://github.com/apache/arrow-datafusion/issues/6790#issuecomment-1619038324

   Thank you @jackwener 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] jackwener commented on issue #6790: Schema error: No field named mytable."A". when table is named `"MyTable"`

Posted by "jackwener (via GitHub)" <gi...@apache.org>.
jackwener commented on issue #6790:
URL: https://github.com/apache/arrow-datafusion/issues/6790#issuecomment-1618006762

   > I hope to find time to work on this issue later in the week if you haven't had a chance @jackwener
   
   This problem is caused by #6595. 
   
   Although the idea of #6595 is right, we cannot change so for the time being. Because it will expose the current design problem ( schema will be change in optimization).
   
   I solve it with #6826 by add `Field` in `Alias`, but I am worried that it will bring more complexity, so I think a better temporary method is to revert #6595
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on issue #6790: Schema error: No field named mytable."A". when table is named `"MyTable"`

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on issue #6790:
URL: https://github.com/apache/arrow-datafusion/issues/6790#issuecomment-1624191231

   > This involves a design problem. we shouldn't use name as a references to get a column from child plan, we should use id to get a column from child plan.
   
   I believe this is also what Postgres does (refers to columns by id, rather than a name)
   
   Also, interestingly, this is what DataFusion `PhyicalExpr` do as well. 
   
   
   > In my opinion, it is unnecessary to keep Schema unchanged at the Optimization stage, because in the process of optimization, it indeed may change.
   
   > qualifier/name change: such as simplify_expr
   
   I somehow feel this should be fixable with `Alias`es but as you point out this apparently is tricky
   
   > nullable change: such as infer not null from predicate
    
   This is a good point
   
   
   > type change: such as https://github.com/apache/arrow-datafusion/pull/6862
   
   In my opinion, the type (other than nullable) should never change after we do the Analysis phase. Otherwise we could end up getting to the physical expressions and not having the types line up
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] jackwener commented on issue #6790: Schema error: No field named mytable."A". when table is named `"MyTable"`

Posted by "jackwener (via GitHub)" <gi...@apache.org>.
jackwener commented on issue #6790:
URL: https://github.com/apache/arrow-datafusion/issues/6790#issuecomment-1612411514

   this is related with #6681.
   
   During the process of solving this problem, I encountered some difficulties, but I am working hard to resolve them.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] jackwener commented on issue #6790: Schema error: No field named mytable."A". when table is named `"MyTable"`

Posted by "jackwener (via GitHub)" <gi...@apache.org>.
jackwener commented on issue #6790:
URL: https://github.com/apache/arrow-datafusion/issues/6790#issuecomment-1623576542

   I have been quite busy lately, and today a newcomer @YjyJeff in the community encountered the same issue and submitted a PR #6862 related to it. So I took some time to systematically analyze this problem and help others in the community better understand it.
   
   This problem is caused by #6595 directly. Although the idea of #6595 is right, but we cannot change so for the time being. Because it will expose the current design problem ( schema will be change in optimization).
   
   Let me have explain this problem:
   
   ```
   There is a optimizated rule: simplify expr:
   project: t1.a + 1 + 2
   its schema is Schema: Field [qualifier: None, name:`t1.a + 1 + 2`]
   
   after simplify expr:
   project: t1.a + 3
   
   if we use original schema: schema still is 
   Schema: Field [qualifier: None, name:`t1.a + 1 + 2`]
   if we recompute schema, its schema is Schema: Field [qualifier: None, name:`t1.a + 3`]
   
   so, if we recompute new schema, it will cause `mismatch schema error`
   of occurse, we can add alias, it will make field keep same after recomputation.
   
   BUT, alias can't avoid this corner case https://github.com/apache/arrow-datafusion/pull/6595#discussion_r1225815262:
   alias('t1.a'), field is [qualifier: none, name: t1.a], we hope field is [qualifier: t1, name: a].
   I try to use #6681 #6826 to resolve it, but it will cause more complexity, I think it's a good idea.
   
   so I think a better temporary method is to revert #6595
   ```
   
   ---
   
   Let me talk about more.
   
   In my opinion, it is unnecessary to keep `Schema` unchanged at the Optimization stage, because in the process of optimization, it indeed may change.
   
   - qualifier/name change
   - nullable change
   - type change #6862
   
   So, keep `Schema` isn't a good design, it will cause some trouble.
   
   ---
   
   Why we need to keep `Schema` unchanged?
   
   Based on my guess, the reason may be that we need to ensure the `references` are correct.
   Plan is based on `Schema`. if we change schema of child plan, plan may can't get right column from child plan.
   
   This involves a design problem. we shouldn't use `name` as a `references` to get a column from child plan, we should use `id` to get a column from child plan.
   Because we should use a immutable/unquie item as a `references`.
   If we use this design, we also can avoid problem like #6543.
   
   Many systems in the industry have adopted this design, like spark, impala, doris...
   
   such as spark
   ```scala
   case class AttributeReference(
       name: String,
       datatype: DataType,
       nullable: Boolean = true,
       override val metadata: Metadata = Metadata.empty)(
       val exprId: ExprId = NamedExpression.newExprId,
       val qualifier: Seq[String] = Seq.empty[String])
     extends Attribute with Unevaluable
   
     /**
      * Returns true iff the expression id is the same for both attributes.
      */
     def sameRef(other: AttributeReference): Boolean = this.exprId == other.exprId
   ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org