You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "elbaro (via GitHub)" <gi...@apache.org> on 2023/02/05 20:47:50 UTC

[GitHub] [arrow-datafusion] elbaro opened a new issue, #5187: Dataframe API adds ?table? qualifier

elbaro opened a new issue, #5187:
URL: https://github.com/apache/arrow-datafusion/issues/5187

   **Describe the bug**
   A dataframe constructed with dataframe API and multiple parquet files has column names prefixed with '?table?.'.
   
   **To Reproduce**
   ```rs
   let mut config = ListingTableConfig::new_with_multi_paths(uris);
   config = config.infer(&ctx.state()).await?;
   let table = Arc::new(ListingTable::try_new(config)?);
   let df = ctx.read_table(table)?;
   let df = df.select_columns(&["key", "size", "last_modified"])?;
   
   Error: Schema error: No field named 'last_modified'. Valid fields are '?table?'.'key',  '?table?'.'size', '?table?'.'last_modified_date', '?table?'.'e_tag'.
   ```
   
   **Expected behavior**
   column names without ?table?.
   
   **Additional context**
   I want a vstack parquet files in a specific order, so I need ListingTable*.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] elbaro commented on issue #5187: Dataframe API adds ?table? qualifier

Posted by "elbaro (via GitHub)" <gi...@apache.org>.
elbaro commented on issue #5187:
URL: https://github.com/apache/arrow-datafusion/issues/5187#issuecomment-1418264366

   The bug is in parsing a column name.
   ```rs
   println!("{:?}", Column::from_qualified_name("?table?.key"));
   ```
   
   ```rs
   Column { relation: None, name: "?table?.key" }
   ```
   
   The below worked.
   ```rs
       let df = df.select(vec![
           col(Column::new(Some("?table?"), "key")).alias("key"),
           col(Column::new(Some("?table?"), "size")).alias("size"),
           col(Column::new(Some("?table?"), "last_modified_date")).alias("last_modified_date"),
       ])?;
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] elbaro commented on issue #5187: Dataframe API adds ?table? qualifier

Posted by "elbaro (via GitHub)" <gi...@apache.org>.
elbaro commented on issue #5187:
URL: https://github.com/apache/arrow-datafusion/issues/5187#issuecomment-1428521366

   My bad, I thought the typo was only in the first example. Thanks!
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] elbaro commented on issue #5187: Dataframe API adds ?table? qualifier

Posted by "elbaro (via GitHub)" <gi...@apache.org>.
elbaro commented on issue #5187:
URL: https://github.com/apache/arrow-datafusion/issues/5187#issuecomment-1428089611

   I see. Is this api intended or in progress?
   
   If I had to add double quotes in everywhere it would be very hard to use.
   
   ```
   SELECT "a".col1, "b".col2 FROM a JOIN b ON "a".col3="b".col3
   ```
   
   If a dot in column names is uncommon, I suggest this style (so that writing is as easy as SQL)
   
   ```
   col("a.col") => ("a", "col")
   Column::new(None, "a.col") => (None, "a.col")
   ```
   
   
   Also, when are we allowed to skip ?table?
   The following example from the doc uses col("a") without ?table?.
   
   ```rs
   // create the dataframe
   let df = ctx.read_csv("tests/data/example.csv", CsvReadOptions::new()).await?;
   
   // create a plan
   let df = df.filter(col("a").lt_eq(col("b")))? // why is this not col("\"?table?\".a")
              .aggregate(vec![col("a")], vec![min(col("b"))])?
              .limit(0, Some(100))?;
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on issue #5187: Dataframe API adds ?table? qualifier

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on issue #5187:
URL: https://github.com/apache/arrow-datafusion/issues/5187#issuecomment-1428507529

   > Also, when are we allowed to skip ?table?
   
   When the reference is not ambiguous (aka when there is only a single table that has a column with that name in the relevant output)
   
   As @Jefffrey  mentioned I think the actual fix to you problem is to change
   
   ```rust
   let df = df.select_columns(&["key", "size", "last_modified"])?;
   ```
   
   to 
   
   ```rust
   let df = df.select_columns(&["key", "size", "last_modified_date"])?;
   ```
   
   The fact that there is a `?Table?` in the message seems somewhat misleading
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] elbaro commented on issue #5187: Dataframe API adds ?table? qualifier

Posted by "elbaro (via GitHub)" <gi...@apache.org>.
elbaro commented on issue #5187:
URL: https://github.com/apache/arrow-datafusion/issues/5187#issuecomment-1418262686

   ```rs
   let df = df.select(vec![
           col("?table?.key").alias("key"),
           col("?table?.size").alias("size"),
           col("?table?.last_modified").alias("last_modified"),
       ])?;
   ```
   
   ```
   Error: Schema error: No field named '?table?.key'. Valid fields are '?table?'.'key', '?table?'.'size',
   ..
   ```
   
   Also tried `col("'?table?'.'key'").alias("key"),`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] Jefffrey commented on issue #5187: Dataframe API adds ?table? qualifier

Posted by "Jefffrey (via GitHub)" <gi...@apache.org>.
Jefffrey commented on issue #5187:
URL: https://github.com/apache/arrow-datafusion/issues/5187#issuecomment-1420422685

   This seems to be expected behaviour. A Dataframe has a default name of `?table?` if none is specified:
   
   https://github.com/apache/arrow-datafusion/blob/48732b4cb2c8e42fbe5be295429bbc465e5f5491/datafusion/expr/src/logical_plan/builder.rs#L52
   
   In the error in the original issue post, it is due to you attempting to select the column `last_modified` whereas the correct name is `last_modified_date`
   
   For the error in the subsequent comment, it is because you need to quote the identifiers like so:
   
   ```rust
   let df = df.select(vec![
           col("\"?table?\".key").alias("key"),
           col("\"?table?\".size").alias("size"),
           col("\"?table?\".last_modified").alias("last_modified"),
       ])?;
   ```
   
   Otherwise the parser will assume the entire string is the name of the column, instead of being able to detect there are two identifiers delimited by the period (where first identifier is quoted by double quotes `"`).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] elbaro closed issue #5187: Dataframe API adds ?table? qualifier

Posted by "elbaro (via GitHub)" <gi...@apache.org>.
elbaro closed issue #5187: Dataframe API adds ?table? qualifier
URL: https://github.com/apache/arrow-datafusion/issues/5187


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org