You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/01/28 20:12:56 UTC

[GitHub] [arrow] Dandandan opened a new pull request #9355: ARROW-11421: [Rust][DataFusion] Support GROUP BY Date32

Dandandan opened a new pull request #9355:
URL: https://github.com/apache/arrow/pull/9355


   After this change, TCPH 3 is running successfully:
   
   ```
   Query 3 iteration 0 took 91.8 ms
   Query 3 iteration 1 took 88.9 ms
   Query 3 iteration 2 took 88.1 ms
   Query 3 iteration 3 took 91.7 ms
   Query 3 iteration 4 took 85.3 ms
   Query 3 iteration 5 took 91.3 ms
   Query 3 iteration 6 took 91.6 ms
   Query 3 iteration 7 took 94.5 ms
   Query 3 iteration 8 took 90.5 ms
   Query 3 iteration 9 took 86.6 ms
   Query 3 avg time: 90.03 ms
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] Dandandan commented on a change in pull request #9355: ARROW-11421: [Rust][DataFusion] Support GROUP BY Date32

Posted by GitBox <gi...@apache.org>.
Dandandan commented on a change in pull request #9355:
URL: https://github.com/apache/arrow/pull/9355#discussion_r566634273



##########
File path: rust/datafusion/tests/sql.rs
##########
@@ -1398,6 +1401,21 @@ fn register_aggregate_simple_csv(ctx: &mut ExecutionContext) -> Result<()> {
     Ok(())
 }
 
+fn register_aggregate_date_csv(ctx: &mut ExecutionContext) -> Result<()> {
+    // It's not possible to use aggregate_test_100, not enought similar values to test grouping on floats
+    let schema = Arc::new(Schema::new(vec![
+        Field::new("date", DataType::Date32(DateUnit::Day), false),
+        Field::new("cnt", DataType::Int32, false),
+    ]));
+
+    ctx.register_csv(
+        "dates",
+        "tests/dates.csv",
+        CsvReadOptions::new().schema(&schema),
+    )?;
+    Ok(())
+}

Review comment:
       Thanks for the suggestion, I like it / applied it, this helps to understand the test better indeed.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on a change in pull request #9355: ARROW-11421: [Rust][DataFusion] Support GROUP BY Date32

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on a change in pull request #9355:
URL: https://github.com/apache/arrow/pull/9355#discussion_r566558997



##########
File path: rust/datafusion/tests/sql.rs
##########
@@ -1398,6 +1401,21 @@ fn register_aggregate_simple_csv(ctx: &mut ExecutionContext) -> Result<()> {
     Ok(())
 }
 
+fn register_aggregate_date_csv(ctx: &mut ExecutionContext) -> Result<()> {
+    // It's not possible to use aggregate_test_100, not enought similar values to test grouping on floats
+    let schema = Arc::new(Schema::new(vec![
+        Field::new("date", DataType::Date32(DateUnit::Day), false),
+        Field::new("cnt", DataType::Int32, false),
+    ]));
+
+    ctx.register_csv(
+        "dates",
+        "tests/dates.csv",
+        CsvReadOptions::new().schema(&schema),
+    )?;
+    Ok(())
+}

Review comment:
       What do you think about placing the data in `MemTable` instead and directly next to the SQL statement, like this:
   
   ```
       let data = RecordBatch::try_new(
           schema.clone(),
           vec![Arc::new(StringArray::from(vec![
               Some("a"),
               Some("b"),
               Some("c"),
               None,
           ]))],
       )?;
       let table = MemTable::try_new(schema, vec![vec![data]])?;
       ctx.register_table("dates", Box::new(table));
   ```
   
   I find it way easier to understand the final result, the numbers "6" and "9", when the test data and statement are next to each other and next to the result.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb closed pull request #9355: ARROW-11421: [Rust][DataFusion] Support GROUP BY Date32

Posted by GitBox <gi...@apache.org>.
alamb closed pull request #9355:
URL: https://github.com/apache/arrow/pull/9355


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on pull request #9355: ARROW-11421: [Rust][DataFusion] Support GROUP BY Date32

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on pull request #9355:
URL: https://github.com/apache/arrow/pull/9355#issuecomment-770359535


   Thanks @Dandandan ! Could you rebase? We migrated from `Date32(_)` to `Date32` and some code here needs to be updated, unfortunately.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] Dandandan commented on pull request #9355: ARROW-11421: [Rust][DataFusion] Support GROUP BY Date32

Posted by GitBox <gi...@apache.org>.
Dandandan commented on pull request #9355:
URL: https://github.com/apache/arrow/pull/9355#issuecomment-770362147


   @jorgecarleitao already did that yesterday!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on a change in pull request #9355: ARROW-11421: [Rust][DataFusion] Support GROUP BY Date32

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on a change in pull request #9355:
URL: https://github.com/apache/arrow/pull/9355#discussion_r566558997



##########
File path: rust/datafusion/tests/sql.rs
##########
@@ -1398,6 +1401,21 @@ fn register_aggregate_simple_csv(ctx: &mut ExecutionContext) -> Result<()> {
     Ok(())
 }
 
+fn register_aggregate_date_csv(ctx: &mut ExecutionContext) -> Result<()> {
+    // It's not possible to use aggregate_test_100, not enought similar values to test grouping on floats
+    let schema = Arc::new(Schema::new(vec![
+        Field::new("date", DataType::Date32(DateUnit::Day), false),
+        Field::new("cnt", DataType::Int32, false),
+    ]));
+
+    ctx.register_csv(
+        "dates",
+        "tests/dates.csv",
+        CsvReadOptions::new().schema(&schema),
+    )?;
+    Ok(())
+}

Review comment:
       What do you think about placing the data in `MemTable` instead, like this:
   
   ```
       let data = RecordBatch::try_new(
           schema.clone(),
           vec![Arc::new(StringArray::from(vec![
               Some("a"),
               Some("b"),
               Some("c"),
               None,
           ]))],
       )?;
       let table = MemTable::try_new(schema, vec![vec![data]])?;
       ctx.register_table("dates", Box::new(table));
   ```
   
   I find it way easier to understand the final result, the numbers "6" and "9", when the test data and statement are next to each other and next to the result.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] Dandandan commented on pull request #9355: ARROW-11421: [Rust][DataFusion] Support GROUP BY Date32

Posted by GitBox <gi...@apache.org>.
Dandandan commented on pull request #9355:
URL: https://github.com/apache/arrow/pull/9355#issuecomment-770443168


   Should be solved now @jorgecarleitao


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb closed pull request #9355: ARROW-11421: [Rust][DataFusion] Support GROUP BY Date32

Posted by GitBox <gi...@apache.org>.
alamb closed pull request #9355:
URL: https://github.com/apache/arrow/pull/9355


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on a change in pull request #9355: ARROW-11421: [Rust][DataFusion] Support GROUP BY Date32

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on a change in pull request #9355:
URL: https://github.com/apache/arrow/pull/9355#discussion_r566558997



##########
File path: rust/datafusion/tests/sql.rs
##########
@@ -1398,6 +1401,21 @@ fn register_aggregate_simple_csv(ctx: &mut ExecutionContext) -> Result<()> {
     Ok(())
 }
 
+fn register_aggregate_date_csv(ctx: &mut ExecutionContext) -> Result<()> {
+    // It's not possible to use aggregate_test_100, not enought similar values to test grouping on floats
+    let schema = Arc::new(Schema::new(vec![
+        Field::new("date", DataType::Date32(DateUnit::Day), false),
+        Field::new("cnt", DataType::Int32, false),
+    ]));
+
+    ctx.register_csv(
+        "dates",
+        "tests/dates.csv",
+        CsvReadOptions::new().schema(&schema),
+    )?;
+    Ok(())
+}

Review comment:
       What do you think about placing the data in `MemTable` and directly on the test, like this:
   
   ```
       let data = RecordBatch::try_new(
           schema.clone(),
           vec![Arc::new(StringArray::from(vec![
               Some("a"),
               Some("b"),
               Some("c"),
               None,
           ]))],
       )?;
       let table = MemTable::try_new(schema, vec![vec![data]])?;
       ctx.register_table("dates", Box::new(table));
   ```
   
   It is much easier to understand the final result, the numbers "6" and "9", when the data and query are next to each other.
   
   Alternatively, the test caters for an explanation of why we get 2 entries and why its numbers are 6 and 9.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on a change in pull request #9355: ARROW-11421: [Rust][DataFusion] Support GROUP BY Date32

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on a change in pull request #9355:
URL: https://github.com/apache/arrow/pull/9355#discussion_r566558997



##########
File path: rust/datafusion/tests/sql.rs
##########
@@ -1398,6 +1401,21 @@ fn register_aggregate_simple_csv(ctx: &mut ExecutionContext) -> Result<()> {
     Ok(())
 }
 
+fn register_aggregate_date_csv(ctx: &mut ExecutionContext) -> Result<()> {
+    // It's not possible to use aggregate_test_100, not enought similar values to test grouping on floats
+    let schema = Arc::new(Schema::new(vec![
+        Field::new("date", DataType::Date32(DateUnit::Day), false),
+        Field::new("cnt", DataType::Int32, false),
+    ]));
+
+    ctx.register_csv(
+        "dates",
+        "tests/dates.csv",
+        CsvReadOptions::new().schema(&schema),
+    )?;
+    Ok(())
+}

Review comment:
       What do you think about placing the data in `MemTable` instead, like this:
   
   ```
       let data = RecordBatch::try_new(
           schema.clone(),
           vec![Arc::new(StringArray::from(vec![
               Some("a"),
               Some("b"),
               Some("c"),
               None,
           ]))],
       )?;
       let table = MemTable::try_new(schema, vec![vec![data]])?;
       ctx.register_table("dates", Box::new(table));
   ```
   
   I find it way easier to understand the final result, in this where the numbers "6" and "9" come from.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #9355: ARROW-11421: [Rust][DataFusion] Support GROUP BY Date32

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #9355:
URL: https://github.com/apache/arrow/pull/9355#issuecomment-769353055


   https://issues.apache.org/jira/browse/ARROW-11421


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on pull request #9355: ARROW-11421: [Rust][DataFusion] Support GROUP BY Date32

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on pull request #9355:
URL: https://github.com/apache/arrow/pull/9355#issuecomment-770362461


   uhm, the CI is complaining 
   
   ```
   error[E0432]: unresolved import `arrow::datatypes::DateUnit`
     --> datafusion/tests/sql.rs:26:17
      |
   26 |     datatypes::{DateUnit, TimeUnit},
      |                 ^^^^^^^^ no `DateUnit` in `datatypes`
   
   error[E0618]: expected function, found enum variant `DataType::Date32`
       --> datafusion/tests/sql.rs:1910:28
        |
   1910 |         Field::new("date", DataType::Date32(DateUnit::Day), false),
        |                            ^^^^^^^^^^^^^^^^---------------
        |                            |
        |                            call expression requires function
        |
   help: `DataType::Date32` is a unit variant, you need to write it without the parenthesis
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] Dandandan commented on pull request #9355: ARROW-11421: [Rust][DataFusion] Support GROUP BY Date32

Posted by GitBox <gi...@apache.org>.
Dandandan commented on pull request #9355:
URL: https://github.com/apache/arrow/pull/9355#issuecomment-770442552


   Ah thanks @jorgecarleitao missed that 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org