You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "maxburke (via GitHub)" <gi...@apache.org> on 2023/03/03 16:06:08 UTC

[GitHub] [arrow-datafusion] maxburke opened a new issue, #5470: Joins on type FixedSizeBinary(16) returning incorrect results

maxburke opened a new issue, #5470:
URL: https://github.com/apache/arrow-datafusion/issues/5470

   (Stemming from #5456)
   
   I've attached two parquet files. Both files contain a single column with 131072 rows, generated from Arrow with a single record batch. The `fsb16.parquet` file contains a column of type FixedSizeBinary(16), the `ints.parquet` contains a column of type `Int64`.
   
   If I do an inner join on the ints with itself, I get a result set of the expected 131072 rows:
   
   ```
   ❯ create external table t0 stored as parquet location 'ints.parquet';
   ❯ select * from t0 inner join t0 as t1 on t0.ints = t1.ints;
   +--------+--------+
   ...[snip]...
   +--------+--------+
   131072 rows in set. Query took 0.530 seconds.
   ```
   
   But if I do the same query with the FixedSizeBinary(16) inputs, it returns 358946 rows (?):
   
   ```
   ❯ create external table t0 stored as parquet location 'fsb16.parquet';
   ❯ select * from t0 inner join t0 as t1 on t0.journey_id = t1.journey_id;
   +----------------------------------+----------------------------------+
   ...[snip]...
   +----------------------------------+----------------------------------+
   358946 rows in set. Query took 2.073 seconds.
   ```
   
   In this particular case, all the FixedSizeBinary(16) values are non-null, though I don't think that should make a difference.
   
   [fsb16.parquet.gz](https://github.com/apache/arrow-datafusion/files/10875507/fsb16.parquet.gz)
   [ints.parquet.gz](https://github.com/apache/arrow-datafusion/files/10875508/ints.parquet.gz)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] maxburke commented on issue #5470: Joins on type FixedSizeBinary(16) returning incorrect results

Posted by "maxburke (via GitHub)" <gi...@apache.org>.
maxburke commented on issue #5470:
URL: https://github.com/apache/arrow-datafusion/issues/5470#issuecomment-1453928585

   I've added a test to this branch -- datafusion/core, `fixed_size_binary_column_join`: https://github.com/urbanlogiq/arrow-datafusion/tree/18.0.0-ul-fsb-join-dupes


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] comphead commented on issue #5470: Joins on type FixedSizeBinary(16) returning incorrect results

Posted by "comphead (via GitHub)" <gi...@apache.org>.
comphead commented on issue #5470:
URL: https://github.com/apache/arrow-datafusion/issues/5470#issuecomment-1456658617

   @maxburke please consider the same local test
   
   ```
   #[tokio::test]
   async fn test_join_binary() -> Result<()> {
       let t1 = Arc::new(Schema::new(vec![
           Field::new("a", DataType::FixedSizeBinary(16), true),
       ]));
       let t2 = Arc::new(Schema::new(vec![
           Field::new("a", DataType::Int64, true),
       ]));
       let batch1 = RecordBatch::try_new(
           t1,
           vec![
               Arc::new(FixedSizeBinaryArray::from(vec![Some("1111111111111111".as_bytes()), Some("1111111111111112".as_bytes()), None])),
           ],
       )?;
       let batch2 = RecordBatch::try_new(
           t2,
           vec![
               Arc::new(Int64Array::from(vec![None, Some(1111111111111111_i64), Some(1111111111111112_i64)])),
           ],
       )?;
   
       let ctx = SessionContext::new();
   
       ctx.register_batch("t1", batch1)?;
       ctx.register_batch("t2", batch2)?;
   
       let df = ctx.sql("select * from t1 inner join t1 as t2 on t1.a = t2.a").await?;
   
       let results = df.collect().await?;
       println!("{:?}", &results);
       Ok(())
   }
   ```
   
   No dups


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] maxburke commented on issue #5470: Joins on type FixedSizeBinary(16) returning incorrect results

Posted by "maxburke (via GitHub)" <gi...@apache.org>.
maxburke commented on issue #5470:
URL: https://github.com/apache/arrow-datafusion/issues/5470#issuecomment-1456726954

   Ugh; I think I was wrong here. The table has duplicates; when the values are unique the results are as expected. Validating against Postgres I get the same results as above.
   
   Apologies; closing this issue.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] maxburke closed issue #5470: Joins on type FixedSizeBinary(16) returning incorrect results

Posted by "maxburke (via GitHub)" <gi...@apache.org>.
maxburke closed issue #5470: Joins on type FixedSizeBinary(16) returning incorrect results
URL: https://github.com/apache/arrow-datafusion/issues/5470


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org