You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2022/11/01 21:30:44 UTC

[GitHub] [iceberg] joao-parana opened a new issue, #6097: Partitioning based on the "identity" transform doesn't work in 1.0.0 Java API.

joao-parana opened a new issue, #6097:
URL: https://github.com/apache/iceberg/issues/6097

   ### Apache Iceberg version
   
   1.0.0 (latest release)
   
   ### Query engine
   
   _No response_
   
   ### Please describe the bug 🐞
   
   Hi folks, I would like to report something that might be a problem with partitioning based on the "identity" transform.
   
   I created a schema like this:
   
   ```java
   this.schema = new Schema(
     required(1, "ts", Types.TimestampType.withoutZone()),
     required(2, "hotel_id", Types.LongType.get()),
     optional(3, "hotel_name", Types.StringType.get()),
     required(4, "arrival_date", Types.DateType.get()),
     required(5, "value", Types.DoubleType.get()));
   ```
   
   So I append data in parquet files with 5 different partitions (in five different tables):
   
   1. unpartitioned
   2. identity("hotel_name")
   3. month("ts")
   4. identity("hotel_name") AND month("ts")
   5. day("ts")
   
   I insert only one record into each of the tables, with their respective partitioning, for testing purposes. When I list the results of the queries in the tables I get the following:
   
   ```txt
   001:		Record(2022-10-30T21:00:50.929375, 1000, hotel_name-1000, 2023-01-01, 4.13)
   --------------------------------------------------------------------------------------
   002:		Record(2022-10-30T21:00:53.238515, 1000, null, 2023-01-01, 4.13)
   --------------------------------------------------------------------------------------
   003:		Record(2022-10-30T21:00:53.461455, 1000, hotel_name-1000, 2023-01-01, 4.13)
   --------------------------------------------------------------------------------------
   004:		Record(2022-10-30T21:00:53.653993, 1000, null, 2023-01-01, 4.13)
   --------------------------------------------------------------------------------------
   005:		Record(2022-10-30T21:00:53.843971, 1000, hotel_name-1000, 2023-01-01, 4.13)
   ```
   
   Note that in cases 2 and 4 where I used "identity()" type partitioning **the hotel_name column is NULL**.
   
   BTW, the "DataFiles" in Parquet format were created with "SortOrder" shown below:
   
   ```java
   final SortOrder sortOrder = SortOrder.builderFor(schema)
     .asc("ts", NullOrder.NULLS_FIRST)
     .asc("hotel_name", NullOrder.NULLS_FIRST)
     .build();
   ```
   
   Also it is important to say that I made a query (with Python 3.10) in the `root` directory of the catalog using `pyarrow` and `datafusion`.  
   This query correctly show the data for `hotel_name` for all parquet files, including those that the query via **Iceberg Java API** showed null.
   It follows from this that the Parquet files are correct.
   
   My test in Python 3.10 is:
   
   ```python
   d = '/tmp/iceberg-test-2/bookings/'
   import datafusion
   print(datafusion.__version__)
   import pyarrow as pa
   from datafusion import SessionContext
   ctx = SessionContext()
   ctx.register_parquet("soma", d)
   print(ctx.tables())
   rb = ctx.sql("SELECT * FROM soma").collect()
   t = pa.Table.from_batches(rb)
   print(t.to_pydict())
   ```
   
   And the result was:
   ```json
   { 'ts': [ datetime.datetime(2022, 10, 30, 19, 36, 45, 434510), datetime.datetime(2022, 10, 30, 19, 36, 43, 137844), datetime.datetime(2022, 10, 30, 19, 36, 44, 902886), datetime.datetime(2022, 10, 30, 19, 36, 45, 155755), datetime.datetime(2022, 10, 30, 19, 36, 45, 688557) ], 'hotel_id': [ 1000, 1000, 1000, 1000, 1000 ], 'hotel_name': [ 'hotel_name-1000', 'hotel_name-1000', 'hotel_name-1000', 'hotel_name-1000', 'hotel_name-1000' ], 'arrival_date': [ datetime.date(2023, 1, 1), datetime.date(2023, 1, 1), datetime.date(2023, 1, 1), datetime.date(2023, 1, 1), datetime.date(2023, 1, 1) ], 'value': [ 4.13, 4.13, 4.13, 4.13, 4.13 ] }
   ```
   
   Timestamp is in microseconds
   
   The complete test code is here: https://gist.github.com/joao-parana/2adbd97c70c701668cd5e778a92262ea
   
   I'm using `1.0.0` version of **Iceberg Java API**
   
   This issue was posted is Iceberg Slack Channel too.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] mwullink commented on issue #6097: Partitioning based on the "identity" transform doesn't work in 1.0.0 Java API.

Posted by "mwullink (via GitHub)" <gi...@apache.org>.

mwullink commented on issue #6097:
URL: https://github.com/apache/iceberg/issues/6097#issuecomment-1663744093

   i have the  same issue, but cannot find the explanation in the Slack channel.
   can you post some more info about the problem and the fix here please?
   
   thx! 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] mwullink commented on issue #6097: Partitioning based on the "identity" transform doesn't work in 1.0.0 Java API.

Posted by "mwullink (via GitHub)" <gi...@apache.org>.

mwullink commented on issue #6097:
URL: https://github.com/apache/iceberg/issues/6097#issuecomment-1663896261

   found my problem, is was using the GenericAppenderFactory and forgot to add the spec as 2nd parameter. without the appender assumes no partitions are used.
   
   this works for partitioning:
   GenericAppenderFactory(table.schema(), table.spec());
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] joao-parana commented on issue #6097: Partitioning based on the "identity" transform doesn't work in 1.0.0 Java API.

Posted by GitBox <gi...@apache.org>.

joao-parana commented on issue #6097:
URL: https://github.com/apache/iceberg/issues/6097#issuecomment-1302802386

   The error has been fixed. It was not a bug but an error in my program. @rdblue explained to me how to fix it (https://apache-iceberg.slack.com/archives/C03LG1D563F/p1667408382726389?thread_ts=1667163877.313839&cid=C03LG1D563F).
   
   See below what was missing:
   
   ```java
   Boolean partitionSpecHaveHotelName = partitionSpec.fields().get(0).name().equals("hotel_name");
   . . .
   GenericRecord partitionTuple = GenericRecord.create(partitionSpec.partitionType());
   partitionTuple.setField("hotel_name", "hotel_name_1000");
   . . . 
   dataFile = DataFiles.builder(table.spec())
           .withPartition(partitionTuple)
           . . . 
           .build();
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] joao-parana closed issue #6097: Partitioning based on the "identity" transform doesn't work in 1.0.0 Java API.

Posted by GitBox <gi...@apache.org>.

joao-parana closed issue #6097: Partitioning based on the "identity" transform doesn't work in 1.0.0 Java API.
URL: https://github.com/apache/iceberg/issues/6097


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org