You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2023/01/02 03:21:53 UTC

[GitHub] [iceberg] youngxinler opened a new issue, #6510: When using the java iceberg api to write data into a partition table, an error occurs For a timestamp partition field, the type cannot be parsed correctly.

youngxinler opened a new issue, #6510:
URL: https://github.com/apache/iceberg/issues/6510

   ### Apache Iceberg version
   
   1.1.0 (latest release)
   
   ### Query engine
   
   Other
   
   ### Please describe the bug 🐞
   
   When using the java iceberg api to write data into a partition table, an error occurs For a timestamp partition field, the type cannot be parsed correctly.
   
   ```    
           Configuration configuration = new Configuration();
           // this is a local file catalog
           HadoopCatalog hadoopCatalog = new HadoopCatalog(configuration, icebergWareHousePath);
           TableIdentifier name = TableIdentifier.of("logging", "logs");
           Schema schema = new Schema(
                   Types.NestedField.required(1, "level", Types.StringType.get()),
                   Types.NestedField.required(2, "event_time", Types.TimestampType.withZone()),
                   Types.NestedField.required(3, "message", Types.StringType.get()),
                   Types.NestedField.optional(4, "call_stack", Types.ListType.ofRequired(5, Types.StringType.get()))
           );
           PartitionSpec spec = PartitionSpec.builderFor(schema)
                   .hour("event_time")
                   .identity("level")
                   .build();
           Table table = hadoopCatalog.createTable(name, schema, spec);
   
           GenericAppenderFactory appenderFactory = new GenericAppenderFactory(table.schema());
   
           int partitionId = 1, taskId = 1;
           OutputFileFactory outputFileFactory = OutputFileFactory.builderFor(table, partitionId, taskId).format(FileFormat.PARQUET).build();
           final PartitionKey partitionKey = new PartitionKey(table.spec(), table.spec().schema());
   
           // partitionedFanoutWriter will auto partitioned record and create the partitioned writer
           PartitionedFanoutWriter<Record> partitionedFanoutWriter = new PartitionedFanoutWriter<Record>(table.spec(), FileFormat.PARQUET, appenderFactory, outputFileFactory, table.io(), TARGET_FILE_SIZE_IN_BYTES) {
               @Override
               protected PartitionKey partition(Record record) {
                   partitionKey.partition(record);
                   return partitionKey;
               }
           };
   
           Random random = new Random();
           List<String> levels = Arrays.asList("info", "debug", "error", "warn");
           GenericRecord genericRecord = GenericRecord.create(table.schema());
   
           // assume write 1000 records
           for (int i = 0; i < 1000; i++) {
               GenericRecord record = genericRecord.copy();
               record.setField("level",  levels.get(random.nextInt(levels.size())));
   //            record.setField("event_time", System.currentTimeMillis());
               record.setField("event_time", OffsetDateTime.now());
               record.setField("message", "Iceberg is a great table format");
               record.setField("call_stack", Arrays.asList("NullPointerException"));
               partitionedFanoutWriter.write(record);
           }
   
   
           AppendFiles appendFiles = table.newAppend();
   
           // submit datafiles to the table
           Arrays.stream(partitionedFanoutWriter.dataFiles()).forEach(appendFiles::appendFile);
   
           // submit snapshot
           Snapshot newSnapshot = appendFiles.apply();
           appendFiles.commit();
   ```
   
   When I use Long to set event_time and write, it will report an error
   ```
   java.lang.ClassCastException: java.lang.Long cannot be cast to java.time.OffsetDateTime
   
   	at org.apache.iceberg.data.parquet.BaseParquetWriter$TimestamptzWriter.write(BaseParquetWriter.java:281)
   	at org.apache.iceberg.parquet.ParquetValueWriters$StructWriter.write(ParquetValueWriters.java:589)
   	at org.apache.iceberg.parquet.ParquetWriter.add(ParquetWriter.java:138)
   	at org.apache.iceberg.io.DataWriter.write(DataWriter.java:71)
   	at org.apache.iceberg.io.BaseTaskWriter$RollingFileWriter.write(BaseTaskWriter.java:362)
   	at org.apache.iceberg.io.BaseTaskWriter$RollingFileWriter.write(BaseTaskWriter.java:345)
   	at org.apache.iceberg.io.BaseTaskWriter$BaseRollingWriter.write(BaseTaskWriter.java:277)
   	at org.apache.iceberg.io.PartitionedFanoutWriter.write(PartitionedFanoutWriter.java:63)
   ```
   
   When I use java.time.OffsetDateTime to set event_time and write, it will also report an error
   ```
   java.lang.IllegalStateException: Not an instance of java.lang.Long: 2023-01-02T11:20:20.746+08:00
   
   	at org.apache.iceberg.data.GenericRecord.get(GenericRecord.java:123)
   	at org.apache.iceberg.Accessors$PositionAccessor.get(Accessors.java:71)
   	at org.apache.iceberg.Accessors$PositionAccessor.get(Accessors.java:58)
   	at org.apache.iceberg.PartitionKey.partition(PartitionKey.java:106)
   ```
   
   This problem will not occur if the table is non partitioned.  I looked at the internal code of PartitionKey. It seems that the transformation logic of this internal partition field is related?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] youngxinler commented on issue #6510: [Java API] When using the java iceberg api to write data into a partition table, an error occurs For a timestamp partition field, the type cannot be parsed correctly.

Posted by GitBox <gi...@apache.org>.
youngxinler commented on issue #6510:
URL: https://github.com/apache/iceberg/issues/6510#issuecomment-1378175142

   thanks for @nastra and @rdblue reply,  i try again by InternalRecordWrapper to partitioned,  the problem is solved.
   
   I  find java api doc in iceberg doc,  that not descripte how to use java api write data to iceberg table.  i think it will be better to add the example about how to write data using java api, this will help those who do not need to use Spark, flink or other large computing engines to write iceberg table. 
   I'd like to add this to the iceberg api doc,  what do you think? @nastra @rdblue 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] nastra commented on issue #6510: [Java API][BUG]When using the java iceberg api to write data into a partition table, an error occurs For a timestamp partition field, the type cannot be parsed correctly.

Posted by GitBox <gi...@apache.org>.
nastra commented on issue #6510:
URL: https://github.com/apache/iceberg/issues/6510#issuecomment-1375859259

   I've checked the code and it looks like we have a test that does something similar. Looking at https://github.com/apache/iceberg/blob/master/arrow/src/test/java/org/apache/iceberg/arrow/vectorized/ArrowReaderTest.java#L911-L912 one can see that we're using a `LocalDateTimeToLongMicros` helper class that does some additional checking for `LocalDateTime` / `OffsetDateTime` stuff: 
   https://github.com/apache/iceberg/blob/master/arrow/src/test/java/org/apache/iceberg/arrow/vectorized/ArrowReaderTest.java#L1215-L1250.
   
   @rdblue I wonder whether we should add this additional checking for `LocalDateTime` / `OffsetDateTime` to  `GenericRecord.get()`?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on issue #6510: [Java API][BUG]When using the java iceberg api to write data into a partition table, an error occurs For a timestamp partition field, the type cannot be parsed correctly.

Posted by GitBox <gi...@apache.org>.
rdblue commented on issue #6510:
URL: https://github.com/apache/iceberg/issues/6510#issuecomment-1377570948

   The problem is in the call to `PartitionKey.partition`, where this assumes that the data value from the Generic object model can be used for a partition key. It can't because the two use different representations. Generics use Java 8 date/time classes and internals (manifest writers in this case) expect the long micros representation.
   
   You just need to wrap the record in `InternalRecordWrapper` when passing it to the `partition` method, which will convert between the different representations. You can reuse the wrapper.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] nastra commented on issue #6510: [Java API] When using the java iceberg api to write data into a partition table, an error occurs For a timestamp partition field, the type cannot be parsed correctly.

Posted by GitBox <gi...@apache.org>.
nastra commented on issue #6510:
URL: https://github.com/apache/iceberg/issues/6510#issuecomment-1377600332

   Thanks @rdblue for checking/reviewing, I wasn't aware that `InternalRecordWrapper` existed. 
   @youngxinler can you please check whether this solves the issue for you?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] nastra closed issue #6510: [Java API] When using the java iceberg api to write data into a partition table, an error occurs For a timestamp partition field, the type cannot be parsed correctly.

Posted by GitBox <gi...@apache.org>.
nastra closed issue #6510: [Java API] When using the java iceberg api to write data into a partition table, an error occurs For a timestamp partition field, the type cannot be parsed correctly.
URL: https://github.com/apache/iceberg/issues/6510


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] nastra commented on issue #6510: [Java API] When using the java iceberg api to write data into a partition table, an error occurs For a timestamp partition field, the type cannot be parsed correctly.

Posted by GitBox <gi...@apache.org>.
nastra commented on issue #6510:
URL: https://github.com/apache/iceberg/issues/6510#issuecomment-1378327451

   @youngxinler yes this is a good idea to add some examples. Ping any of us to get a review on the PR. I'm going to close this issue, since it's resolved and the proposed approach worked for you.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org