You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@iceberg.apache.org by "romanstreamsets (via GitHub)" <gi...@apache.org> on 2023/02/09 23:11:06 UTC

[GitHub] [iceberg] romanstreamsets opened a new issue, #6796: AvtoSchemaUtils.convert() method produces Iceberg schema different from that by Hive/Spark

romanstreamsets opened a new issue, #6796:
URL: https://github.com/apache/iceberg/issues/6796

   ### Apache Iceberg version
   
   1.1.0 (latest release)
   
   ### Query engine
   
   Spark
   
   ### Please describe the bug 🐞
   
   Say, I run this in Spark/Hive: `CREATE TABLE FOO (col1 int) USING iceberg;`
   
   Then I run something like this in a Java/groovy code:
   ```
   org.apache.avro.Schema avroSchema = org.apache.avro.Schema.parse("{   \"type\" : \"record\",\"name\" : \"Employee\",   \"fields\" :  [    { \"name\" : \"col1\" , \"type\" : \"int\" } ] }");
   org.apache.iceberg.Schema icebergSchema = AvroSchemaUtil.toIceberg(avroSchema);
   ```
   ... and proceed with writing data into the table:
   GenericRecord record = GenericRecord.create(icebergSchema);
   ImmutableList.Builder<GenericRecord> builder = ImmutableList.builder();
   builder.add(record.copy(ImmutableMap.of("col1", 111)));


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] romanstreamsets commented on issue #6796: AvtoSchemaUtils.convert() method produces Iceberg schema different from that by Hive/Spark

Posted by "romanstreamsets (via GitHub)" <gi...@apache.org>.

romanstreamsets commented on issue #6796:
URL: https://github.com/apache/iceberg/issues/6796#issuecomment-1427150528

   Hi @rdblue, what's the rationale behind such treatment of iceberg Schema when creating a table? -
   1. Create icebergSchema
   2. Create a table
   3. discard the original Schema
   4. get the correct icebergSchema from the table
   
   It just looks like an unnecessary hoop.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

Re: [I] AvtoSchemaUtils.convert() method produces Iceberg schema different from that by Hive/Spark [iceberg]

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.

github-actions[bot] commented on issue #6796:
URL: https://github.com/apache/iceberg/issues/6796#issuecomment-1883991113

   This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale'


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

Re: [I] AvtoSchemaUtils.convert() method produces Iceberg schema different from that by Hive/Spark [iceberg]

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.

github-actions[bot] closed issue #6796: AvtoSchemaUtils.convert() method produces Iceberg schema different from that by Hive/Spark
URL: https://github.com/apache/iceberg/issues/6796


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] github-actions[bot] commented on issue #6796: AvtoSchemaUtils.convert() method produces Iceberg schema different from that by Hive/Spark

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.

github-actions[bot] commented on issue #6796:
URL: https://github.com/apache/iceberg/issues/6796#issuecomment-1675550713

   This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] Fokko commented on issue #6796: AvtoSchemaUtils.convert() method produces Iceberg schema different from that by Hive/Spark

Posted by "Fokko (via GitHub)" <gi...@apache.org>.

Fokko commented on issue #6796:
URL: https://github.com/apache/iceberg/issues/6796#issuecomment-1425782324

   I think we have to start the schema from 1 anyway, I took the liberty and created a PR: https://github.com/apache/iceberg/pull/6802
   
   I'm still very curious to see whats going on with Hive, returning a null. But this requires me to setup a hive setup, but that takes a bit more time.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on issue #6796: AvtoSchemaUtils.convert() method produces Iceberg schema different from that by Hive/Spark

Posted by "rdblue (via GitHub)" <gi...@apache.org>.

rdblue commented on issue #6796:
URL: https://github.com/apache/iceberg/issues/6796#issuecomment-1426203778

   I looked at @Fokko's PR, but I think that it doesn't address the underlying problem.
   
   It looks like the flow here is to take a schema then convert to Iceberg and create a table. Then convert the schema to Avro and expect that it matches the table. That is incorrect. When writing to a table, you must always use the table's schema. Tables are responsible for assigning and tracking field IDs and you should never guess at what they will probably be.
   
   The right solution is to convert your schema to Iceberg and create the table, then throw it away and use the table's schema. Convert the table schema to Avro and write records that way.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] Fokko commented on issue #6796: AvtoSchemaUtils.convert() method produces Iceberg schema different from that by Hive/Spark

Posted by "Fokko (via GitHub)" <gi...@apache.org>.

Fokko commented on issue #6796:
URL: https://github.com/apache/iceberg/issues/6796#issuecomment-1425472316

   Hey @romanstreamsets,
   
   Thanks for opening up this issue. Do you want the field to be required or optional? There is a difference between the schema that's being parsed from the JSON and the one you define in the code:
   ![image](https://user-images.githubusercontent.com/1134248/218050928-72357430-056e-4f72-a584-380b78e11d57.png)
   
   However, this should not result in issues with reading null values. Let me try to reproduce this on my end


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] romanstreamsets commented on issue #6796: AvtoSchemaUtils.convert() method produces Iceberg schema different from that by Hive/Spark

Posted by "romanstreamsets (via GitHub)" <gi...@apache.org>.

romanstreamsets commented on issue #6796:
URL: https://github.com/apache/iceberg/issues/6796#issuecomment-1425523063

Hi @Fokko
That blog is exactly where I got my initial code. I have changed the source of my schema from "manual" to Avro schema supplied as a String, because that's my use case.
Something I have found since posting this issue.
As you reproduced, conversion of Avro Schema renders an Iceberg Schema where field IDs are numbered starting from 0. It doesn't matter whether fields are required or optional.
However, at the time when I call catalog.createTable(..., icebergSchema, ...) method, in the resulting schema fields IDs will be numbered form 1.
When I call "create table" in Spark/Hive, the schema is also created with fields IDs starting from 1.
So, my workaround currently is:
```
avroConvertedSchema = AvroSchemaUtil.convert(avroSchema);
table = catalog.tableCreate(..., avroConvertedSchema, ...);
icebergSchema = table.schema();
... // then I use icebergSchema when writing records to the file
```
In this case, avroConvertedSchema is not the same as icebergSchema, exactly in field IDs numbering.
So, the convert() method should number IDs from 1.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] Fokko commented on issue #6796: AvtoSchemaUtils.convert() method produces Iceberg schema different from that by Hive/Spark

Posted by "Fokko (via GitHub)" <gi...@apache.org>.

Fokko commented on issue #6796:
URL: https://github.com/apache/iceberg/issues/6796#issuecomment-1425488573

   Can you share how you append the records to the table? This blog might also be helpful for what you're trying to achieve: https://tabular.io/blog/java-api-part-3/


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org