You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by "romanstreamsets (via GitHub)" <gi...@apache.org> on 2023/02/09 23:11:06 UTC
[GitHub] [iceberg] romanstreamsets opened a new issue, #6796: AvtoSchemaUtils.convert() method produces Iceberg schema different from that by Hive/Spark
romanstreamsets opened a new issue, #6796:
URL: https://github.com/apache/iceberg/issues/6796
### Apache Iceberg version
1.1.0 (latest release)
### Query engine
Spark
### Please describe the bug 🐞
Say, I run this in Spark/Hive: `CREATE TABLE FOO (col1 int) USING iceberg;`
Then I run something like this in a Java/groovy code:
```
org.apache.avro.Schema avroSchema = org.apache.avro.Schema.parse("{ \"type\" : \"record\",\"name\" : \"Employee\", \"fields\" : [ { \"name\" : \"col1\" , \"type\" : \"int\" } ] }");
org.apache.iceberg.Schema icebergSchema = AvroSchemaUtil.toIceberg(avroSchema);
```
... and proceed with writing data into the table:
GenericRecord record = GenericRecord.create(icebergSchema);
ImmutableList.Builder<GenericRecord> builder = ImmutableList.builder();
builder.add(record.copy(ImmutableMap.of("col1", 111)));
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org
[GitHub] [iceberg] romanstreamsets commented on issue #6796: AvtoSchemaUtils.convert() method produces Iceberg schema different from that by Hive/Spark
Posted by "romanstreamsets (via GitHub)" <gi...@apache.org>.
romanstreamsets commented on issue #6796:
URL: https://github.com/apache/iceberg/issues/6796#issuecomment-1427150528
Hi @rdblue, what's the rationale behind such treatment of iceberg Schema when creating a table? -
1. Create icebergSchema
2. Create a table
3. discard the original Schema
4. get the correct icebergSchema from the table
It just looks like an unnecessary hoop.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org
Re: [I] AvtoSchemaUtils.convert() method produces Iceberg schema different from that by Hive/Spark [iceberg]
Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on issue #6796:
URL: https://github.com/apache/iceberg/issues/6796#issuecomment-1883991113
This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale'
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org
Re: [I] AvtoSchemaUtils.convert() method produces Iceberg schema different from that by Hive/Spark [iceberg]
Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] closed issue #6796: AvtoSchemaUtils.convert() method produces Iceberg schema different from that by Hive/Spark
URL: https://github.com/apache/iceberg/issues/6796
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org
[GitHub] [iceberg] github-actions[bot] commented on issue #6796: AvtoSchemaUtils.convert() method produces Iceberg schema different from that by Hive/Spark
Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on issue #6796:
URL: https://github.com/apache/iceberg/issues/6796#issuecomment-1675550713
This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org
[GitHub] [iceberg] Fokko commented on issue #6796: AvtoSchemaUtils.convert() method produces Iceberg schema different from that by Hive/Spark
Posted by "Fokko (via GitHub)" <gi...@apache.org>.
Fokko commented on issue #6796:
URL: https://github.com/apache/iceberg/issues/6796#issuecomment-1425782324
I think we have to start the schema from 1 anyway, I took the liberty and created a PR: https://github.com/apache/iceberg/pull/6802
I'm still very curious to see whats going on with Hive, returning a null. But this requires me to setup a hive setup, but that takes a bit more time.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org
[GitHub] [iceberg] rdblue commented on issue #6796: AvtoSchemaUtils.convert() method produces Iceberg schema different from that by Hive/Spark
Posted by "rdblue (via GitHub)" <gi...@apache.org>.
rdblue commented on issue #6796:
URL: https://github.com/apache/iceberg/issues/6796#issuecomment-1426203778
I looked at @Fokko's PR, but I think that it doesn't address the underlying problem.
It looks like the flow here is to take a schema then convert to Iceberg and create a table. Then convert the schema to Avro and expect that it matches the table. That is incorrect. When writing to a table, you must always use the table's schema. Tables are responsible for assigning and tracking field IDs and you should never guess at what they will probably be.
The right solution is to convert your schema to Iceberg and create the table, then throw it away and use the table's schema. Convert the table schema to Avro and write records that way.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org
[GitHub] [iceberg] Fokko commented on issue #6796: AvtoSchemaUtils.convert() method produces Iceberg schema different from that by Hive/Spark
Posted by "Fokko (via GitHub)" <gi...@apache.org>.
Fokko commented on issue #6796:
URL: https://github.com/apache/iceberg/issues/6796#issuecomment-1425472316
Hey @romanstreamsets,
Thanks for opening up this issue. Do you want the field to be required or optional? There is a difference between the schema that's being parsed from the JSON and the one you define in the code:
![image](https://user-images.githubusercontent.com/1134248/218050928-72357430-056e-4f72-a584-380b78e11d57.png)
However, this should not result in issues with reading null values. Let me try to reproduce this on my end
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org
[GitHub] [iceberg] romanstreamsets commented on issue #6796: AvtoSchemaUtils.convert() method produces Iceberg schema different from that by Hive/Spark
Posted by "romanstreamsets (via GitHub)" <gi...@apache.org>.
romanstreamsets commented on issue #6796:
URL: https://github.com/apache/iceberg/issues/6796#issuecomment-1425523063
Hi @Fokko
That blog is exactly where I got my initial code. I have changed the source of my schema from "manual" to Avro schema supplied as a String, because that's my use case.
Something I have found since posting this issue.
As you reproduced, conversion of Avro Schema renders an Iceberg Schema where field IDs are numbered starting from 0. It doesn't matter whether fields are required or optional.
However, at the time when I call catalog.createTable(..., icebergSchema, ...) method, in the resulting schema fields IDs will be numbered form 1.
When I call "create table" in Spark/Hive, the schema is also created with fields IDs starting from 1.
So, my workaround currently is:
```
avroConvertedSchema = AvroSchemaUtil.convert(avroSchema);
table = catalog.tableCreate(..., avroConvertedSchema, ...);
icebergSchema = table.schema();
... // then I use icebergSchema when writing records to the file
```
In this case, avroConvertedSchema is not the same as icebergSchema, exactly in field IDs numbering.
So, the convert() method should number IDs from 1.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org
[GitHub] [iceberg] Fokko commented on issue #6796: AvtoSchemaUtils.convert() method produces Iceberg schema different from that by Hive/Spark
Posted by "Fokko (via GitHub)" <gi...@apache.org>.
Fokko commented on issue #6796:
URL: https://github.com/apache/iceberg/issues/6796#issuecomment-1425488573
Can you share how you append the records to the table? This blog might also be helpful for what you're trying to achieve: https://tabular.io/blog/java-api-part-3/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org