You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2022/12/06 23:13:57 UTC

[GitHub] [iceberg] ayushtkn opened a new pull request, #6369: Increase Partition Start Id to 10000

ayushtkn opened a new pull request, #6369:
URL: https://github.com/apache/iceberg/pull/6369

   Increase the partition field start to 10K to avoid collisions with columns.
   [#6368](https://github.com/apache/iceberg/issues/6368)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on pull request #6369: Increase Partition Start Id to 10000

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on PR #6369:
URL: https://github.com/apache/iceberg/pull/6369#issuecomment-1341564011

   I'd be really careful with this change. Even though the spec may not mention it directly, that was always our assumption. I will need to take a closer look in a bit.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ayushtkn commented on pull request #6369: Increase Partition Start Id to 10000

Posted by GitBox <gi...@apache.org>.
ayushtkn commented on PR #6369:
URL: https://github.com/apache/iceberg/pull/6369#issuecomment-1340677494

   >written prior to this change will still have the collision with the partition field IDs and will only be fixed if they are, or at lest their metadata is rewritten, right?
   
   yep


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] gaborkaszab commented on pull request #6369: Increase Partition Start Id to 10000

Posted by GitBox <gi...@apache.org>.
gaborkaszab commented on PR #6369:
URL: https://github.com/apache/iceberg/pull/6369#issuecomment-1340696570

   Thanks for the answer, @ayushtkn!
   I wonder if it would make sense to make the already written tables work as expected even with more than 1000 cols. E.g. when reading their metadata we could convert their field IDs (starting from 1000) to the new way (starting from 10000). They would remain the same in the written metadata files but internally we could keep track of the field IDs starting from 10000.
   Would it make sense? I can create a separate issue for this for further discussion and if it does make sense I could give the implementation a try. @RussellSpitzer ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on pull request #6369: Increase Partition Start Id to 10000

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on PR #6369:
URL: https://github.com/apache/iceberg/pull/6369#issuecomment-1341155158

   @gaborkaszab I would probably just recommended dropping and recreating the table (via metadata) or having a separate utility for modifying existing tables. I really don't think many folks have 1000 columns since we would have seen this before so I don't think the upgrade procedure really needs to exist inside the Iceberg repo.
   
   I think this change is pretty small and harmless though.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ayushtkn commented on pull request #6369: Increase Partition Start Id to 10000

Posted by "ayushtkn (via GitHub)" <gi...@apache.org>.
ayushtkn commented on PR #6369:
URL: https://github.com/apache/iceberg/pull/6369#issuecomment-1487665688

   I think you created : https://github.com/apache/iceberg/issues/7221
   Similar to my previous one here:
   https://github.com/apache/iceberg/issues/6368
   My first take was to bump this number up to 10k which I felt a safe number and was in a way inline with Hive limits, but the approach had concerns that it might be risky thing to do. I did validated couple of stuff but not in a state to say, ‘I guarantee it can’t cause any issue’ for folks migrating/upgrading from the version having older limits


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] gaborkaszab commented on pull request #6369: Increase Partition Start Id to 10000

Posted by GitBox <gi...@apache.org>.
gaborkaszab commented on PR #6369:
URL: https://github.com/apache/iceberg/pull/6369#issuecomment-1341585151

   I also checked if PartitionSpec.hasSequentialIds() could cause any issues with existing tables. The first use that you linked seems to be the case when we re-write the table metadata and it checks the new partition field ID. This should be fine in my opinion.
   However, the other use-case here (https://github.com/apache/iceberg/blob/6697129a314d98cb793601495f7ebb2ae000b40a/core/src/main/java/org/apache/iceberg/TableMetadata.java#L1409-L1414) seems suspicious. First I thought this is the case when we load metadata from JSON, but then it seems that TableMetadata.addPartitionSpecInternal() is used for [adding new partition specs to a table](https://github.com/apache/iceberg/blob/6697129a314d98cb793601495f7ebb2ae000b40a/core/src/main/java/org/apache/iceberg/MetadataUpdate.java#L122), that also seems okay with the new start ID.
   
   I can take another look tomorrow (it's kind of late now to think :) )


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ayushtkn commented on pull request #6369: Increase Partition Start Id to 10000

Posted by GitBox <gi...@apache.org>.
ayushtkn commented on PR #6369:
URL: https://github.com/apache/iceberg/pull/6369#issuecomment-1341587366

   Thanx folks, Just thinking about the sequentialId check, why it needs to rely on the start id, Does changing that check like that help
   ```
     static boolean hasSequentialIds(PartitionSpec spec) {
       for (int i = 1; i < spec.fields.length; i += 1) {
         if (spec.fields[i].fieldId() != spec.fields[i - 1].fieldId() + 1) {
           return false;
         }
       }
       return true;
     }
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #6369: Increase Partition Start Id to 10000

Posted by GitBox <gi...@apache.org>.
rdblue commented on PR #6369:
URL: https://github.com/apache/iceberg/pull/6369#issuecomment-1341825580

   Looks like @RussellSpitzer, @szehon-ho, and @aokolnychyi are looking at this and have noted the issues with v1 tables.
   
   I think that this is risky because not all v1 readers will use partition field IDs, but we do write them into partition specs now. Currently, we are careful that those IDs are always the same, but this change would cause them to differ. It may be safe, but I'd test very thoroughly and possibly put this behind a flag.
   
   I'd also like to understand why this is needed. Partition field IDs are stored in manifest files, not data files. Partition field IDs should generally not mix with data field IDs from the Iceberg schema.
   
   The only case I can think of right now is projecting the `_partition` metadata field when reading a table... but in that case I think there needs to be a better solution. Running into a collision at 10,000 fields is still possible with this PR. We should just assign new field IDs to the `_partition` metadata fields.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on pull request #6369: Increase Partition Start Id to 10000

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on PR #6369:
URL: https://github.com/apache/iceberg/pull/6369#issuecomment-1341558435

   Double checking the relevant part of the spec and we never actually demand that partition id's start at 1000. So I think we are in the clear hear from a backwards compatibility standpoint as well>
   
   ```Partition Evolution
   Table partitioning can be evolved by adding, removing, renaming, or reordering partition spec fields.
   
   Changing a partition spec produces a new spec identified by a unique spec ID that is added to the table’s list of partition specs and may be set as the table’s default spec.
   
   When evolving a spec, changes should not cause partition field IDs to change because the partition field IDs are used as the partition tuple field IDs in manifest files.
   
   In v2, partition field IDs must be explicitly tracked for each partition field. New IDs are assigned based on the last assigned partition ID in table metadata.
   
   In v1, partition field IDs were not tracked, but were assigned sequentially starting at 1000 in the reference implementation. This assignment caused problems when reading metadata tables based on manifest files from multiple specs because partition fields with the same ID may contain different data types. For compatibility with old versions, the following rules are recommended for partition evolution in v1 tables:
   
   Do not reorder partition fields
   Do not drop partition fields; instead replace the field’s transform with the void transform
   Only add partition fields at the end of the previous partition spec```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on pull request #6369: Increase Partition Start Id to 10000

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on PR #6369:
URL: https://github.com/apache/iceberg/pull/6369#issuecomment-1341588733

   Yeah I think the hasSequential code needs to be modified. Otherwise I can have a table whose first spec id is 1000 before this patch, then after this patch I try to add another field. The hasSequential would see 1000 != 10000 and would throw an error.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] gaborkaszab commented on pull request #6369: Increase Partition Start Id to 10000

Posted by GitBox <gi...@apache.org>.
gaborkaszab commented on PR #6369:
URL: https://github.com/apache/iceberg/pull/6369#issuecomment-1340673537

   This seems a reasonable change for me. Just a question for my better understanding: The tables that we have already written will still have their partition field IDs from 1000, right? So in case we have some tables that have more than 1000 cols and written prior to this change will still have the collision with the partition field IDs and will only be fixed if they are rewritten, right?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ShubhamSharmaCSE commented on pull request #6369: Increase Partition Start Id to 10000

Posted by "ShubhamSharmaCSE (via GitHub)" <gi...@apache.org>.
ShubhamSharmaCSE commented on PR #6369:
URL: https://github.com/apache/iceberg/pull/6369#issuecomment-1486398210

   any update on this?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] TuroczyX commented on pull request #6369: Increase Partition Start Id to 10000

Posted by GitBox <gi...@apache.org>.
TuroczyX commented on PR #6369:
URL: https://github.com/apache/iceberg/pull/6369#issuecomment-1341545626

   @szehon-ho https://www.youtube.com/watch?v=rR4n-0KYeKQ  about the LGTM :) Just for fun. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on pull request #6369: Increase Partition Start Id to 10000

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on PR #6369:
URL: https://github.com/apache/iceberg/pull/6369#issuecomment-1341567681

   The area I'm worried about now, is 
   https://github.com/apache/iceberg/blob/29187353e18a0ec477f638a00353340f24fb704e/api/src/main/java/org/apache/iceberg/PartitionSpec.java#L602-L609
   
   Which is a check for V1 Tables which is only used
   https://github.com/apache/iceberg/blob/6697129a314d98cb793601495f7ebb2ae000b40a/core/src/main/java/org/apache/iceberg/TableMetadata.java#L617-L626
   
   and
   https://github.com/apache/iceberg/blob/6697129a314d98cb793601495f7ebb2ae000b40a/core/src/main/java/org/apache/iceberg/TableMetadata.java#L1409-L1414
   
   So I may have to think about this more


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #6369: Increase Partition Start Id to 10000

Posted by GitBox <gi...@apache.org>.
rdblue commented on PR #6369:
URL: https://github.com/apache/iceberg/pull/6369#issuecomment-1356903474

   Any update on this? Should we keep it open or are we pursuing other solutions?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ayushtkn commented on pull request #6369: Increase Partition Start Id to 10000

Posted by GitBox <gi...@apache.org>.
ayushtkn commented on PR #6369:
URL: https://github.com/apache/iceberg/pull/6369#issuecomment-1341771756

   Went through the discussion, One good thing is it has lastPartitionId, and it is used for next allocations, so that should prevent any old table breaking due to this change.
   
   I couldn't figure out from where 1000 came in, it looks like `just like that`, And I don't feel if sequential is required, it should specifically start from the 'startId' either 1K or 10K
   
   The reason for sequential isn't mentioned over there, I need to explore a bit more around that area, 
   but there was a test which was checking the allocation should start from startId not just sequential also(Why?)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] szehon-ho commented on pull request #6369: Increase Partition Start Id to 10000

Posted by GitBox <gi...@apache.org>.
szehon-ho commented on PR #6369:
URL: https://github.com/apache/iceberg/pull/6369#issuecomment-1341722542

   And some in depth discussion : https://github.com/apache/iceberg/issues/280


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] szehon-ho commented on pull request #6369: Increase Partition Start Id to 10000

Posted by GitBox <gi...@apache.org>.
szehon-ho commented on PR #6369:
URL: https://github.com/apache/iceberg/pull/6369#issuecomment-1341718987

   Nice catch, didnt realize it would throw an exception if its not sequential.  Hm Im not 100% sure atm why we need to throw an exception in this case versus start id assignment from last assigned id, looks like it came from comment: https://github.com/apache/iceberg/pull/845#discussion_r406889223


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org