You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@iceberg.apache.org by filip <fi...@gmail.com> on 2019/05/28 08:27:11 UTC
Need help trying to figure out if the issue on multiple partition
specs on same field is a tracked issue or not
A while back I bumped into an issue with what seems to be an inconsistency
in the partition spec API or maybe it's just an implementation bug.
Attempting to have multiple partitions specs on the same schema field I
bumped into an issue regarding the fact that while the API allows for
multiple partitions spec defined for same field, internally this conflicts
with the assumption that there is only one partition spec per field.
Given this partition spec:
PartitionSpec spec = PartitionSpec.builderFor(schema)
.withSpecId(0)
.year("timestamp")
.month("timestamp")
.day("timestamp")
.hour("timestamp")
.build();
Trying to validate partition pruning with similar code to:
UnboundPredicate<Object> match = Expressions.equal("timestamp",
Literal.of("2019-01-11T00:00:00.000000").to(TimestampType.withoutZone()).value());
Assert.assertTrue(
new InclusiveManifestEvaluator(spec,
match).eval(table.currentSnapshot().manifests().get(0));
I get an unexpected google collection exception:
java.lang.IllegalArgumentException: Multiple entries with same key:
1=org.apache.iceberg.PartitionField@da8cdda7 and
1=org.apache.iceberg.PartitionField@e5c6fddb
at
com.google.common.collect.ImmutableMap.conflictException(ImmutableMap.java:215)
at
com.google.common.collect.ImmutableMap.checkNoConflict(ImmutableMap.java:209)
at
com.google.common.collect.RegularImmutableMap.checkNoConflictInKeyBucket(RegularImmutableMap.java:147)
at
com.google.common.collect.RegularImmutableMap.fromEntryArray(RegularImmutableMap.java:110)
at
com.google.common.collect.ImmutableMap$Builder.build(ImmutableMap.java:393)
at
org.apache.iceberg.PartitionSpec.lazyFieldsBySourceId(PartitionSpec.java:232)
at
org.apache.iceberg.PartitionSpec.getFieldBySourceId(PartitionSpec.java:95)
at
org.apache.iceberg.expressions.Projections$InclusiveProjection.predicate(Projections.java:208)
at
org.apache.iceberg.expressions.Projections$InclusiveProjection.predicate(Projections.java:200)
at
org.apache.iceberg.expressions.Projections$BaseProjectionEvaluator.predicate(Projections.java:185)
at
org.apache.iceberg.expressions.Projections$BaseProjectionEvaluator.predicate(Projections.java:136)
at
org.apache.iceberg.expressions.ExpressionVisitors.visit(ExpressionVisitors.java:152)
at
org.apache.iceberg.expressions.Projections$BaseProjectionEvaluator.project(Projections.java:152)
at
org.apache.iceberg.expressions.InclusiveManifestEvaluator.<init>(InclusiveManifestEvaluator.java:63)
at
org.apache.iceberg.expressions.InclusiveManifestEvaluator.<init>(InclusiveManifestEvaluator.java:56)
at
org.apache.iceberg.TestScansAndSchemaEvolution.testMultiPartitionPerFieldTransform(TestScansAndSchemaEvolution.java:177)
I was wondering if this issue is tracked so maybe I could help out.
Thanks,
/Filip
Re: Need help trying to figure out if the issue on multiple partition
specs on same field is a tracked issue or not
Posted by Ryan Blue <rb...@netflix.com.INVALID>.
Thanks for offering! I was already working on it though. Here's the PR:
https://github.com/apache/incubator-iceberg/pull/203
On Mon, Jun 3, 2019 at 1:09 AM Filip <fi...@gmail.com> wrote:
> I could try to take a stab at fixing this given that you've pointed out
> very clearly the expected behavior in your previous explanation.
>
> On Thu, May 30, 2019 at 10:31 PM Ryan Blue <rb...@netflix.com.invalid>
> wrote:
>
>> Yeah, this is a bug. You should be able to define multiple partition
>> functions on the same field. But we do want to check that multiple time
>> partitions are not used because they are redundant. I'll open a PR. Thanks
>> for pointing this out!
>>
>> On Tue, May 28, 2019 at 4:15 AM Anton Okolnychyi
>> <ao...@apple.com.invalid> wrote:
>>
>>> Hm, this is actually a good question.
>>>
>>> My understanding is that we shouldn't explicitly define partitioning by
>>> year/month/day/hour on the same column. Instead, we should be fine with
>>> hour only. Iceberg produces ordinals for time-based partition functions. As
>>> far as I remember, Ryan was planning to submit a PR in order to prohibit
>>> multiple partition functions.
>>>
>>> I believe in the above case you are trying to create one partition spec
>>> with multiple partition functions on the same field.
>>>
>>> Keep in mind that if you partition by hour only, the directory structure
>>> won’t contain year/month/day folders. If you are to have that directory
>>> structure, you need to have actual columns for year/month/day in your
>>> dataset and use identity partition function.
>>>
>>> Thanks,
>>> Anton
>>>
>>>
>>> > On 28 May 2019, at 09:27, filip <fi...@gmail.com> wrote:
>>> >
>>> >
>>> > A while back I bumped into an issue with what seems to be an
>>> inconsistency in the partition spec API or maybe it's just an
>>> implementation bug.
>>> > Attempting to have multiple partitions specs on the same schema field
>>> I bumped into an issue regarding the fact that while the API allows for
>>> multiple partitions spec defined for same field, internally this conflicts
>>> with the assumption that there is only one partition spec per field.
>>> >
>>> > Given this partition spec:
>>> >
>>> > PartitionSpec spec = PartitionSpec.builderFor(schema)
>>> > .withSpecId(0)
>>> > .year("timestamp")
>>> > .month("timestamp")
>>> > .day("timestamp")
>>> > .hour("timestamp")
>>> > .build();
>>> >
>>> > Trying to validate partition pruning with similar code to:
>>> >
>>> > UnboundPredicate<Object> match = Expressions.equal("timestamp",
>>> >
>>> Literal.of("2019-01-11T00:00:00.000000").to(TimestampType.withoutZone()).value());
>>> > Assert.assertTrue(
>>> > new InclusiveManifestEvaluator(spec,
>>> match).eval(table.currentSnapshot().manifests().get(0));
>>> >
>>> > I get an unexpected google collection exception:
>>> >
>>> > java.lang.IllegalArgumentException: Multiple entries with same key:
>>> 1=org.apache.iceberg.PartitionField@da8cdda7 and
>>> 1=org.apache.iceberg.PartitionField@e5c6fddb
>>> >
>>> > at
>>> com.google.common.collect.ImmutableMap.conflictException(ImmutableMap.java:215)
>>> > at
>>> com.google.common.collect.ImmutableMap.checkNoConflict(ImmutableMap.java:209)
>>> > at
>>> com.google.common.collect.RegularImmutableMap.checkNoConflictInKeyBucket(RegularImmutableMap.java:147)
>>> > at
>>> com.google.common.collect.RegularImmutableMap.fromEntryArray(RegularImmutableMap.java:110)
>>> > at
>>> com.google.common.collect.ImmutableMap$Builder.build(ImmutableMap.java:393)
>>> > at
>>> org.apache.iceberg.PartitionSpec.lazyFieldsBySourceId(PartitionSpec.java:232)
>>> > at
>>> org.apache.iceberg.PartitionSpec.getFieldBySourceId(PartitionSpec.java:95)
>>> > at
>>> org.apache.iceberg.expressions.Projections$InclusiveProjection.predicate(Projections.java:208)
>>> > at
>>> org.apache.iceberg.expressions.Projections$InclusiveProjection.predicate(Projections.java:200)
>>> > at
>>> org.apache.iceberg.expressions.Projections$BaseProjectionEvaluator.predicate(Projections.java:185)
>>> > at
>>> org.apache.iceberg.expressions.Projections$BaseProjectionEvaluator.predicate(Projections.java:136)
>>> > at
>>> org.apache.iceberg.expressions.ExpressionVisitors.visit(ExpressionVisitors.java:152)
>>> > at
>>> org.apache.iceberg.expressions.Projections$BaseProjectionEvaluator.project(Projections.java:152)
>>> > at
>>> org.apache.iceberg.expressions.InclusiveManifestEvaluator.<init>(InclusiveManifestEvaluator.java:63)
>>> > at
>>> org.apache.iceberg.expressions.InclusiveManifestEvaluator.<init>(InclusiveManifestEvaluator.java:56)
>>> > at
>>> org.apache.iceberg.TestScansAndSchemaEvolution.testMultiPartitionPerFieldTransform(TestScansAndSchemaEvolution.java:177)
>>> >
>>> >
>>> > I was wondering if this issue is tracked so maybe I could help out.
>>> >
>>> > Thanks,
>>> > /Filip
>>>
>>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>
>
> --
> Filip Bocse
>
--
Ryan Blue
Software Engineer
Netflix
Re: Need help trying to figure out if the issue on multiple partition
specs on same field is a tracked issue or not
Posted by Filip <fi...@gmail.com>.
I could try to take a stab at fixing this given that you've pointed out
very clearly the expected behavior in your previous explanation.
On Thu, May 30, 2019 at 10:31 PM Ryan Blue <rb...@netflix.com.invalid>
wrote:
> Yeah, this is a bug. You should be able to define multiple partition
> functions on the same field. But we do want to check that multiple time
> partitions are not used because they are redundant. I'll open a PR. Thanks
> for pointing this out!
>
> On Tue, May 28, 2019 at 4:15 AM Anton Okolnychyi
> <ao...@apple.com.invalid> wrote:
>
>> Hm, this is actually a good question.
>>
>> My understanding is that we shouldn't explicitly define partitioning by
>> year/month/day/hour on the same column. Instead, we should be fine with
>> hour only. Iceberg produces ordinals for time-based partition functions. As
>> far as I remember, Ryan was planning to submit a PR in order to prohibit
>> multiple partition functions.
>>
>> I believe in the above case you are trying to create one partition spec
>> with multiple partition functions on the same field.
>>
>> Keep in mind that if you partition by hour only, the directory structure
>> won’t contain year/month/day folders. If you are to have that directory
>> structure, you need to have actual columns for year/month/day in your
>> dataset and use identity partition function.
>>
>> Thanks,
>> Anton
>>
>>
>> > On 28 May 2019, at 09:27, filip <fi...@gmail.com> wrote:
>> >
>> >
>> > A while back I bumped into an issue with what seems to be an
>> inconsistency in the partition spec API or maybe it's just an
>> implementation bug.
>> > Attempting to have multiple partitions specs on the same schema field I
>> bumped into an issue regarding the fact that while the API allows for
>> multiple partitions spec defined for same field, internally this conflicts
>> with the assumption that there is only one partition spec per field.
>> >
>> > Given this partition spec:
>> >
>> > PartitionSpec spec = PartitionSpec.builderFor(schema)
>> > .withSpecId(0)
>> > .year("timestamp")
>> > .month("timestamp")
>> > .day("timestamp")
>> > .hour("timestamp")
>> > .build();
>> >
>> > Trying to validate partition pruning with similar code to:
>> >
>> > UnboundPredicate<Object> match = Expressions.equal("timestamp",
>> >
>> Literal.of("2019-01-11T00:00:00.000000").to(TimestampType.withoutZone()).value());
>> > Assert.assertTrue(
>> > new InclusiveManifestEvaluator(spec,
>> match).eval(table.currentSnapshot().manifests().get(0));
>> >
>> > I get an unexpected google collection exception:
>> >
>> > java.lang.IllegalArgumentException: Multiple entries with same key:
>> 1=org.apache.iceberg.PartitionField@da8cdda7 and
>> 1=org.apache.iceberg.PartitionField@e5c6fddb
>> >
>> > at
>> com.google.common.collect.ImmutableMap.conflictException(ImmutableMap.java:215)
>> > at
>> com.google.common.collect.ImmutableMap.checkNoConflict(ImmutableMap.java:209)
>> > at
>> com.google.common.collect.RegularImmutableMap.checkNoConflictInKeyBucket(RegularImmutableMap.java:147)
>> > at
>> com.google.common.collect.RegularImmutableMap.fromEntryArray(RegularImmutableMap.java:110)
>> > at
>> com.google.common.collect.ImmutableMap$Builder.build(ImmutableMap.java:393)
>> > at
>> org.apache.iceberg.PartitionSpec.lazyFieldsBySourceId(PartitionSpec.java:232)
>> > at
>> org.apache.iceberg.PartitionSpec.getFieldBySourceId(PartitionSpec.java:95)
>> > at
>> org.apache.iceberg.expressions.Projections$InclusiveProjection.predicate(Projections.java:208)
>> > at
>> org.apache.iceberg.expressions.Projections$InclusiveProjection.predicate(Projections.java:200)
>> > at
>> org.apache.iceberg.expressions.Projections$BaseProjectionEvaluator.predicate(Projections.java:185)
>> > at
>> org.apache.iceberg.expressions.Projections$BaseProjectionEvaluator.predicate(Projections.java:136)
>> > at
>> org.apache.iceberg.expressions.ExpressionVisitors.visit(ExpressionVisitors.java:152)
>> > at
>> org.apache.iceberg.expressions.Projections$BaseProjectionEvaluator.project(Projections.java:152)
>> > at
>> org.apache.iceberg.expressions.InclusiveManifestEvaluator.<init>(InclusiveManifestEvaluator.java:63)
>> > at
>> org.apache.iceberg.expressions.InclusiveManifestEvaluator.<init>(InclusiveManifestEvaluator.java:56)
>> > at
>> org.apache.iceberg.TestScansAndSchemaEvolution.testMultiPartitionPerFieldTransform(TestScansAndSchemaEvolution.java:177)
>> >
>> >
>> > I was wondering if this issue is tracked so maybe I could help out.
>> >
>> > Thanks,
>> > /Filip
>>
>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>
--
Filip Bocse
Re: Need help trying to figure out if the issue on multiple partition
specs on same field is a tracked issue or not
Posted by Ryan Blue <rb...@netflix.com.INVALID>.
Yeah, this is a bug. You should be able to define multiple partition
functions on the same field. But we do want to check that multiple time
partitions are not used because they are redundant. I'll open a PR. Thanks
for pointing this out!
On Tue, May 28, 2019 at 4:15 AM Anton Okolnychyi
<ao...@apple.com.invalid> wrote:
> Hm, this is actually a good question.
>
> My understanding is that we shouldn't explicitly define partitioning by
> year/month/day/hour on the same column. Instead, we should be fine with
> hour only. Iceberg produces ordinals for time-based partition functions. As
> far as I remember, Ryan was planning to submit a PR in order to prohibit
> multiple partition functions.
>
> I believe in the above case you are trying to create one partition spec
> with multiple partition functions on the same field.
>
> Keep in mind that if you partition by hour only, the directory structure
> won’t contain year/month/day folders. If you are to have that directory
> structure, you need to have actual columns for year/month/day in your
> dataset and use identity partition function.
>
> Thanks,
> Anton
>
>
> > On 28 May 2019, at 09:27, filip <fi...@gmail.com> wrote:
> >
> >
> > A while back I bumped into an issue with what seems to be an
> inconsistency in the partition spec API or maybe it's just an
> implementation bug.
> > Attempting to have multiple partitions specs on the same schema field I
> bumped into an issue regarding the fact that while the API allows for
> multiple partitions spec defined for same field, internally this conflicts
> with the assumption that there is only one partition spec per field.
> >
> > Given this partition spec:
> >
> > PartitionSpec spec = PartitionSpec.builderFor(schema)
> > .withSpecId(0)
> > .year("timestamp")
> > .month("timestamp")
> > .day("timestamp")
> > .hour("timestamp")
> > .build();
> >
> > Trying to validate partition pruning with similar code to:
> >
> > UnboundPredicate<Object> match = Expressions.equal("timestamp",
> >
> Literal.of("2019-01-11T00:00:00.000000").to(TimestampType.withoutZone()).value());
> > Assert.assertTrue(
> > new InclusiveManifestEvaluator(spec,
> match).eval(table.currentSnapshot().manifests().get(0));
> >
> > I get an unexpected google collection exception:
> >
> > java.lang.IllegalArgumentException: Multiple entries with same key:
> 1=org.apache.iceberg.PartitionField@da8cdda7 and
> 1=org.apache.iceberg.PartitionField@e5c6fddb
> >
> > at
> com.google.common.collect.ImmutableMap.conflictException(ImmutableMap.java:215)
> > at
> com.google.common.collect.ImmutableMap.checkNoConflict(ImmutableMap.java:209)
> > at
> com.google.common.collect.RegularImmutableMap.checkNoConflictInKeyBucket(RegularImmutableMap.java:147)
> > at
> com.google.common.collect.RegularImmutableMap.fromEntryArray(RegularImmutableMap.java:110)
> > at
> com.google.common.collect.ImmutableMap$Builder.build(ImmutableMap.java:393)
> > at
> org.apache.iceberg.PartitionSpec.lazyFieldsBySourceId(PartitionSpec.java:232)
> > at
> org.apache.iceberg.PartitionSpec.getFieldBySourceId(PartitionSpec.java:95)
> > at
> org.apache.iceberg.expressions.Projections$InclusiveProjection.predicate(Projections.java:208)
> > at
> org.apache.iceberg.expressions.Projections$InclusiveProjection.predicate(Projections.java:200)
> > at
> org.apache.iceberg.expressions.Projections$BaseProjectionEvaluator.predicate(Projections.java:185)
> > at
> org.apache.iceberg.expressions.Projections$BaseProjectionEvaluator.predicate(Projections.java:136)
> > at
> org.apache.iceberg.expressions.ExpressionVisitors.visit(ExpressionVisitors.java:152)
> > at
> org.apache.iceberg.expressions.Projections$BaseProjectionEvaluator.project(Projections.java:152)
> > at
> org.apache.iceberg.expressions.InclusiveManifestEvaluator.<init>(InclusiveManifestEvaluator.java:63)
> > at
> org.apache.iceberg.expressions.InclusiveManifestEvaluator.<init>(InclusiveManifestEvaluator.java:56)
> > at
> org.apache.iceberg.TestScansAndSchemaEvolution.testMultiPartitionPerFieldTransform(TestScansAndSchemaEvolution.java:177)
> >
> >
> > I was wondering if this issue is tracked so maybe I could help out.
> >
> > Thanks,
> > /Filip
>
>
--
Ryan Blue
Software Engineer
Netflix
Re: Need help trying to figure out if the issue on multiple partition
specs on same field is a tracked issue or not
Posted by Anton Okolnychyi <ao...@apple.com.INVALID>.
Hm, this is actually a good question.
My understanding is that we shouldn't explicitly define partitioning by year/month/day/hour on the same column. Instead, we should be fine with hour only. Iceberg produces ordinals for time-based partition functions. As far as I remember, Ryan was planning to submit a PR in order to prohibit multiple partition functions.
I believe in the above case you are trying to create one partition spec with multiple partition functions on the same field.
Keep in mind that if you partition by hour only, the directory structure won’t contain year/month/day folders. If you are to have that directory structure, you need to have actual columns for year/month/day in your dataset and use identity partition function.
Thanks,
Anton
> On 28 May 2019, at 09:27, filip <fi...@gmail.com> wrote:
>
>
> A while back I bumped into an issue with what seems to be an inconsistency in the partition spec API or maybe it's just an implementation bug.
> Attempting to have multiple partitions specs on the same schema field I bumped into an issue regarding the fact that while the API allows for multiple partitions spec defined for same field, internally this conflicts with the assumption that there is only one partition spec per field.
>
> Given this partition spec:
>
> PartitionSpec spec = PartitionSpec.builderFor(schema)
> .withSpecId(0)
> .year("timestamp")
> .month("timestamp")
> .day("timestamp")
> .hour("timestamp")
> .build();
>
> Trying to validate partition pruning with similar code to:
>
> UnboundPredicate<Object> match = Expressions.equal("timestamp",
> Literal.of("2019-01-11T00:00:00.000000").to(TimestampType.withoutZone()).value());
> Assert.assertTrue(
> new InclusiveManifestEvaluator(spec, match).eval(table.currentSnapshot().manifests().get(0));
>
> I get an unexpected google collection exception:
>
> java.lang.IllegalArgumentException: Multiple entries with same key: 1=org.apache.iceberg.PartitionField@da8cdda7 and 1=org.apache.iceberg.PartitionField@e5c6fddb
>
> at com.google.common.collect.ImmutableMap.conflictException(ImmutableMap.java:215)
> at com.google.common.collect.ImmutableMap.checkNoConflict(ImmutableMap.java:209)
> at com.google.common.collect.RegularImmutableMap.checkNoConflictInKeyBucket(RegularImmutableMap.java:147)
> at com.google.common.collect.RegularImmutableMap.fromEntryArray(RegularImmutableMap.java:110)
> at com.google.common.collect.ImmutableMap$Builder.build(ImmutableMap.java:393)
> at org.apache.iceberg.PartitionSpec.lazyFieldsBySourceId(PartitionSpec.java:232)
> at org.apache.iceberg.PartitionSpec.getFieldBySourceId(PartitionSpec.java:95)
> at org.apache.iceberg.expressions.Projections$InclusiveProjection.predicate(Projections.java:208)
> at org.apache.iceberg.expressions.Projections$InclusiveProjection.predicate(Projections.java:200)
> at org.apache.iceberg.expressions.Projections$BaseProjectionEvaluator.predicate(Projections.java:185)
> at org.apache.iceberg.expressions.Projections$BaseProjectionEvaluator.predicate(Projections.java:136)
> at org.apache.iceberg.expressions.ExpressionVisitors.visit(ExpressionVisitors.java:152)
> at org.apache.iceberg.expressions.Projections$BaseProjectionEvaluator.project(Projections.java:152)
> at org.apache.iceberg.expressions.InclusiveManifestEvaluator.<init>(InclusiveManifestEvaluator.java:63)
> at org.apache.iceberg.expressions.InclusiveManifestEvaluator.<init>(InclusiveManifestEvaluator.java:56)
> at org.apache.iceberg.TestScansAndSchemaEvolution.testMultiPartitionPerFieldTransform(TestScansAndSchemaEvolution.java:177)
>
>
> I was wondering if this issue is tracked so maybe I could help out.
>
> Thanks,
> /Filip