You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@iceberg.apache.org by filip <fi...@gmail.com> on 2019/05/28 08:27:11 UTC

Need help trying to figure out if the issue on multiple partition specs on same field is a tracked issue or not

A while back I bumped into an issue with what seems to be an inconsistency
in the partition spec API or maybe it's just an implementation bug.
Attempting to have multiple partitions specs on the same schema field I
bumped into an issue regarding the fact that while the API allows for
multiple partitions spec defined for same field, internally this conflicts
with the assumption that there is only one partition spec per field.

Given this partition spec:

PartitionSpec spec = PartitionSpec.builderFor(schema)
            .withSpecId(0)
            .year("timestamp")
            .month("timestamp")
            .day("timestamp")
            .hour("timestamp")
            .build();

Trying to validate partition pruning with similar code to:

UnboundPredicate<Object> match = Expressions.equal("timestamp",

Literal.of("2019-01-11T00:00:00.000000").to(TimestampType.withoutZone()).value());
Assert.assertTrue(
new InclusiveManifestEvaluator(spec,
match).eval(table.currentSnapshot().manifests().get(0));

I get an unexpected google collection exception:

java.lang.IllegalArgumentException: Multiple entries with same key:
1=org.apache.iceberg.PartitionField@da8cdda7 and
1=org.apache.iceberg.PartitionField@e5c6fddb

at
com.google.common.collect.ImmutableMap.conflictException(ImmutableMap.java:215)
at
com.google.common.collect.ImmutableMap.checkNoConflict(ImmutableMap.java:209)
at
com.google.common.collect.RegularImmutableMap.checkNoConflictInKeyBucket(RegularImmutableMap.java:147)
at
com.google.common.collect.RegularImmutableMap.fromEntryArray(RegularImmutableMap.java:110)
at
com.google.common.collect.ImmutableMap$Builder.build(ImmutableMap.java:393)
at
org.apache.iceberg.PartitionSpec.lazyFieldsBySourceId(PartitionSpec.java:232)
at
org.apache.iceberg.PartitionSpec.getFieldBySourceId(PartitionSpec.java:95)
at
org.apache.iceberg.expressions.Projections$InclusiveProjection.predicate(Projections.java:208)
at
org.apache.iceberg.expressions.Projections$InclusiveProjection.predicate(Projections.java:200)
at
org.apache.iceberg.expressions.Projections$BaseProjectionEvaluator.predicate(Projections.java:185)
at
org.apache.iceberg.expressions.Projections$BaseProjectionEvaluator.predicate(Projections.java:136)
at
org.apache.iceberg.expressions.ExpressionVisitors.visit(ExpressionVisitors.java:152)
at
org.apache.iceberg.expressions.Projections$BaseProjectionEvaluator.project(Projections.java:152)
at
org.apache.iceberg.expressions.InclusiveManifestEvaluator.<init>(InclusiveManifestEvaluator.java:63)
at
org.apache.iceberg.expressions.InclusiveManifestEvaluator.<init>(InclusiveManifestEvaluator.java:56)
at
org.apache.iceberg.TestScansAndSchemaEvolution.testMultiPartitionPerFieldTransform(TestScansAndSchemaEvolution.java:177)


I was wondering if this issue is tracked so maybe I could help out.

Thanks,
/Filip

Re: Need help trying to figure out if the issue on multiple partition specs on same field is a tracked issue or not

Posted by Ryan Blue <rb...@netflix.com.INVALID>.
Thanks for offering! I was already working on it though. Here's the PR:
https://github.com/apache/incubator-iceberg/pull/203

On Mon, Jun 3, 2019 at 1:09 AM Filip <fi...@gmail.com> wrote:

> I could try to take a stab at fixing this given that you've pointed out
> very clearly the expected behavior in your previous explanation.
>
> On Thu, May 30, 2019 at 10:31 PM Ryan Blue <rb...@netflix.com.invalid>
> wrote:
>
>> Yeah, this is a bug. You should be able to define multiple partition
>> functions on the same field. But we do want to check that multiple time
>> partitions are not used because they are redundant. I'll open a PR. Thanks
>> for pointing this out!
>>
>> On Tue, May 28, 2019 at 4:15 AM Anton Okolnychyi
>> <ao...@apple.com.invalid> wrote:
>>
>>> Hm, this is actually a good question.
>>>
>>> My understanding is that we shouldn't explicitly define partitioning by
>>> year/month/day/hour on the same column. Instead, we should be fine with
>>> hour only. Iceberg produces ordinals for time-based partition functions. As
>>> far as I remember, Ryan was planning to submit a PR in order to prohibit
>>> multiple partition functions.
>>>
>>> I believe in the above case you are trying to create one partition spec
>>> with multiple partition functions on the same field.
>>>
>>> Keep in mind that if you partition by hour only, the directory structure
>>> won’t contain year/month/day folders. If you are to have that directory
>>> structure, you need to have actual columns for year/month/day in your
>>> dataset and use identity partition function.
>>>
>>> Thanks,
>>> Anton
>>>
>>>
>>> > On 28 May 2019, at 09:27, filip <fi...@gmail.com> wrote:
>>> >
>>> >
>>> > A while back I bumped into an issue with what seems to be an
>>> inconsistency in the partition spec API or maybe it's just an
>>> implementation bug.
>>> > Attempting to have multiple partitions specs on the same schema field
>>> I bumped into an issue regarding the fact that while the API allows for
>>> multiple partitions spec defined for same field, internally this conflicts
>>> with the assumption that there is only one partition spec per field.
>>> >
>>> > Given this partition spec:
>>> >
>>> > PartitionSpec spec = PartitionSpec.builderFor(schema)
>>> >             .withSpecId(0)
>>> >             .year("timestamp")
>>> >             .month("timestamp")
>>> >             .day("timestamp")
>>> >             .hour("timestamp")
>>> >             .build();
>>> >
>>> > Trying to validate partition pruning with similar code to:
>>> >
>>> > UnboundPredicate<Object> match = Expressions.equal("timestamp",
>>> >
>>>  Literal.of("2019-01-11T00:00:00.000000").to(TimestampType.withoutZone()).value());
>>> > Assert.assertTrue(
>>> > new InclusiveManifestEvaluator(spec,
>>> match).eval(table.currentSnapshot().manifests().get(0));
>>> >
>>> > I get an unexpected google collection exception:
>>> >
>>> > java.lang.IllegalArgumentException: Multiple entries with same key:
>>> 1=org.apache.iceberg.PartitionField@da8cdda7 and
>>> 1=org.apache.iceberg.PartitionField@e5c6fddb
>>> >
>>> > at
>>> com.google.common.collect.ImmutableMap.conflictException(ImmutableMap.java:215)
>>> > at
>>> com.google.common.collect.ImmutableMap.checkNoConflict(ImmutableMap.java:209)
>>> > at
>>> com.google.common.collect.RegularImmutableMap.checkNoConflictInKeyBucket(RegularImmutableMap.java:147)
>>> > at
>>> com.google.common.collect.RegularImmutableMap.fromEntryArray(RegularImmutableMap.java:110)
>>> > at
>>> com.google.common.collect.ImmutableMap$Builder.build(ImmutableMap.java:393)
>>> > at
>>> org.apache.iceberg.PartitionSpec.lazyFieldsBySourceId(PartitionSpec.java:232)
>>> > at
>>> org.apache.iceberg.PartitionSpec.getFieldBySourceId(PartitionSpec.java:95)
>>> > at
>>> org.apache.iceberg.expressions.Projections$InclusiveProjection.predicate(Projections.java:208)
>>> > at
>>> org.apache.iceberg.expressions.Projections$InclusiveProjection.predicate(Projections.java:200)
>>> > at
>>> org.apache.iceberg.expressions.Projections$BaseProjectionEvaluator.predicate(Projections.java:185)
>>> > at
>>> org.apache.iceberg.expressions.Projections$BaseProjectionEvaluator.predicate(Projections.java:136)
>>> > at
>>> org.apache.iceberg.expressions.ExpressionVisitors.visit(ExpressionVisitors.java:152)
>>> > at
>>> org.apache.iceberg.expressions.Projections$BaseProjectionEvaluator.project(Projections.java:152)
>>> > at
>>> org.apache.iceberg.expressions.InclusiveManifestEvaluator.<init>(InclusiveManifestEvaluator.java:63)
>>> > at
>>> org.apache.iceberg.expressions.InclusiveManifestEvaluator.<init>(InclusiveManifestEvaluator.java:56)
>>> > at
>>> org.apache.iceberg.TestScansAndSchemaEvolution.testMultiPartitionPerFieldTransform(TestScansAndSchemaEvolution.java:177)
>>> >
>>> >
>>> > I was wondering if this issue is tracked so maybe I could help out.
>>> >
>>> > Thanks,
>>> > /Filip
>>>
>>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>
>
> --
> Filip Bocse
>


-- 
Ryan Blue
Software Engineer
Netflix

Re: Need help trying to figure out if the issue on multiple partition specs on same field is a tracked issue or not

Posted by Filip <fi...@gmail.com>.
I could try to take a stab at fixing this given that you've pointed out
very clearly the expected behavior in your previous explanation.

On Thu, May 30, 2019 at 10:31 PM Ryan Blue <rb...@netflix.com.invalid>
wrote:

> Yeah, this is a bug. You should be able to define multiple partition
> functions on the same field. But we do want to check that multiple time
> partitions are not used because they are redundant. I'll open a PR. Thanks
> for pointing this out!
>
> On Tue, May 28, 2019 at 4:15 AM Anton Okolnychyi
> <ao...@apple.com.invalid> wrote:
>
>> Hm, this is actually a good question.
>>
>> My understanding is that we shouldn't explicitly define partitioning by
>> year/month/day/hour on the same column. Instead, we should be fine with
>> hour only. Iceberg produces ordinals for time-based partition functions. As
>> far as I remember, Ryan was planning to submit a PR in order to prohibit
>> multiple partition functions.
>>
>> I believe in the above case you are trying to create one partition spec
>> with multiple partition functions on the same field.
>>
>> Keep in mind that if you partition by hour only, the directory structure
>> won’t contain year/month/day folders. If you are to have that directory
>> structure, you need to have actual columns for year/month/day in your
>> dataset and use identity partition function.
>>
>> Thanks,
>> Anton
>>
>>
>> > On 28 May 2019, at 09:27, filip <fi...@gmail.com> wrote:
>> >
>> >
>> > A while back I bumped into an issue with what seems to be an
>> inconsistency in the partition spec API or maybe it's just an
>> implementation bug.
>> > Attempting to have multiple partitions specs on the same schema field I
>> bumped into an issue regarding the fact that while the API allows for
>> multiple partitions spec defined for same field, internally this conflicts
>> with the assumption that there is only one partition spec per field.
>> >
>> > Given this partition spec:
>> >
>> > PartitionSpec spec = PartitionSpec.builderFor(schema)
>> >             .withSpecId(0)
>> >             .year("timestamp")
>> >             .month("timestamp")
>> >             .day("timestamp")
>> >             .hour("timestamp")
>> >             .build();
>> >
>> > Trying to validate partition pruning with similar code to:
>> >
>> > UnboundPredicate<Object> match = Expressions.equal("timestamp",
>> >
>>  Literal.of("2019-01-11T00:00:00.000000").to(TimestampType.withoutZone()).value());
>> > Assert.assertTrue(
>> > new InclusiveManifestEvaluator(spec,
>> match).eval(table.currentSnapshot().manifests().get(0));
>> >
>> > I get an unexpected google collection exception:
>> >
>> > java.lang.IllegalArgumentException: Multiple entries with same key:
>> 1=org.apache.iceberg.PartitionField@da8cdda7 and
>> 1=org.apache.iceberg.PartitionField@e5c6fddb
>> >
>> > at
>> com.google.common.collect.ImmutableMap.conflictException(ImmutableMap.java:215)
>> > at
>> com.google.common.collect.ImmutableMap.checkNoConflict(ImmutableMap.java:209)
>> > at
>> com.google.common.collect.RegularImmutableMap.checkNoConflictInKeyBucket(RegularImmutableMap.java:147)
>> > at
>> com.google.common.collect.RegularImmutableMap.fromEntryArray(RegularImmutableMap.java:110)
>> > at
>> com.google.common.collect.ImmutableMap$Builder.build(ImmutableMap.java:393)
>> > at
>> org.apache.iceberg.PartitionSpec.lazyFieldsBySourceId(PartitionSpec.java:232)
>> > at
>> org.apache.iceberg.PartitionSpec.getFieldBySourceId(PartitionSpec.java:95)
>> > at
>> org.apache.iceberg.expressions.Projections$InclusiveProjection.predicate(Projections.java:208)
>> > at
>> org.apache.iceberg.expressions.Projections$InclusiveProjection.predicate(Projections.java:200)
>> > at
>> org.apache.iceberg.expressions.Projections$BaseProjectionEvaluator.predicate(Projections.java:185)
>> > at
>> org.apache.iceberg.expressions.Projections$BaseProjectionEvaluator.predicate(Projections.java:136)
>> > at
>> org.apache.iceberg.expressions.ExpressionVisitors.visit(ExpressionVisitors.java:152)
>> > at
>> org.apache.iceberg.expressions.Projections$BaseProjectionEvaluator.project(Projections.java:152)
>> > at
>> org.apache.iceberg.expressions.InclusiveManifestEvaluator.<init>(InclusiveManifestEvaluator.java:63)
>> > at
>> org.apache.iceberg.expressions.InclusiveManifestEvaluator.<init>(InclusiveManifestEvaluator.java:56)
>> > at
>> org.apache.iceberg.TestScansAndSchemaEvolution.testMultiPartitionPerFieldTransform(TestScansAndSchemaEvolution.java:177)
>> >
>> >
>> > I was wondering if this issue is tracked so maybe I could help out.
>> >
>> > Thanks,
>> > /Filip
>>
>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>


-- 
Filip Bocse

Re: Need help trying to figure out if the issue on multiple partition specs on same field is a tracked issue or not

Posted by Ryan Blue <rb...@netflix.com.INVALID>.
Yeah, this is a bug. You should be able to define multiple partition
functions on the same field. But we do want to check that multiple time
partitions are not used because they are redundant. I'll open a PR. Thanks
for pointing this out!

On Tue, May 28, 2019 at 4:15 AM Anton Okolnychyi
<ao...@apple.com.invalid> wrote:

> Hm, this is actually a good question.
>
> My understanding is that we shouldn't explicitly define partitioning by
> year/month/day/hour on the same column. Instead, we should be fine with
> hour only. Iceberg produces ordinals for time-based partition functions. As
> far as I remember, Ryan was planning to submit a PR in order to prohibit
> multiple partition functions.
>
> I believe in the above case you are trying to create one partition spec
> with multiple partition functions on the same field.
>
> Keep in mind that if you partition by hour only, the directory structure
> won’t contain year/month/day folders. If you are to have that directory
> structure, you need to have actual columns for year/month/day in your
> dataset and use identity partition function.
>
> Thanks,
> Anton
>
>
> > On 28 May 2019, at 09:27, filip <fi...@gmail.com> wrote:
> >
> >
> > A while back I bumped into an issue with what seems to be an
> inconsistency in the partition spec API or maybe it's just an
> implementation bug.
> > Attempting to have multiple partitions specs on the same schema field I
> bumped into an issue regarding the fact that while the API allows for
> multiple partitions spec defined for same field, internally this conflicts
> with the assumption that there is only one partition spec per field.
> >
> > Given this partition spec:
> >
> > PartitionSpec spec = PartitionSpec.builderFor(schema)
> >             .withSpecId(0)
> >             .year("timestamp")
> >             .month("timestamp")
> >             .day("timestamp")
> >             .hour("timestamp")
> >             .build();
> >
> > Trying to validate partition pruning with similar code to:
> >
> > UnboundPredicate<Object> match = Expressions.equal("timestamp",
> >
>  Literal.of("2019-01-11T00:00:00.000000").to(TimestampType.withoutZone()).value());
> > Assert.assertTrue(
> > new InclusiveManifestEvaluator(spec,
> match).eval(table.currentSnapshot().manifests().get(0));
> >
> > I get an unexpected google collection exception:
> >
> > java.lang.IllegalArgumentException: Multiple entries with same key:
> 1=org.apache.iceberg.PartitionField@da8cdda7 and
> 1=org.apache.iceberg.PartitionField@e5c6fddb
> >
> > at
> com.google.common.collect.ImmutableMap.conflictException(ImmutableMap.java:215)
> > at
> com.google.common.collect.ImmutableMap.checkNoConflict(ImmutableMap.java:209)
> > at
> com.google.common.collect.RegularImmutableMap.checkNoConflictInKeyBucket(RegularImmutableMap.java:147)
> > at
> com.google.common.collect.RegularImmutableMap.fromEntryArray(RegularImmutableMap.java:110)
> > at
> com.google.common.collect.ImmutableMap$Builder.build(ImmutableMap.java:393)
> > at
> org.apache.iceberg.PartitionSpec.lazyFieldsBySourceId(PartitionSpec.java:232)
> > at
> org.apache.iceberg.PartitionSpec.getFieldBySourceId(PartitionSpec.java:95)
> > at
> org.apache.iceberg.expressions.Projections$InclusiveProjection.predicate(Projections.java:208)
> > at
> org.apache.iceberg.expressions.Projections$InclusiveProjection.predicate(Projections.java:200)
> > at
> org.apache.iceberg.expressions.Projections$BaseProjectionEvaluator.predicate(Projections.java:185)
> > at
> org.apache.iceberg.expressions.Projections$BaseProjectionEvaluator.predicate(Projections.java:136)
> > at
> org.apache.iceberg.expressions.ExpressionVisitors.visit(ExpressionVisitors.java:152)
> > at
> org.apache.iceberg.expressions.Projections$BaseProjectionEvaluator.project(Projections.java:152)
> > at
> org.apache.iceberg.expressions.InclusiveManifestEvaluator.<init>(InclusiveManifestEvaluator.java:63)
> > at
> org.apache.iceberg.expressions.InclusiveManifestEvaluator.<init>(InclusiveManifestEvaluator.java:56)
> > at
> org.apache.iceberg.TestScansAndSchemaEvolution.testMultiPartitionPerFieldTransform(TestScansAndSchemaEvolution.java:177)
> >
> >
> > I was wondering if this issue is tracked so maybe I could help out.
> >
> > Thanks,
> > /Filip
>
>

-- 
Ryan Blue
Software Engineer
Netflix

Re: Need help trying to figure out if the issue on multiple partition specs on same field is a tracked issue or not

Posted by Anton Okolnychyi <ao...@apple.com.INVALID>.
Hm, this is actually a good question.

My understanding is that we shouldn't explicitly define partitioning by year/month/day/hour on the same column. Instead, we should be fine with hour only. Iceberg produces ordinals for time-based partition functions. As far as I remember, Ryan was planning to submit a PR in order to prohibit multiple partition functions.

I believe in the above case you are trying to create one partition spec with multiple partition functions on the same field.

Keep in mind that if you partition by hour only, the directory structure won’t contain year/month/day folders. If you are to have that directory structure, you need to have actual columns for year/month/day in your dataset and use identity partition function.

Thanks,
Anton


> On 28 May 2019, at 09:27, filip <fi...@gmail.com> wrote:
> 
> 
> A while back I bumped into an issue with what seems to be an inconsistency in the partition spec API or maybe it's just an implementation bug.
> Attempting to have multiple partitions specs on the same schema field I bumped into an issue regarding the fact that while the API allows for multiple partitions spec defined for same field, internally this conflicts with the assumption that there is only one partition spec per field.
> 
> Given this partition spec: 
> 
> PartitionSpec spec = PartitionSpec.builderFor(schema)
>             .withSpecId(0)
>             .year("timestamp")
>             .month("timestamp")
>             .day("timestamp")
>             .hour("timestamp")
>             .build();
> 
> Trying to validate partition pruning with similar code to:
> 
> UnboundPredicate<Object> match = Expressions.equal("timestamp",
>             Literal.of("2019-01-11T00:00:00.000000").to(TimestampType.withoutZone()).value());
> Assert.assertTrue(
> new InclusiveManifestEvaluator(spec, match).eval(table.currentSnapshot().manifests().get(0));
>  
> I get an unexpected google collection exception:
> 
> java.lang.IllegalArgumentException: Multiple entries with same key: 1=org.apache.iceberg.PartitionField@da8cdda7 and 1=org.apache.iceberg.PartitionField@e5c6fddb
> 
> at com.google.common.collect.ImmutableMap.conflictException(ImmutableMap.java:215)
> at com.google.common.collect.ImmutableMap.checkNoConflict(ImmutableMap.java:209)
> at com.google.common.collect.RegularImmutableMap.checkNoConflictInKeyBucket(RegularImmutableMap.java:147)
> at com.google.common.collect.RegularImmutableMap.fromEntryArray(RegularImmutableMap.java:110)
> at com.google.common.collect.ImmutableMap$Builder.build(ImmutableMap.java:393)
> at org.apache.iceberg.PartitionSpec.lazyFieldsBySourceId(PartitionSpec.java:232)
> at org.apache.iceberg.PartitionSpec.getFieldBySourceId(PartitionSpec.java:95)
> at org.apache.iceberg.expressions.Projections$InclusiveProjection.predicate(Projections.java:208)
> at org.apache.iceberg.expressions.Projections$InclusiveProjection.predicate(Projections.java:200)
> at org.apache.iceberg.expressions.Projections$BaseProjectionEvaluator.predicate(Projections.java:185)
> at org.apache.iceberg.expressions.Projections$BaseProjectionEvaluator.predicate(Projections.java:136)
> at org.apache.iceberg.expressions.ExpressionVisitors.visit(ExpressionVisitors.java:152)
> at org.apache.iceberg.expressions.Projections$BaseProjectionEvaluator.project(Projections.java:152)
> at org.apache.iceberg.expressions.InclusiveManifestEvaluator.<init>(InclusiveManifestEvaluator.java:63)
> at org.apache.iceberg.expressions.InclusiveManifestEvaluator.<init>(InclusiveManifestEvaluator.java:56)
> at org.apache.iceberg.TestScansAndSchemaEvolution.testMultiPartitionPerFieldTransform(TestScansAndSchemaEvolution.java:177)
> 
> 
> I was wondering if this issue is tracked so maybe I could help out.
> 
> Thanks,
> /Filip