You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@iceberg.apache.org by Luis Otero <lo...@gmail.com> on 2020/03/12 17:09:34 UTC

AvroFileAppender metrics

Hi,

AvroFileAppender doesn't report min/max values (
https://github.com/apache/incubator-iceberg/blob/80cbc60ee55911ee627a7ad3013804394d7b5e9a/core/src/main/java/org/apache/iceberg/avro/AvroFileAppender.java#L60
).

As a side effect (I think) overwrite operations (if there are data files
with the same partition) fail with "Cannot delete file where some, but not
all, rows match filter" because StrictMetricsEvaluator can't confirm all
rows match.

For instance, if you modify TestLocalScan with:

    this.partitionSpec =
PartitionSpec.builderFor(SCHEMA).bucket("id",10).build();

    this.file1Records = new ArrayList<Record>();
    file1Records.add(record.copy(ImmutableMap.of("id", 60L, "data",
UUID.randomUUID().toString())));
    DataFile file1 = writeFile(sharedTable.location(),
format.addExtension("file-1"), file1Records);

    this.file2Records = new ArrayList<Record>();
    file2Records.add(record.copy(ImmutableMap.of("id", 1L, "data",
UUID.randomUUID().toString())));
    DataFile file2 = writeFile(sharedTable.location(),
format.addExtension("file-2"), file2Records);

    this.file3Records = new ArrayList<Record>();
    file3Records.add(record.copy(ImmutableMap.of("id", 1L, "data",
UUID.randomUUID().toString())));
    DataFile file3 = writeFile(sharedTable.location(),
format.addExtension("file-3"), file3Records);

    sharedTable.newAppend()
        .appendFile(file1)
        .commit();

    sharedTable.newAppend()
        .appendFile(file2)
        .commit();

    sharedTable.newOverwrite()
        .overwriteByRowFilter(equal("id",1L))
        .addFile(file3)
        .commit();


Fails with 'org.apache.iceberg.exceptions.ValidationException: Cannot
delete file where some, but not all, rows match filter ref(name="id") == 1:
file:/AVRO/file-2.avro' for AVRO format but works fine for PARQUET format.

Am I missing something here?

Thanks!!

Re: AvroFileAppender metrics

Posted by Ryan Blue <rb...@netflix.com.INVALID>.
Yeah, I would probably ignore the column size metric. That's really more
for columnar formats, where we could use it to estimate how much data from
a row group is being projected. For Avro, we'd have to read the same amount
either way.

For this, I'd probably create an appender that wraps another and
accumulates metrics before handing rows to the real writer. That way we can
use this for any format that doesn't keep its own.

On Fri, Mar 13, 2020 at 3:41 AM Luis Otero <lo...@gmail.com> wrote:

> Feedback/guidance request:
>
> Byte size info in avro is encapsulated in encoder
> (org.apache.avro.io.BufferedBinaryEncoder) and is not exposed by avro api.
>
> Should we carry on with the task ignoring that metric (gathering as much
> info as we can inside Iceberg)?
> Is it feasible to get Avro modified (to expose that info)?
>
> Thanks,
> L.
>
> On Thu, 12 Mar 2020 at 18:19, Luis Otero <lo...@gmail.com> wrote:
>
>> Hi Ryan,
>>
>> I'll give it a try.
>>
>> Regards,
>> L.
>>
>> On Thu, 12 Mar 2020 at 18:16, Ryan Blue <rb...@netflix.com.invalid>
>> wrote:
>>
>>> Hi Luis,
>>>
>>> You're right about what's happening. Because the Avro appender doesn't
>>> track column-level stats, Iceberg can't determine that the file only
>>> contains matching data rows and can be deleted. Parquet does keep those
>>> stats, so even though the partitioning doesn't guarantee the delete is
>>> safe, Iceberg can determine that it is.
>>>
>>> The solution is to add column-level stats for Avro files. Is that
>>> something you're interested in working on?
>>>
>>> rb
>>>
>>> On Thu, Mar 12, 2020 at 10:09 AM Luis Otero <lo...@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> AvroFileAppender doesn't report min/max values (
>>>> https://github.com/apache/incubator-iceberg/blob/80cbc60ee55911ee627a7ad3013804394d7b5e9a/core/src/main/java/org/apache/iceberg/avro/AvroFileAppender.java#L60
>>>> ).
>>>>
>>>> As a side effect (I think) overwrite operations (if there are data
>>>> files with the same partition) fail with "Cannot delete file where some,
>>>> but not all, rows match filter" because StrictMetricsEvaluator can't
>>>> confirm all rows match.
>>>>
>>>> For instance, if you modify TestLocalScan with:
>>>>
>>>>     this.partitionSpec =
>>>> PartitionSpec.builderFor(SCHEMA).bucket("id",10).build();
>>>>
>>>>     this.file1Records = new ArrayList<Record>();
>>>>     file1Records.add(record.copy(ImmutableMap.of("id", 60L, "data",
>>>> UUID.randomUUID().toString())));
>>>>     DataFile file1 = writeFile(sharedTable.location(),
>>>> format.addExtension("file-1"), file1Records);
>>>>
>>>>     this.file2Records = new ArrayList<Record>();
>>>>     file2Records.add(record.copy(ImmutableMap.of("id", 1L, "data",
>>>> UUID.randomUUID().toString())));
>>>>     DataFile file2 = writeFile(sharedTable.location(),
>>>> format.addExtension("file-2"), file2Records);
>>>>
>>>>     this.file3Records = new ArrayList<Record>();
>>>>     file3Records.add(record.copy(ImmutableMap.of("id", 1L, "data",
>>>> UUID.randomUUID().toString())));
>>>>     DataFile file3 = writeFile(sharedTable.location(),
>>>> format.addExtension("file-3"), file3Records);
>>>>
>>>>     sharedTable.newAppend()
>>>>         .appendFile(file1)
>>>>         .commit();
>>>>
>>>>     sharedTable.newAppend()
>>>>         .appendFile(file2)
>>>>         .commit();
>>>>
>>>>     sharedTable.newOverwrite()
>>>>         .overwriteByRowFilter(equal("id",1L))
>>>>         .addFile(file3)
>>>>         .commit();
>>>>
>>>>
>>>> Fails with 'org.apache.iceberg.exceptions.ValidationException: Cannot
>>>> delete file where some, but not all, rows match filter ref(name="id") == 1:
>>>> file:/AVRO/file-2.avro' for AVRO format but works fine for PARQUET format.
>>>>
>>>> Am I missing something here?
>>>>
>>>> Thanks!!
>>>>
>>>
>>>
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>>
>>

-- 
Ryan Blue
Software Engineer
Netflix

Re: AvroFileAppender metrics

Posted by Luis Otero <lo...@gmail.com>.
Feedback/guidance request:

Byte size info in avro is encapsulated in encoder
(org.apache.avro.io.BufferedBinaryEncoder) and is not exposed by avro api.

Should we carry on with the task ignoring that metric (gathering as much
info as we can inside Iceberg)?
Is it feasible to get Avro modified (to expose that info)?

Thanks,
L.

On Thu, 12 Mar 2020 at 18:19, Luis Otero <lo...@gmail.com> wrote:

> Hi Ryan,
>
> I'll give it a try.
>
> Regards,
> L.
>
> On Thu, 12 Mar 2020 at 18:16, Ryan Blue <rb...@netflix.com.invalid> wrote:
>
>> Hi Luis,
>>
>> You're right about what's happening. Because the Avro appender doesn't
>> track column-level stats, Iceberg can't determine that the file only
>> contains matching data rows and can be deleted. Parquet does keep those
>> stats, so even though the partitioning doesn't guarantee the delete is
>> safe, Iceberg can determine that it is.
>>
>> The solution is to add column-level stats for Avro files. Is that
>> something you're interested in working on?
>>
>> rb
>>
>> On Thu, Mar 12, 2020 at 10:09 AM Luis Otero <lo...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> AvroFileAppender doesn't report min/max values (
>>> https://github.com/apache/incubator-iceberg/blob/80cbc60ee55911ee627a7ad3013804394d7b5e9a/core/src/main/java/org/apache/iceberg/avro/AvroFileAppender.java#L60
>>> ).
>>>
>>> As a side effect (I think) overwrite operations (if there are data files
>>> with the same partition) fail with "Cannot delete file where some, but not
>>> all, rows match filter" because StrictMetricsEvaluator can't confirm all
>>> rows match.
>>>
>>> For instance, if you modify TestLocalScan with:
>>>
>>>     this.partitionSpec =
>>> PartitionSpec.builderFor(SCHEMA).bucket("id",10).build();
>>>
>>>     this.file1Records = new ArrayList<Record>();
>>>     file1Records.add(record.copy(ImmutableMap.of("id", 60L, "data",
>>> UUID.randomUUID().toString())));
>>>     DataFile file1 = writeFile(sharedTable.location(),
>>> format.addExtension("file-1"), file1Records);
>>>
>>>     this.file2Records = new ArrayList<Record>();
>>>     file2Records.add(record.copy(ImmutableMap.of("id", 1L, "data",
>>> UUID.randomUUID().toString())));
>>>     DataFile file2 = writeFile(sharedTable.location(),
>>> format.addExtension("file-2"), file2Records);
>>>
>>>     this.file3Records = new ArrayList<Record>();
>>>     file3Records.add(record.copy(ImmutableMap.of("id", 1L, "data",
>>> UUID.randomUUID().toString())));
>>>     DataFile file3 = writeFile(sharedTable.location(),
>>> format.addExtension("file-3"), file3Records);
>>>
>>>     sharedTable.newAppend()
>>>         .appendFile(file1)
>>>         .commit();
>>>
>>>     sharedTable.newAppend()
>>>         .appendFile(file2)
>>>         .commit();
>>>
>>>     sharedTable.newOverwrite()
>>>         .overwriteByRowFilter(equal("id",1L))
>>>         .addFile(file3)
>>>         .commit();
>>>
>>>
>>> Fails with 'org.apache.iceberg.exceptions.ValidationException: Cannot
>>> delete file where some, but not all, rows match filter ref(name="id") == 1:
>>> file:/AVRO/file-2.avro' for AVRO format but works fine for PARQUET format.
>>>
>>> Am I missing something here?
>>>
>>> Thanks!!
>>>
>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>

Re: AvroFileAppender metrics

Posted by Luis Otero <lo...@gmail.com>.
Hi Ryan,

I'll give it a try.

Regards,
L.

On Thu, 12 Mar 2020 at 18:16, Ryan Blue <rb...@netflix.com.invalid> wrote:

> Hi Luis,
>
> You're right about what's happening. Because the Avro appender doesn't
> track column-level stats, Iceberg can't determine that the file only
> contains matching data rows and can be deleted. Parquet does keep those
> stats, so even though the partitioning doesn't guarantee the delete is
> safe, Iceberg can determine that it is.
>
> The solution is to add column-level stats for Avro files. Is that
> something you're interested in working on?
>
> rb
>
> On Thu, Mar 12, 2020 at 10:09 AM Luis Otero <lo...@gmail.com> wrote:
>
>> Hi,
>>
>> AvroFileAppender doesn't report min/max values (
>> https://github.com/apache/incubator-iceberg/blob/80cbc60ee55911ee627a7ad3013804394d7b5e9a/core/src/main/java/org/apache/iceberg/avro/AvroFileAppender.java#L60
>> ).
>>
>> As a side effect (I think) overwrite operations (if there are data files
>> with the same partition) fail with "Cannot delete file where some, but not
>> all, rows match filter" because StrictMetricsEvaluator can't confirm all
>> rows match.
>>
>> For instance, if you modify TestLocalScan with:
>>
>>     this.partitionSpec =
>> PartitionSpec.builderFor(SCHEMA).bucket("id",10).build();
>>
>>     this.file1Records = new ArrayList<Record>();
>>     file1Records.add(record.copy(ImmutableMap.of("id", 60L, "data",
>> UUID.randomUUID().toString())));
>>     DataFile file1 = writeFile(sharedTable.location(),
>> format.addExtension("file-1"), file1Records);
>>
>>     this.file2Records = new ArrayList<Record>();
>>     file2Records.add(record.copy(ImmutableMap.of("id", 1L, "data",
>> UUID.randomUUID().toString())));
>>     DataFile file2 = writeFile(sharedTable.location(),
>> format.addExtension("file-2"), file2Records);
>>
>>     this.file3Records = new ArrayList<Record>();
>>     file3Records.add(record.copy(ImmutableMap.of("id", 1L, "data",
>> UUID.randomUUID().toString())));
>>     DataFile file3 = writeFile(sharedTable.location(),
>> format.addExtension("file-3"), file3Records);
>>
>>     sharedTable.newAppend()
>>         .appendFile(file1)
>>         .commit();
>>
>>     sharedTable.newAppend()
>>         .appendFile(file2)
>>         .commit();
>>
>>     sharedTable.newOverwrite()
>>         .overwriteByRowFilter(equal("id",1L))
>>         .addFile(file3)
>>         .commit();
>>
>>
>> Fails with 'org.apache.iceberg.exceptions.ValidationException: Cannot
>> delete file where some, but not all, rows match filter ref(name="id") == 1:
>> file:/AVRO/file-2.avro' for AVRO format but works fine for PARQUET format.
>>
>> Am I missing something here?
>>
>> Thanks!!
>>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Re: AvroFileAppender metrics

Posted by Ryan Blue <rb...@netflix.com.INVALID>.
Hi Luis,

You're right about what's happening. Because the Avro appender doesn't
track column-level stats, Iceberg can't determine that the file only
contains matching data rows and can be deleted. Parquet does keep those
stats, so even though the partitioning doesn't guarantee the delete is
safe, Iceberg can determine that it is.

The solution is to add column-level stats for Avro files. Is that something
you're interested in working on?

rb

On Thu, Mar 12, 2020 at 10:09 AM Luis Otero <lo...@gmail.com> wrote:

> Hi,
>
> AvroFileAppender doesn't report min/max values (
> https://github.com/apache/incubator-iceberg/blob/80cbc60ee55911ee627a7ad3013804394d7b5e9a/core/src/main/java/org/apache/iceberg/avro/AvroFileAppender.java#L60
> ).
>
> As a side effect (I think) overwrite operations (if there are data files
> with the same partition) fail with "Cannot delete file where some, but not
> all, rows match filter" because StrictMetricsEvaluator can't confirm all
> rows match.
>
> For instance, if you modify TestLocalScan with:
>
>     this.partitionSpec =
> PartitionSpec.builderFor(SCHEMA).bucket("id",10).build();
>
>     this.file1Records = new ArrayList<Record>();
>     file1Records.add(record.copy(ImmutableMap.of("id", 60L, "data",
> UUID.randomUUID().toString())));
>     DataFile file1 = writeFile(sharedTable.location(),
> format.addExtension("file-1"), file1Records);
>
>     this.file2Records = new ArrayList<Record>();
>     file2Records.add(record.copy(ImmutableMap.of("id", 1L, "data",
> UUID.randomUUID().toString())));
>     DataFile file2 = writeFile(sharedTable.location(),
> format.addExtension("file-2"), file2Records);
>
>     this.file3Records = new ArrayList<Record>();
>     file3Records.add(record.copy(ImmutableMap.of("id", 1L, "data",
> UUID.randomUUID().toString())));
>     DataFile file3 = writeFile(sharedTable.location(),
> format.addExtension("file-3"), file3Records);
>
>     sharedTable.newAppend()
>         .appendFile(file1)
>         .commit();
>
>     sharedTable.newAppend()
>         .appendFile(file2)
>         .commit();
>
>     sharedTable.newOverwrite()
>         .overwriteByRowFilter(equal("id",1L))
>         .addFile(file3)
>         .commit();
>
>
> Fails with 'org.apache.iceberg.exceptions.ValidationException: Cannot
> delete file where some, but not all, rows match filter ref(name="id") == 1:
> file:/AVRO/file-2.avro' for AVRO format but works fine for PARQUET format.
>
> Am I missing something here?
>
> Thanks!!
>


-- 
Ryan Blue
Software Engineer
Netflix