You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@iceberg.apache.org by Vivekanand Vellanki <vi...@dremio.com> on 2021/04/07 08:41:14 UTC

Question on record_count field in the data-file entry of a manifest file

Hi,

We are in the process of converting Hive datasets to Iceberg datasets.

In this process, we noticed that each data-file entry in the manifest file
has a required record_count field.

Populating this accurately would require reading the footer/tail for
Parquet/ORC files. For AVRO files, it requires reading the block headers
for all blocks to determine the number of records in the AVRO file.

Is the record_count in the data-file entry expected to be accurate? or can
we estimate it based on size of the file and an estimation of a row size?

Thanks
Vivek

Re: Question on record_count field in the data-file entry of a manifest file

Posted by Russell Spitzer <ru...@gmail.com>.

I don't think anything actually uses Record counts at the moment, but if you include them they should be correct. In general we allow any metric to also be empty which is treated as "unknown". This looks like what we currently do with Avro

When we import Avro files in spark we skip doing any file analysis using -1 for count and null for the missing counts,
https://github.com/apache/iceberg/blob/f0a6b717dbf662caa9c762e72c47715a12625647/spark/src/main/java/org/apache/iceberg/spark/SparkTableUtil.java#L346-L358 <https://github.com/apache/iceberg/blob/f0a6b717dbf662caa9c762e72c47715a12625647/spark/src/main/java/org/apache/iceberg/spark/SparkTableUtil.java#L346-L358>

While with Parquet <https://github.com/apache/iceberg/blob/f0a6b717dbf662caa9c762e72c47715a12625647/spark/src/main/java/org/apache/iceberg/spark/SparkTableUtil.java#L378-L379> and Orc <https://github.com/apache/iceberg/blob/f0a6b717dbf662caa9c762e72c47715a12625647/spark/src/main/java/org/apache/iceberg/spark/SparkTableUtil.java#L413-L414> we populate the metrics using the footer.

So the tldr, iMissing is OK, but inaccurate is not

> On Apr 7, 2021, at 7:36 AM, Vivekanand Vellanki <vi...@dremio.com> wrote:
> 
> I understand the part about the file sizes. The file size information can be used to read the Parquet/ORC footers assuming the file size in the manifest files.
> 
> My question is specific to record counts in these files. Are these expected to be accurate as well?
> 
> On Wed, Apr 7, 2021 at 5:53 PM <russell.spitzer@gmail.com <ma...@gmail.com>> wrote:
> Iceberg stores this information and other footer and file level details in manifests for just such a use case. The goal is always to read the files once and then save metrics and statistics in the manifest so they do not need be read again. 
> 
> If the value is not accurate there is a bug in Iceberg (recently there was one of these with improperly recorded file sizes). 
> 
> I would suggest taking a look at the snapshot and migrate procedures since we already have code for determining these values for existing files and hive tables
> 
> Sent from my iPhone
> 
> > On Apr 7, 2021, at 3:41 AM, Vivekanand Vellanki <vivek@dremio.com <ma...@dremio.com>> wrote:
> > 
> > 
> > Hi,
> > 
> > We are in the process of converting Hive datasets to Iceberg datasets.
> > 
> > In this process, we noticed that each data-file entry in the manifest file has a required record_count field.
> > 
> > Populating this accurately would require reading the footer/tail for Parquet/ORC files. For AVRO files, it requires reading the block headers for all blocks to determine the number of records in the AVRO file.
> > 
> > Is the record_count in the data-file entry expected to be accurate? or can we estimate it based on size of the file and an estimation of a row size?
> > 
> > Thanks
> > Vivek
> >

Re: Question on record_count field in the data-file entry of a manifest file

Posted by Vivekanand Vellanki <vi...@dremio.com>.

I understand the part about the file sizes. The file size information can
be used to read the Parquet/ORC footers assuming the file size in the
manifest files.

My question is specific to record counts in these files. Are these expected
to be accurate as well?

On Wed, Apr 7, 2021 at 5:53 PM <ru...@gmail.com> wrote:

> Iceberg stores this information and other footer and file level details in
> manifests for just such a use case. The goal is always to read the files
> once and then save metrics and statistics in the manifest so they do not
> need be read again.
>
> If the value is not accurate there is a bug in Iceberg (recently there was
> one of these with improperly recorded file sizes).
>
> I would suggest taking a look at the snapshot and migrate procedures since
> we already have code for determining these values for existing files and
> hive tables
>
> Sent from my iPhone
>
> > On Apr 7, 2021, at 3:41 AM, Vivekanand Vellanki <vi...@dremio.com>
> wrote:
> >
> > 
> > Hi,
> >
> > We are in the process of converting Hive datasets to Iceberg datasets.
> >
> > In this process, we noticed that each data-file entry in the manifest
> file has a required record_count field.
> >
> > Populating this accurately would require reading the footer/tail for
> Parquet/ORC files. For AVRO files, it requires reading the block headers
> for all blocks to determine the number of records in the AVRO file.
> >
> > Is the record_count in the data-file entry expected to be accurate? or
> can we estimate it based on size of the file and an estimation of a row
> size?
> >
> > Thanks
> > Vivek
> >
>

Re: Question on record_count field in the data-file entry of a manifest file

Posted by ru...@gmail.com.

Iceberg stores this information and other footer and file level details in manifests for just such a use case. The goal is always to read the files once and then save metrics and statistics in the manifest so they do not need be read again. 

If the value is not accurate there is a bug in Iceberg (recently there was one of these with improperly recorded file sizes). 

I would suggest taking a look at the snapshot and migrate procedures since we already have code for determining these values for existing files and hive tables

Sent from my iPhone

> On Apr 7, 2021, at 3:41 AM, Vivekanand Vellanki <vi...@dremio.com> wrote:
> 
> 
> Hi,
> 
> We are in the process of converting Hive datasets to Iceberg datasets.
> 
> In this process, we noticed that each data-file entry in the manifest file has a required record_count field.
> 
> Populating this accurately would require reading the footer/tail for Parquet/ORC files. For AVRO files, it requires reading the block headers for all blocks to determine the number of records in the AVRO file.
> 
> Is the record_count in the data-file entry expected to be accurate? or can we estimate it based on size of the file and an estimation of a row size?
> 
> Thanks
> Vivek
>