You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@iceberg.apache.org by Mayank Thirani <ma...@dremio.com> on 2022/03/29 01:11:32 UTC

Iceberg Partition by via Spark

Hi Team,

We are trying to use Spark for creating some sample tables for testing to
see how metadata file/ folders look when we use "partition by". Links which
helped us to follow:
https://iceberg.apache.org/docs/latest/spark-ddl/#alter-table--drop-partition-field

We created a table with partition by using one column (for say: city) and
can see the metadata file and folders created by city in S3 using below
commands:
create table samplesMainTestPartitionCity partition by (city) as select *
from sampleTable limit 1000

[image: image.png]
*00000-fc....json* is the metadata file generated for the same:

Secondly, we tried to drop the partition field (city) using the below
command:
ALTER TABLE nessie.samplesMainTestPartitionCity DROP PARTITION FIELD city

We got a new metadata file for it (*00001-9d.....json* is the new one).
But we can still use the partitions folder as shown above. Expectation was
that no such folders would be there.
So we tried to add a new partition field after dropping based on the below
command:
ALTER TABLE nessie.samplesMainTestPartitionCity ADD PARTITION FIELD state

We got a new metadata file for it (*00002-26.....json*) and no new folders
are generated based on the state.
This looks incorrect to us. Can you please explain.


-- 
Thanks
-Mayank

Re: Iceberg Partition by via Spark

Posted by Russell Spitzer <ru...@gmail.com>.

Changing an Iceberg Partition Spec does not change existing files or the
layout of files within the table. What it does is tell Iceberg how to write
new files
to the table and how they should be laid out.

When evaluating predicates for scan planning always uses the predicate with
the partition spec that a particular file was written with. For example, if
you use a predicate
x = 5, and one file is partitioned by x and another file was not
partitioned by x. The File partitioned by x can be filtered using the
predicate, while the other file which was not partitioned
will be assumed to have values of X = 5 and will get passed to the next
layer of filtering.

To move fold files into the current spec you can use the
RewriteDataFilesAction.

On Mon, Mar 28, 2022 at 8:11 PM Mayank Thirani <ma...@dremio.com>
wrote:

> Hi Team,
>
> We are trying to use Spark for creating some sample tables for testing to
> see how metadata file/ folders look when we use "partition by". Links which
> helped us to follow:
>
> https://iceberg.apache.org/docs/latest/spark-ddl/#alter-table--drop-partition-field
>
> We created a table with partition by using one column (for say: city) and
> can see the metadata file and folders created by city in S3 using below
> commands:
> create table samplesMainTestPartitionCity partition by (city) as select *
> from sampleTable limit 1000
>
> [image: image.png]
> *00000-fc....json* is the metadata file generated for the same:
>
> Secondly, we tried to drop the partition field (city) using the below
> command:
> ALTER TABLE nessie.samplesMainTestPartitionCity DROP PARTITION FIELD city
>
> We got a new metadata file for it (*00001-9d.....json* is the new one).
> But we can still use the partitions folder as shown above. Expectation was
> that no such folders would be there.
> So we tried to add a new partition field after dropping based on the below
> command:
> ALTER TABLE nessie.samplesMainTestPartitionCity ADD PARTITION FIELD state
>
> We got a new metadata file for it (*00002-26.....json*) and no new
> folders are generated based on the state.
> This looks incorrect to us. Can you please explain.
>
>
> --
> Thanks
> -Mayank
>