You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@iceberg.apache.org by "mikulskibartosz (via GitHub)" <gi...@apache.org> on 2023/04/28 10:03:19 UTC

[GitHub] [iceberg] mikulskibartosz opened a new issue, #7457: PyIceberg doesn't support tables compacted with AWS Athena

mikulskibartosz opened a new issue, #7457:
URL: https://github.com/apache/iceberg/issues/7457

### Apache Iceberg version

1.1.0

### Query engine

Athena

### Please describe the bug 🐞

It's not possible to read an Iceberg table with PyIceberg if the data was written using PySpark and compacted with AWS Athena.

## Steps to reproduce

1. Create an Iceberg table:

```sql
CREATE TABLE IF NOT EXISTS table_name
(columns ...)
USING ICEBERG
PARTITIONED BY (date)
```

2. Write to the table using PySpark:

```python
spark_df = self.spark_session.createDataFrame(df)
spark_df.sort(date_column).writeTo(table_name).append()
```

3. Read the table using PyIceberg:

```python
catalog = load_glue("default", {})
table = catalog.load_table('...')

scan = table.scan(
row_filter=EqualTo("date", date_as_string),
)
result = scan.to_arrow()
```

The `result` variable contains correct data.

4. Compact the table files using the OPTIMIZE instruction in AWS Athena. https://docs.aws.amazon.com/athena/latest/ug/querying-iceberg-data-optimization.html

```sql
OPTIMIZE table_name REWRITE DATA USING BIN_PACK WHERE date = 'date_as_string'
```

5. Optionally, VACUUM the table. It doesn't matter and doesn't change the behavior in any way.

6. Query the table using the same PyIceberg code as in step 3.

7. `to_arrow` raises an exception: `ValueError: Iceberg schema is not embedded into the Parquet file, see https://github.com/apache/iceberg/issues/6505`

8. The table can still be accessed correctly in AWS Athena.

## Expected behavior

In step 7, the code should work correctly and return the same results as the code in step 3.

## Dependency versions

### Writing data (step 2)

* pyarrow: 11.0.0
* pyspark: 3.3.1
* iceberg-spark-runtime-3.3_2.12-1.1.0.jar

### Reading data (steps 3 and 7):

```python
pyiceberg.__version__
'0.3.0'

pyarrow.__version__
'10.0.1'
```

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue closed issue #7457: PyIceberg doesn't support tables compacted with AWS Athena

Posted by "rdblue (via GitHub)" <gi...@apache.org>.

rdblue closed issue #7457: PyIceberg doesn't support tables compacted with AWS Athena
URL: https://github.com/apache/iceberg/issues/7457


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on issue #7457: PyIceberg doesn't support tables compacted with AWS Athena

Posted by "rdblue (via GitHub)" <gi...@apache.org>.

rdblue commented on issue #7457:
URL: https://github.com/apache/iceberg/issues/7457#issuecomment-1530631475

   Just merged #6505, which should address this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] Fokko commented on issue #7457: PyIceberg doesn't support tables compacted with AWS Athena

Posted by "Fokko (via GitHub)" <gi...@apache.org>.

Fokko commented on issue #7457:
URL: https://github.com/apache/iceberg/issues/7457#issuecomment-1527317639

   Thanks @mikulskibartosz for reporting this. Kudo's for the comprehensive issue. This is a known issue that we're working on and will be fixed in the next release: https://github.com/apache/iceberg/issues/6647


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org