You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by "bigluck (via GitHub)" <gi...@apache.org> on 2023/05/30 09:59:44 UTC
[GitHub] [iceberg] bigluck opened a new issue, #7736: [PyIceberg] plan_files() fail on math-based partitioned table
bigluck opened a new issue, #7736:
URL: https://github.com/apache/iceberg/issues/7736
### Apache Iceberg version
main (development)
### Query engine
Other
### Please describe the bug 🐞
Ciao @Fokko; not sure if it's a bug, but I'm encountering strange behavior when trying to scan a partitioned table.
Dataset: taxi (full dataset)
Data catalog: glue
Table partitions: `request_datetime`, transform=`month`
This is my snippet:
```python
from datetime import timedelta, datetime, timezone
from pyiceberg.catalog import load_catalog
from pyiceberg.expressions import GreaterThanOrEqual, LessThanOrEqual, And
catalog = load_catalog('default', type='glue')
table = catalog.load_table(('biglake', 'taxi_dremio_by_month'))
from_date = datetime(2021, 1, 1, 0, 0, 0, 0, tzinfo=timezone.utc)
to_date = from_date + timedelta(days=7)
scan = table.scan(
row_filter=And(
GreaterThanOrEqual('request_datetime', from_date.strftime('%Y-%m-%dT00:00:00.000+00:00')),
LessThanOrEqual('request_datetime', to_date.strftime('%Y-%m-%dT00:00:00.000+00:00')),
),
selected_fields=('request_datetime',),
)
files = [plan.file.file_path for plan in scan.plan_files()]
```
`scan.metadata.partitions_spec[0]` contains `{'name': 'request_datetime_month', 'transform': 'month', 'source-id': 4, 'field-id': 1000}` (it's the only partition), and this is the entire content of the scan object:
<img width="817" alt="Screenshot 2023-05-30 at 11 45 20" src="https://github.com/apache/iceberg/assets/1511095/f787af1f-5f2f-40a6-bc7f-6a01a0bae4ba">
The final value of the scan.row_filter variable is:
```python
And(left=GreaterThanOrEqual(term=Reference(name='request_datetime'), literal=literal('2021-01-01T00:00:00.000+00:00')), right=LessThanOrEqual(term=Reference(name='request_datetime'), literal=literal('2021-01-08T00:00:00.000+00:00')))
```
Once the code reaches the next statement (files = ...) it crashes with this error:
```
Traceback (most recent call last):
File "/Users/bigluck/Desktop/duckbanch/run_pyiceberg2.py", line 121, in <module>
res = run(
File "/Users/bigluck/Desktop/duckbanch/run_pyiceberg2.py", line 96, in run
files = [plan.file.file_path for plan in scan.plan_files()]
File "/Users/bigluck/Desktop/duckbanch/.venv/lib/python3.10/site-packages/pyiceberg/table/__init__.py", line 394, in plan_files
*pool.starmap(
File "/Users/bigluck/.pyenv/versions/3.10.11/lib/python3.10/multiprocessing/pool.py", line 375, in starmap
return self._map_async(func, iterable, starmapstar, chunksize).get()
File "/Users/bigluck/.pyenv/versions/3.10.11/lib/python3.10/multiprocessing/pool.py", line 774, in get
raise self._value
File "/Users/bigluck/.pyenv/versions/3.10.11/lib/python3.10/multiprocessing/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
File "/Users/bigluck/.pyenv/versions/3.10.11/lib/python3.10/multiprocessing/pool.py", line 51, in starmapstar
return list(itertools.starmap(args[0], args[1]))
File "/Users/bigluck/Desktop/duckbanch/.venv/lib/python3.10/site-packages/pyiceberg/table/__init__.py", line 332, in _open_manifest
return [FileScanTask(file) for file in matching_partition_data_files if metrics_evaluator(file)]
File "/Users/bigluck/Desktop/duckbanch/.venv/lib/python3.10/site-packages/pyiceberg/table/__init__.py", line 332, in <listcomp>
return [FileScanTask(file) for file in matching_partition_data_files if metrics_evaluator(file)]
File "/Users/bigluck/Desktop/duckbanch/.venv/lib/python3.10/site-packages/pyiceberg/table/__init__.py", line 367, in <lambda>
return lambda data_file: evaluator(data_file.partition)
File "/Users/bigluck/Desktop/duckbanch/.venv/lib/python3.10/site-packages/pyiceberg/expressions/visitors.py", line 468, in eval
return visit(self.bound, self)
File "/Users/bigluck/.pyenv/versions/3.10.11/lib/python3.10/functools.py", line 889, in wrapper
return dispatch(args[0].__class__)(*args, **kw)
File "/Users/bigluck/Desktop/duckbanch/.venv/lib/python3.10/site-packages/pyiceberg/expressions/visitors.py", line 177, in _
left_result: T = visit(obj.left, visitor=visitor)
File "/Users/bigluck/.pyenv/versions/3.10.11/lib/python3.10/functools.py", line 889, in wrapper
return dispatch(args[0].__class__)(*args, **kw)
File "/Users/bigluck/Desktop/duckbanch/.venv/lib/python3.10/site-packages/pyiceberg/expressions/visitors.py", line 191, in _
return visitor.visit_bound_predicate(predicate=obj)
File "/Users/bigluck/Desktop/duckbanch/.venv/lib/python3.10/site-packages/pyiceberg/expressions/visitors.py", line 347, in visit_bound_predicate
return visit_bound_predicate(predicate, self)
File "/Users/bigluck/.pyenv/versions/3.10.11/lib/python3.10/functools.py", line 889, in wrapper
return dispatch(args[0].__class__)(*args, **kw)
File "/Users/bigluck/Desktop/duckbanch/.venv/lib/python3.10/site-packages/pyiceberg/expressions/visitors.py", line 398, in _
return visitor.visit_greater_than_or_equal(term=expr.term, literal=expr.literal)
File "/Users/bigluck/Desktop/duckbanch/.venv/lib/python3.10/site-packages/pyiceberg/expressions/visitors.py", line 497, in visit_greater_than_or_equal
return term.eval(self.struct) >= literal.value
TypeError: '>=' not supported between instances of 'NoneType' and 'int'
```
I've added a print on the `File "/Users/bigluck/Desktop/duckbanch/.venv/lib/python3.10/site-packages/pyiceberg/expressions/visitors.py", line 347, in visit_bound_predicate` line, and this the content of the `predicate` var:
```
BoundGreaterThanOrEqual(term=BoundReference(field=NestedField(field_id=1000, name='request_datetime_month', field_type=IntegerType(), required=False), accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
BoundLessThanOrEqual(term=BoundReference(field=NestedField(field_id=1000, name='request_datetime_month', field_type=IntegerType(), required=False), accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
BoundGreaterThanOrEqual(term=BoundReference(field=NestedField(field_id=1000, name='request_datetime_month', field_type=IntegerType(), required=False), accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
BoundLessThanOrEqual(term=BoundReference(field=NestedField(field_id=1000, name='request_datetime_month', field_type=IntegerType(), required=False), accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
BoundGreaterThanOrEqual(term=BoundReference(field=NestedField(field_id=1000, name='request_datetime_month', field_type=IntegerType(), required=False), accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
BoundLessThanOrEqual(term=BoundReference(field=NestedField(field_id=1000, name='request_datetime_month', field_type=IntegerType(), required=False), accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
BoundGreaterThanOrEqual(term=BoundReference(field=NestedField(field_id=1000, name='request_datetime_month', field_type=IntegerType(), required=False), accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
BoundLessThanOrEqual(term=BoundReference(field=NestedField(field_id=1000, name='request_datetime_month', field_type=IntegerType(), required=False), accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
BoundGreaterThanOrEqual(term=BoundReference(field=NestedField(field_id=1000, name='request_datetime_month', field_type=IntegerType(), required=False), accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
BoundLessThanOrEqual(term=BoundReference(field=NestedField(field_id=1000, name='request_datetime_month', field_type=IntegerType(), required=False), accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
BoundGreaterThanOrEqual(term=BoundReference(field=NestedField(field_id=1000, name='request_datetime_month', field_type=IntegerType(), required=False), accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
BoundLessThanOrEqual(term=BoundReference(field=NestedField(field_id=1000, name='request_datetime_month', field_type=IntegerType(), required=False), accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
BoundGreaterThanOrEqual(term=BoundReference(field=NestedField(field_id=1000, name='request_datetime_month', field_type=IntegerType(), required=False), accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
BoundLessThanOrEqual(term=BoundReference(field=NestedField(field_id=1000, name='request_datetime_month', field_type=IntegerType(), required=False), accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
BoundGreaterThanOrEqual(term=BoundReference(field=NestedField(field_id=1000, name='request_datetime_month', field_type=IntegerType(), required=False), accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
BoundLessThanOrEqual(term=BoundReference(field=NestedField(field_id=1000, name='request_datetime_month', field_type=IntegerType(), required=False), accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
BoundGreaterThanOrEqual(term=BoundReference(field=NestedField(field_id=1000, name='request_datetime_month', field_type=IntegerType(), required=False), accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
BoundLessThanOrEqual(term=BoundReference(field=NestedField(field_id=1000, name='request_datetime_month', field_type=IntegerType(), required=False), accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
BoundGreaterThanOrEqual(term=BoundReference(field=NestedField(field_id=1000, name='request_datetime_month', field_type=IntegerType(), required=False), accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
BoundLessThanOrEqual(term=BoundReference(field=NestedField(field_id=1000, name='request_datetime_month', field_type=IntegerType(), required=False), accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
BoundGreaterThanOrEqual(term=BoundReference(field=NestedField(field_id=1000, name='request_datetime_month', field_type=IntegerType(), required=False), accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
BoundLessThanOrEqual(term=BoundReference(field=NestedField(field_id=1000, name='request_datetime_month', field_type=IntegerType(), required=False), accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
BoundGreaterThanOrEqual(term=BoundReference(field=NestedField(field_id=1000, name='request_datetime_month', field_type=IntegerType(), required=False), accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
BoundLessThanOrEqual(term=BoundReference(field=NestedField(field_id=1000, name='request_datetime_month', field_type=IntegerType(), required=False), accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
BoundGreaterThanOrEqual(term=BoundReference(field=NestedField(field_id=1000, name='request_datetime_month', field_type=IntegerType(), required=False), accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
BoundLessThanOrEqual(term=BoundReference(field=NestedField(field_id=1000, name='request_datetime_month', field_type=IntegerType(), required=False), accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
BoundGreaterThanOrEqual(term=BoundReference(field=NestedField(field_id=1000, name='request_datetime_month', field_type=IntegerType(), required=False), accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
BoundLessThanOrEqual(term=BoundReference(field=NestedField(field_id=1000, name='request_datetime_month', field_type=IntegerType(), required=False), accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
BoundGreaterThanOrEqual(term=BoundReference(field=NestedField(field_id=1000, name='request_datetime_month', field_type=IntegerType(), required=False), accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
BoundLessThanOrEqual(term=BoundReference(field=NestedField(field_id=1000, name='request_datetime_month', field_type=IntegerType(), required=False), accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
BoundGreaterThanOrEqual(term=BoundReference(field=NestedField(field_id=1000, name='request_datetime_month', field_type=IntegerType(), required=False), accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
BoundLessThanOrEqual(term=BoundReference(field=NestedField(field_id=1000, name='request_datetime_month', field_type=IntegerType(), required=False), accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
BoundGreaterThanOrEqual(term=BoundReference(field=NestedField(field_id=1000, name='request_datetime_month', field_type=IntegerType(), required=False), accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
BoundLessThanOrEqual(term=BoundReference(field=NestedField(field_id=1000, name='request_datetime_month', field_type=IntegerType(), required=False), accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
BoundGreaterThanOrEqual(term=BoundReference(field=NestedField(field_id=1000, name='request_datetime_month', field_type=IntegerType(), required=False), accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
BoundLessThanOrEqual(term=BoundReference(field=NestedField(field_id=1000, name='request_datetime_month', field_type=IntegerType(), required=False), accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
BoundGreaterThanOrEqual(term=BoundReference(field=NestedField(field_id=1000, name='request_datetime_month', field_type=IntegerType(), required=False), accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
BoundLessThanOrEqual(term=BoundReference(field=NestedField(field_id=1000, name='request_datetime_month', field_type=IntegerType(), required=False), accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
BoundGreaterThanOrEqual(term=BoundReference(field=NestedField(field_id=1000, name='request_datetime_month', field_type=IntegerType(), required=False), accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
BoundLessThanOrEqual(term=BoundReference(field=NestedField(field_id=1000, name='request_datetime_month', field_type=IntegerType(), required=False), accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
BoundGreaterThanOrEqual(term=BoundReference(field=NestedField(field_id=1000, name='request_datetime_month', field_type=IntegerType(), required=False), accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
BoundLessThanOrEqual(term=BoundReference(field=NestedField(field_id=1000, name='request_datetime_month', field_type=IntegerType(), required=False), accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
BoundGreaterThanOrEqual(term=BoundReference(field=NestedField(field_id=1000, name='request_datetime_month', field_type=IntegerType(), required=False), accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
BoundLessThanOrEqual(term=BoundReference(field=NestedField(field_id=1000, name='request_datetime_month', field_type=IntegerType(), required=False), accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
BoundGreaterThanOrEqual(term=BoundReference(field=NestedField(field_id=1000, name='request_datetime_month', field_type=IntegerType(), required=False), accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
BoundLessThanOrEqual(term=BoundReference(field=NestedField(field_id=1000, name='request_datetime_month', field_type=IntegerType(), required=False), accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
BoundGreaterThanOrEqual(term=BoundReference(field=NestedField(field_id=1000, name='request_datetime_month', field_type=IntegerType(), required=False), accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
BoundLessThanOrEqual(term=BoundReference(field=NestedField(field_id=1000, name='request_datetime_month', field_type=IntegerType(), required=False), accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
BoundGreaterThanOrEqual(term=BoundReference(field=NestedField(field_id=1000, name='request_datetime_month', field_type=IntegerType(), required=False), accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
BoundLessThanOrEqual(term=BoundReference(field=NestedField(field_id=1000, name='request_datetime_month', field_type=IntegerType(), required=False), accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
BoundGreaterThanOrEqual(term=BoundReference(field=NestedField(field_id=1000, name='request_datetime_month', field_type=IntegerType(), required=False), accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
BoundLessThanOrEqual(term=BoundReference(field=NestedField(field_id=1000, name='request_datetime_month', field_type=IntegerType(), required=False), accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
BoundGreaterThanOrEqual(term=BoundReference(field=NestedField(field_id=1000, name='request_datetime_month', field_type=IntegerType(), required=False), accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
BoundLessThanOrEqual(term=BoundReference(field=NestedField(field_id=1000, name='request_datetime_month', field_type=IntegerType(), required=False), accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
BoundGreaterThanOrEqual(term=BoundReference(field=NestedField(field_id=1000, name='request_datetime_month', field_type=IntegerType(), required=False), accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
BoundLessThanOrEqual(term=BoundReference(field=NestedField(field_id=1000, name='request_datetime_month', field_type=IntegerType(), required=False), accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
BoundGreaterThanOrEqual(term=BoundReference(field=NestedField(field_id=1000, name='request_datetime_month', field_type=IntegerType(), required=False), accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
BoundLessThanOrEqual(term=BoundReference(field=NestedField(field_id=1000, name='request_datetime_month', field_type=IntegerType(), required=False), accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
BoundGreaterThanOrEqual(term=BoundReference(field=NestedField(field_id=1000, name='request_datetime_month', field_type=IntegerType(), required=False), accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
BoundLessThanOrEqual(term=BoundReference(field=NestedField(field_id=1000, name='request_datetime_month', field_type=IntegerType(), required=False), accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
BoundGreaterThanOrEqual(term=BoundReference(field=NestedField(field_id=1000, name='request_datetime_month', field_type=IntegerType(), required=False), accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
```
It's unclear to me if it's a bug, a problem with the table itself or if I'm passing invalid values to the `row_filter` argument, but this SQL query (done using Athena) works:
```sql
SELECT DATE_TRUNC('day', "request_datetime"), COUNT(*) FROM "taxi_dremio_by_month"
WHERE "request_datetime" >= CAST('2021-01-01' AS DATE) AND "request_datetime" <= CAST('2021-01-08' AS DATE)
GROUP BY 1
ORDER BY 1
```
Can you help me? Thanks so much.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org
[GitHub] [iceberg] rdblue closed issue #7736: [PyIceberg] plan_files() fail on math-based partitioned table
Posted by "rdblue (via GitHub)" <gi...@apache.org>.
rdblue closed issue #7736: [PyIceberg] plan_files() fail on math-based partitioned table
URL: https://github.com/apache/iceberg/issues/7736
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org
[GitHub] [iceberg] Fokko commented on issue #7736: [PyIceberg] plan_files() fail on math-based partitioned table
Posted by "Fokko (via GitHub)" <gi...@apache.org>.
Fokko commented on issue #7736:
URL: https://github.com/apache/iceberg/issues/7736#issuecomment-1579086269
@bigluck Thanks for finding this! I'm able to reproduce it locally. A fix is inbound
![image](https://github.com/apache/iceberg/assets/1134248/b9bb9156-2e94-4603-894c-08f72b0103c8)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org
[GitHub] [iceberg] bigluck commented on issue #7736: [PyIceberg] plan_files() fail on math-based partitioned table
Posted by "bigluck (via GitHub)" <gi...@apache.org>.
bigluck commented on issue #7736:
URL: https://github.com/apache/iceberg/issues/7736#issuecomment-1568239090
For reference, the table has 108.958 rows with `request_datetime=NULL`, and 745.178.065 rows with a valid datetime.
It looks like the parser expects the partition always to have a valid value
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org