You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by "bigluck (via GitHub)" <gi...@apache.org> on 2023/05/30 09:59:44 UTC

[GitHub] [iceberg] bigluck opened a new issue, #7736: [PyIceberg] plan_files() fail on math-based partitioned table

bigluck opened a new issue, #7736:
URL: https://github.com/apache/iceberg/issues/7736

   ### Apache Iceberg version
   
   main (development)
   
   ### Query engine
   
   Other
   
   ### Please describe the bug 🐞
   
   Ciao @Fokko; not sure if it's a bug, but I'm encountering strange behavior when trying to scan a partitioned table.
   
   Dataset: taxi (full dataset)
   Data catalog: glue
   Table partitions: `request_datetime`, transform=`month`
   
   
   This is my snippet:
   ```python
   from datetime import timedelta, datetime, timezone
   
   from pyiceberg.catalog import load_catalog
   from pyiceberg.expressions import GreaterThanOrEqual, LessThanOrEqual, And
   
   
   catalog = load_catalog('default', type='glue')
   table = catalog.load_table(('biglake', 'taxi_dremio_by_month'))
   
   from_date = datetime(2021, 1, 1, 0, 0, 0, 0, tzinfo=timezone.utc)
   to_date = from_date + timedelta(days=7)
   
   scan = table.scan(
       row_filter=And(
           GreaterThanOrEqual('request_datetime', from_date.strftime('%Y-%m-%dT00:00:00.000+00:00')),
           LessThanOrEqual('request_datetime', to_date.strftime('%Y-%m-%dT00:00:00.000+00:00')),
       ),
       selected_fields=('request_datetime',),
   )
   
   files = [plan.file.file_path for plan in scan.plan_files()]
   ```
   
   `scan.metadata.partitions_spec[0]` contains `{'name': 'request_datetime_month', 'transform': 'month', 'source-id': 4, 'field-id': 1000}` (it's the only partition), and this is the entire content of the scan object:
   
   <img width="817" alt="Screenshot 2023-05-30 at 11 45 20" src="https://github.com/apache/iceberg/assets/1511095/f787af1f-5f2f-40a6-bc7f-6a01a0bae4ba">
   
   The final value of the scan.row_filter variable is:
   
   ```python
   And(left=GreaterThanOrEqual(term=Reference(name='request_datetime'), literal=literal('2021-01-01T00:00:00.000+00:00')), right=LessThanOrEqual(term=Reference(name='request_datetime'), literal=literal('2021-01-08T00:00:00.000+00:00')))
   ```
   
   Once the code reaches the next statement (files = ...) it crashes with this error:
   
   ```
   Traceback (most recent call last):
     File "/Users/bigluck/Desktop/duckbanch/run_pyiceberg2.py", line 121, in <module>
       res = run(
     File "/Users/bigluck/Desktop/duckbanch/run_pyiceberg2.py", line 96, in run
       files = [plan.file.file_path for plan in scan.plan_files()]
     File "/Users/bigluck/Desktop/duckbanch/.venv/lib/python3.10/site-packages/pyiceberg/table/__init__.py", line 394, in plan_files
       *pool.starmap(
     File "/Users/bigluck/.pyenv/versions/3.10.11/lib/python3.10/multiprocessing/pool.py", line 375, in starmap
       return self._map_async(func, iterable, starmapstar, chunksize).get()
     File "/Users/bigluck/.pyenv/versions/3.10.11/lib/python3.10/multiprocessing/pool.py", line 774, in get
       raise self._value
     File "/Users/bigluck/.pyenv/versions/3.10.11/lib/python3.10/multiprocessing/pool.py", line 125, in worker
       result = (True, func(*args, **kwds))
     File "/Users/bigluck/.pyenv/versions/3.10.11/lib/python3.10/multiprocessing/pool.py", line 51, in starmapstar
       return list(itertools.starmap(args[0], args[1]))
     File "/Users/bigluck/Desktop/duckbanch/.venv/lib/python3.10/site-packages/pyiceberg/table/__init__.py", line 332, in _open_manifest
       return [FileScanTask(file) for file in matching_partition_data_files if metrics_evaluator(file)]
     File "/Users/bigluck/Desktop/duckbanch/.venv/lib/python3.10/site-packages/pyiceberg/table/__init__.py", line 332, in <listcomp>
       return [FileScanTask(file) for file in matching_partition_data_files if metrics_evaluator(file)]
     File "/Users/bigluck/Desktop/duckbanch/.venv/lib/python3.10/site-packages/pyiceberg/table/__init__.py", line 367, in <lambda>
       return lambda data_file: evaluator(data_file.partition)
     File "/Users/bigluck/Desktop/duckbanch/.venv/lib/python3.10/site-packages/pyiceberg/expressions/visitors.py", line 468, in eval
       return visit(self.bound, self)
     File "/Users/bigluck/.pyenv/versions/3.10.11/lib/python3.10/functools.py", line 889, in wrapper
       return dispatch(args[0].__class__)(*args, **kw)
     File "/Users/bigluck/Desktop/duckbanch/.venv/lib/python3.10/site-packages/pyiceberg/expressions/visitors.py", line 177, in _
       left_result: T = visit(obj.left, visitor=visitor)
     File "/Users/bigluck/.pyenv/versions/3.10.11/lib/python3.10/functools.py", line 889, in wrapper
       return dispatch(args[0].__class__)(*args, **kw)
     File "/Users/bigluck/Desktop/duckbanch/.venv/lib/python3.10/site-packages/pyiceberg/expressions/visitors.py", line 191, in _
       return visitor.visit_bound_predicate(predicate=obj)
     File "/Users/bigluck/Desktop/duckbanch/.venv/lib/python3.10/site-packages/pyiceberg/expressions/visitors.py", line 347, in visit_bound_predicate
       return visit_bound_predicate(predicate, self)
     File "/Users/bigluck/.pyenv/versions/3.10.11/lib/python3.10/functools.py", line 889, in wrapper
       return dispatch(args[0].__class__)(*args, **kw)
     File "/Users/bigluck/Desktop/duckbanch/.venv/lib/python3.10/site-packages/pyiceberg/expressions/visitors.py", line 398, in _
       return visitor.visit_greater_than_or_equal(term=expr.term, literal=expr.literal)
     File "/Users/bigluck/Desktop/duckbanch/.venv/lib/python3.10/site-packages/pyiceberg/expressions/visitors.py", line 497, in visit_greater_than_or_equal
       return term.eval(self.struct) >= literal.value
   TypeError: '>=' not supported between instances of 'NoneType' and 'int'
   ```
   
   I've added a print on the `File "/Users/bigluck/Desktop/duckbanch/.venv/lib/python3.10/site-packages/pyiceberg/expressions/visitors.py", line 347, in visit_bound_predicate` line, and this the content of the `predicate` var:
   
   ```
   BoundGreaterThanOrEqual(term=BoundReference(field=NestedField(field_id=1000, name='request_datetime_month', field_type=IntegerType(), required=False), accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
   BoundLessThanOrEqual(term=BoundReference(field=NestedField(field_id=1000, name='request_datetime_month', field_type=IntegerType(), required=False), accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
   BoundGreaterThanOrEqual(term=BoundReference(field=NestedField(field_id=1000, name='request_datetime_month', field_type=IntegerType(), required=False), accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
   BoundLessThanOrEqual(term=BoundReference(field=NestedField(field_id=1000, name='request_datetime_month', field_type=IntegerType(), required=False), accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
   BoundGreaterThanOrEqual(term=BoundReference(field=NestedField(field_id=1000, name='request_datetime_month', field_type=IntegerType(), required=False), accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
   BoundLessThanOrEqual(term=BoundReference(field=NestedField(field_id=1000, name='request_datetime_month', field_type=IntegerType(), required=False), accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
   BoundGreaterThanOrEqual(term=BoundReference(field=NestedField(field_id=1000, name='request_datetime_month', field_type=IntegerType(), required=False), accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
   BoundLessThanOrEqual(term=BoundReference(field=NestedField(field_id=1000, name='request_datetime_month', field_type=IntegerType(), required=False), accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
   BoundGreaterThanOrEqual(term=BoundReference(field=NestedField(field_id=1000, name='request_datetime_month', field_type=IntegerType(), required=False), accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
   BoundLessThanOrEqual(term=BoundReference(field=NestedField(field_id=1000, name='request_datetime_month', field_type=IntegerType(), required=False), accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
   BoundGreaterThanOrEqual(term=BoundReference(field=NestedField(field_id=1000, name='request_datetime_month', field_type=IntegerType(), required=False), accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
   BoundLessThanOrEqual(term=BoundReference(field=NestedField(field_id=1000, name='request_datetime_month', field_type=IntegerType(), required=False), accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
   BoundGreaterThanOrEqual(term=BoundReference(field=NestedField(field_id=1000, name='request_datetime_month', field_type=IntegerType(), required=False), accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
   BoundLessThanOrEqual(term=BoundReference(field=NestedField(field_id=1000, name='request_datetime_month', field_type=IntegerType(), required=False), accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
   BoundGreaterThanOrEqual(term=BoundReference(field=NestedField(field_id=1000, name='request_datetime_month', field_type=IntegerType(), required=False), accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
   BoundLessThanOrEqual(term=BoundReference(field=NestedField(field_id=1000, name='request_datetime_month', field_type=IntegerType(), required=False), accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
   BoundGreaterThanOrEqual(term=BoundReference(field=NestedField(field_id=1000, name='request_datetime_month', field_type=IntegerType(), required=False), accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
   BoundLessThanOrEqual(term=BoundReference(field=NestedField(field_id=1000, name='request_datetime_month', field_type=IntegerType(), required=False), accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
   BoundGreaterThanOrEqual(term=BoundReference(field=NestedField(field_id=1000, name='request_datetime_month', field_type=IntegerType(), required=False), accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
   BoundLessThanOrEqual(term=BoundReference(field=NestedField(field_id=1000, name='request_datetime_month', field_type=IntegerType(), required=False), accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
   BoundGreaterThanOrEqual(term=BoundReference(field=NestedField(field_id=1000, name='request_datetime_month', field_type=IntegerType(), required=False), accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
   BoundLessThanOrEqual(term=BoundReference(field=NestedField(field_id=1000, name='request_datetime_month', field_type=IntegerType(), required=False), accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
   BoundGreaterThanOrEqual(term=BoundReference(field=NestedField(field_id=1000, name='request_datetime_month', field_type=IntegerType(), required=False), accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
   BoundLessThanOrEqual(term=BoundReference(field=NestedField(field_id=1000, name='request_datetime_month', field_type=IntegerType(), required=False), accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
   BoundGreaterThanOrEqual(term=BoundReference(field=NestedField(field_id=1000, name='request_datetime_month', field_type=IntegerType(), required=False), accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
   BoundLessThanOrEqual(term=BoundReference(field=NestedField(field_id=1000, name='request_datetime_month', field_type=IntegerType(), required=False), accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
   BoundGreaterThanOrEqual(term=BoundReference(field=NestedField(field_id=1000, name='request_datetime_month', field_type=IntegerType(), required=False), accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
   BoundLessThanOrEqual(term=BoundReference(field=NestedField(field_id=1000, name='request_datetime_month', field_type=IntegerType(), required=False), accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
   BoundGreaterThanOrEqual(term=BoundReference(field=NestedField(field_id=1000, name='request_datetime_month', field_type=IntegerType(), required=False), accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
   BoundLessThanOrEqual(term=BoundReference(field=NestedField(field_id=1000, name='request_datetime_month', field_type=IntegerType(), required=False), accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
   BoundGreaterThanOrEqual(term=BoundReference(field=NestedField(field_id=1000, name='request_datetime_month', field_type=IntegerType(), required=False), accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
   BoundLessThanOrEqual(term=BoundReference(field=NestedField(field_id=1000, name='request_datetime_month', field_type=IntegerType(), required=False), accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
   BoundGreaterThanOrEqual(term=BoundReference(field=NestedField(field_id=1000, name='request_datetime_month', field_type=IntegerType(), required=False), accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
   BoundLessThanOrEqual(term=BoundReference(field=NestedField(field_id=1000, name='request_datetime_month', field_type=IntegerType(), required=False), accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
   BoundGreaterThanOrEqual(term=BoundReference(field=NestedField(field_id=1000, name='request_datetime_month', field_type=IntegerType(), required=False), accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
   BoundLessThanOrEqual(term=BoundReference(field=NestedField(field_id=1000, name='request_datetime_month', field_type=IntegerType(), required=False), accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
   BoundGreaterThanOrEqual(term=BoundReference(field=NestedField(field_id=1000, name='request_datetime_month', field_type=IntegerType(), required=False), accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
   BoundLessThanOrEqual(term=BoundReference(field=NestedField(field_id=1000, name='request_datetime_month', field_type=IntegerType(), required=False), accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
   BoundGreaterThanOrEqual(term=BoundReference(field=NestedField(field_id=1000, name='request_datetime_month', field_type=IntegerType(), required=False), accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
   BoundLessThanOrEqual(term=BoundReference(field=NestedField(field_id=1000, name='request_datetime_month', field_type=IntegerType(), required=False), accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
   BoundGreaterThanOrEqual(term=BoundReference(field=NestedField(field_id=1000, name='request_datetime_month', field_type=IntegerType(), required=False), accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
   BoundLessThanOrEqual(term=BoundReference(field=NestedField(field_id=1000, name='request_datetime_month', field_type=IntegerType(), required=False), accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
   BoundGreaterThanOrEqual(term=BoundReference(field=NestedField(field_id=1000, name='request_datetime_month', field_type=IntegerType(), required=False), accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
   BoundLessThanOrEqual(term=BoundReference(field=NestedField(field_id=1000, name='request_datetime_month', field_type=IntegerType(), required=False), accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
   BoundGreaterThanOrEqual(term=BoundReference(field=NestedField(field_id=1000, name='request_datetime_month', field_type=IntegerType(), required=False), accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
   BoundLessThanOrEqual(term=BoundReference(field=NestedField(field_id=1000, name='request_datetime_month', field_type=IntegerType(), required=False), accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
   BoundGreaterThanOrEqual(term=BoundReference(field=NestedField(field_id=1000, name='request_datetime_month', field_type=IntegerType(), required=False), accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
   BoundLessThanOrEqual(term=BoundReference(field=NestedField(field_id=1000, name='request_datetime_month', field_type=IntegerType(), required=False), accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
   BoundGreaterThanOrEqual(term=BoundReference(field=NestedField(field_id=1000, name='request_datetime_month', field_type=IntegerType(), required=False), accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
   BoundLessThanOrEqual(term=BoundReference(field=NestedField(field_id=1000, name='request_datetime_month', field_type=IntegerType(), required=False), accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
   BoundGreaterThanOrEqual(term=BoundReference(field=NestedField(field_id=1000, name='request_datetime_month', field_type=IntegerType(), required=False), accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
   BoundLessThanOrEqual(term=BoundReference(field=NestedField(field_id=1000, name='request_datetime_month', field_type=IntegerType(), required=False), accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
   BoundGreaterThanOrEqual(term=BoundReference(field=NestedField(field_id=1000, name='request_datetime_month', field_type=IntegerType(), required=False), accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
   BoundLessThanOrEqual(term=BoundReference(field=NestedField(field_id=1000, name='request_datetime_month', field_type=IntegerType(), required=False), accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
   BoundGreaterThanOrEqual(term=BoundReference(field=NestedField(field_id=1000, name='request_datetime_month', field_type=IntegerType(), required=False), accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
   BoundLessThanOrEqual(term=BoundReference(field=NestedField(field_id=1000, name='request_datetime_month', field_type=IntegerType(), required=False), accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
   BoundGreaterThanOrEqual(term=BoundReference(field=NestedField(field_id=1000, name='request_datetime_month', field_type=IntegerType(), required=False), accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
   BoundLessThanOrEqual(term=BoundReference(field=NestedField(field_id=1000, name='request_datetime_month', field_type=IntegerType(), required=False), accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
   BoundGreaterThanOrEqual(term=BoundReference(field=NestedField(field_id=1000, name='request_datetime_month', field_type=IntegerType(), required=False), accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
   BoundLessThanOrEqual(term=BoundReference(field=NestedField(field_id=1000, name='request_datetime_month', field_type=IntegerType(), required=False), accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
   BoundGreaterThanOrEqual(term=BoundReference(field=NestedField(field_id=1000, name='request_datetime_month', field_type=IntegerType(), required=False), accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
   ```
   
   It's unclear to me if it's a bug, a problem with the table itself or if I'm passing invalid values to the `row_filter` argument, but this SQL query (done using Athena) works:
   
   ```sql
   SELECT DATE_TRUNC('day', "request_datetime"), COUNT(*) FROM "taxi_dremio_by_month"
   WHERE "request_datetime" >= CAST('2021-01-01' AS DATE) AND "request_datetime" <= CAST('2021-01-08' AS DATE)
   GROUP BY 1
   ORDER BY 1
   ```
   
   Can you help me? Thanks so much.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue closed issue #7736: [PyIceberg] plan_files() fail on math-based partitioned table

Posted by "rdblue (via GitHub)" <gi...@apache.org>.
rdblue closed issue #7736: [PyIceberg] plan_files() fail on math-based partitioned table
URL: https://github.com/apache/iceberg/issues/7736


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] Fokko commented on issue #7736: [PyIceberg] plan_files() fail on math-based partitioned table

Posted by "Fokko (via GitHub)" <gi...@apache.org>.
Fokko commented on issue #7736:
URL: https://github.com/apache/iceberg/issues/7736#issuecomment-1579086269

   @bigluck Thanks for finding this!  I'm able to reproduce it locally. A fix is inbound
   
   ![image](https://github.com/apache/iceberg/assets/1134248/b9bb9156-2e94-4603-894c-08f72b0103c8)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] bigluck commented on issue #7736: [PyIceberg] plan_files() fail on math-based partitioned table

Posted by "bigluck (via GitHub)" <gi...@apache.org>.
bigluck commented on issue #7736:
URL: https://github.com/apache/iceberg/issues/7736#issuecomment-1568239090

   For reference, the table has 108.958 rows with `request_datetime=NULL`, and 745.178.065 rows with a valid datetime.
   It looks like the parser expects the partition always to have a valid value


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org