You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by "Fokko (via GitHub)" <gi...@apache.org> on 2023/03/21 21:31:34 UTC

[GitHub] [iceberg] Fokko commented on a diff in pull request #7163: Python: Add limit parameter to table scan

Fokko commented on code in PR #7163:
URL: https://github.com/apache/iceberg/pull/7163#discussion_r1143985117


##########
python/pyiceberg/io/pyarrow.py:
##########
@@ -484,21 +488,36 @@ def expression_to_pyarrow(expr: BooleanExpression) -> pc.Expression:
     return boolean_expression_visit(expr, _ConvertToArrowExpression())
 
 
+@lru_cache
+def _get_file_format(file_format: FileFormat, **kwargs: Dict[str, Any]) -> ds.FileFormat:
+    if file_format == FileFormat.PARQUET.value:
+        return ds.ParquetFileFormat(**kwargs)
+    elif file_format == FileFormat.ORC.value:

Review Comment:
   We want to remove this, and we can implement ORC in https://github.com/apache/iceberg/pull/7033 because it needs more work.



##########
python/pyiceberg/io/pyarrow.py:
##########
@@ -517,15 +536,22 @@ def _file_to_table(
         if file_schema is None:
             raise ValueError(f"Missing Iceberg schema in Metadata for file: {path}")
 
-        arrow_table = pq.read_table(
-            source=fout,
-            schema=parquet_schema,
-            pre_buffer=True,
-            buffer_size=8 * ONE_MEGABYTE,
-            filters=pyarrow_filter,
+        fragment_scanner = ds.Scanner.from_fragment(
+            fragment=fragment,
+            schema=physical_schema,
+            filter=pyarrow_filter,
             columns=[col.name for col in file_project_schema.columns],
         )
 
+        if limit:
+            arrow_table = fragment_scanner.head(limit)
+            with rows_counter.get_lock():

Review Comment:
   I think we can remove this lock because we already did the expensive work. This will make the code a bit simpler and avoid locking.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org