You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2022/12/09 22:10:12 UTC

[GitHub] [iceberg] Fokko opened a new pull request, #6398: Python: Integration tests

Fokko opened a new pull request, #6398:
URL: https://github.com/apache/iceberg/pull/6398

   This is the first version of a framework to read Iceberg tables, produced by Spark, using PyIceberg. This makes it easier to run end-to-end tests and also validate the behavior of PyArrow and DuckDB.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] Fokko commented on pull request #6398: Python: Integration tests

Posted by "Fokko (via GitHub)" <gi...@apache.org>.
Fokko commented on PR #6398:
URL: https://github.com/apache/iceberg/pull/6398#issuecomment-1470624741

   Thanks for the review @rdblue we can more tests later on


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #6398: Python: Integration tests

Posted by "rdblue (via GitHub)" <gi...@apache.org>.
rdblue commented on code in PR #6398:
URL: https://github.com/apache/iceberg/pull/6398#discussion_r1136192947


##########
python/tests/test_integration.py:
##########
@@ -0,0 +1,81 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+# pylint:disable=redefined-outer-name
+
+import math
+
+import pytest
+
+from pyiceberg.catalog import Catalog, load_catalog
+from pyiceberg.expressions import IsNaN, NotNaN
+from pyiceberg.table import Table
+
+
+@pytest.fixture()
+def catalog() -> Catalog:
+    return load_catalog(
+        "local",
+        **{
+            "type": "rest",
+            "uri": "http://localhost:8181",
+            "s3.endpoint": "http://localhost:9000",
+            "s3.access-key-id": "admin",
+            "s3.secret-access-key": "password",
+        },
+    )
+
+
+@pytest.fixture()
+def table_test_null_nan(catalog: Catalog) -> Table:
+    return catalog.load_table("default.test_null_nan")
+
+
+@pytest.fixture()
+def table_test_null_nan_rewritten(catalog: Catalog) -> Table:
+    return catalog.load_table("default.test_null_nan_rewritten")
+
+
+@pytest.mark.integration
+def test_pyarrow_nan(table_test_null_nan: Table) -> None:
+    arrow_table = table_test_null_nan.scan(row_filter=IsNaN("col_numeric"), selected_fields=("idx", "col_numeric")).to_arrow()
+    assert len(arrow_table) == 1
+    assert arrow_table[0][0].as_py() == 1
+    assert math.isnan(arrow_table[1][0].as_py())

Review Comment:
   I think it would be easier to read these tests if you called `as_py()` to produce rows and validated the rows. It looks like there's just one row, but the row/column indexes are backward because this is columnar?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #6398: Python: Integration tests

Posted by "rdblue (via GitHub)" <gi...@apache.org>.
rdblue commented on code in PR #6398:
URL: https://github.com/apache/iceberg/pull/6398#discussion_r1136186713


##########
python/dev/spark-defaults.conf:
##########
@@ -0,0 +1,29 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+spark.sql.extensions                   org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
+spark.sql.catalog.demo                 org.apache.iceberg.spark.SparkCatalog
+spark.sql.catalog.demo.catalog-impl    org.apache.iceberg.rest.RESTCatalog

Review Comment:
   type=rest?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #6398: Python: Integration tests

Posted by "rdblue (via GitHub)" <gi...@apache.org>.
rdblue commented on code in PR #6398:
URL: https://github.com/apache/iceberg/pull/6398#discussion_r1136188827


##########
python/dev/docker-compose-integration.yml:
##########
@@ -0,0 +1,76 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+version: "3"
+
+services:
+  spark-iceberg:
+    image: python-integration
+    container_name: pyiceberg-spark
+    build: .
+    depends_on:
+      - rest
+      - minio
+    volumes:
+      - ./warehouse:/home/iceberg/warehouse
+    environment:
+      - AWS_ACCESS_KEY_ID=admin
+      - AWS_SECRET_ACCESS_KEY=password
+      - AWS_REGION=us-east-1
+    ports:
+      - 8888:8888
+      - 8080:8080
+    links:
+      - rest:rest
+      - minio:minio
+  rest:
+    image: tabulario/iceberg-rest:0.2.0
+    container_name: pyiceberg-rest

Review Comment:
   Where does this store the underlying catalog metadata?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] Fokko commented on a diff in pull request #6398: Python: Integration tests

Posted by "Fokko (via GitHub)" <gi...@apache.org>.
Fokko commented on code in PR #6398:
URL: https://github.com/apache/iceberg/pull/6398#discussion_r1137320389


##########
python/dev/spark-defaults.conf:
##########
@@ -0,0 +1,29 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+spark.sql.extensions                   org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
+spark.sql.catalog.demo                 org.apache.iceberg.spark.SparkCatalog
+spark.sql.catalog.demo.catalog-impl    org.apache.iceberg.rest.RESTCatalog

Review Comment:
   Less is more, thanks!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] Fokko merged pull request #6398: Python: Integration tests

Posted by "Fokko (via GitHub)" <gi...@apache.org>.
Fokko merged PR #6398:
URL: https://github.com/apache/iceberg/pull/6398


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #6398: Python: Integration tests

Posted by "rdblue (via GitHub)" <gi...@apache.org>.
rdblue commented on code in PR #6398:
URL: https://github.com/apache/iceberg/pull/6398#discussion_r1136189567


##########
python/pyiceberg/io/pyarrow.py:
##########
@@ -428,11 +428,11 @@ def visit_not_in(self, term: BoundTerm[pc.Expression], literals: Set[Any]) -> pc
 
     def visit_is_nan(self, term: BoundTerm[Any]) -> pc.Expression:
         ref = pc.field(term.ref().field.name)
-        return ref.is_null(nan_is_null=True) & ref.is_valid()
+        return pc.is_nan(ref)

Review Comment:
   This probably shouldn't be in this PR right? Seems like an update with a new version of pyarrow?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #6398: Python: Integration tests

Posted by "rdblue (via GitHub)" <gi...@apache.org>.
rdblue commented on code in PR #6398:
URL: https://github.com/apache/iceberg/pull/6398#discussion_r1136189961


##########
python/pyiceberg/table/__init__.py:
##########
@@ -331,7 +331,7 @@ def __init__(
         self,
         table: Table,
         row_filter: Union[str, BooleanExpression] = ALWAYS_TRUE,
-        selected_fields: Tuple[str] = ("*",),
+        selected_fields: Tuple[str, ...] = ("*",),

Review Comment:
   This also seems like a separate PR change, but good cleanup.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] Fokko commented on a diff in pull request #6398: Python: Integration tests

Posted by "Fokko (via GitHub)" <gi...@apache.org>.
Fokko commented on code in PR #6398:
URL: https://github.com/apache/iceberg/pull/6398#discussion_r1137353269


##########
python/tests/test_integration.py:
##########
@@ -0,0 +1,81 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+# pylint:disable=redefined-outer-name
+
+import math
+
+import pytest
+
+from pyiceberg.catalog import Catalog, load_catalog
+from pyiceberg.expressions import IsNaN, NotNaN
+from pyiceberg.table import Table
+
+
+@pytest.fixture()
+def catalog() -> Catalog:
+    return load_catalog(
+        "local",
+        **{
+            "type": "rest",
+            "uri": "http://localhost:8181",
+            "s3.endpoint": "http://localhost:9000",
+            "s3.access-key-id": "admin",
+            "s3.secret-access-key": "password",
+        },
+    )
+
+
+@pytest.fixture()
+def table_test_null_nan(catalog: Catalog) -> Table:
+    return catalog.load_table("default.test_null_nan")
+
+
+@pytest.fixture()
+def table_test_null_nan_rewritten(catalog: Catalog) -> Table:
+    return catalog.load_table("default.test_null_nan_rewritten")
+
+
+@pytest.mark.integration
+def test_pyarrow_nan(table_test_null_nan: Table) -> None:
+    arrow_table = table_test_null_nan.scan(row_filter=IsNaN("col_numeric"), selected_fields=("idx", "col_numeric")).to_arrow()
+    assert len(arrow_table) == 1
+    assert arrow_table[0][0].as_py() == 1
+    assert math.isnan(arrow_table[1][0].as_py())
+
+
+@pytest.mark.integration
+def test_pyarrow_nan_rewritten(table_test_null_nan_rewritten: Table) -> None:
+    arrow_table = table_test_null_nan_rewritten.scan(
+        row_filter=IsNaN("col_numeric"), selected_fields=("idx", "col_numeric")
+    ).to_arrow()
+    assert len(arrow_table) == 1
+    assert arrow_table[0][0].as_py() == 1
+    assert math.isnan(arrow_table[1][0].as_py())
+
+
+@pytest.mark.integration
+@pytest.mark.skip(reason="Fixing issues with NaN's: https://github.com/apache/arrow/issues/34162")
+def test_pyarrow_not_nan_count(table_test_null_nan: Table) -> None:
+    not_nan = table_test_null_nan.scan(row_filter=NotNaN("col_numeric"), selected_fields=("idx",)).to_arrow()
+    assert len(not_nan) == 2
+
+
+@pytest.mark.integration
+def test_duckdb_nan(table_test_null_nan_rewritten: Table) -> None:
+    con = table_test_null_nan_rewritten.scan().to_duckdb("table_test_null_nan")
+    result = con.query("SELECT idx FROM table_test_null_nan WHERE isnan(col_numeric)").fetchone()
+    assert result == (1,)

Review Comment:
   Now it does :)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #6398: Python: Integration tests

Posted by "rdblue (via GitHub)" <gi...@apache.org>.
rdblue commented on code in PR #6398:
URL: https://github.com/apache/iceberg/pull/6398#discussion_r1136184834


##########
.github/workflows/python-integration.yml:
##########
@@ -0,0 +1,88 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+
+name: "Python CI"
+on:
+  push:
+    branches:
+    - 'master'
+    - '0.**'
+    tags:
+    - 'apache-iceberg-**'
+  pull_request:
+    paths:
+    - '.github/workflows/python-ci.yml'
+    - 'python/**'
+
+concurrency:
+  group: ${{ github.workflow }}-${{ github.ref }}
+  cancel-in-progress: ${{ github.event_name == 'pull_request' }}
+
+jobs:
+  integration-test:
+    runs-on: ubuntu-20.04
+
+    steps:
+    - uses: actions/checkout@v3
+      with:
+        fetch-depth: 2
+    - shell: pwsh
+      id: check_file_changed
+      run: |
+        $diff = git diff --name-only HEAD^ HEAD
+        $SourceDiff = $diff | Where-Object { $_ -match '^python/dev/Dockerfile$' }
+        $HasDiff = $SourceDiff.Length -gt 0
+        Write-Host "::set-output name=docs_changed::$HasDiff"
+    - name: Restore image
+      id: cache-docker
+      uses: actions/cache@v3
+      with:
+        path: ci/cache/docker/python
+        key: cache-mintegration
+    - name: Update Image Cache if cache miss
+      if: steps.cache-docker.outputs.cache-hit != 'true' || steps.check_file_changed.outputs.docs_changed == 'True'
+      run: |
+        docker build -t python-integration python/dev/ && \
+        mkdir -p ci/cache/docker/python && \
+        docker image save python-integration --output ./ci/cache/docker/python/python-integration.tar
+    - name: Use Image Cache if cache hit
+      if: steps.cache-docker.outputs.cache-hit == 'true'
+      run: docker image load --input ./ci/cache/docker/python/python-integration.tar
+    - name: Run Apache-Spark setup
+      working-directory: ./python
+      run: |
+        docker-compose -f dev/docker-compose-integration.yml up -d
+        sleep 10
+    - name: Install poetry
+      run: pip install poetry
+    - uses: actions/setup-python@v4
+      with:
+        python-version: '3.9'
+        cache: poetry
+        cache-dependency-path: |
+          ./python/poetry.lock

Review Comment:
   Should there be more than just the lock file?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] Fokko commented on a diff in pull request #6398: Python: Integration tests

Posted by "Fokko (via GitHub)" <gi...@apache.org>.
Fokko commented on code in PR #6398:
URL: https://github.com/apache/iceberg/pull/6398#discussion_r1137351565


##########
python/tests/test_integration.py:
##########
@@ -0,0 +1,81 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+# pylint:disable=redefined-outer-name
+
+import math
+
+import pytest
+
+from pyiceberg.catalog import Catalog, load_catalog
+from pyiceberg.expressions import IsNaN, NotNaN
+from pyiceberg.table import Table
+
+
+@pytest.fixture()
+def catalog() -> Catalog:
+    return load_catalog(
+        "local",
+        **{
+            "type": "rest",
+            "uri": "http://localhost:8181",
+            "s3.endpoint": "http://localhost:9000",
+            "s3.access-key-id": "admin",
+            "s3.secret-access-key": "password",
+        },
+    )
+
+
+@pytest.fixture()
+def table_test_null_nan(catalog: Catalog) -> Table:
+    return catalog.load_table("default.test_null_nan")
+
+
+@pytest.fixture()
+def table_test_null_nan_rewritten(catalog: Catalog) -> Table:
+    return catalog.load_table("default.test_null_nan_rewritten")
+
+
+@pytest.mark.integration
+def test_pyarrow_nan(table_test_null_nan: Table) -> None:
+    arrow_table = table_test_null_nan.scan(row_filter=IsNaN("col_numeric"), selected_fields=("idx", "col_numeric")).to_arrow()
+    assert len(arrow_table) == 1
+    assert arrow_table[0][0].as_py() == 1
+    assert math.isnan(arrow_table[1][0].as_py())

Review Comment:
   I've changed it into `assert math.isnan(arrow_table["col_numeric"][0].as_py())`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] Fokko commented on a diff in pull request #6398: Python: Integration tests

Posted by "Fokko (via GitHub)" <gi...@apache.org>.
Fokko commented on code in PR #6398:
URL: https://github.com/apache/iceberg/pull/6398#discussion_r1137346006


##########
python/tests/test_integration.py:
##########
@@ -0,0 +1,81 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+# pylint:disable=redefined-outer-name
+
+import math
+
+import pytest
+
+from pyiceberg.catalog import Catalog, load_catalog
+from pyiceberg.expressions import IsNaN, NotNaN
+from pyiceberg.table import Table
+
+
+@pytest.fixture()
+def catalog() -> Catalog:
+    return load_catalog(
+        "local",
+        **{
+            "type": "rest",
+            "uri": "http://localhost:8181",
+            "s3.endpoint": "http://localhost:9000",
+            "s3.access-key-id": "admin",
+            "s3.secret-access-key": "password",
+        },
+    )
+
+
+@pytest.fixture()
+def table_test_null_nan(catalog: Catalog) -> Table:
+    return catalog.load_table("default.test_null_nan")
+
+
+@pytest.fixture()
+def table_test_null_nan_rewritten(catalog: Catalog) -> Table:
+    return catalog.load_table("default.test_null_nan_rewritten")
+
+
+@pytest.mark.integration
+def test_pyarrow_nan(table_test_null_nan: Table) -> None:
+    arrow_table = table_test_null_nan.scan(row_filter=IsNaN("col_numeric"), selected_fields=("idx", "col_numeric")).to_arrow()
+    assert len(arrow_table) == 1
+    assert arrow_table[0][0].as_py() == 1
+    assert math.isnan(arrow_table[1][0].as_py())

Review Comment:
   Let me rewrite those tests a bit



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #6398: Python: Integration tests

Posted by "rdblue (via GitHub)" <gi...@apache.org>.
rdblue commented on code in PR #6398:
URL: https://github.com/apache/iceberg/pull/6398#discussion_r1136193990


##########
python/tests/test_integration.py:
##########
@@ -0,0 +1,81 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+# pylint:disable=redefined-outer-name
+
+import math
+
+import pytest
+
+from pyiceberg.catalog import Catalog, load_catalog
+from pyiceberg.expressions import IsNaN, NotNaN
+from pyiceberg.table import Table
+
+
+@pytest.fixture()
+def catalog() -> Catalog:
+    return load_catalog(
+        "local",
+        **{
+            "type": "rest",
+            "uri": "http://localhost:8181",
+            "s3.endpoint": "http://localhost:9000",
+            "s3.access-key-id": "admin",
+            "s3.secret-access-key": "password",
+        },
+    )
+
+
+@pytest.fixture()
+def table_test_null_nan(catalog: Catalog) -> Table:
+    return catalog.load_table("default.test_null_nan")
+
+
+@pytest.fixture()
+def table_test_null_nan_rewritten(catalog: Catalog) -> Table:
+    return catalog.load_table("default.test_null_nan_rewritten")
+
+
+@pytest.mark.integration
+def test_pyarrow_nan(table_test_null_nan: Table) -> None:
+    arrow_table = table_test_null_nan.scan(row_filter=IsNaN("col_numeric"), selected_fields=("idx", "col_numeric")).to_arrow()
+    assert len(arrow_table) == 1
+    assert arrow_table[0][0].as_py() == 1
+    assert math.isnan(arrow_table[1][0].as_py())
+
+
+@pytest.mark.integration
+def test_pyarrow_nan_rewritten(table_test_null_nan_rewritten: Table) -> None:
+    arrow_table = table_test_null_nan_rewritten.scan(
+        row_filter=IsNaN("col_numeric"), selected_fields=("idx", "col_numeric")
+    ).to_arrow()
+    assert len(arrow_table) == 1
+    assert arrow_table[0][0].as_py() == 1
+    assert math.isnan(arrow_table[1][0].as_py())
+
+
+@pytest.mark.integration
+@pytest.mark.skip(reason="Fixing issues with NaN's: https://github.com/apache/arrow/issues/34162")
+def test_pyarrow_not_nan_count(table_test_null_nan: Table) -> None:
+    not_nan = table_test_null_nan.scan(row_filter=NotNaN("col_numeric"), selected_fields=("idx",)).to_arrow()
+    assert len(not_nan) == 2
+
+
+@pytest.mark.integration
+def test_duckdb_nan(table_test_null_nan_rewritten: Table) -> None:
+    con = table_test_null_nan_rewritten.scan().to_duckdb("table_test_null_nan")
+    result = con.query("SELECT idx FROM table_test_null_nan WHERE isnan(col_numeric)").fetchone()
+    assert result == (1,)

Review Comment:
   It doesn't return NaN?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] Fokko commented on a diff in pull request #6398: Python: Integration tests

Posted by "Fokko (via GitHub)" <gi...@apache.org>.
Fokko commented on code in PR #6398:
URL: https://github.com/apache/iceberg/pull/6398#discussion_r1137319685


##########
.github/workflows/python-integration.yml:
##########
@@ -0,0 +1,88 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+
+name: "Python CI"
+on:
+  push:
+    branches:
+    - 'master'
+    - '0.**'
+    tags:
+    - 'apache-iceberg-**'
+  pull_request:
+    paths:
+    - '.github/workflows/python-ci.yml'
+    - 'python/**'
+
+concurrency:
+  group: ${{ github.workflow }}-${{ github.ref }}
+  cancel-in-progress: ${{ github.event_name == 'pull_request' }}
+
+jobs:
+  integration-test:
+    runs-on: ubuntu-20.04
+
+    steps:
+    - uses: actions/checkout@v3
+      with:
+        fetch-depth: 2
+    - shell: pwsh
+      id: check_file_changed
+      run: |
+        $diff = git diff --name-only HEAD^ HEAD
+        $SourceDiff = $diff | Where-Object { $_ -match '^python/dev/Dockerfile$' }
+        $HasDiff = $SourceDiff.Length -gt 0
+        Write-Host "::set-output name=docs_changed::$HasDiff"
+    - name: Restore image
+      id: cache-docker
+      uses: actions/cache@v3
+      with:
+        path: ci/cache/docker/python
+        key: cache-mintegration
+    - name: Update Image Cache if cache miss
+      if: steps.cache-docker.outputs.cache-hit != 'true' || steps.check_file_changed.outputs.docs_changed == 'True'
+      run: |
+        docker build -t python-integration python/dev/ && \
+        mkdir -p ci/cache/docker/python && \
+        docker image save python-integration --output ./ci/cache/docker/python/python-integration.tar
+    - name: Use Image Cache if cache hit
+      if: steps.cache-docker.outputs.cache-hit == 'true'
+      run: docker image load --input ./ci/cache/docker/python/python-integration.tar
+    - name: Run Apache-Spark setup
+      working-directory: ./python
+      run: |
+        docker-compose -f dev/docker-compose-integration.yml up -d
+        sleep 10
+    - name: Install poetry
+      run: pip install poetry
+    - uses: actions/setup-python@v4
+      with:
+        python-version: '3.9'
+        cache: poetry
+        cache-dependency-path: |
+          ./python/poetry.lock

Review Comment:
   If you change the dependencies, you need to regenerate the lock file. So that should be enough



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] Fokko commented on a diff in pull request #6398: Python: Integration tests

Posted by "Fokko (via GitHub)" <gi...@apache.org>.
Fokko commented on code in PR #6398:
URL: https://github.com/apache/iceberg/pull/6398#discussion_r1137321903


##########
python/dev/docker-compose-integration.yml:
##########
@@ -0,0 +1,76 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+version: "3"
+
+services:
+  spark-iceberg:
+    image: python-integration
+    container_name: pyiceberg-spark
+    build: .
+    depends_on:
+      - rest
+      - minio
+    volumes:
+      - ./warehouse:/home/iceberg/warehouse
+    environment:
+      - AWS_ACCESS_KEY_ID=admin
+      - AWS_SECRET_ACCESS_KEY=password
+      - AWS_REGION=us-east-1
+    ports:
+      - 8888:8888
+      - 8080:8080
+    links:
+      - rest:rest
+      - minio:minio
+  rest:
+    image: tabulario/iceberg-rest:0.2.0
+    container_name: pyiceberg-rest

Review Comment:
   An in-memory SQLite



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] Fokko commented on a diff in pull request #6398: Python: Integration tests

Posted by "Fokko (via GitHub)" <gi...@apache.org>.
Fokko commented on code in PR #6398:
URL: https://github.com/apache/iceberg/pull/6398#discussion_r1137324576


##########
python/pyiceberg/io/pyarrow.py:
##########
@@ -428,11 +428,11 @@ def visit_not_in(self, term: BoundTerm[pc.Expression], literals: Set[Any]) -> pc
 
     def visit_is_nan(self, term: BoundTerm[Any]) -> pc.Expression:
         ref = pc.field(term.ref().field.name)
-        return ref.is_null(nan_is_null=True) & ref.is_valid()
+        return pc.is_nan(ref)

Review Comment:
   This is actually to make the CI pass. I've [created a PR](https://github.com/apache/arrow/pull/34184) to allow `ref.is_nan()` as well, but this is not released yet.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org