You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2022/11/23 09:22:36 UTC

[GitHub] [iceberg] dungdm93 opened a new pull request, #6254: Python: implement `to_pandas`

dungdm93 opened a new pull request, #6254:
URL: https://github.com/apache/iceberg/pull/6254

   After #6233, we can read Iceberg table to `pandas` DataFrame via Apache Arrow:
   ```python
   taxi.scan().to_arrow().to_pandas()
   ```
   
   Since Pandas is most popular python data analysis library, I'd like to add a shortcut to convert `TableScan` to `pd.DataFrame` directly.
   ```python
   taxi.scan().to_pandas()
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] dungdm93 commented on pull request #6254: Python: implement `to_pandas`

Posted by GitBox <gi...@apache.org>.
dungdm93 commented on PR #6254:
URL: https://github.com/apache/iceberg/pull/6254#issuecomment-1324797834

   `pandas` has type hint, but seem like missing `py.typed` file.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] Fokko commented on pull request #6254: Python: implement `to_pandas`

Posted by GitBox <gi...@apache.org>.
Fokko commented on PR #6254:
URL: https://github.com/apache/iceberg/pull/6254#issuecomment-1345636980

   Thanks @dungdm93 for working on this 🙌🏻 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] dungdm93 commented on pull request #6254: Python: implement `to_pandas`

Posted by GitBox <gi...@apache.org>.
dungdm93 commented on PR #6254:
URL: https://github.com/apache/iceberg/pull/6254#issuecomment-1328131979

   @rdblue In that case, I'd like to have a docs page mention how to use Iceberg in popular frameworks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] dungdm93 commented on pull request #6254: Python: implement `to_pandas`

Posted by GitBox <gi...@apache.org>.
dungdm93 commented on PR #6254:
URL: https://github.com/apache/iceberg/pull/6254#issuecomment-1345456299

   @rdblue, @Fokko rebased and conflicts are resolved.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] Fokko commented on pull request #6254: Python: implement `to_pandas`

Posted by GitBox <gi...@apache.org>.
Fokko commented on PR #6254:
URL: https://github.com/apache/iceberg/pull/6254#issuecomment-1325556537

   Fair point @rdblue. I think it makes more sense to return a PyArrow dataset: https://github.com/apache/iceberg/pull/6258#discussion_r1030733926 Then this would translate to: `to_arrow().to_table().toPandas()`. We could also split them into `to_pyarrow_dataset()` and `to_pyarrow_table()` (or `to_pyarrow_dataset() -> pa.Dataset` and `to_pyarrow() -> pa.Table` to keep the `to_pyarrow` as is.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] dungdm93 commented on pull request #6254: Python: implement `to_pandas`

Posted by GitBox <gi...@apache.org>.
dungdm93 commented on PR #6254:
URL: https://github.com/apache/iceberg/pull/6254#issuecomment-1324777707

   @Fokko Yes. `poetry.lock` is updated


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #6254: Python: implement `to_pandas`

Posted by GitBox <gi...@apache.org>.
rdblue commented on PR #6254:
URL: https://github.com/apache/iceberg/pull/6254#issuecomment-1325529068

   If the value of this is just to get rid of one intermediate call, `to_arrow().to_pandas()` to `to_pandas()` then I'm not sure it is worth the trouble. We don't need the additional complexity of all the requirements and yet another optional dependency.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] dungdm93 commented on pull request #6254: Python: implement `to_pandas`

Posted by GitBox <gi...@apache.org>.
dungdm93 commented on PR #6254:
URL: https://github.com/apache/iceberg/pull/6254#issuecomment-1325884010

   @rdblue I see `to_duckdb` basically is just `con.register(...)`, so what is different?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #6254: Python: implement `to_pandas`

Posted by GitBox <gi...@apache.org>.
rdblue commented on PR #6254:
URL: https://github.com/apache/iceberg/pull/6254#issuecomment-1344935248

   Sounds good to me. Merge away.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] Fokko commented on pull request #6254: Python: implement `to_pandas`

Posted by GitBox <gi...@apache.org>.
Fokko commented on PR #6254:
URL: https://github.com/apache/iceberg/pull/6254#issuecomment-1328652496

   @rdblue I don't see much harm as the dependencies are optional. I'd also like end-to-end tests in the future so we don't break any integrations without us knowing.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] Fokko commented on a diff in pull request #6254: Python: implement `to_pandas`

Posted by GitBox <gi...@apache.org>.
Fokko commented on code in PR #6254:
URL: https://github.com/apache/iceberg/pull/6254#discussion_r1030205856


##########
python/pyiceberg/table/__init__.py:
##########
@@ -54,6 +55,10 @@
 )
 from pyiceberg.types import StructType
 
+if TYPE_CHECKING:

Review Comment:
   This is nice 👍🏻 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #6254: Python: implement `to_pandas`

Posted by GitBox <gi...@apache.org>.
rdblue commented on PR #6254:
URL: https://github.com/apache/iceberg/pull/6254#issuecomment-1328118366

   @dungdm93, I agree. Maybe we shouldn't have a `to_duckdb` either. Is it really worth the additional dependency headache?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] Fokko commented on pull request #6254: Python: implement `to_pandas`

Posted by GitBox <gi...@apache.org>.
Fokko commented on PR #6254:
URL: https://github.com/apache/iceberg/pull/6254#issuecomment-1324786981

   @dungdm93 thanks! It looks like that pandas doesn't have any types. We can ignore this by adding:
   
   ```
   [[tool.mypy.overrides]]
   module = "pandas.*"
   ignore_missing_imports = true
   ```
   
   To the `pyproject.toml`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] Fokko commented on pull request #6254: Python: implement `to_pandas`

Posted by GitBox <gi...@apache.org>.
Fokko commented on PR #6254:
URL: https://github.com/apache/iceberg/pull/6254#issuecomment-1324775075

   @dungdm93 It looks like you need to run `poetry update` to update the `poetry lock` file


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] Dr-Irv commented on pull request #6254: Python: implement `to_pandas`

Posted by GitBox <gi...@apache.org>.
Dr-Irv commented on PR #6254:
URL: https://github.com/apache/iceberg/pull/6254#issuecomment-1325339822

   You can get typing from pandas by installing `pandas-stubs`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] TomAugspurger commented on pull request #6254: Python: implement `to_pandas`

Posted by GitBox <gi...@apache.org>.
TomAugspurger commented on PR #6254:
URL: https://github.com/apache/iceberg/pull/6254#issuecomment-1325280070

   xref https://github.com/pandas-dev/pandas/issues/28142. cc @Dr-Irv on whether we're ready to add this or not yet.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] Fokko merged pull request #6254: Python: implement `to_pandas`

Posted by GitBox <gi...@apache.org>.
Fokko merged PR #6254:
URL: https://github.com/apache/iceberg/pull/6254


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org