You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by "idavi-bcs (via GitHub)" <gi...@apache.org> on 2023/10/26 15:03:25 UTC

Re: [I] write_table does not respect indices dtype [arrow]

idavi-bcs commented on issue #36589:
URL: https://github.com/apache/arrow/issues/36589#issuecomment-1781310090

   Unfortunately it looks like this is a well-known bug at this point, reported multiple times (#36589, #30302, #27616).  However, I want to point out an important impact that I haven't seen mentioned yet.  I have a large data table of int8 categoricals (genotype data), but it still fits comfortably in memory, and can easily be written to Parquet.  But I cannot *read* the Parquet file back into memory, because now it takes 5 times as much space (int32 + int8 in memory simultaneously as Pandas tries to cast back to int8).  So my data is effectively lost.
   
   In other words, this is not just a performance bug -- it can actually cause data loss in the case of large tables!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org