You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/12/28 08:52:51 UTC

[GitHub] [arrow] svjack opened a new issue #9026: Interpretation of underlying logic of pyarrow manipulate hive

svjack opened a new issue #9026:
URL: https://github.com/apache/arrow/issues/9026


   i used pyarrow to handle hdfs files in hive. And i review the source code of pyarrow.
   The mainly utilities about hdfs filesystem are function's about parquet, many about io and meta or schema inferred which is rich to use it.
   Another aspect is plain read function , read as text to manipulate text file in hdfs file system.
   as i know if i create table in hive by default the save format is text. and when i use HdfsFileSystem to deep into the truly path in hdfs of hive. It seems like the schema and meta info (and the auto parsing of delimited lines)of table can't retrieved by internal api.
   There i don't want to use sql tools as pyhive or others to make it as a "two source"(one from abstract sql another from plain file system) problem. even it is simple.
   So at present, i must use pd.read_csv with the f returned by fs.open and retrieve schema info from mysql's TBLS where the detail schema info truly located of hive metastore. I think this design is not perfect.
   So i want to know is that, did i omit some details about the underlying logic about pyarrow related with hdfs file system in hive ? please make a interpretation for me.
   All about this is pyarrow internal construction 
   instead of other framework.
   And i also want to have a brief introduction about dataset api 's function about hive's parquet file and text file. Can you give me some examples about them, mainly about text save format in hive's hdfs.
   I also take a glare a datas source transport toolkit called sqoop, in its AppendUtils.java file it use some detail partition manipulates toolkit to perform data append and i think all functions can be rebuilder with pyarrow. But as i review the source code about pyarrow , i can not find some developed logic about "partition" and 'warehouse' manipulation. Did some one build some projects use pyarrow or arrow's other api which have implement these function ?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wesm closed issue #9026: Interpretation of underlying logic of pyarrow manipulate hive

Posted by GitBox <gi...@apache.org>.
wesm closed issue #9026:
URL: https://github.com/apache/arrow/issues/9026


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on issue #9026: Interpretation of underlying logic of pyarrow manipulate hive

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on issue #9026:
URL: https://github.com/apache/arrow/issues/9026#issuecomment-753949295


   > So at present, i must use pd.read_csv with the f returned by fs.open and retrieve schema info from mysql's TBLS where the detail schema info truly located of hive metastore. I think this design is not perfect.
   
   Arrow doesn't have functionality to natively interact with or understand hive metastores. So if you have a CSV file stored, and you want to read this following the schema stored in the hive metastore, then you will at the moment always need to do something manually like what you described above.
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org