You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ozone.apache.org by "mingchao zhao (Jira)" <ji...@apache.org> on 2020/02/09 10:35:00 UTC

[jira] [Commented] (HDDS-2443) Python client/interface for Ozone

    [ https://issues.apache.org/jira/browse/HDDS-2443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17033169#comment-17033169 ] 

mingchao zhao commented on HDDS-2443:
-------------------------------------

Hi [~cxorm] Any progress on the previous question? Here's what I got:

I had a look at the pyarrow connect() execution process. Pyarrow's connect() use libhdfs‘s (jni-based)[hdfsConnect|[https://github.com/apache/arrow/blob/207b3507be82e92ebf29ec7d6d3b0bb86091c09a/python/pyarrow/hdfs.py#L206]]. Here are some questions:
The first time this method is called in the process, it will take a long time to load the library.
In my test, each operation would start a separate process and then Connect and upload. Each connect will cost about 1.5 secondse. If the user's scenario is the same as mine, their operation will be slow too. We tested AWS python client (boto3)  and boto3 performed much better under the same conditions

*It would be much better if the user only created connect once and then reused it.* I've tested the reuse of connect and the performance has improved tremendously:
Test cluster: use pyarrow client. 9 physical machines, each with 10 HDD disks, 1 as master for OM and SCM, 8 as datanodes.

|upload files|Total size|Multi Raft latency(s)
reuse connect|Multi Raft latency(s)
no reuse connect|
|100KB * 1000 files|100MB|151.858362913|2471.23463202|
|100KB * 20000 files |2GB|2482.97329998
=~0.69h|49398.845176
=~13.7h|

> Python client/interface for Ozone
> ---------------------------------
>
>                 Key: HDDS-2443
>                 URL: https://issues.apache.org/jira/browse/HDDS-2443
>             Project: Hadoop Distributed Data Store
>          Issue Type: New Feature
>          Components: Ozone Client
>            Reporter: Li Cheng
>            Priority: Major
>         Attachments: Ozone with pyarrow.html, OzoneS3.py
>
>
> This Jira will be used to track development for python client/interface of Ozone.
> Original ideas: item#25 in [https://cwiki.apache.org/confluence/display/HADOOP/Ozone+project+ideas+for+new+contributors]
> Ozone Client(Python) for Data Science Notebook such as Jupyter.
>  # Size: Large
>  # PyArrow: [https://pypi.org/project/pyarrow/]
>  # Python -> libhdfs HDFS JNI library (HDFS, S3,...) -> Java client API Impala uses  libhdfs
> Path to try:
>  # s3 interface: Ozone s3 gateway(already supported) + AWS python client (boto3)
>  # python native RPC
>  # pyarrow + libhdfs, which use the Java client under the hood.
>  # python + C interface of go / rust ozone library. I created POC go / rust clients earlier which can be improved if the libhdfs interface is not good enough. [By [~elek]]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: ozone-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: ozone-issues-help@hadoop.apache.org