You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/11/23 07:31:05 UTC

[GitHub] [arrow] Neighbor-Mr-Wang opened a new issue, #14713: 【pyarrow】upload jpeg files to s3 much slower than boto

Neighbor-Mr-Wang opened a new issue, #14713:
URL: https://github.com/apache/arrow/issues/14713

   ### Describe the usage question you have. Please include as many useful details as possible.
   
   I need to upload a large number of small pictures to S3. I use pyarrow and boto to test, and found that under the same conditions (85625 JPEG pictures, a total of 9.2GB, 8 processes, boto use time 100s, pyarrow need 500s), pyarrow is much slower than boto. Boto is also used at the bottom of pyarrow, why is pyarrow so much slower?
   ```
   from typing import List
   from concurrent.futures import ProcessPoolExecutor, as_completed, ThreadPoolExecutor
   from boto3.session import Session
   import os
   import time
   from pyarrow import fs
   
   NUM_WORKERS = 8
   
   class S3Info(object):
       ACCESS_KEY = 'xxxx'
       SECRET_KEY = 'xxxx'
       ENDPOINT = 'http://xxxx'
       BUCKET = 'xxx'
   
   def import_data(file_list: List[str]):
       worker_pool = ProcessPoolExecutor()
       futures, fail_list = [], []
       length = len(file_list)
       step = int(length / NUM_WORKERS) + 1
       for i in range(0, length, step):
           sub_fd_list = file_list[i: i + step]
           futures.append(worker_pool.submit(put_list_files_by_arrow, sub_fd_list))  # 8 processes, use 500s
           # or: futures.append(worker_pool.submit(put_list_files_by_boto, sub_fd_list))  8 processes, use 100s
       
       for future in as_completed(futures):
           sub_fail_list = future.result()
           fail_list += sub_fail_list
       return
   
   def put_list_files_by_boto(file_list: List[str]):
       dst_session = Session(aws_access_key_id=S3Info.ACCESS_KEY, aws_secret_access_key=S3Info.SECRET_KEY)
       dst_s3 = dst_session.client("s3", endpoint_url=S3Info.ENDPOINT)
   
       dst_bucket = 'xxx'
       file_dir = '/data/test/10GJPEG'
       
       fail_list = []
       for file in file_list:
           try:
               file_path = os.path.join(file_dir, file)
               with open(file_path, 'rb') as f:
                   data = f.read()
                   dst_key = 'ds_test/put_test_3/'+file
                   dst_s3.put_object(Body=data, Key=dst_key, Bucket=dst_bucket)
           except Exception as exc:
               fail_list.append(file)
   
       return fail_list
   
   def put_list_files_by_arrow(file_list: List[str]):
       handle = fs.S3FileSystem(
           access_key=S3Info.ACCESS_KEY,
           secret_key=S3Info.SECRET_KEY,
           endpoint_override=S3Info.ENDPOINT)
   
       dst_bucket = 'xxx'
       file_dir = '/data/test/10GJPEG'
       
       fail_list = []
       for file in file_list:
           try:
               file_path = os.path.join(file_dir, file)
               with open(file_path, 'rb') as f:
                   data = f.read()
                   dst_key = 'xxx/ds_test/put_test_3/'+file
                   f_new = handle.open_output_stream(dst_key)
                   f_new.write(data)
                   f_new.close()
           except Exception as exc:
               fail_list.append(file)
   
       return fail_list
   
   if __name__ == "__main__":
       dir = '/data/test/10GJPEG' # 85625 jpeg images, total 9.2GB
       files = os.listdir(dir)
       print('start .......')
       begin = time.time()
       import_data(files)
       end = time.time()
       print(f'{NUM_WORKERS} workers, use time: {end-begin}')
   ```
   
   ### Component
   
   Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] assignUser commented on issue #14713: 【pyarrow】upload jpeg files to s3 much slower than boto

Posted by GitBox <gi...@apache.org>.

assignUser commented on issue #14713:
URL: https://github.com/apache/arrow/issues/14713#issuecomment-1326489119

   Please see this related [JIRA issue ](https://issues.apache.org/jira/browse/ARROW-17961)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org