You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2019/02/12 17:11:43 UTC
[GitHub] jkhalid commented on issue #5400: [SPARK-6190][core] create LargeByteBuffer for eliminating 2GB block limit

jkhalid commented on issue #5400: [SPARK-6190][core] create LargeByteBuffer for eliminating 2GB block limit
URL: https://github.com/apache/spark/pull/5400#issuecomment-462847339
 
 
   @squito @SparkQA @vanzin @shaneknapp @tgravescs 
   
   I am using spark.sql on AWS Glue to generate a single large (it is the clients requirement to have a single file)  csv compressed file which is greater than 2GB for sure. I am running to into this issue 
   
   write(transformed_feed)
   File "script_2019-02-12-15-57-55.py", line 161, in write
   output_path_premium, header=True, compression="gzip")
   File "/mnt/yarn/usercache/root/appcache/application_1549986900582_0001/container_1549986900582_0001_01_000001/pyspark.zip/pyspark/sql/readwriter.py", line 766, in csv
   File "/mnt/yarn/usercache/root/appcache/application_1549986900582_0001/container_1549986900582_0001_01_000001/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
   File "/mnt/yarn/usercache/root/appcache/application_1549986900582_0001/container_1549986900582_0001_01_000001/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
   File "/mnt/yarn/usercache/root/appcache/application_1549986900582_0001/container_1549986900582_0001_01_000001/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value
   py4j.protocol.Py4JJavaError: An error occurred while calling o210.csv.
   .
   .
   .
   .
   .
   at java.lang.Thread.run(Thread.java:748)
   Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID 4, ip-172-32-189-222.ec2.internal, executor 1): java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE
   
   Below is python code used to write the file
   def write(dataframe):
       # write two files premium listings and non premium listings (critera : listing_priority > 30 = premium)
       dataframe.filter(dataframe["listing_priority"] >= 30).drop('listing_priority').drop('image_count').write.csv(
           output_path_premium, header=True, compression="gzip")
       shell_command = "hdfs dfs -mv " + output_path_premium + '/part-*' + '  ' + output_path_premium + output_file_premium
       os.system(shell_command)
       dataframe.filter(dataframe["listing_priority"] < 30).drop('listing_priority').drop('image_count').write.csv(
           output_path_nonpremium, header=True, compression="gzip")
       shell_command = "hdfs dfs -mv " + output_path_nonpremium + '/part-*' + '  ' + output_path_nonpremium + output_file_nonpremium
       os.system(shell_command)
   
   I am assuming its because the file is greater than 2GB . has this issue been fixed ?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org