You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2019/02/12 17:11:43 UTC
[GitHub] jkhalid commented on issue #5400: [SPARK-6190][core] create
LargeByteBuffer for eliminating 2GB block limit
jkhalid commented on issue #5400: [SPARK-6190][core] create LargeByteBuffer for eliminating 2GB block limit
URL: https://github.com/apache/spark/pull/5400#issuecomment-462847339
@squito @SparkQA @vanzin @shaneknapp @tgravescs
I am using spark.sql on AWS Glue to generate a single large (it is the clients requirement to have a single file) csv compressed file which is greater than 2GB for sure. I am running to into this issue
write(transformed_feed)
File "script_2019-02-12-15-57-55.py", line 161, in write
output_path_premium, header=True, compression="gzip")
File "/mnt/yarn/usercache/root/appcache/application_1549986900582_0001/container_1549986900582_0001_01_000001/pyspark.zip/pyspark/sql/readwriter.py", line 766, in csv
File "/mnt/yarn/usercache/root/appcache/application_1549986900582_0001/container_1549986900582_0001_01_000001/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
File "/mnt/yarn/usercache/root/appcache/application_1549986900582_0001/container_1549986900582_0001_01_000001/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
File "/mnt/yarn/usercache/root/appcache/application_1549986900582_0001/container_1549986900582_0001_01_000001/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o210.csv.
.
.
.
.
.
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID 4, ip-172-32-189-222.ec2.internal, executor 1): java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE
Below is python code used to write the file
def write(dataframe):
# write two files premium listings and non premium listings (critera : listing_priority > 30 = premium)
dataframe.filter(dataframe["listing_priority"] >= 30).drop('listing_priority').drop('image_count').write.csv(
output_path_premium, header=True, compression="gzip")
shell_command = "hdfs dfs -mv " + output_path_premium + '/part-*' + ' ' + output_path_premium + output_file_premium
os.system(shell_command)
dataframe.filter(dataframe["listing_priority"] < 30).drop('listing_priority').drop('image_count').write.csv(
output_path_nonpremium, header=True, compression="gzip")
shell_command = "hdfs dfs -mv " + output_path_nonpremium + '/part-*' + ' ' + output_path_nonpremium + output_file_nonpremium
os.system(shell_command)
I am assuming its because the file is greater than 2GB . has this issue been fixed ?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org