You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by shiva <sh...@gmail.com> on 2021/02/22 12:50:09 UTC

s3a staging committer(directory committer )not writing data to s3 bucket (final output directory) in spark3

Hi,
I'm running spark3 on Kubernetes and using S3A staging committer (directory
committer) to write data to s3 bucket. The same set up works fine with
spark2 but with spark3 the final data (writing in parquet format) is not
visible in s3 bucket and when read operation is performed on that parquet
data it fails as it is a empty path without any data.
As s3a committer requires shared file system (like NFS or HDFS) for staging
data i have set up a shared PVC for all executors and drivers(i.e.,
spark.hadoop.fs.s3a.committer.staging.tmp.path set to shared PVC with
readWriteMany)

In S3 bucket i can see only _SUCCESS file without any data.

bash-4.2# s3cmd ls  --no-ssl --host=${AWS_ENDPOINT} --host-bucket=
s3://rookbucket/shiva/ --recursive | grep people.parquet
2021-02-22 11:55      4074   s3://rookbucket/shiva/people.parquet/_SUCCESS
bash-4.2#

The _SUCCESS file is in json format with below content:

==============================
{
  "name" : "org.apache.hadoop.fs.s3a.commit.files.SuccessData/1",
  "timestamp" : 1613994948681,
  "date" : "Mon Feb 22 11:55:48 UTC 2021",
  "hostname" : "spark-thrift-hdfs",
  "committer" : "directory",
  "description" : "Task committer attempt_20210222115547_0000_m_000000_0",
  "metrics" : {
    "stream_write_block_uploads" : 0,
    "files_created" : 5,
    "S3guard_metadatastore_put_path_latencyNumOps" : 0,
    "stream_write_block_uploads_aborted" : 0,
    "committer_commits_reverted" : 0,
    "op_open" : 2,
    "stream_closed" : 12,
    "committer_magic_files_created" : 0,
    "object_copy_requests" : 0,
    "s3guard_metadatastore_initialization" : 0,
    "S3guard_metadatastore_put_path_latency90thPercentileLatency" : 0,
    "stream_write_block_uploads_committed" : 0,
    "S3guard_metadatastore_throttle_rate75thPercentileFrequency (Hz)" : 0,
    "S3guard_metadatastore_throttle_rate90thPercentileFrequency (Hz)" : 0,
    "committer_bytes_committed" : 0,
    "op_create" : 5,
    "stream_read_fully_operations" : 0,
    "committer_commits_completed" : 0,
    "object_put_requests_active" : 0,
    "s3guard_metadatastore_retry" : 0,
    "stream_write_block_uploads_active" : 0,
    "stream_opened" : 12,
    "S3guard_metadatastore_throttle_rate95thPercentileFrequency (Hz)" : 0,
    "op_create_non_recursive" : 0,
    "object_continue_list_requests" : 0,
    "committer_jobs_completed" : 5,
    "S3guard_metadatastore_put_path_latency50thPercentileLatency" : 0,
    "stream_close_operations" : 12,
    "stream_read_operations" : 378,
    "object_delete_requests" : 4,
    "fake_directories_deleted" : 8,
    "stream_aborted" : 0,
    "op_rename" : 0,
    "object_multipart_aborted" : 0,
    "committer_commits_created" : 0,
    "op_get_file_status" : 26,
    "s3guard_metadatastore_put_path_request" : 9,
    "committer_commits_failed" : 0,
    "stream_bytes_read_in_close" : 0,
    "op_glob_status" : 1,
    "stream_read_exceptions" : 0,
    "op_exists" : 5,
    "stream_read_version_mismatches" : 0,
    "S3guard_metadatastore_throttle_rate50thPercentileFrequency (Hz)" : 0,
    "S3guard_metadatastore_put_path_latency95thPercentileLatency" : 0,
    "stream_write_block_uploads_pending" : 4,
    "directories_created" : 0,
    "S3guard_metadatastore_throttle_rateNumEvents" : 0,
    "S3guard_metadatastore_put_path_latency99thPercentileLatency" : 0,
    "stream_bytes_backwards_on_seek" : 0,
    "stream_bytes_read" : 2997558,
    "stream_write_total_data" : 16282,
    "committer_jobs_failed" : 0,
    "stream_read_operations_incomplete" : 29,
    "files_copied_bytes" : 0,
    "op_delete" : 8,
    "object_put_bytes_pending" : 0,
    "stream_write_block_uploads_data_pending" : 0,
    "op_list_located_status" : 0,
    "object_list_requests" : 19,
    "stream_forward_seek_operations" : 0,
    "committer_tasks_completed" : 0,
    "committer_commits_aborted" : 0,
    "object_metadata_requests" : 45,
    "object_put_requests_completed" : 4,
    "stream_seek_operations" : 0,
    "op_list_status" : 0,
    "store_io_throttled" : 0,
    "stream_write_failures" : 0,
    "op_get_file_checksum" : 0,
    "files_copied" : 0,
    "ignored_errors" : 8,
    "committer_bytes_uploaded" : 0,
    "committer_tasks_failed" : 0,
    "stream_bytes_skipped_on_seek" : 0,
    "op_list_files" : 0,
    "files_deleted" : 0,
    "stream_bytes_discarded_in_abort" : 0,
    "op_mkdirs" : 1,
    "op_copy_from_local_file" : 0,
    "op_is_directory" : 1,
    "s3guard_metadatastore_throttled" : 0,
    "S3guard_metadatastore_put_path_latency75thPercentileLatency" : 0,
    "stream_write_total_time" : 0,
    "stream_backward_seek_operations" : 0,
    "object_put_requests" : 4,
    "object_put_bytes" : 16282,
    "directories_deleted" : 0,
    "op_is_file" : 2,
    "S3guard_metadatastore_throttle_rate99thPercentileFrequency (Hz)" : 0
  },
  "diagnostics" : {
    "fs.s3a.metadatastore.impl" :
"org.apache.hadoop.fs.s3a.s3guard.NullMetadataStore",
    "fs.s3a.committer.magic.enabled" : "false",
    "fs.s3a.metadatastore.authoritative" : "false"
  },
  "filenames" : [ ]
}

===============================
With same s3 bucket if i run spark job with spark2 then it writes data to
s3://rookbucket/shiva/people.parquet/  and the _SUCCESS file looks similar
to above one but "filenames" key in that json contain list of part files
(parquet's data files) but with spark3 it is empty list as shown above.
There is no exception or error during write operation, but read fails to get
the schema as the parquet file is empty.

Not sure what is causing the issue, I have attached the spark configuration
which are used to submit the job as attachment( spark-default.conf
<http://apache-spark-user-list.1001560.n3.nabble.com/file/t11249/spark-default.conf> 
).

I'm using Ceph as underlying storage for s3 buckets and if I use rados
command to check data i can see parquet data with file name containing
multipart upload in some path like below (but not in final output s3 path)

bash-4.2# rados ls  -p rook-ceph-store.rgw.buckets.data | grep
"part-00000-43466165-16d1-4b36-ab90-acb6c3c309a5"
4bd26ab1-6211-4aa5-92d9-9595ad0ee383.454449.1__multipart_shiva/people.parquet/part-00000-43466165-16d1-4b36-ab90-acb6c3c309a5-c000-spark-66e7529285e54226b94d61c2263be83b.snappy.parquet.2~d3to_jPrAO_BLxTu74GXr_g_sz4pvQF.1

Could someone help me to debug this issue or any known issue around this?

Regards,
Shiva






--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: s3a staging committer(directory committer )not writing data to s3 bucket (final output directory) in spark3

Posted by shiva <sh...@gmail.com>.

Any suggestions or help is greatly appreciated!



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: s3a staging committer(directory committer )not writing data to s3 bucket (final output directory) in spark3

Posted by Mich Talebzadeh <mi...@gmail.com>.

Hi Shiva,

*This works on 3.0.1 on prem* but not on Google dataproc with spark
3.1.1-RC2

These are the jar files used for structured streaming

All added under $SPARK_HOME/jars on all nodes

spark-sql-kafka-0-10_2.12-3.0.1.jar
kafka-clients-2.7.0.jar
spark-token-provider-kafka-0-10_2.12-3.0.1.jar
commons-pool2-2.9.0.jar

Also add these under $SPARK_HOME/conf in file spark-defaults.conf all nodes

spark.driver.extraClassPath        $SPARK_HOME/jars/*.jar
spark.executor.extraClassPath      $SPARK_HOME/jars/*.jar


Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.0.1
      /_/

Using Scala version 2.12.10 (Java HotSpot(TM) 64-Bit Server VM, Java
1.8.0_201)

HTH

LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*





*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Tue, 2 Mar 2021 at 11:47, shiva <sh...@gmail.com> wrote:

> Hi Mich Talebzadeh,
> Could you please share the spark configuration used to run the job? you
> mentioned it works on 3.0.1 I will check if I am also using the same
> configuration or not.
>
> Regards,
> Shiva
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>

Re: s3a staging committer(directory committer )not writing data to s3 bucket (final output directory) in spark3

Posted by shiva <sh...@gmail.com>.

Hi Mich Talebzadeh,
Could you please share the spark configuration used to run the job? you
mentioned it works on 3.0.1 I will check if I am also using the same
configuration or not.

Regards,
Shiva



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: s3a staging committer(directory committer )not writing data to s3 bucket (final output directory) in spark3

Posted by Mich Talebzadeh <mi...@gmail.com>.

Hi,

We also have an issue with data not being displayed in Google Cloud
DataProc 2 that uses Spark 3.1.1.

It works on 3.0.1 on Prem but not on 3.1.1 on Google Data Proc (offered as
a service). It may be related to Spark version

It is concerning.

HTH

LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*

*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.

On Mon, 1 Mar 2021 at 18:55, shiva <sh...@gmail.com> wrote:

> Hi Mich Talebzadeh,
> Thanks for your reply, the issue is seen in spark 3.0.0 and with spark
> 2.4.5
> it works without any problem.
>
> Regards,
> Shiva
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>

Re: s3a staging committer(directory committer )not writing data to s3 bucket (final output directory) in spark3

Posted by shiva <sh...@gmail.com>.

Hi Mich Talebzadeh,
Thanks for your reply, the issue is seen in spark 3.0.0 and with spark 2.4.5
it works without any problem.

Regards,
Shiva



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: s3a staging committer(directory committer )not writing data to s3 bucket (final output directory) in spark3

Posted by Mich Talebzadeh <mi...@gmail.com>.

Hi,

What exact version of spark is it?

HTH


LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*





*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Mon, 22 Feb 2021 at 14:41, shiva <sh...@gmail.com> wrote:

> Hi,
> I'm running spark3 on Kubernetes and using S3A staging committer (directory
> committer) to write data to s3 bucket. The same set up works fine with
> spark2 but with spark3 the final data (writing in parquet format) is not
> visible in s3 bucket and when read operation is performed on that parquet
> data it fails as it is a empty path without any data.
> As s3a committer requires shared file system (like NFS or HDFS) for staging
> data i have set up a shared PVC for all executors and drivers(i.e.,
> spark.hadoop.fs.s3a.committer.staging.tmp.path set to shared PVC with
> readWriteMany)
>
> In S3 bucket i can see only _SUCCESS file without any data.
>
> bash-4.2# s3cmd ls  --no-ssl --host=${AWS_ENDPOINT} --host-bucket=
> s3://rookbucket/shiva/ --recursive | grep people.parquet
> 2021-02-22 11:55      4074   s3://rookbucket/shiva/people.parquet/_SUCCESS
> bash-4.2#
>
> The _SUCCESS file is in json format with below content:
>
> ==============================
> {
>   "name" : "org.apache.hadoop.fs.s3a.commit.files.SuccessData/1",
>   "timestamp" : 1613994948681,
>   "date" : "Mon Feb 22 11:55:48 UTC 2021",
>   "hostname" : "spark-thrift-hdfs",
>   "committer" : "directory",
>   "description" : "Task committer attempt_20210222115547_0000_m_000000_0",
>   "metrics" : {
>     "stream_write_block_uploads" : 0,
>     "files_created" : 5,
>     "S3guard_metadatastore_put_path_latencyNumOps" : 0,
>     "stream_write_block_uploads_aborted" : 0,
>     "committer_commits_reverted" : 0,
>     "op_open" : 2,
>     "stream_closed" : 12,
>     "committer_magic_files_created" : 0,
>     "object_copy_requests" : 0,
>     "s3guard_metadatastore_initialization" : 0,
>     "S3guard_metadatastore_put_path_latency90thPercentileLatency" : 0,
>     "stream_write_block_uploads_committed" : 0,
>     "S3guard_metadatastore_throttle_rate75thPercentileFrequency (Hz)" : 0,
>     "S3guard_metadatastore_throttle_rate90thPercentileFrequency (Hz)" : 0,
>     "committer_bytes_committed" : 0,
>     "op_create" : 5,
>     "stream_read_fully_operations" : 0,
>     "committer_commits_completed" : 0,
>     "object_put_requests_active" : 0,
>     "s3guard_metadatastore_retry" : 0,
>     "stream_write_block_uploads_active" : 0,
>     "stream_opened" : 12,
>     "S3guard_metadatastore_throttle_rate95thPercentileFrequency (Hz)" : 0,
>     "op_create_non_recursive" : 0,
>     "object_continue_list_requests" : 0,
>     "committer_jobs_completed" : 5,
>     "S3guard_metadatastore_put_path_latency50thPercentileLatency" : 0,
>     "stream_close_operations" : 12,
>     "stream_read_operations" : 378,
>     "object_delete_requests" : 4,
>     "fake_directories_deleted" : 8,
>     "stream_aborted" : 0,
>     "op_rename" : 0,
>     "object_multipart_aborted" : 0,
>     "committer_commits_created" : 0,
>     "op_get_file_status" : 26,
>     "s3guard_metadatastore_put_path_request" : 9,
>     "committer_commits_failed" : 0,
>     "stream_bytes_read_in_close" : 0,
>     "op_glob_status" : 1,
>     "stream_read_exceptions" : 0,
>     "op_exists" : 5,
>     "stream_read_version_mismatches" : 0,
>     "S3guard_metadatastore_throttle_rate50thPercentileFrequency (Hz)" : 0,
>     "S3guard_metadatastore_put_path_latency95thPercentileLatency" : 0,
>     "stream_write_block_uploads_pending" : 4,
>     "directories_created" : 0,
>     "S3guard_metadatastore_throttle_rateNumEvents" : 0,
>     "S3guard_metadatastore_put_path_latency99thPercentileLatency" : 0,
>     "stream_bytes_backwards_on_seek" : 0,
>     "stream_bytes_read" : 2997558,
>     "stream_write_total_data" : 16282,
>     "committer_jobs_failed" : 0,
>     "stream_read_operations_incomplete" : 29,
>     "files_copied_bytes" : 0,
>     "op_delete" : 8,
>     "object_put_bytes_pending" : 0,
>     "stream_write_block_uploads_data_pending" : 0,
>     "op_list_located_status" : 0,
>     "object_list_requests" : 19,
>     "stream_forward_seek_operations" : 0,
>     "committer_tasks_completed" : 0,
>     "committer_commits_aborted" : 0,
>     "object_metadata_requests" : 45,
>     "object_put_requests_completed" : 4,
>     "stream_seek_operations" : 0,
>     "op_list_status" : 0,
>     "store_io_throttled" : 0,
>     "stream_write_failures" : 0,
>     "op_get_file_checksum" : 0,
>     "files_copied" : 0,
>     "ignored_errors" : 8,
>     "committer_bytes_uploaded" : 0,
>     "committer_tasks_failed" : 0,
>     "stream_bytes_skipped_on_seek" : 0,
>     "op_list_files" : 0,
>     "files_deleted" : 0,
>     "stream_bytes_discarded_in_abort" : 0,
>     "op_mkdirs" : 1,
>     "op_copy_from_local_file" : 0,
>     "op_is_directory" : 1,
>     "s3guard_metadatastore_throttled" : 0,
>     "S3guard_metadatastore_put_path_latency75thPercentileLatency" : 0,
>     "stream_write_total_time" : 0,
>     "stream_backward_seek_operations" : 0,
>     "object_put_requests" : 4,
>     "object_put_bytes" : 16282,
>     "directories_deleted" : 0,
>     "op_is_file" : 2,
>     "S3guard_metadatastore_throttle_rate99thPercentileFrequency (Hz)" : 0
>   },
>   "diagnostics" : {
>     "fs.s3a.metadatastore.impl" :
> "org.apache.hadoop.fs.s3a.s3guard.NullMetadataStore",
>     "fs.s3a.committer.magic.enabled" : "false",
>     "fs.s3a.metadatastore.authoritative" : "false"
>   },
>   "filenames" : [ ]
> }
>
> ===============================
> With same s3 bucket if i run spark job with spark2 then it writes data to
> s3://rookbucket/shiva/people.parquet/  and the _SUCCESS file looks similar
> to above one but "filenames" key in that json contain list of part files
> (parquet's data files) but with spark3 it is empty list as shown above.
> There is no exception or error during write operation, but read fails to
> get
> the schema as the parquet file is empty.
>
> Not sure what is causing the issue, I have attached the spark configuration
> which are used to submit the job as attachment( spark-default.conf
> <
> http://apache-spark-user-list.1001560.n3.nabble.com/file/t11249/spark-default.conf>
>
> ).
>
> I'm using Ceph as underlying storage for s3 buckets and if I use rados
> command to check data i can see parquet data with file name containing
> multipart upload in some path like below (but not in final output s3 path)
>
> bash-4.2# rados ls  -p rook-ceph-store.rgw.buckets.data | grep
> "part-00000-43466165-16d1-4b36-ab90-acb6c3c309a5"
>
> 4bd26ab1-6211-4aa5-92d9-9595ad0ee383.454449.1__multipart_shiva/people.parquet/part-00000-43466165-16d1-4b36-ab90-acb6c3c309a5-c000-spark-66e7529285e54226b94d61c2263be83b.snappy.parquet.2~d3to_jPrAO_BLxTu74GXr_g_sz4pvQF.1
>
> Could someone help me to debug this issue or any known issue around this?
>
> Regards,
> Shiva
>
>
>
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>