You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Daniel Lopes <da...@onematch.com.br> on 2016/09/09 16:56:06 UTC

Spark + Parquet + IBM Block Storage at Bluemix

Hi, someone can help

I'm trying to use parquet in IBM Block Storage at Spark but when I try to
load get this error:

using this config

credentials = {
  "name": "keystone",
  *"auth_url": "https://identity.open.softlayer.com
<https://identity.open.softlayer.com>",*
  "project": "object_storage_23f274c1_d11XXXXXXXXXXXXXXXe634",
  "projectId": "XXXXXXd9c4aa39b7c7eCCCCCCCCb",
  "region": "dallas",
  "userId": "XXXXX64087180b40XXXXX2b909",
  "username": "admin_XXXX9dd810f8901d48778XXXXXX",
  "password": "chXXXXXXXXXXXXX6_",
  "domainId": "c1ddad17cfcXXXXXXXXX41",
  "domainName": "10XXXXXX",
  "role": "admin"
}

def set_hadoop_config(credentials):
    """This function sets the Hadoop configuration with given credentials,
    so it is possible to access data using SparkContext"""

    prefix = "fs.swift.service." + credentials['name']
    hconf = sc._jsc.hadoopConfiguration()
    *hconf.set(prefix + ".auth.url",
credentials['auth_url']+'/v3/auth/tokens')*
    hconf.set(prefix + ".auth.endpoint.prefix", "endpoints")
    hconf.set(prefix + ".tenant", credentials['projectId'])
    hconf.set(prefix + ".username", credentials['userId'])
    hconf.set(prefix + ".password", credentials['password'])
    hconf.setInt(prefix + ".http.port", 8080)
    hconf.set(prefix + ".region", credentials['region'])
    hconf.setBoolean(prefix + ".public", True)

set_hadoop_config(credentials)

-------------------------------------------------

Py4JJavaErrorTraceback (most recent call last)
<ipython-input-55-5a14928215eb> in <module>()
----> 1 train.groupby('Acordo').count().show()

*Py4JJavaError: An error occurred while calling* o406.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task
60 in stage 30.0 failed 10 times, most recent failure: Lost task 60.9 in
stage 30.0 (TID 2556, yp-spark-dal09-env5-0039):
org.apache.hadoop.fs.swift.exceptions.SwiftConfigurationException:* Missing
mandatory configuration option: fs.swift.service.keystone.auth.url*
at
org.apache.hadoop.fs.swift.http.RestClientBindings.copy(RestClientBindings.java:223)
at
org.apache.hadoop.fs.swift.http.RestClientBindings.bind(RestClientBindings.java:147)


*Daniel Lopes*
Chief Data and Analytics Officer | OneMatch
c: +55 (18) 99764-2733 | https://www.linkedin.com/in/dslopes

www.onematch.com.br
<http://www.onematch.com.br/?utm_source=EmailSignature&utm_term=daniel-lopes>

Re: Spark + Parquet + IBM Block Storage at Bluemix

Posted by Steve Loughran <st...@hortonworks.com>.

On 12 Sep 2016, at 13:04, Daniel Lopes <da...@onematch.com.br>> wrote:

Thanks Steve,

But this error occurs only with parquet files, CSVs works.


out my depth then, I'm afraid. sorry


Best,

Daniel Lopes
Chief Data and Analytics Officer | OneMatch
c: +55 (18) 99764-2733 | https://www.linkedin.com/in/dslopes

www.onematch.com.br<http://www.onematch.com.br/?utm_source=EmailSignature&utm_term=daniel-lopes>

On Sun, Sep 11, 2016 at 3:28 PM, Steve Loughran <st...@hortonworks.com>> wrote:

On 9 Sep 2016, at 17:56, Daniel Lopes <da...@onematch.com.br>> wrote:

Hi, someone can help

I'm trying to use parquet in IBM Block Storage at Spark but when I try to load get this error:

using this config

credentials = {
  "name": "keystone",
  "auth_url": "https://identity.open.softlayer.com<https://identity.open.softlayer.com/>",
  "project": "object_storage_23f274c1_d11XXXXXXXXXXXXXXXe634",
  "projectId": "XXXXXXd9c4aa39b7c7eCCCCCCCCb",
  "region": "dallas",
  "userId": "XXXXX64087180b40XXXXX2b909",
  "username": "admin_XXXX9dd810f8901d48778XXXXXX",
  "password": "chXXXXXXXXXXXXX6_",
  "domainId": "c1ddad17cfcXXXXXXXXX41",
  "domainName": "10XXXXXX",
  "role": "admin"
}

def set_hadoop_config(credentials):
    """This function sets the Hadoop configuration with given credentials,
    so it is possible to access data using SparkContext"""

    prefix = "fs.swift.service." + credentials['name']
    hconf = sc._jsc.hadoopConfiguration()
    hconf.set(prefix + ".auth.url", credentials['auth_url']+'/v3/auth/tokens')
    hconf.set(prefix + ".auth.endpoint.prefix", "endpoints")
    hconf.set(prefix + ".tenant", credentials['projectId'])
    hconf.set(prefix + ".username", credentials['userId'])
    hconf.set(prefix + ".password", credentials['password'])
    hconf.setInt(prefix + ".http.port", 8080)
    hconf.set(prefix + ".region", credentials['region'])
    hconf.setBoolean(prefix + ".public", True)

set_hadoop_config(credentials)

-------------------------------------------------

Py4JJavaErrorTraceback (most recent call last)
<ipython-input-55-5a14928215eb> in <module>()
----> 1 train.groupby('Acordo').count().show()

Py4JJavaError: An error occurred while calling o406.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 60 in stage 30.0 failed 10 times, most recent failure: Lost task 60.9 in stage 30.0 (TID 2556, yp-spark-dal09-env5-0039): org.apache.hadoop.fs.swift.exceptions.SwiftConfigurationException: Missing mandatory configuration option: fs.swift.service.keystone.auth.url


In my own code, I'd assume that the value of credentials['name'] didn't match that of the URL, assuming you have something like swift://bucket.keystone . Failing that: the options were set too late.

Instead of asking for the hadoop config and editing that, set the option in your spark context, before it is launched, with the prefix "hadoop"


at org.apache.hadoop.fs.swift.http.RestClientBindings.copy(RestClientBindings.java:223)
at org.apache.hadoop.fs.swift.http.RestClientBindings.bind(RestClientBindings.java:147)


Daniel Lopes
Chief Data and Analytics Officer | OneMatch
c: +55 (18) 99764-2733<tel:%2B55%20%2818%29%2099764-2733> | https://www.linkedin.com/in/dslopes

www.onematch.com.br<http://www.onematch.com.br/?utm_source=EmailSignature&utm_term=daniel-lopes>

Re: Spark + Parquet + IBM Block Storage at Bluemix

Posted by Daniel Lopes <da...@onematch.com.br>.

Thanks Steve,

But this error occurs only with parquet files, CSVs works.

Best,

*Daniel Lopes*
Chief Data and Analytics Officer | OneMatch
c: +55 (18) 99764-2733 | https://www.linkedin.com/in/dslopes

www.onematch.com.br
<http://www.onematch.com.br/?utm_source=EmailSignature&utm_term=daniel-lopes>

On Sun, Sep 11, 2016 at 3:28 PM, Steve Loughran <st...@hortonworks.com>
wrote:

>
> On 9 Sep 2016, at 17:56, Daniel Lopes <da...@onematch.com.br> wrote:
>
> Hi, someone can help
>
> I'm trying to use parquet in IBM Block Storage at Spark but when I try to
> load get this error:
>
> using this config
>
> credentials = {
>   "name": "keystone",
>   *"auth_url": "https://identity.open.softlayer.com
> <https://identity.open.softlayer.com/>",*
>   "project": "object_storage_23f274c1_d11XXXXXXXXXXXXXXXe634",
>   "projectId": "XXXXXXd9c4aa39b7c7eCCCCCCCCb",
>   "region": "dallas",
>   "userId": "XXXXX64087180b40XXXXX2b909",
>   "username": "admin_XXXX9dd810f8901d48778XXXXXX",
>   "password": "chXXXXXXXXXXXXX6_",
>   "domainId": "c1ddad17cfcXXXXXXXXX41",
>   "domainName": "10XXXXXX",
>   "role": "admin"
> }
>
> def set_hadoop_config(credentials):
>     """This function sets the Hadoop configuration with given credentials,
>     so it is possible to access data using SparkContext"""
>
>     prefix = "fs.swift.service." + credentials['name']
>     hconf = sc._jsc.hadoopConfiguration()
>     *hconf.set(prefix + ".auth.url",
> credentials['auth_url']+'/v3/auth/tokens')*
>     hconf.set(prefix + ".auth.endpoint.prefix", "endpoints")
>     hconf.set(prefix + ".tenant", credentials['projectId'])
>     hconf.set(prefix + ".username", credentials['userId'])
>     hconf.set(prefix + ".password", credentials['password'])
>     hconf.setInt(prefix + ".http.port", 8080)
>     hconf.set(prefix + ".region", credentials['region'])
>     hconf.setBoolean(prefix + ".public", True)
>
> set_hadoop_config(credentials)
>
> -------------------------------------------------
>
> Py4JJavaErrorTraceback (most recent call last)
> <ipython-input-55-5a14928215eb> in <module>()
> ----> 1 train.groupby('Acordo').count().show()
>
> *Py4JJavaError: An error occurred while calling* o406.showString.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task
> 60 in stage 30.0 failed 10 times, most recent failure: Lost task 60.9 in
> stage 30.0 (TID 2556, yp-spark-dal09-env5-0039): org.apache.hadoop.fs.swift.
> exceptions.SwiftConfigurationException:* Missing mandatory configuration
> option: fs.swift.service.keystone.auth.url*
>
>
>
> In my own code, I'd assume that the value of credentials['name'] didn't
> match that of the URL, assuming you have something like
> swift://bucket.keystone . Failing that: the options were set too late.
>
> Instead of asking for the hadoop config and editing that, set the option
> in your spark context, before it is launched, with the prefix "hadoop"
>
>
> at org.apache.hadoop.fs.swift.http.RestClientBindings.copy(
> RestClientBindings.java:223)
> at org.apache.hadoop.fs.swift.http.RestClientBindings.bind(
> RestClientBindings.java:147)
>
>
> *Daniel Lopes*
> Chief Data and Analytics Officer | OneMatch
> c: +55 (18) 99764-2733 | https://www.linkedin.com/in/dslopes
>
> www.onematch.com.br
> <http://www.onematch.com.br/?utm_source=EmailSignature&utm_term=daniel-lopes>
>
>
>

Re: Spark + Parquet + IBM Block Storage at Bluemix

Posted by Steve Loughran <st...@hortonworks.com>.

On 9 Sep 2016, at 17:56, Daniel Lopes <da...@onematch.com.br>> wrote:

Hi, someone can help

I'm trying to use parquet in IBM Block Storage at Spark but when I try to load get this error:

using this config

credentials = {
  "name": "keystone",
  "auth_url": "https://identity.open.softlayer.com<https://identity.open.softlayer.com/>",
  "project": "object_storage_23f274c1_d11XXXXXXXXXXXXXXXe634",
  "projectId": "XXXXXXd9c4aa39b7c7eCCCCCCCCb",
  "region": "dallas",
  "userId": "XXXXX64087180b40XXXXX2b909",
  "username": "admin_XXXX9dd810f8901d48778XXXXXX",
  "password": "chXXXXXXXXXXXXX6_",
  "domainId": "c1ddad17cfcXXXXXXXXX41",
  "domainName": "10XXXXXX",
  "role": "admin"
}

def set_hadoop_config(credentials):
    """This function sets the Hadoop configuration with given credentials,
    so it is possible to access data using SparkContext"""

    prefix = "fs.swift.service." + credentials['name']
    hconf = sc._jsc.hadoopConfiguration()
    hconf.set(prefix + ".auth.url", credentials['auth_url']+'/v3/auth/tokens')
    hconf.set(prefix + ".auth.endpoint.prefix", "endpoints")
    hconf.set(prefix + ".tenant", credentials['projectId'])
    hconf.set(prefix + ".username", credentials['userId'])
    hconf.set(prefix + ".password", credentials['password'])
    hconf.setInt(prefix + ".http.port", 8080)
    hconf.set(prefix + ".region", credentials['region'])
    hconf.setBoolean(prefix + ".public", True)

set_hadoop_config(credentials)

-------------------------------------------------

Py4JJavaErrorTraceback (most recent call last)
<ipython-input-55-5a14928215eb> in <module>()
----> 1 train.groupby('Acordo').count().show()

Py4JJavaError: An error occurred while calling o406.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 60 in stage 30.0 failed 10 times, most recent failure: Lost task 60.9 in stage 30.0 (TID 2556, yp-spark-dal09-env5-0039): org.apache.hadoop.fs.swift.exceptions.SwiftConfigurationException: Missing mandatory configuration option: fs.swift.service.keystone.auth.url


In my own code, I'd assume that the value of credentials['name'] didn't match that of the URL, assuming you have something like swift://bucket.keystone . Failing that: the options were set too late.

Instead of asking for the hadoop config and editing that, set the option in your spark context, before it is launched, with the prefix "hadoop"


at org.apache.hadoop.fs.swift.http.RestClientBindings.copy(RestClientBindings.java:223)
at org.apache.hadoop.fs.swift.http.RestClientBindings.bind(RestClientBindings.java:147)


Daniel Lopes
Chief Data and Analytics Officer | OneMatch
c: +55 (18) 99764-2733 | https://www.linkedin.com/in/dslopes

www.onematch.com.br<http://www.onematch.com.br/?utm_source=EmailSignature&utm_term=daniel-lopes>