You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by Ignacio Zendejas <iz...@node.io> on 2015/06/03 00:13:40 UTC

createDataframe from s3 results in error

I've run into an error when trying to create a dataframe. Here's the code:

--
from pyspark import StorageLevel
from pyspark.sql import Row

table = 'blah'
ssc = HiveContext(sc)

data = sc.textFile('s3://bucket/some.tsv')

def deserialize(s):
  p = s.strip().split('\t')
  p[-1] = float(p[-1])
  return Row(normalized_page_sha1=p[0], name=p[1], phrase=p[2],
created_at=p[3], layer_id=p[4], score=p[5])

blah = data.map(deserialize)
df = sqlContext.inferSchema(blah)

---

I've also tried s3n and using createDataFrame. Our setup is on EMR
instances, using the setup script Amazon provides. After lots of debugging,
I suspect there'll be a problem with this setup.

What's weird is that if I run this on pyspark shell, and re-run the last
line (inferSchema/createDataFrame), it actually works.

We're getting warnings like this:
http://pastebin.ca/3016476

Here's the actual error:
http://www.pastebin.ca/3016473

Any help would be greatly appreciated.

Thanks,
Ignacio

Re: createDataframe from s3 results in error

Posted by Reynold Xin <rx...@databricks.com>.
What version of Spark is this?

On Tue, Jun 2, 2015 at 3:13 PM, Ignacio Zendejas <iz...@node.io> wrote:

> I've run into an error when trying to create a dataframe. Here's the code:
>
> --
> from pyspark import StorageLevel
> from pyspark.sql import Row
>
> table = 'blah'
> ssc = HiveContext(sc)
>
> data = sc.textFile('s3://bucket/some.tsv')
>
> def deserialize(s):
>   p = s.strip().split('\t')
>   p[-1] = float(p[-1])
>   return Row(normalized_page_sha1=p[0], name=p[1], phrase=p[2],
> created_at=p[3], layer_id=p[4], score=p[5])
>
> blah = data.map(deserialize)
> df = sqlContext.inferSchema(blah)
>
> ---
>
> I've also tried s3n and using createDataFrame. Our setup is on EMR
> instances, using the setup script Amazon provides. After lots of debugging,
> I suspect there'll be a problem with this setup.
>
> What's weird is that if I run this on pyspark shell, and re-run the last
> line (inferSchema/createDataFrame), it actually works.
>
> We're getting warnings like this:
> http://pastebin.ca/3016476
>
> Here's the actual error:
> http://www.pastebin.ca/3016473
>
> Any help would be greatly appreciated.
>
> Thanks,
> Ignacio
>
>