You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Chengi Liu <ch...@gmail.com> on 2014/02/26 18:28:06 UTC

Dealing with headers in csv file pyspark

Hi,
  How do we deal with headers in csv file.
For example:
id, counts
1,2
1,5
2,20
2,25
... and so on


And I want to do a frequency count of counts for each id. So result will be
:

1,7
2,45

and so on..
My code:
counts = data.map(lambda x: (x[0],int(x[1]))).reduceByKey(lambda a, b: a +
b))

But I see this error:
ValueError: invalid literal for int() with base 10: 'counts'

    at
org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:131)
    at
org.apache.spark.api.python.PythonRDD$$anon$1.<init>(PythonRDD.scala:153)
    at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:96)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
    at
org.apache.spark.scheduler.DAGScheduler.runLocallyWithinThread(DAGScheduler.scala:694)
    at
org.apache.spark.scheduler.DAGScheduler$$anon$1.run(DAGScheduler.scala:679)


I guess because of the header...

Q1) How do i exclude header from this
Q2) Rather than using pyspark.. how do i run python programs on spark?

Thanks

Re: Dealing with headers in csv file pyspark

Posted by Bryn Keller <xo...@xoltar.org>.

In the past I've handled this by filtering out the header line, but it
seems to me that it would be useful to have a way of dealing with files
that would preserve sequence, so that e.g. you could just do
mySequentialRDD.drop(1) to get rid of the header. There are other use cases
like this that currently have to be solved outside of Spark or else by
writing a custom InputFormat to do the reading, that perhaps could be
simplified along these lines.

Thanks,
Bryn

On Wed, Feb 26, 2014 at 9:28 AM, Chengi Liu <ch...@gmail.com> wrote:

> Hi,
>   How do we deal with headers in csv file.
> For example:
> id, counts
> 1,2
> 1,5
> 2,20
> 2,25
> ... and so on
>
>
> And I want to do a frequency count of counts for each id. So result will
> be :
>
> 1,7
> 2,45
>
> and so on..
> My code:
> counts = data.map(lambda x: (x[0],int(x[1]))).reduceByKey(lambda a, b: a +
> b))
>
> But I see this error:
> ValueError: invalid literal for int() with base 10: 'counts'
>
>     at
> org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:131)
>     at
> org.apache.spark.api.python.PythonRDD$$anon$1.<init>(PythonRDD.scala:153)
>     at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:96)
>     at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
>     at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
>     at
> org.apache.spark.scheduler.DAGScheduler.runLocallyWithinThread(DAGScheduler.scala:694)
>     at
> org.apache.spark.scheduler.DAGScheduler$$anon$1.run(DAGScheduler.scala:679)
>
>
> I guess because of the header...
>
> Q1) How do i exclude header from this
> Q2) Rather than using pyspark.. how do i run python programs on spark?
>
> Thanks
>
>
>

Re: Dealing with headers in csv file pyspark

Posted by Ewen Cheslack-Postava <me...@ewencp.org>.

You must be parsing each line of the file at some point anyway, so 
adding a step to filter out the header should work fine. It'll get 
executed at the same time as your parsing/conversion to ints, so there's 
no significant overhead aside from the check itself.

For standalone programs, there's an section in the pyspark programming 
guide, along with a link to a complete example: 
http://spark.incubator.apache.org/docs/latest/python-programming-guide.html#standalone-programs

> Chengi Liu <ma...@gmail.com>
> February 26, 2014 at 9:38 AM
> I am not sure.. the suggestion is to open a TB file and remove a line?
> That doesnt sounds that good.
> I am hacking my way by using a filter..
> Can I put a try:except clause in my lambda function.. Maybe i should 
> just try that out.
> But thanks for the suggestion.
>
> Also, can i run scripts against spark rather than using pyspark shell..
>
>
>
>
>
> Mayur Rustagi <ma...@gmail.com>
> February 26, 2014 at 9:34 AM
>
> Bad solution is to run a mapper through the data and null the counts , 
> good solution is to trim the header before hand without Spark.
>
> Chengi Liu <ma...@gmail.com>
> February 26, 2014 at 9:28 AM
> Hi,
>   How do we deal with headers in csv file.
> For example:
> id, counts
> 1,2
> 1,5
> 2,20
> 2,25
> ... and so on
>
>
> And I want to do a frequency count of counts for each id. So result 
> will be :
>
> 1,7
> 2,45
>
> and so on..
> My code:
> counts = data.map(lambda x: (x[0],int(x[1]))).reduceByKey(lambda a, b: 
> a + b))
>
> But I see this error:
> ValueError: invalid literal for int() with base 10: 'counts'
>
>     at 
> org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:131)
>     at 
> org.apache.spark.api.python.PythonRDD$$anon$1.<init>(PythonRDD.scala:153)
>     at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:96)
>     at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
>     at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
>     at 
> org.apache.spark.scheduler.DAGScheduler.runLocallyWithinThread(DAGScheduler.scala:694)
>     at 
> org.apache.spark.scheduler.DAGScheduler$$anon$1.run(DAGScheduler.scala:679)
>
>
> I guess because of the header...
>
> Q1) How do i exclude header from this
> Q2) Rather than using pyspark.. how do i run python programs on spark?
>
> Thanks
>
>

Re: Dealing with headers in csv file pyspark

Posted by Chengi Liu <ch...@gmail.com>.

I am not sure.. the suggestion is to open a TB file and remove a line?
That doesnt sounds that good.
I am hacking my way by using a filter..
Can I put a try:except clause in my lambda function.. Maybe i should just
try that out.
But thanks for the suggestion.

Also, can i run scripts against spark rather than using pyspark shell..




On Wed, Feb 26, 2014 at 9:34 AM, Mayur Rustagi <ma...@gmail.com>wrote:

> Bad solution is to run a mapper through the data and null the counts ,
> good solution is to trim the header before hand without Spark.
> On Feb 26, 2014 9:28 AM, "Chengi Liu" <ch...@gmail.com> wrote:
>
>> Hi,
>>   How do we deal with headers in csv file.
>> For example:
>> id, counts
>> 1,2
>> 1,5
>> 2,20
>> 2,25
>> ... and so on
>>
>>
>> And I want to do a frequency count of counts for each id. So result will
>> be :
>>
>> 1,7
>> 2,45
>>
>> and so on..
>> My code:
>> counts = data.map(lambda x: (x[0],int(x[1]))).reduceByKey(lambda a, b: a
>> + b))
>>
>> But I see this error:
>> ValueError: invalid literal for int() with base 10: 'counts'
>>
>>     at
>> org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:131)
>>     at
>> org.apache.spark.api.python.PythonRDD$$anon$1.<init>(PythonRDD.scala:153)
>>     at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:96)
>>     at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
>>     at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
>>     at
>> org.apache.spark.scheduler.DAGScheduler.runLocallyWithinThread(DAGScheduler.scala:694)
>>     at
>> org.apache.spark.scheduler.DAGScheduler$$anon$1.run(DAGScheduler.scala:679)
>>
>>
>> I guess because of the header...
>>
>> Q1) How do i exclude header from this
>> Q2) Rather than using pyspark.. how do i run python programs on spark?
>>
>> Thanks
>>
>>
>>

Re: Dealing with headers in csv file pyspark

Posted by Mayur Rustagi <ma...@gmail.com>.

Bad solution is to run a mapper through the data and null the counts , good
solution is to trim the header before hand without Spark.
On Feb 26, 2014 9:28 AM, "Chengi Liu" <ch...@gmail.com> wrote:

> Hi,
>   How do we deal with headers in csv file.
> For example:
> id, counts
> 1,2
> 1,5
> 2,20
> 2,25
> ... and so on
>
>
> And I want to do a frequency count of counts for each id. So result will
> be :
>
> 1,7
> 2,45
>
> and so on..
> My code:
> counts = data.map(lambda x: (x[0],int(x[1]))).reduceByKey(lambda a, b: a +
> b))
>
> But I see this error:
> ValueError: invalid literal for int() with base 10: 'counts'
>
>     at
> org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:131)
>     at
> org.apache.spark.api.python.PythonRDD$$anon$1.<init>(PythonRDD.scala:153)
>     at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:96)
>     at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
>     at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
>     at
> org.apache.spark.scheduler.DAGScheduler.runLocallyWithinThread(DAGScheduler.scala:694)
>     at
> org.apache.spark.scheduler.DAGScheduler$$anon$1.run(DAGScheduler.scala:679)
>
>
> I guess because of the header...
>
> Q1) How do i exclude header from this
> Q2) Rather than using pyspark.. how do i run python programs on spark?
>
> Thanks
>
>
>