You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Denny Lee <de...@gmail.com> on 2014/12/14 17:15:42 UTC

Limit the # of columns in Spark Scala

I have a large of files within HDFS that I would like to do a group by
statement ala

val table = sc.textFile("hdfs://....")
val tabs = table.map(_.split("\t"))

I'm trying to do something similar to
tabs.map(c => (c._(167), c._(110), c._(200))

where I create a new RDD that only has
but that isn't quite right because I'm not really manipulating sequences.

BTW, I cannot use SparkSQL / case right now because my table has 200
columns (and I'm on Scala 2.10.3)

Thanks!
Denny

Re: Limit the # of columns in Spark Scala

Posted by Denny Lee <de...@gmail.com>.
Oh, just figured it out:

tabs.map(c => Array(c(167), c(110), c(200))

Thanks for all of the advice, eh?!





On Sun Dec 14 2014 at 1:14:00 PM Yana Kadiyska <ya...@gmail.com>
wrote:

> Denny, I am not sure what exception you're observing but I've had luck
> with 2 things:
>
> val table = sc.textFile("hdfs://....")
>
> You can try calling table.first here and you'll see the first line of the
> file.
> You can also do val debug = table.first.split("\t") which would give you
> an array and you can indeed verify that the array contains what you want in
>  positions 167,119 and 200. In the case of large files with a random bad
> line I find wrapping the call within the map in try/catch very valuable --
> you can dump out the whole line in the catch statement
>
> Lastly I would guess that you're getting a compile error and not a runtime
> error -- I believe c is an array of values so I think you want
> tabs.map(c => (c(167), c(110), c(200)) instead of tabs.map(c => (c._(167),
> c._(110), c._(200))
>
>
>
> On Sun, Dec 14, 2014 at 3:12 PM, Denny Lee <de...@gmail.com> wrote:
>>
>> Yes - that works great! Sorry for implying I couldn't. Was just more
>> flummoxed that I couldn't make the Scala call work on its own. Will
>> continue to debug ;-)
>>
>> On Sun, Dec 14, 2014 at 11:39 Michael Armbrust <mi...@databricks.com>
>> wrote:
>>
>>> BTW, I cannot use SparkSQL / case right now because my table has 200
>>>> columns (and I'm on Scala 2.10.3)
>>>>
>>>
>>> You can still apply the schema programmatically:
>>> http://spark.apache.org/docs/latest/sql-programming-guide.html#programmatically-specifying-the-schema
>>>
>>

Re: Limit the # of columns in Spark Scala

Posted by Yana Kadiyska <ya...@gmail.com>.
Denny, I am not sure what exception you're observing but I've had luck with
2 things:

val table = sc.textFile("hdfs://....")

You can try calling table.first here and you'll see the first line of the
file.
You can also do val debug = table.first.split("\t") which would give you an
array and you can indeed verify that the array contains what you want in
 positions 167,119 and 200. In the case of large files with a random bad
line I find wrapping the call within the map in try/catch very valuable --
you can dump out the whole line in the catch statement

Lastly I would guess that you're getting a compile error and not a runtime
error -- I believe c is an array of values so I think you want
tabs.map(c => (c(167), c(110), c(200)) instead of tabs.map(c => (c._(167),
c._(110), c._(200))



On Sun, Dec 14, 2014 at 3:12 PM, Denny Lee <de...@gmail.com> wrote:
>
> Yes - that works great! Sorry for implying I couldn't. Was just more
> flummoxed that I couldn't make the Scala call work on its own. Will
> continue to debug ;-)
>
> On Sun, Dec 14, 2014 at 11:39 Michael Armbrust <mi...@databricks.com>
> wrote:
>
>> BTW, I cannot use SparkSQL / case right now because my table has 200
>>> columns (and I'm on Scala 2.10.3)
>>>
>>
>> You can still apply the schema programmatically:
>> http://spark.apache.org/docs/latest/sql-programming-guide.html#programmatically-specifying-the-schema
>>
>

Re: Limit the # of columns in Spark Scala

Posted by Denny Lee <de...@gmail.com>.
Yes - that works great! Sorry for implying I couldn't. Was just more
flummoxed that I couldn't make the Scala call work on its own. Will
continue to debug ;-)
On Sun, Dec 14, 2014 at 11:39 Michael Armbrust <mi...@databricks.com>
wrote:

> BTW, I cannot use SparkSQL / case right now because my table has 200
>> columns (and I'm on Scala 2.10.3)
>>
>
> You can still apply the schema programmatically:
> http://spark.apache.org/docs/latest/sql-programming-guide.html#programmatically-specifying-the-schema
>

Re: Limit the # of columns in Spark Scala

Posted by Michael Armbrust <mi...@databricks.com>.
>
> BTW, I cannot use SparkSQL / case right now because my table has 200
> columns (and I'm on Scala 2.10.3)
>

You can still apply the schema programmatically:
http://spark.apache.org/docs/latest/sql-programming-guide.html#programmatically-specifying-the-schema

Re: Limit the # of columns in Spark Scala

Posted by Denny Lee <de...@gmail.com>.
Getting a bunch of syntax errors. Let me get back with the full statement
and error later today. Thanks for verifying my thinking wasn't out in left
field.
On Sun, Dec 14, 2014 at 08:56 Gerard Maas <ge...@gmail.com> wrote:

> Hi,
>
> I don't get what the problem is. That map to selected columns looks like
> the way to go given the context. What's not working?
>
> Kr, Gerard
> On Dec 14, 2014 5:17 PM, "Denny Lee" <de...@gmail.com> wrote:
>
>> I have a large of files within HDFS that I would like to do a group by
>> statement ala
>>
>> val table = sc.textFile("hdfs://....")
>> val tabs = table.map(_.split("\t"))
>>
>> I'm trying to do something similar to
>> tabs.map(c => (c._(167), c._(110), c._(200))
>>
>> where I create a new RDD that only has
>> but that isn't quite right because I'm not really manipulating sequences.
>>
>> BTW, I cannot use SparkSQL / case right now because my table has 200
>> columns (and I'm on Scala 2.10.3)
>>
>> Thanks!
>> Denny
>>
>>

Re: Limit the # of columns in Spark Scala

Posted by Gerard Maas <ge...@gmail.com>.
Hi,

I don't get what the problem is. That map to selected columns looks like
the way to go given the context. What's not working?

Kr, Gerard
On Dec 14, 2014 5:17 PM, "Denny Lee" <de...@gmail.com> wrote:

> I have a large of files within HDFS that I would like to do a group by
> statement ala
>
> val table = sc.textFile("hdfs://....")
> val tabs = table.map(_.split("\t"))
>
> I'm trying to do something similar to
> tabs.map(c => (c._(167), c._(110), c._(200))
>
> where I create a new RDD that only has
> but that isn't quite right because I'm not really manipulating sequences.
>
> BTW, I cannot use SparkSQL / case right now because my table has 200
> columns (and I'm on Scala 2.10.3)
>
> Thanks!
> Denny
>
>