You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Ashok Kumar <as...@yahoo.com.INVALID> on 2016/09/05 12:48:50 UTC
Splitting columns from a text file
Hi,
I have a text file as below that I read in
74,20160905-133143,98.1121806912882759414875,20160905-133143,49.5277699881591680774276,20160905-133143,56.0802995712398098455677,20160905-133143,46.6368952654440752277778,20160905-133143,84.8822714116440218155179,20160905-133143,68.72408602520662115000
val textFile = sc.textFile("/tmp/mytextfile.txt")
Now I want to split the rows separated by ","
scala> textFile.map(x=>x.toString).split(",")<console>:27: error: value split is not a member of org.apache.spark.rdd.RDD[String] textFile.map(x=>x.toString).split(",")
However, the above throws error?
Any ideas what is wrong or how I can do this if I can avoid converting it to String?
Thanking
Re: Splitting columns from a text file
Posted by Somasundaram Sekar <so...@tigeranalytics.com>.
Please have a look at the documentation for information on how to work with
RDD. Start with this http://spark.apache.org/docs/latest/quick-start.html
On 5 Sep 2016 7:00 pm, "Ashok Kumar" <as...@yahoo.com> wrote:
> Thank you sir.
>
> This is what I get
>
> scala> textFile.map(x=> x.split(","))
> res52: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[27] at
> map at <console>:27
>
> How can I work on individual columns. I understand they are strings
>
> scala> textFile.map(x=> x.split(",")).map(x => (x.getString(0))
> | )
> <console>:27: error: value getString is not a member of Array[String]
> textFile.map(x=> x.split(",")).map(x => (x.getString(0))
>
> regards
>
>
>
>
> On Monday, 5 September 2016, 13:51, Somasundaram Sekar <somasundar.sekar@
> tigeranalytics.com> wrote:
>
>
> Basic error, you get back an RDD on transformations like map.
> sc.textFile("filename").map(x => x.split(",")
>
> On 5 Sep 2016 6:19 pm, "Ashok Kumar" <as...@yahoo.com.invalid> wrote:
>
> Hi,
>
> I have a text file as below that I read in
>
> 74,20160905-133143,98. 11218069128827594148
> 75,20160905-133143,49. 52776998815916807742
> 76,20160905-133143,56. 08029957123980984556
> 77,20160905-133143,46. 63689526544407522777
> 78,20160905-133143,84. 88227141164402181551
> 79,20160905-133143,68. 72408602520662115000
>
> val textFile = sc.textFile("/tmp/mytextfile. txt")
>
> Now I want to split the rows separated by ","
>
> scala> textFile.map(x=>x.toString). split(",")
> <console>:27: error: value split is not a member of
> org.apache.spark.rdd.RDD[ String]
> textFile.map(x=>x.toString). split(",")
>
> However, the above throws error?
>
> Any ideas what is wrong or how I can do this if I can avoid converting it
> to String?
>
> Thanking
>
>
>
>
Re: Splitting columns from a text file
Posted by Somasundaram Sekar <so...@tigeranalytics.com>.
sc.textFile("filename").map(_.split(",")).filter(arr => arr.length == 3 &&
arr(2).toDouble > 50).collect this will give you a Array[Array[String]] do
as you may wish with it. And please read through abt RDD
On 5 Sep 2016 8:51 pm, "Ashok Kumar" <as...@yahoo.com> wrote:
> Thanks everyone.
>
> I am not skilled like you gentlemen
>
> This is what I did
>
> 1) Read the text file
>
> val textFile = sc.textFile("/tmp/myfile.txt")
>
> 2) That produces an RDD of String.
>
> 3) Create a DF after splitting the file into an Array
>
> val df = textFile.map(line => line.split(",")).map(x=>(x(0).
> toInt,x(1).toString,x(2).toDouble)).toDF
>
> 4) Create a class for column headers
>
> case class Columns(col1: Int, col2: String, col3: Double)
>
> 5) Assign the column headers
>
> val h = df.map(p => Columns(p(0).toString.toInt, p(1).toString,
> p(2).toString.toDouble))
>
> 6) Only interested in column 3 > 50
>
> h.filter(col("Col3") > 50.0)
>
> 7) Now I just want Col3 only
>
> h.filter(col("Col3") > 50.0).select("col3").show(5)
> +-----------------+
> | col3|
> +-----------------+
> |95.42536350467836|
> |61.56297588648554|
> |76.73982017179868|
> |68.86218120274728|
> |67.64613810115105|
> +-----------------+
> only showing top 5 rows
>
> Does that make sense. Are there shorter ways gurus? Can I just do all this
> on RDD without DF?
>
> Thanking you
>
>
>
>
>
>
>
> On Monday, 5 September 2016, 15:19, ayan guha <gu...@gmail.com> wrote:
>
>
> Then, You need to refer third term in the array, convert it to your
> desired data type and then use filter.
>
>
> On Tue, Sep 6, 2016 at 12:14 AM, Ashok Kumar <as...@yahoo.com> wrote:
>
> Hi,
> I want to filter them for values.
>
> This is what is in array
>
> 74,20160905-133143,98. 11218069128827594148
>
> I want to filter anything > 50.0 in the third column
>
> Thanks
>
>
>
>
> On Monday, 5 September 2016, 15:07, ayan guha <gu...@gmail.com> wrote:
>
>
> Hi
>
> x.split returns an array. So, after first map, you will get RDD of arrays.
> What is your expected outcome of 2nd map?
>
> On Mon, Sep 5, 2016 at 11:30 PM, Ashok Kumar <ashok34668@yahoo.com.invalid
> > wrote:
>
> Thank you sir.
>
> This is what I get
>
> scala> textFile.map(x=> x.split(","))
> res52: org.apache.spark.rdd.RDD[ Array[String]] = MapPartitionsRDD[27] at
> map at <console>:27
>
> How can I work on individual columns. I understand they are strings
>
> scala> textFile.map(x=> x.split(",")).map(x => (x.getString(0))
> | )
> <console>:27: error: value getString is not a member of Array[String]
> textFile.map(x=> x.split(",")).map(x => (x.getString(0))
>
> regards
>
>
>
>
> On Monday, 5 September 2016, 13:51, Somasundaram Sekar <somasundar.sekar@
> tigeranalytics.com <so...@tigeranalytics.com>> wrote:
>
>
> Basic error, you get back an RDD on transformations like map.
> sc.textFile("filename").map(x => x.split(",")
>
> On 5 Sep 2016 6:19 pm, "Ashok Kumar" <as...@yahoo.com.invalid> wrote:
>
> Hi,
>
> I have a text file as below that I read in
>
> 74,20160905-133143,98. 11218069128827594148
> 75,20160905-133143,49. 52776998815916807742
> 76,20160905-133143,56. 08029957123980984556
> 77,20160905-133143,46. 63689526544407522777
> 78,20160905-133143,84. 88227141164402181551
> 79,20160905-133143,68. 72408602520662115000
>
> val textFile = sc.textFile("/tmp/mytextfile. txt")
>
> Now I want to split the rows separated by ","
>
> scala> textFile.map(x=>x.toString). split(",")
> <console>:27: error: value split is not a member of
> org.apache.spark.rdd.RDD[ String]
> textFile.map(x=>x.toString). split(",")
>
> However, the above throws error?
>
> Any ideas what is wrong or how I can do this if I can avoid converting it
> to String?
>
> Thanking
>
>
>
>
>
>
> --
> Best Regards,
> Ayan Guha
>
>
>
>
>
> --
> Best Regards,
> Ayan Guha
>
>
>
Re: Splitting columns from a text file
Posted by Ashok Kumar <as...@yahoo.com.INVALID>.
Thanks everyone.
I am not skilled like you gentlemen
This is what I did
1) Read the text file
val textFile = sc.textFile("/tmp/myfile.txt")
2) That produces an RDD of String.
3) Create a DF after splitting the file into an Array
val df = textFile.map(line => line.split(",")).map(x=>(x(0).toInt,x(1).toString,x(2).toDouble)).toDF
4) Create a class for column headers
case class Columns(col1: Int, col2: String, col3: Double)
5) Assign the column headers
val h = df.map(p => Columns(p(0).toString.toInt, p(1).toString, p(2).toString.toDouble))
6) Only interested in column 3 > 50
h.filter(col("Col3") > 50.0)
7) Now I just want Col3 only
h.filter(col("Col3") > 50.0).select("col3").show(5)+-----------------+| col3|+-----------------+|95.42536350467836||61.56297588648554||76.73982017179868||68.86218120274728||67.64613810115105|+-----------------+only showing top 5 rows
Does that make sense. Are there shorter ways gurus? Can I just do all this on RDD without DF?
Thanking you
On Monday, 5 September 2016, 15:19, ayan guha <gu...@gmail.com> wrote:
Then, You need to refer third term in the array, convert it to your desired data type and then use filter.
On Tue, Sep 6, 2016 at 12:14 AM, Ashok Kumar <as...@yahoo.com> wrote:
Hi,I want to filter them for values.
This is what is in array
74,20160905-133143,98. 11218069128827594148
I want to filter anything > 50.0 in the third column
Thanks
On Monday, 5 September 2016, 15:07, ayan guha <gu...@gmail.com> wrote:
Hi
x.split returns an array. So, after first map, you will get RDD of arrays. What is your expected outcome of 2nd map?
On Mon, Sep 5, 2016 at 11:30 PM, Ashok Kumar <as...@yahoo.com.invalid> wrote:
Thank you sir.
This is what I get
scala> textFile.map(x=> x.split(","))res52: org.apache.spark.rdd.RDD[ Array[String]] = MapPartitionsRDD[27] at map at <console>:27
How can I work on individual columns. I understand they are strings
scala> textFile.map(x=> x.split(",")).map(x => (x.getString(0)) | )<console>:27: error: value getString is not a member of Array[String] textFile.map(x=> x.split(",")).map(x => (x.getString(0))
regards
On Monday, 5 September 2016, 13:51, Somasundaram Sekar <somasundar.sekar@ tigeranalytics.com> wrote:
Basic error, you get back an RDD on transformations like map.sc.textFile("filename").map(x => x.split(",")
On 5 Sep 2016 6:19 pm, "Ashok Kumar" <as...@yahoo.com.invalid> wrote:
Hi,
I have a text file as below that I read in
74,20160905-133143,98. 1121806912882759414875,20160905-133143,49. 5277699881591680774276,20160905-133143,56. 0802995712398098455677,20160905-133143,46. 6368952654440752277778,20160905-133143,84. 8822714116440218155179,20160905-133143,68. 72408602520662115000
val textFile = sc.textFile("/tmp/mytextfile. txt")
Now I want to split the rows separated by ","
scala> textFile.map(x=>x.toString). split(",")<console>:27: error: value split is not a member of org.apache.spark.rdd.RDD[ String] textFile.map(x=>x.toString). split(",")
However, the above throws error?
Any ideas what is wrong or how I can do this if I can avoid converting it to String?
Thanking
--
Best Regards,
Ayan Guha
--
Best Regards,
Ayan Guha
Re: Splitting columns from a text file
Posted by ayan guha <gu...@gmail.com>.
Then, You need to refer third term in the array, convert it to your desired
data type and then use filter.
On Tue, Sep 6, 2016 at 12:14 AM, Ashok Kumar <as...@yahoo.com> wrote:
> Hi,
> I want to filter them for values.
>
> This is what is in array
>
> 74,20160905-133143,98.11218069128827594148
>
> I want to filter anything > 50.0 in the third column
>
> Thanks
>
>
>
>
> On Monday, 5 September 2016, 15:07, ayan guha <gu...@gmail.com> wrote:
>
>
> Hi
>
> x.split returns an array. So, after first map, you will get RDD of arrays.
> What is your expected outcome of 2nd map?
>
> On Mon, Sep 5, 2016 at 11:30 PM, Ashok Kumar <ashok34668@yahoo.com.invalid
> > wrote:
>
> Thank you sir.
>
> This is what I get
>
> scala> textFile.map(x=> x.split(","))
> res52: org.apache.spark.rdd.RDD[ Array[String]] = MapPartitionsRDD[27] at
> map at <console>:27
>
> How can I work on individual columns. I understand they are strings
>
> scala> textFile.map(x=> x.split(",")).map(x => (x.getString(0))
> | )
> <console>:27: error: value getString is not a member of Array[String]
> textFile.map(x=> x.split(",")).map(x => (x.getString(0))
>
> regards
>
>
>
>
> On Monday, 5 September 2016, 13:51, Somasundaram Sekar <somasundar.sekar@
> tigeranalytics.com <so...@tigeranalytics.com>> wrote:
>
>
> Basic error, you get back an RDD on transformations like map.
> sc.textFile("filename").map(x => x.split(",")
>
> On 5 Sep 2016 6:19 pm, "Ashok Kumar" <as...@yahoo.com.invalid> wrote:
>
> Hi,
>
> I have a text file as below that I read in
>
> 74,20160905-133143,98. 11218069128827594148
> 75,20160905-133143,49. 52776998815916807742
> 76,20160905-133143,56. 08029957123980984556
> 77,20160905-133143,46. 63689526544407522777
> 78,20160905-133143,84. 88227141164402181551
> 79,20160905-133143,68. 72408602520662115000
>
> val textFile = sc.textFile("/tmp/mytextfile. txt")
>
> Now I want to split the rows separated by ","
>
> scala> textFile.map(x=>x.toString). split(",")
> <console>:27: error: value split is not a member of
> org.apache.spark.rdd.RDD[ String]
> textFile.map(x=>x.toString). split(",")
>
> However, the above throws error?
>
> Any ideas what is wrong or how I can do this if I can avoid converting it
> to String?
>
> Thanking
>
>
>
>
>
>
> --
> Best Regards,
> Ayan Guha
>
>
>
--
Best Regards,
Ayan Guha
Re: Splitting columns from a text file
Posted by Fridtjof Sander <fr...@googlemail.com>.
Ask yourself how to access the third element in an array in Scala.
Am 05.09.2016 um 16:14 schrieb Ashok Kumar:
> Hi,
> I want to filter them for values.
>
> This is what is in array
>
> 74,20160905-133143,98.11218069128827594148
>
> I want to filter anything > 50.0 in the third column
>
> Thanks
>
>
>
>
> On Monday, 5 September 2016, 15:07, ayan guha <gu...@gmail.com> wrote:
>
>
> Hi
>
> x.split returns an array. So, after first map, you will get RDD of
> arrays. What is your expected outcome of 2nd map?
>
> On Mon, Sep 5, 2016 at 11:30 PM, Ashok Kumar
> <ashok34668@yahoo.com.invalid <ma...@yahoo.com.invalid>>
> wrote:
>
> Thank you sir.
>
> This is what I get
>
> scala> textFile.map(x=> x.split(","))
> res52: org.apache.spark.rdd.RDD[ Array[String]] =
> MapPartitionsRDD[27] at map at <console>:27
>
> How can I work on individual columns. I understand they are strings
>
> scala> textFile.map(x=> x.split(",")).map(x => (x.getString(0))
> | )
> <console>:27: error: value getString is not a member of Array[String]
> textFile.map(x=> x.split(",")).map(x => (x.getString(0))
>
> regards
>
>
>
>
> On Monday, 5 September 2016, 13:51, Somasundaram Sekar
> <somasundar.sekar@ tigeranalytics.com
> <ma...@tigeranalytics.com>> wrote:
>
>
> Basic error, you get back an RDD on transformations like map.
> sc.textFile("filename").map(x => x.split(",")
>
> On 5 Sep 2016 6:19 pm, "Ashok Kumar"
> <as...@yahoo.com.invalid> wrote:
>
> Hi,
>
> I have a text file as below that I read in
>
> 74,20160905-133143,98. 11218069128827594148
> 75,20160905-133143,49. 52776998815916807742
> 76,20160905-133143,56. 08029957123980984556
> 77,20160905-133143,46. 63689526544407522777
> 78,20160905-133143,84. 88227141164402181551
> 79,20160905-133143,68. 72408602520662115000
>
> val textFile = sc.textFile("/tmp/mytextfile. txt")
>
> Now I want to split the rows separated by ","
>
> scala> textFile.map(x=>x.toString). split(",")
> <console>:27: error: value split is not a member of
> org.apache.spark.rdd.RDD[ String]
> textFile.map(x=>x.toString). split(",")
>
> However, the above throws error?
>
> Any ideas what is wrong or how I can do this if I can avoid
> converting it to String?
>
> Thanking
>
>
>
>
>
>
> --
> Best Regards,
> Ayan Guha
>
>
Re: Splitting columns from a text file
Posted by Ashok Kumar <as...@yahoo.com.INVALID>.
Hi,I want to filter them for values.
This is what is in array
74,20160905-133143,98.11218069128827594148
I want to filter anything > 50.0 in the third column
Thanks
On Monday, 5 September 2016, 15:07, ayan guha <gu...@gmail.com> wrote:
Hi
x.split returns an array. So, after first map, you will get RDD of arrays. What is your expected outcome of 2nd map?
On Mon, Sep 5, 2016 at 11:30 PM, Ashok Kumar <as...@yahoo.com.invalid> wrote:
Thank you sir.
This is what I get
scala> textFile.map(x=> x.split(","))res52: org.apache.spark.rdd.RDD[ Array[String]] = MapPartitionsRDD[27] at map at <console>:27
How can I work on individual columns. I understand they are strings
scala> textFile.map(x=> x.split(",")).map(x => (x.getString(0)) | )<console>:27: error: value getString is not a member of Array[String] textFile.map(x=> x.split(",")).map(x => (x.getString(0))
regards
On Monday, 5 September 2016, 13:51, Somasundaram Sekar <somasundar.sekar@ tigeranalytics.com> wrote:
Basic error, you get back an RDD on transformations like map.sc.textFile("filename").map(x => x.split(",")
On 5 Sep 2016 6:19 pm, "Ashok Kumar" <as...@yahoo.com.invalid> wrote:
Hi,
I have a text file as below that I read in
74,20160905-133143,98. 1121806912882759414875,20160905-133143,49. 5277699881591680774276,20160905-133143,56. 0802995712398098455677,20160905-133143,46. 6368952654440752277778,20160905-133143,84. 8822714116440218155179,20160905-133143,68. 72408602520662115000
val textFile = sc.textFile("/tmp/mytextfile. txt")
Now I want to split the rows separated by ","
scala> textFile.map(x=>x.toString). split(",")<console>:27: error: value split is not a member of org.apache.spark.rdd.RDD[ String] textFile.map(x=>x.toString). split(",")
However, the above throws error?
Any ideas what is wrong or how I can do this if I can avoid converting it to String?
Thanking
--
Best Regards,
Ayan Guha
Re: Splitting columns from a text file
Posted by ayan guha <gu...@gmail.com>.
Hi
x.split returns an array. So, after first map, you will get RDD of arrays.
What is your expected outcome of 2nd map?
On Mon, Sep 5, 2016 at 11:30 PM, Ashok Kumar <as...@yahoo.com.invalid>
wrote:
> Thank you sir.
>
> This is what I get
>
> scala> textFile.map(x=> x.split(","))
> res52: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[27] at
> map at <console>:27
>
> How can I work on individual columns. I understand they are strings
>
> scala> textFile.map(x=> x.split(",")).map(x => (x.getString(0))
> | )
> <console>:27: error: value getString is not a member of Array[String]
> textFile.map(x=> x.split(",")).map(x => (x.getString(0))
>
> regards
>
>
>
>
> On Monday, 5 September 2016, 13:51, Somasundaram Sekar <somasundar.sekar@
> tigeranalytics.com> wrote:
>
>
> Basic error, you get back an RDD on transformations like map.
> sc.textFile("filename").map(x => x.split(",")
>
> On 5 Sep 2016 6:19 pm, "Ashok Kumar" <as...@yahoo.com.invalid> wrote:
>
> Hi,
>
> I have a text file as below that I read in
>
> 74,20160905-133143,98. 11218069128827594148
> 75,20160905-133143,49. 52776998815916807742
> 76,20160905-133143,56. 08029957123980984556
> 77,20160905-133143,46. 63689526544407522777
> 78,20160905-133143,84. 88227141164402181551
> 79,20160905-133143,68. 72408602520662115000
>
> val textFile = sc.textFile("/tmp/mytextfile. txt")
>
> Now I want to split the rows separated by ","
>
> scala> textFile.map(x=>x.toString). split(",")
> <console>:27: error: value split is not a member of
> org.apache.spark.rdd.RDD[ String]
> textFile.map(x=>x.toString). split(",")
>
> However, the above throws error?
>
> Any ideas what is wrong or how I can do this if I can avoid converting it
> to String?
>
> Thanking
>
>
>
>
--
Best Regards,
Ayan Guha
Re: Splitting columns from a text file
Posted by Ashok Kumar <as...@yahoo.com.INVALID>.
Thank you sir.
This is what I get
scala> textFile.map(x=> x.split(","))res52: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[27] at map at <console>:27
How can I work on individual columns. I understand they are strings
scala> textFile.map(x=> x.split(",")).map(x => (x.getString(0)) | )<console>:27: error: value getString is not a member of Array[String] textFile.map(x=> x.split(",")).map(x => (x.getString(0))
regards
On Monday, 5 September 2016, 13:51, Somasundaram Sekar <so...@tigeranalytics.com> wrote:
Basic error, you get back an RDD on transformations like map.sc.textFile("filename").map(x => x.split(",")
On 5 Sep 2016 6:19 pm, "Ashok Kumar" <as...@yahoo.com.invalid> wrote:
Hi,
I have a text file as below that I read in
74,20160905-133143,98. 1121806912882759414875,20160905-133143,49. 5277699881591680774276,20160905-133143,56. 0802995712398098455677,20160905-133143,46. 6368952654440752277778,20160905-133143,84. 8822714116440218155179,20160905-133143,68. 72408602520662115000
val textFile = sc.textFile("/tmp/mytextfile. txt")
Now I want to split the rows separated by ","
scala> textFile.map(x=>x.toString). split(",")<console>:27: error: value split is not a member of org.apache.spark.rdd.RDD[ String] textFile.map(x=>x.toString). split(",")
However, the above throws error?
Any ideas what is wrong or how I can do this if I can avoid converting it to String?
Thanking
Re: Splitting columns from a text file
Posted by Somasundaram Sekar <so...@tigeranalytics.com>.
Basic error, you get back an RDD on transformations like map.
sc.textFile("filename").map(x => x.split(",")
On 5 Sep 2016 6:19 pm, "Ashok Kumar" <as...@yahoo.com.invalid> wrote:
> Hi,
>
> I have a text file as below that I read in
>
> 74,20160905-133143,98.11218069128827594148
> 75,20160905-133143,49.52776998815916807742
> 76,20160905-133143,56.08029957123980984556
> 77,20160905-133143,46.63689526544407522777
> 78,20160905-133143,84.88227141164402181551
> 79,20160905-133143,68.72408602520662115000
>
> val textFile = sc.textFile("/tmp/mytextfile.txt")
>
> Now I want to split the rows separated by ","
>
> scala> textFile.map(x=>x.toString).split(",")
> <console>:27: error: value split is not a member of
> org.apache.spark.rdd.RDD[String]
> textFile.map(x=>x.toString).split(",")
>
> However, the above throws error?
>
> Any ideas what is wrong or how I can do this if I can avoid converting it
> to String?
>
> Thanking
>
>
Re: Splitting columns from a text file
Posted by Gourav Sengupta <go...@gmail.com>.
just use SPARK CSV, all other ways of splitting and working is just trying
to reinvent the wheel and a magnanimous waste of time.
Regards,
Gourav
On Mon, Sep 5, 2016 at 1:48 PM, Ashok Kumar <as...@yahoo.com.invalid>
wrote:
> Hi,
>
> I have a text file as below that I read in
>
> 74,20160905-133143,98.11218069128827594148
> 75,20160905-133143,49.52776998815916807742
> 76,20160905-133143,56.08029957123980984556
> 77,20160905-133143,46.63689526544407522777
> 78,20160905-133143,84.88227141164402181551
> 79,20160905-133143,68.72408602520662115000
>
> val textFile = sc.textFile("/tmp/mytextfile.txt")
>
> Now I want to split the rows separated by ","
>
> scala> textFile.map(x=>x.toString).split(",")
> <console>:27: error: value split is not a member of
> org.apache.spark.rdd.RDD[String]
> textFile.map(x=>x.toString).split(",")
>
> However, the above throws error?
>
> Any ideas what is wrong or how I can do this if I can avoid converting it
> to String?
>
> Thanking
>
>