You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by syepes <sy...@gmail.com> on 2016/01/30 01:43:54 UTC

Reading lzo+index with spark-csv (Splittable reads)

Hello,

I have managed to speed up the read stage when loading CSV files using the
classic "newAPIHadoopFile" method, the issue is that I would like to use the
spark-csv package and it seams that its not taking into consideration the
LZO Index file / Splittable reads.

/# Using the classic method the read is fully parallelized (Splittable)/
sc.newAPIHadoopFile("/user/sy/data.csv.lzo", .... ).count

/# When spark-csv is used the file is read only from one node (No Splittable
reads)/
sqlContext.read.format("com.databricks.spark.csv").options(Map("path" ->
"/user/sy/data.csv.lzo", "header" -> "true", "inferSchema" ->
"false")).load().count()

Does anyone know if this is currently supported?





--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Reading-lzo-index-with-spark-csv-Splittable-reads-tp26103.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: Reading lzo+index with spark-csv (Splittable reads)

Posted by Hyukjin Kwon <gu...@gmail.com>.

Hm.. As I said here
https://github.com/databricks/spark-csv/issues/245#issuecomment-177682354,

It sounds reasonable in a way though. For me, this might be to deal with
some narrow use-cases.

How about using csvRdd(),
https://github.com/databricks/spark-csv/blob/master/src/main/scala/com/databricks/spark/csv/CsvParser.scala#L143-L162
?

I think you can do this like below:


val rdd = sc.newAPIHadoopFile("/file.csv.lzo",
                    classOf[com.hadoop.mapreduce.LzoTextInputFormat],
                    classOf[org.apache.hadoop.io.LongWritable],
                    classOf[org.apache.hadoop.io.Text])
val df = new CsvParser()
      .csvRdd(sqlContext, rdd)



2016-01-30 10:04 GMT+09:00 syepes <sy...@gmail.com>:

> Well looking at the src it look like its not implemented:
>
>
> https://github.com/databricks/spark-csv/blob/master/src/main/scala/com/databricks/spark/csv/util/TextFile.scala#L34-L36
>
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Reading-lzo-index-with-spark-csv-Splittable-reads-tp26103p26105.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>

Re: Reading lzo+index with spark-csv (Splittable reads)

Posted by syepes <sy...@gmail.com>.

Well looking at the src it look like its not implemented:

https://github.com/databricks/spark-csv/blob/master/src/main/scala/com/databricks/spark/csv/util/TextFile.scala#L34-L36





--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Reading-lzo-index-with-spark-csv-Splittable-reads-tp26103p26105.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org