You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Selvam Raman <se...@gmail.com> on 2016/09/10 09:14:13 UTC

Spark CSV skip lines

Hi,

I am using spark csv to read csv file. The issue is my files first n lines
contains some report and followed by actual data (header and rest of the
data).

So how can i skip first n lines in spark csv. I dont have any specific
comment character in the first byte.

Please give me some idea.

-- 
Selvam Raman
"லஞ்சம் தவிர்த்து நெஞ்சம் நிமிர்த்து"

Re: Spark CSV skip lines

Posted by Hyukjin Kwon <gu...@gmail.com>.

As you are reading each record as each file via wholeTextFiles and
falttening them to records, I think you can just drop the few lines as you
want.

Can you just drop or skip few lines from reader.readAll().map(...)?

Also, are you sure this is an issue in Spark or external CSV library issue?

Do you mind if I ask the stack trace if you think so?

On 11 Sep 2016 1:50 a.m., "Selvam Raman" <se...@gmail.com> wrote:

> Hi,
>
> I saw this two option already anyway thanks for the idea.
>
> i am using wholetext file to read my data(cause there are  \n middle of
> it) and using opencsv to parse the data. In my data first two lines are
> just some report. how can i eliminate.
>
> *How to eliminate first two lines after reading from wholetextfiles.*
>
> val test = wholeTextFiles.flatMap{ case (_, txt) =>
>      | val reader = new CSVReader(new StringReader(txt));
>      | reader.readAll().map(data => Row(data(3),data(4),data(7),
> data(9),data(14)))}
>
> The above code throws arrayoutofbounce exception for empty line and report
> line.
>
>
> On Sat, Sep 10, 2016 at 3:02 PM, Hyukjin Kwon <gu...@gmail.com> wrote:
>
>> Hi Selvam,
>>
>> If your report is commented with any character (e.g. #), you can skip
>> these lines via comment option [1].
>>
>> If you are using Spark 1.x, then you might be able to do this by manually
>> skipping from the RDD and then making this to DataFrame as below:
>>
>> I haven’t tested this but I think this should work.
>>
>> val rdd = sparkContext.textFile("...")
>> val filteredRdd = rdd.mapPartitionsWithIndex { (idx, iter) =>
>>   if (idx == 0) {
>>     iter.drop(10)
>>   } else {
>>     iter
>>   }
>> }
>> val df = new CsvParser().csvRdd(sqlContext, filteredRdd)
>>
>> If you are using Spark 2.0, then it seems there is no way to manually
>> modifying the source data because loading existing RDD or DataSet[String]
>> to DataFrame is not yet supported.
>>
>> There is an issue open[2]. I hope this is helpful.
>>
>> Thanks.
>>
>> [1] https://github.com/apache/spark/blob/27209252f09ff73c58e
>> 60c6df8aaba73b308088c/sql/core/src/main/scala/org/
>> apache/spark/sql/DataFrameReader.scala#L369
>> [2] https://issues.apache.org/jira/browse/SPARK-15463
>>
>>
>> 
>>
>>
>> On 10 Sep 2016 6:14 p.m., "Selvam Raman" <se...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I am using spark csv to read csv file. The issue is my files first n
>>> lines contains some report and followed by actual data (header and rest of
>>> the data).
>>>
>>> So how can i skip first n lines in spark csv. I dont have any specific
>>> comment character in the first byte.
>>>
>>> Please give me some idea.
>>>
>>> --
>>> Selvam Raman
>>> "லஞ்சம் தவிர்த்து நெஞ்சம் நிமிர்த்து"
>>>
>>
>
>
> --
> Selvam Raman
> "லஞ்சம் தவிர்த்து நெஞ்சம் நிமிர்த்து"
>

Re: Spark CSV skip lines

Posted by Selvam Raman <se...@gmail.com>.

Hi,

I saw this two option already anyway thanks for the idea.

i am using wholetext file to read my data(cause there are  \n middle of it)
and using opencsv to parse the data. In my data first two lines are just
some report. how can i eliminate.

*How to eliminate first two lines after reading from wholetextfiles.*

val test = wholeTextFiles.flatMap{ case (_, txt) =>
     | val reader = new CSVReader(new StringReader(txt));
     | reader.readAll().map(data =>
Row(data(3),data(4),data(7),data(9),data(14)))}

The above code throws arrayoutofbounce exception for empty line and report
line.


On Sat, Sep 10, 2016 at 3:02 PM, Hyukjin Kwon <gu...@gmail.com> wrote:

> Hi Selvam,
>
> If your report is commented with any character (e.g. #), you can skip
> these lines via comment option [1].
>
> If you are using Spark 1.x, then you might be able to do this by manually
> skipping from the RDD and then making this to DataFrame as below:
>
> I haven’t tested this but I think this should work.
>
> val rdd = sparkContext.textFile("...")
> val filteredRdd = rdd.mapPartitionsWithIndex { (idx, iter) =>
>   if (idx == 0) {
>     iter.drop(10)
>   } else {
>     iter
>   }
> }
> val df = new CsvParser().csvRdd(sqlContext, filteredRdd)
>
> If you are using Spark 2.0, then it seems there is no way to manually
> modifying the source data because loading existing RDD or DataSet[String]
> to DataFrame is not yet supported.
>
> There is an issue open[2]. I hope this is helpful.
>
> Thanks.
>
> [1] https://github.com/apache/spark/blob/27209252f09ff73c58e60c6df8aaba
> 73b308088c/sql/core/src/main/scala/org/apache/spark/sql/
> DataFrameReader.scala#L369
> [2] https://issues.apache.org/jira/browse/SPARK-15463
>
>
> 
>
>
> On 10 Sep 2016 6:14 p.m., "Selvam Raman" <se...@gmail.com> wrote:
>
>> Hi,
>>
>> I am using spark csv to read csv file. The issue is my files first n
>> lines contains some report and followed by actual data (header and rest of
>> the data).
>>
>> So how can i skip first n lines in spark csv. I dont have any specific
>> comment character in the first byte.
>>
>> Please give me some idea.
>>
>> --
>> Selvam Raman
>> "லஞ்சம் தவிர்த்து நெஞ்சம் நிமிர்த்து"
>>
>


-- 
Selvam Raman
"லஞ்சம் தவிர்த்து நெஞ்சம் நிமிர்த்து"

Re: Spark CSV skip lines

Posted by Hyukjin Kwon <gu...@gmail.com>.

Hi Selvam,

If your report is commented with any character (e.g. #), you can skip these
lines via comment option [1].

If you are using Spark 1.x, then you might be able to do this by manually
skipping from the RDD and then making this to DataFrame as below:

I haven’t tested this but I think this should work.

val rdd = sparkContext.textFile("...")
val filteredRdd = rdd.mapPartitionsWithIndex { (idx, iter) =>
  if (idx == 0) {
    iter.drop(10)
  } else {
    iter
  }
}
val df = new CsvParser().csvRdd(sqlContext, filteredRdd)

If you are using Spark 2.0, then it seems there is no way to manually
modifying the source data because loading existing RDD or DataSet[String]
to DataFrame is not yet supported.

There is an issue open[2]. I hope this is helpful.

Thanks.

[1]
https://github.com/apache/spark/blob/27209252f09ff73c58e60c6df8aaba73b308088c/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala#L369
[2] https://issues.apache.org/jira/browse/SPARK-15463

On 10 Sep 2016 6:14 p.m., "Selvam Raman" <se...@gmail.com> wrote:

> Hi,
>
> I am using spark csv to read csv file. The issue is my files first n lines
> contains some report and followed by actual data (header and rest of the
> data).
>
> So how can i skip first n lines in spark csv. I dont have any specific
> comment character in the first byte.
>
> Please give me some idea.
>
> --
> Selvam Raman
> "லஞ்சம் தவிர்த்து நெஞ்சம் நிமிர்த்து"
>