You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by SLiZn Liu <sl...@gmail.com> on 2016/02/07 08:44:00 UTC

Imported CSV file content isn't identical to the original file

Hi Spark Users Group,

I have a csv file to analysis with Spark, but I’m troubling with importing
as DataFrame.

Here’s the minimal reproducible example. Suppose I’m having a
*10(rows)x2(cols)* *space-delimited csv* file, shown as below:

1446566430 2015-11-04<SP>00:00:30
1446566430 2015-11-04<SP>00:00:30
1446566430 2015-11-04<SP>00:00:30
1446566430 2015-11-04<SP>00:00:30
1446566430 2015-11-04<SP>00:00:30
1446566431 2015-11-04<SP>00:00:31
1446566431 2015-11-04<SP>00:00:31
1446566431 2015-11-04<SP>00:00:31
1446566431 2015-11-04<SP>00:00:31
1446566431 2015-11-04<SP>00:00:31

the <SP> in column 2 represents sub-delimiter within that column, and this
file is stored on HDFS, let’s say the path is hdfs:///tmp/1.csv

I’m using *spark-csv* to import this file as Spark *DataFrame*:

sqlContext.read.format("com.databricks.spark.csv")
        .option("header", "false") // Use first line of all files as header
        .option("inferSchema", "false") // Automatically infer data types
        .option("delimiter", " ")
        .load("hdfs:///tmp/1.csv")
        .show

Oddly, the output shows only a part of each column:

[image: Screenshot from 2016-02-07 15-27-51.png]

and even the boundary of the table wasn’t shown correctly. I also used the
other way to read csv file, by sc.textFile(...).map(_.split(" ")) and
sqlContext.createDataFrame, and the result is the same. Can someone point
me out where I did it wrong?

—
BR,
Todd Leo
​

Re: Imported CSV file content isn't identical to the original file

Posted by SLiZn Liu <sl...@gmail.com>.
This Error message does not appear as I upgraded to 1.6.0 .

--
Cheers,
Todd Leo

On Tue, Feb 9, 2016 at 9:07 AM SLiZn Liu <sl...@gmail.com> wrote:

> At least works for me though, temporarily disabled Kyro serilizer until
> upgrade to 1.6.0. Appreciate for your update. :)
> Luciano Resende <lu...@gmail.com>于2016年2月9日 周二02:37写道:
>
>> Sorry, same expected results with trunk and Kryo serializer
>>
>> On Mon, Feb 8, 2016 at 4:15 AM, SLiZn Liu <sl...@gmail.com> wrote:
>>
>>> I’ve found the trigger of my issue: if I start my spark-shell or submit
>>> by spark-submit with --conf
>>> spark.serializer=org.apache.spark.serializer.KryoSerializer, the
>>> DataFrame content goes wrong, as I described earlier.
>>> ​
>>>
>>> On Mon, Feb 8, 2016 at 5:42 PM SLiZn Liu <sl...@gmail.com> wrote:
>>>
>>>> Thanks Luciano, now it looks like I’m the only guy who have this issue.
>>>> My options is narrowed down to upgrade my spark to 1.6.0, to see if this
>>>> issue is gone.
>>>>
>>>> —
>>>> Cheers,
>>>> Todd Leo
>>>>
>>>>
>>>> ​
>>>> On Mon, Feb 8, 2016 at 2:12 PM Luciano Resende <lu...@gmail.com>
>>>> wrote:
>>>>
>>>>> I tried in both 1.5.0, 1.6.0 and 2.0.0 trunk and
>>>>> com.databricks:spark-csv_2.10:1.3.0 with expected results, where the
>>>>> columns seem to be read properly.
>>>>>
>>>>>  +----------+----------------------+
>>>>> |C0        |C1                    |
>>>>> +----------+----------------------+
>>>>>
>>>>> |1446566430 | 2015-11-04<SP>00:00:30|
>>>>> |1446566430 | 2015-11-04<SP>00:00:30|
>>>>> |1446566430 | 2015-11-04<SP>00:00:30|
>>>>> |1446566430 | 2015-11-04<SP>00:00:30|
>>>>> |1446566430 | 2015-11-04<SP>00:00:30|
>>>>> |1446566431 | 2015-11-04<SP>00:00:31|
>>>>> |1446566431 | 2015-11-04<SP>00:00:31|
>>>>> |1446566431 | 2015-11-04<SP>00:00:31|
>>>>> |1446566431 | 2015-11-04<SP>00:00:31|
>>>>> |1446566431 | 2015-11-04<SP>00:00:31|
>>>>> +----------+----------------------+
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Sat, Feb 6, 2016 at 11:44 PM, SLiZn Liu <sl...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi Spark Users Group,
>>>>>>
>>>>>> I have a csv file to analysis with Spark, but I’m troubling with
>>>>>> importing as DataFrame.
>>>>>>
>>>>>> Here’s the minimal reproducible example. Suppose I’m having a
>>>>>> *10(rows)x2(cols)* *space-delimited csv* file, shown as below:
>>>>>>
>>>>>> 1446566430 2015-11-04<SP>00:00:30
>>>>>> 1446566430 2015-11-04<SP>00:00:30
>>>>>> 1446566430 2015-11-04<SP>00:00:30
>>>>>> 1446566430 2015-11-04<SP>00:00:30
>>>>>> 1446566430 2015-11-04<SP>00:00:30
>>>>>> 1446566431 2015-11-04<SP>00:00:31
>>>>>> 1446566431 2015-11-04<SP>00:00:31
>>>>>> 1446566431 2015-11-04<SP>00:00:31
>>>>>> 1446566431 2015-11-04<SP>00:00:31
>>>>>> 1446566431 2015-11-04<SP>00:00:31
>>>>>>
>>>>>> the <SP> in column 2 represents sub-delimiter within that column,
>>>>>> and this file is stored on HDFS, let’s say the path is
>>>>>> hdfs:///tmp/1.csv
>>>>>>
>>>>>> I’m using *spark-csv* to import this file as Spark *DataFrame*:
>>>>>>
>>>>>> sqlContext.read.format("com.databricks.spark.csv")
>>>>>>         .option("header", "false") // Use first line of all files as header
>>>>>>         .option("inferSchema", "false") // Automatically infer data types
>>>>>>         .option("delimiter", " ")
>>>>>>         .load("hdfs:///tmp/1.csv")
>>>>>>         .show
>>>>>>
>>>>>> Oddly, the output shows only a part of each column:
>>>>>>
>>>>>> [image: Screenshot from 2016-02-07 15-27-51.png]
>>>>>>
>>>>>> and even the boundary of the table wasn’t shown correctly. I also
>>>>>> used the other way to read csv file, by sc.textFile(...).map(_.split("
>>>>>> ")) and sqlContext.createDataFrame, and the result is the same. Can
>>>>>> someone point me out where I did it wrong?
>>>>>>
>>>>>> —
>>>>>> BR,
>>>>>> Todd Leo
>>>>>> ​
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Luciano Resende
>>>>> http://people.apache.org/~lresende
>>>>> http://twitter.com/lresende1975
>>>>> http://lresende.blogspot.com/
>>>>>
>>>>
>>
>>
>> --
>> Luciano Resende
>> http://people.apache.org/~lresende
>> http://twitter.com/lresende1975
>> http://lresende.blogspot.com/
>>
>

Re: Imported CSV file content isn't identical to the original file

Posted by SLiZn Liu <sl...@gmail.com>.
At least works for me though, temporarily disabled Kyro serilizer until
upgrade to 1.6.0. Appreciate for your update. :)
Luciano Resende <lu...@gmail.com>于2016年2月9日 周二02:37写道:

> Sorry, same expected results with trunk and Kryo serializer
>
> On Mon, Feb 8, 2016 at 4:15 AM, SLiZn Liu <sl...@gmail.com> wrote:
>
>> I’ve found the trigger of my issue: if I start my spark-shell or submit
>> by spark-submit with --conf
>> spark.serializer=org.apache.spark.serializer.KryoSerializer, the
>> DataFrame content goes wrong, as I described earlier.
>> ​
>>
>> On Mon, Feb 8, 2016 at 5:42 PM SLiZn Liu <sl...@gmail.com> wrote:
>>
>>> Thanks Luciano, now it looks like I’m the only guy who have this issue.
>>> My options is narrowed down to upgrade my spark to 1.6.0, to see if this
>>> issue is gone.
>>>
>>> —
>>> Cheers,
>>> Todd Leo
>>>
>>>
>>> ​
>>> On Mon, Feb 8, 2016 at 2:12 PM Luciano Resende <lu...@gmail.com>
>>> wrote:
>>>
>>>> I tried in both 1.5.0, 1.6.0 and 2.0.0 trunk and
>>>> com.databricks:spark-csv_2.10:1.3.0 with expected results, where the
>>>> columns seem to be read properly.
>>>>
>>>>  +----------+----------------------+
>>>> |C0        |C1                    |
>>>> +----------+----------------------+
>>>>
>>>> |1446566430 | 2015-11-04<SP>00:00:30|
>>>> |1446566430 | 2015-11-04<SP>00:00:30|
>>>> |1446566430 | 2015-11-04<SP>00:00:30|
>>>> |1446566430 | 2015-11-04<SP>00:00:30|
>>>> |1446566430 | 2015-11-04<SP>00:00:30|
>>>> |1446566431 | 2015-11-04<SP>00:00:31|
>>>> |1446566431 | 2015-11-04<SP>00:00:31|
>>>> |1446566431 | 2015-11-04<SP>00:00:31|
>>>> |1446566431 | 2015-11-04<SP>00:00:31|
>>>> |1446566431 | 2015-11-04<SP>00:00:31|
>>>> +----------+----------------------+
>>>>
>>>>
>>>>
>>>>
>>>> On Sat, Feb 6, 2016 at 11:44 PM, SLiZn Liu <sl...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Spark Users Group,
>>>>>
>>>>> I have a csv file to analysis with Spark, but I’m troubling with
>>>>> importing as DataFrame.
>>>>>
>>>>> Here’s the minimal reproducible example. Suppose I’m having a
>>>>> *10(rows)x2(cols)* *space-delimited csv* file, shown as below:
>>>>>
>>>>> 1446566430 2015-11-04<SP>00:00:30
>>>>> 1446566430 2015-11-04<SP>00:00:30
>>>>> 1446566430 2015-11-04<SP>00:00:30
>>>>> 1446566430 2015-11-04<SP>00:00:30
>>>>> 1446566430 2015-11-04<SP>00:00:30
>>>>> 1446566431 2015-11-04<SP>00:00:31
>>>>> 1446566431 2015-11-04<SP>00:00:31
>>>>> 1446566431 2015-11-04<SP>00:00:31
>>>>> 1446566431 2015-11-04<SP>00:00:31
>>>>> 1446566431 2015-11-04<SP>00:00:31
>>>>>
>>>>> the <SP> in column 2 represents sub-delimiter within that column, and
>>>>> this file is stored on HDFS, let’s say the path is hdfs:///tmp/1.csv
>>>>>
>>>>> I’m using *spark-csv* to import this file as Spark *DataFrame*:
>>>>>
>>>>> sqlContext.read.format("com.databricks.spark.csv")
>>>>>         .option("header", "false") // Use first line of all files as header
>>>>>         .option("inferSchema", "false") // Automatically infer data types
>>>>>         .option("delimiter", " ")
>>>>>         .load("hdfs:///tmp/1.csv")
>>>>>         .show
>>>>>
>>>>> Oddly, the output shows only a part of each column:
>>>>>
>>>>> [image: Screenshot from 2016-02-07 15-27-51.png]
>>>>>
>>>>> and even the boundary of the table wasn’t shown correctly. I also used
>>>>> the other way to read csv file, by sc.textFile(...).map(_.split(" "))
>>>>> and sqlContext.createDataFrame, and the result is the same. Can
>>>>> someone point me out where I did it wrong?
>>>>>
>>>>> —
>>>>> BR,
>>>>> Todd Leo
>>>>> ​
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Luciano Resende
>>>> http://people.apache.org/~lresende
>>>> http://twitter.com/lresende1975
>>>> http://lresende.blogspot.com/
>>>>
>>>
>
>
> --
> Luciano Resende
> http://people.apache.org/~lresende
> http://twitter.com/lresende1975
> http://lresende.blogspot.com/
>

Re: Imported CSV file content isn't identical to the original file

Posted by Luciano Resende <lu...@gmail.com>.
Sorry, same expected results with trunk and Kryo serializer

On Mon, Feb 8, 2016 at 4:15 AM, SLiZn Liu <sl...@gmail.com> wrote:

> I’ve found the trigger of my issue: if I start my spark-shell or submit
> by spark-submit with --conf
> spark.serializer=org.apache.spark.serializer.KryoSerializer, the
> DataFrame content goes wrong, as I described earlier.
> ​
>
> On Mon, Feb 8, 2016 at 5:42 PM SLiZn Liu <sl...@gmail.com> wrote:
>
>> Thanks Luciano, now it looks like I’m the only guy who have this issue.
>> My options is narrowed down to upgrade my spark to 1.6.0, to see if this
>> issue is gone.
>>
>> —
>> Cheers,
>> Todd Leo
>>
>>
>> ​
>> On Mon, Feb 8, 2016 at 2:12 PM Luciano Resende <lu...@gmail.com>
>> wrote:
>>
>>> I tried in both 1.5.0, 1.6.0 and 2.0.0 trunk and
>>> com.databricks:spark-csv_2.10:1.3.0 with expected results, where the
>>> columns seem to be read properly.
>>>
>>>  +----------+----------------------+
>>> |C0        |C1                    |
>>> +----------+----------------------+
>>>
>>> |1446566430 | 2015-11-04<SP>00:00:30|
>>> |1446566430 | 2015-11-04<SP>00:00:30|
>>> |1446566430 | 2015-11-04<SP>00:00:30|
>>> |1446566430 | 2015-11-04<SP>00:00:30|
>>> |1446566430 | 2015-11-04<SP>00:00:30|
>>> |1446566431 | 2015-11-04<SP>00:00:31|
>>> |1446566431 | 2015-11-04<SP>00:00:31|
>>> |1446566431 | 2015-11-04<SP>00:00:31|
>>> |1446566431 | 2015-11-04<SP>00:00:31|
>>> |1446566431 | 2015-11-04<SP>00:00:31|
>>> +----------+----------------------+
>>>
>>>
>>>
>>>
>>> On Sat, Feb 6, 2016 at 11:44 PM, SLiZn Liu <sl...@gmail.com>
>>> wrote:
>>>
>>>> Hi Spark Users Group,
>>>>
>>>> I have a csv file to analysis with Spark, but I’m troubling with
>>>> importing as DataFrame.
>>>>
>>>> Here’s the minimal reproducible example. Suppose I’m having a
>>>> *10(rows)x2(cols)* *space-delimited csv* file, shown as below:
>>>>
>>>> 1446566430 2015-11-04<SP>00:00:30
>>>> 1446566430 2015-11-04<SP>00:00:30
>>>> 1446566430 2015-11-04<SP>00:00:30
>>>> 1446566430 2015-11-04<SP>00:00:30
>>>> 1446566430 2015-11-04<SP>00:00:30
>>>> 1446566431 2015-11-04<SP>00:00:31
>>>> 1446566431 2015-11-04<SP>00:00:31
>>>> 1446566431 2015-11-04<SP>00:00:31
>>>> 1446566431 2015-11-04<SP>00:00:31
>>>> 1446566431 2015-11-04<SP>00:00:31
>>>>
>>>> the <SP> in column 2 represents sub-delimiter within that column, and
>>>> this file is stored on HDFS, let’s say the path is hdfs:///tmp/1.csv
>>>>
>>>> I’m using *spark-csv* to import this file as Spark *DataFrame*:
>>>>
>>>> sqlContext.read.format("com.databricks.spark.csv")
>>>>         .option("header", "false") // Use first line of all files as header
>>>>         .option("inferSchema", "false") // Automatically infer data types
>>>>         .option("delimiter", " ")
>>>>         .load("hdfs:///tmp/1.csv")
>>>>         .show
>>>>
>>>> Oddly, the output shows only a part of each column:
>>>>
>>>> [image: Screenshot from 2016-02-07 15-27-51.png]
>>>>
>>>> and even the boundary of the table wasn’t shown correctly. I also used
>>>> the other way to read csv file, by sc.textFile(...).map(_.split(" "))
>>>> and sqlContext.createDataFrame, and the result is the same. Can
>>>> someone point me out where I did it wrong?
>>>>
>>>> —
>>>> BR,
>>>> Todd Leo
>>>> ​
>>>>
>>>
>>>
>>>
>>> --
>>> Luciano Resende
>>> http://people.apache.org/~lresende
>>> http://twitter.com/lresende1975
>>> http://lresende.blogspot.com/
>>>
>>


-- 
Luciano Resende
http://people.apache.org/~lresende
http://twitter.com/lresende1975
http://lresende.blogspot.com/

Re: Imported CSV file content isn't identical to the original file

Posted by SLiZn Liu <sl...@gmail.com>.
I’ve found the trigger of my issue: if I start my spark-shell or submit by
spark-submit with --conf
spark.serializer=org.apache.spark.serializer.KryoSerializer, the DataFrame
content goes wrong, as I described earlier.
​

On Mon, Feb 8, 2016 at 5:42 PM SLiZn Liu <sl...@gmail.com> wrote:

> Thanks Luciano, now it looks like I’m the only guy who have this issue. My
> options is narrowed down to upgrade my spark to 1.6.0, to see if this issue
> is gone.
>
> —
> Cheers,
> Todd Leo
>
>
> ​
> On Mon, Feb 8, 2016 at 2:12 PM Luciano Resende <lu...@gmail.com>
> wrote:
>
>> I tried in both 1.5.0, 1.6.0 and 2.0.0 trunk and
>> com.databricks:spark-csv_2.10:1.3.0 with expected results, where the
>> columns seem to be read properly.
>>
>>  +----------+----------------------+
>> |C0        |C1                    |
>> +----------+----------------------+
>>
>> |1446566430 | 2015-11-04<SP>00:00:30|
>> |1446566430 | 2015-11-04<SP>00:00:30|
>> |1446566430 | 2015-11-04<SP>00:00:30|
>> |1446566430 | 2015-11-04<SP>00:00:30|
>> |1446566430 | 2015-11-04<SP>00:00:30|
>> |1446566431 | 2015-11-04<SP>00:00:31|
>> |1446566431 | 2015-11-04<SP>00:00:31|
>> |1446566431 | 2015-11-04<SP>00:00:31|
>> |1446566431 | 2015-11-04<SP>00:00:31|
>> |1446566431 | 2015-11-04<SP>00:00:31|
>> +----------+----------------------+
>>
>>
>>
>>
>> On Sat, Feb 6, 2016 at 11:44 PM, SLiZn Liu <sl...@gmail.com>
>> wrote:
>>
>>> Hi Spark Users Group,
>>>
>>> I have a csv file to analysis with Spark, but I’m troubling with
>>> importing as DataFrame.
>>>
>>> Here’s the minimal reproducible example. Suppose I’m having a
>>> *10(rows)x2(cols)* *space-delimited csv* file, shown as below:
>>>
>>> 1446566430 2015-11-04<SP>00:00:30
>>> 1446566430 2015-11-04<SP>00:00:30
>>> 1446566430 2015-11-04<SP>00:00:30
>>> 1446566430 2015-11-04<SP>00:00:30
>>> 1446566430 2015-11-04<SP>00:00:30
>>> 1446566431 2015-11-04<SP>00:00:31
>>> 1446566431 2015-11-04<SP>00:00:31
>>> 1446566431 2015-11-04<SP>00:00:31
>>> 1446566431 2015-11-04<SP>00:00:31
>>> 1446566431 2015-11-04<SP>00:00:31
>>>
>>> the <SP> in column 2 represents sub-delimiter within that column, and
>>> this file is stored on HDFS, let’s say the path is hdfs:///tmp/1.csv
>>>
>>> I’m using *spark-csv* to import this file as Spark *DataFrame*:
>>>
>>> sqlContext.read.format("com.databricks.spark.csv")
>>>         .option("header", "false") // Use first line of all files as header
>>>         .option("inferSchema", "false") // Automatically infer data types
>>>         .option("delimiter", " ")
>>>         .load("hdfs:///tmp/1.csv")
>>>         .show
>>>
>>> Oddly, the output shows only a part of each column:
>>>
>>> [image: Screenshot from 2016-02-07 15-27-51.png]
>>>
>>> and even the boundary of the table wasn’t shown correctly. I also used
>>> the other way to read csv file, by sc.textFile(...).map(_.split(" "))
>>> and sqlContext.createDataFrame, and the result is the same. Can someone
>>> point me out where I did it wrong?
>>>
>>> —
>>> BR,
>>> Todd Leo
>>> ​
>>>
>>
>>
>>
>> --
>> Luciano Resende
>> http://people.apache.org/~lresende
>> http://twitter.com/lresende1975
>> http://lresende.blogspot.com/
>>
>

Re: Imported CSV file content isn't identical to the original file

Posted by SLiZn Liu <sl...@gmail.com>.
Thanks Luciano, now it looks like I’m the only guy who have this issue. My
options is narrowed down to upgrade my spark to 1.6.0, to see if this issue
is gone.

—
Cheers,
Todd Leo


​
On Mon, Feb 8, 2016 at 2:12 PM Luciano Resende <lu...@gmail.com> wrote:

> I tried in both 1.5.0, 1.6.0 and 2.0.0 trunk and
> com.databricks:spark-csv_2.10:1.3.0 with expected results, where the
> columns seem to be read properly.
>
>  +----------+----------------------+
> |C0        |C1                    |
> +----------+----------------------+
>
> |1446566430 | 2015-11-04<SP>00:00:30|
> |1446566430 | 2015-11-04<SP>00:00:30|
> |1446566430 | 2015-11-04<SP>00:00:30|
> |1446566430 | 2015-11-04<SP>00:00:30|
> |1446566430 | 2015-11-04<SP>00:00:30|
> |1446566431 | 2015-11-04<SP>00:00:31|
> |1446566431 | 2015-11-04<SP>00:00:31|
> |1446566431 | 2015-11-04<SP>00:00:31|
> |1446566431 | 2015-11-04<SP>00:00:31|
> |1446566431 | 2015-11-04<SP>00:00:31|
> +----------+----------------------+
>
>
>
>
> On Sat, Feb 6, 2016 at 11:44 PM, SLiZn Liu <sl...@gmail.com> wrote:
>
>> Hi Spark Users Group,
>>
>> I have a csv file to analysis with Spark, but I’m troubling with
>> importing as DataFrame.
>>
>> Here’s the minimal reproducible example. Suppose I’m having a
>> *10(rows)x2(cols)* *space-delimited csv* file, shown as below:
>>
>> 1446566430 2015-11-04<SP>00:00:30
>> 1446566430 2015-11-04<SP>00:00:30
>> 1446566430 2015-11-04<SP>00:00:30
>> 1446566430 2015-11-04<SP>00:00:30
>> 1446566430 2015-11-04<SP>00:00:30
>> 1446566431 2015-11-04<SP>00:00:31
>> 1446566431 2015-11-04<SP>00:00:31
>> 1446566431 2015-11-04<SP>00:00:31
>> 1446566431 2015-11-04<SP>00:00:31
>> 1446566431 2015-11-04<SP>00:00:31
>>
>> the <SP> in column 2 represents sub-delimiter within that column, and
>> this file is stored on HDFS, let’s say the path is hdfs:///tmp/1.csv
>>
>> I’m using *spark-csv* to import this file as Spark *DataFrame*:
>>
>> sqlContext.read.format("com.databricks.spark.csv")
>>         .option("header", "false") // Use first line of all files as header
>>         .option("inferSchema", "false") // Automatically infer data types
>>         .option("delimiter", " ")
>>         .load("hdfs:///tmp/1.csv")
>>         .show
>>
>> Oddly, the output shows only a part of each column:
>>
>> [image: Screenshot from 2016-02-07 15-27-51.png]
>>
>> and even the boundary of the table wasn’t shown correctly. I also used
>> the other way to read csv file, by sc.textFile(...).map(_.split(" "))
>> and sqlContext.createDataFrame, and the result is the same. Can someone
>> point me out where I did it wrong?
>>
>> —
>> BR,
>> Todd Leo
>> ​
>>
>
>
>
> --
> Luciano Resende
> http://people.apache.org/~lresende
> http://twitter.com/lresende1975
> http://lresende.blogspot.com/
>

Re: Imported CSV file content isn't identical to the original file

Posted by Luciano Resende <lu...@gmail.com>.
I tried in both 1.5.0, 1.6.0 and 2.0.0 trunk and
com.databricks:spark-csv_2.10:1.3.0 with expected results, where the
columns seem to be read properly.

 +----------+----------------------+
|C0        |C1                    |
+----------+----------------------+
|1446566430 | 2015-11-04<SP>00:00:30|
|1446566430 | 2015-11-04<SP>00:00:30|
|1446566430 | 2015-11-04<SP>00:00:30|
|1446566430 | 2015-11-04<SP>00:00:30|
|1446566430 | 2015-11-04<SP>00:00:30|
|1446566431 | 2015-11-04<SP>00:00:31|
|1446566431 | 2015-11-04<SP>00:00:31|
|1446566431 | 2015-11-04<SP>00:00:31|
|1446566431 | 2015-11-04<SP>00:00:31|
|1446566431 | 2015-11-04<SP>00:00:31|
+----------+----------------------+




On Sat, Feb 6, 2016 at 11:44 PM, SLiZn Liu <sl...@gmail.com> wrote:

> Hi Spark Users Group,
>
> I have a csv file to analysis with Spark, but I’m troubling with importing
> as DataFrame.
>
> Here’s the minimal reproducible example. Suppose I’m having a
> *10(rows)x2(cols)* *space-delimited csv* file, shown as below:
>
> 1446566430 2015-11-04<SP>00:00:30
> 1446566430 2015-11-04<SP>00:00:30
> 1446566430 2015-11-04<SP>00:00:30
> 1446566430 2015-11-04<SP>00:00:30
> 1446566430 2015-11-04<SP>00:00:30
> 1446566431 2015-11-04<SP>00:00:31
> 1446566431 2015-11-04<SP>00:00:31
> 1446566431 2015-11-04<SP>00:00:31
> 1446566431 2015-11-04<SP>00:00:31
> 1446566431 2015-11-04<SP>00:00:31
>
> the <SP> in column 2 represents sub-delimiter within that column, and
> this file is stored on HDFS, let’s say the path is hdfs:///tmp/1.csv
>
> I’m using *spark-csv* to import this file as Spark *DataFrame*:
>
> sqlContext.read.format("com.databricks.spark.csv")
>         .option("header", "false") // Use first line of all files as header
>         .option("inferSchema", "false") // Automatically infer data types
>         .option("delimiter", " ")
>         .load("hdfs:///tmp/1.csv")
>         .show
>
> Oddly, the output shows only a part of each column:
>
> [image: Screenshot from 2016-02-07 15-27-51.png]
>
> and even the boundary of the table wasn’t shown correctly. I also used the
> other way to read csv file, by sc.textFile(...).map(_.split(" ")) and
> sqlContext.createDataFrame, and the result is the same. Can someone point
> me out where I did it wrong?
>
> —
> BR,
> Todd Leo
> ​
>



-- 
Luciano Resende
http://people.apache.org/~lresende
http://twitter.com/lresende1975
http://lresende.blogspot.com/

Re: Imported CSV file content isn't identical to the original file

Posted by SLiZn Liu <sl...@gmail.com>.
*Update*: on local mode(spark-shell --local[2], no matter read from local
file system or hdfs) , it works well. But it doesn’t solve this issue,
since my data scale requires hundreds of CPU cores and hundreds GB of RAM.

BTW, it’s Chinese Tradition New Year now, wish you all have a happy year
and have Great fortune in the Year of Monkey!

—
BR,
Todd Leo
​

On Sun, Feb 7, 2016 at 6:09 PM SLiZn Liu <sl...@gmail.com> wrote:

> Hi Igor,
>
> In my case, it’s not a matter of *truncate*. As the show() function in
> Spark API doc reads,
>
> truncate: Whether truncate long strings. If true, strings more than 20
> characters will be truncated and all cells will be aligned right…
>
> whereas the leading characters of my two columns are missing.
>
> Good to know the way to show the whole content in a cell.
>
> —
> BR,
> Todd Leo
> ​
>
>
>
>
> On Sun, Feb 7, 2016 at 5:42 PM Igor Berman <ig...@gmail.com> wrote:
>
>> show has argument of truncate
>> pass false so it wont truncate your results
>>
>> On 7 February 2016 at 11:01, SLiZn Liu <sl...@gmail.com> wrote:
>>
>>> Plus, I’m using *Spark 1.5.2*, with *spark-csv 1.3.0*. Also tried
>>> HiveContext, but the result is exactly the same.
>>> ​
>>>
>>> On Sun, Feb 7, 2016 at 3:44 PM SLiZn Liu <sl...@gmail.com> wrote:
>>>
>>>> Hi Spark Users Group,
>>>>
>>>> I have a csv file to analysis with Spark, but I’m troubling with
>>>> importing as DataFrame.
>>>>
>>>> Here’s the minimal reproducible example. Suppose I’m having a
>>>> *10(rows)x2(cols)* *space-delimited csv* file, shown as below:
>>>>
>>>> 1446566430 2015-11-04<SP>00:00:30
>>>> 1446566430 2015-11-04<SP>00:00:30
>>>> 1446566430 2015-11-04<SP>00:00:30
>>>> 1446566430 2015-11-04<SP>00:00:30
>>>> 1446566430 2015-11-04<SP>00:00:30
>>>> 1446566431 2015-11-04<SP>00:00:31
>>>> 1446566431 2015-11-04<SP>00:00:31
>>>> 1446566431 2015-11-04<SP>00:00:31
>>>> 1446566431 2015-11-04<SP>00:00:31
>>>> 1446566431 2015-11-04<SP>00:00:31
>>>>
>>>> the <SP> in column 2 represents sub-delimiter within that column, and
>>>> this file is stored on HDFS, let’s say the path is hdfs:///tmp/1.csv
>>>>
>>>> I’m using *spark-csv* to import this file as Spark *DataFrame*:
>>>>
>>>> sqlContext.read.format("com.databricks.spark.csv")
>>>>         .option("header", "false") // Use first line of all files as header
>>>>         .option("inferSchema", "false") // Automatically infer data types
>>>>         .option("delimiter", " ")
>>>>         .load("hdfs:///tmp/1.csv")
>>>>         .show
>>>>
>>>> Oddly, the output shows only a part of each column:
>>>>
>>>> [image: Screenshot from 2016-02-07 15-27-51.png]
>>>>
>>>> and even the boundary of the table wasn’t shown correctly. I also used
>>>> the other way to read csv file, by sc.textFile(...).map(_.split(" "))
>>>> and sqlContext.createDataFrame, and the result is the same. Can
>>>> someone point me out where I did it wrong?
>>>>
>>>> —
>>>> BR,
>>>> Todd Leo
>>>> ​
>>>>
>>>
>>

Re: Imported CSV file content isn't identical to the original file

Posted by SLiZn Liu <sl...@gmail.com>.
Hi Igor,

In my case, it’s not a matter of *truncate*. As the show() function in
Spark API doc reads,

truncate: Whether truncate long strings. If true, strings more than 20
characters will be truncated and all cells will be aligned right…

whereas the leading characters of my two columns are missing.

Good to know the way to show the whole content in a cell.

—
BR,
Todd Leo
​




On Sun, Feb 7, 2016 at 5:42 PM Igor Berman <ig...@gmail.com> wrote:

> show has argument of truncate
> pass false so it wont truncate your results
>
> On 7 February 2016 at 11:01, SLiZn Liu <sl...@gmail.com> wrote:
>
>> Plus, I’m using *Spark 1.5.2*, with *spark-csv 1.3.0*. Also tried
>> HiveContext, but the result is exactly the same.
>> ​
>>
>> On Sun, Feb 7, 2016 at 3:44 PM SLiZn Liu <sl...@gmail.com> wrote:
>>
>>> Hi Spark Users Group,
>>>
>>> I have a csv file to analysis with Spark, but I’m troubling with
>>> importing as DataFrame.
>>>
>>> Here’s the minimal reproducible example. Suppose I’m having a
>>> *10(rows)x2(cols)* *space-delimited csv* file, shown as below:
>>>
>>> 1446566430 2015-11-04<SP>00:00:30
>>> 1446566430 2015-11-04<SP>00:00:30
>>> 1446566430 2015-11-04<SP>00:00:30
>>> 1446566430 2015-11-04<SP>00:00:30
>>> 1446566430 2015-11-04<SP>00:00:30
>>> 1446566431 2015-11-04<SP>00:00:31
>>> 1446566431 2015-11-04<SP>00:00:31
>>> 1446566431 2015-11-04<SP>00:00:31
>>> 1446566431 2015-11-04<SP>00:00:31
>>> 1446566431 2015-11-04<SP>00:00:31
>>>
>>> the <SP> in column 2 represents sub-delimiter within that column, and
>>> this file is stored on HDFS, let’s say the path is hdfs:///tmp/1.csv
>>>
>>> I’m using *spark-csv* to import this file as Spark *DataFrame*:
>>>
>>> sqlContext.read.format("com.databricks.spark.csv")
>>>         .option("header", "false") // Use first line of all files as header
>>>         .option("inferSchema", "false") // Automatically infer data types
>>>         .option("delimiter", " ")
>>>         .load("hdfs:///tmp/1.csv")
>>>         .show
>>>
>>> Oddly, the output shows only a part of each column:
>>>
>>> [image: Screenshot from 2016-02-07 15-27-51.png]
>>>
>>> and even the boundary of the table wasn’t shown correctly. I also used
>>> the other way to read csv file, by sc.textFile(...).map(_.split(" "))
>>> and sqlContext.createDataFrame, and the result is the same. Can someone
>>> point me out where I did it wrong?
>>>
>>> —
>>> BR,
>>> Todd Leo
>>> ​
>>>
>>
>

Re: Imported CSV file content isn't identical to the original file

Posted by Igor Berman <ig...@gmail.com>.
show has argument of truncate
pass false so it wont truncate your results

On 7 February 2016 at 11:01, SLiZn Liu <sl...@gmail.com> wrote:

> Plus, I’m using *Spark 1.5.2*, with *spark-csv 1.3.0*. Also tried
> HiveContext, but the result is exactly the same.
> ​
>
> On Sun, Feb 7, 2016 at 3:44 PM SLiZn Liu <sl...@gmail.com> wrote:
>
>> Hi Spark Users Group,
>>
>> I have a csv file to analysis with Spark, but I’m troubling with
>> importing as DataFrame.
>>
>> Here’s the minimal reproducible example. Suppose I’m having a
>> *10(rows)x2(cols)* *space-delimited csv* file, shown as below:
>>
>> 1446566430 2015-11-04<SP>00:00:30
>> 1446566430 2015-11-04<SP>00:00:30
>> 1446566430 2015-11-04<SP>00:00:30
>> 1446566430 2015-11-04<SP>00:00:30
>> 1446566430 2015-11-04<SP>00:00:30
>> 1446566431 2015-11-04<SP>00:00:31
>> 1446566431 2015-11-04<SP>00:00:31
>> 1446566431 2015-11-04<SP>00:00:31
>> 1446566431 2015-11-04<SP>00:00:31
>> 1446566431 2015-11-04<SP>00:00:31
>>
>> the <SP> in column 2 represents sub-delimiter within that column, and
>> this file is stored on HDFS, let’s say the path is hdfs:///tmp/1.csv
>>
>> I’m using *spark-csv* to import this file as Spark *DataFrame*:
>>
>> sqlContext.read.format("com.databricks.spark.csv")
>>         .option("header", "false") // Use first line of all files as header
>>         .option("inferSchema", "false") // Automatically infer data types
>>         .option("delimiter", " ")
>>         .load("hdfs:///tmp/1.csv")
>>         .show
>>
>> Oddly, the output shows only a part of each column:
>>
>> [image: Screenshot from 2016-02-07 15-27-51.png]
>>
>> and even the boundary of the table wasn’t shown correctly. I also used
>> the other way to read csv file, by sc.textFile(...).map(_.split(" "))
>> and sqlContext.createDataFrame, and the result is the same. Can someone
>> point me out where I did it wrong?
>>
>> —
>> BR,
>> Todd Leo
>> ​
>>
>

Re: Imported CSV file content isn't identical to the original file

Posted by SLiZn Liu <sl...@gmail.com>.
Plus, I’m using *Spark 1.5.2*, with *spark-csv 1.3.0*. Also tried
HiveContext, but the result is exactly the same.
​

On Sun, Feb 7, 2016 at 3:44 PM SLiZn Liu <sl...@gmail.com> wrote:

> Hi Spark Users Group,
>
> I have a csv file to analysis with Spark, but I’m troubling with importing
> as DataFrame.
>
> Here’s the minimal reproducible example. Suppose I’m having a
> *10(rows)x2(cols)* *space-delimited csv* file, shown as below:
>
> 1446566430 2015-11-04<SP>00:00:30
> 1446566430 2015-11-04<SP>00:00:30
> 1446566430 2015-11-04<SP>00:00:30
> 1446566430 2015-11-04<SP>00:00:30
> 1446566430 2015-11-04<SP>00:00:30
> 1446566431 2015-11-04<SP>00:00:31
> 1446566431 2015-11-04<SP>00:00:31
> 1446566431 2015-11-04<SP>00:00:31
> 1446566431 2015-11-04<SP>00:00:31
> 1446566431 2015-11-04<SP>00:00:31
>
> the <SP> in column 2 represents sub-delimiter within that column, and
> this file is stored on HDFS, let’s say the path is hdfs:///tmp/1.csv
>
> I’m using *spark-csv* to import this file as Spark *DataFrame*:
>
> sqlContext.read.format("com.databricks.spark.csv")
>         .option("header", "false") // Use first line of all files as header
>         .option("inferSchema", "false") // Automatically infer data types
>         .option("delimiter", " ")
>         .load("hdfs:///tmp/1.csv")
>         .show
>
> Oddly, the output shows only a part of each column:
>
> [image: Screenshot from 2016-02-07 15-27-51.png]
>
> and even the boundary of the table wasn’t shown correctly. I also used the
> other way to read csv file, by sc.textFile(...).map(_.split(" ")) and
> sqlContext.createDataFrame, and the result is the same. Can someone point
> me out where I did it wrong?
>
> —
> BR,
> Todd Leo
> ​
>