You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Mich Talebzadeh <mi...@gmail.com> on 2016/11/17 13:11:21 UTC

Handling windows characters with Spark CSV on Linux

Hi,

In the past with Databricks package for csv files on occasions I had to do
some cleaning at Linux directory level before ingesting CSV file into HDFS
staging directory for Spark to read it.

I have a more generic issue that may have to be ready.

Assume that a provides using FTP to push CSV files into Windows
directories. The whole solution is built around windows and .NET.

Now you want to ingest those files into HDFS and process them with Spark
CSV.

One can create NFS directories visible to Windows server and HDFS as well.
However, there may be issues with character sets etc. What are the best
ways of handling this? One way would be to use some scripts to make these
spreadsheet time files compatible with Linux and then load them into HDFS.
For example I know that if I saved a Excel spresheet file with DOS FORMAT,
that file will work OK with Spark CSV.  Are there tools to do this as well?

Thanks


Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.

Re: Handling windows characters with Spark CSV on Linux

Posted by Hyukjin Kwon <gu...@gmail.com>.

Actually, CSV datasource supports encoding option[1] (although it does not
support non-ascii compatible encoding types).

[1]
https://github.com/apache/spark/blob/44c8bfda793b7655e2bd1da5e9915a09ed9d42ce/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala#L364

On 17 Nov 2016 10:59 p.m., "ayan guha" <gu...@gmail.com> wrote:

> There is an utility called dos2unix. You can give it a try
>
> On 18 Nov 2016 00:20, "Jörn Franke" <jo...@gmail.com> wrote:
> >
> > You can do the conversion of character set (is this the issue?) as part
> of your loading process in Spark.
> > As far as i know the spark CSV package is based on Hadoop
> TextFileInputformat. This format to my best of knowledge supports only
> utf-8. So you have to do a conversion from windows to utf-8. If you refer
> to language specific settings (numbers, dates etc) - this is also not
> supported.
> >
> > I started to work on the hadoopoffice library (which you can use with
> Spark) where you can read Excel files directly (
> https://github.com/ZuInnoTe/hadoopoffice).However, there is no official
> release - yet. There you can specify also the language in which you want to
> represent data values, numbers etc. when reading the file.
> >
> > On 17 Nov 2016, at 14:11, Mich Talebzadeh <mi...@gmail.com>
> wrote:
> >
> >> Hi,
> >>
> >> In the past with Databricks package for csv files on occasions I had to
> do some cleaning at Linux directory level before ingesting CSV file into
> HDFS staging directory for Spark to read it.
> >>
> >> I have a more generic issue that may have to be ready.
> >>
> >> Assume that a provides using FTP to push CSV files into Windows
> directories. The whole solution is built around windows and .NET.
> >>
> >> Now you want to ingest those files into HDFS and process them with
> Spark CSV.
> >>
> >> One can create NFS directories visible to Windows server and HDFS
> as well. However, there may be issues with character sets etc. What are the
> best ways of handling this? One way would be to use some scripts to make
> these spreadsheet time files compatible with Linux and then load them into
> HDFS. For example I know that if I saved a Excel spresheet file with DOS
> FORMAT, that file will work OK with Spark CSV.  Are there tools to do this
> as well?
> >>
> >> Thanks
> >>
> >>
> >> Dr Mich Talebzadeh
> >>
> >>
> >>
> >> LinkedIn  https://www.linkedin.com/profile/view?id=
> AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> >>
> >>
> >>
> >> http://talebzadehmich.wordpress.com
> >>
> >>
> >> Disclaimer: Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
> >>
> >>
>

Re: Handling windows characters with Spark CSV on Linux

Posted by Mich Talebzadeh <mi...@gmail.com>.

Thanks Ayan.

That only works for extra characters like ^ characters etc. Unfortunately
it does not cure specific character sets.

cheers

Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 17 November 2016 at 13:59, ayan guha <gu...@gmail.com> wrote:

> There is an utility called dos2unix. You can give it a try
>
> On 18 Nov 2016 00:20, "Jörn Franke" <jo...@gmail.com> wrote:
> >
> > You can do the conversion of character set (is this the issue?) as part
> of your loading process in Spark.
> > As far as i know the spark CSV package is based on Hadoop
> TextFileInputformat. This format to my best of knowledge supports only
> utf-8. So you have to do a conversion from windows to utf-8. If you refer
> to language specific settings (numbers, dates etc) - this is also not
> supported.
> >
> > I started to work on the hadoopoffice library (which you can use with
> Spark) where you can read Excel files directly (
> https://github.com/ZuInnoTe/hadoopoffice).However, there is no official
> release - yet. There you can specify also the language in which you want to
> represent data values, numbers etc. when reading the file.
> >
> > On 17 Nov 2016, at 14:11, Mich Talebzadeh <mi...@gmail.com>
> wrote:
> >
> >> Hi,
> >>
> >> In the past with Databricks package for csv files on occasions I had to
> do some cleaning at Linux directory level before ingesting CSV file into
> HDFS staging directory for Spark to read it.
> >>
> >> I have a more generic issue that may have to be ready.
> >>
> >> Assume that a provides using FTP to push CSV files into Windows
> directories. The whole solution is built around windows and .NET.
> >>
> >> Now you want to ingest those files into HDFS and process them with
> Spark CSV.
> >>
> >> One can create NFS directories visible to Windows server and HDFS
> as well. However, there may be issues with character sets etc. What are the
> best ways of handling this? One way would be to use some scripts to make
> these spreadsheet time files compatible with Linux and then load them into
> HDFS. For example I know that if I saved a Excel spresheet file with DOS
> FORMAT, that file will work OK with Spark CSV.  Are there tools to do this
> as well?
> >>
> >> Thanks
> >>
> >>
> >> Dr Mich Talebzadeh
> >>
> >>
> >>
> >> LinkedIn  https://www.linkedin.com/profile/view?id=
> AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> >>
> >>
> >>
> >> http://talebzadehmich.wordpress.com
> >>
> >>
> >> Disclaimer: Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
> >>
> >>
>

Re: Handling windows characters with Spark CSV on Linux

Posted by ayan guha <gu...@gmail.com>.

There is an utility called dos2unix. You can give it a try

On 18 Nov 2016 00:20, "Jörn Franke" <jo...@gmail.com> wrote:
>
> You can do the conversion of character set (is this the issue?) as part
of your loading process in Spark.
> As far as i know the spark CSV package is based on Hadoop
TextFileInputformat. This format to my best of knowledge supports only
utf-8. So you have to do a conversion from windows to utf-8. If you refer
to language specific settings (numbers, dates etc) - this is also not
supported.
>
> I started to work on the hadoopoffice library (which you can use with
Spark) where you can read Excel files directly (
https://github.com/ZuInnoTe/hadoopoffice).However, there is no official
release - yet. There you can specify also the language in which you want to
represent data values, numbers etc. when reading the file.
>
> On 17 Nov 2016, at 14:11, Mich Talebzadeh <mi...@gmail.com>
wrote:
>
>> Hi,
>>
>> In the past with Databricks package for csv files on occasions I had to
do some cleaning at Linux directory level before ingesting CSV file into
HDFS staging directory for Spark to read it.
>>
>> I have a more generic issue that may have to be ready.
>>
>> Assume that a provides using FTP to push CSV files into Windows
directories. The whole solution is built around windows and .NET.
>>
>> Now you want to ingest those files into HDFS and process them with Spark
CSV.
>>
>> One can create NFS directories visible to Windows server and HDFS
as well. However, there may be issues with character sets etc. What are the
best ways of handling this? One way would be to use some scripts to make
these spreadsheet time files compatible with Linux and then load them into
HDFS. For example I know that if I saved a Excel spresheet file with DOS
FORMAT, that file will work OK with Spark CSV.  Are there tools to do this
as well?
>>
>> Thanks
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> Disclaimer: Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.
>>
>>

Re: Handling windows characters with Spark CSV on Linux

Posted by Jörn Franke <jo...@gmail.com>.

You can do the conversion of character set (is this the issue?) as part of your loading process in Spark.
As far as i know the spark CSV package is based on Hadoop TextFileInputformat. This format to my best of knowledge supports only utf-8. So you have to do a conversion from windows to utf-8. If you refer to language specific settings (numbers, dates etc) - this is also not supported.

I started to work on the hadoopoffice library (which you can use with Spark) where you can read Excel files directly (https://github.com/ZuInnoTe/hadoopoffice).However, there is no official release - yet. There you can specify also the language in which you want to represent data values, numbers etc. when reading the file.

> On 17 Nov 2016, at 14:11, Mich Talebzadeh <mi...@gmail.com> wrote:
> 
> Hi,
> 
> In the past with Databricks package for csv files on occasions I had to do some cleaning at Linux directory level before ingesting CSV file into HDFS staging directory for Spark to read it.
> 
> I have a more generic issue that may have to be ready.
> 
> Assume that a provides using FTP to push CSV files into Windows directories. The whole solution is built around windows and .NET.
> 
> Now you want to ingest those files into HDFS and process them with Spark CSV.
> 
> One can create NFS directories visible to Windows server and HDFS as well. However, there may be issues with character sets etc. What are the best ways of handling this? One way would be to use some scripts to make these spreadsheet time files compatible with Linux and then load them into HDFS. For example I know that if I saved a Excel spresheet file with DOS FORMAT, that file will work OK with Spark CSV.  Are there tools to do this as well?
> 
> Thanks
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.
>