You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Babu (JIRA)" <ji...@apache.org> on 2019/02/22 14:18:00 UTC
[jira] [Created] (SPARK-26971) How to read delimiter (Cedilla) in
spark RDD and Dataframes
Babu created SPARK-26971:
----------------------------
Summary: How to read delimiter (Cedilla) in spark RDD and Dataframes
Key: SPARK-26971
URL: https://issues.apache.org/jira/browse/SPARK-26971
Project: Spark
Issue Type: Question
Components: PySpark
Affects Versions: 1.6.0
Reporter: Babu
I am trying to read a cedilla delimited HDFS Text file. I am getting the below error, did any one face similar issue?
{{hadoop fs -cat test_file.dat }}
{{1ÇCelvelandÇOhio 2ÇDurhamÇNC 3ÇDallasÇTexas }}
{{>>> rdd = sc.textFile("test_file.dat") }}
{{>>> rdd.collect() [u'1\xc7Celveland\xc7Ohio', u'2\xc7Durham\xc7NC', u'3Dallas\xc7Texas'] }}
{{>>> rdd.map(lambda p: p.split("\xc7")).collect() UnicodeDecodeError: 'ascii' codec can't decode byte 0xc7 in position 0: ordinal not in range(128) }}
{{>>> sqlContext.read.format("text").option("delimiter","Ç").option("encoding","ISO-8859").load("/user/cloudera/test_file.dat").show() }}
|1ÇCelvelandÇOhio|
{{2ÇDurhamÇNC}}
{{ 3DallasÇTexas}}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org