You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@zeppelin.apache.org by Meethu Mathew <me...@flytxt.com> on 2017/04/19 12:30:29 UTC

UnicodeDecodeError in zeppelin 0.7.1

Hi,

I just migrated from zeppelin 0.7.0 to zeppelin 0.7.1 and I am facing this
error while creating an RDD(in pyspark).

UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0:
> invalid start byte


I was able to create the RDD without any error after adding
use_unicode=False as follows

> sc.textFile("file.csv",use_unicode=False)


​But it fails when I try to stem the text. I am getting similar error when
trying to apply stemming to the text using python interpreter.

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 4:
> ordinal not in range(128)

All these code is working in 0.7.0 version. There is no change in the
dataset and code. ​Is there any change in the encoding type in the new
version of zeppelin?

Regards,
Meethu Mathew

Re: UnicodeDecodeError in zeppelin 0.7.1

Posted by Meethu Mathew <me...@flytxt.com>.
Hi All,

I am getting in zeppelin 0.7.2 also with the following code. I had reported
the same error in 0.7.1 as well (PFB the mail).

​

> def textPreProcessor(text):
>      for w in text.split():
>
> ​     ​
> regex = re.compile('[%s]' % re.escape(string.punctuation))
>
> ​    * ​*
> *no_punctuation = unicode(regex.sub(' ', w),'utf8')*
>
> ​     ​
> tokens = word_tokenize(no_punctuation)
>
> ​     ​
> lowercased = [t.lower() for t in tokens]
>
> ​     ​
> no_stopwords = [w for w in lowercased if not w in stopwordsX]
>
> ​     ​
> stemmed = [stemmerX.stem(w) for w in no_stopwords]
>
> ​     ​
> return [w for w in stemmed if w]



>    - docs =sc.textFile(hdfs_path+training_data,*use_unicode=False*).
>    repartition(96)
>    - docs.map(lambda features: sentimentObject.textPreProcess
>    or(features.split(delimiter)[text_colum])).count()
>
> ​
*​Note :: In version 0.7.0 the code was running fine without
using use_unicode and unicode(regex.sub(' ', w),'utf8')*

*Please help to fix this issue.*

Regards,
Meethu Mathew


On Fri, Apr 21, 2017 at 11:26 AM, Meethu Mathew <me...@flytxt.com>
wrote:

> ​​
> Hi,
>
> Thanks for the repsonse.
>
> @ moon soo Lee: The interpreter setting is same in 0.7.0 and 0.7.1
>
> @ Felix Cheng : The Python version is same.
>
> The code is as follows:
>
> *PYSPARK*
>
> def textPreProcessor(text):
>>      for w in text.split():
>>
>> ​     ​
>> regex = re.compile('[%s]' % re.escape(string.punctuation))
>>
>> ​    * ​*
>> *no_punctuation = unicode(regex.sub(' ', w),'utf8')*
>>
>> ​     ​
>> tokens = word_tokenize(no_punctuation)
>>
>> ​     ​
>> lowercased = [t.lower() for t in tokens]
>>
>> ​     ​
>> no_stopwords = [w for w in lowercased if not w in stopwordsX]
>>
>> ​     ​
>> stemmed = [stemmerX.stem(w) for w in no_stopwords]
>>
>> ​     ​
>> return [w for w in stemmed if w]
>
>
>
>>    - docs =sc.textFile(hdfs_path+training_data,*use_unicode=False*).
>>    repartition(96)
>>    - docs.map(lambda features: sentimentObject.textPreProcess
>>    or(features.split(delimiter)[text_colum])).count()
>>
>>
> *Error:*
>
>    - UnicodeDecodeError: 'utf8' codec can't decode byte 0x9b in position
>    17: invalid start byte
>
>
>    - Same error  *use_unicode=False* is not used
>
>
>    - Error change to *'ascii' codec can't decode byte 0x97 in position 3:
>    ordinal not in range(128) when **no_punctuation = regex.sub(' ', w)* is
>    used instead of *no_punctuation = unicode(regex.sub(' ', w),'utf8'). *
>
> *​​Note :: In version 0.7.0 the code was running fine without using
> use_unicode and unicode(regex.sub(' ', w),'utf8')*
>
> *PYTHON*
>
> def textPreProcessor(text_column):
>>     processed_text=[]
>> for text in text_column:
>>        for w in text.split():
>>           regex = re.compile('[%s]' % re.escape(string.punctuation)) #
>> reg exprn for puntuation
>>           no_punctuation = unicode(regex.sub(' ', text_),'utf8')
>>              tokens = word_tokenize(no_punctuation)
>>                  lowercased = [t.lower() for t in tokens]
>>            no_stopwords = [w for w in lowercased if not w in stopwordsX]
>>            stemmed = [stemmerX.stem(w) for w in no_stopwords]
>>            processed_text.append([w for w in stemmed if w])
>> return processed_text
>
>
>    - new_training = pd.read_csv(training_data,header=None,
>    delimiter=delimiter, error_bad_lines=False, usecols=[label_column,text_
>    column],names=['label','msg']).dropna()
>    - new_training['processed_msg'] = textPreProcessor(new_training['msg'])
>
> This python code is working and I am getting result. In version 0.7.0, I
> am getting output without using the unicode function.
>
> Hope the problem is clear now.
>
> Regards,
> Meethu Mathew
>
>
> On Fri, Apr 21, 2017 at 3:07 AM, Felix Cheung <fe...@hotmail.com>
> wrote:
>
>> And are they running with the same Python version? What is the Python
>> version?
>>
>> _____________________________
>> From: moon soo Lee <mo...@apache.org>
>> Sent: Thursday, April 20, 2017 11:53 AM
>> Subject: Re: UnicodeDecodeError in zeppelin 0.7.1
>> To: <us...@zeppelin.apache.org>
>>
>>
>>
>> Hi,
>>
>> 0.7.1 didn't changed any encoding type as far as i know.
>> One difference is 0.7.1 official artifact has been built with JDK8 while
>> 0.7.0 built with JDK7 (we'll use JDK7 to build upcoming 0.7.2 binary). But
>> i'm not sure that can make pyspark and spark encoding type changes.
>>
>> Do you have exactly the same interpreter setting in 0.7.1 and 0.7.0?
>>
>> Thanks,
>> moon
>>
>> On Wed, Apr 19, 2017 at 5:30 AM Meethu Mathew <me...@flytxt.com>
>> wrote:
>>
>>> Hi,
>>>
>>> I just migrated from zeppelin 0.7.0 to zeppelin 0.7.1 and I am facing
>>> this error while creating an RDD(in pyspark).
>>>
>>> UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0:
>>>> invalid start byte
>>>
>>>
>>> I was able to create the RDD without any error after adding
>>> use_unicode=False as follows
>>>
>>>> sc.textFile("file.csv",use_unicode=False)
>>>
>>>
>>> ​But it fails when I try to stem the text. I am getting similar error
>>> when trying to apply stemming to the text using python interpreter.
>>>
>>> UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 4:
>>>> ordinal not in range(128)
>>>
>>> All these code is working in 0.7.0 version. There is no change in the
>>> dataset and code. ​Is there any change in the encoding type in the new
>>> version of zeppelin?
>>>
>>> Regards,
>>>
>>>
>>> Meethu Mathew
>>>
>>>
>>
>>
>

Re: UnicodeDecodeError in zeppelin 0.7.1

Posted by Meethu Mathew <me...@flytxt.com>.
​​
Hi,

Thanks for the repsonse.

@ moon soo Lee: The interpreter setting is same in 0.7.0 and 0.7.1

@ Felix Cheng : The Python version is same.

The code is as follows:

*PYSPARK*

def textPreProcessor(text):
>      for w in text.split():
>
> ​     ​
> regex = re.compile('[%s]' % re.escape(string.punctuation))
>
> ​    * ​*
> *no_punctuation = unicode(regex.sub(' ', w),'utf8')*
>
> ​     ​
> tokens = word_tokenize(no_punctuation)
>
> ​     ​
> lowercased = [t.lower() for t in tokens]
>
> ​     ​
> no_stopwords = [w for w in lowercased if not w in stopwordsX]
>
> ​     ​
> stemmed = [stemmerX.stem(w) for w in no_stopwords]
>
> ​     ​
> return [w for w in stemmed if w]



>    - docs =sc.textFile(hdfs_path+training_data,*use_unicode=False*
>    ).repartition(96)
>    - docs.map(lambda features: sentimentObject.textPreProcessor(features.
>    split(delimiter)[text_colum])).count()
>
>
*Error:*

   - UnicodeDecodeError: 'utf8' codec can't decode byte 0x9b in position
   17: invalid start byte


   - Same error  *use_unicode=False* is not used


   - Error change to *'ascii' codec can't decode byte 0x97 in position 3:
   ordinal not in range(128) when **no_punctuation = regex.sub(' ', w)* is
   used instead of *no_punctuation = unicode(regex.sub(' ', w),'utf8'). *

*Note :: In version 0.7.0 the code was running fine without using
use_unicode and unicode(regex.sub(' ', w),'utf8')*

*PYTHON*

def textPreProcessor(text_column):
>     processed_text=[]
> for text in text_column:
>        for w in text.split():
>           regex = re.compile('[%s]' % re.escape(string.punctuation)) # reg
> exprn for puntuation
>           no_punctuation = unicode(regex.sub(' ', text_),'utf8')
>              tokens = word_tokenize(no_punctuation)
>                  lowercased = [t.lower() for t in tokens]
>            no_stopwords = [w for w in lowercased if not w in stopwordsX]
>            stemmed = [stemmerX.stem(w) for w in no_stopwords]
>            processed_text.append([w for w in stemmed if w])
> return processed_text


   - new_training = pd.read_csv(training_data,header=None,
   delimiter=delimiter, error_bad_lines=False, usecols=[label_column,text_
   column],names=['label','msg']).dropna()
   - new_training['processed_msg'] = textPreProcessor(new_training['msg'])

This python code is working and I am getting result. In version 0.7.0, I am
getting output without using the unicode function.

Hope the problem is clear now.

Regards,
Meethu Mathew


On Fri, Apr 21, 2017 at 3:07 AM, Felix Cheung <fe...@hotmail.com>
wrote:

> And are they running with the same Python version? What is the Python
> version?
>
> _____________________________
> From: moon soo Lee <mo...@apache.org>
> Sent: Thursday, April 20, 2017 11:53 AM
> Subject: Re: UnicodeDecodeError in zeppelin 0.7.1
> To: <us...@zeppelin.apache.org>
>
>
>
> Hi,
>
> 0.7.1 didn't changed any encoding type as far as i know.
> One difference is 0.7.1 official artifact has been built with JDK8 while
> 0.7.0 built with JDK7 (we'll use JDK7 to build upcoming 0.7.2 binary). But
> i'm not sure that can make pyspark and spark encoding type changes.
>
> Do you have exactly the same interpreter setting in 0.7.1 and 0.7.0?
>
> Thanks,
> moon
>
> On Wed, Apr 19, 2017 at 5:30 AM Meethu Mathew <me...@flytxt.com>
> wrote:
>
>> Hi,
>>
>> I just migrated from zeppelin 0.7.0 to zeppelin 0.7.1 and I am facing
>> this error while creating an RDD(in pyspark).
>>
>> UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0:
>>> invalid start byte
>>
>>
>> I was able to create the RDD without any error after adding
>> use_unicode=False as follows
>>
>>> sc.textFile("file.csv",use_unicode=False)
>>
>>
>> ​But it fails when I try to stem the text. I am getting similar error
>> when trying to apply stemming to the text using python interpreter.
>>
>> UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 4:
>>> ordinal not in range(128)
>>
>> All these code is working in 0.7.0 version. There is no change in the
>> dataset and code. ​Is there any change in the encoding type in the new
>> version of zeppelin?
>>
>> Regards,
>>
>>
>> Meethu Mathew
>>
>>
>
>

Re: UnicodeDecodeError in zeppelin 0.7.1

Posted by Felix Cheung <fe...@hotmail.com>.
And are they running with the same Python version? What is the Python version?

_____________________________
From: moon soo Lee <mo...@apache.org>>
Sent: Thursday, April 20, 2017 11:53 AM
Subject: Re: UnicodeDecodeError in zeppelin 0.7.1
To: <us...@zeppelin.apache.org>>


Hi,

0.7.1 didn't changed any encoding type as far as i know.
One difference is 0.7.1 official artifact has been built with JDK8 while 0.7.0 built with JDK7 (we'll use JDK7 to build upcoming 0.7.2 binary). But i'm not sure that can make pyspark and spark encoding type changes.

Do you have exactly the same interpreter setting in 0.7.1 and 0.7.0?

Thanks,
moon

On Wed, Apr 19, 2017 at 5:30 AM Meethu Mathew <me...@flytxt.com>> wrote:
Hi,

I just migrated from zeppelin 0.7.0 to zeppelin 0.7.1 and I am facing this error while creating an RDD(in pyspark).

UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: invalid start byte

I was able to create the RDD without any error after adding use_unicode=False as follows
sc.textFile("file.csv",use_unicode=False)

​But it fails when I try to stem the text. I am getting similar error when trying to apply stemming to the text using python interpreter.

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 4: ordinal not in range(128)

All these code is working in 0.7.0 version. There is no change in the dataset and code. ​Is there any change in the encoding type in the new version of zeppelin?


Regards,

Meethu Mathew




Re: UnicodeDecodeError in zeppelin 0.7.1

Posted by moon soo Lee <mo...@apache.org>.
Hi,

0.7.1 didn't changed any encoding type as far as i know.
One difference is 0.7.1 official artifact has been built with JDK8 while
0.7.0 built with JDK7 (we'll use JDK7 to build upcoming 0.7.2 binary). But
i'm not sure that can make pyspark and spark encoding type changes.

Do you have exactly the same interpreter setting in 0.7.1 and 0.7.0?

Thanks,
moon

On Wed, Apr 19, 2017 at 5:30 AM Meethu Mathew <me...@flytxt.com>
wrote:

> Hi,
>
> I just migrated from zeppelin 0.7.0 to zeppelin 0.7.1 and I am facing this
> error while creating an RDD(in pyspark).
>
> UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0:
>> invalid start byte
>
>
> I was able to create the RDD without any error after adding
> use_unicode=False as follows
>
>> sc.textFile("file.csv",use_unicode=False)
>
>
> ​But it fails when I try to stem the text. I am getting similar error when
> trying to apply stemming to the text using python interpreter.
>
> UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 4:
>> ordinal not in range(128)
>
> All these code is working in 0.7.0 version. There is no change in the
> dataset and code. ​Is there any change in the encoding type in the new
> version of zeppelin?
>
> Regards,
>
>
> Meethu Mathew
>
>