You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Santhosh Srinivasan (JIRA)" <ji...@apache.org> on 2009/02/19 18:22:02 UTC

[jira] Commented: (PIG-681) TextDataParser does not handle non-ASCII UTF-8 characters

    [ https://issues.apache.org/jira/browse/PIG-681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12675048#action_12675048 ] 

Santhosh Srinivasan commented on PIG-681:
-----------------------------------------

The query and the exception stack trace from the user:

{code}
phrases = load 'phrases' as (data: chararray, f: int);
a = group phrases by f;
b = foreach a generate group as f, phrases.data as data;
store b into 'grouped';

b = load 'grouped' as (f: int, data: bag{t: tuple(data: chararray)});
c = foreach b generate f, data;       -- just store in this sample
store c into 'final';

[main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2043: Unexpected
error during execution.

org.apache.pig.data.parser.TokenMgrError: Error: Bailing out of
infinite loop caused by repeated empty string matches at line 1, column 3.
	at org.apache.pig.data.parser.TextDataParserTokenManager.TokenLexicalActions(TextDataParserTokenManager.java:619)
	at org.apache.pig.data.parser.TextDataParserTokenManager.getNextToken(TextDataParserTokenManager.java:568)
	at org.apache.pig.data.parser.TextDataParser.jj_ntk(TextDataParser.java:623)
	at org.apache.pig.data.parser.TextDataParser.Tuple(TextDataParser.java:153)
	at org.apache.pig.data.parser.TextDataParser.Bag(TextDataParser.java:85)
	at org.apache.pig.data.parser.TextDataParser.Datum(TextDataParser.java:345)
	at org.apache.pig.data.parser.TextDataParser.Parse(TextDataParser.java:42)
	at org.apache.pig.builtin.Utf8StorageConverter.parseFromBytes(Utf8StorageConverter.java:71)
	at org.apache.pig.builtin.Utf8StorageConverter.bytesToBag(Utf8StorageConverter.java:79)
	at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POCast.getNext(POCast.java:908)
	at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:244)
	at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:198)
	at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:226)
	at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:187)
	at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:203)
	at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:194)
	at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapOnly$Map.map(PigMapOnly.java:65)
	at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342)
	at org.apache.hadoop.mapred.Child.main(Child.java:158)
{code}

> TextDataParser does not handle non-ASCII UTF-8 characters
> ---------------------------------------------------------
>
>                 Key: PIG-681
>                 URL: https://issues.apache.org/jira/browse/PIG-681
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: types_branch
>            Reporter: Santhosh Srinivasan
>             Fix For: types_branch
>
>
> The TextDataParser handles ASCII data but it does not handle non-ASCII UTF-8 data. Since Pig supports UTF-8 data, the parser should be modified to handle non-ASCII UTF-8 data.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.