You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Alejandra García Rojas Martínez <al...@gmail.com> on 2012/03/16 11:01:22 UTC

Python to BoundScript encoding question

Hello, I am having some problems with encoding at the moment of binding
python script and pig script. I have the text "Catégorie" as a parameter
for a pig script, and when binding with the pig script, it doesn't use the
right encoding, and produces "Catgorie".

This is my python script:

# -*- coding: UTF-8 -*-
Prefix = "Catégorie"
params = { prefix":Prefix, "output":"workspace/fr_30" }
P1 = Pig.compileFromFile("topic-corpus/test.pig")
bound1 = P1.bind(params)
stats1 = bound1.run()

And the pig script:

items = LOAD '$output/items.tsv' AS (id: chararray, count: long);
update_items = FOREACH items GENERATE
  id, REPLACE(id, '$prefix:', '') AS candidate_id;

When I run the script the binding generates this code to run:

2012-03-16 10:54:18,001 [main] INFO  org.apache.pig.scripting.BoundScript -
Query to run:
items = LOAD 'workspace/fr_30/items.tsv'  AS (id: chararray, count: long,
childen:long, parents:long);items_replace = FOREACH items GENERATE  id,
REPLACE(id, 'Cat̩gorie:', '') AS candidateId;STORE items_replace INTO
'workspace/fr_30/replaced__items.tsv';


so the prefix Cat̩gorie is not well decoded... and results not replaced...

What I am missing?

Thanks!