You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Alejandra García Rojas Martínez <al...@gmail.com> on 2012/03/16 11:01:22 UTC
Python to BoundScript encoding question
Hello, I am having some problems with encoding at the moment of binding
python script and pig script. I have the text "Catégorie" as a parameter
for a pig script, and when binding with the pig script, it doesn't use the
right encoding, and produces "Catgorie".
This is my python script:
# -*- coding: UTF-8 -*-
Prefix = "Catégorie"
params = { prefix":Prefix, "output":"workspace/fr_30" }
P1 = Pig.compileFromFile("topic-corpus/test.pig")
bound1 = P1.bind(params)
stats1 = bound1.run()
And the pig script:
items = LOAD '$output/items.tsv' AS (id: chararray, count: long);
update_items = FOREACH items GENERATE
id, REPLACE(id, '$prefix:', '') AS candidate_id;
When I run the script the binding generates this code to run:
2012-03-16 10:54:18,001 [main] INFO org.apache.pig.scripting.BoundScript -
Query to run:
items = LOAD 'workspace/fr_30/items.tsv' AS (id: chararray, count: long,
childen:long, parents:long);items_replace = FOREACH items GENERATE id,
REPLACE(id, 'Cat̩gorie:', '') AS candidateId;STORE items_replace INTO
'workspace/fr_30/replaced__items.tsv';
so the prefix Cat̩gorie is not well decoded... and results not replaced...
What I am missing?
Thanks!