You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Björn-Elmar Macek <em...@cs.uni-kassel.de> on 2012/10/30 18:22:58 UTC

Python UDF got problems converting Strings to Integers

Hi together,

i got a UDF that  sums up histograms in form of tuples. The function i 
wrote looks like this:

@outputSchema("res_histo:tuple()")
def aggHisto(aHistogramSet):
                 if aHistogramSet is None: return None;
                 hist_len = len(aHistogramSet[0])
                 result=[0]*hist_len

                 for aHistogram in aHistogramSet:
                         for i in range(0,hist_len):
                                 value = 
int(''.join(map(str,aHistogram[i])));
                                 result[i] = result[i] + (value)
                 return tuple(result)

So for the following input {(1,23,45),(0,0,0)} i SHOULD get the 
following output: (1,23,45)
But instead i get: (49,5051,52,5353)
I played around with this for some time and found out this program does 
the following:
The line "value = int(''.join(map(str,aHistogram[i])));" does not 
convert the "23" to 23, but it does the following:
It takes every single digit starting with the most siginificant one and 
adds 48 to it: 2+48=50 and 3+48=51 resulting in 5051

Why does this happen? Can anybody help me here?

Best regards,
Elmar

Re: Python UDF got problems converting Strings to Integers

Posted by Björn-Elmar Macek <em...@cs.uni-kassel.de>.
Ok, i got it solved after realizing what happens internally. The 
solution looks like this:
@outputSchema("res_histo:tuple()")
def aggHisto(aHistogramSet):
         if aHistogramSet is None: return None;
         hist_len = len(aHistogramSet[0])
         result=[0]*hist_len

         for aHistogram in aHistogramSet:
             for i in range(0,hist_len):
                 value = aHistogram[i]
                 val_len=len(value)
                 tmp_conv=''
                 for j in range(0,val_len):
                     tmp_conv = tmp_conv + str(int(value[j])-48)
                 value2=int(tmp_conv)
                 result[i] = result[i] + value2

         return tuple(result)

It is important to know that aHistogram[i] is of type array. If left 
untouched and returned by the function, it properly displays the value 
of the histogram tuple at position i. Any direct conversion to int or 
string does not work the way it is supposed to. If you access the 
positions (value[j]) you get the j-th significant position of the 
integer, but increased by 48. The code above restores the information 
encoded into this array. It is not a clean solution and looks more like 
a hack, but at least this does the trick.

Thanks,
Björn-Elmar


Am 31.10.12 10:36, schrieb Björn-Elmar Macek:
> Hi Cheolsoo,
>
> this is because i have a 24-dimensional tuple and the definition alone 
> is a pain. It makes my code unreadable and worse to interpret or fix: 
> imagine how many errors you can make there.
>
> I would prefer solving this issue within python, so my pig calls do 
> not get too complicated and possibly messy.
>
> Thanks,
> Björn-Elmar
>
>
> Am 31.10.12 05:59, schrieb Cheolsoo Park:
>> Hi,
>>
>> First of all, why can't you pass a tuple of integers to your udf in the
>> first place? Because then you don't have to cast strings to integers 
>> inside
>> your udf.
>>
>> Here is how I got your udf working.
>>
>> cheolsoo@localhost:~/workspace/pig-trunk $cat 1.txt
>> 1,2,3
>> 4,5,6
>>
>> cheolsoo@localhost:~/workspace/pig-trunk $cat test.pig
>> register 'test.py' using jython as myfuncs;
>> a = load '1.txt' using PigStorage(',') as (i:int, j:int, k:int); // 
>> declare
>> as integers
>> b = group a all;
>> c = foreach b generate myfuncs.aggHisto(a);
>> dump c;
>>
>> @outputSchema("res_histo:tuple()")
>> def aggHisto(aHistogramSet):
>>      if aHistogramSet is None:
>>          return None;
>>
>>      hist_len = len(aHistogramSet[0])
>>      result=[0]*hist_len
>>      print(aHistogramSet);
>>
>>      for aHistogram in aHistogramSet:
>>          for i in range(0, hist_len):
>>              result[i] = result[i] + aHistogram[i]; // vector addition
>>      return tuple(result)
>>
>> I get the following result:
>> ((5,7,9))
>>
>> Thanks,
>> Cheolsoo
>>
>> On Tue, Oct 30, 2012 at 10:22 AM, Björn-Elmar Macek 
>> <em...@cs.uni-kassel.de>wrote:
>>
>>> Hi together,
>>>
>>> i got a UDF that  sums up histograms in form of tuples. The function i
>>> wrote looks like this:
>>>
>>> @outputSchema("res_histo:**tuple()")
>>> def aggHisto(aHistogramSet):
>>>                  if aHistogramSet is None: return None;
>>>                  hist_len = len(aHistogramSet[0])
>>>                  result=[0]*hist_len
>>>
>>>                  for aHistogram in aHistogramSet:
>>>                          for i in range(0,hist_len):
>>>                                  value = int(''.join(map(str,**
>>> aHistogram[i])));
>>>                                  result[i] = result[i] + (value)
>>>                  return tuple(result)
>>>
>>> So for the following input {(1,23,45),(0,0,0)} i SHOULD get the 
>>> following
>>> output: (1,23,45)
>>> But instead i get: (49,5051,52,5353)
>>> I played around with this for some time and found out this program does
>>> the following:
>>> The line "value = int(''.join(map(str,**aHistogram[i])));" does not
>>> convert the "23" to 23, but it does the following:
>>> It takes every single digit starting with the most siginificant one and
>>> adds 48 to it: 2+48=50 and 3+48=51 resulting in 5051
>>>
>>> Why does this happen? Can anybody help me here?
>>>
>>> Best regards,
>>> Elmar
>>>
>


Re: Python UDF got problems converting Strings to Integers

Posted by Björn-Elmar Macek <em...@cs.uni-kassel.de>.
Hi Cheolsoo,

this is because i have a 24-dimensional tuple and the definition alone 
is a pain. It makes my code unreadable and worse to interpret or fix: 
imagine how many errors you can make there.

I would prefer solving this issue within python, so my pig calls do not 
get too complicated and possibly messy.

Thanks,
Björn-Elmar


Am 31.10.12 05:59, schrieb Cheolsoo Park:
> Hi,
>
> First of all, why can't you pass a tuple of integers to your udf in the
> first place? Because then you don't have to cast strings to integers inside
> your udf.
>
> Here is how I got your udf working.
>
> cheolsoo@localhost:~/workspace/pig-trunk $cat 1.txt
> 1,2,3
> 4,5,6
>
> cheolsoo@localhost:~/workspace/pig-trunk $cat test.pig
> register 'test.py' using jython as myfuncs;
> a = load '1.txt' using PigStorage(',') as (i:int, j:int, k:int); // declare
> as integers
> b = group a all;
> c = foreach b generate myfuncs.aggHisto(a);
> dump c;
>
> @outputSchema("res_histo:tuple()")
> def aggHisto(aHistogramSet):
>      if aHistogramSet is None:
>          return None;
>
>      hist_len = len(aHistogramSet[0])
>      result=[0]*hist_len
>      print(aHistogramSet);
>
>      for aHistogram in aHistogramSet:
>          for i in range(0, hist_len):
>              result[i] = result[i] + aHistogram[i]; // vector addition
>      return tuple(result)
>
> I get the following result:
> ((5,7,9))
>
> Thanks,
> Cheolsoo
>
> On Tue, Oct 30, 2012 at 10:22 AM, Björn-Elmar Macek <em...@cs.uni-kassel.de>wrote:
>
>> Hi together,
>>
>> i got a UDF that  sums up histograms in form of tuples. The function i
>> wrote looks like this:
>>
>> @outputSchema("res_histo:**tuple()")
>> def aggHisto(aHistogramSet):
>>                  if aHistogramSet is None: return None;
>>                  hist_len = len(aHistogramSet[0])
>>                  result=[0]*hist_len
>>
>>                  for aHistogram in aHistogramSet:
>>                          for i in range(0,hist_len):
>>                                  value = int(''.join(map(str,**
>> aHistogram[i])));
>>                                  result[i] = result[i] + (value)
>>                  return tuple(result)
>>
>> So for the following input {(1,23,45),(0,0,0)} i SHOULD get the following
>> output: (1,23,45)
>> But instead i get: (49,5051,52,5353)
>> I played around with this for some time and found out this program does
>> the following:
>> The line "value = int(''.join(map(str,**aHistogram[i])));" does not
>> convert the "23" to 23, but it does the following:
>> It takes every single digit starting with the most siginificant one and
>> adds 48 to it: 2+48=50 and 3+48=51 resulting in 5051
>>
>> Why does this happen? Can anybody help me here?
>>
>> Best regards,
>> Elmar
>>


Re: Python UDF got problems converting Strings to Integers

Posted by Cheolsoo Park <ch...@cloudera.com>.
Hi,

First of all, why can't you pass a tuple of integers to your udf in the
first place? Because then you don't have to cast strings to integers inside
your udf.

Here is how I got your udf working.

cheolsoo@localhost:~/workspace/pig-trunk $cat 1.txt
1,2,3
4,5,6

cheolsoo@localhost:~/workspace/pig-trunk $cat test.pig
register 'test.py' using jython as myfuncs;
a = load '1.txt' using PigStorage(',') as (i:int, j:int, k:int); // declare
as integers
b = group a all;
c = foreach b generate myfuncs.aggHisto(a);
dump c;

@outputSchema("res_histo:tuple()")
def aggHisto(aHistogramSet):
    if aHistogramSet is None:
        return None;

    hist_len = len(aHistogramSet[0])
    result=[0]*hist_len
    print(aHistogramSet);

    for aHistogram in aHistogramSet:
        for i in range(0, hist_len):
            result[i] = result[i] + aHistogram[i]; // vector addition
    return tuple(result)

I get the following result:
((5,7,9))

Thanks,
Cheolsoo

On Tue, Oct 30, 2012 at 10:22 AM, Björn-Elmar Macek <em...@cs.uni-kassel.de>wrote:

> Hi together,
>
> i got a UDF that  sums up histograms in form of tuples. The function i
> wrote looks like this:
>
> @outputSchema("res_histo:**tuple()")
> def aggHisto(aHistogramSet):
>                 if aHistogramSet is None: return None;
>                 hist_len = len(aHistogramSet[0])
>                 result=[0]*hist_len
>
>                 for aHistogram in aHistogramSet:
>                         for i in range(0,hist_len):
>                                 value = int(''.join(map(str,**
> aHistogram[i])));
>                                 result[i] = result[i] + (value)
>                 return tuple(result)
>
> So for the following input {(1,23,45),(0,0,0)} i SHOULD get the following
> output: (1,23,45)
> But instead i get: (49,5051,52,5353)
> I played around with this for some time and found out this program does
> the following:
> The line "value = int(''.join(map(str,**aHistogram[i])));" does not
> convert the "23" to 23, but it does the following:
> It takes every single digit starting with the most siginificant one and
> adds 48 to it: 2+48=50 and 3+48=51 resulting in 5051
>
> Why does this happen? Can anybody help me here?
>
> Best regards,
> Elmar
>