You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Johannes Schwenk <jo...@adition.com> on 2012/05/29 14:35:20 UTC

Verifying unordered output with PigUnit

Hello all,

I'd like to verify output from a pig script that does not sort its
results prior to output. Thus the order of the tuples in the output is
non-deterministic. I would rather not add sorting to my script, because
I am potentially dealing with a lot of data here. As I have found
PigLatin does not support conditional statements like "if PIG_UNIT_TEST
do stepsA else do stepsB fi" - so this is also not an option (besides
from having duplicate and differing logic for test and non-test runs!).

So how could I do this?

Greetings,
Johannes Schwenk

-- 
Softwareentwickler (Reporting)
________________________________________________________

ADITION technologies AG
Schwarzwaldstraße 78b
79117 Freiburg

http://www.adition.com

T +49 / (0)761 / 88147 - 30
F +49 / (0)761 / 88147 - 77
SUPPORT +49  / (0)1805 - ADITION

(Festnetzpreis 14 ct/min; Mobilfunkpreise maximal 42 ct/min)

Eingetragen beim Amtsgericht Düsseldorf unter HRB 54076
Vorstände: Andreas Kleiser, Jörg Klekamp, Tihomir Perkovic, Marcus Schlüter
Aufsichtsratsvorsitzender: Rechtsanwalt Daniel Raimer
UStIDNr.: DE 218 858 434


Re: Verifying unordered output with PigUnit

Posted by Johannes Schwenk <jo...@adition.com>.
Hello again!

I don't have to sort the output in normal operation of my script, so I
would rather not, as this prolongs running time unnecessarily...

So I still have the problem that I cannot compare the unsorted output of
the script to the expected one. I am doing this in PigUnit, so I had a
look at org.apache.pig.pigunit.PigTest and the only option I could see
is to override assertOutput and write a new version of readFile assuring
that those functions return sorted records, which I thought to be not
that elegant...

Has nobody had this problem with PigUnit to date?

Thanks!

Am 29.05.2012 19:42, schrieb Jonathan Coveney:
> Generally, sorting is the way to go. It's going to be difficult to get
> around doing some sort of processing in order to make it easier to evaluate
> equality.
> 
> If you want something generally O(n) instead of O(n log n), you could
> calculate the hashCode for every tuple then SUM it (which is algebraic),
> and only in the case that these are not equal (exceedingly rare) would you
> sort and directly do the comparison.
> 
> 2012/5/29 Johannes Schwenk <jo...@adition.com>
> 
>> Hello all,
>>
>> I'd like to verify output from a pig script that does not sort its
>> results prior to output. Thus the order of the tuples in the output is
>> non-deterministic. I would rather not add sorting to my script, because
>> I am potentially dealing with a lot of data here. As I have found
>> PigLatin does not support conditional statements like "if PIG_UNIT_TEST
>> do stepsA else do stepsB fi" - so this is also not an option (besides
>> from having duplicate and differing logic for test and non-test runs!).
>>
>> So how could I do this?
>>
>> Greetings,
>> Johannes Schwenk
>>
>> --
>> Softwareentwickler (Reporting)
>> ________________________________________________________
>>
>> ADITION technologies AG
>> Schwarzwaldstraße 78b
>> 79117 Freiburg
>>
>> http://www.adition.com
>>
>> T +49 / (0)761 / 88147 - 30
>> F +49 / (0)761 / 88147 - 77
>> SUPPORT +49  / (0)1805 - ADITION
>>
>> (Festnetzpreis 14 ct/min; Mobilfunkpreise maximal 42 ct/min)
>>
>> Eingetragen beim Amtsgericht Düsseldorf unter HRB 54076
>> Vorstände: Andreas Kleiser, Jörg Klekamp, Tihomir Perkovic, Marcus Schlüter
>> Aufsichtsratsvorsitzender: Rechtsanwalt Daniel Raimer
>> UStIDNr.: DE 218 858 434
>>
>>
> 



Johannes Schwenk

-- 
Softwareentwickler (Reporting)
________________________________________________________

ADITION technologies AG
Schwarzwaldstraße 78b
79117 Freiburg

http://www.adition.com

T +49 / (0)761 / 88147 - 30
F +49 / (0)761 / 88147 - 77
SUPPORT +49  / (0)1805 - ADITION

(Festnetzpreis 14 ct/min; Mobilfunkpreise maximal 42 ct/min)

Eingetragen beim Amtsgericht Düsseldorf unter HRB 54076
Vorstände: Andreas Kleiser, Jörg Klekamp, Tihomir Perkovic, Marcus Schlüter
Aufsichtsratsvorsitzender: Rechtsanwalt Daniel Raimer
UStIDNr.: DE 218 858 434


Re: Verifying unordered output with PigUnit

Posted by Jonathan Coveney <jc...@gmail.com>.
Generally, sorting is the way to go. It's going to be difficult to get
around doing some sort of processing in order to make it easier to evaluate
equality.

If you want something generally O(n) instead of O(n log n), you could
calculate the hashCode for every tuple then SUM it (which is algebraic),
and only in the case that these are not equal (exceedingly rare) would you
sort and directly do the comparison.

2012/5/29 Johannes Schwenk <jo...@adition.com>

> Hello all,
>
> I'd like to verify output from a pig script that does not sort its
> results prior to output. Thus the order of the tuples in the output is
> non-deterministic. I would rather not add sorting to my script, because
> I am potentially dealing with a lot of data here. As I have found
> PigLatin does not support conditional statements like "if PIG_UNIT_TEST
> do stepsA else do stepsB fi" - so this is also not an option (besides
> from having duplicate and differing logic for test and non-test runs!).
>
> So how could I do this?
>
> Greetings,
> Johannes Schwenk
>
> --
> Softwareentwickler (Reporting)
> ________________________________________________________
>
> ADITION technologies AG
> Schwarzwaldstraße 78b
> 79117 Freiburg
>
> http://www.adition.com
>
> T +49 / (0)761 / 88147 - 30
> F +49 / (0)761 / 88147 - 77
> SUPPORT +49  / (0)1805 - ADITION
>
> (Festnetzpreis 14 ct/min; Mobilfunkpreise maximal 42 ct/min)
>
> Eingetragen beim Amtsgericht Düsseldorf unter HRB 54076
> Vorstände: Andreas Kleiser, Jörg Klekamp, Tihomir Perkovic, Marcus Schlüter
> Aufsichtsratsvorsitzender: Rechtsanwalt Daniel Raimer
> UStIDNr.: DE 218 858 434
>
>