You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by Clément MATHIEU <cl...@unportant.info> on 2012/11/30 14:48:30 UTC

How to perfom a logical diff on two PigStorage files

Hi all,

I'm trying to build a non regression testing tool to verify that the 
files
produced by two Pig scripts are equals.

The files are in PigStorage format. The first field is a key and 
remaining
fields are opaque data (primitive or complex types).

Example:
	1	43	{(10), (12), (14)}	{(55), (90)}	0	60

I want to check that each key is present in  both or neither files, and 
that
for each key the lines are equals. By being equals I mean logical 
equality
not string or byte equality. For example, the two following lines 
should be
equal:
	1	43	{(10), (12), (14)}	{(55), (90)}	0	60
	1	43	{(12), (10), (14)}	{(90), (55)}	0	60


My issue is that since this tool needs to operate on lot of different
files, it should not rely on a predefined schema. I experimented
the following idea:

------
	f1 = LOAD '$FILE1' USING PigStorage();
	f2 = LOAD '$FILE2' USING PigStorage();

	g_f1 = GROUP f1 BY $0;
	g_f2 = GROUP f2 BY $0;

	joined = JOIN
		g_f1  by group full outer,
		g_f2  by group;

	cmp = FILTER joined by
		g_f1::group is null
		or  g_f2::group is null
		or  SIZE(DIFF(g_f1::f1, g_f2::f2)) != 0;

	dump cmp;
------

Unfortunately, since no schema is specified at load time, g_f1::f1 and
g_f2::f2 are instance of DataByteArray. It means that the DIFF function 
does
not behave as wanted. A byte-to-byte comparison is performed rather 
than a
logical comparison. For example "1       {(2),(1)}" and "1       
{(1),(2)}"
are different since their byte representations are not the same.

Do you know if a such tool already exist or how to write it ?

I currently foresee three options:

   1- Specify the schema. It could be done using scripting and a 
file-to-schema
      mapping. The schema would be inserted using a variable. However 
the schema
      of each file has to be described manually. This is a cumbersome 
process.
   2- Use PigStorageSchema instead of PigStorage. I believe this would 
solve
      the issue; but being stuck with 0.8.1 I'm wondering if 
PigStorageSchema
      is reasonably robust and side effect free to be used in production 
scripts.
   3- Write a custom DIFF UDF taking two DataByteArray. This option 
allows to not
      modify production scripts but I don't know how much effort is 
required
      to write a such UDF. Parsing the DataByteArray to rebuild a
      set/list/string structure seems quite easy. Do you think some part 
of
      Pig code like Utf8StorageConverter can be reused or should I 
simply write
      my own parser ?


Thanks !

- Clément

Re: How to perfom a logical diff on two PigStorage files

Posted by Bill Graham <bi...@gmail.com>.

I've done this in two passes. First I do an intersection test and determine
the outer misses by join key on each side, similar to what you've done. I
then store the left_only and right_only side for further inspection.

Then I take the intersection relation, which contains a left and right
tuple and I pass that through a UDF. This is similar to your #3 proposal,
only the UDF takes two tuples. It traverses them in parallel before
outputting a string representation of a bitmask of which tuple field
matched or missed. Group on the bitmasks to generate counts and you get a
report of all the different combos of field misses. All without a known
schema.



On Fri, Nov 30, 2012 at 12:49 PM, Ruslan Al-Fakikh <me...@gmail.com>wrote:

> Hi,
>
> As for point 1: it will always be cumbersome to work on such files. I would
> recommend using Avro where the schema is included in the file.
> Also you could try to sort contents or apply some transformation to force
> the files look the same. Then just diff the files outside of Pig, that's
> just an idea, I'm not sure whether it'll work for you.
>
> Thanks
>
>
> On Fri, Nov 30, 2012 at 5:48 PM, Clément MATHIEU <clement@unportant.info
> >wrote:
>
> > Hi all,
> >
> > I'm trying to build a non regression testing tool to verify that the
> files
> > produced by two Pig scripts are equals.
> >
> > The files are in PigStorage format. The first field is a key and
> remaining
> > fields are opaque data (primitive or complex types).
> >
> > Example:
> >         1       43      {(10), (12), (14)}      {(55), (90)}    0
> 60
> >
> > I want to check that each key is present in  both or neither files, and
> > that
> > for each key the lines are equals. By being equals I mean logical
> equality
> > not string or byte equality. For example, the two following lines should
> be
> > equal:
> >         1       43      {(10), (12), (14)}      {(55), (90)}    0
> 60
> >         1       43      {(12), (10), (14)}      {(90), (55)}    0
> 60
> >
> >
> > My issue is that since this tool needs to operate on lot of different
> > files, it should not rely on a predefined schema. I experimented
> > the following idea:
> >
> > ------
> >         f1 = LOAD '$FILE1' USING PigStorage();
> >         f2 = LOAD '$FILE2' USING PigStorage();
> >
> >         g_f1 = GROUP f1 BY $0;
> >         g_f2 = GROUP f2 BY $0;
> >
> >         joined = JOIN
> >                 g_f1  by group full outer,
> >                 g_f2  by group;
> >
> >         cmp = FILTER joined by
> >                 g_f1::group is null
> >                 or  g_f2::group is null
> >                 or  SIZE(DIFF(g_f1::f1, g_f2::f2)) != 0;
> >
> >         dump cmp;
> > ------
> >
> > Unfortunately, since no schema is specified at load time, g_f1::f1 and
> > g_f2::f2 are instance of DataByteArray. It means that the DIFF function
> > does
> > not behave as wanted. A byte-to-byte comparison is performed rather than
> a
> > logical comparison. For example "1       {(2),(1)}" and "1
> {(1),(2)}"
> > are different since their byte representations are not the same.
> >
> > Do you know if a such tool already exist or how to write it ?
> >
> > I currently foresee three options:
> >
> >   1- Specify the schema. It could be done using scripting and a
> > file-to-schema
> >      mapping. The schema would be inserted using a variable. However the
> > schema
> >      of each file has to be described manually. This is a cumbersome
> > process.
> >   2- Use PigStorageSchema instead of PigStorage. I believe this would
> solve
> >      the issue; but being stuck with 0.8.1 I'm wondering if
> > PigStorageSchema
> >      is reasonably robust and side effect free to be used in production
> > scripts.
> >   3- Write a custom DIFF UDF taking two DataByteArray. This option allows
> > to not
> >      modify production scripts but I don't know how much effort is
> required
> >      to write a such UDF. Parsing the DataByteArray to rebuild a
> >      set/list/string structure seems quite easy. Do you think some part
> of
> >      Pig code like Utf8StorageConverter can be reused or should I simply
> > write
> >      my own parser ?
> >
> >
> > Thanks !
> >
> > - Clément
> >
> >
> >
>



-- 
*Note that I'm no longer using my Yahoo! email address. Please email me at
billgraham@gmail.com going forward.*

Re: How to perfom a logical diff on two PigStorage files

Posted by Ruslan Al-Fakikh <me...@gmail.com>.

Hi,

As for point 1: it will always be cumbersome to work on such files. I would
recommend using Avro where the schema is included in the file.
Also you could try to sort contents or apply some transformation to force
the files look the same. Then just diff the files outside of Pig, that's
just an idea, I'm not sure whether it'll work for you.

Thanks


On Fri, Nov 30, 2012 at 5:48 PM, Clément MATHIEU <cl...@unportant.info>wrote:

> Hi all,
>
> I'm trying to build a non regression testing tool to verify that the files
> produced by two Pig scripts are equals.
>
> The files are in PigStorage format. The first field is a key and remaining
> fields are opaque data (primitive or complex types).
>
> Example:
>         1       43      {(10), (12), (14)}      {(55), (90)}    0       60
>
> I want to check that each key is present in  both or neither files, and
> that
> for each key the lines are equals. By being equals I mean logical equality
> not string or byte equality. For example, the two following lines should be
> equal:
>         1       43      {(10), (12), (14)}      {(55), (90)}    0       60
>         1       43      {(12), (10), (14)}      {(90), (55)}    0       60
>
>
> My issue is that since this tool needs to operate on lot of different
> files, it should not rely on a predefined schema. I experimented
> the following idea:
>
> ------
>         f1 = LOAD '$FILE1' USING PigStorage();
>         f2 = LOAD '$FILE2' USING PigStorage();
>
>         g_f1 = GROUP f1 BY $0;
>         g_f2 = GROUP f2 BY $0;
>
>         joined = JOIN
>                 g_f1  by group full outer,
>                 g_f2  by group;
>
>         cmp = FILTER joined by
>                 g_f1::group is null
>                 or  g_f2::group is null
>                 or  SIZE(DIFF(g_f1::f1, g_f2::f2)) != 0;
>
>         dump cmp;
> ------
>
> Unfortunately, since no schema is specified at load time, g_f1::f1 and
> g_f2::f2 are instance of DataByteArray. It means that the DIFF function
> does
> not behave as wanted. A byte-to-byte comparison is performed rather than a
> logical comparison. For example "1       {(2),(1)}" and "1       {(1),(2)}"
> are different since their byte representations are not the same.
>
> Do you know if a such tool already exist or how to write it ?
>
> I currently foresee three options:
>
>   1- Specify the schema. It could be done using scripting and a
> file-to-schema
>      mapping. The schema would be inserted using a variable. However the
> schema
>      of each file has to be described manually. This is a cumbersome
> process.
>   2- Use PigStorageSchema instead of PigStorage. I believe this would solve
>      the issue; but being stuck with 0.8.1 I'm wondering if
> PigStorageSchema
>      is reasonably robust and side effect free to be used in production
> scripts.
>   3- Write a custom DIFF UDF taking two DataByteArray. This option allows
> to not
>      modify production scripts but I don't know how much effort is required
>      to write a such UDF. Parsing the DataByteArray to rebuild a
>      set/list/string structure seems quite easy. Do you think some part of
>      Pig code like Utf8StorageConverter can be reused or should I simply
> write
>      my own parser ?
>
>
> Thanks !
>
> - Clément
>
>
>