You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Clément MATHIEU <cl...@unportant.info> on 2012/11/30 14:48:30 UTC
How to perfom a logical diff on two PigStorage files
Hi all,
I'm trying to build a non regression testing tool to verify that the
files
produced by two Pig scripts are equals.
The files are in PigStorage format. The first field is a key and
remaining
fields are opaque data (primitive or complex types).
Example:
1 43 {(10), (12), (14)} {(55), (90)} 0 60
I want to check that each key is present in both or neither files, and
that
for each key the lines are equals. By being equals I mean logical
equality
not string or byte equality. For example, the two following lines
should be
equal:
1 43 {(10), (12), (14)} {(55), (90)} 0 60
1 43 {(12), (10), (14)} {(90), (55)} 0 60
My issue is that since this tool needs to operate on lot of different
files, it should not rely on a predefined schema. I experimented
the following idea:
------
f1 = LOAD '$FILE1' USING PigStorage();
f2 = LOAD '$FILE2' USING PigStorage();
g_f1 = GROUP f1 BY $0;
g_f2 = GROUP f2 BY $0;
joined = JOIN
g_f1 by group full outer,
g_f2 by group;
cmp = FILTER joined by
g_f1::group is null
or g_f2::group is null
or SIZE(DIFF(g_f1::f1, g_f2::f2)) != 0;
dump cmp;
------
Unfortunately, since no schema is specified at load time, g_f1::f1 and
g_f2::f2 are instance of DataByteArray. It means that the DIFF function
does
not behave as wanted. A byte-to-byte comparison is performed rather
than a
logical comparison. For example "1 {(2),(1)}" and "1
{(1),(2)}"
are different since their byte representations are not the same.
Do you know if a such tool already exist or how to write it ?
I currently foresee three options:
1- Specify the schema. It could be done using scripting and a
file-to-schema
mapping. The schema would be inserted using a variable. However
the schema
of each file has to be described manually. This is a cumbersome
process.
2- Use PigStorageSchema instead of PigStorage. I believe this would
solve
the issue; but being stuck with 0.8.1 I'm wondering if
PigStorageSchema
is reasonably robust and side effect free to be used in production
scripts.
3- Write a custom DIFF UDF taking two DataByteArray. This option
allows to not
modify production scripts but I don't know how much effort is
required
to write a such UDF. Parsing the DataByteArray to rebuild a
set/list/string structure seems quite easy. Do you think some part
of
Pig code like Utf8StorageConverter can be reused or should I
simply write
my own parser ?
Thanks !
- Clément
Re: How to perfom a logical diff on two PigStorage files
Posted by Bill Graham <bi...@gmail.com>.
I've done this in two passes. First I do an intersection test and determine
the outer misses by join key on each side, similar to what you've done. I
then store the left_only and right_only side for further inspection.
Then I take the intersection relation, which contains a left and right
tuple and I pass that through a UDF. This is similar to your #3 proposal,
only the UDF takes two tuples. It traverses them in parallel before
outputting a string representation of a bitmask of which tuple field
matched or missed. Group on the bitmasks to generate counts and you get a
report of all the different combos of field misses. All without a known
schema.
On Fri, Nov 30, 2012 at 12:49 PM, Ruslan Al-Fakikh <me...@gmail.com>wrote:
> Hi,
>
> As for point 1: it will always be cumbersome to work on such files. I would
> recommend using Avro where the schema is included in the file.
> Also you could try to sort contents or apply some transformation to force
> the files look the same. Then just diff the files outside of Pig, that's
> just an idea, I'm not sure whether it'll work for you.
>
> Thanks
>
>
> On Fri, Nov 30, 2012 at 5:48 PM, Clément MATHIEU <clement@unportant.info
> >wrote:
>
> > Hi all,
> >
> > I'm trying to build a non regression testing tool to verify that the
> files
> > produced by two Pig scripts are equals.
> >
> > The files are in PigStorage format. The first field is a key and
> remaining
> > fields are opaque data (primitive or complex types).
> >
> > Example:
> > 1 43 {(10), (12), (14)} {(55), (90)} 0
> 60
> >
> > I want to check that each key is present in both or neither files, and
> > that
> > for each key the lines are equals. By being equals I mean logical
> equality
> > not string or byte equality. For example, the two following lines should
> be
> > equal:
> > 1 43 {(10), (12), (14)} {(55), (90)} 0
> 60
> > 1 43 {(12), (10), (14)} {(90), (55)} 0
> 60
> >
> >
> > My issue is that since this tool needs to operate on lot of different
> > files, it should not rely on a predefined schema. I experimented
> > the following idea:
> >
> > ------
> > f1 = LOAD '$FILE1' USING PigStorage();
> > f2 = LOAD '$FILE2' USING PigStorage();
> >
> > g_f1 = GROUP f1 BY $0;
> > g_f2 = GROUP f2 BY $0;
> >
> > joined = JOIN
> > g_f1 by group full outer,
> > g_f2 by group;
> >
> > cmp = FILTER joined by
> > g_f1::group is null
> > or g_f2::group is null
> > or SIZE(DIFF(g_f1::f1, g_f2::f2)) != 0;
> >
> > dump cmp;
> > ------
> >
> > Unfortunately, since no schema is specified at load time, g_f1::f1 and
> > g_f2::f2 are instance of DataByteArray. It means that the DIFF function
> > does
> > not behave as wanted. A byte-to-byte comparison is performed rather than
> a
> > logical comparison. For example "1 {(2),(1)}" and "1
> {(1),(2)}"
> > are different since their byte representations are not the same.
> >
> > Do you know if a such tool already exist or how to write it ?
> >
> > I currently foresee three options:
> >
> > 1- Specify the schema. It could be done using scripting and a
> > file-to-schema
> > mapping. The schema would be inserted using a variable. However the
> > schema
> > of each file has to be described manually. This is a cumbersome
> > process.
> > 2- Use PigStorageSchema instead of PigStorage. I believe this would
> solve
> > the issue; but being stuck with 0.8.1 I'm wondering if
> > PigStorageSchema
> > is reasonably robust and side effect free to be used in production
> > scripts.
> > 3- Write a custom DIFF UDF taking two DataByteArray. This option allows
> > to not
> > modify production scripts but I don't know how much effort is
> required
> > to write a such UDF. Parsing the DataByteArray to rebuild a
> > set/list/string structure seems quite easy. Do you think some part
> of
> > Pig code like Utf8StorageConverter can be reused or should I simply
> > write
> > my own parser ?
> >
> >
> > Thanks !
> >
> > - Clément
> >
> >
> >
>
--
*Note that I'm no longer using my Yahoo! email address. Please email me at
billgraham@gmail.com going forward.*
Re: How to perfom a logical diff on two PigStorage files
Posted by Ruslan Al-Fakikh <me...@gmail.com>.
Hi,
As for point 1: it will always be cumbersome to work on such files. I would
recommend using Avro where the schema is included in the file.
Also you could try to sort contents or apply some transformation to force
the files look the same. Then just diff the files outside of Pig, that's
just an idea, I'm not sure whether it'll work for you.
Thanks
On Fri, Nov 30, 2012 at 5:48 PM, Clément MATHIEU <cl...@unportant.info>wrote:
> Hi all,
>
> I'm trying to build a non regression testing tool to verify that the files
> produced by two Pig scripts are equals.
>
> The files are in PigStorage format. The first field is a key and remaining
> fields are opaque data (primitive or complex types).
>
> Example:
> 1 43 {(10), (12), (14)} {(55), (90)} 0 60
>
> I want to check that each key is present in both or neither files, and
> that
> for each key the lines are equals. By being equals I mean logical equality
> not string or byte equality. For example, the two following lines should be
> equal:
> 1 43 {(10), (12), (14)} {(55), (90)} 0 60
> 1 43 {(12), (10), (14)} {(90), (55)} 0 60
>
>
> My issue is that since this tool needs to operate on lot of different
> files, it should not rely on a predefined schema. I experimented
> the following idea:
>
> ------
> f1 = LOAD '$FILE1' USING PigStorage();
> f2 = LOAD '$FILE2' USING PigStorage();
>
> g_f1 = GROUP f1 BY $0;
> g_f2 = GROUP f2 BY $0;
>
> joined = JOIN
> g_f1 by group full outer,
> g_f2 by group;
>
> cmp = FILTER joined by
> g_f1::group is null
> or g_f2::group is null
> or SIZE(DIFF(g_f1::f1, g_f2::f2)) != 0;
>
> dump cmp;
> ------
>
> Unfortunately, since no schema is specified at load time, g_f1::f1 and
> g_f2::f2 are instance of DataByteArray. It means that the DIFF function
> does
> not behave as wanted. A byte-to-byte comparison is performed rather than a
> logical comparison. For example "1 {(2),(1)}" and "1 {(1),(2)}"
> are different since their byte representations are not the same.
>
> Do you know if a such tool already exist or how to write it ?
>
> I currently foresee three options:
>
> 1- Specify the schema. It could be done using scripting and a
> file-to-schema
> mapping. The schema would be inserted using a variable. However the
> schema
> of each file has to be described manually. This is a cumbersome
> process.
> 2- Use PigStorageSchema instead of PigStorage. I believe this would solve
> the issue; but being stuck with 0.8.1 I'm wondering if
> PigStorageSchema
> is reasonably robust and side effect free to be used in production
> scripts.
> 3- Write a custom DIFF UDF taking two DataByteArray. This option allows
> to not
> modify production scripts but I don't know how much effort is required
> to write a such UDF. Parsing the DataByteArray to rebuild a
> set/list/string structure seems quite easy. Do you think some part of
> Pig code like Utf8StorageConverter can be reused or should I simply
> write
> my own parser ?
>
>
> Thanks !
>
> - Clément
>
>
>