You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Yang <te...@gmail.com> on 2012/11/07 22:05:22 UTC
Re: debug feature??

ok, I found this practice to be useful:


I divide my code into sections, each section implemented as a macro.

then I debug each macro separately, at the end of each macro, I manually
write
its output vars into tmp storage. Then for each macro, I write a
corresponding "***_fake.pig" macro, which has the same signature, but
populates the same return vars by loading them from the tmp storage.

then after I am done with one section, I swap out the IMPORT sentence to
import the **_fake.pig script instead, so that the same computation is not
done again.




On Tue, Oct 23, 2012 at 11:11 AM, Yang <te...@gmail.com> wrote:

> nice, thanks
>
> macros and mock.Storage() are both new to me, I believe it will help a lot
>
>
> On Mon, Oct 22, 2012 at 5:32 PM, Dmitriy Ryaboy <dv...@gmail.com>wrote:
>
>> Some testing tips:
>>
>> 1) parametrize your load/store statements so that if you have to run
>> in hadoop mode, it's easy to switch to debug inputs / outputs (and
>> debug input/output loaders and storers). It's vastly preferable to
>> test in local mode when possible, since the iterations are so much
>> faster.
>>
>> 2) it's a good thing that PigUnit makes you test small pieces of code!
>> Factor out macros so that you can create unit tests; don't copy and
>> paste code, use macros and the import statement.
>>
>> 3) Try using mock.Storage (see
>> https://issues.apache.org/jira/browse/PIG-2650) to automatically
>> create inputs and examine outputs in your unit tests, if you are on
>> pig 11.
>>
>> D
>>
>> On Fri, Oct 19, 2012 at 12:01 PM, Yang <te...@gmail.com> wrote:
>> > I am using PigUnit, but it's somewhat limited: it can run only
>> localmode,
>> > so I can't find issues that come with fairly large test data; you have
>> to
>> > create small snippets of code that you cut out manually from your
>> original
>> > code, so after you tested a snippet to be fine, you have to copy-paste
>> that
>> > back into the production code, which introduces possible copy-paste
>> errors.
>> >  if you compare this to java junit, this is really very crude: in java,
>> you
>> > have a class, and you can do junit testing on individual methods of the
>> > class, instead of having to copy paste and create a special "test
>> version"
>> > of that class.
>> >
>> >
>> > overall, I feel that testability is an area where PIG could spend a lot
>> > more efforts and it will greatly benefit its wider adoption.  ----- some
>> > other tools (Cascading, Cascalog etc) advertise testability as one of
>> their
>> > important features.
>> >
>> > let me check out penny... thanks
>> >
>> > On Fri, Oct 19, 2012 at 2:18 AM, Jagat Singh <ja...@gmail.com>
>> wrote:
>> >
>> >> Hello ,
>> >>
>> >> I understand the pain :)
>> >>
>> >> Have you seen PigUnit and Penny
>> >>
>> >> http://pig.apache.org/docs/r0.10.0/test.html
>> >>
>> >>
>> >>
>> >> On Fri, Oct 19, 2012 at 8:09 PM, Yang <te...@gmail.com> wrote:
>> >>
>> >> > one of the greatest pains I face with debugging a pig code is that
>> the
>> >> > iteration cycles are really long:
>> >> > the applications for which we use pig typically deal with large
>> dataset,
>> >> > and if a pig script involves many
>> >> > JOIN/generate/filter steps, every step takes a lot of time, but every
>> >> time
>> >> > I fix one step, I have to run from the start,
>> >> > which is meaningless.
>> >> >
>> >> > what I am doing so far to reduce the meaningless wasted time to
>> re-run
>> >> > already-debugged steps, is to
>> >> > manually divide my script into many small scripts, and save the last
>> >> > variable out into hdfs, and once the
>> >> > small script is debugged fine, I load the previous variable in the
>> next
>> >> > small script
>> >> >
>> >> > after all small scripts are done, I connect them back manually to the
>> >> > original big script.
>> >> >
>> >> >
>> >> > is there a way to automate this? for example add a mark around a
>> >> particular
>> >> > step, and tells pig
>> >> > that the result is to be saved up, and all following steps are not
>> to be
>> >> > executed. and when we move
>> >> > onto the next step, it knows where to pick up the last-saved data.
>> >> >
>> >> > writing a preprocessor to do the above is not trivial so that I can't
>> >> whip
>> >> > up something immediately , cuz it needs to figure out the
>> >> > schemas of variables that propagate through the steps.
>> >> >
>> >> >
>> >> > Thanks
>> >> > Yang
>> >> >
>> >>
>>
>
>