You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hadoop.apache.org by Björn-Elmar Macek <ma...@cs.uni-kassel.de> on 2012/10/01 18:11:28 UTC

HDFS "file" missing a part-file

Hi,

i am kind of unsure where to post this problem, but i think it is more 
related to hadoop than to pig.

By successfully executing a pig script i created a new file in my hdfs. 
Sadly though, i cannot use it for further processing except for 
"dump"ing and viewing the data: every data-manipulation script-command 
just as "foreach" gives exceptions during the map phase.
Since there was no problem executing the same script on the first 100 
lines of my data (LIMIT statement),i copied it to my local fs folder.
What i realized is, that one of the files namely part-r-000001 was empty 
and contained within the _temporary folder.

Is there any reason for this? How can i fix this issue? Did the job 
(which created the file we are talking about) NOT run properly til its 
end, although the tasktracker worked til the very end and the file was 
created?

Best regards,
Björn

Re: HDFS "file" missing a part-file

Posted by Robert Molina <rm...@hortonworks.com>.

 What I guess might be happening is that your data may contain some text
data that pig is not fully parsing because the data contains characters
that pig uses as delimiters (i.e commas and curly brackets).  Thus, you can
probably take a look at the data and see if you can find any of the
characters used by pig to distinguish values, bags, tuples.  You also might
want to move this topic to the pig forums to see if anyone else has faced a
similar issue.

On Tue, Oct 2, 2012 at 5:26 AM, Björn-Elmar Macek <ma...@cs.uni-kassel.de>wrote:

> Hi again,
>
> i executed a slightly different script again, that included some more
> operations. The logs look similar, but this time i have 2 attempt files for
> the same job-package:
> (1) _temporary/_attempt_**201210021204_0001_r_000001_0/**part-r-00001
> (2) _temporary/_attempt_**201210021204_0001_r_000001_1/**part-r-00001
>
> For me it looks like 2 results of the same jobpackage - this time both
> being not empty as before, but with about blocksize which is about 700 mb.
> I hoped, that both files contained the same content, but "diff" showed me
> that this was not the case. I merged both files with a combination of "cat"
> and "sort -u": the result is a file of about 1.2 gb, which indicates for
> me, that there were many different lines. I suppose, that the cluster didnt
> manage to compute this part-file, tho i have no idea what makes this file
> so special, that it is always this one which is corrupt(?).
>
> The worst solution would be for me to simply ignore this error and
> continue working with the merged file. Is there anybody who has experienced
> similar things?
> If there is a way to fix this, i would love to know, how? Possible reasons
> for the problems are also very appreciated! :)
>
>
> Am 01.10.2012 22:36, schrieb Björn-Elmar Macek:
>
>
>> The script i now want to executed looks like this:
>>
>> x = load 'tag_count_ts_pro_userpair' as (group:tuple(),cnt:int,times:**
>> bag{t:tuple(c:chararray)});
>> y = foreach x generate *, moins.daysFromStart('2011-06-**01 00:00:00',
>> times);
>> store y into 'test_daysFromStart';
>>
>>
>> The problem is, that i do not have the logs anymore due to space
>> constraints within the cluster. But i think i can explain the important
>> parts:
>> The script that created this data was a GROUP statement followed by a
>> FOREACH calculating a COUNT on the bag mentioned above as "times" which is
>> represented in the 2nd column named "cnt". The results were stored via a
>> simple "store".
>> The resulting pig calculation started as expected, but stoppped showing
>> me progress at a certain percentage. A "tail -f" on the hadoop/logs dir
>> revealed that the hadoop calculation progressed nontheless - although some
>> of the tasktrackers permanently vanished during the shuffle phase with the
>> committed/eof/mortbay exception and stopped at least producing any more log
>> output. As i really continiously watched the log i could see, that those
>> work packages were handled by the remaining servers after some of them
>> already calculated packages of progress 1.0. Even the cleanup phase in the
>> end was done, ALTHOUGH(!) the pig log didn't reflect the calculations of
>> the cluster. And since i found the file as output in hdfs i supposed the
>> missing pig progress log entries were simply pig problems. Maybe im wrong
>> with that.
>>
>> But i did the calculations several times and this happened during every
>> execution.
>>
>> Is there something wrong with the data or the calculations?
>>
>>
>> On Mon, 1 Oct 2012 13:01:41 -0700, Robert Molina <rm...@hortonworks.com>
>> wrote:
>>
>>> It seems that maybe the previous pig script didn't generate the output
>>> data or write correctly on hdfs. Can you provide the pig script you
>>> are trying to run?  Also, for the original script that ran and
>>> generated the file, can you verify if that job had any failed tasks?
>>>
>>> On Mon, Oct 1, 2012 at 10:31 AM, Björn-Elmar Macek  wrote:
>>>
>>>  Hi Robert,
>>>
>>>  the exception i see in the output of the grunt shell and in the pig
>>> log respectively is:
>>>
>>>  Backend error message
>>>  ---------------------
>>>  java.util.EmptyStackException
>>>          at java.util.Stack.peek(Stack.**java:102)
>>>          at
>>>
>>> org.apache.pig.builtin.**Utf8StorageConverter.**consumeTuple(**Utf8StorageConverter.java:182)
>>>
>>>          at
>>>
>>> org.apache.pig.builtin.**Utf8StorageConverter.**bytesToTuple(**Utf8StorageConverter.java:501)
>>>
>>>          at
>>>
>>> org.apache.pig.backend.hadoop.**executionengine.physicalLayer.**
>>> expressionOperators.POCast.**getNext(POCast.java:905)
>>>          at
>>>
>>> org.apache.pig.backend.hadoop.**executionengine.physicalLayer.**
>>> PhysicalOperator.getNext(**PhysicalOperator.java:334)
>>>          at
>>>
>>> org.apache.pig.backend.hadoop.**executionengine.physicalLayer.**
>>> relationalOperators.POForEach.**processPlan(POForEach.java:**332)
>>>          at
>>>
>>> org.apache.pig.backend.hadoop.**executionengine.physicalLayer.**
>>> relationalOperators.POForEach.**getNext(POForEach.java:284)
>>>          at
>>>
>>> org.apache.pig.backend.hadoop.**executionengine.physicalLayer.**
>>> PhysicalOperator.processInput(**PhysicalOperator.java:290)
>>>          at
>>>
>>> org.apache.pig.backend.hadoop.**executionengine.physicalLayer.**
>>> relationalOperators.POForEach.**getNext(POForEach.java:233)
>>>          at
>>>
>>> org.apache.pig.backend.hadoop.**executionengine.**mapReduceLayer.**
>>> PigGenericMapBase.runPipeline(**PigGenericMapBase.java:271)
>>>          at
>>>
>>> org.apache.pig.backend.hadoop.**executionengine.**mapReduceLayer.**
>>> PigGenericMapBase.map(**PigGenericMapBase.java:266)
>>>          at
>>>
>>> org.apache.pig.backend.hadoop.**executionengine.**mapReduceLayer.**
>>> PigGenericMapBase.map(**PigGenericMapBase.java:64)
>>>          at
>>> org.apache.hadoop.mapreduce.**Mapper.run(Mapper.java:144)
>>>          at
>>> org.apache.hadoop.mapred.**MapTask.runNewMapper(MapTask.**java:764)
>>>          at org.apache.hadoop.mapred.**MapTask.run(MapTask.java:370)
>>>          at org.apache.hadoop.mapred.**Child$4.run(Child.java:255)
>>>          at java.security.**AccessController.doPrivileged(**Native
>>> Method)
>>>          at javax.security.auth.Subject.**doAs(Subject.java:415)
>>>          at
>>>
>>> org.apache.hadoop.security.**UserGroupInformation.doAs(**
>>> UserGroupInformation.java:**1121)
>>>          at org.apache.hadoop.mapred.**Child.main(Child.java:249)
>>>
>>>  On Mon, 1 Oct 2012 10:12:22 -0700, Robert Molina  wrote:
>>>
>>>  Hi Bjorn,
>>>  Can you post the exception you are getting during the map phase?
>>>
>>>  On Mon, Oct 1, 2012 at 9:11 AM, Björn-Elmar Macek  wrote:
>>>
>>>   Hi,
>>>
>>>   i am kind of unsure where to post this problem, but i think it is
>>>  more related to hadoop than to pig.
>>>
>>>   By successfully executing a pig script i created a new file in my
>>>  hdfs. Sadly though, i cannot use it for further processing except for
>>>  "dump"ing and viewing the data: every data-manipulation
>>> script-command
>>>  just as "foreach" gives exceptions during the map phase.
>>>   Since there was no problem executing the same script on the first
>>> 100
>>>  lines of my data (LIMIT statement),i copied it to my local fs folder.
>>>   What i realized is, that one of the files namely part-r-000001 was
>>>  empty and contained within the _temporary folder.
>>>
>>>   Is there any reason for this? How can i fix this issue? Did the job
>>>  (which created the file we are talking about) NOT run properly til
>>> its
>>>  end, although the tasktracker worked til the very end and the file
>>> was
>>>  created?
>>>
>>>   Best regards,
>>>   Björn
>>>
>>>  Links:
>>>  ------
>>>  [1] mailto:macek@cs.uni-kassel.de [3]
>>>
>>>
>>>
>>> Links:
>>> ------
>>> [1] mailto:ema@cs.uni-kassel.de
>>> [2] mailto:rmolina@hortonworks.com
>>> [3] mailto:macek@cs.uni-kassel.de
>>>
>>
>>
>>
>

Re: HDFS "file" missing a part-file

Posted by Robert Molina <rm...@hortonworks.com>.

 What I guess might be happening is that your data may contain some text
data that pig is not fully parsing because the data contains characters
that pig uses as delimiters (i.e commas and curly brackets).  Thus, you can
probably take a look at the data and see if you can find any of the
characters used by pig to distinguish values, bags, tuples.  You also might
want to move this topic to the pig forums to see if anyone else has faced a
similar issue.

On Tue, Oct 2, 2012 at 5:26 AM, Björn-Elmar Macek <ma...@cs.uni-kassel.de>wrote:

> Hi again,
>
> i executed a slightly different script again, that included some more
> operations. The logs look similar, but this time i have 2 attempt files for
> the same job-package:
> (1) _temporary/_attempt_**201210021204_0001_r_000001_0/**part-r-00001
> (2) _temporary/_attempt_**201210021204_0001_r_000001_1/**part-r-00001
>
> For me it looks like 2 results of the same jobpackage - this time both
> being not empty as before, but with about blocksize which is about 700 mb.
> I hoped, that both files contained the same content, but "diff" showed me
> that this was not the case. I merged both files with a combination of "cat"
> and "sort -u": the result is a file of about 1.2 gb, which indicates for
> me, that there were many different lines. I suppose, that the cluster didnt
> manage to compute this part-file, tho i have no idea what makes this file
> so special, that it is always this one which is corrupt(?).
>
> The worst solution would be for me to simply ignore this error and
> continue working with the merged file. Is there anybody who has experienced
> similar things?
> If there is a way to fix this, i would love to know, how? Possible reasons
> for the problems are also very appreciated! :)
>
>
> Am 01.10.2012 22:36, schrieb Björn-Elmar Macek:
>
>
>> The script i now want to executed looks like this:
>>
>> x = load 'tag_count_ts_pro_userpair' as (group:tuple(),cnt:int,times:**
>> bag{t:tuple(c:chararray)});
>> y = foreach x generate *, moins.daysFromStart('2011-06-**01 00:00:00',
>> times);
>> store y into 'test_daysFromStart';
>>
>>
>> The problem is, that i do not have the logs anymore due to space
>> constraints within the cluster. But i think i can explain the important
>> parts:
>> The script that created this data was a GROUP statement followed by a
>> FOREACH calculating a COUNT on the bag mentioned above as "times" which is
>> represented in the 2nd column named "cnt". The results were stored via a
>> simple "store".
>> The resulting pig calculation started as expected, but stoppped showing
>> me progress at a certain percentage. A "tail -f" on the hadoop/logs dir
>> revealed that the hadoop calculation progressed nontheless - although some
>> of the tasktrackers permanently vanished during the shuffle phase with the
>> committed/eof/mortbay exception and stopped at least producing any more log
>> output. As i really continiously watched the log i could see, that those
>> work packages were handled by the remaining servers after some of them
>> already calculated packages of progress 1.0. Even the cleanup phase in the
>> end was done, ALTHOUGH(!) the pig log didn't reflect the calculations of
>> the cluster. And since i found the file as output in hdfs i supposed the
>> missing pig progress log entries were simply pig problems. Maybe im wrong
>> with that.
>>
>> But i did the calculations several times and this happened during every
>> execution.
>>
>> Is there something wrong with the data or the calculations?
>>
>>
>> On Mon, 1 Oct 2012 13:01:41 -0700, Robert Molina <rm...@hortonworks.com>
>> wrote:
>>
>>> It seems that maybe the previous pig script didn't generate the output
>>> data or write correctly on hdfs. Can you provide the pig script you
>>> are trying to run?  Also, for the original script that ran and
>>> generated the file, can you verify if that job had any failed tasks?
>>>
>>> On Mon, Oct 1, 2012 at 10:31 AM, Björn-Elmar Macek  wrote:
>>>
>>>  Hi Robert,
>>>
>>>  the exception i see in the output of the grunt shell and in the pig
>>> log respectively is:
>>>
>>>  Backend error message
>>>  ---------------------
>>>  java.util.EmptyStackException
>>>          at java.util.Stack.peek(Stack.**java:102)
>>>          at
>>>
>>> org.apache.pig.builtin.**Utf8StorageConverter.**consumeTuple(**Utf8StorageConverter.java:182)
>>>
>>>          at
>>>
>>> org.apache.pig.builtin.**Utf8StorageConverter.**bytesToTuple(**Utf8StorageConverter.java:501)
>>>
>>>          at
>>>
>>> org.apache.pig.backend.hadoop.**executionengine.physicalLayer.**
>>> expressionOperators.POCast.**getNext(POCast.java:905)
>>>          at
>>>
>>> org.apache.pig.backend.hadoop.**executionengine.physicalLayer.**
>>> PhysicalOperator.getNext(**PhysicalOperator.java:334)
>>>          at
>>>
>>> org.apache.pig.backend.hadoop.**executionengine.physicalLayer.**
>>> relationalOperators.POForEach.**processPlan(POForEach.java:**332)
>>>          at
>>>
>>> org.apache.pig.backend.hadoop.**executionengine.physicalLayer.**
>>> relationalOperators.POForEach.**getNext(POForEach.java:284)
>>>          at
>>>
>>> org.apache.pig.backend.hadoop.**executionengine.physicalLayer.**
>>> PhysicalOperator.processInput(**PhysicalOperator.java:290)
>>>          at
>>>
>>> org.apache.pig.backend.hadoop.**executionengine.physicalLayer.**
>>> relationalOperators.POForEach.**getNext(POForEach.java:233)
>>>          at
>>>
>>> org.apache.pig.backend.hadoop.**executionengine.**mapReduceLayer.**
>>> PigGenericMapBase.runPipeline(**PigGenericMapBase.java:271)
>>>          at
>>>
>>> org.apache.pig.backend.hadoop.**executionengine.**mapReduceLayer.**
>>> PigGenericMapBase.map(**PigGenericMapBase.java:266)
>>>          at
>>>
>>> org.apache.pig.backend.hadoop.**executionengine.**mapReduceLayer.**
>>> PigGenericMapBase.map(**PigGenericMapBase.java:64)
>>>          at
>>> org.apache.hadoop.mapreduce.**Mapper.run(Mapper.java:144)
>>>          at
>>> org.apache.hadoop.mapred.**MapTask.runNewMapper(MapTask.**java:764)
>>>          at org.apache.hadoop.mapred.**MapTask.run(MapTask.java:370)
>>>          at org.apache.hadoop.mapred.**Child$4.run(Child.java:255)
>>>          at java.security.**AccessController.doPrivileged(**Native
>>> Method)
>>>          at javax.security.auth.Subject.**doAs(Subject.java:415)
>>>          at
>>>
>>> org.apache.hadoop.security.**UserGroupInformation.doAs(**
>>> UserGroupInformation.java:**1121)
>>>          at org.apache.hadoop.mapred.**Child.main(Child.java:249)
>>>
>>>  On Mon, 1 Oct 2012 10:12:22 -0700, Robert Molina  wrote:
>>>
>>>  Hi Bjorn,
>>>  Can you post the exception you are getting during the map phase?
>>>
>>>  On Mon, Oct 1, 2012 at 9:11 AM, Björn-Elmar Macek  wrote:
>>>
>>>   Hi,
>>>
>>>   i am kind of unsure where to post this problem, but i think it is
>>>  more related to hadoop than to pig.
>>>
>>>   By successfully executing a pig script i created a new file in my
>>>  hdfs. Sadly though, i cannot use it for further processing except for
>>>  "dump"ing and viewing the data: every data-manipulation
>>> script-command
>>>  just as "foreach" gives exceptions during the map phase.
>>>   Since there was no problem executing the same script on the first
>>> 100
>>>  lines of my data (LIMIT statement),i copied it to my local fs folder.
>>>   What i realized is, that one of the files namely part-r-000001 was
>>>  empty and contained within the _temporary folder.
>>>
>>>   Is there any reason for this? How can i fix this issue? Did the job
>>>  (which created the file we are talking about) NOT run properly til
>>> its
>>>  end, although the tasktracker worked til the very end and the file
>>> was
>>>  created?
>>>
>>>   Best regards,
>>>   Björn
>>>
>>>  Links:
>>>  ------
>>>  [1] mailto:macek@cs.uni-kassel.de [3]
>>>
>>>
>>>
>>> Links:
>>> ------
>>> [1] mailto:ema@cs.uni-kassel.de
>>> [2] mailto:rmolina@hortonworks.com
>>> [3] mailto:macek@cs.uni-kassel.de
>>>
>>
>>
>>
>

Re: HDFS "file" missing a part-file

Posted by Robert Molina <rm...@hortonworks.com>.

 What I guess might be happening is that your data may contain some text
data that pig is not fully parsing because the data contains characters
that pig uses as delimiters (i.e commas and curly brackets).  Thus, you can
probably take a look at the data and see if you can find any of the
characters used by pig to distinguish values, bags, tuples.  You also might
want to move this topic to the pig forums to see if anyone else has faced a
similar issue.

On Tue, Oct 2, 2012 at 5:26 AM, Björn-Elmar Macek <ma...@cs.uni-kassel.de>wrote:

> Hi again,
>
> i executed a slightly different script again, that included some more
> operations. The logs look similar, but this time i have 2 attempt files for
> the same job-package:
> (1) _temporary/_attempt_**201210021204_0001_r_000001_0/**part-r-00001
> (2) _temporary/_attempt_**201210021204_0001_r_000001_1/**part-r-00001
>
> For me it looks like 2 results of the same jobpackage - this time both
> being not empty as before, but with about blocksize which is about 700 mb.
> I hoped, that both files contained the same content, but "diff" showed me
> that this was not the case. I merged both files with a combination of "cat"
> and "sort -u": the result is a file of about 1.2 gb, which indicates for
> me, that there were many different lines. I suppose, that the cluster didnt
> manage to compute this part-file, tho i have no idea what makes this file
> so special, that it is always this one which is corrupt(?).
>
> The worst solution would be for me to simply ignore this error and
> continue working with the merged file. Is there anybody who has experienced
> similar things?
> If there is a way to fix this, i would love to know, how? Possible reasons
> for the problems are also very appreciated! :)
>
>
> Am 01.10.2012 22:36, schrieb Björn-Elmar Macek:
>
>
>> The script i now want to executed looks like this:
>>
>> x = load 'tag_count_ts_pro_userpair' as (group:tuple(),cnt:int,times:**
>> bag{t:tuple(c:chararray)});
>> y = foreach x generate *, moins.daysFromStart('2011-06-**01 00:00:00',
>> times);
>> store y into 'test_daysFromStart';
>>
>>
>> The problem is, that i do not have the logs anymore due to space
>> constraints within the cluster. But i think i can explain the important
>> parts:
>> The script that created this data was a GROUP statement followed by a
>> FOREACH calculating a COUNT on the bag mentioned above as "times" which is
>> represented in the 2nd column named "cnt". The results were stored via a
>> simple "store".
>> The resulting pig calculation started as expected, but stoppped showing
>> me progress at a certain percentage. A "tail -f" on the hadoop/logs dir
>> revealed that the hadoop calculation progressed nontheless - although some
>> of the tasktrackers permanently vanished during the shuffle phase with the
>> committed/eof/mortbay exception and stopped at least producing any more log
>> output. As i really continiously watched the log i could see, that those
>> work packages were handled by the remaining servers after some of them
>> already calculated packages of progress 1.0. Even the cleanup phase in the
>> end was done, ALTHOUGH(!) the pig log didn't reflect the calculations of
>> the cluster. And since i found the file as output in hdfs i supposed the
>> missing pig progress log entries were simply pig problems. Maybe im wrong
>> with that.
>>
>> But i did the calculations several times and this happened during every
>> execution.
>>
>> Is there something wrong with the data or the calculations?
>>
>>
>> On Mon, 1 Oct 2012 13:01:41 -0700, Robert Molina <rm...@hortonworks.com>
>> wrote:
>>
>>> It seems that maybe the previous pig script didn't generate the output
>>> data or write correctly on hdfs. Can you provide the pig script you
>>> are trying to run?  Also, for the original script that ran and
>>> generated the file, can you verify if that job had any failed tasks?
>>>
>>> On Mon, Oct 1, 2012 at 10:31 AM, Björn-Elmar Macek  wrote:
>>>
>>>  Hi Robert,
>>>
>>>  the exception i see in the output of the grunt shell and in the pig
>>> log respectively is:
>>>
>>>  Backend error message
>>>  ---------------------
>>>  java.util.EmptyStackException
>>>          at java.util.Stack.peek(Stack.**java:102)
>>>          at
>>>
>>> org.apache.pig.builtin.**Utf8StorageConverter.**consumeTuple(**Utf8StorageConverter.java:182)
>>>
>>>          at
>>>
>>> org.apache.pig.builtin.**Utf8StorageConverter.**bytesToTuple(**Utf8StorageConverter.java:501)
>>>
>>>          at
>>>
>>> org.apache.pig.backend.hadoop.**executionengine.physicalLayer.**
>>> expressionOperators.POCast.**getNext(POCast.java:905)
>>>          at
>>>
>>> org.apache.pig.backend.hadoop.**executionengine.physicalLayer.**
>>> PhysicalOperator.getNext(**PhysicalOperator.java:334)
>>>          at
>>>
>>> org.apache.pig.backend.hadoop.**executionengine.physicalLayer.**
>>> relationalOperators.POForEach.**processPlan(POForEach.java:**332)
>>>          at
>>>
>>> org.apache.pig.backend.hadoop.**executionengine.physicalLayer.**
>>> relationalOperators.POForEach.**getNext(POForEach.java:284)
>>>          at
>>>
>>> org.apache.pig.backend.hadoop.**executionengine.physicalLayer.**
>>> PhysicalOperator.processInput(**PhysicalOperator.java:290)
>>>          at
>>>
>>> org.apache.pig.backend.hadoop.**executionengine.physicalLayer.**
>>> relationalOperators.POForEach.**getNext(POForEach.java:233)
>>>          at
>>>
>>> org.apache.pig.backend.hadoop.**executionengine.**mapReduceLayer.**
>>> PigGenericMapBase.runPipeline(**PigGenericMapBase.java:271)
>>>          at
>>>
>>> org.apache.pig.backend.hadoop.**executionengine.**mapReduceLayer.**
>>> PigGenericMapBase.map(**PigGenericMapBase.java:266)
>>>          at
>>>
>>> org.apache.pig.backend.hadoop.**executionengine.**mapReduceLayer.**
>>> PigGenericMapBase.map(**PigGenericMapBase.java:64)
>>>          at
>>> org.apache.hadoop.mapreduce.**Mapper.run(Mapper.java:144)
>>>          at
>>> org.apache.hadoop.mapred.**MapTask.runNewMapper(MapTask.**java:764)
>>>          at org.apache.hadoop.mapred.**MapTask.run(MapTask.java:370)
>>>          at org.apache.hadoop.mapred.**Child$4.run(Child.java:255)
>>>          at java.security.**AccessController.doPrivileged(**Native
>>> Method)
>>>          at javax.security.auth.Subject.**doAs(Subject.java:415)
>>>          at
>>>
>>> org.apache.hadoop.security.**UserGroupInformation.doAs(**
>>> UserGroupInformation.java:**1121)
>>>          at org.apache.hadoop.mapred.**Child.main(Child.java:249)
>>>
>>>  On Mon, 1 Oct 2012 10:12:22 -0700, Robert Molina  wrote:
>>>
>>>  Hi Bjorn,
>>>  Can you post the exception you are getting during the map phase?
>>>
>>>  On Mon, Oct 1, 2012 at 9:11 AM, Björn-Elmar Macek  wrote:
>>>
>>>   Hi,
>>>
>>>   i am kind of unsure where to post this problem, but i think it is
>>>  more related to hadoop than to pig.
>>>
>>>   By successfully executing a pig script i created a new file in my
>>>  hdfs. Sadly though, i cannot use it for further processing except for
>>>  "dump"ing and viewing the data: every data-manipulation
>>> script-command
>>>  just as "foreach" gives exceptions during the map phase.
>>>   Since there was no problem executing the same script on the first
>>> 100
>>>  lines of my data (LIMIT statement),i copied it to my local fs folder.
>>>   What i realized is, that one of the files namely part-r-000001 was
>>>  empty and contained within the _temporary folder.
>>>
>>>   Is there any reason for this? How can i fix this issue? Did the job
>>>  (which created the file we are talking about) NOT run properly til
>>> its
>>>  end, although the tasktracker worked til the very end and the file
>>> was
>>>  created?
>>>
>>>   Best regards,
>>>   Björn
>>>
>>>  Links:
>>>  ------
>>>  [1] mailto:macek@cs.uni-kassel.de [3]
>>>
>>>
>>>
>>> Links:
>>> ------
>>> [1] mailto:ema@cs.uni-kassel.de
>>> [2] mailto:rmolina@hortonworks.com
>>> [3] mailto:macek@cs.uni-kassel.de
>>>
>>
>>
>>
>

Re: HDFS "file" missing a part-file

Posted by Robert Molina <rm...@hortonworks.com>.

 What I guess might be happening is that your data may contain some text
data that pig is not fully parsing because the data contains characters
that pig uses as delimiters (i.e commas and curly brackets).  Thus, you can
probably take a look at the data and see if you can find any of the
characters used by pig to distinguish values, bags, tuples.  You also might
want to move this topic to the pig forums to see if anyone else has faced a
similar issue.

On Tue, Oct 2, 2012 at 5:26 AM, Björn-Elmar Macek <ma...@cs.uni-kassel.de>wrote:

> Hi again,
>
> i executed a slightly different script again, that included some more
> operations. The logs look similar, but this time i have 2 attempt files for
> the same job-package:
> (1) _temporary/_attempt_**201210021204_0001_r_000001_0/**part-r-00001
> (2) _temporary/_attempt_**201210021204_0001_r_000001_1/**part-r-00001
>
> For me it looks like 2 results of the same jobpackage - this time both
> being not empty as before, but with about blocksize which is about 700 mb.
> I hoped, that both files contained the same content, but "diff" showed me
> that this was not the case. I merged both files with a combination of "cat"
> and "sort -u": the result is a file of about 1.2 gb, which indicates for
> me, that there were many different lines. I suppose, that the cluster didnt
> manage to compute this part-file, tho i have no idea what makes this file
> so special, that it is always this one which is corrupt(?).
>
> The worst solution would be for me to simply ignore this error and
> continue working with the merged file. Is there anybody who has experienced
> similar things?
> If there is a way to fix this, i would love to know, how? Possible reasons
> for the problems are also very appreciated! :)
>
>
> Am 01.10.2012 22:36, schrieb Björn-Elmar Macek:
>
>
>> The script i now want to executed looks like this:
>>
>> x = load 'tag_count_ts_pro_userpair' as (group:tuple(),cnt:int,times:**
>> bag{t:tuple(c:chararray)});
>> y = foreach x generate *, moins.daysFromStart('2011-06-**01 00:00:00',
>> times);
>> store y into 'test_daysFromStart';
>>
>>
>> The problem is, that i do not have the logs anymore due to space
>> constraints within the cluster. But i think i can explain the important
>> parts:
>> The script that created this data was a GROUP statement followed by a
>> FOREACH calculating a COUNT on the bag mentioned above as "times" which is
>> represented in the 2nd column named "cnt". The results were stored via a
>> simple "store".
>> The resulting pig calculation started as expected, but stoppped showing
>> me progress at a certain percentage. A "tail -f" on the hadoop/logs dir
>> revealed that the hadoop calculation progressed nontheless - although some
>> of the tasktrackers permanently vanished during the shuffle phase with the
>> committed/eof/mortbay exception and stopped at least producing any more log
>> output. As i really continiously watched the log i could see, that those
>> work packages were handled by the remaining servers after some of them
>> already calculated packages of progress 1.0. Even the cleanup phase in the
>> end was done, ALTHOUGH(!) the pig log didn't reflect the calculations of
>> the cluster. And since i found the file as output in hdfs i supposed the
>> missing pig progress log entries were simply pig problems. Maybe im wrong
>> with that.
>>
>> But i did the calculations several times and this happened during every
>> execution.
>>
>> Is there something wrong with the data or the calculations?
>>
>>
>> On Mon, 1 Oct 2012 13:01:41 -0700, Robert Molina <rm...@hortonworks.com>
>> wrote:
>>
>>> It seems that maybe the previous pig script didn't generate the output
>>> data or write correctly on hdfs. Can you provide the pig script you
>>> are trying to run?  Also, for the original script that ran and
>>> generated the file, can you verify if that job had any failed tasks?
>>>
>>> On Mon, Oct 1, 2012 at 10:31 AM, Björn-Elmar Macek  wrote:
>>>
>>>  Hi Robert,
>>>
>>>  the exception i see in the output of the grunt shell and in the pig
>>> log respectively is:
>>>
>>>  Backend error message
>>>  ---------------------
>>>  java.util.EmptyStackException
>>>          at java.util.Stack.peek(Stack.**java:102)
>>>          at
>>>
>>> org.apache.pig.builtin.**Utf8StorageConverter.**consumeTuple(**Utf8StorageConverter.java:182)
>>>
>>>          at
>>>
>>> org.apache.pig.builtin.**Utf8StorageConverter.**bytesToTuple(**Utf8StorageConverter.java:501)
>>>
>>>          at
>>>
>>> org.apache.pig.backend.hadoop.**executionengine.physicalLayer.**
>>> expressionOperators.POCast.**getNext(POCast.java:905)
>>>          at
>>>
>>> org.apache.pig.backend.hadoop.**executionengine.physicalLayer.**
>>> PhysicalOperator.getNext(**PhysicalOperator.java:334)
>>>          at
>>>
>>> org.apache.pig.backend.hadoop.**executionengine.physicalLayer.**
>>> relationalOperators.POForEach.**processPlan(POForEach.java:**332)
>>>          at
>>>
>>> org.apache.pig.backend.hadoop.**executionengine.physicalLayer.**
>>> relationalOperators.POForEach.**getNext(POForEach.java:284)
>>>          at
>>>
>>> org.apache.pig.backend.hadoop.**executionengine.physicalLayer.**
>>> PhysicalOperator.processInput(**PhysicalOperator.java:290)
>>>          at
>>>
>>> org.apache.pig.backend.hadoop.**executionengine.physicalLayer.**
>>> relationalOperators.POForEach.**getNext(POForEach.java:233)
>>>          at
>>>
>>> org.apache.pig.backend.hadoop.**executionengine.**mapReduceLayer.**
>>> PigGenericMapBase.runPipeline(**PigGenericMapBase.java:271)
>>>          at
>>>
>>> org.apache.pig.backend.hadoop.**executionengine.**mapReduceLayer.**
>>> PigGenericMapBase.map(**PigGenericMapBase.java:266)
>>>          at
>>>
>>> org.apache.pig.backend.hadoop.**executionengine.**mapReduceLayer.**
>>> PigGenericMapBase.map(**PigGenericMapBase.java:64)
>>>          at
>>> org.apache.hadoop.mapreduce.**Mapper.run(Mapper.java:144)
>>>          at
>>> org.apache.hadoop.mapred.**MapTask.runNewMapper(MapTask.**java:764)
>>>          at org.apache.hadoop.mapred.**MapTask.run(MapTask.java:370)
>>>          at org.apache.hadoop.mapred.**Child$4.run(Child.java:255)
>>>          at java.security.**AccessController.doPrivileged(**Native
>>> Method)
>>>          at javax.security.auth.Subject.**doAs(Subject.java:415)
>>>          at
>>>
>>> org.apache.hadoop.security.**UserGroupInformation.doAs(**
>>> UserGroupInformation.java:**1121)
>>>          at org.apache.hadoop.mapred.**Child.main(Child.java:249)
>>>
>>>  On Mon, 1 Oct 2012 10:12:22 -0700, Robert Molina  wrote:
>>>
>>>  Hi Bjorn,
>>>  Can you post the exception you are getting during the map phase?
>>>
>>>  On Mon, Oct 1, 2012 at 9:11 AM, Björn-Elmar Macek  wrote:
>>>
>>>   Hi,
>>>
>>>   i am kind of unsure where to post this problem, but i think it is
>>>  more related to hadoop than to pig.
>>>
>>>   By successfully executing a pig script i created a new file in my
>>>  hdfs. Sadly though, i cannot use it for further processing except for
>>>  "dump"ing and viewing the data: every data-manipulation
>>> script-command
>>>  just as "foreach" gives exceptions during the map phase.
>>>   Since there was no problem executing the same script on the first
>>> 100
>>>  lines of my data (LIMIT statement),i copied it to my local fs folder.
>>>   What i realized is, that one of the files namely part-r-000001 was
>>>  empty and contained within the _temporary folder.
>>>
>>>   Is there any reason for this? How can i fix this issue? Did the job
>>>  (which created the file we are talking about) NOT run properly til
>>> its
>>>  end, although the tasktracker worked til the very end and the file
>>> was
>>>  created?
>>>
>>>   Best regards,
>>>   Björn
>>>
>>>  Links:
>>>  ------
>>>  [1] mailto:macek@cs.uni-kassel.de [3]
>>>
>>>
>>>
>>> Links:
>>> ------
>>> [1] mailto:ema@cs.uni-kassel.de
>>> [2] mailto:rmolina@hortonworks.com
>>> [3] mailto:macek@cs.uni-kassel.de
>>>
>>
>>
>>
>

Re: HDFS "file" missing a part-file

Posted by Björn-Elmar Macek <ma...@cs.uni-kassel.de>.

Hi again,

i executed a slightly different script again, that included some more 
operations. The logs look similar, but this time i have 2 attempt files 
for the same job-package:
(1) _temporary/_attempt_201210021204_0001_r_000001_0/part-r-00001
(2) _temporary/_attempt_201210021204_0001_r_000001_1/part-r-00001

For me it looks like 2 results of the same jobpackage - this time both 
being not empty as before, but with about blocksize which is about 700 
mb. I hoped, that both files contained the same content, but "diff" 
showed me that this was not the case. I merged both files with a 
combination of "cat" and "sort -u": the result is a file of about 1.2 
gb, which indicates for me, that there were many different lines. I 
suppose, that the cluster didnt manage to compute this part-file, tho i 
have no idea what makes this file so special, that it is always this one 
which is corrupt(?).

The worst solution would be for me to simply ignore this error and 
continue working with the merged file. Is there anybody who has 
experienced similar things?
If there is a way to fix this, i would love to know, how? Possible 
reasons for the problems are also very appreciated! :)


Am 01.10.2012 22:36, schrieb Björn-Elmar Macek:
>
> The script i now want to executed looks like this:
>
> x = load 'tag_count_ts_pro_userpair' as 
> (group:tuple(),cnt:int,times:bag{t:tuple(c:chararray)});
> y = foreach x generate *, moins.daysFromStart('2011-06-01 00:00:00', 
> times);
> store y into 'test_daysFromStart';
>
>
> The problem is, that i do not have the logs anymore due to space 
> constraints within the cluster. But i think i can explain the 
> important parts:
> The script that created this data was a GROUP statement followed by a 
> FOREACH calculating a COUNT on the bag mentioned above as "times" 
> which is represented in the 2nd column named "cnt". The results were 
> stored via a simple "store".
> The resulting pig calculation started as expected, but stoppped 
> showing me progress at a certain percentage. A "tail -f" on the 
> hadoop/logs dir revealed that the hadoop calculation progressed 
> nontheless - although some of the tasktrackers permanently vanished 
> during the shuffle phase with the committed/eof/mortbay exception and 
> stopped at least producing any more log output. As i really 
> continiously watched the log i could see, that those work packages 
> were handled by the remaining servers after some of them already 
> calculated packages of progress 1.0. Even the cleanup phase in the end 
> was done, ALTHOUGH(!) the pig log didn't reflect the calculations of 
> the cluster. And since i found the file as output in hdfs i supposed 
> the missing pig progress log entries were simply pig problems. Maybe 
> im wrong with that.
>
> But i did the calculations several times and this happened during 
> every execution.
>
> Is there something wrong with the data or the calculations?
>
>
> On Mon, 1 Oct 2012 13:01:41 -0700, Robert Molina 
> <rm...@hortonworks.com> wrote:
>> It seems that maybe the previous pig script didn't generate the output
>> data or write correctly on hdfs. Can you provide the pig script you
>> are trying to run?  Also, for the original script that ran and
>> generated the file, can you verify if that job had any failed tasks?
>>
>> On Mon, Oct 1, 2012 at 10:31 AM, Björn-Elmar Macek  wrote:
>>
>>  Hi Robert,
>>
>>  the exception i see in the output of the grunt shell and in the pig
>> log respectively is:
>>
>>  Backend error message
>>  ---------------------
>>  java.util.EmptyStackException
>>          at java.util.Stack.peek(Stack.java:102)
>>          at
>>
>> org.apache.pig.builtin.Utf8StorageConverter.consumeTuple(Utf8StorageConverter.java:182) 
>>
>>          at
>>
>> org.apache.pig.builtin.Utf8StorageConverter.bytesToTuple(Utf8StorageConverter.java:501) 
>>
>>          at
>>
>> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POCast.getNext(POCast.java:905) 
>>
>>          at
>>
>> org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:334) 
>>
>>          at
>>
>> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:332) 
>>
>>          at
>>
>> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:284) 
>>
>>          at
>>
>> org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:290) 
>>
>>          at
>>
>> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:233) 
>>
>>          at
>>
>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:271) 
>>
>>          at
>>
>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:266) 
>>
>>          at
>>
>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:64) 
>>
>>          at
>> org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
>>          at
>> org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
>>          at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
>>          at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
>>          at java.security.AccessController.doPrivileged(Native
>> Method)
>>          at javax.security.auth.Subject.doAs(Subject.java:415)
>>          at
>>
>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121) 
>>
>>          at org.apache.hadoop.mapred.Child.main(Child.java:249)
>>
>>  On Mon, 1 Oct 2012 10:12:22 -0700, Robert Molina  wrote:
>>
>>  Hi Bjorn,
>>  Can you post the exception you are getting during the map phase?
>>
>>  On Mon, Oct 1, 2012 at 9:11 AM, Björn-Elmar Macek  wrote:
>>
>>   Hi,
>>
>>   i am kind of unsure where to post this problem, but i think it is
>>  more related to hadoop than to pig.
>>
>>   By successfully executing a pig script i created a new file in my
>>  hdfs. Sadly though, i cannot use it for further processing except for
>>  "dump"ing and viewing the data: every data-manipulation
>> script-command
>>  just as "foreach" gives exceptions during the map phase.
>>   Since there was no problem executing the same script on the first
>> 100
>>  lines of my data (LIMIT statement),i copied it to my local fs folder.
>>   What i realized is, that one of the files namely part-r-000001 was
>>  empty and contained within the _temporary folder.
>>
>>   Is there any reason for this? How can i fix this issue? Did the job
>>  (which created the file we are talking about) NOT run properly til
>> its
>>  end, although the tasktracker worked til the very end and the file
>> was
>>  created?
>>
>>   Best regards,
>>   Björn
>>
>>  Links:
>>  ------
>>  [1] mailto:macek@cs.uni-kassel.de [3]
>>
>>
>>
>> Links:
>> ------
>> [1] mailto:ema@cs.uni-kassel.de
>> [2] mailto:rmolina@hortonworks.com
>> [3] mailto:macek@cs.uni-kassel.de
>
>

Re: HDFS "file" missing a part-file

Posted by Björn-Elmar Macek <ma...@cs.uni-kassel.de>.

Hi again,

i executed a slightly different script again, that included some more 
operations. The logs look similar, but this time i have 2 attempt files 
for the same job-package:
(1) _temporary/_attempt_201210021204_0001_r_000001_0/part-r-00001
(2) _temporary/_attempt_201210021204_0001_r_000001_1/part-r-00001

For me it looks like 2 results of the same jobpackage - this time both 
being not empty as before, but with about blocksize which is about 700 
mb. I hoped, that both files contained the same content, but "diff" 
showed me that this was not the case. I merged both files with a 
combination of "cat" and "sort -u": the result is a file of about 1.2 
gb, which indicates for me, that there were many different lines. I 
suppose, that the cluster didnt manage to compute this part-file, tho i 
have no idea what makes this file so special, that it is always this one 
which is corrupt(?).

The worst solution would be for me to simply ignore this error and 
continue working with the merged file. Is there anybody who has 
experienced similar things?
If there is a way to fix this, i would love to know, how? Possible 
reasons for the problems are also very appreciated! :)


Am 01.10.2012 22:36, schrieb Björn-Elmar Macek:
>
> The script i now want to executed looks like this:
>
> x = load 'tag_count_ts_pro_userpair' as 
> (group:tuple(),cnt:int,times:bag{t:tuple(c:chararray)});
> y = foreach x generate *, moins.daysFromStart('2011-06-01 00:00:00', 
> times);
> store y into 'test_daysFromStart';
>
>
> The problem is, that i do not have the logs anymore due to space 
> constraints within the cluster. But i think i can explain the 
> important parts:
> The script that created this data was a GROUP statement followed by a 
> FOREACH calculating a COUNT on the bag mentioned above as "times" 
> which is represented in the 2nd column named "cnt". The results were 
> stored via a simple "store".
> The resulting pig calculation started as expected, but stoppped 
> showing me progress at a certain percentage. A "tail -f" on the 
> hadoop/logs dir revealed that the hadoop calculation progressed 
> nontheless - although some of the tasktrackers permanently vanished 
> during the shuffle phase with the committed/eof/mortbay exception and 
> stopped at least producing any more log output. As i really 
> continiously watched the log i could see, that those work packages 
> were handled by the remaining servers after some of them already 
> calculated packages of progress 1.0. Even the cleanup phase in the end 
> was done, ALTHOUGH(!) the pig log didn't reflect the calculations of 
> the cluster. And since i found the file as output in hdfs i supposed 
> the missing pig progress log entries were simply pig problems. Maybe 
> im wrong with that.
>
> But i did the calculations several times and this happened during 
> every execution.
>
> Is there something wrong with the data or the calculations?
>
>
> On Mon, 1 Oct 2012 13:01:41 -0700, Robert Molina 
> <rm...@hortonworks.com> wrote:
>> It seems that maybe the previous pig script didn't generate the output
>> data or write correctly on hdfs. Can you provide the pig script you
>> are trying to run?  Also, for the original script that ran and
>> generated the file, can you verify if that job had any failed tasks?
>>
>> On Mon, Oct 1, 2012 at 10:31 AM, Björn-Elmar Macek  wrote:
>>
>>  Hi Robert,
>>
>>  the exception i see in the output of the grunt shell and in the pig
>> log respectively is:
>>
>>  Backend error message
>>  ---------------------
>>  java.util.EmptyStackException
>>          at java.util.Stack.peek(Stack.java:102)
>>          at
>>
>> org.apache.pig.builtin.Utf8StorageConverter.consumeTuple(Utf8StorageConverter.java:182) 
>>
>>          at
>>
>> org.apache.pig.builtin.Utf8StorageConverter.bytesToTuple(Utf8StorageConverter.java:501) 
>>
>>          at
>>
>> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POCast.getNext(POCast.java:905) 
>>
>>          at
>>
>> org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:334) 
>>
>>          at
>>
>> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:332) 
>>
>>          at
>>
>> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:284) 
>>
>>          at
>>
>> org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:290) 
>>
>>          at
>>
>> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:233) 
>>
>>          at
>>
>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:271) 
>>
>>          at
>>
>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:266) 
>>
>>          at
>>
>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:64) 
>>
>>          at
>> org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
>>          at
>> org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
>>          at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
>>          at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
>>          at java.security.AccessController.doPrivileged(Native
>> Method)
>>          at javax.security.auth.Subject.doAs(Subject.java:415)
>>          at
>>
>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121) 
>>
>>          at org.apache.hadoop.mapred.Child.main(Child.java:249)
>>
>>  On Mon, 1 Oct 2012 10:12:22 -0700, Robert Molina  wrote:
>>
>>  Hi Bjorn,
>>  Can you post the exception you are getting during the map phase?
>>
>>  On Mon, Oct 1, 2012 at 9:11 AM, Björn-Elmar Macek  wrote:
>>
>>   Hi,
>>
>>   i am kind of unsure where to post this problem, but i think it is
>>  more related to hadoop than to pig.
>>
>>   By successfully executing a pig script i created a new file in my
>>  hdfs. Sadly though, i cannot use it for further processing except for
>>  "dump"ing and viewing the data: every data-manipulation
>> script-command
>>  just as "foreach" gives exceptions during the map phase.
>>   Since there was no problem executing the same script on the first
>> 100
>>  lines of my data (LIMIT statement),i copied it to my local fs folder.
>>   What i realized is, that one of the files namely part-r-000001 was
>>  empty and contained within the _temporary folder.
>>
>>   Is there any reason for this? How can i fix this issue? Did the job
>>  (which created the file we are talking about) NOT run properly til
>> its
>>  end, although the tasktracker worked til the very end and the file
>> was
>>  created?
>>
>>   Best regards,
>>   Björn
>>
>>  Links:
>>  ------
>>  [1] mailto:macek@cs.uni-kassel.de [3]
>>
>>
>>
>> Links:
>> ------
>> [1] mailto:ema@cs.uni-kassel.de
>> [2] mailto:rmolina@hortonworks.com
>> [3] mailto:macek@cs.uni-kassel.de
>
>

Re: HDFS "file" missing a part-file

Posted by Björn-Elmar Macek <ma...@cs.uni-kassel.de>.

Hi again,

i executed a slightly different script again, that included some more 
operations. The logs look similar, but this time i have 2 attempt files 
for the same job-package:
(1) _temporary/_attempt_201210021204_0001_r_000001_0/part-r-00001
(2) _temporary/_attempt_201210021204_0001_r_000001_1/part-r-00001

For me it looks like 2 results of the same jobpackage - this time both 
being not empty as before, but with about blocksize which is about 700 
mb. I hoped, that both files contained the same content, but "diff" 
showed me that this was not the case. I merged both files with a 
combination of "cat" and "sort -u": the result is a file of about 1.2 
gb, which indicates for me, that there were many different lines. I 
suppose, that the cluster didnt manage to compute this part-file, tho i 
have no idea what makes this file so special, that it is always this one 
which is corrupt(?).

The worst solution would be for me to simply ignore this error and 
continue working with the merged file. Is there anybody who has 
experienced similar things?
If there is a way to fix this, i would love to know, how? Possible 
reasons for the problems are also very appreciated! :)


Am 01.10.2012 22:36, schrieb Björn-Elmar Macek:
>
> The script i now want to executed looks like this:
>
> x = load 'tag_count_ts_pro_userpair' as 
> (group:tuple(),cnt:int,times:bag{t:tuple(c:chararray)});
> y = foreach x generate *, moins.daysFromStart('2011-06-01 00:00:00', 
> times);
> store y into 'test_daysFromStart';
>
>
> The problem is, that i do not have the logs anymore due to space 
> constraints within the cluster. But i think i can explain the 
> important parts:
> The script that created this data was a GROUP statement followed by a 
> FOREACH calculating a COUNT on the bag mentioned above as "times" 
> which is represented in the 2nd column named "cnt". The results were 
> stored via a simple "store".
> The resulting pig calculation started as expected, but stoppped 
> showing me progress at a certain percentage. A "tail -f" on the 
> hadoop/logs dir revealed that the hadoop calculation progressed 
> nontheless - although some of the tasktrackers permanently vanished 
> during the shuffle phase with the committed/eof/mortbay exception and 
> stopped at least producing any more log output. As i really 
> continiously watched the log i could see, that those work packages 
> were handled by the remaining servers after some of them already 
> calculated packages of progress 1.0. Even the cleanup phase in the end 
> was done, ALTHOUGH(!) the pig log didn't reflect the calculations of 
> the cluster. And since i found the file as output in hdfs i supposed 
> the missing pig progress log entries were simply pig problems. Maybe 
> im wrong with that.
>
> But i did the calculations several times and this happened during 
> every execution.
>
> Is there something wrong with the data or the calculations?
>
>
> On Mon, 1 Oct 2012 13:01:41 -0700, Robert Molina 
> <rm...@hortonworks.com> wrote:
>> It seems that maybe the previous pig script didn't generate the output
>> data or write correctly on hdfs. Can you provide the pig script you
>> are trying to run?  Also, for the original script that ran and
>> generated the file, can you verify if that job had any failed tasks?
>>
>> On Mon, Oct 1, 2012 at 10:31 AM, Björn-Elmar Macek  wrote:
>>
>>  Hi Robert,
>>
>>  the exception i see in the output of the grunt shell and in the pig
>> log respectively is:
>>
>>  Backend error message
>>  ---------------------
>>  java.util.EmptyStackException
>>          at java.util.Stack.peek(Stack.java:102)
>>          at
>>
>> org.apache.pig.builtin.Utf8StorageConverter.consumeTuple(Utf8StorageConverter.java:182) 
>>
>>          at
>>
>> org.apache.pig.builtin.Utf8StorageConverter.bytesToTuple(Utf8StorageConverter.java:501) 
>>
>>          at
>>
>> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POCast.getNext(POCast.java:905) 
>>
>>          at
>>
>> org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:334) 
>>
>>          at
>>
>> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:332) 
>>
>>          at
>>
>> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:284) 
>>
>>          at
>>
>> org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:290) 
>>
>>          at
>>
>> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:233) 
>>
>>          at
>>
>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:271) 
>>
>>          at
>>
>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:266) 
>>
>>          at
>>
>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:64) 
>>
>>          at
>> org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
>>          at
>> org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
>>          at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
>>          at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
>>          at java.security.AccessController.doPrivileged(Native
>> Method)
>>          at javax.security.auth.Subject.doAs(Subject.java:415)
>>          at
>>
>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121) 
>>
>>          at org.apache.hadoop.mapred.Child.main(Child.java:249)
>>
>>  On Mon, 1 Oct 2012 10:12:22 -0700, Robert Molina  wrote:
>>
>>  Hi Bjorn,
>>  Can you post the exception you are getting during the map phase?
>>
>>  On Mon, Oct 1, 2012 at 9:11 AM, Björn-Elmar Macek  wrote:
>>
>>   Hi,
>>
>>   i am kind of unsure where to post this problem, but i think it is
>>  more related to hadoop than to pig.
>>
>>   By successfully executing a pig script i created a new file in my
>>  hdfs. Sadly though, i cannot use it for further processing except for
>>  "dump"ing and viewing the data: every data-manipulation
>> script-command
>>  just as "foreach" gives exceptions during the map phase.
>>   Since there was no problem executing the same script on the first
>> 100
>>  lines of my data (LIMIT statement),i copied it to my local fs folder.
>>   What i realized is, that one of the files namely part-r-000001 was
>>  empty and contained within the _temporary folder.
>>
>>   Is there any reason for this? How can i fix this issue? Did the job
>>  (which created the file we are talking about) NOT run properly til
>> its
>>  end, although the tasktracker worked til the very end and the file
>> was
>>  created?
>>
>>   Best regards,
>>   Björn
>>
>>  Links:
>>  ------
>>  [1] mailto:macek@cs.uni-kassel.de [3]
>>
>>
>>
>> Links:
>> ------
>> [1] mailto:ema@cs.uni-kassel.de
>> [2] mailto:rmolina@hortonworks.com
>> [3] mailto:macek@cs.uni-kassel.de
>
>

Re: HDFS "file" missing a part-file

Posted by Björn-Elmar Macek <ma...@cs.uni-kassel.de>.

Hi again,

i executed a slightly different script again, that included some more 
operations. The logs look similar, but this time i have 2 attempt files 
for the same job-package:
(1) _temporary/_attempt_201210021204_0001_r_000001_0/part-r-00001
(2) _temporary/_attempt_201210021204_0001_r_000001_1/part-r-00001

For me it looks like 2 results of the same jobpackage - this time both 
being not empty as before, but with about blocksize which is about 700 
mb. I hoped, that both files contained the same content, but "diff" 
showed me that this was not the case. I merged both files with a 
combination of "cat" and "sort -u": the result is a file of about 1.2 
gb, which indicates for me, that there were many different lines. I 
suppose, that the cluster didnt manage to compute this part-file, tho i 
have no idea what makes this file so special, that it is always this one 
which is corrupt(?).

The worst solution would be for me to simply ignore this error and 
continue working with the merged file. Is there anybody who has 
experienced similar things?
If there is a way to fix this, i would love to know, how? Possible 
reasons for the problems are also very appreciated! :)


Am 01.10.2012 22:36, schrieb Björn-Elmar Macek:
>
> The script i now want to executed looks like this:
>
> x = load 'tag_count_ts_pro_userpair' as 
> (group:tuple(),cnt:int,times:bag{t:tuple(c:chararray)});
> y = foreach x generate *, moins.daysFromStart('2011-06-01 00:00:00', 
> times);
> store y into 'test_daysFromStart';
>
>
> The problem is, that i do not have the logs anymore due to space 
> constraints within the cluster. But i think i can explain the 
> important parts:
> The script that created this data was a GROUP statement followed by a 
> FOREACH calculating a COUNT on the bag mentioned above as "times" 
> which is represented in the 2nd column named "cnt". The results were 
> stored via a simple "store".
> The resulting pig calculation started as expected, but stoppped 
> showing me progress at a certain percentage. A "tail -f" on the 
> hadoop/logs dir revealed that the hadoop calculation progressed 
> nontheless - although some of the tasktrackers permanently vanished 
> during the shuffle phase with the committed/eof/mortbay exception and 
> stopped at least producing any more log output. As i really 
> continiously watched the log i could see, that those work packages 
> were handled by the remaining servers after some of them already 
> calculated packages of progress 1.0. Even the cleanup phase in the end 
> was done, ALTHOUGH(!) the pig log didn't reflect the calculations of 
> the cluster. And since i found the file as output in hdfs i supposed 
> the missing pig progress log entries were simply pig problems. Maybe 
> im wrong with that.
>
> But i did the calculations several times and this happened during 
> every execution.
>
> Is there something wrong with the data or the calculations?
>
>
> On Mon, 1 Oct 2012 13:01:41 -0700, Robert Molina 
> <rm...@hortonworks.com> wrote:
>> It seems that maybe the previous pig script didn't generate the output
>> data or write correctly on hdfs. Can you provide the pig script you
>> are trying to run?  Also, for the original script that ran and
>> generated the file, can you verify if that job had any failed tasks?
>>
>> On Mon, Oct 1, 2012 at 10:31 AM, Björn-Elmar Macek  wrote:
>>
>>  Hi Robert,
>>
>>  the exception i see in the output of the grunt shell and in the pig
>> log respectively is:
>>
>>  Backend error message
>>  ---------------------
>>  java.util.EmptyStackException
>>          at java.util.Stack.peek(Stack.java:102)
>>          at
>>
>> org.apache.pig.builtin.Utf8StorageConverter.consumeTuple(Utf8StorageConverter.java:182) 
>>
>>          at
>>
>> org.apache.pig.builtin.Utf8StorageConverter.bytesToTuple(Utf8StorageConverter.java:501) 
>>
>>          at
>>
>> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POCast.getNext(POCast.java:905) 
>>
>>          at
>>
>> org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:334) 
>>
>>          at
>>
>> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:332) 
>>
>>          at
>>
>> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:284) 
>>
>>          at
>>
>> org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:290) 
>>
>>          at
>>
>> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:233) 
>>
>>          at
>>
>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:271) 
>>
>>          at
>>
>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:266) 
>>
>>          at
>>
>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:64) 
>>
>>          at
>> org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
>>          at
>> org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
>>          at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
>>          at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
>>          at java.security.AccessController.doPrivileged(Native
>> Method)
>>          at javax.security.auth.Subject.doAs(Subject.java:415)
>>          at
>>
>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121) 
>>
>>          at org.apache.hadoop.mapred.Child.main(Child.java:249)
>>
>>  On Mon, 1 Oct 2012 10:12:22 -0700, Robert Molina  wrote:
>>
>>  Hi Bjorn,
>>  Can you post the exception you are getting during the map phase?
>>
>>  On Mon, Oct 1, 2012 at 9:11 AM, Björn-Elmar Macek  wrote:
>>
>>   Hi,
>>
>>   i am kind of unsure where to post this problem, but i think it is
>>  more related to hadoop than to pig.
>>
>>   By successfully executing a pig script i created a new file in my
>>  hdfs. Sadly though, i cannot use it for further processing except for
>>  "dump"ing and viewing the data: every data-manipulation
>> script-command
>>  just as "foreach" gives exceptions during the map phase.
>>   Since there was no problem executing the same script on the first
>> 100
>>  lines of my data (LIMIT statement),i copied it to my local fs folder.
>>   What i realized is, that one of the files namely part-r-000001 was
>>  empty and contained within the _temporary folder.
>>
>>   Is there any reason for this? How can i fix this issue? Did the job
>>  (which created the file we are talking about) NOT run properly til
>> its
>>  end, although the tasktracker worked til the very end and the file
>> was
>>  created?
>>
>>   Best regards,
>>   Björn
>>
>>  Links:
>>  ------
>>  [1] mailto:macek@cs.uni-kassel.de [3]
>>
>>
>>
>> Links:
>> ------
>> [1] mailto:ema@cs.uni-kassel.de
>> [2] mailto:rmolina@hortonworks.com
>> [3] mailto:macek@cs.uni-kassel.de
>
>

Re: HDFS "file" missing a part-file

Posted by Björn-Elmar Macek <em...@cs.uni-kassel.de>.

 The script i now want to executed looks like this:

 x = load 'tag_count_ts_pro_userpair' as 
 (group:tuple(),cnt:int,times:bag{t:tuple(c:chararray)});
 y = foreach x generate *, moins.daysFromStart('2011-06-01 00:00:00', 
 times);
 store y into 'test_daysFromStart';


 The problem is, that i do not have the logs anymore due to space 
 constraints within the cluster. But i think i can explain the important 
 parts:
 The script that created this data was a GROUP statement followed by a 
 FOREACH calculating a COUNT on the bag mentioned above as "times" which 
 is represented in the 2nd column named "cnt". The results were stored 
 via a simple "store".
 The resulting pig calculation started as expected, but stoppped showing 
 me progress at a certain percentage. A "tail -f" on the hadoop/logs dir 
 revealed that the hadoop calculation progressed nontheless - although 
 some of the tasktrackers permanently vanished during the shuffle phase 
 with the committed/eof/mortbay exception and stopped at least producing 
 any more log output. As i really continiously watched the log i could 
 see, that those work packages were handled by the remaining servers 
 after some of them already calculated packages of progress 1.0. Even the 
 cleanup phase in the end was done, ALTHOUGH(!) the pig log didn't 
 reflect the calculations of the cluster. And since i found the file as 
 output in hdfs i supposed the missing pig progress log entries were 
 simply pig problems. Maybe im wrong with that.

 But i did the calculations several times and this happened during every 
 execution.

 Is there something wrong with the data or the calculations?


 On Mon, 1 Oct 2012 13:01:41 -0700, Robert Molina 
 <rm...@hortonworks.com> wrote:
> It seems that maybe the previous pig script didn't generate the 
> output
> data or write correctly on hdfs. Can you provide the pig script you
> are trying to run?  Also, for the original script that ran and
> generated the file, can you verify if that job had any failed tasks?
>
> On Mon, Oct 1, 2012 at 10:31 AM, Björn-Elmar Macek  wrote:
>
>  Hi Robert,
>
>  the exception i see in the output of the grunt shell and in the pig
> log respectively is:
>
>  Backend error message
>  ---------------------
>  java.util.EmptyStackException
>          at java.util.Stack.peek(Stack.java:102)
>          at
> 
> org.apache.pig.builtin.Utf8StorageConverter.consumeTuple(Utf8StorageConverter.java:182)
>          at
> 
> org.apache.pig.builtin.Utf8StorageConverter.bytesToTuple(Utf8StorageConverter.java:501)
>          at
> 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POCast.getNext(POCast.java:905)
>          at
> 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:334)
>          at
> 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:332)
>          at
> 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:284)
>          at
> 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:290)
>          at
> 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:233)
>          at
> 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:271)
>          at
> 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:266)
>          at
> 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:64)
>          at
> org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
>          at
> org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
>          at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
>          at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
>          at java.security.AccessController.doPrivileged(Native
> Method)
>          at javax.security.auth.Subject.doAs(Subject.java:415)
>          at
> 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
>          at org.apache.hadoop.mapred.Child.main(Child.java:249)
>
>  On Mon, 1 Oct 2012 10:12:22 -0700, Robert Molina  wrote:
>
>  Hi Bjorn, 
>  Can you post the exception you are getting during the map phase?
>
>  On Mon, Oct 1, 2012 at 9:11 AM, Björn-Elmar Macek  wrote:
>
>   Hi,
>
>   i am kind of unsure where to post this problem, but i think it is
>  more related to hadoop than to pig.
>
>   By successfully executing a pig script i created a new file in my
>  hdfs. Sadly though, i cannot use it for further processing except 
> for
>  "dump"ing and viewing the data: every data-manipulation
> script-command
>  just as "foreach" gives exceptions during the map phase.
>   Since there was no problem executing the same script on the first
> 100
>  lines of my data (LIMIT statement),i copied it to my local fs 
> folder.
>   What i realized is, that one of the files namely part-r-000001 was
>  empty and contained within the _temporary folder.
>
>   Is there any reason for this? How can i fix this issue? Did the job
>  (which created the file we are talking about) NOT run properly til
> its
>  end, although the tasktracker worked til the very end and the file
> was
>  created?
>
>   Best regards,
>   Björn
>
>  Links:
>  ------
>  [1] mailto:macek@cs.uni-kassel.de [3]
>
>
>
> Links:
> ------
> [1] mailto:ema@cs.uni-kassel.de
> [2] mailto:rmolina@hortonworks.com
> [3] mailto:macek@cs.uni-kassel.de

Re: HDFS "file" missing a part-file

Posted by Björn-Elmar Macek <em...@cs.uni-kassel.de>.

 The script i now want to executed looks like this:

 x = load 'tag_count_ts_pro_userpair' as 
 (group:tuple(),cnt:int,times:bag{t:tuple(c:chararray)});
 y = foreach x generate *, moins.daysFromStart('2011-06-01 00:00:00', 
 times);
 store y into 'test_daysFromStart';


 The problem is, that i do not have the logs anymore due to space 
 constraints within the cluster. But i think i can explain the important 
 parts:
 The script that created this data was a GROUP statement followed by a 
 FOREACH calculating a COUNT on the bag mentioned above as "times" which 
 is represented in the 2nd column named "cnt". The results were stored 
 via a simple "store".
 The resulting pig calculation started as expected, but stoppped showing 
 me progress at a certain percentage. A "tail -f" on the hadoop/logs dir 
 revealed that the hadoop calculation progressed nontheless - although 
 some of the tasktrackers permanently vanished during the shuffle phase 
 with the committed/eof/mortbay exception and stopped at least producing 
 any more log output. As i really continiously watched the log i could 
 see, that those work packages were handled by the remaining servers 
 after some of them already calculated packages of progress 1.0. Even the 
 cleanup phase in the end was done, ALTHOUGH(!) the pig log didn't 
 reflect the calculations of the cluster. And since i found the file as 
 output in hdfs i supposed the missing pig progress log entries were 
 simply pig problems. Maybe im wrong with that.

 But i did the calculations several times and this happened during every 
 execution.

 Is there something wrong with the data or the calculations?


 On Mon, 1 Oct 2012 13:01:41 -0700, Robert Molina 
 <rm...@hortonworks.com> wrote:
> It seems that maybe the previous pig script didn't generate the 
> output
> data or write correctly on hdfs. Can you provide the pig script you
> are trying to run?  Also, for the original script that ran and
> generated the file, can you verify if that job had any failed tasks?
>
> On Mon, Oct 1, 2012 at 10:31 AM, Björn-Elmar Macek  wrote:
>
>  Hi Robert,
>
>  the exception i see in the output of the grunt shell and in the pig
> log respectively is:
>
>  Backend error message
>  ---------------------
>  java.util.EmptyStackException
>          at java.util.Stack.peek(Stack.java:102)
>          at
> 
> org.apache.pig.builtin.Utf8StorageConverter.consumeTuple(Utf8StorageConverter.java:182)
>          at
> 
> org.apache.pig.builtin.Utf8StorageConverter.bytesToTuple(Utf8StorageConverter.java:501)
>          at
> 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POCast.getNext(POCast.java:905)
>          at
> 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:334)
>          at
> 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:332)
>          at
> 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:284)
>          at
> 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:290)
>          at
> 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:233)
>          at
> 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:271)
>          at
> 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:266)
>          at
> 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:64)
>          at
> org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
>          at
> org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
>          at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
>          at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
>          at java.security.AccessController.doPrivileged(Native
> Method)
>          at javax.security.auth.Subject.doAs(Subject.java:415)
>          at
> 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
>          at org.apache.hadoop.mapred.Child.main(Child.java:249)
>
>  On Mon, 1 Oct 2012 10:12:22 -0700, Robert Molina  wrote:
>
>  Hi Bjorn, 
>  Can you post the exception you are getting during the map phase?
>
>  On Mon, Oct 1, 2012 at 9:11 AM, Björn-Elmar Macek  wrote:
>
>   Hi,
>
>   i am kind of unsure where to post this problem, but i think it is
>  more related to hadoop than to pig.
>
>   By successfully executing a pig script i created a new file in my
>  hdfs. Sadly though, i cannot use it for further processing except 
> for
>  "dump"ing and viewing the data: every data-manipulation
> script-command
>  just as "foreach" gives exceptions during the map phase.
>   Since there was no problem executing the same script on the first
> 100
>  lines of my data (LIMIT statement),i copied it to my local fs 
> folder.
>   What i realized is, that one of the files namely part-r-000001 was
>  empty and contained within the _temporary folder.
>
>   Is there any reason for this? How can i fix this issue? Did the job
>  (which created the file we are talking about) NOT run properly til
> its
>  end, although the tasktracker worked til the very end and the file
> was
>  created?
>
>   Best regards,
>   Björn
>
>  Links:
>  ------
>  [1] mailto:macek@cs.uni-kassel.de [3]
>
>
>
> Links:
> ------
> [1] mailto:ema@cs.uni-kassel.de
> [2] mailto:rmolina@hortonworks.com
> [3] mailto:macek@cs.uni-kassel.de

Re: HDFS "file" missing a part-file

Posted by Björn-Elmar Macek <em...@cs.uni-kassel.de>.

 The script i now want to executed looks like this:

 x = load 'tag_count_ts_pro_userpair' as 
 (group:tuple(),cnt:int,times:bag{t:tuple(c:chararray)});
 y = foreach x generate *, moins.daysFromStart('2011-06-01 00:00:00', 
 times);
 store y into 'test_daysFromStart';


 The problem is, that i do not have the logs anymore due to space 
 constraints within the cluster. But i think i can explain the important 
 parts:
 The script that created this data was a GROUP statement followed by a 
 FOREACH calculating a COUNT on the bag mentioned above as "times" which 
 is represented in the 2nd column named "cnt". The results were stored 
 via a simple "store".
 The resulting pig calculation started as expected, but stoppped showing 
 me progress at a certain percentage. A "tail -f" on the hadoop/logs dir 
 revealed that the hadoop calculation progressed nontheless - although 
 some of the tasktrackers permanently vanished during the shuffle phase 
 with the committed/eof/mortbay exception and stopped at least producing 
 any more log output. As i really continiously watched the log i could 
 see, that those work packages were handled by the remaining servers 
 after some of them already calculated packages of progress 1.0. Even the 
 cleanup phase in the end was done, ALTHOUGH(!) the pig log didn't 
 reflect the calculations of the cluster. And since i found the file as 
 output in hdfs i supposed the missing pig progress log entries were 
 simply pig problems. Maybe im wrong with that.

 But i did the calculations several times and this happened during every 
 execution.

 Is there something wrong with the data or the calculations?


 On Mon, 1 Oct 2012 13:01:41 -0700, Robert Molina 
 <rm...@hortonworks.com> wrote:
> It seems that maybe the previous pig script didn't generate the 
> output
> data or write correctly on hdfs. Can you provide the pig script you
> are trying to run?  Also, for the original script that ran and
> generated the file, can you verify if that job had any failed tasks?
>
> On Mon, Oct 1, 2012 at 10:31 AM, Björn-Elmar Macek  wrote:
>
>  Hi Robert,
>
>  the exception i see in the output of the grunt shell and in the pig
> log respectively is:
>
>  Backend error message
>  ---------------------
>  java.util.EmptyStackException
>          at java.util.Stack.peek(Stack.java:102)
>          at
> 
> org.apache.pig.builtin.Utf8StorageConverter.consumeTuple(Utf8StorageConverter.java:182)
>          at
> 
> org.apache.pig.builtin.Utf8StorageConverter.bytesToTuple(Utf8StorageConverter.java:501)
>          at
> 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POCast.getNext(POCast.java:905)
>          at
> 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:334)
>          at
> 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:332)
>          at
> 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:284)
>          at
> 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:290)
>          at
> 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:233)
>          at
> 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:271)
>          at
> 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:266)
>          at
> 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:64)
>          at
> org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
>          at
> org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
>          at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
>          at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
>          at java.security.AccessController.doPrivileged(Native
> Method)
>          at javax.security.auth.Subject.doAs(Subject.java:415)
>          at
> 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
>          at org.apache.hadoop.mapred.Child.main(Child.java:249)
>
>  On Mon, 1 Oct 2012 10:12:22 -0700, Robert Molina  wrote:
>
>  Hi Bjorn, 
>  Can you post the exception you are getting during the map phase?
>
>  On Mon, Oct 1, 2012 at 9:11 AM, Björn-Elmar Macek  wrote:
>
>   Hi,
>
>   i am kind of unsure where to post this problem, but i think it is
>  more related to hadoop than to pig.
>
>   By successfully executing a pig script i created a new file in my
>  hdfs. Sadly though, i cannot use it for further processing except 
> for
>  "dump"ing and viewing the data: every data-manipulation
> script-command
>  just as "foreach" gives exceptions during the map phase.
>   Since there was no problem executing the same script on the first
> 100
>  lines of my data (LIMIT statement),i copied it to my local fs 
> folder.
>   What i realized is, that one of the files namely part-r-000001 was
>  empty and contained within the _temporary folder.
>
>   Is there any reason for this? How can i fix this issue? Did the job
>  (which created the file we are talking about) NOT run properly til
> its
>  end, although the tasktracker worked til the very end and the file
> was
>  created?
>
>   Best regards,
>   Björn
>
>  Links:
>  ------
>  [1] mailto:macek@cs.uni-kassel.de [3]
>
>
>
> Links:
> ------
> [1] mailto:ema@cs.uni-kassel.de
> [2] mailto:rmolina@hortonworks.com
> [3] mailto:macek@cs.uni-kassel.de

Re: HDFS "file" missing a part-file

Posted by Björn-Elmar Macek <em...@cs.uni-kassel.de>.

 The script i now want to executed looks like this:

 x = load 'tag_count_ts_pro_userpair' as 
 (group:tuple(),cnt:int,times:bag{t:tuple(c:chararray)});
 y = foreach x generate *, moins.daysFromStart('2011-06-01 00:00:00', 
 times);
 store y into 'test_daysFromStart';


 The problem is, that i do not have the logs anymore due to space 
 constraints within the cluster. But i think i can explain the important 
 parts:
 The script that created this data was a GROUP statement followed by a 
 FOREACH calculating a COUNT on the bag mentioned above as "times" which 
 is represented in the 2nd column named "cnt". The results were stored 
 via a simple "store".
 The resulting pig calculation started as expected, but stoppped showing 
 me progress at a certain percentage. A "tail -f" on the hadoop/logs dir 
 revealed that the hadoop calculation progressed nontheless - although 
 some of the tasktrackers permanently vanished during the shuffle phase 
 with the committed/eof/mortbay exception and stopped at least producing 
 any more log output. As i really continiously watched the log i could 
 see, that those work packages were handled by the remaining servers 
 after some of them already calculated packages of progress 1.0. Even the 
 cleanup phase in the end was done, ALTHOUGH(!) the pig log didn't 
 reflect the calculations of the cluster. And since i found the file as 
 output in hdfs i supposed the missing pig progress log entries were 
 simply pig problems. Maybe im wrong with that.

 But i did the calculations several times and this happened during every 
 execution.

 Is there something wrong with the data or the calculations?


 On Mon, 1 Oct 2012 13:01:41 -0700, Robert Molina 
 <rm...@hortonworks.com> wrote:
> It seems that maybe the previous pig script didn't generate the 
> output
> data or write correctly on hdfs. Can you provide the pig script you
> are trying to run?  Also, for the original script that ran and
> generated the file, can you verify if that job had any failed tasks?
>
> On Mon, Oct 1, 2012 at 10:31 AM, Björn-Elmar Macek  wrote:
>
>  Hi Robert,
>
>  the exception i see in the output of the grunt shell and in the pig
> log respectively is:
>
>  Backend error message
>  ---------------------
>  java.util.EmptyStackException
>          at java.util.Stack.peek(Stack.java:102)
>          at
> 
> org.apache.pig.builtin.Utf8StorageConverter.consumeTuple(Utf8StorageConverter.java:182)
>          at
> 
> org.apache.pig.builtin.Utf8StorageConverter.bytesToTuple(Utf8StorageConverter.java:501)
>          at
> 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POCast.getNext(POCast.java:905)
>          at
> 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:334)
>          at
> 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:332)
>          at
> 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:284)
>          at
> 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:290)
>          at
> 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:233)
>          at
> 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:271)
>          at
> 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:266)
>          at
> 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:64)
>          at
> org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
>          at
> org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
>          at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
>          at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
>          at java.security.AccessController.doPrivileged(Native
> Method)
>          at javax.security.auth.Subject.doAs(Subject.java:415)
>          at
> 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
>          at org.apache.hadoop.mapred.Child.main(Child.java:249)
>
>  On Mon, 1 Oct 2012 10:12:22 -0700, Robert Molina  wrote:
>
>  Hi Bjorn, 
>  Can you post the exception you are getting during the map phase?
>
>  On Mon, Oct 1, 2012 at 9:11 AM, Björn-Elmar Macek  wrote:
>
>   Hi,
>
>   i am kind of unsure where to post this problem, but i think it is
>  more related to hadoop than to pig.
>
>   By successfully executing a pig script i created a new file in my
>  hdfs. Sadly though, i cannot use it for further processing except 
> for
>  "dump"ing and viewing the data: every data-manipulation
> script-command
>  just as "foreach" gives exceptions during the map phase.
>   Since there was no problem executing the same script on the first
> 100
>  lines of my data (LIMIT statement),i copied it to my local fs 
> folder.
>   What i realized is, that one of the files namely part-r-000001 was
>  empty and contained within the _temporary folder.
>
>   Is there any reason for this? How can i fix this issue? Did the job
>  (which created the file we are talking about) NOT run properly til
> its
>  end, although the tasktracker worked til the very end and the file
> was
>  created?
>
>   Best regards,
>   Björn
>
>  Links:
>  ------
>  [1] mailto:macek@cs.uni-kassel.de [3]
>
>
>
> Links:
> ------
> [1] mailto:ema@cs.uni-kassel.de
> [2] mailto:rmolina@hortonworks.com
> [3] mailto:macek@cs.uni-kassel.de

Re: HDFS "file" missing a part-file

Posted by Robert Molina <rm...@hortonworks.com>.

It seems that maybe the previous pig script didn't generate the output data
or write correctly on hdfs. Can you provide the pig script you are trying
to run?  Also, for the original script that ran and generated the file, can
you verify if that job had any failed tasks?


On Mon, Oct 1, 2012 at 10:31 AM, Björn-Elmar Macek <em...@cs.uni-kassel.de>wrote:

>
> Hi Robert,
>
> the exception i see in the output of the grunt shell and in the pig log
> respectively is:
>
>
> Backend error message
> ---------------------
> java.util.EmptyStackException
>         at java.util.Stack.peek(Stack.**java:102)
>         at org.apache.pig.builtin.**Utf8StorageConverter.**consumeTuple(**
> Utf8StorageConverter.java:182)
>         at org.apache.pig.builtin.**Utf8StorageConverter.**bytesToTuple(**
> Utf8StorageConverter.java:501)
>         at org.apache.pig.backend.hadoop.**executionengine.physicalLayer.*
> *expressionOperators.POCast.**getNext(POCast.java:905)
>         at org.apache.pig.backend.hadoop.**executionengine.physicalLayer.*
> *PhysicalOperator.getNext(**PhysicalOperator.java:334)
>         at org.apache.pig.backend.hadoop.**executionengine.physicalLayer.*
> *relationalOperators.POForEach.**processPlan(POForEach.java:**332)
>         at org.apache.pig.backend.hadoop.**executionengine.physicalLayer.*
> *relationalOperators.POForEach.**getNext(POForEach.java:284)
>         at org.apache.pig.backend.hadoop.**executionengine.physicalLayer.*
> *PhysicalOperator.processInput(**PhysicalOperator.java:290)
>         at org.apache.pig.backend.hadoop.**executionengine.physicalLayer.*
> *relationalOperators.POForEach.**getNext(POForEach.java:233)
>         at org.apache.pig.backend.hadoop.**executionengine.**
> mapReduceLayer.**PigGenericMapBase.runPipeline(**
> PigGenericMapBase.java:271)
>         at org.apache.pig.backend.hadoop.**executionengine.**
> mapReduceLayer.**PigGenericMapBase.map(**PigGenericMapBase.java:266)
>         at org.apache.pig.backend.hadoop.**executionengine.**
> mapReduceLayer.**PigGenericMapBase.map(**PigGenericMapBase.java:64)
>         at org.apache.hadoop.mapreduce.**Mapper.run(Mapper.java:144)
>         at org.apache.hadoop.mapred.**MapTask.runNewMapper(MapTask.**
> java:764)
>         at org.apache.hadoop.mapred.**MapTask.run(MapTask.java:370)
>         at org.apache.hadoop.mapred.**Child$4.run(Child.java:255)
>         at java.security.**AccessController.doPrivileged(**Native Method)
>         at javax.security.auth.Subject.**doAs(Subject.java:415)
>         at org.apache.hadoop.security.**UserGroupInformation.doAs(**
> UserGroupInformation.java:**1121)
>         at org.apache.hadoop.mapred.**Child.main(Child.java:249)
>
>
>
>
>
> On Mon, 1 Oct 2012 10:12:22 -0700, Robert Molina <rm...@hortonworks.com>
> wrote:
>
>> Hi Bjorn,
>> Can you post the exception you are getting during the map phase?
>>
>> On Mon, Oct 1, 2012 at 9:11 AM, Björn-Elmar Macek  wrote:
>>
>>  Hi,
>>
>>  i am kind of unsure where to post this problem, but i think it is
>> more related to hadoop than to pig.
>>
>>  By successfully executing a pig script i created a new file in my
>> hdfs. Sadly though, i cannot use it for further processing except for
>> "dump"ing and viewing the data: every data-manipulation script-command
>> just as "foreach" gives exceptions during the map phase.
>>  Since there was no problem executing the same script on the first 100
>> lines of my data (LIMIT statement),i copied it to my local fs folder.
>>  What i realized is, that one of the files namely part-r-000001 was
>> empty and contained within the _temporary folder.
>>
>>  Is there any reason for this? How can i fix this issue? Did the job
>> (which created the file we are talking about) NOT run properly til its
>> end, although the tasktracker worked til the very end and the file was
>> created?
>>
>>  Best regards,
>>  Björn
>>
>>
>>
>> Links:
>> ------
>> [1] mailto:macek@cs.uni-kassel.de
>>
>
>

Re: HDFS "file" missing a part-file

Posted by Robert Molina <rm...@hortonworks.com>.

It seems that maybe the previous pig script didn't generate the output data
or write correctly on hdfs. Can you provide the pig script you are trying
to run?  Also, for the original script that ran and generated the file, can
you verify if that job had any failed tasks?


On Mon, Oct 1, 2012 at 10:31 AM, Björn-Elmar Macek <em...@cs.uni-kassel.de>wrote:

>
> Hi Robert,
>
> the exception i see in the output of the grunt shell and in the pig log
> respectively is:
>
>
> Backend error message
> ---------------------
> java.util.EmptyStackException
>         at java.util.Stack.peek(Stack.**java:102)
>         at org.apache.pig.builtin.**Utf8StorageConverter.**consumeTuple(**
> Utf8StorageConverter.java:182)
>         at org.apache.pig.builtin.**Utf8StorageConverter.**bytesToTuple(**
> Utf8StorageConverter.java:501)
>         at org.apache.pig.backend.hadoop.**executionengine.physicalLayer.*
> *expressionOperators.POCast.**getNext(POCast.java:905)
>         at org.apache.pig.backend.hadoop.**executionengine.physicalLayer.*
> *PhysicalOperator.getNext(**PhysicalOperator.java:334)
>         at org.apache.pig.backend.hadoop.**executionengine.physicalLayer.*
> *relationalOperators.POForEach.**processPlan(POForEach.java:**332)
>         at org.apache.pig.backend.hadoop.**executionengine.physicalLayer.*
> *relationalOperators.POForEach.**getNext(POForEach.java:284)
>         at org.apache.pig.backend.hadoop.**executionengine.physicalLayer.*
> *PhysicalOperator.processInput(**PhysicalOperator.java:290)
>         at org.apache.pig.backend.hadoop.**executionengine.physicalLayer.*
> *relationalOperators.POForEach.**getNext(POForEach.java:233)
>         at org.apache.pig.backend.hadoop.**executionengine.**
> mapReduceLayer.**PigGenericMapBase.runPipeline(**
> PigGenericMapBase.java:271)
>         at org.apache.pig.backend.hadoop.**executionengine.**
> mapReduceLayer.**PigGenericMapBase.map(**PigGenericMapBase.java:266)
>         at org.apache.pig.backend.hadoop.**executionengine.**
> mapReduceLayer.**PigGenericMapBase.map(**PigGenericMapBase.java:64)
>         at org.apache.hadoop.mapreduce.**Mapper.run(Mapper.java:144)
>         at org.apache.hadoop.mapred.**MapTask.runNewMapper(MapTask.**
> java:764)
>         at org.apache.hadoop.mapred.**MapTask.run(MapTask.java:370)
>         at org.apache.hadoop.mapred.**Child$4.run(Child.java:255)
>         at java.security.**AccessController.doPrivileged(**Native Method)
>         at javax.security.auth.Subject.**doAs(Subject.java:415)
>         at org.apache.hadoop.security.**UserGroupInformation.doAs(**
> UserGroupInformation.java:**1121)
>         at org.apache.hadoop.mapred.**Child.main(Child.java:249)
>
>
>
>
>
> On Mon, 1 Oct 2012 10:12:22 -0700, Robert Molina <rm...@hortonworks.com>
> wrote:
>
>> Hi Bjorn,
>> Can you post the exception you are getting during the map phase?
>>
>> On Mon, Oct 1, 2012 at 9:11 AM, Björn-Elmar Macek  wrote:
>>
>>  Hi,
>>
>>  i am kind of unsure where to post this problem, but i think it is
>> more related to hadoop than to pig.
>>
>>  By successfully executing a pig script i created a new file in my
>> hdfs. Sadly though, i cannot use it for further processing except for
>> "dump"ing and viewing the data: every data-manipulation script-command
>> just as "foreach" gives exceptions during the map phase.
>>  Since there was no problem executing the same script on the first 100
>> lines of my data (LIMIT statement),i copied it to my local fs folder.
>>  What i realized is, that one of the files namely part-r-000001 was
>> empty and contained within the _temporary folder.
>>
>>  Is there any reason for this? How can i fix this issue? Did the job
>> (which created the file we are talking about) NOT run properly til its
>> end, although the tasktracker worked til the very end and the file was
>> created?
>>
>>  Best regards,
>>  Björn
>>
>>
>>
>> Links:
>> ------
>> [1] mailto:macek@cs.uni-kassel.de
>>
>
>

Re: HDFS "file" missing a part-file

Posted by Robert Molina <rm...@hortonworks.com>.

It seems that maybe the previous pig script didn't generate the output data
or write correctly on hdfs. Can you provide the pig script you are trying
to run?  Also, for the original script that ran and generated the file, can
you verify if that job had any failed tasks?


On Mon, Oct 1, 2012 at 10:31 AM, Björn-Elmar Macek <em...@cs.uni-kassel.de>wrote:

>
> Hi Robert,
>
> the exception i see in the output of the grunt shell and in the pig log
> respectively is:
>
>
> Backend error message
> ---------------------
> java.util.EmptyStackException
>         at java.util.Stack.peek(Stack.**java:102)
>         at org.apache.pig.builtin.**Utf8StorageConverter.**consumeTuple(**
> Utf8StorageConverter.java:182)
>         at org.apache.pig.builtin.**Utf8StorageConverter.**bytesToTuple(**
> Utf8StorageConverter.java:501)
>         at org.apache.pig.backend.hadoop.**executionengine.physicalLayer.*
> *expressionOperators.POCast.**getNext(POCast.java:905)
>         at org.apache.pig.backend.hadoop.**executionengine.physicalLayer.*
> *PhysicalOperator.getNext(**PhysicalOperator.java:334)
>         at org.apache.pig.backend.hadoop.**executionengine.physicalLayer.*
> *relationalOperators.POForEach.**processPlan(POForEach.java:**332)
>         at org.apache.pig.backend.hadoop.**executionengine.physicalLayer.*
> *relationalOperators.POForEach.**getNext(POForEach.java:284)
>         at org.apache.pig.backend.hadoop.**executionengine.physicalLayer.*
> *PhysicalOperator.processInput(**PhysicalOperator.java:290)
>         at org.apache.pig.backend.hadoop.**executionengine.physicalLayer.*
> *relationalOperators.POForEach.**getNext(POForEach.java:233)
>         at org.apache.pig.backend.hadoop.**executionengine.**
> mapReduceLayer.**PigGenericMapBase.runPipeline(**
> PigGenericMapBase.java:271)
>         at org.apache.pig.backend.hadoop.**executionengine.**
> mapReduceLayer.**PigGenericMapBase.map(**PigGenericMapBase.java:266)
>         at org.apache.pig.backend.hadoop.**executionengine.**
> mapReduceLayer.**PigGenericMapBase.map(**PigGenericMapBase.java:64)
>         at org.apache.hadoop.mapreduce.**Mapper.run(Mapper.java:144)
>         at org.apache.hadoop.mapred.**MapTask.runNewMapper(MapTask.**
> java:764)
>         at org.apache.hadoop.mapred.**MapTask.run(MapTask.java:370)
>         at org.apache.hadoop.mapred.**Child$4.run(Child.java:255)
>         at java.security.**AccessController.doPrivileged(**Native Method)
>         at javax.security.auth.Subject.**doAs(Subject.java:415)
>         at org.apache.hadoop.security.**UserGroupInformation.doAs(**
> UserGroupInformation.java:**1121)
>         at org.apache.hadoop.mapred.**Child.main(Child.java:249)
>
>
>
>
>
> On Mon, 1 Oct 2012 10:12:22 -0700, Robert Molina <rm...@hortonworks.com>
> wrote:
>
>> Hi Bjorn,
>> Can you post the exception you are getting during the map phase?
>>
>> On Mon, Oct 1, 2012 at 9:11 AM, Björn-Elmar Macek  wrote:
>>
>>  Hi,
>>
>>  i am kind of unsure where to post this problem, but i think it is
>> more related to hadoop than to pig.
>>
>>  By successfully executing a pig script i created a new file in my
>> hdfs. Sadly though, i cannot use it for further processing except for
>> "dump"ing and viewing the data: every data-manipulation script-command
>> just as "foreach" gives exceptions during the map phase.
>>  Since there was no problem executing the same script on the first 100
>> lines of my data (LIMIT statement),i copied it to my local fs folder.
>>  What i realized is, that one of the files namely part-r-000001 was
>> empty and contained within the _temporary folder.
>>
>>  Is there any reason for this? How can i fix this issue? Did the job
>> (which created the file we are talking about) NOT run properly til its
>> end, although the tasktracker worked til the very end and the file was
>> created?
>>
>>  Best regards,
>>  Björn
>>
>>
>>
>> Links:
>> ------
>> [1] mailto:macek@cs.uni-kassel.de
>>
>
>

Re: HDFS "file" missing a part-file

Posted by Robert Molina <rm...@hortonworks.com>.

It seems that maybe the previous pig script didn't generate the output data
or write correctly on hdfs. Can you provide the pig script you are trying
to run?  Also, for the original script that ran and generated the file, can
you verify if that job had any failed tasks?


On Mon, Oct 1, 2012 at 10:31 AM, Björn-Elmar Macek <em...@cs.uni-kassel.de>wrote:

>
> Hi Robert,
>
> the exception i see in the output of the grunt shell and in the pig log
> respectively is:
>
>
> Backend error message
> ---------------------
> java.util.EmptyStackException
>         at java.util.Stack.peek(Stack.**java:102)
>         at org.apache.pig.builtin.**Utf8StorageConverter.**consumeTuple(**
> Utf8StorageConverter.java:182)
>         at org.apache.pig.builtin.**Utf8StorageConverter.**bytesToTuple(**
> Utf8StorageConverter.java:501)
>         at org.apache.pig.backend.hadoop.**executionengine.physicalLayer.*
> *expressionOperators.POCast.**getNext(POCast.java:905)
>         at org.apache.pig.backend.hadoop.**executionengine.physicalLayer.*
> *PhysicalOperator.getNext(**PhysicalOperator.java:334)
>         at org.apache.pig.backend.hadoop.**executionengine.physicalLayer.*
> *relationalOperators.POForEach.**processPlan(POForEach.java:**332)
>         at org.apache.pig.backend.hadoop.**executionengine.physicalLayer.*
> *relationalOperators.POForEach.**getNext(POForEach.java:284)
>         at org.apache.pig.backend.hadoop.**executionengine.physicalLayer.*
> *PhysicalOperator.processInput(**PhysicalOperator.java:290)
>         at org.apache.pig.backend.hadoop.**executionengine.physicalLayer.*
> *relationalOperators.POForEach.**getNext(POForEach.java:233)
>         at org.apache.pig.backend.hadoop.**executionengine.**
> mapReduceLayer.**PigGenericMapBase.runPipeline(**
> PigGenericMapBase.java:271)
>         at org.apache.pig.backend.hadoop.**executionengine.**
> mapReduceLayer.**PigGenericMapBase.map(**PigGenericMapBase.java:266)
>         at org.apache.pig.backend.hadoop.**executionengine.**
> mapReduceLayer.**PigGenericMapBase.map(**PigGenericMapBase.java:64)
>         at org.apache.hadoop.mapreduce.**Mapper.run(Mapper.java:144)
>         at org.apache.hadoop.mapred.**MapTask.runNewMapper(MapTask.**
> java:764)
>         at org.apache.hadoop.mapred.**MapTask.run(MapTask.java:370)
>         at org.apache.hadoop.mapred.**Child$4.run(Child.java:255)
>         at java.security.**AccessController.doPrivileged(**Native Method)
>         at javax.security.auth.Subject.**doAs(Subject.java:415)
>         at org.apache.hadoop.security.**UserGroupInformation.doAs(**
> UserGroupInformation.java:**1121)
>         at org.apache.hadoop.mapred.**Child.main(Child.java:249)
>
>
>
>
>
> On Mon, 1 Oct 2012 10:12:22 -0700, Robert Molina <rm...@hortonworks.com>
> wrote:
>
>> Hi Bjorn,
>> Can you post the exception you are getting during the map phase?
>>
>> On Mon, Oct 1, 2012 at 9:11 AM, Björn-Elmar Macek  wrote:
>>
>>  Hi,
>>
>>  i am kind of unsure where to post this problem, but i think it is
>> more related to hadoop than to pig.
>>
>>  By successfully executing a pig script i created a new file in my
>> hdfs. Sadly though, i cannot use it for further processing except for
>> "dump"ing and viewing the data: every data-manipulation script-command
>> just as "foreach" gives exceptions during the map phase.
>>  Since there was no problem executing the same script on the first 100
>> lines of my data (LIMIT statement),i copied it to my local fs folder.
>>  What i realized is, that one of the files namely part-r-000001 was
>> empty and contained within the _temporary folder.
>>
>>  Is there any reason for this? How can i fix this issue? Did the job
>> (which created the file we are talking about) NOT run properly til its
>> end, although the tasktracker worked til the very end and the file was
>> created?
>>
>>  Best regards,
>>  Björn
>>
>>
>>
>> Links:
>> ------
>> [1] mailto:macek@cs.uni-kassel.de
>>
>
>

Re: HDFS "file" missing a part-file

Posted by Björn-Elmar Macek <em...@cs.uni-kassel.de>.

 Hi Robert,

 the exception i see in the output of the grunt shell and in the pig log 
 respectively is:


 Backend error message
 ---------------------
 java.util.EmptyStackException
         at java.util.Stack.peek(Stack.java:102)
         at 
 org.apache.pig.builtin.Utf8StorageConverter.consumeTuple(Utf8StorageConverter.java:182)
         at 
 org.apache.pig.builtin.Utf8StorageConverter.bytesToTuple(Utf8StorageConverter.java:501)
         at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POCast.getNext(POCast.java:905)
         at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:334)
         at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:332)
         at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:284)
         at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:290)
         at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:233)
         at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:271)
         at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:266)
         at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:64)
         at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
         at 
 org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
         at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
         at java.security.AccessController.doPrivileged(Native Method)
         at javax.security.auth.Subject.doAs(Subject.java:415)
         at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
         at org.apache.hadoop.mapred.Child.main(Child.java:249)




 On Mon, 1 Oct 2012 10:12:22 -0700, Robert Molina 
 <rm...@hortonworks.com> wrote:
> Hi Bjorn, 
> Can you post the exception you are getting during the map phase?
>
> On Mon, Oct 1, 2012 at 9:11 AM, Björn-Elmar Macek  wrote:
>  Hi,
>
>  i am kind of unsure where to post this problem, but i think it is
> more related to hadoop than to pig.
>
>  By successfully executing a pig script i created a new file in my
> hdfs. Sadly though, i cannot use it for further processing except for
> "dump"ing and viewing the data: every data-manipulation 
> script-command
> just as "foreach" gives exceptions during the map phase.
>  Since there was no problem executing the same script on the first 
> 100
> lines of my data (LIMIT statement),i copied it to my local fs folder.
>  What i realized is, that one of the files namely part-r-000001 was
> empty and contained within the _temporary folder.
>
>  Is there any reason for this? How can i fix this issue? Did the job
> (which created the file we are talking about) NOT run properly til 
> its
> end, although the tasktracker worked til the very end and the file 
> was
> created?
>
>  Best regards,
>  Björn
>
>
>
> Links:
> ------
> [1] mailto:macek@cs.uni-kassel.de

Re: HDFS "file" missing a part-file

Posted by Björn-Elmar Macek <em...@cs.uni-kassel.de>.

 Hi Robert,

 the exception i see in the output of the grunt shell and in the pig log 
 respectively is:


 Backend error message
 ---------------------
 java.util.EmptyStackException
         at java.util.Stack.peek(Stack.java:102)
         at 
 org.apache.pig.builtin.Utf8StorageConverter.consumeTuple(Utf8StorageConverter.java:182)
         at 
 org.apache.pig.builtin.Utf8StorageConverter.bytesToTuple(Utf8StorageConverter.java:501)
         at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POCast.getNext(POCast.java:905)
         at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:334)
         at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:332)
         at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:284)
         at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:290)
         at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:233)
         at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:271)
         at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:266)
         at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:64)
         at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
         at 
 org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
         at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
         at java.security.AccessController.doPrivileged(Native Method)
         at javax.security.auth.Subject.doAs(Subject.java:415)
         at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
         at org.apache.hadoop.mapred.Child.main(Child.java:249)




 On Mon, 1 Oct 2012 10:12:22 -0700, Robert Molina 
 <rm...@hortonworks.com> wrote:
> Hi Bjorn, 
> Can you post the exception you are getting during the map phase?
>
> On Mon, Oct 1, 2012 at 9:11 AM, Björn-Elmar Macek  wrote:
>  Hi,
>
>  i am kind of unsure where to post this problem, but i think it is
> more related to hadoop than to pig.
>
>  By successfully executing a pig script i created a new file in my
> hdfs. Sadly though, i cannot use it for further processing except for
> "dump"ing and viewing the data: every data-manipulation 
> script-command
> just as "foreach" gives exceptions during the map phase.
>  Since there was no problem executing the same script on the first 
> 100
> lines of my data (LIMIT statement),i copied it to my local fs folder.
>  What i realized is, that one of the files namely part-r-000001 was
> empty and contained within the _temporary folder.
>
>  Is there any reason for this? How can i fix this issue? Did the job
> (which created the file we are talking about) NOT run properly til 
> its
> end, although the tasktracker worked til the very end and the file 
> was
> created?
>
>  Best regards,
>  Björn
>
>
>
> Links:
> ------
> [1] mailto:macek@cs.uni-kassel.de

Re: HDFS "file" missing a part-file

Posted by Björn-Elmar Macek <em...@cs.uni-kassel.de>.

 Hi Robert,

 the exception i see in the output of the grunt shell and in the pig log 
 respectively is:


 Backend error message
 ---------------------
 java.util.EmptyStackException
         at java.util.Stack.peek(Stack.java:102)
         at 
 org.apache.pig.builtin.Utf8StorageConverter.consumeTuple(Utf8StorageConverter.java:182)
         at 
 org.apache.pig.builtin.Utf8StorageConverter.bytesToTuple(Utf8StorageConverter.java:501)
         at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POCast.getNext(POCast.java:905)
         at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:334)
         at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:332)
         at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:284)
         at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:290)
         at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:233)
         at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:271)
         at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:266)
         at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:64)
         at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
         at 
 org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
         at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
         at java.security.AccessController.doPrivileged(Native Method)
         at javax.security.auth.Subject.doAs(Subject.java:415)
         at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
         at org.apache.hadoop.mapred.Child.main(Child.java:249)




 On Mon, 1 Oct 2012 10:12:22 -0700, Robert Molina 
 <rm...@hortonworks.com> wrote:
> Hi Bjorn, 
> Can you post the exception you are getting during the map phase?
>
> On Mon, Oct 1, 2012 at 9:11 AM, Björn-Elmar Macek  wrote:
>  Hi,
>
>  i am kind of unsure where to post this problem, but i think it is
> more related to hadoop than to pig.
>
>  By successfully executing a pig script i created a new file in my
> hdfs. Sadly though, i cannot use it for further processing except for
> "dump"ing and viewing the data: every data-manipulation 
> script-command
> just as "foreach" gives exceptions during the map phase.
>  Since there was no problem executing the same script on the first 
> 100
> lines of my data (LIMIT statement),i copied it to my local fs folder.
>  What i realized is, that one of the files namely part-r-000001 was
> empty and contained within the _temporary folder.
>
>  Is there any reason for this? How can i fix this issue? Did the job
> (which created the file we are talking about) NOT run properly til 
> its
> end, although the tasktracker worked til the very end and the file 
> was
> created?
>
>  Best regards,
>  Björn
>
>
>
> Links:
> ------
> [1] mailto:macek@cs.uni-kassel.de

Re: HDFS "file" missing a part-file

Posted by Björn-Elmar Macek <em...@cs.uni-kassel.de>.

 Hi Robert,

 the exception i see in the output of the grunt shell and in the pig log 
 respectively is:


 Backend error message
 ---------------------
 java.util.EmptyStackException
         at java.util.Stack.peek(Stack.java:102)
         at 
 org.apache.pig.builtin.Utf8StorageConverter.consumeTuple(Utf8StorageConverter.java:182)
         at 
 org.apache.pig.builtin.Utf8StorageConverter.bytesToTuple(Utf8StorageConverter.java:501)
         at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POCast.getNext(POCast.java:905)
         at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:334)
         at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:332)
         at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:284)
         at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:290)
         at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:233)
         at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:271)
         at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:266)
         at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:64)
         at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
         at 
 org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
         at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
         at java.security.AccessController.doPrivileged(Native Method)
         at javax.security.auth.Subject.doAs(Subject.java:415)
         at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
         at org.apache.hadoop.mapred.Child.main(Child.java:249)




 On Mon, 1 Oct 2012 10:12:22 -0700, Robert Molina 
 <rm...@hortonworks.com> wrote:
> Hi Bjorn, 
> Can you post the exception you are getting during the map phase?
>
> On Mon, Oct 1, 2012 at 9:11 AM, Björn-Elmar Macek  wrote:
>  Hi,
>
>  i am kind of unsure where to post this problem, but i think it is
> more related to hadoop than to pig.
>
>  By successfully executing a pig script i created a new file in my
> hdfs. Sadly though, i cannot use it for further processing except for
> "dump"ing and viewing the data: every data-manipulation 
> script-command
> just as "foreach" gives exceptions during the map phase.
>  Since there was no problem executing the same script on the first 
> 100
> lines of my data (LIMIT statement),i copied it to my local fs folder.
>  What i realized is, that one of the files namely part-r-000001 was
> empty and contained within the _temporary folder.
>
>  Is there any reason for this? How can i fix this issue? Did the job
> (which created the file we are talking about) NOT run properly til 
> its
> end, although the tasktracker worked til the very end and the file 
> was
> created?
>
>  Best regards,
>  Björn
>
>
>
> Links:
> ------
> [1] mailto:macek@cs.uni-kassel.de

Re: HDFS "file" missing a part-file

Posted by Robert Molina <rm...@hortonworks.com>.

Hi Bjorn,
Can you post the exception you are getting during the map phase?



On Mon, Oct 1, 2012 at 9:11 AM, Björn-Elmar Macek <ma...@cs.uni-kassel.de>wrote:

> Hi,
>
> i am kind of unsure where to post this problem, but i think it is more
> related to hadoop than to pig.
>
> By successfully executing a pig script i created a new file in my hdfs.
> Sadly though, i cannot use it for further processing except for "dump"ing
> and viewing the data: every data-manipulation script-command just as
> "foreach" gives exceptions during the map phase.
> Since there was no problem executing the same script on the first 100
> lines of my data (LIMIT statement),i copied it to my local fs folder.
> What i realized is, that one of the files namely part-r-000001 was empty
> and contained within the _temporary folder.
>
> Is there any reason for this? How can i fix this issue? Did the job (which
> created the file we are talking about) NOT run properly til its end,
> although the tasktracker worked til the very end and the file was created?
>
> Best regards,
> Björn
>

Re: HDFS "file" missing a part-file

Posted by Robert Molina <rm...@hortonworks.com>.

Hi Bjorn,
Can you post the exception you are getting during the map phase?



On Mon, Oct 1, 2012 at 9:11 AM, Björn-Elmar Macek <ma...@cs.uni-kassel.de>wrote:

> Hi,
>
> i am kind of unsure where to post this problem, but i think it is more
> related to hadoop than to pig.
>
> By successfully executing a pig script i created a new file in my hdfs.
> Sadly though, i cannot use it for further processing except for "dump"ing
> and viewing the data: every data-manipulation script-command just as
> "foreach" gives exceptions during the map phase.
> Since there was no problem executing the same script on the first 100
> lines of my data (LIMIT statement),i copied it to my local fs folder.
> What i realized is, that one of the files namely part-r-000001 was empty
> and contained within the _temporary folder.
>
> Is there any reason for this? How can i fix this issue? Did the job (which
> created the file we are talking about) NOT run properly til its end,
> although the tasktracker worked til the very end and the file was created?
>
> Best regards,
> Björn
>

Re: HDFS "file" missing a part-file

Posted by Robert Molina <rm...@hortonworks.com>.

Hi Bjorn,
Can you post the exception you are getting during the map phase?



On Mon, Oct 1, 2012 at 9:11 AM, Björn-Elmar Macek <ma...@cs.uni-kassel.de>wrote:

> Hi,
>
> i am kind of unsure where to post this problem, but i think it is more
> related to hadoop than to pig.
>
> By successfully executing a pig script i created a new file in my hdfs.
> Sadly though, i cannot use it for further processing except for "dump"ing
> and viewing the data: every data-manipulation script-command just as
> "foreach" gives exceptions during the map phase.
> Since there was no problem executing the same script on the first 100
> lines of my data (LIMIT statement),i copied it to my local fs folder.
> What i realized is, that one of the files namely part-r-000001 was empty
> and contained within the _temporary folder.
>
> Is there any reason for this? How can i fix this issue? Did the job (which
> created the file we are talking about) NOT run properly til its end,
> although the tasktracker worked til the very end and the file was created?
>
> Best regards,
> Björn
>

Re: HDFS "file" missing a part-file

Posted by Robert Molina <rm...@hortonworks.com>.

Hi Bjorn,
Can you post the exception you are getting during the map phase?



On Mon, Oct 1, 2012 at 9:11 AM, Björn-Elmar Macek <ma...@cs.uni-kassel.de>wrote:

> Hi,
>
> i am kind of unsure where to post this problem, but i think it is more
> related to hadoop than to pig.
>
> By successfully executing a pig script i created a new file in my hdfs.
> Sadly though, i cannot use it for further processing except for "dump"ing
> and viewing the data: every data-manipulation script-command just as
> "foreach" gives exceptions during the map phase.
> Since there was no problem executing the same script on the first 100
> lines of my data (LIMIT statement),i copied it to my local fs folder.
> What i realized is, that one of the files namely part-r-000001 was empty
> and contained within the _temporary folder.
>
> Is there any reason for this? How can i fix this issue? Did the job (which
> created the file we are talking about) NOT run properly til its end,
> although the tasktracker worked til the very end and the file was created?
>
> Best regards,
> Björn
>