You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@crunch.apache.org by David Ortiz <dp...@gmail.com> on 2019/09/23 17:40:07 UTC

Re: AvroParquetPathPerKeyTarget with Spark

Josh,

     To circle back to this after forever, I finally was able to get permission to hand this off.  I attached it to CRUNCH-670.

Thanks,
     Dave

> On May 25, 2018, at 7:21 PM, David Ortiz <dp...@gmail.com> wrote:
> 
> The answer seems to be no from my employer on uploading a patch. Double checking on that though. 
> 
> On Fri, May 25, 2018, 4:40 PM Josh Wills <josh.wills@gmail.com <ma...@gmail.com>> wrote:
> CRUNCH-670 is the issue, FWIW
> 
> On Fri, May 25, 2018 at 1:39 PM, Josh Wills <josh.wills@gmail.com <ma...@gmail.com>> wrote:
> Ah, of course-- nice detective work! Can you send me the code so I can patch it in?
> 
> On Fri, May 25, 2018 at 1:33 PM, David Ortiz <dpo5003@gmail.com <ma...@gmail.com>> wrote:
> Josh,
> 
>      When I dug into the code a little more, I saw that both AvroPathPerKeyOutputFormat and AvroParquetPathPerKeyOutputFormat use "part" as a default when creating the basePath when there is not a value for "mapreduce.output.basename".  My guess is that when running via a SparkPipeline that value is not set.  I changed my local copy to use out0 as the defaultValue instead of part, and the job was able to write output successfully.
> 
> Thanks,
>     Dave
> 
> On Fri, May 25, 2018 at 2:20 PM David Ortiz <dpo5003@gmail.com <ma...@gmail.com>> wrote:
> Josh,
> 
>      After cleaning up the logs a little bit I noticed this.
> 
> 18/05/25 18:10:39 WARN AvroPathPerKeyTarget: Nothing to copy from /tmp/crunch-1037479188/p12/out0
> 18/05/25 18:11:38 WARN AvroPathPerKeyTarget: Nothing to copy from /tmp/crunch-1037479188/p13/out0
> 
> When I look in those tmp directories while the job runs, they are actually writing out to the subdirectory part rather than out0, so that would be another reason why it's having issues.  Any thoughts on where that output path is coming from?  If you point me in the right direction I can try to figure it out.
> 
> Thanks,
>      Dave
> 
> On Fri, May 25, 2018 at 2:01 PM David Ortiz <dpo5003@gmail.com <ma...@gmail.com>> wrote:
> Hey Josh,
> 
>      I am still messing around with it a little bit, but I still seem to be getting the same behavior even after rebuilding with the patch.
> 
> Thanks,
>      Dave
> 
> On Thu, May 24, 2018 at 1:50 AM Josh Wills <josh.wills@gmail.com <ma...@gmail.com>> wrote:
> David,
> 
> Take a look at CRUNCH-670; I think that patch fixes the problem in the most minimal way I can think of.
> 
> https://issues.apache.org/jira/browse/CRUNCH-670 <https://issues.apache.org/jira/browse/CRUNCH-670>
> 
> J
> 
> On Wed, May 23, 2018 at 3:54 PM, Josh Wills <josh.wills@gmail.com <ma...@gmail.com>> wrote:
> I think that must be it Dave, but I can't for the life of me figure out where in the code that's happening. Will take another look tonight.
> 
> J
> 
> On Wed, May 23, 2018 at 7:00 AM, David Ortiz <dpo5003@gmail.com <ma...@gmail.com>> wrote:
> Josh,
> 
>      Is there any chance that somehow the output path to AvroParquetPathPerKey is getting twisted up when it goes through the compilation step?  Watching it while it runs, the output in the /tmp/crunch/p<stage> directory basically looks like what I would expect it to do in the output directory.  It seems that AvroPathPerKeyTarget also was showing similar behavior when I was messing around to see if that would work.
> 
> Thanks,
>      Dave
> 
> On Thu, May 17, 2018 at 6:03 PM Josh Wills <josh.wills@gmail.com <ma...@gmail.com>> wrote:
> Hrm, got it-- now at least I know where to look (although surprised that overriding the finalize() didn't fix it, as I ran into similar problems with my own cluster and created a SlackPipeline class that overrides that method.)
> 
> 
> J
> 
> On Thu, May 17, 2018 at 12:22 PM, David Ortiz <dpo5003@gmail.com <ma...@gmail.com>> wrote:
> Josh,
> 
>      Those adjustments did not appear to do anything to stop the tmp directory from being removed at the end of the job execution (override finalize with an empty block when creating SparkPipeline and run using pipeline.run() instead of done()).  However, I can confirm that I see the stage output for the two output directories complete with parquet files partitioned by key.  However, neither they, nor anything else ever make it to the output directory, which is not even created.
> 
> Thanks,
>      Dave
> 
> On Fri, May 11, 2018 at 8:24 AM David Ortiz <dpo5003@gmail.com <ma...@gmail.com>> wrote:
> Hey Josh,
> 
>      Thanks for taking a look.  I can definitely play with that on Monday when I'm back at work.
> 
> Thanks,
>      Dave
> 
> On Fri, May 11, 2018 at 1:46 AM Josh Wills <josh.wills@gmail.com <ma...@gmail.com>> wrote:
> Hey David,
> 
> Looking at the code, the problem isn't obvious to me, but there are only two places things could be going wrong: writing the data out of Spark into the temp directory where intermediate outputs get stored (i.e., Spark isn't writing the data out for some reason) or moving the data from the temp directory to the final location. The temp data is usually deleted at the end of a Crunch run, but you can disable this by a) not calling Pipeline.cleanup or Pipeline.done at the end of the run and b) subclassing SparkPipeline with dummy code that overrides the finalize() method (which is implemented in the top-level DistributedPipeline abstract base class) to be a no-op. Is that easy to try out to see if we can isolate the source of the error? Otherwise I can play with this a bit tomorrow on my own cluster.
> 
> J
> 
> On Thu, May 10, 2018 at 2:20 PM, David Ortiz <dpo5003@gmail.com <ma...@gmail.com>> wrote:
> Awesome.  Thanks for taking a look!
> 
> On Thu, May 10, 2018 at 5:18 PM Josh Wills <josh.wills@gmail.com <ma...@gmail.com>> wrote:
> hrm, that sounds like something is wrong with the commit operation on the Spark side; let me take a look at it this evening!
> 
> J
> 
> On Thu, May 10, 2018 at 8:56 AM, David Ortiz <dpo5003@gmail.com <ma...@gmail.com>> wrote:
> Hello,
> 
>      Are there any known issues with the AvroParquetPathPerKeyTarget when running a Spark pipeline?  When I run my pipeline with mapreduce, I get output, and when I run with spark, the step before where I list my partition keys out (because we use them to add partitions to hive) lists data being present, but the output directory remains empty.  This behavior is occurring targeting both HDFS and S3 directly.
> 
> Thanks,
>      Dave
> 
> 
> 
> 
> 
> 
> 


Re: AvroParquetPathPerKeyTarget with Spark

Posted by Josh Wills <jo...@gmail.com>.
Yay!! Thanks Dave!

On Mon, Sep 23, 2019 at 10:40 AM David Ortiz <dp...@gmail.com> wrote:

> Josh,
>
>      To circle back to this after forever, I finally was able to get
> permission to hand this off.  I attached it to CRUNCH-670.
>
> Thanks,
>      Dave
>
> On May 25, 2018, at 7:21 PM, David Ortiz <dp...@gmail.com> wrote:
>
> The answer seems to be no from my employer on uploading a patch. Double
> checking on that though.
>
> On Fri, May 25, 2018, 4:40 PM Josh Wills <jo...@gmail.com> wrote:
>
>> CRUNCH-670 is the issue, FWIW
>>
>> On Fri, May 25, 2018 at 1:39 PM, Josh Wills <jo...@gmail.com> wrote:
>>
>>> Ah, of course-- nice detective work! Can you send me the code so I can
>>> patch it in?
>>>
>>> On Fri, May 25, 2018 at 1:33 PM, David Ortiz <dp...@gmail.com> wrote:
>>>
>>>> Josh,
>>>>
>>>>      When I dug into the code a little more, I saw that both
>>>> AvroPathPerKeyOutputFormat and AvroParquetPathPerKeyOutputFormat use "part"
>>>> as a default when creating the basePath when there is not a value for
>>>> "mapreduce.output.basename".  My guess is that when running via a
>>>> SparkPipeline that value is not set.  I changed my local copy to use out0
>>>> as the defaultValue instead of part, and the job was able to write output
>>>> successfully.
>>>>
>>>> Thanks,
>>>>     Dave
>>>>
>>>> On Fri, May 25, 2018 at 2:20 PM David Ortiz <dp...@gmail.com> wrote:
>>>>
>>>>> Josh,
>>>>>
>>>>>      After cleaning up the logs a little bit I noticed this.
>>>>>
>>>>> 18/05/25 18:10:39 WARN AvroPathPerKeyTarget: Nothing to copy from
>>>>> /tmp/crunch-1037479188/p12/out0
>>>>> 18/05/25 18:11:38 WARN AvroPathPerKeyTarget: Nothing to copy from
>>>>> /tmp/crunch-1037479188/p13/out0
>>>>>
>>>>> When I look in those tmp directories while the job runs, they are
>>>>> actually writing out to the subdirectory part rather than out0, so that
>>>>> would be another reason why it's having issues.  Any thoughts on where that
>>>>> output path is coming from?  If you point me in the right direction I can
>>>>> try to figure it out.
>>>>>
>>>>> Thanks,
>>>>>      Dave
>>>>>
>>>>> On Fri, May 25, 2018 at 2:01 PM David Ortiz <dp...@gmail.com> wrote:
>>>>>
>>>>>> Hey Josh,
>>>>>>
>>>>>>      I am still messing around with it a little bit, but I still seem
>>>>>> to be getting the same behavior even after rebuilding with the patch.
>>>>>>
>>>>>> Thanks,
>>>>>>      Dave
>>>>>>
>>>>>> On Thu, May 24, 2018 at 1:50 AM Josh Wills <jo...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> David,
>>>>>>>
>>>>>>> Take a look at CRUNCH-670; I think that patch fixes the problem in
>>>>>>> the most minimal way I can think of.
>>>>>>>
>>>>>>> https://issues.apache.org/jira/browse/CRUNCH-670
>>>>>>>
>>>>>>> J
>>>>>>>
>>>>>>> On Wed, May 23, 2018 at 3:54 PM, Josh Wills <jo...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> I think that must be it Dave, but I can't for the life of me figure
>>>>>>>> out where in the code that's happening. Will take another look tonight.
>>>>>>>>
>>>>>>>> J
>>>>>>>>
>>>>>>>> On Wed, May 23, 2018 at 7:00 AM, David Ortiz <dp...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Josh,
>>>>>>>>>
>>>>>>>>>      Is there any chance that somehow the output path to
>>>>>>>>> AvroParquetPathPerKey is getting twisted up when it goes through the
>>>>>>>>> compilation step?  Watching it while it runs, the output in the
>>>>>>>>> /tmp/crunch/p<stage> directory basically looks like what I would expect it
>>>>>>>>> to do in the output directory.  It seems that AvroPathPerKeyTarget also was
>>>>>>>>> showing similar behavior when I was messing around to see if that would
>>>>>>>>> work.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>>      Dave
>>>>>>>>>
>>>>>>>>> On Thu, May 17, 2018 at 6:03 PM Josh Wills <jo...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hrm, got it-- now at least I know where to look (although
>>>>>>>>>> surprised that overriding the finalize() didn't fix it, as I ran into
>>>>>>>>>> similar problems with my own cluster and created a SlackPipeline class that
>>>>>>>>>> overrides that method.)
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> J
>>>>>>>>>>
>>>>>>>>>> On Thu, May 17, 2018 at 12:22 PM, David Ortiz <dp...@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Josh,
>>>>>>>>>>>
>>>>>>>>>>>      Those adjustments did not appear to do anything to stop the
>>>>>>>>>>> tmp directory from being removed at the end of the job execution (override
>>>>>>>>>>> finalize with an empty block when creating SparkPipeline and run using
>>>>>>>>>>> pipeline.run() instead of done()).  However, I can confirm that I see the
>>>>>>>>>>> stage output for the two output directories complete with parquet files
>>>>>>>>>>> partitioned by key.  However, neither they, nor anything else ever make it
>>>>>>>>>>> to the output directory, which is not even created.
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>>      Dave
>>>>>>>>>>>
>>>>>>>>>>> On Fri, May 11, 2018 at 8:24 AM David Ortiz <dp...@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hey Josh,
>>>>>>>>>>>>
>>>>>>>>>>>>      Thanks for taking a look.  I can definitely play with that
>>>>>>>>>>>> on Monday when I'm back at work.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>      Dave
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, May 11, 2018 at 1:46 AM Josh Wills <
>>>>>>>>>>>> josh.wills@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hey David,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Looking at the code, the problem isn't obvious to me, but
>>>>>>>>>>>>> there are only two places things could be going wrong: writing the data out
>>>>>>>>>>>>> of Spark into the temp directory where intermediate outputs get stored
>>>>>>>>>>>>> (i.e., Spark isn't writing the data out for some reason) or moving the data
>>>>>>>>>>>>> from the temp directory to the final location. The temp data is usually
>>>>>>>>>>>>> deleted at the end of a Crunch run, but you can disable this by a) not
>>>>>>>>>>>>> calling Pipeline.cleanup or Pipeline.done at the end of the run and b)
>>>>>>>>>>>>> subclassing SparkPipeline with dummy code that overrides the finalize()
>>>>>>>>>>>>> method (which is implemented in the top-level DistributedPipeline abstract
>>>>>>>>>>>>> base class) to be a no-op. Is that easy to try out to see if we can isolate
>>>>>>>>>>>>> the source of the error? Otherwise I can play with this a bit tomorrow on
>>>>>>>>>>>>> my own cluster.
>>>>>>>>>>>>>
>>>>>>>>>>>>> J
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Thu, May 10, 2018 at 2:20 PM, David Ortiz <
>>>>>>>>>>>>> dpo5003@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Awesome.  Thanks for taking a look!
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Thu, May 10, 2018 at 5:18 PM Josh Wills <
>>>>>>>>>>>>>> josh.wills@gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> hrm, that sounds like something is wrong with the commit
>>>>>>>>>>>>>>> operation on the Spark side; let me take a look at it this evening!
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> J
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Thu, May 10, 2018 at 8:56 AM, David Ortiz <
>>>>>>>>>>>>>>> dpo5003@gmail.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>      Are there any known issues with the
>>>>>>>>>>>>>>>> AvroParquetPathPerKeyTarget when running a Spark pipeline?  When I run my
>>>>>>>>>>>>>>>> pipeline with mapreduce, I get output, and when I run with spark, the step
>>>>>>>>>>>>>>>> before where I list my partition keys out (because we use them to add
>>>>>>>>>>>>>>>> partitions to hive) lists data being present, but the output directory
>>>>>>>>>>>>>>>> remains empty.  This behavior is occurring targeting both HDFS and S3
>>>>>>>>>>>>>>>> directly.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>      Dave
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>
>>
>