You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@crunch.apache.org by Bryan Baugher <bj...@gmail.com> on 2015/07/01 21:29:04 UTC

Plan Dotfile in Configuration

We recently ran into an issue where our code to serialize a pipeline's
configuration was throwing an exception because one of the key/values in
the config was too big (65k characters). We found this key/value was
'crunch.planner.dotfile' which is included in the pipelines config from
Crunch.

My question is why does Crunch provide this value into the config object?

Crunch saves the dotfile string in the MRExecutor context[1] and I don't
think any pipeline would need this at runtime. It also seems like there are
no references to this config value anywhere within Crunch other then to
write the value into the config object.

[1] -
https://github.com/apache/crunch/blob/d176778cf803374506cb7743069a05e28e07e2cf/crunch-core/src/main/java/org/apache/crunch/impl/mr/plan/DotfileUtills.java#L139-L140

Re: Plan Dotfile in Configuration

Posted by Gabriel Reid <ga...@gmail.com>.
Ok, thanks for the info Bryan.

I'll log a jira ticket for making the change that I mentioned in my
previous mail.

- Gabriel


On Thu, Jul 2, 2015 at 4:13 PM Bryan Baugher <bj...@gmail.com> wrote:

> The code throwing the exception was our code. We can get around it for
> now. Just thought it was strange to see such a large config value.
>
> On Thu, Jul 2, 2015 at 12:26 AM Gabriel Reid <ga...@gmail.com>
> wrote:
>
>> From what I remember, the original intent of storing the dotfile in the
>> Configuration object was:
>> * to have a simple way of making it available to clients without putting
>> anything dotfile-specific in an API (because dotfiles are only available
>> for MR pipelines)
>> * making it available as soon as MRPipeline.plan is called (which rules
>> out putting it in the PipelineResult object)
>> * also making it easy to get at after a full pipeline had been run
>>
>> Looking at how things are now since CRUNCH-418 and CRUNCH-438, I think
>> that most of the above points are no longer valid. The dotfile can be
>> retrieved via the MRExecutor that is returned from MRPipeline.plan, and can
>> be automatically written to an output directory. The only reason to keep it
>> around in the Configuration object is for backwards compatibility.
>>
>> What I would propose is that we deprecate
>> PlanningParameters#PIPELINE_PLAN_DOTFILE, and remove it in an upcoming
>> release. That means we probably still need to work around the issue that
>> Brian is encountering though.
>>
>> @Brian, was the code throwing the exception your own code, or is there a
>> hard limit in the Configuration class somewhere? My initial thought is that
>> we could skip adding the dotfile to the Configuration that is serialized,
>> and only add it when we return the Configuration from
>> MRPipeline.getConfiguration.
>>
>> - Gabriel
>>
>>
>>
>> On Thu, Jul 2, 2015 at 12:05 AM Christian Tzolov <
>> christian.tzolov@gmail.com> wrote:
>>
>>> Hi Bryan, Josh,
>>>
>>> IIRC this comes from the original dotfile jobplan implementation. I kept
>>> it for backward compatible. You can see that only the "jobplan" (e.g. the
>>> original/main plan) is stored in the Configuration.
>>>
>>> +Gabriel i am not sure I remember the original intent to have the
>>> jobplan stored in the Configuration?
>>>
>>>
>>>
>>>
>>>
>>> On Wed, Jul 1, 2015 at 11:02 PM, Josh Wills <jo...@gmail.com>
>>> wrote:
>>>
>>>> +Christian
>>>>
>>>> I'm not sure what the intent was there-- Christian?
>>>>
>>>> J
>>>>
>>>> On Wed, Jul 1, 2015 at 12:29 PM, Bryan Baugher <bj...@gmail.com>
>>>> wrote:
>>>>
>>>>> We recently ran into an issue where our code to serialize a pipeline's
>>>>> configuration was throwing an exception because one of the key/values in
>>>>> the config was too big (65k characters). We found this key/value was
>>>>> 'crunch.planner.dotfile' which is included in the pipelines config from
>>>>> Crunch.
>>>>>
>>>>> My question is why does Crunch provide this value into the config
>>>>> object?
>>>>>
>>>>> Crunch saves the dotfile string in the MRExecutor context[1] and I
>>>>> don't think any pipeline would need this at runtime. It also seems like
>>>>> there are no references to this config value anywhere within Crunch other
>>>>> then to write the value into the config object.
>>>>>
>>>>> [1] -
>>>>> https://github.com/apache/crunch/blob/d176778cf803374506cb7743069a05e28e07e2cf/crunch-core/src/main/java/org/apache/crunch/impl/mr/plan/DotfileUtills.java#L139-L140
>>>>>
>>>>>
>>>>
>>>

Re: Plan Dotfile in Configuration

Posted by Bryan Baugher <bj...@gmail.com>.
The code throwing the exception was our code. We can get around it for now.
Just thought it was strange to see such a large config value.

On Thu, Jul 2, 2015 at 12:26 AM Gabriel Reid <ga...@gmail.com> wrote:

> From what I remember, the original intent of storing the dotfile in the
> Configuration object was:
> * to have a simple way of making it available to clients without putting
> anything dotfile-specific in an API (because dotfiles are only available
> for MR pipelines)
> * making it available as soon as MRPipeline.plan is called (which rules
> out putting it in the PipelineResult object)
> * also making it easy to get at after a full pipeline had been run
>
> Looking at how things are now since CRUNCH-418 and CRUNCH-438, I think
> that most of the above points are no longer valid. The dotfile can be
> retrieved via the MRExecutor that is returned from MRPipeline.plan, and can
> be automatically written to an output directory. The only reason to keep it
> around in the Configuration object is for backwards compatibility.
>
> What I would propose is that we deprecate
> PlanningParameters#PIPELINE_PLAN_DOTFILE, and remove it in an upcoming
> release. That means we probably still need to work around the issue that
> Brian is encountering though.
>
> @Brian, was the code throwing the exception your own code, or is there a
> hard limit in the Configuration class somewhere? My initial thought is that
> we could skip adding the dotfile to the Configuration that is serialized,
> and only add it when we return the Configuration from
> MRPipeline.getConfiguration.
>
> - Gabriel
>
>
>
> On Thu, Jul 2, 2015 at 12:05 AM Christian Tzolov <
> christian.tzolov@gmail.com> wrote:
>
>> Hi Bryan, Josh,
>>
>> IIRC this comes from the original dotfile jobplan implementation. I kept
>> it for backward compatible. You can see that only the "jobplan" (e.g. the
>> original/main plan) is stored in the Configuration.
>>
>> +Gabriel i am not sure I remember the original intent to have the jobplan
>> stored in the Configuration?
>>
>>
>>
>>
>>
>> On Wed, Jul 1, 2015 at 11:02 PM, Josh Wills <jo...@gmail.com> wrote:
>>
>>> +Christian
>>>
>>> I'm not sure what the intent was there-- Christian?
>>>
>>> J
>>>
>>> On Wed, Jul 1, 2015 at 12:29 PM, Bryan Baugher <bj...@gmail.com> wrote:
>>>
>>>> We recently ran into an issue where our code to serialize a pipeline's
>>>> configuration was throwing an exception because one of the key/values in
>>>> the config was too big (65k characters). We found this key/value was
>>>> 'crunch.planner.dotfile' which is included in the pipelines config from
>>>> Crunch.
>>>>
>>>> My question is why does Crunch provide this value into the config
>>>> object?
>>>>
>>>> Crunch saves the dotfile string in the MRExecutor context[1] and I
>>>> don't think any pipeline would need this at runtime. It also seems like
>>>> there are no references to this config value anywhere within Crunch other
>>>> then to write the value into the config object.
>>>>
>>>> [1] -
>>>> https://github.com/apache/crunch/blob/d176778cf803374506cb7743069a05e28e07e2cf/crunch-core/src/main/java/org/apache/crunch/impl/mr/plan/DotfileUtills.java#L139-L140
>>>>
>>>>
>>>
>>

Re: Plan Dotfile in Configuration

Posted by Gabriel Reid <ga...@gmail.com>.
>From what I remember, the original intent of storing the dotfile in the
Configuration object was:
* to have a simple way of making it available to clients without putting
anything dotfile-specific in an API (because dotfiles are only available
for MR pipelines)
* making it available as soon as MRPipeline.plan is called (which rules out
putting it in the PipelineResult object)
* also making it easy to get at after a full pipeline had been run

Looking at how things are now since CRUNCH-418 and CRUNCH-438, I think that
most of the above points are no longer valid. The dotfile can be retrieved
via the MRExecutor that is returned from MRPipeline.plan, and can be
automatically written to an output directory. The only reason to keep it
around in the Configuration object is for backwards compatibility.

What I would propose is that we deprecate
PlanningParameters#PIPELINE_PLAN_DOTFILE, and remove it in an upcoming
release. That means we probably still need to work around the issue that
Brian is encountering though.

@Brian, was the code throwing the exception your own code, or is there a
hard limit in the Configuration class somewhere? My initial thought is that
we could skip adding the dotfile to the Configuration that is serialized,
and only add it when we return the Configuration from
MRPipeline.getConfiguration.

- Gabriel



On Thu, Jul 2, 2015 at 12:05 AM Christian Tzolov <ch...@gmail.com>
wrote:

> Hi Bryan, Josh,
>
> IIRC this comes from the original dotfile jobplan implementation. I kept
> it for backward compatible. You can see that only the "jobplan" (e.g. the
> original/main plan) is stored in the Configuration.
>
> +Gabriel i am not sure I remember the original intent to have the jobplan
> stored in the Configuration?
>
>
>
>
>
> On Wed, Jul 1, 2015 at 11:02 PM, Josh Wills <jo...@gmail.com> wrote:
>
>> +Christian
>>
>> I'm not sure what the intent was there-- Christian?
>>
>> J
>>
>> On Wed, Jul 1, 2015 at 12:29 PM, Bryan Baugher <bj...@gmail.com> wrote:
>>
>>> We recently ran into an issue where our code to serialize a pipeline's
>>> configuration was throwing an exception because one of the key/values in
>>> the config was too big (65k characters). We found this key/value was
>>> 'crunch.planner.dotfile' which is included in the pipelines config from
>>> Crunch.
>>>
>>> My question is why does Crunch provide this value into the config
>>> object?
>>>
>>> Crunch saves the dotfile string in the MRExecutor context[1] and I don't
>>> think any pipeline would need this at runtime. It also seems like there are
>>> no references to this config value anywhere within Crunch other then to
>>> write the value into the config object.
>>>
>>> [1] -
>>> https://github.com/apache/crunch/blob/d176778cf803374506cb7743069a05e28e07e2cf/crunch-core/src/main/java/org/apache/crunch/impl/mr/plan/DotfileUtills.java#L139-L140
>>>
>>>
>>
>

Re: Plan Dotfile in Configuration

Posted by Christian Tzolov <ch...@gmail.com>.
Hi Bryan, Josh,

IIRC this comes from the original dotfile jobplan implementation. I kept it
for backward compatible. You can see that only the "jobplan" (e.g. the
original/main plan) is stored in the Configuration.

+Gabriel i am not sure I remember the original intent to have the jobplan
stored in the Configuration?





On Wed, Jul 1, 2015 at 11:02 PM, Josh Wills <jo...@gmail.com> wrote:

> +Christian
>
> I'm not sure what the intent was there-- Christian?
>
> J
>
> On Wed, Jul 1, 2015 at 12:29 PM, Bryan Baugher <bj...@gmail.com> wrote:
>
>> We recently ran into an issue where our code to serialize a pipeline's
>> configuration was throwing an exception because one of the key/values in
>> the config was too big (65k characters). We found this key/value was
>> 'crunch.planner.dotfile' which is included in the pipelines config from
>> Crunch.
>>
>> My question is why does Crunch provide this value into the config object?
>>
>> Crunch saves the dotfile string in the MRExecutor context[1] and I don't
>> think any pipeline would need this at runtime. It also seems like there are
>> no references to this config value anywhere within Crunch other then to
>> write the value into the config object.
>>
>> [1] -
>> https://github.com/apache/crunch/blob/d176778cf803374506cb7743069a05e28e07e2cf/crunch-core/src/main/java/org/apache/crunch/impl/mr/plan/DotfileUtills.java#L139-L140
>>
>>
>

Re: Plan Dotfile in Configuration

Posted by Josh Wills <jo...@gmail.com>.
+Christian

I'm not sure what the intent was there-- Christian?

J

On Wed, Jul 1, 2015 at 12:29 PM, Bryan Baugher <bj...@gmail.com> wrote:

> We recently ran into an issue where our code to serialize a pipeline's
> configuration was throwing an exception because one of the key/values in
> the config was too big (65k characters). We found this key/value was
> 'crunch.planner.dotfile' which is included in the pipelines config from
> Crunch.
>
> My question is why does Crunch provide this value into the config object?
>
> Crunch saves the dotfile string in the MRExecutor context[1] and I don't
> think any pipeline would need this at runtime. It also seems like there are
> no references to this config value anywhere within Crunch other then to
> write the value into the config object.
>
> [1] -
> https://github.com/apache/crunch/blob/d176778cf803374506cb7743069a05e28e07e2cf/crunch-core/src/main/java/org/apache/crunch/impl/mr/plan/DotfileUtills.java#L139-L140
>
>