You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by Sean Owen <sr...@gmail.com> on 2010/09/22 09:16:19 UTC

Remind me about 'transient'?

Someone remind me about why we use 'transient' in the code?

It has no meaning outside of java.io.Serializable, and customizing its
behavior. It's not seen in non-Serializable classes.
But, I remember someone commented at one point that it has meaning for
the GSON serialization mechanism. True?

If so, is there a sense of which classes that affects? since I see it
in classes I don't think would ever be serialized like OnlineAuc.

Re: Remind me about 'transient'?

Posted by Ted Dunning <te...@gmail.com>.
I don't have a good answer to this question yet. This is partly  
because I am still learning the operational considerations.  Gson was  
introduced because we had it and it satisfied the multilingual  
requirement. Beyond that I don't have a good story other than avro  
someday mumble.

Sent from my iPhone

On Sep 22, 2010, at 6:41 AM, Sean Owen <sr...@gmail.com> wrote:

> I figured there was a reason.
> But what serialization mechanism? I'm getting at whether there is any
> unneeded consistency in approach to serialization. (There may not be.)
>
> On Wed, Sep 22, 2010 at 2:38 PM, Ted Dunning <te...@gmail.com>  
> wrote:
>> Actually OnlineAuc is serialized when an AdaptiveLogisticRegression  
>> is
>> serialized.  This is done to checkpoint a large incremental  
>> training run.
>>  That serialized form is also used in some cases to deploy a model to
>> production or to do model diagnostics.

Re: Remind me about 'transient'?

Posted by Sean Owen <sr...@gmail.com>.
I figured there was a reason.
But what serialization mechanism? I'm getting at whether there is any
unneeded consistency in approach to serialization. (There may not be.)

On Wed, Sep 22, 2010 at 2:38 PM, Ted Dunning <te...@gmail.com> wrote:
> Actually OnlineAuc is serialized when an AdaptiveLogisticRegression is
> serialized.  This is done to checkpoint a large incremental training run.
>  That serialized form is also used in some cases to deploy a model to
> production or to do model diagnostics.

Re: Remind me about 'transient'?

Posted by Ted Dunning <te...@gmail.com>.
Actually OnlineAuc is serialized when an AdaptiveLogisticRegression is  
serialized.  This is done to checkpoint a large incremental training  
run.  That serialized form is also used in some cases to deploy a  
model to production or to do model diagnostics.

Sent from my iPhone

On Sep 22, 2010, at 6:19 AM, Jeff Eastman <jd...@windwardsolutions.com>  
wrote:

> On 9/22/10 3:16 AM, Sean Owen wrote:
>> Someone remind me about why we use 'transient' in the code?
>>
>> It has no meaning outside of java.io.Serializable, and customizing  
>> its
>> behavior. It's not seen in non-Serializable classes.
>> But, I remember someone commented at one point that it has meaning  
>> for
>> the GSON serialization mechanism. True?
>>
>> If so, is there a sense of which classes that affects? since I see it
>> in classes I don't think would ever be serialized like OnlineAuc.
>>
> Yes, Gson will not serialize transient state. It's also a reasonable  
> way to indicate transient state that is not serialized by Writable.  
> See AbstractCluster for an example.

Re: Remind me about 'transient'?

Posted by Sean Owen <sr...@gmail.com>.
No worries, I wasn't planning on any sudden moves.
But this has perhaps earned a JIRA issue for tracking. I'll open one.

On Wed, Sep 22, 2010 at 3:45 PM, Jeff Eastman
<jd...@windwardsolutions.com> wrote:
>  Ok, based upon this I'm -1 on removing it. Perhaps we should deprecate and
> remove in 0.5 or later?
>

Re: Remind me about 'transient'?

Posted by Jeff Eastman <jd...@windwardsolutions.com>.
  Ok, based upon this I'm -1 on removing it. Perhaps we should deprecate 
and remove in 0.5 or later?

On 9/22/10 10:31 AM, Ted Dunning wrote:
> For the model stuff, it is a bit more than vestigial (aka in production
> now).  I would definitely like to migrate to a better format, but that will
> take a bit of time to complete.
>
> On Wed, Sep 22, 2010 at 7:21 AM, Jeff Eastman<jd...@windwardsolutions.com>wrote:
>
>>   +0 Before there was Writable, we used Json to communicate internal state.
>> With Writable now the standard, the Json is vestigial. Pretty easy to rip
>> out if nobody thinks it is useful. We don't currently have textual I/O forms
>> that are uniform and complete other than Json.


Re: Remind me about 'transient'?

Posted by Ted Dunning <te...@gmail.com>.
For the model stuff, it is a bit more than vestigial (aka in production
now).  I would definitely like to migrate to a better format, but that will
take a bit of time to complete.

On Wed, Sep 22, 2010 at 7:21 AM, Jeff Eastman <jd...@windwardsolutions.com>wrote:

>  +0 Before there was Writable, we used Json to communicate internal state.
> With Writable now the standard, the Json is vestigial. Pretty easy to rip
> out if nobody thinks it is useful. We don't currently have textual I/O forms
> that are uniform and complete other than Json.

Re: Remind me about 'transient'?

Posted by Jeff Eastman <jd...@windwardsolutions.com>.
  +0 Before there was Writable, we used Json to communicate internal 
state. With Writable now the standard, the Json is vestigial. Pretty 
easy to rip out if nobody thinks it is useful. We don't currently have 
textual I/O forms that are uniform and complete other than Json.

On 9/22/10 10:05 AM, Sean Owen wrote:
> Ah a fourth mechanism. Not bad per se, as long as there a Good Reason
> to use Serializable in one place, Writable another, GSON elsewhere and
> Avro as well.
>
> Reasons come from goals and use cases. Right now IMHO there are really
> two input / output formats for anything that interacts with the
> outside world:
>
> - Text of various stripes
> - Vector Writable
>
> Text is the ultimate lowest-common-denominator: human readable,
> cross-language, but not efficient. Vector Writable is the opposite.
> And between them, if I squint, that answers the use cases.
>
> Wild idea: is that about right? What happens if GSON is removed, Avro
> not used, Serializable not used?
>
>
> Mahout's nature will always be a bit of a 'bazaar' project, really a
> loose confederation of implementations that are not entirely
> consistent. I imagine though that taking targeted shots at chunky
> issues like this (and standardizing on Hadoop 0.20.x APIs for
> instance) gets rid of 80% of the divergence. And that's pretty fine
> for such a project. Better perhaps than many closed / proprietary code
> bases.
>


Re: Remind me about 'transient'?

Posted by Ted Dunning <te...@gmail.com>.
On Wed, Sep 22, 2010 at 7:05 AM, Sean Owen <sr...@gmail.com> wrote:

> And between them, if I squint, that (text and writable) answers the use
> cases.
>

Except when it takes 20GB of extra heap to serialize a model (it really
can).


>
> Wild idea: is that about right? What happens if GSON is removed, Avro
> not used, Serializable not used?
>

Right.  This migration sounds good to me.  It gives us what we have with
better memory feasibility.


>
> Mahout's nature will always be a bit of a 'bazaar' project, really a
> loose confederation of implementations that are not entirely
> consistent. I imagine though that taking targeted shots at chunky
> issues like this (and standardizing on Hadoop 0.20.x APIs for
> instance) gets rid of 80% of the divergence. And that's pretty fine
> for such a project. Better perhaps than many closed / proprietary code
> bases.
>

Sounds right to me.

Re: Remind me about 'transient'?

Posted by Sean Owen <sr...@gmail.com>.
Ah a fourth mechanism. Not bad per se, as long as there a Good Reason
to use Serializable in one place, Writable another, GSON elsewhere and
Avro as well.

Reasons come from goals and use cases. Right now IMHO there are really
two input / output formats for anything that interacts with the
outside world:

- Text of various stripes
- Vector Writable

Text is the ultimate lowest-common-denominator: human readable,
cross-language, but not efficient. Vector Writable is the opposite.
And between them, if I squint, that answers the use cases.

Wild idea: is that about right? What happens if GSON is removed, Avro
not used, Serializable not used?


Mahout's nature will always be a bit of a 'bazaar' project, really a
loose confederation of implementations that are not entirely
consistent. I imagine though that taking targeted shots at chunky
issues like this (and standardizing on Hadoop 0.20.x APIs for
instance) gets rid of 80% of the divergence. And that's pretty fine
for such a project. Better perhaps than many closed / proprietary code
bases.

Re: Remind me about 'transient'?

Posted by Ted Dunning <te...@gmail.com>.

Sent from my iPhone

On Sep 22, 2010, at 6:27 AM, Sean Owen <sr...@gmail.com> wrote:

> Embedded in my question is, what is GSON used for at this stage? There
> are potentially three serialization mechanisms in play (java.io,
> Writable, GSON) and want to rationalize them. Is GSON still needed?

GSON is useful as an inspectable serialization. Avro would perhaps be  
preferable but Writable and java native serialization are not because  
of the need to inspect models using non java languages.

I am very sympathetic about the desire to rationalize these but the  
way forward that I see is avro since it serves the purposes of the  
other options. In particular, with large models gson begins to eat up  
vast amounts of transient heap space when writing or reading. Avro  
would avoid this.

> (Could well be) 'transient' doesn't have meaning for Writable, though
> I take the point about annotation. But an annotation or comment could
> be a better thing. 'transient' in a non-Serializable class looks at
> first like a mistake.

Except where it is meaningful to the compiler, I agree that transient  
is less than perspicuous. Where it is used only for GSON there should  
always be a comment about why.

>
> On Wed, Sep 22, 2010 at 2:19 PM, Jeff Eastman
> <jd...@windwardsolutions.com> wrote:
>>  On 9/22/10 3:16 AM, Sean Owen wrote:
>>>
>>> Someone remind me about why we use 'transient' in the code?
>>>
>>> It has no meaning outside of java.io.Serializable, and customizing  
>>> its
>>> behavior. It's not seen in non-Serializable classes.
>>> But, I remember someone commented at one point that it has meaning  
>>> for
>>> the GSON serialization mechanism. True?
>>>
>>> If so, is there a sense of which classes that affects? since I see  
>>> it
>>> in classes I don't think would ever be serialized like OnlineAuc.
>>>
>> Yes, Gson will not serialize transient state. It's also a  
>> reasonable way to
>> indicate transient state that is not serialized by Writable. See
>> AbstractCluster for an example.
>>

Re: Remind me about 'transient'?

Posted by Sean Owen <sr...@gmail.com>.
Embedded in my question is, what is GSON used for at this stage? There
are potentially three serialization mechanisms in play (java.io,
Writable, GSON) and want to rationalize them. Is GSON still needed?
(Could well be) 'transient' doesn't have meaning for Writable, though
I take the point about annotation. But an annotation or comment could
be a better thing. 'transient' in a non-Serializable class looks at
first like a mistake.

On Wed, Sep 22, 2010 at 2:19 PM, Jeff Eastman
<jd...@windwardsolutions.com> wrote:
>  On 9/22/10 3:16 AM, Sean Owen wrote:
>>
>> Someone remind me about why we use 'transient' in the code?
>>
>> It has no meaning outside of java.io.Serializable, and customizing its
>> behavior. It's not seen in non-Serializable classes.
>> But, I remember someone commented at one point that it has meaning for
>> the GSON serialization mechanism. True?
>>
>> If so, is there a sense of which classes that affects? since I see it
>> in classes I don't think would ever be serialized like OnlineAuc.
>>
> Yes, Gson will not serialize transient state. It's also a reasonable way to
> indicate transient state that is not serialized by Writable. See
> AbstractCluster for an example.
>

Re: Remind me about 'transient'?

Posted by Jeff Eastman <jd...@windwardsolutions.com>.
  On 9/22/10 3:16 AM, Sean Owen wrote:
> Someone remind me about why we use 'transient' in the code?
>
> It has no meaning outside of java.io.Serializable, and customizing its
> behavior. It's not seen in non-Serializable classes.
> But, I remember someone commented at one point that it has meaning for
> the GSON serialization mechanism. True?
>
> If so, is there a sense of which classes that affects? since I see it
> in classes I don't think would ever be serialized like OnlineAuc.
>
Yes, Gson will not serialize transient state. It's also a reasonable way 
to indicate transient state that is not serialized by Writable. See 
AbstractCluster for an example.