You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@uima.apache.org by Marshall Schor <ms...@schor.com> on 2016/01/13 21:28:30 UTC

versioning cas serializations

Hi,

I'm working on UIMA-4743 - fixing some binary cas serialization problems, which
will unfortunately make the binary serialization for "delta" formats not
backward compatible (the fix may have extra bytes in it).

We currently have a partially architected scheme for serialization forms, which
looks like:
  - 1 word encoding U + I + M + A and also serving to identify byte order
  - 1 word for bit-encoding some categorizations:
     -- a bit for delta / non delta
     -- a bit for compressed / non compressed
  - 0 or 1 additional word for incrementing in some fashion a version number for
a particular serialization category (named below as "2nd version word)

This 2nd version word is currently only used with compressed serialization formats.

I'm thinking of assigning another bit in the first word to indicate there's a
2nd version word present.

I would turn this on for the repaired binary delta format, and supply a version
number.

Our current compressed formats use "1" as the incrementing version number.

Thinking ahead, perhaps the serialization formats should have a multi-part 2nd
version word, along some standards. 
The "semantic versioning" standard has sparked some push-back (see
https://gist.github.com/jashkenas/cbd2b088e20279ae2c8e )
basically saying the "mechanical" approach of semantic versioning isn't rich
enough for the grey areas of real world use, and ends up obscuring the purpose
of indicating how "far" one version is from another. 

I'm leaning toward something simple, such as using the Major/Minor/Patch format,
each value 1 byte, in the 3 lower bytes of the 2nd version word, giving 256
possibilities for each (more than I've ever seen used).

Other ideas?

-Marshall

Re: versioning cas serializations

Posted by Marshall Schor <ms...@schor.com>.

The advantage of a slightly more complex 2nd version word (with
major/minor/patch) may be in the future that some better backwards compatible
tests could be done.  Also, it really costs essentially nothing, I think. 

+1 on your general analysis of versioning :-).

-Marshall

On 1/13/2016 3:56 PM, Richard Eckart de Castilho wrote:
> Hi,
>
> On 13.01.2016, at 21:28, Marshall Schor <ms...@schor.com> wrote:
>> I would turn this on for the repaired binary delta format, and supply a version
>> number.
>>
>> Our current compressed formats use "1" as the incrementing version number.
>>
>> I'm leaning toward something simple, such as using the Major/Minor/Patch format,
>> each value 1 byte, in the 3 lower bytes of the 2nd version word, giving 256
>> possibilities for each (more than I've ever seen used).
> +1 for versioning the CAS formats. Every data format should include version information :) The BinaryCasWriter in DKPro Core uses 'D', 'K', 'P', 'r', 'o', '1' as the header for the 6+ format (serialization with compression form 6  prepended with type system information).
>
> Is it really necessary to have a complex versioning scheme for data formats? I'd rather tend towards a plain int versioning: 1, 2, 3, 4, etc. wouldn't that be sufficient?
>
>> The "semantic versioning" standard has sparked some push-back (see
>> https://gist.github.com/jashkenas/cbd2b088e20279ae2c8e )
>> basically saying the "mechanical" approach of semantic versioning isn't rich
>> enough for the grey areas of real world use, and ends up obscuring the purpose
>> of indicating how "far" one version is from another. 
>
> Regarding SemVer: I don't personally fully trust the plugin we are using. E.g. I tried doing some changes to uimaFIT that I believe are backwards-compatible but the semver plugin believes otherwise. 
>
> Other than that, I am not quite convinced of the criticism towards semver either. 
>
> Let's just consider (for software):
>
> - if we do bug-fixes, we typically make this a x.y.+1 - bug-fixes shouldn't change the API - sounds reasonable to me
>
> - when adding new features, I would personally always tend towards a x.+1.0 - in the past, we had various UIMA releases that added cool new feature but increased the version only at the last digit. Undeserved, I think. Since we use semver, we increase the middle digit more and I think that is very appropriate and reflects the activity in the project much better.
>
> - that leaves the first digit, which IMHO is often a marketing digit: increase it to tell people that all is new and shiny and they should have another fresh look at the project. I don't think we need that. Using it to indicate major breaking changes (which are typically part of a major refactoring with cool new features that people should have a look at) seems quite appropriate to me. We are now in UIMA 2. UIMA 1 was IBM UIMA. I do believe that if we are introducing major changes now like a completely new CAS, that warrants going to UIMA 3.
>
> So looking at that and minus some doubts that I have about the accuracy of the semver plugin, I believe that the idea of semver in general is quite sensible - at least when going with a three-part versioning scheme. I would consider the plugin as an automatic alert for accidentally introducing incompatible changes and the semver idea
> as a guideline. When we consider it a good idea, I think we should add exceptions and overrides to the plugin
> for particular releases. 
>
> Cheers,
>
> -- Richard
>
>> On 13.01.2016, at 21:28, Marshall Schor <ms...@schor.com> wrote:
>>
>> Hi,
>>
>> I'm working on UIMA-4743 - fixing some binary cas serialization problems, which
>> will unfortunately make the binary serialization for "delta" formats not
>> backward compatible (the fix may have extra bytes in it).
>>
>> We currently have a partially architected scheme for serialization forms, which
>> looks like:
>>  - 1 word encoding U + I + M + A and also serving to identify byte order
>>  - 1 word for bit-encoding some categorizations:
>>     -- a bit for delta / non delta
>>     -- a bit for compressed / non compressed
>>  - 0 or 1 additional word for incrementing in some fashion a version number for
>> a particular serialization category (named below as "2nd version word)
>>
>> This 2nd version word is currently only used with compressed serialization formats.
>>
>> I'm thinking of assigning another bit in the first word to indicate there's a
>> 2nd version word present.
>>
>> I would turn this on for the repaired binary delta format, and supply a version
>> number.
>>
>> Our current compressed formats use "1" as the incrementing version number.
>>
>> Thinking ahead, perhaps the serialization formats should have a multi-part 2nd
>> version word, along some standards. 
>> The "semantic versioning" standard has sparked some push-back (see
>> https://gist.github.com/jashkenas/cbd2b088e20279ae2c8e )
>> basically saying the "mechanical" approach of semantic versioning isn't rich
>> enough for the grey areas of real world use, and ends up obscuring the purpose
>> of indicating how "far" one version is from another. 
>>
>> I'm leaning toward something simple, such as using the Major/Minor/Patch format,
>> each value 1 byte, in the 3 lower bytes of the 2nd version word, giving 256
>> possibilities for each (more than I've ever seen used).
>>
>> Other ideas?
>>
>> -Marshall
>

Re: versioning cas serializations

Posted by Marshall Schor <ms...@schor.com>.

I'm now agreeing with you Richard, on just using simple incrementing numbers as
the version for each serialization format, with some extra bit encoding things
like form 4, form 6, delta, etc.

I'll put up a Jira for this.

-Marshall



On 1/13/2016 3:56 PM, Richard Eckart de Castilho wrote:
> Hi,
>
> On 13.01.2016, at 21:28, Marshall Schor <ms...@schor.com> wrote:
>> I would turn this on for the repaired binary delta format, and supply a version
>> number.
>>
>> Our current compressed formats use "1" as the incrementing version number.
>>
>> I'm leaning toward something simple, such as using the Major/Minor/Patch format,
>> each value 1 byte, in the 3 lower bytes of the 2nd version word, giving 256
>> possibilities for each (more than I've ever seen used).
> +1 for versioning the CAS formats. Every data format should include version information :) The BinaryCasWriter in DKPro Core uses 'D', 'K', 'P', 'r', 'o', '1' as the header for the 6+ format (serialization with compression form 6  prepended with type system information).
>
> Is it really necessary to have a complex versioning scheme for data formats? I'd rather tend towards a plain int versioning: 1, 2, 3, 4, etc. wouldn't that be sufficient?
>
>> The "semantic versioning" standard has sparked some push-back (see
>> https://gist.github.com/jashkenas/cbd2b088e20279ae2c8e )
>> basically saying the "mechanical" approach of semantic versioning isn't rich
>> enough for the grey areas of real world use, and ends up obscuring the purpose
>> of indicating how "far" one version is from another. 
>
> Regarding SemVer: I don't personally fully trust the plugin we are using. E.g. I tried doing some changes to uimaFIT that I believe are backwards-compatible but the semver plugin believes otherwise. 
>
> Other than that, I am not quite convinced of the criticism towards semver either. 
>
> Let's just consider (for software):
>
> - if we do bug-fixes, we typically make this a x.y.+1 - bug-fixes shouldn't change the API - sounds reasonable to me
>
> - when adding new features, I would personally always tend towards a x.+1.0 - in the past, we had various UIMA releases that added cool new feature but increased the version only at the last digit. Undeserved, I think. Since we use semver, we increase the middle digit more and I think that is very appropriate and reflects the activity in the project much better.
>
> - that leaves the first digit, which IMHO is often a marketing digit: increase it to tell people that all is new and shiny and they should have another fresh look at the project. I don't think we need that. Using it to indicate major breaking changes (which are typically part of a major refactoring with cool new features that people should have a look at) seems quite appropriate to me. We are now in UIMA 2. UIMA 1 was IBM UIMA. I do believe that if we are introducing major changes now like a completely new CAS, that warrants going to UIMA 3.
>
> So looking at that and minus some doubts that I have about the accuracy of the semver plugin, I believe that the idea of semver in general is quite sensible - at least when going with a three-part versioning scheme. I would consider the plugin as an automatic alert for accidentally introducing incompatible changes and the semver idea
> as a guideline. When we consider it a good idea, I think we should add exceptions and overrides to the plugin
> for particular releases. 
>
> Cheers,
>
> -- Richard
>
>> On 13.01.2016, at 21:28, Marshall Schor <ms...@schor.com> wrote:
>>
>> Hi,
>>
>> I'm working on UIMA-4743 - fixing some binary cas serialization problems, which
>> will unfortunately make the binary serialization for "delta" formats not
>> backward compatible (the fix may have extra bytes in it).
>>
>> We currently have a partially architected scheme for serialization forms, which
>> looks like:
>>  - 1 word encoding U + I + M + A and also serving to identify byte order
>>  - 1 word for bit-encoding some categorizations:
>>     -- a bit for delta / non delta
>>     -- a bit for compressed / non compressed
>>  - 0 or 1 additional word for incrementing in some fashion a version number for
>> a particular serialization category (named below as "2nd version word)
>>
>> This 2nd version word is currently only used with compressed serialization formats.
>>
>> I'm thinking of assigning another bit in the first word to indicate there's a
>> 2nd version word present.
>>
>> I would turn this on for the repaired binary delta format, and supply a version
>> number.
>>
>> Our current compressed formats use "1" as the incrementing version number.
>>
>> Thinking ahead, perhaps the serialization formats should have a multi-part 2nd
>> version word, along some standards. 
>> The "semantic versioning" standard has sparked some push-back (see
>> https://gist.github.com/jashkenas/cbd2b088e20279ae2c8e )
>> basically saying the "mechanical" approach of semantic versioning isn't rich
>> enough for the grey areas of real world use, and ends up obscuring the purpose
>> of indicating how "far" one version is from another. 
>>
>> I'm leaning toward something simple, such as using the Major/Minor/Patch format,
>> each value 1 byte, in the 3 lower bytes of the 2nd version word, giving 256
>> possibilities for each (more than I've ever seen used).
>>
>> Other ideas?
>>
>> -Marshall
>

Re: versioning cas serializations

Posted by Richard Eckart de Castilho <re...@apache.org>.

Hi,

On 13.01.2016, at 21:28, Marshall Schor <ms...@schor.com> wrote:
> I would turn this on for the repaired binary delta format, and supply a version
> number.
> 
> Our current compressed formats use "1" as the incrementing version number.
> 

> I'm leaning toward something simple, such as using the Major/Minor/Patch format,
> each value 1 byte, in the 3 lower bytes of the 2nd version word, giving 256
> possibilities for each (more than I've ever seen used).

+1 for versioning the CAS formats. Every data format should include version information :) The BinaryCasWriter in DKPro Core uses 'D', 'K', 'P', 'r', 'o', '1' as the header for the 6+ format (serialization with compression form 6  prepended with type system information).

Is it really necessary to have a complex versioning scheme for data formats? I'd rather tend towards a plain int versioning: 1, 2, 3, 4, etc. wouldn't that be sufficient?

> The "semantic versioning" standard has sparked some push-back (see
> https://gist.github.com/jashkenas/cbd2b088e20279ae2c8e )
> basically saying the "mechanical" approach of semantic versioning isn't rich
> enough for the grey areas of real world use, and ends up obscuring the purpose
> of indicating how "far" one version is from another. 


Regarding SemVer: I don't personally fully trust the plugin we are using. E.g. I tried doing some changes to uimaFIT that I believe are backwards-compatible but the semver plugin believes otherwise. 

Other than that, I am not quite convinced of the criticism towards semver either. 

Let's just consider (for software):

- if we do bug-fixes, we typically make this a x.y.+1 - bug-fixes shouldn't change the API - sounds reasonable to me

- when adding new features, I would personally always tend towards a x.+1.0 - in the past, we had various UIMA releases that added cool new feature but increased the version only at the last digit. Undeserved, I think. Since we use semver, we increase the middle digit more and I think that is very appropriate and reflects the activity in the project much better.

- that leaves the first digit, which IMHO is often a marketing digit: increase it to tell people that all is new and shiny and they should have another fresh look at the project. I don't think we need that. Using it to indicate major breaking changes (which are typically part of a major refactoring with cool new features that people should have a look at) seems quite appropriate to me. We are now in UIMA 2. UIMA 1 was IBM UIMA. I do believe that if we are introducing major changes now like a completely new CAS, that warrants going to UIMA 3.

So looking at that and minus some doubts that I have about the accuracy of the semver plugin, I believe that the idea of semver in general is quite sensible - at least when going with a three-part versioning scheme. I would consider the plugin as an automatic alert for accidentally introducing incompatible changes and the semver idea
as a guideline. When we consider it a good idea, I think we should add exceptions and overrides to the plugin
for particular releases. 

Cheers,

-- Richard

> On 13.01.2016, at 21:28, Marshall Schor <ms...@schor.com> wrote:
> 
> Hi,
> 
> I'm working on UIMA-4743 - fixing some binary cas serialization problems, which
> will unfortunately make the binary serialization for "delta" formats not
> backward compatible (the fix may have extra bytes in it).
> 
> We currently have a partially architected scheme for serialization forms, which
> looks like:
>  - 1 word encoding U + I + M + A and also serving to identify byte order
>  - 1 word for bit-encoding some categorizations:
>     -- a bit for delta / non delta
>     -- a bit for compressed / non compressed
>  - 0 or 1 additional word for incrementing in some fashion a version number for
> a particular serialization category (named below as "2nd version word)
> 
> This 2nd version word is currently only used with compressed serialization formats.
> 
> I'm thinking of assigning another bit in the first word to indicate there's a
> 2nd version word present.
> 
> I would turn this on for the repaired binary delta format, and supply a version
> number.
> 
> Our current compressed formats use "1" as the incrementing version number.
> 
> Thinking ahead, perhaps the serialization formats should have a multi-part 2nd
> version word, along some standards. 
> The "semantic versioning" standard has sparked some push-back (see
> https://gist.github.com/jashkenas/cbd2b088e20279ae2c8e )
> basically saying the "mechanical" approach of semantic versioning isn't rich
> enough for the grey areas of real world use, and ends up obscuring the purpose
> of indicating how "far" one version is from another. 
> 
> I'm leaning toward something simple, such as using the Major/Minor/Patch format,
> each value 1 byte, in the 3 lower bytes of the 2nd version word, giving 256
> possibilities for each (more than I've ever seen used).
> 
> Other ideas?
> 
> -Marshall