You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@avro.apache.org by "Doug Cutting (JIRA)" <ji...@apache.org> on 2010/07/22 22:21:53 UTC

[jira] Created: (AVRO-600) add support for type and field name aliases

add support for type and field name aliases
-------------------------------------------

                 Key: AVRO-600
                 URL: https://issues.apache.org/jira/browse/AVRO-600
             Project: Avro
          Issue Type: New Feature
          Components: java, spec
            Reporter: Doug Cutting


It would be good if Avro would permit one to still read data if a type or field name has been changed.  I propose we add a notion of name _aliases_.  Aliases could be listed for every named type and for record fields.  The writers schema would be permitted to contain any of the aliases.

In general, this permits one to construct schemas that can read different types into a single type.  One could use this not just to handle renamings, but also to join different datasets.  For example, if two datasets each contain differently named records with a date and an ip address field, this could be used be used to project these both to a single record with just those fields.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (AVRO-600) add support for type and field name aliases

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/AVRO-600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doug Cutting updated AVRO-600:
------------------------------

          Status: Resolved  (was: Patch Available)
    Hadoop Flags:   (was: [Incompatible change])
      Resolution: Fixed

I committed this.

> add support for type and field name aliases
> -------------------------------------------
>
>                 Key: AVRO-600
>                 URL: https://issues.apache.org/jira/browse/AVRO-600
>             Project: Avro
>          Issue Type: New Feature
>          Components: java, spec
>            Reporter: Doug Cutting
>            Assignee: Doug Cutting
>             Fix For: 1.4.0
>
>         Attachments: AVRO-600.patch, AVRO-600.patch, AVRO-600.patch
>
>
> It would be good if Avro would permit one to still read data if a type or field name has been changed.  I propose we add a notion of name _aliases_.  Aliases could be listed for every named type and for record fields.  The writers schema would be permitted to contain any of the aliases.
> In general, this permits one to construct schemas that can read different types into a single type.  One could use this not just to handle renamings, but also to join different datasets.  For example, if two datasets each contain differently named records with a date and an ip address field, this could be used be used to project these both to a single record with just those fields.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (AVRO-600) add support for type and field name aliases

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/AVRO-600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12898399#action_12898399 ] 

Doug Cutting commented on AVRO-600:
-----------------------------------

I will commit this today unless someone objects.

> add support for type and field name aliases
> -------------------------------------------
>
>                 Key: AVRO-600
>                 URL: https://issues.apache.org/jira/browse/AVRO-600
>             Project: Avro
>          Issue Type: New Feature
>          Components: java, spec
>            Reporter: Doug Cutting
>            Assignee: Doug Cutting
>             Fix For: 1.4.0
>
>         Attachments: AVRO-600.patch, AVRO-600.patch, AVRO-600.patch
>
>
> It would be good if Avro would permit one to still read data if a type or field name has been changed.  I propose we add a notion of name _aliases_.  Aliases could be listed for every named type and for record fields.  The writers schema would be permitted to contain any of the aliases.
> In general, this permits one to construct schemas that can read different types into a single type.  One could use this not just to handle renamings, but also to join different datasets.  For example, if two datasets each contain differently named records with a date and an ip address field, this could be used be used to project these both to a single record with just those fields.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (AVRO-600) add support for type and field name aliases

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/AVRO-600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12891647#action_12891647 ] 

Doug Cutting commented on AVRO-600:
-----------------------------------

> This seems like it adds quite a bit of complexity to the base Avro system.

I think this should be easy to implement as a single-pass re-write of the writer's schema, rewriting any names that are aliases in the reader's schema.  In Java, this will be a single recursive method, plus a single call to this method in GenericDatumReader just before the ResolvingDecoder is created.

Moreover this can be an optional feature.  The schema stored with the data always fully and accurately describes the data.  Applications build using implementations without this feature would have to manually correlate data which has different names, as they do today.

Consider an alternate, functionally-equivalent, implementation that puts such aliases in a separate data structure that's passed to the reader, i.e., an aliasing feature of that particular reader implementation.  Such a feature would be useful, and would be completely consistent with the Avro specification.  The only difference between that and the proposal here is that the aliases are made available via the schema to every implementation in a standard form should they choose to implement this feature.


> add support for type and field name aliases
> -------------------------------------------
>
>                 Key: AVRO-600
>                 URL: https://issues.apache.org/jira/browse/AVRO-600
>             Project: Avro
>          Issue Type: New Feature
>          Components: java, spec
>            Reporter: Doug Cutting
>            Assignee: Doug Cutting
>
> It would be good if Avro would permit one to still read data if a type or field name has been changed.  I propose we add a notion of name _aliases_.  Aliases could be listed for every named type and for record fields.  The writers schema would be permitted to contain any of the aliases.
> In general, this permits one to construct schemas that can read different types into a single type.  One could use this not just to handle renamings, but also to join different datasets.  For example, if two datasets each contain differently named records with a date and an ip address field, this could be used be used to project these both to a single record with just those fields.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (AVRO-600) add support for type and field name aliases

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/AVRO-600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12891325#action_12891325 ] 

Doug Cutting commented on AVRO-600:
-----------------------------------

An example:

Data written with:

{code}
{"type": "record", "name": "org.x.Foo", "fields": [
    {"name": "a", "type": "int"},
    {"name": "b", "type": "int"}
  ]
}
{code}

Could be read with:
{code}
{"type": "record", "name": "org.y.Bar", "fields": [
    {"name": "c", "type": "int", "aliases": ["a"]},
    {"name": "d", "type": "int", "default": 0}

  ],
 "aliases": ["org.x.Foo"]
}
{code}

It would be an error for a type alias to name an already-defined type or for a field alias to name an already-defined field.

The semantics would be equivalent to rewriting the writer's schema, replacing matching aliased types and fields with their names in the reader's schema.  In the above example, the writer's schema would be rewritten as:


{code}
{"type": "record", "name": "org.y.Bar", "fields": [
    {"name": "c", "type": "int"},
    {"name": "b", "type": "int"}
  ]
}
{code}

When instances are read, values for "a" would be read into the "c" field, values for "b" would be dropped, and "d" would have the default value of zero.

> add support for type and field name aliases
> -------------------------------------------
>
>                 Key: AVRO-600
>                 URL: https://issues.apache.org/jira/browse/AVRO-600
>             Project: Avro
>          Issue Type: New Feature
>          Components: java, spec
>            Reporter: Doug Cutting
>
> It would be good if Avro would permit one to still read data if a type or field name has been changed.  I propose we add a notion of name _aliases_.  Aliases could be listed for every named type and for record fields.  The writers schema would be permitted to contain any of the aliases.
> In general, this permits one to construct schemas that can read different types into a single type.  One could use this not just to handle renamings, but also to join different datasets.  For example, if two datasets each contain differently named records with a date and an ip address field, this could be used be used to project these both to a single record with just those fields.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (AVRO-600) add support for type and field name aliases

Posted by "Philip Zeyliger (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/AVRO-600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12891348#action_12891348 ] 

Philip Zeyliger commented on AVRO-600:
--------------------------------------

This seems like it adds quite a bit of complexity to the base Avro system.  Could this be layered on top?  Perhaps, a separate way to indicate a transformation from one schema to another, which could then be used at read-time?

-- Philip

> add support for type and field name aliases
> -------------------------------------------
>
>                 Key: AVRO-600
>                 URL: https://issues.apache.org/jira/browse/AVRO-600
>             Project: Avro
>          Issue Type: New Feature
>          Components: java, spec
>            Reporter: Doug Cutting
>            Assignee: Doug Cutting
>
> It would be good if Avro would permit one to still read data if a type or field name has been changed.  I propose we add a notion of name _aliases_.  Aliases could be listed for every named type and for record fields.  The writers schema would be permitted to contain any of the aliases.
> In general, this permits one to construct schemas that can read different types into a single type.  One could use this not just to handle renamings, but also to join different datasets.  For example, if two datasets each contain differently named records with a date and an ip address field, this could be used be used to project these both to a single record with just those fields.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (AVRO-600) add support for type and field name aliases

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/AVRO-600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doug Cutting updated AVRO-600:
------------------------------

    Attachment: AVRO-600.patch

Updated patch with documentation added to spec.

> add support for type and field name aliases
> -------------------------------------------
>
>                 Key: AVRO-600
>                 URL: https://issues.apache.org/jira/browse/AVRO-600
>             Project: Avro
>          Issue Type: New Feature
>          Components: java, spec
>            Reporter: Doug Cutting
>            Assignee: Doug Cutting
>         Attachments: AVRO-600.patch, AVRO-600.patch
>
>
> It would be good if Avro would permit one to still read data if a type or field name has been changed.  I propose we add a notion of name _aliases_.  Aliases could be listed for every named type and for record fields.  The writers schema would be permitted to contain any of the aliases.
> In general, this permits one to construct schemas that can read different types into a single type.  One could use this not just to handle renamings, but also to join different datasets.  For example, if two datasets each contain differently named records with a date and an ip address field, this could be used be used to project these both to a single record with just those fields.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Assigned: (AVRO-600) add support for type and field name aliases

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/AVRO-600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doug Cutting reassigned AVRO-600:
---------------------------------

    Assignee: Doug Cutting

> add support for type and field name aliases
> -------------------------------------------
>
>                 Key: AVRO-600
>                 URL: https://issues.apache.org/jira/browse/AVRO-600
>             Project: Avro
>          Issue Type: New Feature
>          Components: java, spec
>            Reporter: Doug Cutting
>            Assignee: Doug Cutting
>
> It would be good if Avro would permit one to still read data if a type or field name has been changed.  I propose we add a notion of name _aliases_.  Aliases could be listed for every named type and for record fields.  The writers schema would be permitted to contain any of the aliases.
> In general, this permits one to construct schemas that can read different types into a single type.  One could use this not just to handle renamings, but also to join different datasets.  For example, if two datasets each contain differently named records with a date and an ip address field, this could be used be used to project these both to a single record with just those fields.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (AVRO-600) add support for type and field name aliases

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/AVRO-600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doug Cutting updated AVRO-600:
------------------------------

    Attachment: AVRO-600.patch

Here's a patch for this, with tests.

> add support for type and field name aliases
> -------------------------------------------
>
>                 Key: AVRO-600
>                 URL: https://issues.apache.org/jira/browse/AVRO-600
>             Project: Avro
>          Issue Type: New Feature
>          Components: java, spec
>            Reporter: Doug Cutting
>            Assignee: Doug Cutting
>         Attachments: AVRO-600.patch
>
>
> It would be good if Avro would permit one to still read data if a type or field name has been changed.  I propose we add a notion of name _aliases_.  Aliases could be listed for every named type and for record fields.  The writers schema would be permitted to contain any of the aliases.
> In general, this permits one to construct schemas that can read different types into a single type.  One could use this not just to handle renamings, but also to join different datasets.  For example, if two datasets each contain differently named records with a date and an ip address field, this could be used be used to project these both to a single record with just those fields.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (AVRO-600) add support for type and field name aliases

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/AVRO-600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doug Cutting updated AVRO-600:
------------------------------

           Status: Patch Available  (was: Open)
     Hadoop Flags: [Incompatible change]
    Fix Version/s: 1.4.0

I think this is ready to commit.

> add support for type and field name aliases
> -------------------------------------------
>
>                 Key: AVRO-600
>                 URL: https://issues.apache.org/jira/browse/AVRO-600
>             Project: Avro
>          Issue Type: New Feature
>          Components: java, spec
>            Reporter: Doug Cutting
>            Assignee: Doug Cutting
>             Fix For: 1.4.0
>
>         Attachments: AVRO-600.patch, AVRO-600.patch
>
>
> It would be good if Avro would permit one to still read data if a type or field name has been changed.  I propose we add a notion of name _aliases_.  Aliases could be listed for every named type and for record fields.  The writers schema would be permitted to contain any of the aliases.
> In general, this permits one to construct schemas that can read different types into a single type.  One could use this not just to handle renamings, but also to join different datasets.  For example, if two datasets each contain differently named records with a date and an ip address field, this could be used be used to project these both to a single record with just those fields.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (AVRO-600) add support for type and field name aliases

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/AVRO-600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doug Cutting updated AVRO-600:
------------------------------

    Attachment: AVRO-600.patch

Here's a new version of the patch, updated for conflicts with AVRO-557.

Note that now the re-written, aliased schema is cached in the resolving decoder.  Performance of the Perf benchmarks is not measurably different.

> add support for type and field name aliases
> -------------------------------------------
>
>                 Key: AVRO-600
>                 URL: https://issues.apache.org/jira/browse/AVRO-600
>             Project: Avro
>          Issue Type: New Feature
>          Components: java, spec
>            Reporter: Doug Cutting
>            Assignee: Doug Cutting
>             Fix For: 1.4.0
>
>         Attachments: AVRO-600.patch, AVRO-600.patch, AVRO-600.patch
>
>
> It would be good if Avro would permit one to still read data if a type or field name has been changed.  I propose we add a notion of name _aliases_.  Aliases could be listed for every named type and for record fields.  The writers schema would be permitted to contain any of the aliases.
> In general, this permits one to construct schemas that can read different types into a single type.  One could use this not just to handle renamings, but also to join different datasets.  For example, if two datasets each contain differently named records with a date and an ip address field, this could be used be used to project these both to a single record with just those fields.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.