You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@avro.apache.org by "Jeff Hammerbacher (JIRA)" <ji...@apache.org> on 2010/09/08 08:52:32 UTC

[jira] Created: (AVRO-659) Portable specification of the location of schema and protocol files

Portable specification of the location of schema and protocol files
-------------------------------------------------------------------

                 Key: AVRO-659
                 URL: https://issues.apache.org/jira/browse/AVRO-659
             Project: Avro
          Issue Type: New Feature
            Reporter: Jeff Hammerbacher


Avro doesn't require code generation, which is great. However, if you want to use a protocol or a schema, your code needs to know where to find it. When your code is ported to new systems, the protocol or schema file must be placed in the same place as on the previous system for things to work correctly.

For importing modules in a portable fashion, Python provides a default set of places it will look for modules and an environment variable called PYTHONPATH that programs can use to override these defaults. It may be useful to explore similar constructs for Avro implementations that don't do code generation. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (AVRO-659) Portable specification of the location of schema and protocol files

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/AVRO-659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12907288#action_12907288 ] 

Doug Cutting commented on AVRO-659:
-----------------------------------

Jeff, I'm still trying to understand the use case you have in mind.

Most folks writing data to files should use an Avro data file, which includes the schema.  If folks are doing RPC, then the protocol they use to write data is typically a file in their source code tree, and the protocol they use to read data is determined through the handshake.   If folks are writing individual records to a database then a best practice is to maintain a registry of schemas used in the database as a separate table, and have each instance refer to its schema in the registry via its MD5 hash.  The application would still probably store or create the schemas it uses for new database records with the source code.  The registry is updated when writing records and accessed when reading them.

We do not want to encourage folks to write data without also storing the schema used to write that schema in the same repository as the data. I don't feel a path-based schema registry is a good idea.  Keeping a copy of the schema with source code that writes data is a good practice: the schema is part of the writing code and should be versioned with it.  Generating schemas on the fly when writing data is a fine practice too.  But whenever data is persisted, its schema should be stored with it.

> Portable specification of the location of schema and protocol files
> -------------------------------------------------------------------
>
>                 Key: AVRO-659
>                 URL: https://issues.apache.org/jira/browse/AVRO-659
>             Project: Avro
>          Issue Type: New Feature
>            Reporter: Jeff Hammerbacher
>
> Avro doesn't require code generation, which is great. However, if you want to use a protocol or a schema, your code needs to know where to find it. When your code is ported to new systems, the protocol or schema file must be placed in the same place as on the previous system for things to work correctly.
> For importing modules in a portable fashion, Python provides a default set of places it will look for modules and an environment variable called PYTHONPATH that programs can use to override these defaults. It may be useful to explore similar constructs for Avro implementations that don't do code generation. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (AVRO-659) Portable specification of the location of schema and protocol files

Posted by "Jeff Hammerbacher (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/AVRO-659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12907282#action_12907282 ] 

Jeff Hammerbacher commented on AVRO-659:
----------------------------------------

bq. Generically helping applications find a text file that they need to load in their application is really up to the developer. AVRO shouldn't really get involved at that layer.

Pardon my ignorance, but why not? Most programming languages provide a perfectly good way of producing an ordered list of paths to search for the files they need; why shouldn't Avro do the same?

> Portable specification of the location of schema and protocol files
> -------------------------------------------------------------------
>
>                 Key: AVRO-659
>                 URL: https://issues.apache.org/jira/browse/AVRO-659
>             Project: Avro
>          Issue Type: New Feature
>            Reporter: Jeff Hammerbacher
>
> Avro doesn't require code generation, which is great. However, if you want to use a protocol or a schema, your code needs to know where to find it. When your code is ported to new systems, the protocol or schema file must be placed in the same place as on the previous system for things to work correctly.
> For importing modules in a portable fashion, Python provides a default set of places it will look for modules and an environment variable called PYTHONPATH that programs can use to override these defaults. It may be useful to explore similar constructs for Avro implementations that don't do code generation. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (AVRO-659) Portable specification of the location of schema and protocol files

Posted by "Philip Zeyliger (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/AVRO-659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12907281#action_12907281 ] 

Philip Zeyliger commented on AVRO-659:
--------------------------------------

Generically helping applications find a text file that they need to load in their application is really up to the developer.  AVRO shouldn't really get involved at that layer.  (In Java, you would just through it in your jar and load it with getResource().  In python, good luck de-referencing __file__ and hoping you're not in an egg, but that's a different story.

That said, there's nothing preventing us (or our users) from creating an avsc2py program that maps "[null, int]" into a "schema.py" file with "SCHEMA = '"[null, int]"'", i.e., creates a python file with the schema appropriately escaped so that it can be loaded that way.

> Portable specification of the location of schema and protocol files
> -------------------------------------------------------------------
>
>                 Key: AVRO-659
>                 URL: https://issues.apache.org/jira/browse/AVRO-659
>             Project: Avro
>          Issue Type: New Feature
>            Reporter: Jeff Hammerbacher
>
> Avro doesn't require code generation, which is great. However, if you want to use a protocol or a schema, your code needs to know where to find it. When your code is ported to new systems, the protocol or schema file must be placed in the same place as on the previous system for things to work correctly.
> For importing modules in a portable fashion, Python provides a default set of places it will look for modules and an environment variable called PYTHONPATH that programs can use to override these defaults. It may be useful to explore similar constructs for Avro implementations that don't do code generation. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (AVRO-659) Portable specification of the location of schema and protocol files

Posted by "Jeff Hammerbacher (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/AVRO-659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12907296#action_12907296 ] 

Jeff Hammerbacher commented on AVRO-659:
----------------------------------------

Doug: yeah, we're talking past one another. Philip is onto what I'm talking about.

Philip: your code assumes that the myschema.avsc file lives at {{os.path.dirname(__path__)}}; I'm arguing that this location should be replaced by a list of locations to be searched, in the same way that PATH, PYTHONPATH, CLASSPATH, and friends work.

Users have had issues when moving their code from one machine to the other because the packaging and versioning of the schema file may happen separately from the packaging and versioning of the code, so that, when the code is ported to another system, the schema file may be located somewhere crazy (e.g. on a shared file system under a well-known path).

> Portable specification of the location of schema and protocol files
> -------------------------------------------------------------------
>
>                 Key: AVRO-659
>                 URL: https://issues.apache.org/jira/browse/AVRO-659
>             Project: Avro
>          Issue Type: New Feature
>            Reporter: Jeff Hammerbacher
>
> Avro doesn't require code generation, which is great. However, if you want to use a protocol or a schema, your code needs to know where to find it. When your code is ported to new systems, the protocol or schema file must be placed in the same place as on the previous system for things to work correctly.
> For importing modules in a portable fashion, Python provides a default set of places it will look for modules and an environment variable called PYTHONPATH that programs can use to override these defaults. It may be useful to explore similar constructs for Avro implementations that don't do code generation. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (AVRO-659) Portable specification of the location of schema and protocol files

Posted by "Jeff Hammerbacher (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/AVRO-659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12907278#action_12907278 ] 

Jeff Hammerbacher commented on AVRO-659:
----------------------------------------

Handling imports with Avro IDL is a separate problem from the one I'm trying to describe (poorly). Once Avro IDL outputs a .avsc, that .avsc file needs to be stored somewhere and that location needs to be communicated to the code that will be using the schema. You could just inline the JSON into your code during the Avro IDL compile stage, but some developers might prefer to keep the file separate. If the file is separate, we need a standard way for code that uses the file to locate it.

Does that make sense?

> Portable specification of the location of schema and protocol files
> -------------------------------------------------------------------
>
>                 Key: AVRO-659
>                 URL: https://issues.apache.org/jira/browse/AVRO-659
>             Project: Avro
>          Issue Type: New Feature
>            Reporter: Jeff Hammerbacher
>
> Avro doesn't require code generation, which is great. However, if you want to use a protocol or a schema, your code needs to know where to find it. When your code is ported to new systems, the protocol or schema file must be placed in the same place as on the previous system for things to work correctly.
> For importing modules in a portable fashion, Python provides a default set of places it will look for modules and an environment variable called PYTHONPATH that programs can use to override these defaults. It may be useful to explore similar constructs for Avro implementations that don't do code generation. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (AVRO-659) Portable specification of the location of schema and protocol files

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/AVRO-659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12907274#action_12907274 ] 

Doug Cutting commented on AVRO-659:
-----------------------------------

JSON-format schemas should be self-contained, since they must fully describe binary data.

However Avro IDL supports file inclusion (AVRO-495).  There was discussion in that issue about searching a set of directories for included files.  I suggested we follow the C convention, where double-quoted imports search the application's path and angle-bracketed imports search a "system" path.  Would this address the need you describe?

> Portable specification of the location of schema and protocol files
> -------------------------------------------------------------------
>
>                 Key: AVRO-659
>                 URL: https://issues.apache.org/jira/browse/AVRO-659
>             Project: Avro
>          Issue Type: New Feature
>            Reporter: Jeff Hammerbacher
>
> Avro doesn't require code generation, which is great. However, if you want to use a protocol or a schema, your code needs to know where to find it. When your code is ported to new systems, the protocol or schema file must be placed in the same place as on the previous system for things to work correctly.
> For importing modules in a portable fashion, Python provides a default set of places it will look for modules and an environment variable called PYTHONPATH that programs can use to override these defaults. It may be useful to explore similar constructs for Avro implementations that don't do code generation. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (AVRO-659) Portable specification of the location of schema and protocol files

Posted by "Philip Zeyliger (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/AVRO-659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12907286#action_12907286 ] 

Philip Zeyliger commented on AVRO-659:
--------------------------------------

Because AVRO is not a programming language?  AVRO's metadata (the schema) is a single string.  It has no external references.  The "compiler" layer (in this case, AvroIDL) takes care of external references (via include) and creates a string.  Avro will help you embed that string in an avro data file, and it will help you embed that string in generated code (org.foo.bar.MyRecord.SCHEMA$), or you can store that string however you'd like and get at it.

It may be that we're talking past each other.  I'm guessing that your motivation is that it's tricky to write:

{noformat}
schema = schema.parse(file(os.path.join(os.path.dirname(__path__), "myschema.avsc")))
{noformat}

In Java, one would write that roughly as Schema.parse(this.getClass().getResourceAsStream("myschema.avsc")).  (I don't recall whether Schema.parse will take a stream, but it certainly could.)

What would be better?  schema.parse_from_avro_specific_logic_for_importing_files("myschema.avsc") would be more confusing.  Generating a file that contains the schema (i.e., "from mygeneratedcode.schema import SCHEMA_STRING; schema.parse(SCHEMA_STRING)") would be reasonable (and is my suggestion above).

> Portable specification of the location of schema and protocol files
> -------------------------------------------------------------------
>
>                 Key: AVRO-659
>                 URL: https://issues.apache.org/jira/browse/AVRO-659
>             Project: Avro
>          Issue Type: New Feature
>            Reporter: Jeff Hammerbacher
>
> Avro doesn't require code generation, which is great. However, if you want to use a protocol or a schema, your code needs to know where to find it. When your code is ported to new systems, the protocol or schema file must be placed in the same place as on the previous system for things to work correctly.
> For importing modules in a portable fashion, Python provides a default set of places it will look for modules and an environment variable called PYTHONPATH that programs can use to override these defaults. It may be useful to explore similar constructs for Avro implementations that don't do code generation. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (AVRO-659) Portable specification of the location of schema and protocol files

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/AVRO-659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12907310#action_12907310 ] 

Doug Cutting commented on AVRO-659:
-----------------------------------

Jeff, that sounds like a language-specific issue.  In Java, one generally uses, as Philip indicated, the CLASSPATH via ClassLoader#getResource().  If folks find that idiom too complex, then we could add it as a method in Schema.java, e.g.:

public static Schema getResource(String resource) throws IOException;

Applications simply bundle their .avsc files into their jars with their .class files, then use the above method to load them.

Should we add this method and similar methods to each language?  That seems a fine goal.  Probably we should have a separate Jira issue for each language though.


> Portable specification of the location of schema and protocol files
> -------------------------------------------------------------------
>
>                 Key: AVRO-659
>                 URL: https://issues.apache.org/jira/browse/AVRO-659
>             Project: Avro
>          Issue Type: New Feature
>            Reporter: Jeff Hammerbacher
>
> Avro doesn't require code generation, which is great. However, if you want to use a protocol or a schema, your code needs to know where to find it. When your code is ported to new systems, the protocol or schema file must be placed in the same place as on the previous system for things to work correctly.
> For importing modules in a portable fashion, Python provides a default set of places it will look for modules and an environment variable called PYTHONPATH that programs can use to override these defaults. It may be useful to explore similar constructs for Avro implementations that don't do code generation. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.