You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@sqoop.apache.org by "Keegan Witt (JIRA)" <ji...@apache.org> on 2015/04/09 19:07:13 UTC
[jira] [Commented] (SQOOP-1395) Potential naming conflict in Avro schema

    [ https://issues.apache.org/jira/browse/SQOOP-1395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14487672#comment-14487672 ] 

Keegan Witt commented on SQOOP-1395:
------------------------------------

I wish I knew about this when it was being discussed.  This change broke us :(

Here are some use cases I believe this change breaks
# Users that maintain their own Avro schema files and generated classes so that their code can more easily interact with Sqooped data.  Now not only would they need to change their schemas and class names, but they'd need to have both the old and the new schema names and classes unless they convert the existing data to this new name.  Wouldn't this be most users?  Sqoop doesn't generate the Avro SpecificRecord classes, right?
# Users that add Avro files to a directory which they're using Hive or Impala on.  Now there'll be 2 different schemas in that directory (which will cause Hive/Impala to choke) even though they made no change to the schema itself.  Users will have to convert existing data to the new name, alter their ingestion process to rename the schema before putting it in the directory for Hive/Impala, or have 2 tables (1 for existing data and 1 for new data going forward).
# This isn't as significant a breakage, but users that have scripting around Sqoop may need to adjust their script to account for the change in directory name if they aren't explicitly setting the directory.  And there could be other places in their code they assume the old default behavior (either that the schema and target directory will match or that the table and the target schema will match) (who knows what might be out there).

Even if it hadn't broken our process, in my opinion it's wrong to pollute the schema name with HOW the data was generated (though a comment is OK).  The schema name should reflect only WHAT the data is.

Why not tell users to use {{\--outdir}} if conflicts occur instead of this breaking change?  Or maybe even default to a random directory in /tmp instead of the current directory?  Or if you didn't like that, just use {{\--class-name}}?  Or if you insist on changing the schema name, why not allow the user to override that without changing the default?

> Potential naming conflict in Avro schema
> ----------------------------------------
>
>                 Key: SQOOP-1395
>                 URL: https://issues.apache.org/jira/browse/SQOOP-1395
>             Project: Sqoop
>          Issue Type: Sub-task
>          Components: tools
>            Reporter: Qian Xu
>            Assignee: Qian Xu
>            Priority: Minor
>             Fix For: 1.4.6
>
>         Attachments: SQOOP-1395.patch
>
>
> If you import a table "users". Sqoop will generate an entity class named "users.java". The class will be compiled, submitted and used by a mapreduce job. If the target file format is Avro or Parquet, an Avro schema will be generated as well. According to Avro specification, the entity class is described as "record", the name of the "record" is "users".
> For Parquet file format handling, we use the Kite SDK to manage Parquet file reading and writing with minimal efforts. Kite requires an Avro schema and all data records to be packed into GenericRecord instances. There will be a problem here. Kite will read the schema first and try to instantiate a record regarding its name. In this case, Kite will try to instantiate a "users" class. Unfortunately, there is a "users.java" out there. This will cause mapreduce job fail. 
> The patch proposes to change the {{AvroSchemaGenerator}} class. Record name will have a prefix. In this example, the record name of "users.java" will be changed to "sqoop_import_users".



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)