You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by vihangk1 <gi...@git.apache.org> on 2018/01/06 02:21:59 UTC

[GitHub] hive pull request #287: HIVE-17580 : Remove standalone-metastore's dependenc...

GitHub user vihangk1 opened a pull request:

    https://github.com/apache/hive/pull/287

    HIVE-17580 : Remove standalone-metastore's dependency with serdes

    Removing the dependency on serdes for the metastore requires a series of changes. I have created multiple commits which hopefully would be easier to review. Each major commit has a descriptive commit message to give a high level idea of what the change is doing. There are still some bits which need to be completed but it would be good to a review.
    
    Overview of all the changes done:
    1. Creates a new module called serde-api under storage-api like discussed. Although I think we can keep it separate as well.
    2. Moved List, Map, Struct, Constant, Primitive, Union ObjectInspectors to serde-api
    3. Moved PrimitiveTypeInfo, PrimitiveTypeEntry and TypeInfo to serde-api.
    4. Moved TypeInfoParser, TypeInfoFactory to serde-api
    5. Added a new class which reading avro storage schema by copying the code from AvroSerde and AvroSerdeUtils. The parsing is done such that String value is first converted into TypeInfos and then into FieldSchemas bypassing the need for ObjectInspectors. In theory we could get rid of TypeInfos as well but that path was getting too difficult with lot of duplicate code between Hive and metastore.
    6. Introduces a default storage schema reader. I noticed that most of the serdes use the same logic to parse the metadata information. This code should be refactored to a common place instead of having many copies (one in standalone hms and another set in multiple serdes)
    7. Moved HiveChar, HiveVarchar, HiveCharWritable, HiveVarcharWritable to storage-api. I noticed that HiveDecimal is already in storage-api. It probably makes sense to move the other primitive types (timestamp, interval etc)to storage-api as well but it requires storage-api to be upgraded to Java 8.
    8. Adds a basic test for the schema reader. I plan to add more tests as this code is reviewed.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/vihangk1/hive vihangk1_HIVE-17580

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/hive/pull/287.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #287
    
----
commit bbfb7dc44904db74a840167c02b07f50a6010b69
Author: Vihang Karajgaonkar <vi...@...>
Date:   2017-11-09T00:52:39Z

    HIVE-17580 : Remove dependency of get_fields_with_environment_context API to serde

commit d54879845eff10c19bc17bda9e09dda16f6fa295
Author: Vihang Karajgaonkar <vi...@...>
Date:   2017-11-29T00:54:23Z

    Moved List, Map, Struct OI to storage-api

commit a12d6c7ba3de598c6b6f75da1bd4efcac43036b1
Author: Vihang Karajgaonkar <vi...@...>
Date:   2017-11-29T04:19:39Z

    Moved ConstantObjectInspector PrimitiveObjectInspector and UnionObjectInspector

commit 13fb832fc2d51958e75d5e609f6781f87449aed8
Author: Vihang Karajgaonkar <vi...@...>
Date:   2017-12-28T01:25:59Z

    Moved PrimitiveTypeInfo to serde-api
    
    In order to move PrimitiveTypeInfo we need to move the PrimitiveTypeEntry as well.
    PrimitiveTypeEntry depends on PrimitiveObjectInspectorUtils which cannot be pulled
    into serde-api. Hence the static final maps are moved to PrimitiveEntry and we provide
    static access methods to these maps along with the register method to add the key
    value pairs in the maps

commit 9cbc789fd3f4ced7ce66a7313c451b75a154976f
Author: Vihang Karajgaonkar <vi...@...>
Date:   2017-12-28T20:51:39Z

    Moved the other TypeInfos to serde-api
    
    In order to move the other TypeInfo classes to serde-api we need to move the serdeConstants.java
    as well. This is a thrift generated class. This commit copies the serde.thrift instead of moving.
    The only reason I did not move it is in case of backwards compatibility reasons (in case someone is using the thrift file location to do something).
    If it is okay to move serde.thrift from serde module to serde-api we can delete it from serde module in a separate
    change.
    
    The other concern is there are some TypeInfo classes which do some validation like VarCharTypeInfo, DecimalTypeInfo.
    The validating methods use the actual type implementation like HiveChar, HiveDecimal etc to ensure that the params
    are under the correct limits. This creates a problem since we cannot bring in the type implementations as well to
    serde-api. Currently, I have marked these as TODO and commented them out.

commit 23aa899a90d17648e560d39602a3bd29bf53661e
Author: Vihang Karajgaonkar <vi...@...>
Date:   2017-12-29T22:14:06Z

    Moved TypeInfoParser and TypeInfoFactory to serde-api
    
    In order to use TypeInfoParser in standalone metastore, we should move it to serde-api
    There is a problem in moving the TypeInfoParser to serde-api which is there are some validation util
    methods which validated the max lengths of varchar, char or decimal typeinfo's precision and scale.
    These values are originally defined in HiveChar, HiveVarChar and HiveDecimal which cannot be moved to
    serde-api.
    
    In order to fix this this change moved these constants to serde.thrift so that it can be accessed from
    serdeConstants instead of HiveChar or HiveVarChar (not sure if this is the right thing to do, but there
    doesn't seem any good option here)

commit 0b29dc43e78eaa33e0cb672fb9cdabfb5b02d534
Author: Vihang Karajgaonkar <vi...@...>
Date:   2018-01-03T19:45:32Z

    Add avro storeage schema reader
    
    This commit adds a AvroStorageSchemaReader which reads the Avro schema files both for external schema and regular avro tables.
    Most of the util methods are in AvroSchemaUtils class which has methods copied from AvroSerDeUtils. Some of the needed classes like
    SchemaResolutionProblem, InstanceCache, SchemaToTypeInfo, TypeInfoToSchema are also copied from Hive. The constants defined
    in AvroSerde are copied in AvroSerdeConstants. The class AvroFieldSchemaGenerator converts the AvroSchema into List of
    FieldSchema which is returned by the AvroStorageSchemaReader

commit 5f2b174cd4e038e9871348dd4cce5389f0f75776
Author: Vihang Karajgaonkar <vi...@...>
Date:   2018-01-04T01:02:40Z

    Introduce default storage schema reader
    
    This change introduces a default storage schema reader which copies the common code from serdes
    initialization method and uses it to parse the column name, type and comments from the table
    properties. For custom storage schema reades like Avro we will have to add more schema readers
    as and when required

commit 2ecd4ffd1b65c29371cb4b384c43445c7c661161
Author: Vihang Karajgaonkar <vi...@...>
Date:   2018-01-04T18:33:02Z

    Moved the schema reader classes to a separate package

commit 6f734eb0c5a9472b239bdc91da56ec0efe7ceef8
Author: Vihang Karajgaonkar <vi...@...>
Date:   2018-01-04T19:18:03Z

    Integrates the avro schema reader into the DefaultStorageaSchemaReader

commit d93ce33559036020facd12ec5ea258223eb05c4c
Author: Vihang Karajgaonkar <vi...@...>
Date:   2018-01-05T01:40:36Z

    Moved HiveChar, HiveVarChar, HiveCharWritable, HiveVarcharWritable to storage-api
    
    The PrimitiveEntry caches (static maps) which keep a mapping between the typename to the classes needs to be
    populated before using in standalone-metastore. This is done in StorageSchemaUtils static initialization block.
    However, this initialization needs access to HiveChar, HiveVarchar and Timestamp related classes from serde.
    
    I see that HiveDecimal class is in storage-api. HiveChar, HiveVarChar and their base classes are comparitively
    light weight and can be moved to storage-api. However, Timestamp related primitive types like TimestampWritable,
    TimestampLocalTZWritable, TimestampTZ, HiveIntervalYearMonth, HiveIntervalYearMonthWritable, HiveIntervalDayTimeWritable
    classes cannot be moved in a straight-forward way because they have dependency with Java 8 and some util methods and
    classes in LazyBinaryUtils. I have marked this as a TODO currently. As far as I can tell TypeInfoParser does not
    need these classes while parsing type string to deserialize the types.
    
    Note: HiveIntervalDayTime is already in storage-api. Not sure why the other time related types are not in storage-api
    in the first place.

commit 9bfdb9beb48088712db82b31d8bc24651dc0d474
Author: Vihang Karajgaonkar <vi...@...>
Date:   2018-01-05T02:38:28Z

    Added a test for getFields method in standalone-metastore
    
    Adds a test case. Also fixes a bug in AvroFieldSchemaGenerator

----


---

[GitHub] hive pull request #287: HIVE-17580 : Remove standalone-metastore's dependenc...

Posted by vihangk1 <gi...@git.apache.org>.
GitHub user vihangk1 reopened a pull request:

    https://github.com/apache/hive/pull/287

    HIVE-17580 : Remove standalone-metastore's dependency with serdes

    Removing the dependency on serdes for the metastore requires a series of changes. I have created multiple commits which hopefully would be easier to review. Each major commit has a descriptive commit message to give a high level idea of what the change is doing. There are still some bits which need to be completed but it would be good to a review.
    
    Overview of all the changes done:
    1. Creates a new module called serde-api under storage-api like discussed. Although I think we can keep it separate as well.
    2. Moved List, Map, Struct, Constant, Primitive, Union ObjectInspectors to serde-api
    3. Moved PrimitiveTypeInfo, PrimitiveTypeEntry and TypeInfo to serde-api.
    4. Moved TypeInfoParser, TypeInfoFactory to serde-api
    5. Added a new class which reading avro storage schema by copying the code from AvroSerde and AvroSerdeUtils. The parsing is done such that String value is first converted into TypeInfos and then into FieldSchemas bypassing the need for ObjectInspectors. In theory we could get rid of TypeInfos as well but that path was getting too difficult with lot of duplicate code between Hive and metastore.
    6. Introduces a default storage schema reader. I noticed that most of the serdes use the same logic to parse the metadata information. This code should be refactored to a common place instead of having many copies (one in standalone hms and another set in multiple serdes)
    7. Moved HiveChar, HiveVarchar, HiveCharWritable, HiveVarcharWritable to storage-api. I noticed that HiveDecimal is already in storage-api. It probably makes sense to move the other primitive types (timestamp, interval etc)to storage-api as well but it requires storage-api to be upgraded to Java 8.
    8. Adds a basic test for the schema reader. I plan to add more tests as this code is reviewed.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/vihangk1/hive vihangk1_HIVE-17580

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/hive/pull/287.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #287
    
----
commit bbfb7dc44904db74a840167c02b07f50a6010b69
Author: Vihang Karajgaonkar <vi...@...>
Date:   2017-11-09T00:52:39Z

    HIVE-17580 : Remove dependency of get_fields_with_environment_context API to serde

commit d54879845eff10c19bc17bda9e09dda16f6fa295
Author: Vihang Karajgaonkar <vi...@...>
Date:   2017-11-29T00:54:23Z

    Moved List, Map, Struct OI to storage-api

commit a12d6c7ba3de598c6b6f75da1bd4efcac43036b1
Author: Vihang Karajgaonkar <vi...@...>
Date:   2017-11-29T04:19:39Z

    Moved ConstantObjectInspector PrimitiveObjectInspector and UnionObjectInspector

commit 13fb832fc2d51958e75d5e609f6781f87449aed8
Author: Vihang Karajgaonkar <vi...@...>
Date:   2017-12-28T01:25:59Z

    Moved PrimitiveTypeInfo to serde-api
    
    In order to move PrimitiveTypeInfo we need to move the PrimitiveTypeEntry as well.
    PrimitiveTypeEntry depends on PrimitiveObjectInspectorUtils which cannot be pulled
    into serde-api. Hence the static final maps are moved to PrimitiveEntry and we provide
    static access methods to these maps along with the register method to add the key
    value pairs in the maps

commit 9cbc789fd3f4ced7ce66a7313c451b75a154976f
Author: Vihang Karajgaonkar <vi...@...>
Date:   2017-12-28T20:51:39Z

    Moved the other TypeInfos to serde-api
    
    In order to move the other TypeInfo classes to serde-api we need to move the serdeConstants.java
    as well. This is a thrift generated class. This commit copies the serde.thrift instead of moving.
    The only reason I did not move it is in case of backwards compatibility reasons (in case someone is using the thrift file location to do something).
    If it is okay to move serde.thrift from serde module to serde-api we can delete it from serde module in a separate
    change.
    
    The other concern is there are some TypeInfo classes which do some validation like VarCharTypeInfo, DecimalTypeInfo.
    The validating methods use the actual type implementation like HiveChar, HiveDecimal etc to ensure that the params
    are under the correct limits. This creates a problem since we cannot bring in the type implementations as well to
    serde-api. Currently, I have marked these as TODO and commented them out.

commit 23aa899a90d17648e560d39602a3bd29bf53661e
Author: Vihang Karajgaonkar <vi...@...>
Date:   2017-12-29T22:14:06Z

    Moved TypeInfoParser and TypeInfoFactory to serde-api
    
    In order to use TypeInfoParser in standalone metastore, we should move it to serde-api
    There is a problem in moving the TypeInfoParser to serde-api which is there are some validation util
    methods which validated the max lengths of varchar, char or decimal typeinfo's precision and scale.
    These values are originally defined in HiveChar, HiveVarChar and HiveDecimal which cannot be moved to
    serde-api.
    
    In order to fix this this change moved these constants to serde.thrift so that it can be accessed from
    serdeConstants instead of HiveChar or HiveVarChar (not sure if this is the right thing to do, but there
    doesn't seem any good option here)

commit 0b29dc43e78eaa33e0cb672fb9cdabfb5b02d534
Author: Vihang Karajgaonkar <vi...@...>
Date:   2018-01-03T19:45:32Z

    Add avro storeage schema reader
    
    This commit adds a AvroStorageSchemaReader which reads the Avro schema files both for external schema and regular avro tables.
    Most of the util methods are in AvroSchemaUtils class which has methods copied from AvroSerDeUtils. Some of the needed classes like
    SchemaResolutionProblem, InstanceCache, SchemaToTypeInfo, TypeInfoToSchema are also copied from Hive. The constants defined
    in AvroSerde are copied in AvroSerdeConstants. The class AvroFieldSchemaGenerator converts the AvroSchema into List of
    FieldSchema which is returned by the AvroStorageSchemaReader

commit 5f2b174cd4e038e9871348dd4cce5389f0f75776
Author: Vihang Karajgaonkar <vi...@...>
Date:   2018-01-04T01:02:40Z

    Introduce default storage schema reader
    
    This change introduces a default storage schema reader which copies the common code from serdes
    initialization method and uses it to parse the column name, type and comments from the table
    properties. For custom storage schema reades like Avro we will have to add more schema readers
    as and when required

commit 2ecd4ffd1b65c29371cb4b384c43445c7c661161
Author: Vihang Karajgaonkar <vi...@...>
Date:   2018-01-04T18:33:02Z

    Moved the schema reader classes to a separate package

commit 6f734eb0c5a9472b239bdc91da56ec0efe7ceef8
Author: Vihang Karajgaonkar <vi...@...>
Date:   2018-01-04T19:18:03Z

    Integrates the avro schema reader into the DefaultStorageaSchemaReader

commit d93ce33559036020facd12ec5ea258223eb05c4c
Author: Vihang Karajgaonkar <vi...@...>
Date:   2018-01-05T01:40:36Z

    Moved HiveChar, HiveVarChar, HiveCharWritable, HiveVarcharWritable to storage-api
    
    The PrimitiveEntry caches (static maps) which keep a mapping between the typename to the classes needs to be
    populated before using in standalone-metastore. This is done in StorageSchemaUtils static initialization block.
    However, this initialization needs access to HiveChar, HiveVarchar and Timestamp related classes from serde.
    
    I see that HiveDecimal class is in storage-api. HiveChar, HiveVarChar and their base classes are comparitively
    light weight and can be moved to storage-api. However, Timestamp related primitive types like TimestampWritable,
    TimestampLocalTZWritable, TimestampTZ, HiveIntervalYearMonth, HiveIntervalYearMonthWritable, HiveIntervalDayTimeWritable
    classes cannot be moved in a straight-forward way because they have dependency with Java 8 and some util methods and
    classes in LazyBinaryUtils. I have marked this as a TODO currently. As far as I can tell TypeInfoParser does not
    need these classes while parsing type string to deserialize the types.
    
    Note: HiveIntervalDayTime is already in storage-api. Not sure why the other time related types are not in storage-api
    in the first place.

commit 9bfdb9beb48088712db82b31d8bc24651dc0d474
Author: Vihang Karajgaonkar <vi...@...>
Date:   2018-01-05T02:38:28Z

    Added a test for getFields method in standalone-metastore
    
    Adds a test case. Also fixes a bug in AvroFieldSchemaGenerator

----


---

[GitHub] hive pull request #287: HIVE-17580 : Remove standalone-metastore's dependenc...

Posted by vihangk1 <gi...@git.apache.org>.
Github user vihangk1 closed the pull request at:

    https://github.com/apache/hive/pull/287


---