You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by "Zhang, Liyun" <lz...@ebay.com> on 2018/08/19 21:51:06 UTC

Can anyone help view the problem about avro schema

Hi :

I want to ask question about 'avro.schema.url'.  I have a partitioned table with huge number of partitions like following





CREATE TABLE episodes_partitioned

PARTITIONED BY (doctor_pt INT)

ROW FORMAT

SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'

STORED AS

INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'

OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
TBLPROPERTIES (
'avro.schema.url'='hdfs:///user/YOURUSER/examples/schema/twitter.avsc'
);






   I found that several methods will call AvroSerdeUtils.determineSchemaOrThrowException, if defined “'avro.schema.url', it will call getSchemaFromFS to get schema which causes huge rpc call because for every partition it will call  getSchemaFromFS.  So my question is  is there any better way to avoid this except defining avro.schema.literal in create table sql.


Method calls AvroSerdeUtils.determineSchemaOrThrowException:

 at org.apache.hadoop.hive.serde2.avro.AvroSerdeUtils.determineSchemaOrThrowException(AvroSerdeUtils.java:109)
                 at org.apache.hadoop.hive.serde2.avro.AvroSerDe.determineSchemaOrReturnErrorSchema(AvroSerDe.java:191)
                 at org.apache.hadoop.hive.serde2.avro.AvroSerDe.initialize(AvroSerDe.java:110)
                 at org.apache.hadoop.hive.serde2.avro.AvroSerDe.initialize(AvroSerDe.java:83)
                 at org.apache.hadoop.hive.serde2.SerDeUtils.initializeSerDe(SerDeUtils.java:540)
                 at org.apache.hadoop.hive.ql.plan.PartitionDesc.getDeserializer(PartitionDesc.java:184)
                 at org.apache.hadoop.hive.ql.exec.MapOperator.getConvertedOI(MapOperator.java:295)
                 at org.apache.hadoop.hive.ql.exec.MapOperator.setChildren(MapOperator.java:423)
                 at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.configure(ExecMapper.java:106)
                 at sun.reflect.NativeMethodAccessorImpl.invoke0(NativeMethodAccessorImpl.java:-1)
                 at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
                 a

AvroSerdeUtils#determineSchemaOrThrowException:


  public static Schema determineSchemaOrThrowException(Configuration conf, Properties properties)
        throws IOException, AvroSerdeException {
  …..

  try {
    Schema s = getSchemaFromFS(schemaString, conf);  // if define avro.schema.url, need to get SchemaFrom hdfs
    if (s == null) {
      //in case schema is not a file system
      return AvroSerdeUtils.getSchemaFor(new URL(schemaString));
    }
    return s;
  } catch (IOException ioe) {
    throw new AvroSerdeException("Unable to read schema from given path: " + schemaString, ioe);
  } catch (URISyntaxException urie) {
    throw new AvroSerdeException("Unable to read schema from given path: " + schemaString, urie);
  }



…..
}



Can anyone can help view the avro schema problem, thanks!

Best Regards
ZhangLiyun/Kelly Zhang


Re: Can anyone help view the problem about avro schema

Posted by Mithun RK <my...@gmail.com>.
Hello, Kelly Zhang.

I have had to struggle with this as well. This should have been fixed as
part of HIVE-14792. This should be available in Hive 3.x and the head of
branch-2 (but not the 2.3 release :/).  What version are you seeing this
problem on?

If 3.x, one should be able to enable the optimization via " set
hive.optimize.update.table.properties.from.serde=true; ".
If 2.x, one might need to port HIVE-14792 over. (This should be an easy
port.)

Mithun

On Sun, Aug 19, 2018 at 2:51 PM Zhang, Liyun <lz...@ebay.com> wrote:

> Hi :
>
> I want to ask question about 'avro.schema.url'.  I have a partitioned
> table with huge number of partitions like following
>
>
>
>
>
> CREATE TABLE episodes_partitioned
>
> PARTITIONED BY (doctor_pt INT)
>
> ROW FORMAT
>
> SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
>
> STORED AS
>
> INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
>
> OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
> TBLPROPERTIES (
> 'avro.schema.url'='hdfs:///user/YOURUSER/examples/schema/twitter.avsc'
> );
>
>
>
>
>
>
>    I found that several methods will call
> AvroSerdeUtils.determineSchemaOrThrowException, if defined
> “'avro.schema.url', it will call getSchemaFromFS to get schema which causes
> huge rpc call because for every partition it will call  getSchemaFromFS.
> So my question is  is there any better way to avoid this except defining
> avro.schema.literal in create table sql.
>
>
> Method calls AvroSerdeUtils.determineSchemaOrThrowException:
>
>  at
> org.apache.hadoop.hive.serde2.avro.AvroSerdeUtils.determineSchemaOrThrowException(AvroSerdeUtils.java:109)
>                  at
> org.apache.hadoop.hive.serde2.avro.AvroSerDe.determineSchemaOrReturnErrorSchema(AvroSerDe.java:191)
>                  at
> org.apache.hadoop.hive.serde2.avro.AvroSerDe.initialize(AvroSerDe.java:110)
>                  at
> org.apache.hadoop.hive.serde2.avro.AvroSerDe.initialize(AvroSerDe.java:83)
>                  at
> org.apache.hadoop.hive.serde2.SerDeUtils.initializeSerDe(SerDeUtils.java:540)
>                  at
> org.apache.hadoop.hive.ql.plan.PartitionDesc.getDeserializer(PartitionDesc.java:184)
>                  at
> org.apache.hadoop.hive.ql.exec.MapOperator.getConvertedOI(MapOperator.java:295)
>                  at
> org.apache.hadoop.hive.ql.exec.MapOperator.setChildren(MapOperator.java:423)
>                  at
> org.apache.hadoop.hive.ql.exec.mr.ExecMapper.configure(ExecMapper.java:106)
>                  at
> sun.reflect.NativeMethodAccessorImpl.invoke0(NativeMethodAccessorImpl.java:-1)
>                  at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>                  a
>
> AvroSerdeUtils#determineSchemaOrThrowException:
>
>
>   public static Schema determineSchemaOrThrowException(Configuration conf,
> Properties properties)
>         throws IOException, AvroSerdeException {
>   …..
>
>   try {
>     Schema s = getSchemaFromFS(schemaString, conf);  // if define
> avro.schema.url, need to get SchemaFrom hdfs
>     if (s == null) {
>       //in case schema is not a file system
>       return AvroSerdeUtils.getSchemaFor(new URL(schemaString));
>     }
>     return s;
>   } catch (IOException ioe) {
>     throw new AvroSerdeException("Unable to read schema from given path: "
> + schemaString, ioe);
>   } catch (URISyntaxException urie) {
>     throw new AvroSerdeException("Unable to read schema from given path: "
> + schemaString, urie);
>   }
>
>
>
> …..
> }
>
>
>
> Can anyone can help view the avro schema problem, thanks!
>
> Best Regards
> ZhangLiyun/Kelly Zhang
>
>