You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2019/08/07 20:34:32 UTC

[GitHub] [incubator-iceberg] rdblue commented on issue #170: Add support for Iceberg MR / InputFormat and OutputFormat APIs

rdblue commented on issue #170: Add support for Iceberg MR / InputFormat and OutputFormat APIs
URL: https://github.com/apache/incubator-iceberg/issues/170#issuecomment-519258569

> IcebergPigInputFormat relies on Java serialization via org.apache.pig.impl.util.ObjectSerializer, shall I reuse the same approach or use a different serialization mechanism?

I don't think a generic InputFormat should rely on Pig, so using Java serialization where necessary is a better choice.

Also, there are some situations where you'd want to use other options:
* Classes that have JSON parsers, like Schema and PartitionSpec should serialize to/from JSON because that produces a human-readable string
* Classes with a serialization requirement from MR should use that. For example, InputSplit needs to be Writeable, so I would expect an IcebergInputSplit to correctly serialize itself using the Writeable interface.

> Thoughts on using GenericRecord as data container for both the Input and Output formats?

I'd use Iceberg's `Record` interface instead. `GenericRecord` is the implementation class. It's also the name of an Avro class, which we shouldn't expose.

> IcebergPigInputFormat declares a public constructor IcebergPigInputFormat(Table table) but I don’t think that’s an option for MR since the JobSubmitter class instantiates the input format via reflection without constructor arguments. How do we solve this?

The easiest solution is to use `HiveCatalogs` like we do in Spark and set the configuration on the catalog that gets returned. The drawback there is that we didn't intend to use `HiveCatalogs` for very long because Spark has a catalog plugin system coming in 3.0 to handle this.

Another option is instantiating using a no-arg constructor and setting Configuration if the Catalog implements [Configurable](https://hadoop.apache.org/docs/r2.8.2/api/org/apache/hadoop/conf/Configurable.html). That sounds reasonable to me, but is not very generic.

Last, you could add a `ServiceLoader` that acts as a factory for catalogs. Then you could load a service by name and call `getCatalog(Configuration)` on it.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org