You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Richard Park <rp...@linkedin.com> on 2010/05/26 04:08:14 UTC

Custom OutputFormat and Schemas

Hi,
I’m porting our Load/Store funcs from pig 0.6 to 0.7. Currently we’re storing data in serialized binary JSON. The format requires that the meta data for the schema is stored in the header of the output file. Converting our LoadFunc was a fairly painless experience.

However, I’ve hit snag while doing the StoreFunc. We’re using a custom SequenceFileOutputFormat and at the invocation of SequenceFileOutputFormat getRecordWriter, we create a SequenceFile.Metadata with the schema and pass it to the SequenceFile.Writer constructor.  Unfortunately with pig 0.7, the schema doesn’t seem to be available at the time the Writer is constructed. In 0.6, there was a MapRedUtil function that allowed us to get the schema through the StoreConfig, but that seems to have been removed.

CheckSchema gets called on the client end, and StoreMetadata.storeSchema seems to be invoked late. How would I go about getting this schema data early (before the writer is created)? I suppose I could add the schema as a parameter in Configuration, but I’m not sure if the Configuration parameters will be properly propagated between the Load and Store func. I’ll play around with configs next.

Any advice would be appreciated.

Thanks,
-Richard


Re: Custom OutputFormat and Schemas

Posted by Dmitriy Ryaboy <dv...@gmail.com>.
Richard,
I think you want to use UDFContext to pass along the schema you get from
CheckSchema. Here are the docs:

http://hadoop.apache.org/pig/docs/r0.7.0/udf.html#Passing+Configurations+to+UDFs
http://hadoop.apache.org/pig/docs/r0.7.0/api/org/apache/pig/impl/util/UDFContext.html

-D

On Tue, May 25, 2010 at 7:08 PM, Richard Park <rp...@linkedin.com> wrote:

> Hi,
> I’m porting our Load/Store funcs from pig 0.6 to 0.7. Currently we’re
> storing data in serialized binary JSON. The format requires that the meta
> data for the schema is stored in the header of the output file. Converting
> our LoadFunc was a fairly painless experience.
>
> However, I’ve hit snag while doing the StoreFunc. We’re using a custom
> SequenceFileOutputFormat and at the invocation of SequenceFileOutputFormat
> getRecordWriter, we create a SequenceFile.Metadata with the schema and pass
> it to the SequenceFile.Writer constructor.  Unfortunately with pig 0.7, the
> schema doesn’t seem to be available at the time the Writer is constructed.
> In 0.6, there was a MapRedUtil function that allowed us to get the schema
> through the StoreConfig, but that seems to have been removed.
>
> CheckSchema gets called on the client end, and StoreMetadata.storeSchema
> seems to be invoked late. How would I go about getting this schema data
> early (before the writer is created)? I suppose I could add the schema as a
> parameter in Configuration, but I’m not sure if the Configuration parameters
> will be properly propagated between the Load and Store func. I’ll play
> around with configs next.
>
> Any advice would be appreciated.
>
> Thanks,
> -Richard
>
>