You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Mason <ma...@verbasoftware.com> on 2013/01/15 23:23:27 UTC

possible to infer schema from TSV header?

I have TSVs with a lot of columns, and I would like to address them by
name, as specified in the header line (first row), within Pig.

The best I can come up with a.t.m is to write a script that strips the
header line from the file and converts it to the form (col1:string,
col2:string, ...), then plug that schema string into the AS portion of
my LOAD statement. Then I'll project columns I want and manually
typecast them.

Is there a better, simple way?

-Mason

Re: possible to infer schema from TSV header?

Posted by Bill Graham <bi...@gmail.com>.
Take a look at the org.apache.pig.builtin.PigStorage.getSchema(..) method.
You can subclass PigStorage and implement that method to read the schema
from the first line of the file. Or you can just implement the LoadMetaData
in the loader you're using.


On Tue, Jan 15, 2013 at 2:27 PM, Mason <ma...@verbasoftware.com> wrote:

> Actually, I'll probably just end up computing positions to use, rather
> than pasting in a schema, but the general point is that I'd love to do
> it some other way, because little hacks like these make my data
> pipeline feel fragile.
>
> I'm willing to write some Java if anyone could point me in the write
> direction.
>
> -Mason
>
> On Tue, Jan 15, 2013 at 2:23 PM, Mason <ma...@verbasoftware.com> wrote:
> > I have TSVs with a lot of columns, and I would like to address them by
> > name, as specified in the header line (first row), within Pig.
> >
> > The best I can come up with a.t.m is to write a script that strips the
> > header line from the file and converts it to the form (col1:string,
> > col2:string, ...), then plug that schema string into the AS portion of
> > my LOAD statement. Then I'll project columns I want and manually
> > typecast them.
> >
> > Is there a better, simple way?
> >
> > -Mason
>



-- 
*Note that I'm no longer using my Yahoo! email address. Please email me at
billgraham@gmail.com going forward.*

Re: possible to infer schema from TSV header?

Posted by Mason <ma...@verbasoftware.com>.
Actually, I'll probably just end up computing positions to use, rather
than pasting in a schema, but the general point is that I'd love to do
it some other way, because little hacks like these make my data
pipeline feel fragile.

I'm willing to write some Java if anyone could point me in the write direction.

-Mason

On Tue, Jan 15, 2013 at 2:23 PM, Mason <ma...@verbasoftware.com> wrote:
> I have TSVs with a lot of columns, and I would like to address them by
> name, as specified in the header line (first row), within Pig.
>
> The best I can come up with a.t.m is to write a script that strips the
> header line from the file and converts it to the form (col1:string,
> col2:string, ...), then plug that schema string into the AS portion of
> my LOAD statement. Then I'll project columns I want and manually
> typecast them.
>
> Is there a better, simple way?
>
> -Mason