You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by Vadi Hombal <vk...@gmail.com> on 2013/03/27 14:30:47 UTC

Efficient load for data with large number of columns

suppose my data has 100 columns or fields, and i want to impose a schema.
is there a way i can create a separate file describing the schema of these
fields, and let PIG read the schema from that file?

for example.
instead of  verbose typing in the pigscript....
A = load mydata as (c1:int, c2:chararray, ...... ,c100:charaaray)

can i do something like.
A = load mydata as described in myschema.txt

myschema.txt would be something like
c1: int
c2: chararray
....
....
c100: chararray

thanks
vkh

Re: Efficient load for data with large number of columns

Posted by Mike Sukmanowsky <mi...@parsely.com>.

Yes, as of Pig 0.10.0 you can specify a schema file along with PigStorage
when loading or storing data see
http://pig.apache.org/docs/r0.10.0/api/org/apache/pig/builtin/PigStorage.html
.


On Wed, Mar 27, 2013 at 9:30 AM, Vadi Hombal <vk...@gmail.com> wrote:

> suppose my data has 100 columns or fields, and i want to impose a schema.
> is there a way i can create a separate file describing the schema of these
> fields, and let PIG read the schema from that file?
>
> for example.
> instead of  verbose typing in the pigscript....
> A = load mydata as (c1:int, c2:chararray, ...... ,c100:charaaray)
>
> can i do something like.
> A = load mydata as described in myschema.txt
>
> myschema.txt would be something like
> c1: int
> c2: chararray
> ....
> ....
> c100: chararray
>
> thanks
> vkh
>



-- 
Mike Sukmanowsky

Product Lead, http://parse.ly
989 Avenue of the Americas, 3rd Floor
New York, NY  10018
p: +1 (416) 953-4248
e: mike@parsely.com

Re: Efficient load for data with large number of columns

Posted by MARCOS MEDRADO RUBINELLI <ma...@buscapecompany.com>.

i did a store to figure out how to write the schema in json and then used
that as a template to create a schema for load.

from my experiments, for data with three columns (int, charray, float) i
figured this is the minimal schema
{"fields":
  [
    {"name":"year","type":10},
    {"name":"name","type":55},
    {"name":"num","type":20}
  ]
}

is there any literature on how to write proper json for schemas?

thanks
vkh

Sadly, there isn't. For a simple, flat schema, it isn't hard. You just have to add another field, with its name, and corresponding DataType:
http://pig.apache.org/docs/r0.10.0/api/constant-values.html#org.apache.pig.data.DataType.GENERIC_WRITABLECOMPARABLE

For a more complex schema, it's easier to actually construct a ResourceSchema object and serialize it with Jackson:

http://pig.apache.org/docs/r0.10.0/api/index.html?org/apache/pig/ResourceSchema.html

Regards,
Marcos

Re: Efficient load for data with large number of columns

Posted by Vadi Hombal <vk...@gmail.com>.

thank you Marc and Markos,
this worked well.

i did a store to figure out how to write the schema in json and then used
that as a template to create a schema for load.

from my experiments, for data with three columns (int, charray, float) i
figured this is the minimal schema
{"fields":
  [
    {"name":"year","type":10},
    {"name":"name","type":55},
    {"name":"num","type":20}
  ]
}

is there any literature on how to write proper json for schemas?

thanks
vkh

On Wed, Mar 27, 2013 at 10:16 AM, MARCOS MEDRADO RUBINELLI <
marcosm@buscapecompany.com> wrote:

> suppose my data has 100 columns or fields, and i want to impose a schema.
> is there a way i can create a separate file describing the schema of these
> fields, and let PIG read the schema from that file?
>
>
> Yes, if you put a json file named " .pig_schema" in the same directory as
> your data, Pig will use it to determine the schema:
>
> http://pig.apache.org/docs/r0.10.0/func.html#pigstorage
>
> Regards,
> Marcos
>

Re: Efficient load for data with large number of columns

Posted by MARCOS MEDRADO RUBINELLI <ma...@buscapecompany.com>.

suppose my data has 100 columns or fields, and i want to impose a schema.
is there a way i can create a separate file describing the schema of these
fields, and let PIG read the schema from that file?


Yes, if you put a json file named " .pig_schema" in the same directory as your data, Pig will use it to determine the schema:

http://pig.apache.org/docs/r0.10.0/func.html#pigstorage

Regards,
Marcos