You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Vadi Hombal <vk...@gmail.com> on 2013/03/27 14:30:47 UTC
Efficient load for data with large number of columns
suppose my data has 100 columns or fields, and i want to impose a schema.
is there a way i can create a separate file describing the schema of these
fields, and let PIG read the schema from that file?
for example.
instead of verbose typing in the pigscript....
A = load mydata as (c1:int, c2:chararray, ...... ,c100:charaaray)
can i do something like.
A = load mydata as described in myschema.txt
myschema.txt would be something like
c1: int
c2: chararray
....
....
c100: chararray
thanks
vkh
Re: Efficient load for data with large number of columns
Posted by Mike Sukmanowsky <mi...@parsely.com>.
Yes, as of Pig 0.10.0 you can specify a schema file along with PigStorage
when loading or storing data see
http://pig.apache.org/docs/r0.10.0/api/org/apache/pig/builtin/PigStorage.html
.
On Wed, Mar 27, 2013 at 9:30 AM, Vadi Hombal <vk...@gmail.com> wrote:
> suppose my data has 100 columns or fields, and i want to impose a schema.
> is there a way i can create a separate file describing the schema of these
> fields, and let PIG read the schema from that file?
>
> for example.
> instead of verbose typing in the pigscript....
> A = load mydata as (c1:int, c2:chararray, ...... ,c100:charaaray)
>
> can i do something like.
> A = load mydata as described in myschema.txt
>
> myschema.txt would be something like
> c1: int
> c2: chararray
> ....
> ....
> c100: chararray
>
> thanks
> vkh
>
--
Mike Sukmanowsky
Product Lead, http://parse.ly
989 Avenue of the Americas, 3rd Floor
New York, NY 10018
p: +1 (416) 953-4248
e: mike@parsely.com
Re: Efficient load for data with large number of columns
Posted by MARCOS MEDRADO RUBINELLI <ma...@buscapecompany.com>.
i did a store to figure out how to write the schema in json and then used
that as a template to create a schema for load.
from my experiments, for data with three columns (int, charray, float) i
figured this is the minimal schema
{"fields":
[
{"name":"year","type":10},
{"name":"name","type":55},
{"name":"num","type":20}
]
}
is there any literature on how to write proper json for schemas?
thanks
vkh
Sadly, there isn't. For a simple, flat schema, it isn't hard. You just have to add another field, with its name, and corresponding DataType:
http://pig.apache.org/docs/r0.10.0/api/constant-values.html#org.apache.pig.data.DataType.GENERIC_WRITABLECOMPARABLE
For a more complex schema, it's easier to actually construct a ResourceSchema object and serialize it with Jackson:
http://pig.apache.org/docs/r0.10.0/api/index.html?org/apache/pig/ResourceSchema.html
Regards,
Marcos
Re: Efficient load for data with large number of columns
Posted by Vadi Hombal <vk...@gmail.com>.
thank you Marc and Markos,
this worked well.
i did a store to figure out how to write the schema in json and then used
that as a template to create a schema for load.
from my experiments, for data with three columns (int, charray, float) i
figured this is the minimal schema
{"fields":
[
{"name":"year","type":10},
{"name":"name","type":55},
{"name":"num","type":20}
]
}
is there any literature on how to write proper json for schemas?
thanks
vkh
On Wed, Mar 27, 2013 at 10:16 AM, MARCOS MEDRADO RUBINELLI <
marcosm@buscapecompany.com> wrote:
> suppose my data has 100 columns or fields, and i want to impose a schema.
> is there a way i can create a separate file describing the schema of these
> fields, and let PIG read the schema from that file?
>
>
> Yes, if you put a json file named " .pig_schema" in the same directory as
> your data, Pig will use it to determine the schema:
>
> http://pig.apache.org/docs/r0.10.0/func.html#pigstorage
>
> Regards,
> Marcos
>
Re: Efficient load for data with large number of columns
Posted by MARCOS MEDRADO RUBINELLI <ma...@buscapecompany.com>.
suppose my data has 100 columns or fields, and i want to impose a schema.
is there a way i can create a separate file describing the schema of these
fields, and let PIG read the schema from that file?
Yes, if you put a json file named " .pig_schema" in the same directory as your data, Pig will use it to determine the schema:
http://pig.apache.org/docs/r0.10.0/func.html#pigstorage
Regards,
Marcos