You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Russell Jurney <ru...@gmail.com> on 2012/03/31 00:23:47 UTC
AvroStorage Question - ARRAY_ELEM bothers me. It called me stupid.
I sent this to the Avro list but got no reply, so I thought I'd try here.
Is it possible to name string elements in the schema of an array?
Specifically, below I want to name the email addresses in the
from/to/cc/bcc/reply_to fields, so they don't get auto-named ARRAY_ELEM by
Pig's AvroStorage. I know I can probably fix this in Java in the Pig
AvroStorage UDF, but I'm hoping I can also fix it more easily in the
schema. Last time I read Avro's array docs in this context, my hit-points
dropped by a third, so pardom me if I've not rtfm this time :)
Complete description of what I'm doing follows:
Avro schema for my emails:
{
"namespace": "agile.data.avro",
"name": "Email",
"type": "record",
"fields": [
{"name":"message_id", "type": ["string", "null"]},
{"name":"from","type": [{"type":"array", "items":"string"},
"null"]},
{"name":"to","type": [{"type":"array", "items":"string"}, "null"]},
{"name":"cc","type": [{"type":"array", "items":"string"}, "null"]},
{"name":"bcc","type": [{"type":"array", "items":"string"}, "null"]},
{"name":"reply_to", "type": [{"type":"array", "items":"string"},
"null"]},
{"name":"in_reply_to", "type": [{"type":"array", "items":"string"},
"null"]},
{"name":"subject", "type": ["string", "null"]},
{"name":"body", "type": ["string", "null"]},
{"name":"date", "type": ["string", "null"]}
]
}
Pig to publish my Avros:
grunt> emails = load '/me/tmp/emails' using AvroStorage();
grunt> describe emails
emails:
{
message_id: chararray,
from:
{ PIG_WRAPPER: (*ARRAY_ELEM*: chararray) },
to:
{ PIG_WRAPPER: (*ARRAY_ELEM*: chararray) },
cc:
{ PIG_WRAPPER: (*ARRAY_ELEM*: chararray) },
bcc:
{ PIG_WRAPPER: (*ARRAY_ELEM*: chararray) },
reply_to:
{ PIG_WRAPPER: (*ARRAY_ELEM*: chararray) },
in_reply_to:
{ PIG_WRAPPER: (*ARRAY_ELEM*: chararray) },
subject: chararray,
body: chararray,
date: chararray
}
grunt> store emails into 'mongodb://localhost/agile_data.emails' using
MongoStorage();
My emails in MongoDB:
> db.emails.findOne()
{
"_id" : ObjectId("4f738a35414e113e75707b97"),
"message_id" : "<4f...@li169-134.mail>",
"from" : [
{
"*ARRAY_ELEM*" : "daily@jobchangealerts.com"
}
],
"to" : [
{
"*ARRAY_ELEM*" : "Russell.jurney@gmail.com"
}
],
"cc" : null,
"bcc" : null,
"reply_to" : null,
"in_reply_to" : null,
"subject" : "Daily Job Change Alerts from SalesLoft",
"body" : "Daily Job Change Alerts from SalesLoft",
"date" : "2012-03-27T08:00:29"
}
My email on screen:
[image: Inline image 1]
My face when I see ARRAY_ELEM, because it means more complex presentation
code: *:(*
What I really want is just an array of strings. Is this possible?
--
Russell Jurney twitter.com/rjurney russell.jurney@gmail.com datasyndrome.com