You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@avro.apache.org by Russell Jurney <ru...@gmail.com> on 2012/03/30 03:05:19 UTC

AvroStorage/Avro Schema Question

Is it possible to name string elements in the schema of an array?
 Specifically, below I want to name the email addresses in the
from/to/cc/bcc/reply_to fields, so they don't get auto-named ARRAY_ELEM by
Pig's AvroStorage.  I know I can probably fix this in Java in the Pig
AvroStorage UDF, but I'm hoping I can also fix it more easily in the
schema.  Last time I read Avro's array docs in this context, my hit-points
dropped by a third, so pardom me if I've not rtfm this time :)

Complete description of what I'm doing follows:

Avro schema for my emails:

{
    "namespace": "agile.data.avro",
    "name": "Email",
    "type": "record",
    "fields": [
        {"name":"message_id", "type": ["string", "null"]},
        {"name":"from","type": [{"type":"array", "items":"string"},
"null"]},
        {"name":"to","type": [{"type":"array", "items":"string"}, "null"]},
        {"name":"cc","type": [{"type":"array", "items":"string"}, "null"]},
        {"name":"bcc","type": [{"type":"array", "items":"string"}, "null"]},
        {"name":"reply_to", "type": [{"type":"array", "items":"string"},
"null"]},
        {"name":"in_reply_to", "type": [{"type":"array", "items":"string"},
"null"]},
        {"name":"subject", "type": ["string", "null"]},
        {"name":"body", "type": ["string", "null"]},
        {"name":"date", "type": ["string", "null"]}
    ]
}


Pig to publish my Avros:

grunt> emails = load '/me/tmp/emails' using AvroStorage();
grunt> describe emails

emails: {message_id: chararray,from: {PIG_WRAPPER: (ARRAY_ELEM:
chararray)},to: {PIG_WRAPPER: (ARRAY_ELEM: chararray)},cc: {PIG_WRAPPER:
(ARRAY_ELEM: chararray)},bcc: {PIG_WRAPPER: (ARRAY_ELEM:
chararray)},reply_to: {PIG_WRAPPER: (ARRAY_ELEM: chararray)},in_reply_to:
{PIG_WRAPPER: (ARRAY_ELEM: chararray)},subject: chararray,body:
chararray,date: chararray}

grunt> store emails into 'mongodb://localhost/agile_data.emails' using
MongoStorage();


My emails in MongoDB:

> db.emails.findOne()
{
"_id" : ObjectId("4f738a35414e113e75707b97"),
"message_id" : "<4f...@li169-134.mail>",
"from" : [
{
"ARRAY_ELEM" : "daily@jobchangealerts.com"
}
],
"to" : [
{
"ARRAY_ELEM" : "Russell.jurney@gmail.com"
}
],
"cc" : null,
"bcc" : null,
"reply_to" : null,
"in_reply_to" : null,
"subject" : "Daily Job Change Alerts from SalesLoft",
"body" : "Daily Job Change Alerts from SalesLoft",
"date" : "2012-03-27T08:00:29"
}


My email on screen:

[image: Inline image 1]

My face when I see ARRAY_ELEM, because it means more complex presentation
code: *:(*
-- 
Russell Jurney twitter.com/rjurney russell.jurney@gmail.com datasyndrome.com

Re: AvroStorage/Avro Schema Question

Posted by Russell Jurney <ru...@gmail.com>.
The fix was this: 

{
    "type":"record",
    "name":"Email",
    "fields":
    [
        {
            "name":"message_id",
            "type":["null","string"],
            "doc":""
        },
        {
            "name":"in_reply_to",
            "type": ["string", "null"]
        },
        {
            "name":"subject", 
            "type": ["string", "null"]
        },
        {
            "name":"body", 
            "type": ["string", "null"]
        },
        {
            "name":"date", 
            "type": ["string", "null"]
        },
        {
            "name":"froms",
            "type":
            [
                "null",
                {
                    "type":"array",
                    "items":
                    [
                        "null",
                        {
                            "type":"record",
                            "name":"from",
                            "fields":
                            [
                                {
                                    "name":"real_name",
                                    "type":["null","string"],
                                    "doc":""
                                },
                                {
                                    "name":"address",
                                    "type":["null","string"],
                                    "doc":""
                                }
                            ]
                        }
                    
                    ]
                }
            ],
            "doc":""
        },
        {
            "name":"tos",
            "type":
            [
                "null",
                {
                    "type":"array",
                    "items":
                    [
                        "null",
                        {
                            "type":"record",
                            "name":"to",
                            "fields":
                            [
                                {
                                    "name":"real_name",
                                    "type":["null","string"],
                                    "doc":""
                                },
                                {
                                    "name":"address",
                                    "type":["null","string"],
                                    "doc":""
                                }
                            ]
                        }
                    
                    ]
                }
            ],
            "doc":""
        },        
        {
            "name":"ccs",
            "type":
            [
                "null",
                {
                    "type":"array",
                    "items":
                    [
                        "null",
                        {
                            "type":"record",
                            "name":"cc",
                            "fields":
                            [
                                {
                                    "name":"real_name",
                                    "type":["null","string"],
                                    "doc":""
                                },
                                {
                                    "name":"address",
                                    "type":["null","string"],
                                    "doc":""
                                }
                            ]
                        }
                    
                    ]
                }
            ],
            "doc":""
        },
        {
            "name":"bccs",
            "type":
            [
                "null",
                {
                    "type":"array",
                    "items":
                    [
                        "null",
                        {
                            "type":"record",
                            "name":"bcc",
                            "fields":
                            [
                                {
                                    "name":"real_name",
                                    "type":["null","string"],
                                    "doc":""
                                },
                                {
                                    "name":"address",
                                    "type":["null","string"],
                                    "doc":""
                                }
                            ]
                        }
                    
                    ]
                }
            ],
            "doc":""
        },
        {
            "name":"reply_tos",
            "type":
            [
                "null",
                {
                    "type":"array",
                    "items":
                    [
                        "null",
                        {
                            "type":"record",
                            "name":"reply_to",
                            "fields":
                            [
                                {
                                    "name":"real_name",
                                    "type":["null","string"],
                                    "doc":""
                                },
                                {
                                    "name":"address",
                                    "type":["null","string"],
                                    "doc":""
                                }
                            ]
                        }
                    
                    ]
                }
            ],
            "doc":""
        }
    ]
}

On Tue, Apr 10, 2012 at 2:36 AM, Russell Jurney <ru...@gmail.com> wrote:
Hmmmm unable to get this to work:

{
    "namespace": "agile.data.avro",
    "name": "Email",
    "type": "record",
    "fields": [
        {"name":"message_id", "type": ["string", "null"]},
        {"name":"froms","type": [{"type":"record", "name":"from", "fields": [{"type":"array", "items":"string"}, "null"]}, "null"]},
        {"name":"tos","type": [{"type":"record", "name":"to", "fields": [{"type":"array", "items":"string"}, "null"]}, "null"]},
        {"name":"ccs","type": [{"type":"record", "name":"cc", "fields": [{"type":"array", "items":"string"}, "null"]}, "null"]},
        {"name":"bccs","type": [{"type":"record", "name":"bcc", "fields": [{"type":"array", "items":"string"}, "null"]}, "null"]},
        {"name":"reply_tos","type": [{"type":"record", "name":"reply_to", "fields": [{"type":"array", "items":"string"}, "null"]}, "null"]},
        {"name":"in_reply_to", "type": [{"type":"array", "items":"string"}, "null"]},
        {"name":"subject", "type": ["string", "null"]},
        {"name":"body", "type": ["string", "null"]},
        {"name":"date", "type": ["string", "null"]}
    ]
}

On Tue, Apr 10, 2012 at 2:26 AM, Russell Jurney <ru...@gmail.com> wrote:
In thinking about it more... it seems that unfortunately, the only thing I can really do is to change the schema for all email address fields:

{"name":"from","type": [{"type":"array", "items":"string"}, "null"]},
to:
{"name":"froms","type": [{"type":"record", "name":"from", "fields": [{"type":"array", "items":"string"}, "null"]}, "null"]},

That is, to pluralize everything and then individually name array elements. I will try running this through my stack.


On Mon, Apr 2, 2012 at 9:13 AM, Scott Carey <sc...@apache.org> wrote:
It appears as though the Avro to PigStorage schema translation names (in pig) all arrays ARRAY_ELEM.  The nullable wrapper is 'visible' and the field name is not moved onto the bag name.   

About a year and a half ago I started
https://issues.apache.org/jira/browse/AVRO-592

but before finishing it AvroStorage was written elsewhere.  I don't recall exactly what I did with the schema translation there, but I recall the mapping from an Avro schema to pig tried to hide the nullable wrappers more.


In Avro, arrays are unnamed types, so I see two things you could probably do without any code changes:

* Add a line in the pig script to project / rename the fields to what you want (unfortunate and clumbsy, but I think it will work — I think you want "from::PIG_WRAPPER::ARRAY_ELEM as from"  or "FLATTEN(from::PIG_WRAPPER)::ARRAY_ELEM as from" something like that.
* Add a record wrapper to your schema (which may inject more messiness in the pig schema view):
{
    "namespace": "agile.data.avro",
    "name": "Email",
    "type": "record",
    "fields": [
        {"name":"message_id", "type": ["string", "null"]},
        {"name":"from","type": [{"type":"record", "name":"From", "fields": [[{"type":"array", "items":"string"},"null"]], "null"]},
       …
    ]
}

But that is very awkward — requiring a named record for each field that is an unnamed type.


Ideally PigStorage would treat any union of null and one other thing as a simple pig type with no wrapper, and project the name of a field or record into the name of the thing inside a bag.


-Scott

On 3/29/12 6:05 PM, "Russell Jurney" <ru...@gmail.com> wrote:

Is it possible to name string elements in the schema of an array?  Specifically, below I want to name the email addresses in the from/to/cc/bcc/reply_to fields, so they don't get auto-named ARRAY_ELEM by Pig's AvroStorage.  I know I can probably fix this in Java in the Pig AvroStorage UDF, but I'm hoping I can also fix it more easily in the schema.  Last time I read Avro's array docs in this context, my hit-points dropped by a third, so pardom me if I've not rtfm this time :)

Complete description of what I'm doing follows:

Avro schema for my emails:

{
    "namespace": "agile.data.avro",
    "name": "Email",
    "type": "record",
    "fields": [
        {"name":"message_id", "type": ["string", "null"]},
        {"name":"from","type": [{"type":"array", "items":"string"}, "null"]},
        {"name":"to","type": [{"type":"array", "items":"string"}, "null"]},
        {"name":"cc","type": [{"type":"array", "items":"string"}, "null"]},
        {"name":"bcc","type": [{"type":"array", "items":"string"}, "null"]},
        {"name":"reply_to", "type": [{"type":"array", "items":"string"}, "null"]},
        {"name":"in_reply_to", "type": [{"type":"array", "items":"string"}, "null"]},
        {"name":"subject", "type": ["string", "null"]},
        {"name":"body", "type": ["string", "null"]},
        {"name":"date", "type": ["string", "null"]}
    ]
}

Pig to publish my Avros:

grunt> emails = load '/me/tmp/emails' using AvroStorage();
grunt> describe emails

emails: {message_id: chararray,from: {PIG_WRAPPER: (ARRAY_ELEM: chararray)},to: {PIG_WRAPPER: (ARRAY_ELEM: chararray)},cc: {PIG_WRAPPER: (ARRAY_ELEM: chararray)},bcc: {PIG_WRAPPER: (ARRAY_ELEM: chararray)},reply_to: {PIG_WRAPPER: (ARRAY_ELEM: chararray)},in_reply_to: {PIG_WRAPPER: (ARRAY_ELEM: chararray)},subject: chararray,body: chararray,date: chararray}

grunt> store emails into 'mongodb://localhost/agile_data.emails' using MongoStorage();

My emails in MongoDB:

> db.emails.findOne()
{
	"_id" : ObjectId("4f738a35414e113e75707b97"),
	"message_id" : "<4f...@li169-134.mail>",
	"from" : [
		{
			"ARRAY_ELEM" : "daily@jobchangealerts.com"
		}
	],
	"to" : [
		{
			"ARRAY_ELEM" : "Russell.jurney@gmail.com"
		}
	],
	"cc" : null,
	"bcc" : null,
	"reply_to" : null,
	"in_reply_to" : null,
	"subject" : "Daily Job Change Alerts from SalesLoft",
	"body" : "Daily Job Change Alerts from SalesLoft",
	"date" : "2012-03-27T08:00:29"
}

My email on screen:



My face when I see ARRAY_ELEM, because it means more complex presentation code: :(
-- 
Russell Jurney twitter.com/rjurney russell.jurney@gmail.com datasyndrome.com



-- 
Russell Jurney twitter.com/rjurney russell.jurney@gmail.com datasyndrome.com



-- 
Russell Jurney twitter.com/rjurney russell.jurney@gmail.com datasyndrome.com



-- 
Russell Jurney twitter.com/rjurney russell.jurney@gmail.com datasyndrome.com

Re: AvroStorage/Avro Schema Question

Posted by Russell Jurney <ru...@gmail.com>.
Hmmmm unable to get this to work:

{
    "namespace": "agile.data.avro",
    "name": "Email",
    "type": "record",
    "fields": [
        {"name":"message_id", "type": ["string", "null"]},
        {"name":"froms","type": [{"type":"record", "name":"from", "fields":
[{"type":"array", "items":"string"}, "null"]}, "null"]},
        {"name":"tos","type": [{"type":"record", "name":"to", "fields":
[{"type":"array", "items":"string"}, "null"]}, "null"]},
        {"name":"ccs","type": [{"type":"record", "name":"cc", "fields":
[{"type":"array", "items":"string"}, "null"]}, "null"]},
        {"name":"bccs","type": [{"type":"record", "name":"bcc", "fields":
[{"type":"array", "items":"string"}, "null"]}, "null"]},
        {"name":"reply_tos","type": [{"type":"record", "name":"reply_to",
"fields": [{"type":"array", "items":"string"}, "null"]}, "null"]},
        {"name":"in_reply_to", "type": [{"type":"array", "items":"string"},
"null"]},
        {"name":"subject", "type": ["string", "null"]},
        {"name":"body", "type": ["string", "null"]},
        {"name":"date", "type": ["string", "null"]}
    ]
}

On Tue, Apr 10, 2012 at 2:26 AM, Russell Jurney <ru...@gmail.com>wrote:

> In thinking about it more... it seems that unfortunately, the only thing I
> can really do is to change the schema for all email address fields:
>
> {"name":"from","type": [{"type":"array", "items":"string"}, "null"]},
> to:
> {"name":"froms","type": [{"type":"record", "name":"from", "fields":
> [{"type":"array", "items":"string"}, "null"]}, "null"]},
>
> That is, to pluralize everything and then individually name array
> elements. I will try running this through my stack.
>
>
> On Mon, Apr 2, 2012 at 9:13 AM, Scott Carey <sc...@apache.org> wrote:
>
>> It appears as though the Avro to PigStorage schema translation names (in
>> pig) all arrays ARRAY_ELEM.  The nullable wrapper is 'visible' and the
>> field name is not moved onto the bag name.
>>
>> About a year and a half ago I started
>> https://issues.apache.org/jira/browse/AVRO-592
>>
>> but before finishing it AvroStorage was written elsewhere.  I don't
>> recall exactly what I did with the schema translation there, but I recall
>> the mapping from an Avro schema to pig tried to hide the nullable wrappers
>> more.
>>
>>
>> In Avro, arrays are unnamed types, so I see two things you could probably
>> do without any code changes:
>>
>> * Add a line in the pig script to project / rename the fields to what you
>> want (unfortunate and clumbsy, but I think it will work — I think you want
>> "from::PIG_WRAPPER::ARRAY_ELEM as from"  or
>> "FLATTEN(from::PIG_WRAPPER)::ARRAY_ELEM as from" something like that.
>> * Add a record wrapper to your schema (which may inject more messiness in
>> the pig schema view):
>> {
>>     "namespace": "agile.data.avro",
>>     "name": "Email",
>>     "type": "record",
>>     "fields": [
>>         {"name":"message_id", "type": ["string", "null"]},
>>         {"name":"from","type": [{"type":"record", "name":"From",
>> "fields": [[{"type":"array", "items":"string"},"null"]], "null"]},
>>        …
>>     ]
>> }
>>
>> But that is very awkward — requiring a named record for each field that
>> is an unnamed type.
>>
>>
>> Ideally PigStorage would treat any union of null and one other thing as a
>> simple pig type with no wrapper, and project the name of a field or record
>> into the name of the thing inside a bag.
>>
>>
>> -Scott
>>
>> On 3/29/12 6:05 PM, "Russell Jurney" <ru...@gmail.com> wrote:
>>
>> Is it possible to name string elements in the schema of an array?
>>  Specifically, below I want to name the email addresses in the
>> from/to/cc/bcc/reply_to fields, so they don't get auto-named ARRAY_ELEM by
>> Pig's AvroStorage.  I know I can probably fix this in Java in the Pig
>> AvroStorage UDF, but I'm hoping I can also fix it more easily in the
>> schema.  Last time I read Avro's array docs in this context, my hit-points
>> dropped by a third, so pardom me if I've not rtfm this time :)
>>
>> Complete description of what I'm doing follows:
>>
>> Avro schema for my emails:
>>
>> {
>>     "namespace": "agile.data.avro",
>>     "name": "Email",
>>     "type": "record",
>>      "fields": [
>>         {"name":"message_id", "type": ["string", "null"]},
>>         {"name":"from","type": [{"type":"array", "items":"string"},
>> "null"]},
>>         {"name":"to","type": [{"type":"array", "items":"string"},
>> "null"]},
>>         {"name":"cc","type": [{"type":"array", "items":"string"},
>> "null"]},
>>         {"name":"bcc","type": [{"type":"array", "items":"string"},
>> "null"]},
>>         {"name":"reply_to", "type": [{"type":"array", "items":"string"},
>> "null"]},
>>         {"name":"in_reply_to", "type": [{"type":"array",
>> "items":"string"}, "null"]},
>>         {"name":"subject", "type": ["string", "null"]},
>>         {"name":"body", "type": ["string", "null"]},
>>         {"name":"date", "type": ["string", "null"]}
>>     ]
>> }
>>
>>
>> Pig to publish my Avros:
>>
>> grunt> emails = load '/me/tmp/emails' using AvroStorage();
>> grunt> describe emails
>>
>> emails: {message_id: chararray,from: {PIG_WRAPPER: (ARRAY_ELEM:
>> chararray)},to: {PIG_WRAPPER: (ARRAY_ELEM: chararray)},cc: {PIG_WRAPPER:
>> (ARRAY_ELEM: chararray)},bcc: {PIG_WRAPPER: (ARRAY_ELEM:
>> chararray)},reply_to: {PIG_WRAPPER: (ARRAY_ELEM: chararray)},in_reply_to:
>> {PIG_WRAPPER: (ARRAY_ELEM: chararray)},subject: chararray,body:
>> chararray,date: chararray}
>>
>> grunt> store emails into 'mongodb://localhost/agile_data.emails' using
>> MongoStorage();
>>
>>
>> My emails in MongoDB:
>>
>> > db.emails.findOne()
>> {
>> "_id" : ObjectId("4f738a35414e113e75707b97"),
>> "message_id" : "<4f...@li169-134.mail>",
>>  "from" : [
>> {
>> "ARRAY_ELEM" : "daily@jobchangealerts.com"
>>  }
>> ],
>> "to" : [
>> {
>>  "ARRAY_ELEM" : "Russell.jurney@gmail.com"
>> }
>>  ],
>> "cc" : null,
>> "bcc" : null,
>> "reply_to" : null,
>>  "in_reply_to" : null,
>> "subject" : "Daily Job Change Alerts from SalesLoft",
>> "body" : "Daily Job Change Alerts from SalesLoft",
>>  "date" : "2012-03-27T08:00:29"
>> }
>>
>>
>> My email on screen:
>>
>> [image: Inline image 1]
>>
>> My face when I see ARRAY_ELEM, because it means more complex presentation
>> code: *:(*
>> --
>> Russell Jurney twitter.com/rjurney russell.jurney@gmail.com datasyndrome.
>> com
>>
>>
>
>
> --
> Russell Jurney twitter.com/rjurney russell.jurney@gmail.com datasyndrome.
> com
>



-- 
Russell Jurney twitter.com/rjurney russell.jurney@gmail.com datasyndrome.com

Re: AvroStorage/Avro Schema Question

Posted by Russell Jurney <ru...@gmail.com>.
In thinking about it more... it seems that unfortunately, the only thing I
can really do is to change the schema for all email address fields:

{"name":"from","type": [{"type":"array", "items":"string"}, "null"]},
to:
{"name":"froms","type": [{"type":"record", "name":"from", "fields":
[{"type":"array", "items":"string"}, "null"]}, "null"]},

That is, to pluralize everything and then individually name array elements.
I will try running this through my stack.


On Mon, Apr 2, 2012 at 9:13 AM, Scott Carey <sc...@apache.org> wrote:

> It appears as though the Avro to PigStorage schema translation names (in
> pig) all arrays ARRAY_ELEM.  The nullable wrapper is 'visible' and the
> field name is not moved onto the bag name.
>
> About a year and a half ago I started
> https://issues.apache.org/jira/browse/AVRO-592
>
> but before finishing it AvroStorage was written elsewhere.  I don't recall
> exactly what I did with the schema translation there, but I recall the
> mapping from an Avro schema to pig tried to hide the nullable wrappers more.
>
>
> In Avro, arrays are unnamed types, so I see two things you could probably
> do without any code changes:
>
> * Add a line in the pig script to project / rename the fields to what you
> want (unfortunate and clumbsy, but I think it will work — I think you want
> "from::PIG_WRAPPER::ARRAY_ELEM as from"  or
> "FLATTEN(from::PIG_WRAPPER)::ARRAY_ELEM as from" something like that.
> * Add a record wrapper to your schema (which may inject more messiness in
> the pig schema view):
> {
>     "namespace": "agile.data.avro",
>     "name": "Email",
>     "type": "record",
>     "fields": [
>         {"name":"message_id", "type": ["string", "null"]},
>         {"name":"from","type": [{"type":"record", "name":"From", "fields":
> [[{"type":"array", "items":"string"},"null"]], "null"]},
>        …
>     ]
> }
>
> But that is very awkward — requiring a named record for each field that is
> an unnamed type.
>
>
> Ideally PigStorage would treat any union of null and one other thing as a
> simple pig type with no wrapper, and project the name of a field or record
> into the name of the thing inside a bag.
>
>
> -Scott
>
> On 3/29/12 6:05 PM, "Russell Jurney" <ru...@gmail.com> wrote:
>
> Is it possible to name string elements in the schema of an array?
>  Specifically, below I want to name the email addresses in the
> from/to/cc/bcc/reply_to fields, so they don't get auto-named ARRAY_ELEM by
> Pig's AvroStorage.  I know I can probably fix this in Java in the Pig
> AvroStorage UDF, but I'm hoping I can also fix it more easily in the
> schema.  Last time I read Avro's array docs in this context, my hit-points
> dropped by a third, so pardom me if I've not rtfm this time :)
>
> Complete description of what I'm doing follows:
>
> Avro schema for my emails:
>
> {
>     "namespace": "agile.data.avro",
>     "name": "Email",
>     "type": "record",
>     "fields": [
>         {"name":"message_id", "type": ["string", "null"]},
>         {"name":"from","type": [{"type":"array", "items":"string"},
> "null"]},
>         {"name":"to","type": [{"type":"array", "items":"string"}, "null"]},
>         {"name":"cc","type": [{"type":"array", "items":"string"}, "null"]},
>         {"name":"bcc","type": [{"type":"array", "items":"string"},
> "null"]},
>         {"name":"reply_to", "type": [{"type":"array", "items":"string"},
> "null"]},
>         {"name":"in_reply_to", "type": [{"type":"array",
> "items":"string"}, "null"]},
>         {"name":"subject", "type": ["string", "null"]},
>         {"name":"body", "type": ["string", "null"]},
>         {"name":"date", "type": ["string", "null"]}
>     ]
> }
>
>
> Pig to publish my Avros:
>
> grunt> emails = load '/me/tmp/emails' using AvroStorage();
> grunt> describe emails
>
> emails: {message_id: chararray,from: {PIG_WRAPPER: (ARRAY_ELEM:
> chararray)},to: {PIG_WRAPPER: (ARRAY_ELEM: chararray)},cc: {PIG_WRAPPER:
> (ARRAY_ELEM: chararray)},bcc: {PIG_WRAPPER: (ARRAY_ELEM:
> chararray)},reply_to: {PIG_WRAPPER: (ARRAY_ELEM: chararray)},in_reply_to:
> {PIG_WRAPPER: (ARRAY_ELEM: chararray)},subject: chararray,body:
> chararray,date: chararray}
>
> grunt> store emails into 'mongodb://localhost/agile_data.emails' using
> MongoStorage();
>
>
> My emails in MongoDB:
>
> > db.emails.findOne()
> {
> "_id" : ObjectId("4f738a35414e113e75707b97"),
> "message_id" : "<4f...@li169-134.mail>",
> "from" : [
> {
> "ARRAY_ELEM" : "daily@jobchangealerts.com"
> }
> ],
> "to" : [
> {
> "ARRAY_ELEM" : "Russell.jurney@gmail.com"
> }
> ],
> "cc" : null,
> "bcc" : null,
> "reply_to" : null,
> "in_reply_to" : null,
> "subject" : "Daily Job Change Alerts from SalesLoft",
> "body" : "Daily Job Change Alerts from SalesLoft",
> "date" : "2012-03-27T08:00:29"
> }
>
>
> My email on screen:
>
> [image: Inline image 1]
>
> My face when I see ARRAY_ELEM, because it means more complex presentation
> code: *:(*
> --
> Russell Jurney twitter.com/rjurney russell.jurney@gmail.com datasyndrome.
> com
>
>


-- 
Russell Jurney twitter.com/rjurney russell.jurney@gmail.com datasyndrome.com

Fwd: AvroStorage/Avro Schema Question

Posted by Russell Jurney <ru...@gmail.com>.
Whoops, sorry to post to user.  Scott Carey explains how to fix my
ARRAY_ELEM problem.

---------- Forwarded message ----------
From: Scott Carey <sc...@apache.org>
Date: Mon, Apr 2, 2012 at 9:13 AM
Subject: Re: AvroStorage/Avro Schema Question
To: user@avro.apache.org


It appears as though the Avro to PigStorage schema translation names (in
pig) all arrays ARRAY_ELEM.  The nullable wrapper is 'visible' and the
field name is not moved onto the bag name.

About a year and a half ago I started
https://issues.apache.org/jira/browse/AVRO-592

but before finishing it AvroStorage was written elsewhere.  I don't recall
exactly what I did with the schema translation there, but I recall the
mapping from an Avro schema to pig tried to hide the nullable wrappers more.


In Avro, arrays are unnamed types, so I see two things you could probably
do without any code changes:

* Add a line in the pig script to project / rename the fields to what you
want (unfortunate and clumbsy, but I think it will work — I think you want
"from::PIG_WRAPPER::ARRAY_ELEM as from"  or
"FLATTEN(from::PIG_WRAPPER)::ARRAY_ELEM as from" something like that.
* Add a record wrapper to your schema (which may inject more messiness in
the pig schema view):
{
    "namespace": "agile.data.avro",
    "name": "Email",
    "type": "record",
    "fields": [
        {"name":"message_id", "type": ["string", "null"]},
        {"name":"from","type": [{"type":"record", "name":"From", "fields":
[[{"type":"array", "items":"string"},"null"]], "null"]},
       …
    ]
}

But that is very awkward — requiring a named record for each field that is
an unnamed type.


Ideally PigStorage would treat any union of null and one other thing as a
simple pig type with no wrapper, and project the name of a field or record
into the name of the thing inside a bag.


-Scott

On 3/29/12 6:05 PM, "Russell Jurney" <ru...@gmail.com> wrote:

Is it possible to name string elements in the schema of an array?
 Specifically, below I want to name the email addresses in the
from/to/cc/bcc/reply_to fields, so they don't get auto-named ARRAY_ELEM by
Pig's AvroStorage.  I know I can probably fix this in Java in the Pig
AvroStorage UDF, but I'm hoping I can also fix it more easily in the
schema.  Last time I read Avro's array docs in this context, my hit-points
dropped by a third, so pardom me if I've not rtfm this time :)

Complete description of what I'm doing follows:

Avro schema for my emails:

{
    "namespace": "agile.data.avro",
    "name": "Email",
    "type": "record",
    "fields": [
        {"name":"message_id", "type": ["string", "null"]},
        {"name":"from","type": [{"type":"array", "items":"string"},
"null"]},
        {"name":"to","type": [{"type":"array", "items":"string"}, "null"]},
        {"name":"cc","type": [{"type":"array", "items":"string"}, "null"]},
        {"name":"bcc","type": [{"type":"array", "items":"string"}, "null"]},
        {"name":"reply_to", "type": [{"type":"array", "items":"string"},
"null"]},
        {"name":"in_reply_to", "type": [{"type":"array", "items":"string"},
"null"]},
        {"name":"subject", "type": ["string", "null"]},
        {"name":"body", "type": ["string", "null"]},
        {"name":"date", "type": ["string", "null"]}
    ]
}


Pig to publish my Avros:

grunt> emails = load '/me/tmp/emails' using AvroStorage();
grunt> describe emails

emails: {message_id: chararray,from: {PIG_WRAPPER: (ARRAY_ELEM:
chararray)},to: {PIG_WRAPPER: (ARRAY_ELEM: chararray)},cc: {PIG_WRAPPER:
(ARRAY_ELEM: chararray)},bcc: {PIG_WRAPPER: (ARRAY_ELEM:
chararray)},reply_to: {PIG_WRAPPER: (ARRAY_ELEM: chararray)},in_reply_to:
{PIG_WRAPPER: (ARRAY_ELEM: chararray)},subject: chararray,body:
chararray,date: chararray}

grunt> store emails into 'mongodb://localhost/agile_data.emails' using
MongoStorage();


My emails in MongoDB:

> db.emails.findOne()
{
"_id" : ObjectId("4f738a35414e113e75707b97"),
"message_id" : "<4f...@li169-134.mail>",
"from" : [
{
"ARRAY_ELEM" : "daily@jobchangealerts.com"
}
],
"to" : [
{
"ARRAY_ELEM" : "Russell.jurney@gmail.com"
}
],
"cc" : null,
"bcc" : null,
"reply_to" : null,
"in_reply_to" : null,
"subject" : "Daily Job Change Alerts from SalesLoft",
"body" : "Daily Job Change Alerts from SalesLoft",
"date" : "2012-03-27T08:00:29"
}


My email on screen:

[image: Inline image 1]

My face when I see ARRAY_ELEM, because it means more complex presentation
code: *:(*
-- 
Russell Jurney twitter.com/rjurney russell.jurney@gmail.com datasyndrome.com




-- 
Russell Jurney twitter.com/rjurney russell.jurney@gmail.com datasyndrome.com

Fwd: AvroStorage/Avro Schema Question

Posted by Russell Jurney <ru...@gmail.com>.
I am having trouble with ARRAY_ELEM getting injected into my pig data, when
I store.  Scott Carey had good insight into how to address the issue.

---------- Forwarded message ----------
From: Scott Carey <sc...@apache.org>
Date: Mon, Apr 2, 2012 at 9:13 AM
Subject: Re: AvroStorage/Avro Schema Question
To: user@avro.apache.org


It appears as though the Avro to PigStorage schema translation names (in
pig) all arrays ARRAY_ELEM.  The nullable wrapper is 'visible' and the
field name is not moved onto the bag name.

About a year and a half ago I started
https://issues.apache.org/jira/browse/AVRO-592

but before finishing it AvroStorage was written elsewhere.  I don't recall
exactly what I did with the schema translation there, but I recall the
mapping from an Avro schema to pig tried to hide the nullable wrappers more.


In Avro, arrays are unnamed types, so I see two things you could probably
do without any code changes:

* Add a line in the pig script to project / rename the fields to what you
want (unfortunate and clumbsy, but I think it will work — I think you want
"from::PIG_WRAPPER::ARRAY_ELEM as from"  or
"FLATTEN(from::PIG_WRAPPER)::ARRAY_ELEM as from" something like that.
* Add a record wrapper to your schema (which may inject more messiness in
the pig schema view):
{
    "namespace": "agile.data.avro",
    "name": "Email",
    "type": "record",
    "fields": [
        {"name":"message_id", "type": ["string", "null"]},
        {"name":"from","type": [{"type":"record", "name":"From", "fields":
[[{"type":"array", "items":"string"},"null"]], "null"]},
       …
    ]
}

But that is very awkward — requiring a named record for each field that is
an unnamed type.


Ideally PigStorage would treat any union of null and one other thing as a
simple pig type with no wrapper, and project the name of a field or record
into the name of the thing inside a bag.


-Scott

On 3/29/12 6:05 PM, "Russell Jurney" <ru...@gmail.com> wrote:

Is it possible to name string elements in the schema of an array?
 Specifically, below I want to name the email addresses in the
from/to/cc/bcc/reply_to fields, so they don't get auto-named ARRAY_ELEM by
Pig's AvroStorage.  I know I can probably fix this in Java in the Pig
AvroStorage UDF, but I'm hoping I can also fix it more easily in the
schema.  Last time I read Avro's array docs in this context, my hit-points
dropped by a third, so pardom me if I've not rtfm this time :)

Complete description of what I'm doing follows:

Avro schema for my emails:

{
    "namespace": "agile.data.avro",
    "name": "Email",
    "type": "record",
    "fields": [
        {"name":"message_id", "type": ["string", "null"]},
        {"name":"from","type": [{"type":"array", "items":"string"},
"null"]},
        {"name":"to","type": [{"type":"array", "items":"string"}, "null"]},
        {"name":"cc","type": [{"type":"array", "items":"string"}, "null"]},
        {"name":"bcc","type": [{"type":"array", "items":"string"}, "null"]},
        {"name":"reply_to", "type": [{"type":"array", "items":"string"},
"null"]},
        {"name":"in_reply_to", "type": [{"type":"array", "items":"string"},
"null"]},
        {"name":"subject", "type": ["string", "null"]},
        {"name":"body", "type": ["string", "null"]},
        {"name":"date", "type": ["string", "null"]}
    ]
}


Pig to publish my Avros:

grunt> emails = load '/me/tmp/emails' using AvroStorage();
grunt> describe emails

emails: {message_id: chararray,from: {PIG_WRAPPER: (ARRAY_ELEM:
chararray)},to: {PIG_WRAPPER: (ARRAY_ELEM: chararray)},cc: {PIG_WRAPPER:
(ARRAY_ELEM: chararray)},bcc: {PIG_WRAPPER: (ARRAY_ELEM:
chararray)},reply_to: {PIG_WRAPPER: (ARRAY_ELEM: chararray)},in_reply_to:
{PIG_WRAPPER: (ARRAY_ELEM: chararray)},subject: chararray,body:
chararray,date: chararray}

grunt> store emails into 'mongodb://localhost/agile_data.emails' using
MongoStorage();


My emails in MongoDB:

> db.emails.findOne()
{
"_id" : ObjectId("4f738a35414e113e75707b97"),
"message_id" : "<4f...@li169-134.mail>",
"from" : [
{
"ARRAY_ELEM" : "daily@jobchangealerts.com"
}
],
"to" : [
{
"ARRAY_ELEM" : "Russell.jurney@gmail.com"
}
],
"cc" : null,
"bcc" : null,
"reply_to" : null,
"in_reply_to" : null,
"subject" : "Daily Job Change Alerts from SalesLoft",
"body" : "Daily Job Change Alerts from SalesLoft",
"date" : "2012-03-27T08:00:29"
}


My email on screen:

[image: Inline image 1]

My face when I see ARRAY_ELEM, because it means more complex presentation
code: *:(*
-- 
Russell Jurney twitter.com/rjurney russell.jurney@gmail.com datasyndrome.com




-- 
Russell Jurney twitter.com/rjurney russell.jurney@gmail.com datasyndrome.com

Re: AvroStorage/Avro Schema Question

Posted by Scott Carey <sc...@apache.org>.
It appears as though the Avro to PigStorage schema translation names (in
pig) all arrays ARRAY_ELEM.  The nullable wrapper is 'visible' and the field
name is not moved onto the bag name.

About a year and a half ago I started
https://issues.apache.org/jira/browse/AVRO-592

but before finishing it AvroStorage was written elsewhere.  I don't recall
exactly what I did with the schema translation there, but I recall the
mapping from an Avro schema to pig tried to hide the nullable wrappers more.


In Avro, arrays are unnamed types, so I see two things you could probably do
without any code changes:

* Add a line in the pig script to project / rename the fields to what you
want (unfortunate and clumbsy, but I think it will work ‹ I think you want
"from::PIG_WRAPPER::ARRAY_ELEM as from"  or
"FLATTEN(from::PIG_WRAPPER)::ARRAY_ELEM as from" something like that.
* Add a record wrapper to your schema (which may inject more messiness in
the pig schema view):
{
    "namespace": "agile.data.avro",
    "name": "Email",
    "type": "record",
    "fields": [
        {"name":"message_id", "type": ["string", "null"]},
        {"name":"from","type": [{"type":"record", "name":"From", "fields":
[[{"type":"array", "items":"string"},"null"]], "null"]},
       Š
    ]
}

But that is very awkward ‹ requiring a named record for each field that is
an unnamed type.


Ideally PigStorage would treat any union of null and one other thing as a
simple pig type with no wrapper, and project the name of a field or record
into the name of the thing inside a bag.


-Scott

On 3/29/12 6:05 PM, "Russell Jurney" <ru...@gmail.com> wrote:

> Is it possible to name string elements in the schema of an array?
> Specifically, below I want to name the email addresses in the
> from/to/cc/bcc/reply_to fields, so they don't get auto-named ARRAY_ELEM by
> Pig's AvroStorage.  I know I can probably fix this in Java in the Pig
> AvroStorage UDF, but I'm hoping I can also fix it more easily in the schema.
> Last time I read Avro's array docs in this context, my hit-points dropped by a
> third, so pardom me if I've not rtfm this time :)
> 
> Complete description of what I'm doing follows:
> 
> Avro schema for my emails:
> 
>> {
>>     "namespace": "agile.data.avro",
>>     "name": "Email",
>>     "type": "record",
>>     "fields": [
>>         {"name":"message_id", "type": ["string", "null"]},
>>         {"name":"from","type": [{"type":"array", "items":"string"}, "null"]},
>>         {"name":"to","type": [{"type":"array", "items":"string"}, "null"]},
>>         {"name":"cc","type": [{"type":"array", "items":"string"}, "null"]},
>>         {"name":"bcc","type": [{"type":"array", "items":"string"}, "null"]},
>>         {"name":"reply_to", "type": [{"type":"array", "items":"string"},
>> "null"]},
>>         {"name":"in_reply_to", "type": [{"type":"array", "items":"string"},
>> "null"]},
>>         {"name":"subject", "type": ["string", "null"]},
>>         {"name":"body", "type": ["string", "null"]},
>>         {"name":"date", "type": ["string", "null"]}
>>     ]
>> }
> 
> Pig to publish my Avros:
> 
>> grunt> emails = load '/me/tmp/emails' using AvroStorage();
>> grunt> describe emails
>> 
>> emails: {message_id: chararray,from: {PIG_WRAPPER: (ARRAY_ELEM:
>> chararray)},to: {PIG_WRAPPER: (ARRAY_ELEM: chararray)},cc: {PIG_WRAPPER:
>> (ARRAY_ELEM: chararray)},bcc: {PIG_WRAPPER: (ARRAY_ELEM:
>> chararray)},reply_to: {PIG_WRAPPER: (ARRAY_ELEM: chararray)},in_reply_to:
>> {PIG_WRAPPER: (ARRAY_ELEM: chararray)},subject: chararray,body:
>> chararray,date: chararray}
>> 
>> grunt> store emails into 'mongodb://localhost/agile_data.emails' using
>> MongoStorage();
> 
> My emails in MongoDB:
> 
>>> > db.emails.findOne()
>> {
>> "_id" : ObjectId("4f738a35414e113e75707b97"),
>> "message_id" : "<4f...@li169-134.mail>",
>> "from" : [
>> {
>> "ARRAY_ELEM" : "daily@jobchangealerts.com"
>> }
>> ],
>> "to" : [
>> {
>> "ARRAY_ELEM" : "Russell.jurney@gmail.com"
>> }
>> ],
>> "cc" : null,
>> "bcc" : null,
>> "reply_to" : null,
>> "in_reply_to" : null,
>> "subject" : "Daily Job Change Alerts from SalesLoft",
>> "body" : "Daily Job Change Alerts from SalesLoft",
>> "date" : "2012-03-27T08:00:29"
>> }
> 
> My email on screen:
> 
> 
> 
> My face when I see ARRAY_ELEM, because it means more complex presentation
> code: :(
> -- 
> Russell Jurney twitter.com/rjurney <http://twitter.com/rjurney>
> russell.jurney@gmail.com <ma...@gmail.com>  datasyndrome.com
> <http://datasyndrome.com/>