You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Matthew Smith <Ma...@g2-inc.com> on 2010/06/25 20:20:02 UTC

Writing to excel files from Pig

Title really says it all. I'm looking to run a job that takes the output
of a pig script and writes that to an excel file for further analysis.
Can somebody point me to a past thread or what commands would generate
this behavior?

 

Thanks,

Matt


Re: Writing to excel files from Pig

Posted by Russell Jurney <ru...@gmail.com>.
Unless it is going to be used as a
UDF<http://wiki.apache.org/pig/UDFManual>by analysts running Pig
themselves, I would not bother to do it in Java.
 This is a solved problem via streaming to Python to use
matplotlib<http://matplotlib.sourceforge.net/>,
out to json to be read by something like
protovis<http://vis.stanford.edu/protovis/>.
 You could also script Excel with VBScript to create charts from a TSV.  R
is a good option for batch charts - you can generate them offline if you
don't need too many, but if you go that route do yourself a favor and get R
in a Nutshell <http://oreilly.com/catalog/9780596801717> first.  I was
helpless in R despite lots of effort without that book.  Also check out:
http://stackoverflow.com/questions/2196985/information-dashboards-in-r-with-ggplot2
There was a series of posts about R dashboards being superior to
anything
out there for the web, but I can't find em.

Russ

On Fri, Jun 25, 2010 at 12:56 PM, Matthew Smith <Ma...@g2-inc.com>wrote:

> I don't know if I wanted to be a usable function or a cron, still
> haven't gotten through that use case %100. If I get the leeway to do
> this from the higher ups I'll definitely commit it.
>
> -----Original Message-----
> From: Russell Jurney [mailto:russell.jurney@gmail.com]
> Sent: Friday, June 25, 2010 3:42 PM
> To: pig-user@hadoop.apache.org
> Subject: Re: Writing to excel files from Pig
>
> Yeah, if they're using it regularly, I'd probably want analysts using a
> UDF
> in Java.  I'd probably hack a streaming job up first to get the
> formatting/charts right first though ;)
>
> If you do this, please commit to piggybank, or just post it here and
> I'll do
> it for you :)
>
> I've actually used R to do this in the past, you can run it on TSV
> output -
> but that was a scheduled job, not a regularly used function.
>
> Russ
>
> On Fri, Jun 25, 2010 at 11:46 AM, Matthew Smith
> <Ma...@g2-inc.com>wrote:
>
> > That's what I'm working on Russell :D. I want the output of the script
> > to come in a nice package for an analyst to make quick decisions.
> Pretty
> > pictures always help. Thank you for the info, I think I'm well on my
> > way.
> >
> > Matt
> >
> > -----Original Message-----
> > From: Russell Jurney [mailto:russell.jurney@gmail.com]
> > Sent: Friday, June 25, 2010 2:36 PM
> > To: pig-user@hadoop.apache.org
> > Subject: Re: Writing to excel files from Pig
> >
> > PigStorage is TSV by default, which will open directly in Excel.  A
> > STORE
> > without any arguments will do that.  Dmitriy has a UDF that adds
> column
> > names in PiggyBank called SchemaAwarePigLoader, if you need that.
> > http://wiki.apache.org/pig/PiggyBank  If you need real excel files,
> use
> > streaming
> > http://hadoop.apache.org/pig/docs/r0.7.0/piglatin_ref2.html#STREAM
> > and
> > something like http://search.cpan.org/dist/Spreadsheet-WriteExcel/ and
> > Perl.
> >
> > A storefunc that used a Java excel lib and pregenerated excel with
> > summary
> > charts would be cool :)
> >
> > Russell Jurney
> > russell.jurney@gmail.com
> > (404) 317-3620
> > http://twitter.com/rjurney
> > http://linkedin.com/in/russelljurney
> >
> > On Jun 25, 2010, at 11:20 AM, Matthew Smith <Ma...@g2-inc.com>
> > wrote:
> >
> > Title really says it all. I'm looking to run a job that takes the
> output
> > of a pig script and writes that to an excel file for further analysis.
> > Can somebody point me to a past thread or what commands would generate
> > this behavior?
> >
> >
> >
> > Thanks,
> >
> > Matt
> >
>

RE: Writing to excel files from Pig

Posted by Matthew Smith <Ma...@g2-inc.com>.
I don't know if I wanted to be a usable function or a cron, still
haven't gotten through that use case %100. If I get the leeway to do
this from the higher ups I'll definitely commit it. 

-----Original Message-----
From: Russell Jurney [mailto:russell.jurney@gmail.com] 
Sent: Friday, June 25, 2010 3:42 PM
To: pig-user@hadoop.apache.org
Subject: Re: Writing to excel files from Pig

Yeah, if they're using it regularly, I'd probably want analysts using a
UDF
in Java.  I'd probably hack a streaming job up first to get the
formatting/charts right first though ;)

If you do this, please commit to piggybank, or just post it here and
I'll do
it for you :)

I've actually used R to do this in the past, you can run it on TSV
output -
but that was a scheduled job, not a regularly used function.

Russ

On Fri, Jun 25, 2010 at 11:46 AM, Matthew Smith
<Ma...@g2-inc.com>wrote:

> That's what I'm working on Russell :D. I want the output of the script
> to come in a nice package for an analyst to make quick decisions.
Pretty
> pictures always help. Thank you for the info, I think I'm well on my
> way.
>
> Matt
>
> -----Original Message-----
> From: Russell Jurney [mailto:russell.jurney@gmail.com]
> Sent: Friday, June 25, 2010 2:36 PM
> To: pig-user@hadoop.apache.org
> Subject: Re: Writing to excel files from Pig
>
> PigStorage is TSV by default, which will open directly in Excel.  A
> STORE
> without any arguments will do that.  Dmitriy has a UDF that adds
column
> names in PiggyBank called SchemaAwarePigLoader, if you need that.
> http://wiki.apache.org/pig/PiggyBank  If you need real excel files,
use
> streaming
> http://hadoop.apache.org/pig/docs/r0.7.0/piglatin_ref2.html#STREAM
> and
> something like http://search.cpan.org/dist/Spreadsheet-WriteExcel/ and
> Perl.
>
> A storefunc that used a Java excel lib and pregenerated excel with
> summary
> charts would be cool :)
>
> Russell Jurney
> russell.jurney@gmail.com
> (404) 317-3620
> http://twitter.com/rjurney
> http://linkedin.com/in/russelljurney
>
> On Jun 25, 2010, at 11:20 AM, Matthew Smith <Ma...@g2-inc.com>
> wrote:
>
> Title really says it all. I'm looking to run a job that takes the
output
> of a pig script and writes that to an excel file for further analysis.
> Can somebody point me to a past thread or what commands would generate
> this behavior?
>
>
>
> Thanks,
>
> Matt
>

Re: Writing to excel files from Pig

Posted by Dmitriy Ryaboy <dv...@gmail.com>.
You can use pig streaming to dump the tsv into R, read in binary output, and
store it.
Or use Azkaban / Oozie to kick off R locally for the visualization once a
pig job completes. That's simpler.

On Fri, Jun 25, 2010 at 12:42 PM, Russell Jurney
<ru...@gmail.com>wrote:

> Yeah, if they're using it regularly, I'd probably want analysts using a UDF
> in Java.  I'd probably hack a streaming job up first to get the
> formatting/charts right first though ;)
>
> If you do this, please commit to piggybank, or just post it here and I'll
> do
> it for you :)
>
> I've actually used R to do this in the past, you can run it on TSV output -
> but that was a scheduled job, not a regularly used function.
>
> Russ
>
> On Fri, Jun 25, 2010 at 11:46 AM, Matthew Smith <Matthew.Smith@g2-inc.com
> >wrote:
>
> > That's what I'm working on Russell :D. I want the output of the script
> > to come in a nice package for an analyst to make quick decisions. Pretty
> > pictures always help. Thank you for the info, I think I'm well on my
> > way.
> >
> > Matt
> >
> > -----Original Message-----
> > From: Russell Jurney [mailto:russell.jurney@gmail.com]
> > Sent: Friday, June 25, 2010 2:36 PM
> > To: pig-user@hadoop.apache.org
> > Subject: Re: Writing to excel files from Pig
> >
> > PigStorage is TSV by default, which will open directly in Excel.  A
> > STORE
> > without any arguments will do that.  Dmitriy has a UDF that adds column
> > names in PiggyBank called SchemaAwarePigLoader, if you need that.
> > http://wiki.apache.org/pig/PiggyBank  If you need real excel files, use
> > streaming
> > http://hadoop.apache.org/pig/docs/r0.7.0/piglatin_ref2.html#STREAM
> > and
> > something like http://search.cpan.org/dist/Spreadsheet-WriteExcel/ and
> > Perl.
> >
> > A storefunc that used a Java excel lib and pregenerated excel with
> > summary
> > charts would be cool :)
> >
> > Russell Jurney
> > russell.jurney@gmail.com
> > (404) 317-3620
> > http://twitter.com/rjurney
> > http://linkedin.com/in/russelljurney
> >
> > On Jun 25, 2010, at 11:20 AM, Matthew Smith <Ma...@g2-inc.com>
> > wrote:
> >
> > Title really says it all. I'm looking to run a job that takes the output
> > of a pig script and writes that to an excel file for further analysis.
> > Can somebody point me to a past thread or what commands would generate
> > this behavior?
> >
> >
> >
> > Thanks,
> >
> > Matt
> >
>

Re: Writing to excel files from Pig

Posted by Russell Jurney <ru...@gmail.com>.
Yeah, if they're using it regularly, I'd probably want analysts using a UDF
in Java.  I'd probably hack a streaming job up first to get the
formatting/charts right first though ;)

If you do this, please commit to piggybank, or just post it here and I'll do
it for you :)

I've actually used R to do this in the past, you can run it on TSV output -
but that was a scheduled job, not a regularly used function.

Russ

On Fri, Jun 25, 2010 at 11:46 AM, Matthew Smith <Ma...@g2-inc.com>wrote:

> That's what I'm working on Russell :D. I want the output of the script
> to come in a nice package for an analyst to make quick decisions. Pretty
> pictures always help. Thank you for the info, I think I'm well on my
> way.
>
> Matt
>
> -----Original Message-----
> From: Russell Jurney [mailto:russell.jurney@gmail.com]
> Sent: Friday, June 25, 2010 2:36 PM
> To: pig-user@hadoop.apache.org
> Subject: Re: Writing to excel files from Pig
>
> PigStorage is TSV by default, which will open directly in Excel.  A
> STORE
> without any arguments will do that.  Dmitriy has a UDF that adds column
> names in PiggyBank called SchemaAwarePigLoader, if you need that.
> http://wiki.apache.org/pig/PiggyBank  If you need real excel files, use
> streaming
> http://hadoop.apache.org/pig/docs/r0.7.0/piglatin_ref2.html#STREAM
> and
> something like http://search.cpan.org/dist/Spreadsheet-WriteExcel/ and
> Perl.
>
> A storefunc that used a Java excel lib and pregenerated excel with
> summary
> charts would be cool :)
>
> Russell Jurney
> russell.jurney@gmail.com
> (404) 317-3620
> http://twitter.com/rjurney
> http://linkedin.com/in/russelljurney
>
> On Jun 25, 2010, at 11:20 AM, Matthew Smith <Ma...@g2-inc.com>
> wrote:
>
> Title really says it all. I'm looking to run a job that takes the output
> of a pig script and writes that to an excel file for further analysis.
> Can somebody point me to a past thread or what commands would generate
> this behavior?
>
>
>
> Thanks,
>
> Matt
>

RE: Writing to excel files from Pig

Posted by Matthew Smith <Ma...@g2-inc.com>.
That's what I'm working on Russell :D. I want the output of the script
to come in a nice package for an analyst to make quick decisions. Pretty
pictures always help. Thank you for the info, I think I'm well on my
way.

Matt

-----Original Message-----
From: Russell Jurney [mailto:russell.jurney@gmail.com] 
Sent: Friday, June 25, 2010 2:36 PM
To: pig-user@hadoop.apache.org
Subject: Re: Writing to excel files from Pig

PigStorage is TSV by default, which will open directly in Excel.  A
STORE
without any arguments will do that.  Dmitriy has a UDF that adds column
names in PiggyBank called SchemaAwarePigLoader, if you need that.
http://wiki.apache.org/pig/PiggyBank  If you need real excel files, use
streaming
http://hadoop.apache.org/pig/docs/r0.7.0/piglatin_ref2.html#STREAM
and
something like http://search.cpan.org/dist/Spreadsheet-WriteExcel/ and
Perl.

A storefunc that used a Java excel lib and pregenerated excel with
summary
charts would be cool :)

Russell Jurney
russell.jurney@gmail.com
(404) 317-3620
http://twitter.com/rjurney
http://linkedin.com/in/russelljurney

On Jun 25, 2010, at 11:20 AM, Matthew Smith <Ma...@g2-inc.com>
wrote:

Title really says it all. I'm looking to run a job that takes the output
of a pig script and writes that to an excel file for further analysis.
Can somebody point me to a past thread or what commands would generate
this behavior?



Thanks,

Matt

Re: Writing to excel files from Pig

Posted by Russell Jurney <ru...@gmail.com>.
PigStorage is TSV by default, which will open directly in Excel.  A STORE
without any arguments will do that.  Dmitriy has a UDF that adds column
names in PiggyBank called SchemaAwarePigLoader, if you need that.
http://wiki.apache.org/pig/PiggyBank  If you need real excel files, use
streaming http://hadoop.apache.org/pig/docs/r0.7.0/piglatin_ref2.html#STREAM
and
something like http://search.cpan.org/dist/Spreadsheet-WriteExcel/ and Perl.

A storefunc that used a Java excel lib and pregenerated excel with summary
charts would be cool :)

Russell Jurney
russell.jurney@gmail.com
(404) 317-3620
http://twitter.com/rjurney
http://linkedin.com/in/russelljurney

On Jun 25, 2010, at 11:20 AM, Matthew Smith <Ma...@g2-inc.com>
wrote:

Title really says it all. I'm looking to run a job that takes the output
of a pig script and writes that to an excel file for further analysis.
Can somebody point me to a past thread or what commands would generate
this behavior?



Thanks,

Matt

Re: Writing to excel files from Pig

Posted by Mark Stetzer <st...@gmail.com>.
It's not exactly Excel, but I've written a custom StoreFunc that
outputs CSV.  It was all of about 10 lines of code (at least with Pig
< 0.7).  That's the approach I've taken in the past.

In theory you could use something like POI in a similar setup if you
were really dead set on outputting Excel.  I like being able to view
my job output with less though.

-Mark

On Fri, Jun 25, 2010 at 2:20 PM, Matthew Smith <Ma...@g2-inc.com> wrote:
> Title really says it all. I'm looking to run a job that takes the output
> of a pig script and writes that to an excel file for further analysis.
> Can somebody point me to a past thread or what commands would generate
> this behavior?
>
>
>
> Thanks,
>
> Matt
>
>