You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@crunch.apache.org by Christian Tzolov <ch...@gmail.com> on 2013/04/08 05:32:15 UTC

Crunch integration with ElasticSearch

I've been working on Crunch - ElasticSearch (http://www.elasticsearch.org/)
 integration over the weekend :)

Here is my first prototype:
https://github.com/tzolov/elasticsearch-hadoop#crunch and a sample
application: http://bit.ly/Y7lasW.

It implements ES Source and Target on top of the ES-Hadoop's (
https://github.com/elasticsearch/elasticsearch-hadoop) ESInputFormat and
ESOutputFormat.

Not sure though what is the best/right way to build Source/Targets for new
Input/Output Formats? Any suggestions, references?

The write to ES is tricky and at the moment looks more like a hack (see the
doc).

Cheers
Chris

(P.S The prototype doesn't support AvroTypeFamily yet but I've been looking
at jackson-dataformat-avro kind of solution (ES-Hadoop relies on Jackson
for the JSON serialisation)

Re: Crunch integration with ElasticSearch

Posted by Christian Tzolov <ch...@gmail.com>.

+1 for releasing 0.6.0


On Mon, Apr 8, 2013 at 5:48 PM, Matthias Friedrich <ma...@mafr.de> wrote:

> On Monday, 2013-04-08, Josh Wills wrote:
> > On Mon, Apr 8, 2013 at 2:58 AM, Christian Tzolov <
> christian.tzolov@gmail.com
> >> wrote:
>
> >> Shall we deploy the 0.6.0-SNAPSHOT in some public snapshot repo? The
> >>
> https://repository.apache.org/content/groups/snapshots/org/apache/crunch/is
> >> empty. Perhaps we can deploy the latest Jenkins builds into this
> >> snapshot repo? Unless there is some policy against it?
>
> > I just think it means it's time to cut the 0.6.0 release. I would have
> > liked to get CRUNCH-165 in as well, but I don't think it's been tested
> > enough.
>
> We can deploy to Apache's snapshot repo if we really want to, but
> we're not allowed to publish links to it because dev snapshots are
> no official Apache releases.
>
> In any case, cutting 0.6.0 sounds like a good idea because it means
> we can finish cleaning up at the incubator.
>
> Regards,
>   Matthias
>

Re: Crunch integration with ElasticSearch

Posted by Matthias Friedrich <ma...@mafr.de>.

On Monday, 2013-04-08, Josh Wills wrote:
> On Mon, Apr 8, 2013 at 2:58 AM, Christian Tzolov <christian.tzolov@gmail.com
>> wrote:

>> Shall we deploy the 0.6.0-SNAPSHOT in some public snapshot repo? The
>> https://repository.apache.org/content/groups/snapshots/org/apache/crunch/is
>> empty. Perhaps we can deploy the latest Jenkins builds into this
>> snapshot repo? Unless there is some policy against it?

> I just think it means it's time to cut the 0.6.0 release. I would have
> liked to get CRUNCH-165 in as well, but I don't think it's been tested
> enough.

We can deploy to Apache's snapshot repo if we really want to, but
we're not allowed to publish links to it because dev snapshots are
no official Apache releases.

In any case, cutting 0.6.0 sounds like a good idea because it means
we can finish cleaning up at the incubator.

Regards,
  Matthias

Re: Crunch integration with ElasticSearch

Posted by Josh Wills <jw...@cloudera.com>.

On Mon, Apr 8, 2013 at 2:58 AM, Christian Tzolov <christian.tzolov@gmail.com
> wrote:

> Hey Josh,
>
> Thanks for the tips!
>
> I followed the HBaseSource.java for implementing the ESSource and copied
> the inputId handling approach:
>
> https://github.com/tzolov/elasticsearch-hadoop/blob/master/src/main/java/org/elasticsearch/hadoop/crunch/ESSource.java
>
> I don't completely understand the implication of the dummy Path parameter.
> In this context is the Path needed only for input equality check?
>
> The ESTarget is more tricky. I was not sure what to do with the keyClass
> parameter in the CrunchOutputs.addNamedOutput() so I've set it to String.
> The ES-Hadoop uses Jackson for JSON serializations and it fails when trying
> to serialize internal Crunch Writable types. I guess because they are not
> public. Storing internal Crunch Writable types in ES doesn't make much
> sense anyway. The current implementation expects a custom (Writable) class
> to define the JSON format. Perhaps with Avro we can try to reuse the Avro
> schema.
>
> Here is the ES-Hadoop ticket for adding Crunch to the ES-Hadoop project:
> https://github.com/elasticsearch/elasticsearch-hadoop/issues/20
>
> Shall we deploy the 0.6.0-SNAPSHOT in some public snapshot repo? The
> https://repository.apache.org/content/groups/snapshots/org/apache/crunch/is
> empty. Perhaps we can deploy the latest Jenkins builds into this
> snapshot repo? Unless there is some policy against it?
>

I just think it means it's time to cut the 0.6.0 release. I would have
liked to get CRUNCH-165 in as well, but I don't think it's been tested
enough.


> Cheers,
> Chris
>
>
>
>
>
>
>
>
> On Mon, Apr 8, 2013 at 7:18 AM, Josh Wills <jw...@cloudera.com> wrote:
>
> > Hey Christian,
> >
> > Supe-cool. Replies inlined.
> >
> > On Sun, Apr 7, 2013 at 8:32 PM, Christian Tzolov <
> > christian.tzolov@gmail.com
> > > wrote:
> >
> > > I've been working on Crunch - ElasticSearch (
> > http://www.elasticsearch.org/
> > > )
> > >  integration over the weekend :)
> > >
> > > Here is my first prototype:
> > > https://github.com/tzolov/elasticsearch-hadoop#crunch and a sample
> > > application: http://bit.ly/Y7lasW.
> > >
> > > It implements ES Source and Target on top of the ES-Hadoop's (
> > > https://github.com/elasticsearch/elasticsearch-hadoop) ESInputFormat
> and
> > > ESOutputFormat.
> > >
> > > Not sure though what is the best/right way to build Source/Targets for
> > new
> > > Input/Output Formats? Any suggestions, references?
> > >
> >
> > I built a Source for HCatalog last week as part of ML:
> >
> >
> >
> https://github.com/cloudera/ml/blob/master/hcatalog/src/main/java/com/cloudera/science/ml/hcatalog/HCatalogSource.java
> >
> > The interesting bit is really in the configureSource method: if the
> inputId
> > is < 0, then it's a single-input MapReduce job, and you can essentially
> > configure the input just as you would for a regular MapReduce. If the
> > inputId >= 0, then it's a multi-input job (e.g., for a join), and you
> have
> > to use CrunchInputs w/a FormatBundle object. The FormatBundle wraps an
> > InputFormat or an OutputFormat w/any Configuration settings that the
> > InputFormat/OutputFormat needs. This way, you can have multiple inputs
> that
> > use the same InputFormat, but have different configuration settings
> (e.g.,
> > when you're joining multiple Avro files together and they each need to
> have
> > their own schema specified.)
> >
> >
> >
> > > The write to ES is tricky and at the moment looks more like a hack (see
> > the
> > > doc).
> > >
> > > Cheers
> > > Chris
> > >
> > > (P.S The prototype doesn't support AvroTypeFamily yet but I've been
> > looking
> > > at jackson-dataformat-avro kind of solution (ES-Hadoop relies on
> Jackson
> > > for the JSON serialisation)
> > >
> >
> > I'd like to work on this as well-- I'll take a look tomorrow and try to
> put
> > together a pull req for anything that I think should be configured
> > differently.
> >
> > J
> >
> >
> >
> > --
> > Director of Data Science
> > Cloudera <http://www.cloudera.com>
> > Twitter: @josh_wills <http://twitter.com/josh_wills>
> >
>



-- 
Director of Data Science
Cloudera <http://www.cloudera.com>
Twitter: @josh_wills <http://twitter.com/josh_wills>

Re: Crunch integration with ElasticSearch

Posted by Christian Tzolov <ch...@gmail.com>.

Hey Josh,

Thanks for the tips!

I followed the HBaseSource.java for implementing the ESSource and copied
the inputId handling approach:
https://github.com/tzolov/elasticsearch-hadoop/blob/master/src/main/java/org/elasticsearch/hadoop/crunch/ESSource.java

I don't completely understand the implication of the dummy Path parameter.
In this context is the Path needed only for input equality check?

The ESTarget is more tricky. I was not sure what to do with the keyClass
parameter in the CrunchOutputs.addNamedOutput() so I've set it to String.
The ES-Hadoop uses Jackson for JSON serializations and it fails when trying
to serialize internal Crunch Writable types. I guess because they are not
public. Storing internal Crunch Writable types in ES doesn't make much
sense anyway. The current implementation expects a custom (Writable) class
to define the JSON format. Perhaps with Avro we can try to reuse the Avro
schema.

Here is the ES-Hadoop ticket for adding Crunch to the ES-Hadoop project:
https://github.com/elasticsearch/elasticsearch-hadoop/issues/20

Shall we deploy the 0.6.0-SNAPSHOT in some public snapshot repo? The
https://repository.apache.org/content/groups/snapshots/org/apache/crunch/is
empty. Perhaps we can deploy the latest Jenkins builds into this
snapshot repo? Unless there is some policy against it?

Cheers,
Chris








On Mon, Apr 8, 2013 at 7:18 AM, Josh Wills <jw...@cloudera.com> wrote:

> Hey Christian,
>
> Supe-cool. Replies inlined.
>
> On Sun, Apr 7, 2013 at 8:32 PM, Christian Tzolov <
> christian.tzolov@gmail.com
> > wrote:
>
> > I've been working on Crunch - ElasticSearch (
> http://www.elasticsearch.org/
> > )
> >  integration over the weekend :)
> >
> > Here is my first prototype:
> > https://github.com/tzolov/elasticsearch-hadoop#crunch and a sample
> > application: http://bit.ly/Y7lasW.
> >
> > It implements ES Source and Target on top of the ES-Hadoop's (
> > https://github.com/elasticsearch/elasticsearch-hadoop) ESInputFormat and
> > ESOutputFormat.
> >
> > Not sure though what is the best/right way to build Source/Targets for
> new
> > Input/Output Formats? Any suggestions, references?
> >
>
> I built a Source for HCatalog last week as part of ML:
>
>
> https://github.com/cloudera/ml/blob/master/hcatalog/src/main/java/com/cloudera/science/ml/hcatalog/HCatalogSource.java
>
> The interesting bit is really in the configureSource method: if the inputId
> is < 0, then it's a single-input MapReduce job, and you can essentially
> configure the input just as you would for a regular MapReduce. If the
> inputId >= 0, then it's a multi-input job (e.g., for a join), and you have
> to use CrunchInputs w/a FormatBundle object. The FormatBundle wraps an
> InputFormat or an OutputFormat w/any Configuration settings that the
> InputFormat/OutputFormat needs. This way, you can have multiple inputs that
> use the same InputFormat, but have different configuration settings (e.g.,
> when you're joining multiple Avro files together and they each need to have
> their own schema specified.)
>
>
>
> > The write to ES is tricky and at the moment looks more like a hack (see
> the
> > doc).
> >
> > Cheers
> > Chris
> >
> > (P.S The prototype doesn't support AvroTypeFamily yet but I've been
> looking
> > at jackson-dataformat-avro kind of solution (ES-Hadoop relies on Jackson
> > for the JSON serialisation)
> >
>
> I'd like to work on this as well-- I'll take a look tomorrow and try to put
> together a pull req for anything that I think should be configured
> differently.
>
> J
>
>
>
> --
> Director of Data Science
> Cloudera <http://www.cloudera.com>
> Twitter: @josh_wills <http://twitter.com/josh_wills>
>

Re: Crunch integration with ElasticSearch

Posted by Josh Wills <jw...@cloudera.com>.

Hey Christian,

Supe-cool. Replies inlined.

On Sun, Apr 7, 2013 at 8:32 PM, Christian Tzolov <christian.tzolov@gmail.com
> wrote:

> I've been working on Crunch - ElasticSearch (http://www.elasticsearch.org/
> )
>  integration over the weekend :)
>
> Here is my first prototype:
> https://github.com/tzolov/elasticsearch-hadoop#crunch and a sample
> application: http://bit.ly/Y7lasW.
>
> It implements ES Source and Target on top of the ES-Hadoop's (
> https://github.com/elasticsearch/elasticsearch-hadoop) ESInputFormat and
> ESOutputFormat.
>
> Not sure though what is the best/right way to build Source/Targets for new
> Input/Output Formats? Any suggestions, references?
>

I built a Source for HCatalog last week as part of ML:

https://github.com/cloudera/ml/blob/master/hcatalog/src/main/java/com/cloudera/science/ml/hcatalog/HCatalogSource.java

The interesting bit is really in the configureSource method: if the inputId
is < 0, then it's a single-input MapReduce job, and you can essentially
configure the input just as you would for a regular MapReduce. If the
inputId >= 0, then it's a multi-input job (e.g., for a join), and you have
to use CrunchInputs w/a FormatBundle object. The FormatBundle wraps an
InputFormat or an OutputFormat w/any Configuration settings that the
InputFormat/OutputFormat needs. This way, you can have multiple inputs that
use the same InputFormat, but have different configuration settings (e.g.,
when you're joining multiple Avro files together and they each need to have
their own schema specified.)

> The write to ES is tricky and at the moment looks more like a hack (see the
> doc).
>
> Cheers
> Chris
>
> (P.S The prototype doesn't support AvroTypeFamily yet but I've been looking
> at jackson-dataformat-avro kind of solution (ES-Hadoop relies on Jackson
> for the JSON serialisation)
>

I'd like to work on this as well-- I'll take a look tomorrow and try to put
together a pull req for anything that I think should be configured
differently.

J

-- 
Director of Data Science
Cloudera <http://www.cloudera.com>
Twitter: @josh_wills <http://twitter.com/josh_wills>