You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@samza.apache.org by Roger Hoover <ro...@gmail.com> on 2016/02/09 19:39:14 UTC

HTTP-based Elasticsearch system producer and reusable task

Hi Samza folks,

For people who want to use HTTP to integrate with Elasticsearch, I wrote an
HTTP-based system producer and a reusable task, including latency stats
from event origin time, task processing time, and time spent talking to
Elasticsearch API.

https://github.com/quantiply/rico/blob/master/docs/common_tasks/es-push.md

Cheers,

Roger

Re: HTTP-based Elasticsearch system producer and reusable task

Posted by Roger Hoover <ro...@gmail.com>.
Hi Yi,

Please see my comment inline.

On Tue, Feb 9, 2016 at 10:08 PM, Yi Pan <ni...@gmail.com> wrote:

> Hi, Roger,
>
> Got it! I would like to understand more on the SystemProducer API changes
> required by #1 and #2. Could you elaborate a bit more?
>
>
For #1, IndexRequestFactory interface (
https://github.com/apache/samza/blob/master/samza-elasticsearch/src/main/java/org/apache/samza/system/elasticsearch/indexrequest/IndexRequestFactory.java#L32)
returns an object of type (org.elasticsearch.action.index.IndexRequest)
which comes from the elasticsearch jar (
https://github.com/elastic/elasticsearch/blob/master/core/src/main/java/org/elasticsearch/action/index/IndexRequest.java
).

For #2, the way to support Updates and Deletes would be to have the
"IndexRequestFactory" return an more generic "ActionRequest" which could be
an index, update, or delete action.  (in which case it should be called
ActionRequestFactory).




> Regarding to JDK8 required in the new HTTP-based Elasticsearch producer, I
> want to ask how you are motivated to go w/ JDK8. It does bring a lot more
> nice features. If we deprecate source-level compatibility to JDK7, we can
> benefit from a lot of new features from JDK8, like lambda, stream APIs,
> etc. And refactor Scala code to JDK8 is also much easier.
>
>
I really like that functions are first-class citizens in java 8 and stream
api is quite helpful as well.

For example, first-class functions helped me avoid duplicate code that
would have occurred because JEST doesn't expose a common ancestor type for
each type of builder (index, update, delete).  Passing in functions instead
of an objects with a common ancestor type solved the problem.

https://github.com/quantiply/rico/blob/master/samza-elasticsearch/src/main/java/com/quantiply/elasticsearch/HTTPBulkLoader.java#L237-L272


> Thanks!
>
> -Yi
>
> On Tue, Feb 9, 2016 at 4:19 PM, Roger Hoover <ro...@gmail.com>
> wrote:
>
> > Hi Yi,
> >
> > It could be merged into the Samza project if there's enough interest but
> > may need some re-working depending on which dependencies are ok to bring
> > in.  I did it outside of the Samza project first because I had to get it
> > done quickly so it relies on Java 8 features, dropwizard metrics for
> > histogram metrics, and JEST (https://github.com/searchbox-io/Jest) which
> > itself drags in more dependencies (Guava, Gson, commons http).
> >
> > There are few issues with the existing ElasticsearchSystemProducer:
> >
> >    1. The plugin API (IndexRequestFactory) is tied to the Elasticsearch
> >    Java API (a bulky dependency)
> >    2. It only supports index requests.  I needed to also support updates
> >    and deletes.
> >    3. There currently no plugin mechanism to register a flush listener.
> >    The reason I needed that was to be able to report end to end latency
> > stats
> >    (total pipeline latency = commit time - event time).
> >
> > #3 is easily solvable with a additional plugin options. #1 and #2 require
> > changing the system producer API.
> >
> > Roger
> >
> > On Tue, Feb 9, 2016 at 10:56 AM, Yi Pan <ni...@gmail.com> wrote:
> >
> > > Hi, Roger,
> > >
> > > That's awesome! Are you planning to submit the HTTP-based system
> producer
> > > in Samza open-source samza-elasticsearch module? If ElasticSearch
> > community
> > > suggest that HTTP-based clients be the recommended way, we should use
> it
> > in
> > > samza-elasticsearch as well. And what's your opinion on the existing
> > > ElasticsearchSystemProducer? If the SystemProducer APIs and configure
> > > options do not change, I would vote to replace the implementation w/
> > > HTTP-based ElasticsearchSystemProducer.
> > >
> > > Thanks for putting this new additions up!
> > >
> > > -Yi
> > >
> > > On Tue, Feb 9, 2016 at 10:39 AM, Roger Hoover <ro...@gmail.com>
> > > wrote:
> > >
> > > > Hi Samza folks,
> > > >
> > > > For people who want to use HTTP to integrate with Elasticsearch, I
> > wrote
> > > an
> > > > HTTP-based system producer and a reusable task, including latency
> stats
> > > > from event origin time, task processing time, and time spent talking
> to
> > > > Elasticsearch API.
> > > >
> > > >
> > >
> >
> https://github.com/quantiply/rico/blob/master/docs/common_tasks/es-push.md
> > > >
> > > > Cheers,
> > > >
> > > > Roger
> > > >
> > >
> >
>

Re: HTTP-based Elasticsearch system producer and reusable task

Posted by Yi Pan <ni...@gmail.com>.
Hi, Roger,

Got it! I would like to understand more on the SystemProducer API changes
required by #1 and #2. Could you elaborate a bit more?

Regarding to JDK8 required in the new HTTP-based Elasticsearch producer, I
want to ask how you are motivated to go w/ JDK8. It does bring a lot more
nice features. If we deprecate source-level compatibility to JDK7, we can
benefit from a lot of new features from JDK8, like lambda, stream APIs,
etc. And refactor Scala code to JDK8 is also much easier.

Thanks!

-Yi

On Tue, Feb 9, 2016 at 4:19 PM, Roger Hoover <ro...@gmail.com> wrote:

> Hi Yi,
>
> It could be merged into the Samza project if there's enough interest but
> may need some re-working depending on which dependencies are ok to bring
> in.  I did it outside of the Samza project first because I had to get it
> done quickly so it relies on Java 8 features, dropwizard metrics for
> histogram metrics, and JEST (https://github.com/searchbox-io/Jest) which
> itself drags in more dependencies (Guava, Gson, commons http).
>
> There are few issues with the existing ElasticsearchSystemProducer:
>
>    1. The plugin API (IndexRequestFactory) is tied to the Elasticsearch
>    Java API (a bulky dependency)
>    2. It only supports index requests.  I needed to also support updates
>    and deletes.
>    3. There currently no plugin mechanism to register a flush listener.
>    The reason I needed that was to be able to report end to end latency
> stats
>    (total pipeline latency = commit time - event time).
>
> #3 is easily solvable with a additional plugin options. #1 and #2 require
> changing the system producer API.
>
> Roger
>
> On Tue, Feb 9, 2016 at 10:56 AM, Yi Pan <ni...@gmail.com> wrote:
>
> > Hi, Roger,
> >
> > That's awesome! Are you planning to submit the HTTP-based system producer
> > in Samza open-source samza-elasticsearch module? If ElasticSearch
> community
> > suggest that HTTP-based clients be the recommended way, we should use it
> in
> > samza-elasticsearch as well. And what's your opinion on the existing
> > ElasticsearchSystemProducer? If the SystemProducer APIs and configure
> > options do not change, I would vote to replace the implementation w/
> > HTTP-based ElasticsearchSystemProducer.
> >
> > Thanks for putting this new additions up!
> >
> > -Yi
> >
> > On Tue, Feb 9, 2016 at 10:39 AM, Roger Hoover <ro...@gmail.com>
> > wrote:
> >
> > > Hi Samza folks,
> > >
> > > For people who want to use HTTP to integrate with Elasticsearch, I
> wrote
> > an
> > > HTTP-based system producer and a reusable task, including latency stats
> > > from event origin time, task processing time, and time spent talking to
> > > Elasticsearch API.
> > >
> > >
> >
> https://github.com/quantiply/rico/blob/master/docs/common_tasks/es-push.md
> > >
> > > Cheers,
> > >
> > > Roger
> > >
> >
>

Re: HTTP-based Elasticsearch system producer and reusable task

Posted by Roger Hoover <ro...@gmail.com>.
Hi Yi,

It could be merged into the Samza project if there's enough interest but
may need some re-working depending on which dependencies are ok to bring
in.  I did it outside of the Samza project first because I had to get it
done quickly so it relies on Java 8 features, dropwizard metrics for
histogram metrics, and JEST (https://github.com/searchbox-io/Jest) which
itself drags in more dependencies (Guava, Gson, commons http).

There are few issues with the existing ElasticsearchSystemProducer:

   1. The plugin API (IndexRequestFactory) is tied to the Elasticsearch
   Java API (a bulky dependency)
   2. It only supports index requests.  I needed to also support updates
   and deletes.
   3. There currently no plugin mechanism to register a flush listener.
   The reason I needed that was to be able to report end to end latency stats
   (total pipeline latency = commit time - event time).

#3 is easily solvable with a additional plugin options. #1 and #2 require
changing the system producer API.

Roger

On Tue, Feb 9, 2016 at 10:56 AM, Yi Pan <ni...@gmail.com> wrote:

> Hi, Roger,
>
> That's awesome! Are you planning to submit the HTTP-based system producer
> in Samza open-source samza-elasticsearch module? If ElasticSearch community
> suggest that HTTP-based clients be the recommended way, we should use it in
> samza-elasticsearch as well. And what's your opinion on the existing
> ElasticsearchSystemProducer? If the SystemProducer APIs and configure
> options do not change, I would vote to replace the implementation w/
> HTTP-based ElasticsearchSystemProducer.
>
> Thanks for putting this new additions up!
>
> -Yi
>
> On Tue, Feb 9, 2016 at 10:39 AM, Roger Hoover <ro...@gmail.com>
> wrote:
>
> > Hi Samza folks,
> >
> > For people who want to use HTTP to integrate with Elasticsearch, I wrote
> an
> > HTTP-based system producer and a reusable task, including latency stats
> > from event origin time, task processing time, and time spent talking to
> > Elasticsearch API.
> >
> >
> https://github.com/quantiply/rico/blob/master/docs/common_tasks/es-push.md
> >
> > Cheers,
> >
> > Roger
> >
>

Re: HTTP-based Elasticsearch system producer and reusable task

Posted by Yi Pan <ni...@gmail.com>.
Hi, Roger,

That's awesome! Are you planning to submit the HTTP-based system producer
in Samza open-source samza-elasticsearch module? If ElasticSearch community
suggest that HTTP-based clients be the recommended way, we should use it in
samza-elasticsearch as well. And what's your opinion on the existing
ElasticsearchSystemProducer? If the SystemProducer APIs and configure
options do not change, I would vote to replace the implementation w/
HTTP-based ElasticsearchSystemProducer.

Thanks for putting this new additions up!

-Yi

On Tue, Feb 9, 2016 at 10:39 AM, Roger Hoover <ro...@gmail.com>
wrote:

> Hi Samza folks,
>
> For people who want to use HTTP to integrate with Elasticsearch, I wrote an
> HTTP-based system producer and a reusable task, including latency stats
> from event origin time, task processing time, and time spent talking to
> Elasticsearch API.
>
> https://github.com/quantiply/rico/blob/master/docs/common_tasks/es-push.md
>
> Cheers,
>
> Roger
>