You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@samza.apache.org by "Dan Harvey (JIRA)" <ji...@apache.org> on 2015/07/10 16:47:05 UTC
[jira] [Updated] (SAMZA-654) Add Elasticsearch Producer

     [ https://issues.apache.org/jira/browse/SAMZA-654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dan Harvey updated SAMZA-654:
-----------------------------
    Attachment: ElasticsearchSystemProducer.patch

Sorry for the delay, I've rebase this against the latest master. I made *createCoordinatorStream* throw *UnsupportedOperationException* as with other methods that are not supported via Elasticsearch.

I think this is all ready to be merged now

[~theduderog] As this ticket is all reviewed now I think it would be best to get in this then create a new ticket to add the metrics to it?

> Add Elasticsearch Producer
> --------------------------
>
>                 Key: SAMZA-654
>                 URL: https://issues.apache.org/jira/browse/SAMZA-654
>             Project: Samza
>          Issue Type: New Feature
>            Reporter: Dan Harvey
>            Assignee: Dan Harvey
>            Priority: Minor
>         Attachments: 0001-SAMZA-654-Added-ElasticsearchSystemProducer-and-Fact.patch, ElasticsearchSystemProducer.patch, ElasticsearchSystemProducer.patch, ElasticsearchSystemProducer.patch, ElasticsearchSystemProducer.patch
>
>
> Elasticsearch is likely to be a common output datastore for Samza so it would be good to have a Producer (and possibly Consumer) as part of the core project.
> Elasticsearch organises data into indexes which can contain multiple types of documents. Each document has a id which it is identified by and the source which is it's actual data. These map well on to concepts in Samza. 
> Elasticsearch also has mappings which defines how it indexes the documents pushed to it. I don't think the Samza System should be concerned with this. 
> (index, type) -> stream
> id -> key
> source -> message
> The main one needing to be agreed upon is how the index and type are defined as a stream. We've started by simply joining them with a / as they would in the elasticsearch REST api and using that as the stream name.
> The java elasticsearch client can deal with the source being a variety of types that can be presented as json (Object, Map, byte[]). We could just pass objects to the Producer or use the Samza json serde to handle that (or maybe both), we're currently passing through the message object and assuming the client can deal with it.
> Elasticsearch also has the ability to bulk index documents, so combining this with correctly flushing the Producer can get good performance.
> Finally we'd need to think how this can be configured. Elasticsearch java client has two different transports that have various configuration. Currently we are only using the TransportClient, we should probably make it configurable or maybe not initially.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)