You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@flume.apache.org by "Rotem Hermon (JIRA)" <ji...@apache.org> on 2015/02/09 09:22:35 UTC

[jira] [Commented] (FLUME-2390) Flume-ElasticSearch Data gets posted multiple times when one of the event fail validation at elastic search sink for JSON Data

    [ https://issues.apache.org/jira/browse/FLUME-2390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14311919#comment-14311919 ] 

Rotem Hermon commented on FLUME-2390:
-------------------------------------

The problem of data getting posted multiple times is that when Flume retries a batch, all the documents are getting indexed again in Elasticsearch. And since no specific document ID is provided for the documents, Elasticsearch treats them as new documents and updates them again. 
The correct solution for this specific problem is to generate the document _id for each indexed document, so that even if it gets indexed again it will re-index the same document and not create a new one. 
We apply such a method in our extended serializer, you can see here - https://github.com/gigya/flume-ng-elasticsearch-ser-ex#generating-document-ids-for-events

> Flume-ElasticSearch Data gets posted multiple times when one of the event fail validation at elastic search sink for JSON Data
> ------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: FLUME-2390
>                 URL: https://issues.apache.org/jira/browse/FLUME-2390
>             Project: Flume
>          Issue Type: Bug
>          Components: Sinks+Sources
>    Affects Versions: v1.4.0
>         Environment: CDH4.5
>            Reporter: Deepak Subhramanian
>
> Hi,
> I am using Elastic Search Sink to post JSON data. I used the temporary fix mentioned in https://issues.apache.org/jira/browse/FLUME-2126 to get JSON data posted to elastic search. When one of the message fail validation at ElasticSearch mapping for JSON data ( For example - getting empty message) , Flume seems to post the entire batch again and again until I restart Flume.  Because of that no of events went from an avg of 100 to avg of 2000 per 10 minutes. As a temporary fix I set a header in my FlumeHTTP Source for non valid JSON and used a interceptor to send data to multiple ESSINKS which has different index names. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Re: [jira] [Commented] (FLUME-2390) Flume-ElasticSearch Data gets posted multiple times when one of the event fail validation at elastic search sink for JSON Data

Posted by Deepak Subhramanian <de...@gmail.com>.
Sorry. I missed replying for previous question.

The issue is not related to the document id. We were sending xml and
json documents to the same channel when we migrated the data structure
from xml to json. So when a json message is send first it sets a
mapping in the index. Then any batch which has an xml fails validation
and the batch is processed multiple times wihtout success.

As mentioned by Ben the same issue is logged in the ticket 2254. You
can close this ticket as it is duplicate.
https://issues.apache.org/jira/browse/FLUME-2254

Thanks everyone for looking into it.

On Mon, Feb 9, 2015 at 8:22 AM, Rotem Hermon (JIRA) <ji...@apache.org> wrote:
>
>     [ https://issues.apache.org/jira/browse/FLUME-2390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14311919#comment-14311919 ]
>
> Rotem Hermon commented on FLUME-2390:
> -------------------------------------
>
> The problem of data getting posted multiple times is that when Flume retries a batch, all the documents are getting indexed again in Elasticsearch. And since no specific document ID is provided for the documents, Elasticsearch treats them as new documents and updates them again.
> The correct solution for this specific problem is to generate the document _id for each indexed document, so that even if it gets indexed again it will re-index the same document and not create a new one.
> We apply such a method in our extended serializer, you can see here - https://github.com/gigya/flume-ng-elasticsearch-ser-ex#generating-document-ids-for-events
>
>> Flume-ElasticSearch Data gets posted multiple times when one of the event fail validation at elastic search sink for JSON Data
>> ------------------------------------------------------------------------------------------------------------------------------
>>
>>                 Key: FLUME-2390
>>                 URL: https://issues.apache.org/jira/browse/FLUME-2390
>>             Project: Flume
>>          Issue Type: Bug
>>          Components: Sinks+Sources
>>    Affects Versions: v1.4.0
>>         Environment: CDH4.5
>>            Reporter: Deepak Subhramanian
>>
>> Hi,
>> I am using Elastic Search Sink to post JSON data. I used the temporary fix mentioned in https://issues.apache.org/jira/browse/FLUME-2126 to get JSON data posted to elastic search. When one of the message fail validation at ElasticSearch mapping for JSON data ( For example - getting empty message) , Flume seems to post the entire batch again and again until I restart Flume.  Because of that no of events went from an avg of 100 to avg of 2000 per 10 minutes. As a temporary fix I set a header in my FlumeHTTP Source for non valid JSON and used a interceptor to send data to multiple ESSINKS which has different index names.
>
>
>
> --
> This message was sent by Atlassian JIRA
> (v6.3.4#6332)



-- 
Deepak Subhramanian