You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@metron.apache.org by at...@gmail.com, at...@gmail.com on 2019/03/26 02:25:29 UTC

Metron Writes partial JSON to HDFS

Hi,
When i try to index the data using batchSize=150,default batchTimeout and TimedRotationPolicy which is set to 30 Mins, it creates some json files in HDFS with incomplete data, the last record in the HDFS file contains only a portion of the json record. When try to read the indexed data using hive external table it throws some exception, due to the partial json in the file. So in streaming i am not able to do any operation on the indexed data.

Indexex File Eg:
{"key1":"value1","key2":"value2","key3":"value3"}
{"key1":"value1","key2":"value2","key3"

When i tried to find the root cause of this behavior, i came across the following observations
1. Metron flushes the data to HDFS based on a CountSyncPolicy, by default it's value is set to the batchSize.
2. When metron performs the file rotation , it first closes the current file, which also result in flushing to HDFS.
3. Regardless of the batchSize, metron writes the data to HDFS after the batchTimeout.
4. CountSyncPolicy is not having any relation with the batchTimeout, that is even if the batchTimeout expires and metron writes the data to the HDFS , it won't init the flush, it still wait for the number of messages to become the CountSyncPolicy. Is this behavior set intentionally ? without the sync the end user won't be able to access the data completely, so it spoils the advantages of batchTimeout.

Due to the amount of data i am writing, i won't be able to set the CountSyncPolicy to 1, which will impact the performance.

Currently our indexing directory structure is like follows "yyyy/MM/dd". I need to do some operation on the newly indexed data based on a sliding window, now it is configured to max_window = 1 and window size = one hour. every hour i move the window to the current_window_hour + 1. when it's streaming i hit the JSON format error in hive.

Can you suggest any methods to over come this issue ?

Re: Metron Writes partial JSON to HDFS

Posted by Michael Miklavcic <mi...@gmail.com>.

Hi, this looks like you may be getting failures when writing to HDFS. For
example, if there's a problem with a batch, it's possible that it will
partially be written to HDFS. I would expect that in instances like this
you will see duplicate entries in your HDFS records. The reason for this is
that we are an at-least-once processing system, which means there may be
duplicates and/or errant records that need to be purged or skipped when
processing. A couple ways to confirm are:

   - look for any errors indexed to HDFS
   - look in ES or Solr, e.g. curl -XGET "http://node1:9200/error*/_search"



On Mon, Mar 25, 2019 at 8:25 PM athulpersonal@gmail.com <
athulpersonal@gmail.com> wrote:

> Hi,
>  When i try to index the data using batchSize=150,default batchTimeout and
> TimedRotationPolicy which is set to 30 Mins, it creates some json files in
> HDFS with incomplete data, the last record in the HDFS file contains only a
> portion of the json record. When try to read the indexed data using hive
> external table it throws some exception, due to the partial json in the
> file. So in streaming i am not able to do any operation on the indexed
> data.
>
> Indexex File Eg:
>     {"key1":"value1","key2":"value2","key3":"value3"}
>     {"key1":"value1","key2":"value2","key3"
>
> When i tried to find the root cause of this behavior, i came across the
> following observations
>   1. Metron flushes the data to HDFS based on a CountSyncPolicy, by
> default it's value is set to the batchSize.
>   2. When metron performs the file rotation , it first closes the current
> file, which also result in flushing to HDFS.
>   3. Regardless of the batchSize, metron writes the data to HDFS after the
> batchTimeout.
>   4. CountSyncPolicy is not having any relation with the batchTimeout,
> that is even if the batchTimeout expires and metron writes the data to the
> HDFS , it won't init the flush, it still wait for the     number of
> messages to become the CountSyncPolicy. Is this behavior set intentionally
> ? without the sync the end user won't be able to access the data
> completely, so it spoils the advantages of batchTimeout.
>
> Due to the amount of data i am writing, i won't be able to set the
> CountSyncPolicy to 1, which will impact the performance.
>
> Currently our indexing directory structure is like follows "yyyy/MM/dd". I
> need to do some operation on the newly indexed data based on a sliding
> window, now it is configured to max_window = 1 and window size = one hour.
> every hour i move the window to the current_window_hour + 1. when it's
> streaming i hit the JSON format error in hive.
>
> Can you suggest any methods to over come this issue ?
>