You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@storm.apache.org by Gaurav Agarwal <ga...@gmail.com> on 2015/12/01 19:09:09 UTC

Re: Writing file to storm hdfs

Tx aaron
On Dec 1, 2015 1:54 AM, "Aaron.Dossett" <Aa...@target.com> wrote:

> Well, not all of the reasons were entirely unrelated:
>
>
>    - If data stopped flowing from Kafka completely then a rotation might
>    not happen for a very long time and I wanted to guarantee time bounds on
>    when I processed files.  A time-based rotation policy would have addressed
>    this, but that was not desirable for other reasons.
>    - If the topology or bolt completely crashed and restarted, rotation
>    actions would never be triggered, as I understand it, on the files that
>    were open at the time of the crash.  It’s for this reason that I have never
>    used rotation actions.
>
> -Aaron
>
> From: Aaron Dossett <Aa...@target.com>
> Reply-To: "user@storm.apache.org" <us...@storm.apache.org>
> Date: Monday, November 30, 2015 at 2:08 PM
> To: "user@storm.apache.org" <us...@storm.apache.org>
> Subject: Re: Writing file to storm hdfs
>
> No, I have a separate process that runs periodically and determines which
> files haven’t been processed before.  Hooking directly into the rotation
> wasn’t an option for me for unrelated reasons.
>
> From: Gaurav Agarwal <ga...@gmail.com>
> Reply-To: "user@storm.apache.org" <us...@storm.apache.org>
> Date: Monday, November 30, 2015 at 2:06 PM
> To: "user@storm.apache.org" <us...@storm.apache.org>
> Subject: Re: Writing file to storm hdfs
>
> Hello Aaron,
> Please correct me if am wrong,You start processing files as soon as it is
> written and rotated by the hdfs bolt.
> On Dec 1, 2015 12:41 AM, "Aaron.Dossett" <Aa...@target.com> wrote:
>
>> I recently had to solve a use case like that.  I decided to track while
>> files i had processed instead of records within each file.  If a file is
>> still open for writing you could ignore it and come back for it later, or
>> insert it more than once if your process is idempotent.
>>
>> From: Gaurav Agarwal <ga...@gmail.com>
>> Reply-To: "user@storm.apache.org" <us...@storm.apache.org>
>> Date: Monday, November 30, 2015 at 1:01 PM
>> To: "user@storm.apache.org" <us...@storm.apache.org>
>> Subject: Writing file to storm hdfs
>>
>> Hello
>>
>> In storm topology we r receiving tuples in millions from Kafka and we
>> have to perform some calculations in bolt. Parallely we have bolt that
>> starts writing into hdfs ,now we have parallelism hint for writing the file
>> is 8. So 8 files will be there.
>> Actually problem is once the snapshot data is enriched Nd written to
>> multiple file nd completed,we have to trigger the other job that will copy
>> the records from files into database.
>> How can we find with multiple files created Nd bolt writing paraalely in
>> files which is the last record written so that we can trigger nextjob.Any
>> ideas?
>>
>