You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@storm.apache.org by Aiman Najjar <na...@gmail.com> on 2014/11/03 16:24:52 UTC

Storm Trident use case

Hello,

I'm new to storm, I'm trying to build a trident topology that exports rows
from an oracle db to hdfs. I'm using an existing implementation for
HdfsState

I wrote my own trident spout that emits tuples in 1-hour time slices (so
the coordinator metadata are timestamps), so for instance, if the time now
is 5:30pm, and the current position of the spout coordinator is 12pm, the
spout will emit 5 batches of tuples.

My problem is that I'm not sure if I'm taking advantage of storm's
distributed processing, will those 5 batches be processed in parallel? Is
my design optimal? Here's how my topology looks like:

 topology.*newStream*(tableName + "_AUDIT_STREAM", auditSpout)        .
*partitionPersist*(factory, tableFields, *new* *HdfsUpdater*(), *new*
*Fields*());


Thanks

Re: Storm Trident use case

Posted by Aiman Najjar <na...@gmail.com>.

Oh that's valid question, I forgot to mention that when HDFS is brought up
to date (i.e. the spout coordinator time position has reached current
time), the storm job will no longer emit 1-hour slices, but it will rather
emit in real-time any new data that has just been inserted into oracle. It
will try to batch them in 5-second slices though, so at that point it
becomes real-time streaming to HDFS. I'm not sure if that's something that
can via sqoop? At some point, we'll be streaming to a web service endpoint
as well, which I think is not doable via sqoop.

Thanks



On Mon, Nov 3, 2014 at 1:15 PM, Babu, Prashanth <Pr...@nttdata.com>
wrote:

>  Any specific reason for using Storm for this use case and not Sqoop?
>
>
>
> *From:* Aiman Najjar [mailto:najjar.aiman86@gmail.com]
> *Sent:* 03 November 2014 15:25
> *To:* user@storm.apache.org
> *Subject:* Storm Trident use case
>
>
>
> Hello,
>
>
>
> I'm new to storm, I'm trying to build a trident topology that exports rows
> from an oracle db to hdfs. I'm using an existing implementation for
> HdfsState
>
>
>
> I wrote my own trident spout that emits tuples in 1-hour time slices (so
> the coordinator metadata are timestamps), so for instance, if the time now
> is 5:30pm, and the current position of the spout coordinator is 12pm, the
> spout will emit 5 batches of tuples.
>
>
>
> My problem is that I'm not sure if I'm taking advantage of storm's
> distributed processing, will those 5 batches be processed in parallel? Is
> my design optimal? Here's how my topology looks like:
>
>
>
>  topology.*newStream*(tableName + "_AUDIT_STREAM", auditSpout)        .
> *partitionPersist*(factory, tableFields, *new* *HdfsUpdater*(), *new*
> *Fields*());
>
>
>
>
>
> Thanks
>
> ______________________________________________________________________
> Disclaimer: This email and any attachments are sent in strictest confidence
> for the sole use of the addressee and may contain legally privileged,
> confidential, and proprietary data. If you are not the intended recipient,
> please advise the sender by replying promptly to this email and then delete
> and destroy this email and any attachments without any further use, copying
> or forwarding.
>

RE: Storm Trident use case

Posted by "Babu, Prashanth" <Pr...@nttdata.com>.

Any specific reason for using Storm for this use case and not Sqoop?

From: Aiman Najjar [mailto:najjar.aiman86@gmail.com]
Sent: 03 November 2014 15:25
To: user@storm.apache.org
Subject: Storm Trident use case

Hello,

I'm new to storm, I'm trying to build a trident topology that exports rows from an oracle db to hdfs. I'm using an existing implementation for HdfsState

I wrote my own trident spout that emits tuples in 1-hour time slices (so the coordinator metadata are timestamps), so for instance, if the time now is 5:30pm, and the current position of the spout coordinator is 12pm, the spout will emit 5 batches of tuples.

My problem is that I'm not sure if I'm taking advantage of storm's distributed processing, will those 5 batches be processed in parallel? Is my design optimal? Here's how my topology looks like:

 topology.newStream(tableName + "_AUDIT_STREAM", auditSpout)        .partitionPersist(factory, tableFields, new HdfsUpdater(), new Fields());

Thanks

______________________________________________________________________
Disclaimer: This email and any attachments are sent in strictest confidence
for the sole use of the addressee and may contain legally privileged,
confidential, and proprietary data. If you are not the intended recipient,
please advise the sender by replying promptly to this email and then delete
and destroy this email and any attachments without any further use, copying
or forwarding.