You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@storm.apache.org by Noam Weinberger <NW...@cablevision.com> on 2016/07/19 20:40:05 UTC

Trident HBaseState query and update ordering

Hello,

(I hope I’m sending this to the correct place. If not, I apologize.)

I am writing a Trident topology that receives new data from a spout, reads values from HBase, averages the new data with the HBase data, and writes the result to HBase.

I initially used the HBaseState library for a separate stateQuery and partitionPersist. However, I’m concerned that it is possible for Batch 2 to query HBase before Batch 1 has updated it (i.e., while Batch 1 is still doing the averaging calculations). In this case, Batch 2 will receive outdated information and write over Batch 1 in HBase.

Is there any way to query and write to an external database and enforce ordering through the entire process, such that one batch will not even query until the previous batch has finished updating?

Thanks,
Noam


--------------------------------------------------------
The information transmitted in this email and any of its attachments is intended only for the person or entity to which it is addressed and may contain information concerning Cablevision and/or its affiliates and subsidiaries that is proprietary, privileged, confidential and/or subject to copyright. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient(s) is prohibited and may be unlawful. If you received this in error, please contact the sender immediately and delete and destroy the communication and all of the attachments you have received and all copies thereof.
--------------------------------------------------------


Re: Trident HBaseState query and update ordering

Posted by Noam Weinberger <NW...@cablevision.com>.
Arun,

Thank you for your prompt reply and helpful information.

I just want to quickly clarify:
Setting topology.max.spout.pending to 1 allows only one batch at a time to be processed anywhere in the topology. So as long as Batch 1 has not yet reached the end of the whole topology, Batch 2 will not even be emitted from the spout?

-Noam

From: Arun Iyer <ai...@hortonworks.com> on behalf of Arun Mahadevan <ar...@apache.org>
Reply-To: "user@storm.apache.org" <us...@storm.apache.org>
Date: Tuesday, July 19, 2016 at 9:30 PM
To: "user@storm.apache.org" <us...@storm.apache.org>
Subject: Re: Trident HBaseState query and update ordering


If I understand correctly, the issue is that your trident topology queries the same state that’s being updated to compute the result. You can control the number of batches that trident processes simultaneously by adjusting the value of “topology.max.spout.pending”, if you set it to 1 the processing of the batches should be serial. See the pipelining section at http://storm.apache.org/releases/1.0.1/Trident-spouts.html

Thanks,
Arun

From: Noam Weinberger <NW...@cablevision.com>
Reply-To: "user@storm.apache.org" <us...@storm.apache.org>
Date: Wednesday, July 20, 2016 at 2:10 AM
To: "user@storm.apache.org" <us...@storm.apache.org>
Subject: Trident HBaseState query and update ordering

Hello,

(I hope I’m sending this to the correct place. If not, I apologize.)

I am writing a Trident topology that receives new data from a spout, reads values from HBase, averages the new data with the HBase data, and writes the result to HBase.

I initially used the HBaseState library for a separate stateQuery and partitionPersist. However, I’m concerned that it is possible for Batch 2 to query HBase before Batch 1 has updated it (i.e., while Batch 1 is still doing the averaging calculations). In this case, Batch 2 will receive outdated information and write over Batch 1 in HBase.

Is there any way to query and write to an external database and enforce ordering through the entire process, such that one batch will not even query until the previous batch has finished updating?

Thanks,
Noam


--------------------------------------------------------
The information transmitted in this email and any of its attachments is intended only for the person or entity to which it is addressed and may contain information concerning Cablevision and/or its affiliates and subsidiaries that is proprietary, privileged, confidential and/or subject to copyright. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient(s) is prohibited and may be unlawful. If you received this in error, please contact the sender immediately and delete and destroy the communication and all of the attachments you have received and all copies thereof.
--------------------------------------------------------

Re: Trident HBaseState query and update ordering

Posted by Arun Mahadevan <ar...@apache.org>.
 

If I understand correctly, the issue is that your trident topology queries the same state that’s being updated to compute the result. You can control the number of batches that trident processes simultaneously by adjusting the value of “topology.max.spout.pending”, if you set it to 1 the processing of the batches should be serial. See the pipelining section at http://storm.apache.org/releases/1.0.1/Trident-spouts.html

 

Thanks,

Arun

 

From: Noam Weinberger <NW...@cablevision.com>
Reply-To: "user@storm.apache.org" <us...@storm.apache.org>
Date: Wednesday, July 20, 2016 at 2:10 AM
To: "user@storm.apache.org" <us...@storm.apache.org>
Subject: Trident HBaseState query and update ordering

 

Hello,

 

(I hope I’m sending this to the correct place. If not, I apologize.)

 

I am writing a Trident topology that receives new data from a spout, reads values from HBase, averages the new data with the HBase data, and writes the result to HBase.

 

I initially used the HBaseState library for a separate stateQuery and partitionPersist. However, I’m concerned that it is possible for Batch 2 to query HBase before Batch 1 has updated it (i.e., while Batch 1 is still doing the averaging calculations). In this case, Batch 2 will receive outdated information and write over Batch 1 in HBase.

 

Is there any way to query and write to an external database and enforce ordering through the entire process, such that one batch will not even query until the previous batch has finished updating?

 

Thanks,

Noam



--------------------------------------------------------
The information transmitted in this email and any of its attachments is intended only for the person or entity to which it is addressed and may contain information concerning Cablevision and/or its affiliates and subsidiaries that is proprietary, privileged, confidential and/or subject to copyright. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient(s) is prohibited and may be unlawful. If you received this in error, please contact the sender immediately and delete and destroy the communication and all of the attachments you have received and all copies thereof.
--------------------------------------------------------