You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by Bing Wei <bl...@gmail.com> on 2011/04/21 00:59:47 UTC

pig query on Cassandra

Hi, All.

When I do a pig query on Cassandra, and the Cassandra is updated by
application at the same time, what will happen? I may get inconsistent
results, right?

-- 
Bing

Graduate Student
Computer Science Department, UCSB :)

Re: pig query on Cassandra

Posted by Jeremy Hanna <je...@gmail.com>.

On Apr 21, 2011, at 9:25 AM, Mridul Muralidharan wrote:

> On Thursday 21 April 2011 06:41 PM, Jeremy Hanna wrote:
>> 
>> On Apr 21, 2011, at 3:19 AM, Mridul Muralidharan wrote:
>> 
>>> 
>>> In general (on hadoop based systems), if the input is not immutable - you can end up with issues during task re-execution, etc.
>>> This happens not just for cassandra but for hbase, others too - where you modify data in-place.
>>> 
>> 
>> So do you mean that between the time of the first execution and the time of the re-execution, input data can change?  Yes that's possible.  However, unless you are reading stale data the second time, it's not a consistency issue, is it?  I mean, if I am guaranteed to read the most recent data on the first execution and the second execution, that's consistent.  If I am reading updated data the second time, that's consistent and may or may not be a problem.
>> 
>> Just trying to make sure I understand.
> 
> To clarify, I am referring to re-execution of a task, not job.
> 
> From a (single) hadoop job point of view (and everything else which consumes its output) - it is a consistency issue : the re-execution  of a task can generate set of key/values which are different from initial invocation (which might have been used by some reducers).
> 

Good point about inputs that are not immutable.  Currently Cassandra doesn't have a way to snapshot the data to be immutable inputs.  Created a ticket to address that -https://issues.apache.org/jira/browse/CASSANDRA-2527

I guess I was more focused on Cassandra's architecture wrt consistency since it's often misunderstood -  and how to use consistency levels with mapreduce/pig.

> 
> Regards,
> Mridul
> 
>> 
>>> 
>>> 
>>> Regards,
>>> Mridul
>>> 
>>> On Thursday 21 April 2011 04:29 AM, Bing Wei wrote:
>>>> Hi, All.
>>>> 
>>>> When I do a pig query on Cassandra, and the Cassandra is updated by
>>>> application at the same time, what will happen? I may get inconsistent
>>>> results, right?
>>>> 
>>> 
>> 
>

Re: pig query on Cassandra

Posted by Mridul Muralidharan <mr...@yahoo-inc.com>.

On Thursday 21 April 2011 06:41 PM, Jeremy Hanna wrote:
>
> On Apr 21, 2011, at 3:19 AM, Mridul Muralidharan wrote:
>
>>
>> In general (on hadoop based systems), if the input is not immutable - you can end up with issues during task re-execution, etc.
>> This happens not just for cassandra but for hbase, others too - where you modify data in-place.
>>
>
> So do you mean that between the time of the first execution and the time of the re-execution, input data can change?  Yes that's possible.  However, unless you are reading stale data the second time, it's not a consistency issue, is it?  I mean, if I am guaranteed to read the most recent data on the first execution and the second execution, that's consistent.  If I am reading updated data the second time, that's consistent and may or may not be a problem.
>
> Just trying to make sure I understand.

To clarify, I am referring to re-execution of a task, not job.

 From a (single) hadoop job point of view (and everything else which 
consumes its output) - it is a consistency issue : the re-execution  of 
a task can generate set of key/values which are different from initial 
invocation (which might have been used by some reducers).


Regards,
Mridul

>
>>
>>
>> Regards,
>> Mridul
>>
>> On Thursday 21 April 2011 04:29 AM, Bing Wei wrote:
>>> Hi, All.
>>>
>>> When I do a pig query on Cassandra, and the Cassandra is updated by
>>> application at the same time, what will happen? I may get inconsistent
>>> results, right?
>>>
>>
>

Re: pig query on Cassandra

Posted by Jeremy Hanna <je...@gmail.com>.

On Apr 21, 2011, at 3:19 AM, Mridul Muralidharan wrote:

> 
> In general (on hadoop based systems), if the input is not immutable - you can end up with issues during task re-execution, etc.
> This happens not just for cassandra but for hbase, others too - where you modify data in-place.
> 

So do you mean that between the time of the first execution and the time of the re-execution, input data can change?  Yes that's possible.  However, unless you are reading stale data the second time, it's not a consistency issue, is it?  I mean, if I am guaranteed to read the most recent data on the first execution and the second execution, that's consistent.  If I am reading updated data the second time, that's consistent and may or may not be a problem.

Just trying to make sure I understand.

> 
> 
> Regards,
> Mridul
> 
> On Thursday 21 April 2011 04:29 AM, Bing Wei wrote:
>> Hi, All.
>> 
>> When I do a pig query on Cassandra, and the Cassandra is updated by
>> application at the same time, what will happen? I may get inconsistent
>> results, right?
>> 
>

Re: pig query on Cassandra

Posted by Mridul Muralidharan <mr...@yahoo-inc.com>.

In general (on hadoop based systems), if the input is not immutable - 
you can end up with issues during task re-execution, etc.
This happens not just for cassandra but for hbase, others too - where 
you modify data in-place.

Regards,
Mridul

On Thursday 21 April 2011 04:29 AM, Bing Wei wrote:
> Hi, All.
>
> When I do a pig query on Cassandra, and the Cassandra is updated by
> application at the same time, what will happen? I may get inconsistent
> results, right?
>

Re: pig query on Cassandra

Posted by Bing Wei <bl...@gmail.com>.

Thanks Jeremy. The link is of great help. The pig query only cares about
rows with certain key patterns. For example, it only cares about rows with
key values beginning with "aaa". For each row, the query only care about one
column in the row.
For write, a new row with that column can be inserted. Or a row with that
column can be deleted.

On Wed, Apr 20, 2011 at 5:36 PM, Jeremy Hanna <je...@gmail.com>wrote:

> The answer is that it depends on which consistency level you are reading
> and writing at.  You can make sure you are always reading consistent data by
> using quorum for reads and quorum for writes.
>
> For more information on consistency level, see:
> http://www.datastax.com/docs/0.7/consistency/index
>
> With Pig, you can specify the consistency level that you want to read at
> with the following property in your hadoop configuration:
> cassandra.consistencylevel.read
>
> So you can read at whatever consistency level you wish for each row.  The
> peculiarity with pig for reading and writing at the same time is that pig is
> by nature a batch job.  It's going to go over a set of columns for every row
> in the column family.  So when you say you're writing at the same time,
> which row do you mean?  But for example, if you are reading a particular row
> with consistency level "quorum" and you're writing with consistency level
> "quorum" to that row, you will see consistent results.
>
> On Apr 20, 2011, at 5:59 PM, Bing Wei wrote:
>
> > Hi, All.
> >
> > When I do a pig query on Cassandra, and the Cassandra is updated by
> > application at the same time, what will happen? I may get inconsistent
> > results, right?
> >
> > --
> > Bing
> >
> > Graduate Student
> > Computer Science Department, UCSB :)
>
>


-- 
Bing

Graduate Student
Computer Science Department, UCSB :)

Re: pig query on Cassandra

Posted by Jeremy Hanna <je...@gmail.com>.

The answer is that it depends on which consistency level you are reading and writing at.  You can make sure you are always reading consistent data by using quorum for reads and quorum for writes.

For more information on consistency level, see:
http://www.datastax.com/docs/0.7/consistency/index

With Pig, you can specify the consistency level that you want to read at with the following property in your hadoop configuration:
cassandra.consistencylevel.read

So you can read at whatever consistency level you wish for each row.  The peculiarity with pig for reading and writing at the same time is that pig is by nature a batch job.  It's going to go over a set of columns for every row in the column family.  So when you say you're writing at the same time, which row do you mean?  But for example, if you are reading a particular row with consistency level "quorum" and you're writing with consistency level "quorum" to that row, you will see consistent results.

On Apr 20, 2011, at 5:59 PM, Bing Wei wrote:

> Hi, All.
> 
> When I do a pig query on Cassandra, and the Cassandra is updated by
> application at the same time, what will happen? I may get inconsistent
> results, right?
> 
> -- 
> Bing
> 
> Graduate Student
> Computer Science Department, UCSB :)