You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by 김준영 <ju...@gmail.com> on 2011/01/15 06:00:45 UTC

is it possible to map an one from a a file and an one from cassandra?

hi, 

cassandra supports hadoop to map & reduce from cassandra.

now I am digging to find out a way to map from a file and cassandra together.

I mean if both of them are files in my disk, it is possible by using splits.

but, in this kind of a situtation, which way is posssible?

for example. 

in a cassandra)
key1| value1 | value2
key2| value3 | value4
key3| value5 | value6

in a file)
key1| value1 | value2
key2| value7 | value4
key3| value7 | value6


the size of both are very hugh.
I want to get a result from diff from both of them.

which keys are deleted?
which values are changed?

thanks.

Re: is it possible to map an one from a a file and an one from cassandra?

Posted by Jun Young Kim <ju...@gmail.com>.

thanks for all.

-----
Junyoung Kim (juneng603@gmail.com)

On 01/17/2011 10:58 AM, Aaron Morton wrote:
> Thanks for the update.
> Aaron
>
>
> On 17 Jan, 2011,at 02:51 PM, Brandon Williams <dr...@gmail.com> wrote:
>
>> 2011/1/16 Jun Young Kim <juneng603@gmail.com 
>> <ma...@gmail.com>>
>>
>>     Hi aron.
>>
>>     I think that if the pig is able to support to map it, the same
>>     job could be represented in java code itself.
>>
>>     I believe that we can call a map function by loading a file and
>>     cassandra at the same time.
>>
>>     Ps) I dont need to join from them. I just wanna compare each keys
>>     which are read from them.
>>
>>
>> We went over this on irc, but I will repeat the summary for posterity.
>>
>> This is a case where using the thrift API, rather than a o.a.c.hadoop 
>> construct is probably better (right now) because 
>> ColumnFamilyInputFormat expects to go over the entire CF, and a join 
>> the reducer is costly.  Instead what you really want is per-row 
>> access after reading an entry from the file in the map task, so using 
>> Hector inside the hadoop job makes the most sense.
>>
>> -Brandon

Re: is it possible to map an one from a a file and an one from cassandra?

Posted by Aaron Morton <aa...@thelastpickle.com>.

Thanks for the update. 
Aaron


On 17 Jan, 2011,at 02:51 PM, Brandon Williams <dr...@gmail.com> wrote:

2011/1/16 Jun Young Kim <ju...@gmail.com>
Hi aron.
I think that if the pig is able to support to map it, the same job could be represented in java code itself.
I believe that we can call a map function by loading a file and cassandra at the same time.
Ps) I dont need to join from them. I just wanna compare each keys which are read from them.

We went over this on irc, but I will repeat the summary for posterity.

This is a case where using the thrift API, rather than a o.a.c.hadoop construct is probably better (right now) because ColumnFamilyInputFormat expects to go over the entire CF, and a join the reducer is costly.  Instead what you really want is per-row access after reading an entry from the file in the map task, so using Hector inside the hadoop job makes the most sense.

-Brandon

Re: is it possible to map an one from a a file and an one from cassandra?

Posted by Brandon Williams <dr...@gmail.com>.

2011/1/16 Jun Young Kim <ju...@gmail.com>

> Hi aron.
>
> I think that if the pig is able to support to map it, the same job could be
> represented in java code itself.
>
> I believe that we can call a map function by loading a file and cassandra
> at the same time.
>
> Ps) I dont need to join from them. I just wanna compare each keys which are
> read from them.
>

We went over this on irc, but I will repeat the summary for posterity.

This is a case where using the thrift API, rather than a o.a.c.hadoop
construct is probably better (right now) because ColumnFamilyInputFormat
expects to go over the entire CF, and a join the reducer is costly.  Instead
what you really want is per-row access after reading an entry from the file
in the map task, so using Hector inside the hadoop job makes the most sense.

-Brandon

Re: is it possible to map an one from a a file and an one from cassandra?

Posted by Aaron Morton <aa...@thelastpickle.com>.

Yup, everything you can do in pig is doable in normal Hadoop. When you say you want to compare the keys, you're sort of doing an outer join. That's why I thought pig may make your life a bit easier,

Good luck.
Aaron

On 17/01/2011, at 1:07 PM, Jun Young Kim <ju...@gmail.com> wrote:

> Hi aron.
> 
> I think that if the pig is able to support to map it, the same job could be represented in java code itself.
> 
> I believe that we can call a map function by loading a file and cassandra at the same time.
> 
> Ps) I dont need to join from them. I just wanna compare each keys which are read from them.
> 
> Thanks.
> 
> 2011. 1. 17. 오전 5:56에 "Aaron Morton" <aa...@thelastpickle.com>님이 작성:
> > The  Pig readers are just the same as any other data source so you should be able to mix and match them as you please
> > 
> > Tthe sample pig script in contrib/pig/example-script.pig specifies the to use the CassandraStorage source when loading data 
> > 
> > rows = LOAD 'cassandra://Keyspace1/Standard1' USING CassandraStorage();
> > 
> > The LOAD command in Pig Latin supports a USING keyword to identify the data source type 
> > http://pig.apache.org/docs/r0.8.0/piglatin_ref2.html#Load%2FStore+Functions
> > 
> > I'm less familiar with Hadoop, but it should be possible. AFAIK though it's going to be easier to do a join between data sources with Pig. 
> > 
> > Hope that helps. 
> > Aaron
> >  
> > 
> > 
> > On 15 Jan, 2011,at 06:00 PM, 김준영 <ju...@gmail.com> wrote:
> > 
> > hi, 
> > 
> > cassandra supports hadoop to map & reduce from cassandra.
> > 
> > now I am digging to find out a way to map from a file and cassandra together.
> > 
> > I mean if both of them are files in my disk, it is possible by using splits.
> > 
> > but, in this kind of a situtation, which way is posssible?
> > 
> > for example. 
> > 
> > in a cassandra)
> > key1| value1 | value2
> > key2| value3 | value4
> > key3| value5 | value6
> > 
> > in a file)
> > key1| value1 | value2
> > key2| value7 | value4
> > key3| value7 | value6
> > 
> > 
> > the size of both are very hugh.
> > I want to get a result from diff from both of them.
> > 
> > which keys are deleted?
> > which values are changed?
> > 
> > thanks.

Re: is it possible to map an one from a a file and an one from cassandra?

Posted by Jun Young Kim <ju...@gmail.com>.

Hi aron.

I think that if the pig is able to support to map it, the same job could be
represented in java code itself.

I believe that we can call a map function by loading a file and cassandra at
the same time.

Ps) I dont need to join from them. I just wanna compare each keys which are
read from them.

Thanks.
2011. 1. 17. 오전 5:56에 "Aaron Morton" <aa...@thelastpickle.com>님이 작성:
> The  Pig readers are just the same as any other data source so you should
be able to mix and match them as you please
>
> Tthe sample pig script in contrib/pig/example-script.pig specifies the to
use the CassandraStorage source when loading data
>
> rows = LOAD 'cassandra://Keyspace1/Standard1' USING CassandraStorage();
>
> The LOAD command in Pig Latin supports a USING keyword to identify the
data source type
>
http://pig.apache.org/docs/r0.8.0/piglatin_ref2.html#Load%2FStore+Functions
>
> I'm less familiar with Hadoop, but it should be possible. AFAIK though
it's going to be easier to do a join between data sources with Pig.
>
> Hope that helps.
> Aaron
>
>
>
> On 15 Jan, 2011,at 06:00 PM, 김준영 <ju...@gmail.com> wrote:
>
> hi,
>
> cassandra supports hadoop to map & reduce from cassandra.
>
> now I am digging to find out a way to map from a file and cassandra
together.
>
> I mean if both of them are files in my disk, it is possible by using
splits.
>
> but, in this kind of a situtation, which way is posssible?
>
> for example.
>
> in a cassandra)
> key1| value1 | value2
> key2| value3 | value4
> key3| value5 | value6
>
> in a file)
> key1| value1 | value2
> key2| value7 | value4
> key3| value7 | value6
>
>
> the size of both are very hugh.
> I want to get a result from diff from both of them.
>
> which keys are deleted?
> which values are changed?
>
> thanks.

Re: is it possible to map an one from a a file and an one from cassandra?

Posted by Aaron Morton <aa...@thelastpickle.com>.

The  Pig readers are just the same as any other data source so you should be able to mix and match them as you please

Tthe sample pig script in contrib/pig/example-script.pig specifies the to use the CassandraStorage source when loading data 

rows = LOAD 'cassandra://Keyspace1/Standard1' USING CassandraStorage();

The LOAD command in Pig Latin supports a USING keyword to identify the data source type 
http://pig.apache.org/docs/r0.8.0/piglatin_ref2.html#Load%2FStore+Functions

I'm less familiar with Hadoop, but it should be possible. AFAIK though it's going to be easier to do a join between data sources with Pig. 

Hope that helps. 
Aaron
 


On 15 Jan, 2011,at 06:00 PM, 김준영 <ju...@gmail.com> wrote:

hi, 

cassandra supports hadoop to map & reduce from cassandra.

now I am digging to find out a way to map from a file and cassandra together.

I mean if both of them are files in my disk, it is possible by using splits.

but, in this kind of a situtation, which way is posssible?

for example. 

in a cassandra)
key1| value1 | value2
key2| value3 | value4
key3| value5 | value6

in a file)
key1| value1 | value2
key2| value7 | value4
key3| value7 | value6


the size of both are very hugh.
I want to get a result from diff from both of them.

which keys are deleted?
which values are changed?

thanks.