You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Mark Schnitzius <ma...@cxense.com> on 2010/05/03 12:49:51 UTC

Feeding in specific Cassandra columns into Hadoop

Hi all...  I am trying to feed a specific list of Cassandra column names in
as input to a Hadoop process, but for some reason it only feeds in some of
the columns I specify, not all.

This is a short description of the problem - I'll see if anyone might have
some insight before I dump a big load of code on you...

1.  I've uploaded a bunch of data into Cassandra; the column names as longs
(dates, basically) converted to byte[8].

2.  I can successfully set a SlicePredicate using setSlice_range to return
all the data for a set of columns.

3.  However, if I instead call setColumn_names on the SlicePredicate, only
some of the specified columns get fed into Hadoop.

4.  This faulty behavior is repeatable, with the same columns going missing
each time for the same input parameters.

5.  For the values that fail, I've made fairly certain that the value for
the column name is getting inserted successfully, and that the exact same
column name is specified in the call to setColumn_names.

Any clues?


AdTHANKSvance,
Mark

Re: Feeding in specific Cassandra columns into Hadoop

Posted by Mark Schnitzius <ma...@cxense.com>.
>
> You should test that getSlicePredicate(conf).equals(originalPredicate)
>
>
That's it!  The byte arrays are slightly different after setting it on the
Hadoop config.  Below is a simple test which demonstrates the bug -- it
should print "true" but instead prints "false".  Please let me know if a bug
gets raised so I can track it.


Thanks
Mark


import org.apache.cassandra.hadoop.ConfigHelper;
import org.apache.cassandra.thrift.SlicePredicate;
import org.apache.hadoop.conf.Configuration;

import java.nio.ByteBuffer;
import java.util.ArrayList;
import java.util.List;

/**
 * A class which demonstrates a bug in Cassandra's ConfigHelper.
 */
public class SlicePredicateTest {

    public static void main(String[] args) {
        long columnValue = 1271253600000l;
        byte[] columnBytes = getBytes(columnValue);
        List<byte[]> columnNames = new ArrayList<byte[]>();
        columnNames.add(columnBytes);
        SlicePredicate originalPredicate = new SlicePredicate();
        originalPredicate.setColumn_names(columnNames);
        Configuration conf = new Configuration();
        ConfigHelper.setSlicePredicate(conf, originalPredicate);

 System.out.println(ConfigHelper.getSlicePredicate(conf).equals(originalPredicate));
    }

        private static byte[] getBytes(long l) {
        byte[] bytes = new byte[8];
        ByteBuffer.wrap(bytes).putLong(l);
        return bytes;
    }
}

Re: Feeding in specific Cassandra columns into Hadoop

Posted by Jonathan Ellis <jb...@gmail.com>.
We serialize the SlicePredicate as part of the Hadoop Configuration
string.  It's quite possible that either

 - one of your column names is exposing a bug in the Thrift json serializer
 - Hadoop is silently truncating large predicates

You should test that getSlicePredicate(conf).equals(originalPredicate)

On Mon, May 3, 2010 at 8:15 PM, Mark Schnitzius
<ma...@cxense.com> wrote:
> If I take the exact same SlicePredicate that fails in the Hadoop example,
> and pass it in to a multiget_slice, the data is returned successfully.  So
> it appears the problem does lie somewhere in the tie-in to Hadoop.
> I will try to create a maximally-trimmed-down example that's complete enough
> to run on its own that demonstrates the failure, and will post here.  I was
> just hoping that there might've been an easy fix recognizable from my
> description before I had to resort to that...
>
> Thanks
> Mark
>
>
> On Tue, May 4, 2010 at 1:40 AM, Jonathan Ellis <jb...@gmail.com> wrote:
>>
>> Can you reproduce outside the Hadoop environment, i.e. w/ Thrift code?
>>
>> On Mon, May 3, 2010 at 5:49 AM, Mark Schnitzius
>> <ma...@cxense.com> wrote:
>> > Hi all...  I am trying to feed a specific list of Cassandra column names
>> > in
>> > as input to a Hadoop process, but for some reason it only feeds in some
>> > of
>> > the columns I specify, not all.
>> > This is a short description of the problem - I'll see if anyone might
>> > have
>> > some insight before I dump a big load of code on you...
>> > 1.  I've uploaded a bunch of data into Cassandra; the column names as
>> > longs
>> > (dates, basically) converted to byte[8].
>> > 2.  I can successfully set a SlicePredicate using setSlice_range to
>> > return
>> > all the data for a set of columns.
>> > 3.  However, if I instead call setColumn_names on the SlicePredicate,
>> > only
>> > some of the specified columns get fed into Hadoop.
>> > 4.  This faulty behavior is repeatable, with the same columns going
>> > missing
>> > each time for the same input parameters.
>> > 5.  For the values that fail, I've made fairly certain that the value
>> > for
>> > the column name is getting inserted successfully, and that the exact
>> > same
>> > column name is specified in the call to setColumn_names.
>> > Any clues?
>> >
>> > AdTHANKSvance,
>> > Mark
>>
>>
>>
>> --
>> Jonathan Ellis
>> Project Chair, Apache Cassandra
>> co-founder of Riptano, the source for professional Cassandra support
>> http://riptano.com
>
>



-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of Riptano, the source for professional Cassandra support
http://riptano.com

Re: Feeding in specific Cassandra columns into Hadoop

Posted by Mark Schnitzius <ma...@cxense.com>.
If I take the exact same SlicePredicate that fails in the Hadoop example,
and pass it in to a multiget_slice, the data is returned successfully.  So
it appears the problem does lie somewhere in the tie-in to Hadoop.

I will try to create a maximally-trimmed-down example that's complete enough
to run on its own that demonstrates the failure, and will post here.  I was
just hoping that there might've been an easy fix recognizable from my
description before I had to resort to that...


Thanks
Mark


On Tue, May 4, 2010 at 1:40 AM, Jonathan Ellis <jb...@gmail.com> wrote:

> Can you reproduce outside the Hadoop environment, i.e. w/ Thrift code?
>
> On Mon, May 3, 2010 at 5:49 AM, Mark Schnitzius
> <ma...@cxense.com> wrote:
> > Hi all...  I am trying to feed a specific list of Cassandra column names
> in
> > as input to a Hadoop process, but for some reason it only feeds in some
> of
> > the columns I specify, not all.
> > This is a short description of the problem - I'll see if anyone might
> have
> > some insight before I dump a big load of code on you...
> > 1.  I've uploaded a bunch of data into Cassandra; the column names as
> longs
> > (dates, basically) converted to byte[8].
> > 2.  I can successfully set a SlicePredicate using setSlice_range to
> return
> > all the data for a set of columns.
> > 3.  However, if I instead call setColumn_names on the SlicePredicate,
> only
> > some of the specified columns get fed into Hadoop.
> > 4.  This faulty behavior is repeatable, with the same columns going
> missing
> > each time for the same input parameters.
> > 5.  For the values that fail, I've made fairly certain that the value for
> > the column name is getting inserted successfully, and that the exact same
> > column name is specified in the call to setColumn_names.
> > Any clues?
> >
> > AdTHANKSvance,
> > Mark
>
>
>
> --
> Jonathan Ellis
> Project Chair, Apache Cassandra
> co-founder of Riptano, the source for professional Cassandra support
> http://riptano.com
>

Re: Feeding in specific Cassandra columns into Hadoop

Posted by Jonathan Ellis <jb...@gmail.com>.
Can you reproduce outside the Hadoop environment, i.e. w/ Thrift code?

On Mon, May 3, 2010 at 5:49 AM, Mark Schnitzius
<ma...@cxense.com> wrote:
> Hi all...  I am trying to feed a specific list of Cassandra column names in
> as input to a Hadoop process, but for some reason it only feeds in some of
> the columns I specify, not all.
> This is a short description of the problem - I'll see if anyone might have
> some insight before I dump a big load of code on you...
> 1.  I've uploaded a bunch of data into Cassandra; the column names as longs
> (dates, basically) converted to byte[8].
> 2.  I can successfully set a SlicePredicate using setSlice_range to return
> all the data for a set of columns.
> 3.  However, if I instead call setColumn_names on the SlicePredicate, only
> some of the specified columns get fed into Hadoop.
> 4.  This faulty behavior is repeatable, with the same columns going missing
> each time for the same input parameters.
> 5.  For the values that fail, I've made fairly certain that the value for
> the column name is getting inserted successfully, and that the exact same
> column name is specified in the call to setColumn_names.
> Any clues?
>
> AdTHANKSvance,
> Mark



-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of Riptano, the source for professional Cassandra support
http://riptano.com