You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Sonny Heer <so...@gmail.com> on 2010/03/01 20:57:20 UTC
Re: Binary memory table flush question

After I moved the create column family to happen only once for each
column family, I now get too much data ingested.  It's almost like it
isn't putting it in the correct rows.

can someone explain what the below method is doing?

    public static Message createMessage(String Keyspace, String Key,
String CFName, LinkedList<ColumnFamily> ColumnFamiles)
    {
        ColumnFamily baseColumnFamily;
        DataOutputBuffer bufOut = new DataOutputBuffer();
        RowMutation rm;
        Message message;
        Column column;

        /* Get the first column family from list, this is just to get
past validation */
        baseColumnFamily = new ColumnFamily(CFName, "Standard",
DatabaseDescriptor.getComparator(Keyspace, CFName), null);

        for(ColumnFamily cf : ColumnFamiles) {
            bufOut.reset();
            try
            {
                ColumnFamily.serializer().serializeWithIndexes(cf, bufOut);
                byte[] data = new byte[bufOut.getLength()];
                System.arraycopy(bufOut.getData(), 0, data, 0,
bufOut.getLength());

                column = new Column(cf.name().getBytes("UTF-8"), data, 0);

                baseColumnFamily.addColumn(column);
            }
            catch (IOException e)
            {
                throw new RuntimeException("loc 9 " + e);
            }
        }
        rm = new RowMutation(Keyspace, Key);
        rm.add(baseColumnFamily);

        try
        {
            /* Make message */
            message =
rm.makeRowMutationMessage(StorageService.binaryVerbHandler_);
        }
        catch (IOException e)
        {
            throw new RuntimeException("loc 10 " + e);
        }
        return message;
    }



It seems the row mutation doesn't always put my data into the correct
row.  Note I'm using cassandra .5, and not using Hadoop but directly
reading from a directory of text files.  For a given row I'm getting
way too many columns inserted.  When i change the program to use the
direct Cassandra thrift API, I get the correct data.  Any ideas?

Thanks.



On Fri, Feb 26, 2010 at 3:55 PM, Sonny Heer <so...@gmail.com> wrote:
> I believe the problem is because of the create.  I wasn't sure what
> exactly that method was doing, now i do :)
> Thanks.
>
> On Fri, Feb 26, 2010 at 3:32 PM, Sonny Heer <so...@gmail.com> wrote:
>> Hey,
>>
>> I have an application which is iterating over a directory with text
>> files in it.  For each document it is ingesting words as keys, and the
>> docid as the column name with column value empty (no super columns).
>>
>> Below is the code I'm using to construct a key and column:
>>
>> ColumnFamily cf = ColumnFamily.create(keyspaceStr, cfStr);
>> docCF.addColumn(
>>   new Column( docid.getBytes("UTF-8"), "".getBytes("UTF-8"), 0 )
>> );
>> LinkedList<ColumnFamily> docCFs = new LinkedList<ColumnFamily>();
>> docCFs.add(docCF);
>>
>> Message message = createMessage(keyspaceStr, key, cfStr, docCFs);
>>
>> /* Send message to end point */
>> for (InetAddress endpoint:
>> StorageService.instance().getNaturalEndpoints(key.toString()))
>> {
>>      MessagingService.instance().sendOneWay(message, endpoint);
>> }
>>
>> The createMessage method is the same code as in
>> CassandraBulkLoader.java.  So that snippet of code is run for each
>> term found in a document.  This should give me unique Terms, with
>> unique list of docids in which those terms appear.
>>
>> After running this program, I do a flush of the keyspace.  I noticed
>> I'm only getting the last message sent.  It appears to be overriding
>> my previous key, docCFs combination.
>>
>> I tried flushing programmatically after each send like so:
>> StorageService.instance().forceTableFlush(keyspaceStr, cfStr);
>>
>> This did not have any effect.  I'm new to Cassandra,  I hope someone
>> with more experience can chime in with some help :)
>>
>> Thanks.
>>
>