You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Arijit Mukherjee <ar...@gmail.com> on 2010/12/07 11:11:19 UTC

About a drastic change in performance

Hi All

I was building an application which stores some telecom call records
in a Cassandra store and later performs some analysis on them. I
created two versions, (1) - where the key is of the form "A|B" where A
and B are two mobile numbers and A calls B, and (2) - where the key is
of the form "A|B|TS" where A and B are same as before and TS is a time
stamp (the time when the call was made). In both cases, the record
structure is as follows: [key, DURATION_S1, DURATION_S2], where S1 and
S2 are two different sources for the same information (two network
elements).

Basically I have two files from two network elements, and I parse the
files and store the records in Cassandra. In both versions, there is a
column family (<ColumnFamily CompareWith="UTF8Type" Name="Event"/>)
which is used to store the records. In the first version, as (A|B) may
occur multiple times within the files, the DURATION_S* fields are
updated every time a duplicate key is encountered. In the second case,
(A|B|TS) is unique - so there is no need for updating the DURATION_S*
fields. Thus, in the second case, the number of records in the Event
CF is slightly more - 9590, compared to 8378 in the 1st case.

In both versions, the records are processed and sent to Cassandra
within a reasonable period of time. The problem is in a range_slice
query. I am trying to find all the records for which DURATION_S1 !=
DURATION_S2. And in fact, the code to do this is almost the same in
both versions:

/**
     * Selects all entries from the Event CF and finds out all those
entries where
     * DUR_SRC1 != DUR_SRC2
     */
    public void findDurationMismatches() {
        boolean iterate = true;
        long count = 0;
        long totalCount = 0;
        int rowCount = 500;
        try {
            KeyRange keyRange = new KeyRange();
            keyRange.setStart_key("");
            keyRange.setEnd_key("");
            // use a keyCount of 500 - means iterate over 500 records
until all keys are considered
            keyRange.setCount(rowCount);
            List<byte[]> columns = new ArrayList<byte[]>();
            columns.add("DUR_SRC1".getBytes(ENCODING));
            columns.add("DUR_SRC2".getBytes(ENCODING));

            SlicePredicate slicePredicate = new SlicePredicate();
            slicePredicate.setColumn_names(columns);
            ColumnParent columnParent = new ColumnParent(EVENT_CF);
            List<KeySlice> keySlices =
client.get_range_slices(KEYSPACE, columnParent,
                    slicePredicate, keyRange, ConsistencyLevel.ONE);

            while (iterate) {
                //logger.debug("Number of rows retrieved: " + keySlices.size());
                totalCount = totalCount + keySlices.size();
                if (keySlices.size() < rowCount) {
                    // this is the last set
                    iterate = false;
                }
                for (KeySlice keySlice : keySlices) {
                    List<ColumnOrSuperColumn> result = keySlice.getColumns();
                    if (result.size() == 2) {
                        String count_src1 = new
String(result.get(0).getColumn().getValue(), ENCODING);
                        String count_src2 = new
String(result.get(1).getColumn().getValue(), ENCODING);
                        if (!count_src1.equals(count_src2)) {
                            count++;
                            //printToConsole(keySlice.getKey(),
keySlice.getColumns());
                        }
                        keyRange.setStart_key(keySlice.getKey());
                    }
                    keySlices = client.get_range_slices(KEYSPACE,
columnParent, slicePredicate,
                            keyRange, ConsistencyLevel.ONE);
                }
            }
            logger.debug("Found " + count + " records with mismatched
duration fields.");
            logger.debug("Total number of records processed: " + totalCount);

        } catch (Exception exception) {
            exception.printStackTrace();
            logger.error("Exception: " + exception.getMessage());
        }
    }

The trouble is - the same code, takes more than 5 mins to iterate over
9590 records in the 2nd version, whereas it takes about 2-3 seconds to
iterate over 8300 records in the 1st version - and on the same
machine.

I can't think of any reason why the performance would change so
drastically. What am I doing wrong here?

Regards
Arijit



-- 
"And when the night is cloudy,
There is still a light that shines on me,
Shine on until tomorrow, let it be."

Re: About a drastic change in performance

Posted by Nick Bailey <ni...@riptano.com>.
On Wed, Dec 8, 2010 at 1:19 AM, Arijit Mukherjee <ar...@gmail.com> wrote:

>  So how do you iterate over all records


You can iterate over your records with RandomPartitioner, they will just be
in the order of their hash, not the order of the keys.


> or try to find a list of all records matching a certain criteria?


It sounds like you want a secondary index on a specific column.
http://www.riptano.com/blog/whats-new-cassandra-07-secondary-indexes



>
> Arijit
>
> On 7 December 2010 15:41, Arijit Mukherjee <ar...@gmail.com> wrote:
> > Hi All
> >
> > I was building an application which stores some telecom call records
> > in a Cassandra store and later performs some analysis on them. I
> > created two versions, (1) - where the key is of the form "A|B" where A
> > and B are two mobile numbers and A calls B, and (2) - where the key is
> > of the form "A|B|TS" where A and B are same as before and TS is a time
> > stamp (the time when the call was made). In both cases, the record
> > structure is as follows: [key, DURATION_S1, DURATION_S2], where S1 and
> > S2 are two different sources for the same information (two network
> > elements).
> >
> > Basically I have two files from two network elements, and I parse the
> > files and store the records in Cassandra. In both versions, there is a
> > column family (<ColumnFamily CompareWith="UTF8Type" Name="Event"/>)
> > which is used to store the records. In the first version, as (A|B) may
> > occur multiple times within the files, the DURATION_S* fields are
> > updated every time a duplicate key is encountered. In the second case,
> > (A|B|TS) is unique - so there is no need for updating the DURATION_S*
> > fields. Thus, in the second case, the number of records in the Event
> > CF is slightly more - 9590, compared to 8378 in the 1st case.
> >
> > In both versions, the records are processed and sent to Cassandra
> > within a reasonable period of time. The problem is in a range_slice
> > query. I am trying to find all the records for which DURATION_S1 !=
> > DURATION_S2. And in fact, the code to do this is almost the same in
> > both versions:
> >
> > /**
> >     * Selects all entries from the Event CF and finds out all those
> > entries where
> >     * DUR_SRC1 != DUR_SRC2
> >     */
> >    public void findDurationMismatches() {
> >        boolean iterate = true;
> >        long count = 0;
> >        long totalCount = 0;
> >        int rowCount = 500;
> >        try {
> >            KeyRange keyRange = new KeyRange();
> >            keyRange.setStart_key("");
> >            keyRange.setEnd_key("");
> >            // use a keyCount of 500 - means iterate over 500 records
> > until all keys are considered
> >            keyRange.setCount(rowCount);
> >            List<byte[]> columns = new ArrayList<byte[]>();
> >            columns.add("DUR_SRC1".getBytes(ENCODING));
> >            columns.add("DUR_SRC2".getBytes(ENCODING));
> >
> >            SlicePredicate slicePredicate = new SlicePredicate();
> >            slicePredicate.setColumn_names(columns);
> >            ColumnParent columnParent = new ColumnParent(EVENT_CF);
> >            List<KeySlice> keySlices =
> > client.get_range_slices(KEYSPACE, columnParent,
> >                    slicePredicate, keyRange, ConsistencyLevel.ONE);
> >
> >            while (iterate) {
> >                //logger.debug("Number of rows retrieved: " +
> keySlices.size());
> >                totalCount = totalCount + keySlices.size();
> >                if (keySlices.size() < rowCount) {
> >                    // this is the last set
> >                    iterate = false;
> >                }
> >                for (KeySlice keySlice : keySlices) {
> >                    List<ColumnOrSuperColumn> result =
> keySlice.getColumns();
> >                    if (result.size() == 2) {
> >                        String count_src1 = new
> > String(result.get(0).getColumn().getValue(), ENCODING);
> >                        String count_src2 = new
> > String(result.get(1).getColumn().getValue(), ENCODING);
> >                        if (!count_src1.equals(count_src2)) {
> >                            count++;
> >                            //printToConsole(keySlice.getKey(),
> > keySlice.getColumns());
> >                        }
> >                        keyRange.setStart_key(keySlice.getKey());
> >                    }
> >                    keySlices = client.get_range_slices(KEYSPACE,
> > columnParent, slicePredicate,
> >                            keyRange, ConsistencyLevel.ONE);
> >                }
> >            }
> >            logger.debug("Found " + count + " records with mismatched
> > duration fields.");
> >            logger.debug("Total number of records processed: " +
> totalCount);
> >
> >        } catch (Exception exception) {
> >            exception.printStackTrace();
> >            logger.error("Exception: " + exception.getMessage());
> >        }
> >    }
> >
> > The trouble is - the same code, takes more than 5 mins to iterate over
> > 9590 records in the 2nd version, whereas it takes about 2-3 seconds to
> > iterate over 8300 records in the 1st version - and on the same
> > machine.
> >
> > I can't think of any reason why the performance would change so
> > drastically. What am I doing wrong here?
> >
> > Regards
> > Arijit
> >
> >
> >
> > --
> > "And when the night is cloudy,
> > There is still a light that shines on me,
> > Shine on until tomorrow, let it be."
> >
>
>
>
> --
> "And when the night is cloudy,
> There is still a light that shines on me,
> Shine on until tomorrow, let it be."
>

Re: About a drastic change in performance

Posted by Arijit Mukherjee <ar...@gmail.com>.
Apologies. I've been too stupid to realize that I had placed the
pagination statement at a ridiculous place:-(

One question though - OPP is mandatory for such pagination, isn't it?
But then I've read elsewhere in this list that there are drawbacks of
OPP. So how do you iterate over all records or try to find a list of
all records matching a certain criteria? Is the hadoop-approach the
only alternative?

Arijit

On 7 December 2010 15:41, Arijit Mukherjee <ar...@gmail.com> wrote:
> Hi All
>
> I was building an application which stores some telecom call records
> in a Cassandra store and later performs some analysis on them. I
> created two versions, (1) - where the key is of the form "A|B" where A
> and B are two mobile numbers and A calls B, and (2) - where the key is
> of the form "A|B|TS" where A and B are same as before and TS is a time
> stamp (the time when the call was made). In both cases, the record
> structure is as follows: [key, DURATION_S1, DURATION_S2], where S1 and
> S2 are two different sources for the same information (two network
> elements).
>
> Basically I have two files from two network elements, and I parse the
> files and store the records in Cassandra. In both versions, there is a
> column family (<ColumnFamily CompareWith="UTF8Type" Name="Event"/>)
> which is used to store the records. In the first version, as (A|B) may
> occur multiple times within the files, the DURATION_S* fields are
> updated every time a duplicate key is encountered. In the second case,
> (A|B|TS) is unique - so there is no need for updating the DURATION_S*
> fields. Thus, in the second case, the number of records in the Event
> CF is slightly more - 9590, compared to 8378 in the 1st case.
>
> In both versions, the records are processed and sent to Cassandra
> within a reasonable period of time. The problem is in a range_slice
> query. I am trying to find all the records for which DURATION_S1 !=
> DURATION_S2. And in fact, the code to do this is almost the same in
> both versions:
>
> /**
>     * Selects all entries from the Event CF and finds out all those
> entries where
>     * DUR_SRC1 != DUR_SRC2
>     */
>    public void findDurationMismatches() {
>        boolean iterate = true;
>        long count = 0;
>        long totalCount = 0;
>        int rowCount = 500;
>        try {
>            KeyRange keyRange = new KeyRange();
>            keyRange.setStart_key("");
>            keyRange.setEnd_key("");
>            // use a keyCount of 500 - means iterate over 500 records
> until all keys are considered
>            keyRange.setCount(rowCount);
>            List<byte[]> columns = new ArrayList<byte[]>();
>            columns.add("DUR_SRC1".getBytes(ENCODING));
>            columns.add("DUR_SRC2".getBytes(ENCODING));
>
>            SlicePredicate slicePredicate = new SlicePredicate();
>            slicePredicate.setColumn_names(columns);
>            ColumnParent columnParent = new ColumnParent(EVENT_CF);
>            List<KeySlice> keySlices =
> client.get_range_slices(KEYSPACE, columnParent,
>                    slicePredicate, keyRange, ConsistencyLevel.ONE);
>
>            while (iterate) {
>                //logger.debug("Number of rows retrieved: " + keySlices.size());
>                totalCount = totalCount + keySlices.size();
>                if (keySlices.size() < rowCount) {
>                    // this is the last set
>                    iterate = false;
>                }
>                for (KeySlice keySlice : keySlices) {
>                    List<ColumnOrSuperColumn> result = keySlice.getColumns();
>                    if (result.size() == 2) {
>                        String count_src1 = new
> String(result.get(0).getColumn().getValue(), ENCODING);
>                        String count_src2 = new
> String(result.get(1).getColumn().getValue(), ENCODING);
>                        if (!count_src1.equals(count_src2)) {
>                            count++;
>                            //printToConsole(keySlice.getKey(),
> keySlice.getColumns());
>                        }
>                        keyRange.setStart_key(keySlice.getKey());
>                    }
>                    keySlices = client.get_range_slices(KEYSPACE,
> columnParent, slicePredicate,
>                            keyRange, ConsistencyLevel.ONE);
>                }
>            }
>            logger.debug("Found " + count + " records with mismatched
> duration fields.");
>            logger.debug("Total number of records processed: " + totalCount);
>
>        } catch (Exception exception) {
>            exception.printStackTrace();
>            logger.error("Exception: " + exception.getMessage());
>        }
>    }
>
> The trouble is - the same code, takes more than 5 mins to iterate over
> 9590 records in the 2nd version, whereas it takes about 2-3 seconds to
> iterate over 8300 records in the 1st version - and on the same
> machine.
>
> I can't think of any reason why the performance would change so
> drastically. What am I doing wrong here?
>
> Regards
> Arijit
>
>
>
> --
> "And when the night is cloudy,
> There is still a light that shines on me,
> Shine on until tomorrow, let it be."
>



-- 
"And when the night is cloudy,
There is still a light that shines on me,
Shine on until tomorrow, let it be."