You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@cassandra.apache.org by Tharindu Mathew <mc...@gmail.com> on 2011/09/12 07:35:38 UTC

Implementing a input format that splits according to column size

Hi,

I plan to do $subject and contribute.

Right now, the hadoop integration splits according to the number of rows in
a slice predicate. This doesn't scale if a row has a large number of
columns.

I'd like to know from the cassandra-devs as to how feasible this is?

-- 
Regards,

Tharindu

blog: http://mackiemathew.com/

Re: Implementing a input format that splits according to column size

Posted by Brandon Williams <dr...@gmail.com>.
On Mon, Sep 12, 2011 at 1:54 PM, Tharindu Mathew <mc...@gmail.com> wrote:
> Thanks Brandon for the clarification.
>
> I'd like to support a use case where an index is built in a row in a CF.

If you're just _building_ the row, the current state of things will
work just fine.  The trouble starts when you need to read it via
hadoop.

> So, as a starting point for a query, a known row with a larger number of
> columns will have to be selected. The split to the hadoop nodes should start
> at that level.

The other problem here is if you want 10 nodes to operate on the row
and have RF=3, you're losing locality for 7 of the nodes.  If the task
is heavily CPU-bound this is probably ok, otherwise it may be that
only using 3 nodes is better (since they will have a local replica.)

> Is this a common use case?

I'm not entirely sure what it is you want to do yet, but maybe I
answered it above.

-Brandon

Re: Implementing a input format that splits according to column size

Posted by Tharindu Mathew <mc...@gmail.com>.
Thanks Brandon for the clarification.

I'd like to support a use case where an index is built in a row in a CF.

So, as a starting point for a query, a known row with a larger number of
columns will have to be selected. The split to the hadoop nodes should start
at that level.

Is this a common use case?

Maybe, there is a way to do this using the current impl. itself, that I'm
not seeing. If so, could you share with me on how to do this?

On Mon, Sep 12, 2011 at 7:01 PM, Brandon Williams <dr...@gmail.com> wrote:

> On Mon, Sep 12, 2011 at 12:35 AM, Tharindu Mathew <mc...@gmail.com>
> wrote:
> > Hi,
> >
> > I plan to do $subject and contribute.
> >
> > Right now, the hadoop integration splits according to the number of rows
> in
> > a slice predicate. This doesn't scale if a row has a large number of
> > columns.
> >
> > I'd like to know from the cassandra-devs as to how feasible this is?
>
> It's feasible, but not entirely easy.  Essentially you need to page
> through the row since you can't know how large it is beforehand.  IIRC
> though, this breaks the current input format contract, since an entire
> row is expected to be returned.
>
> -Brandon
>



-- 
Regards,

Tharindu

blog: http://mackiemathew.com/

Re: Implementing a input format that splits according to column size

Posted by Jonathan Ellis <jb...@gmail.com>.
On Mon, Sep 12, 2011 at 8:31 AM, Brandon Williams <dr...@gmail.com> wrote:
> It's feasible, but not entirely easy.  Essentially you need to page
> through the row since you can't know how large it is beforehand.  IIRC
> though, this breaks the current input format contract, since an entire
> row is expected to be returned.

This is one scenario that
https://issues.apache.org/jira/browse/CASSANDRA-2474 addresses, btw.
(Once we update CFIF to accept a CQL query.)

-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com

Re: Implementing a input format that splits according to column size

Posted by Brandon Williams <dr...@gmail.com>.
On Mon, Sep 12, 2011 at 12:35 AM, Tharindu Mathew <mc...@gmail.com> wrote:
> Hi,
>
> I plan to do $subject and contribute.
>
> Right now, the hadoop integration splits according to the number of rows in
> a slice predicate. This doesn't scale if a row has a large number of
> columns.
>
> I'd like to know from the cassandra-devs as to how feasible this is?

It's feasible, but not entirely easy.  Essentially you need to page
through the row since you can't know how large it is beforehand.  IIRC
though, this breaks the current input format contract, since an entire
row is expected to be returned.

-Brandon