You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by murat migdisoglu <mu...@gmail.com> on 2012/05/21 19:31:28 UTC

Is mapper called per row when used with Cassandra

Hi,

I'm quite new in Hadoop and trying to understand how the task split works
when used with Cassandra ColumnFamilyInputFormat.

I have a very basic scenario: Cassandra has the sessionId and a bson data
that contains the username. I want to go through all rows and dump the row
to a file when the username is matching to a certain criteria. And I do not
need any Reducer or Combiner for now.

After I've written the following very simple hadoop job, I see from the
logs that my mapper function is called per each row.  Is that normal? If
that is the case, doing such a search operation in a big dataset would take
hours if not days...

I guess i need a better understanding on how splitting the job into tasks
works exactly..


    @Override
    public void map(ByteBuffer key, SortedMap<ByteBuffer, IColumn> columns,
Context context)
    throws IOException, InterruptedException
    {
        String rowkey = ByteBufferUtil.string(key);
        String ip = context.getConfiguration().get(IP);
        IColumn column = columns.get(sourceColumn);
        if (column == null)
            return;
        ByteBuffer byteBuffer = column.value();
        ByteBuffer bb2 = byteBuffer.duplicate();

        DataConvertor convertor= fromBson(byteBuffer,
DataConvertor.class);
        String username= convertor.getUsername();
        BytesWritable value = new BytesWritable();
        if (username != null && username.equals(cip)) {
            byte[] arr = convertToByteArray(bb2);
            value.set(new BytesWritable(arr));
            Text tkey = new Text(rowkey);
            context.write( tkey, value);
        } else {
            log.info("ip not match [" + ip + "]");
        }
    }

Thanks in advance
Kind Regards

Re: Is mapper called per row when used with Cassandra

Posted by highpointe <hi...@gmail.com>.

Here is my SS:  259 71 2451

On May 21, 2012, at 10:31 AM, murat migdisoglu <mu...@gmail.com> wrote:

> Hi,
> 
> I'm quite new in Hadoop and trying to understand how the task split works
> when used with Cassandra ColumnFamilyInputFormat.
> 
> I have a very basic scenario: Cassandra has the sessionId and a bson data
> that contains the username. I want to go through all rows and dump the row
> to a file when the username is matching to a certain criteria. And I do not
> need any Reducer or Combiner for now.
> 
> After I've written the following very simple hadoop job, I see from the
> logs that my mapper function is called per each row.  Is that normal? If
> that is the case, doing such a search operation in a big dataset would take
> hours if not days...
> 
> I guess i need a better understanding on how splitting the job into tasks
> works exactly..
> 
> 
>    @Override
>    public void map(ByteBuffer key, SortedMap<ByteBuffer, IColumn> columns,
> Context context)
>    throws IOException, InterruptedException
>    {
>        String rowkey = ByteBufferUtil.string(key);
>        String ip = context.getConfiguration().get(IP);
>        IColumn column = columns.get(sourceColumn);
>        if (column == null)
>            return;
>        ByteBuffer byteBuffer = column.value();
>        ByteBuffer bb2 = byteBuffer.duplicate();
> 
>        DataConvertor convertor= fromBson(byteBuffer,
> DataConvertor.class);
>        String username= convertor.getUsername();
>        BytesWritable value = new BytesWritable();
>        if (username != null && username.equals(cip)) {
>            byte[] arr = convertToByteArray(bb2);
>            value.set(new BytesWritable(arr));
>            Text tkey = new Text(rowkey);
>            context.write( tkey, value);
>        } else {
>            log.info("ip not match [" + ip + "]");
>        }
>    }
> 
> Thanks in advance
> Kind Regards