You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by murat migdisoglu <mu...@gmail.com> on 2012/05/21 19:31:28 UTC
Is mapper called per row when used with Cassandra
Hi,
I'm quite new in Hadoop and trying to understand how the task split works
when used with Cassandra ColumnFamilyInputFormat.
I have a very basic scenario: Cassandra has the sessionId and a bson data
that contains the username. I want to go through all rows and dump the row
to a file when the username is matching to a certain criteria. And I do not
need any Reducer or Combiner for now.
After I've written the following very simple hadoop job, I see from the
logs that my mapper function is called per each row. Is that normal? If
that is the case, doing such a search operation in a big dataset would take
hours if not days...
I guess i need a better understanding on how splitting the job into tasks
works exactly..
@Override
public void map(ByteBuffer key, SortedMap<ByteBuffer, IColumn> columns,
Context context)
throws IOException, InterruptedException
{
String rowkey = ByteBufferUtil.string(key);
String ip = context.getConfiguration().get(IP);
IColumn column = columns.get(sourceColumn);
if (column == null)
return;
ByteBuffer byteBuffer = column.value();
ByteBuffer bb2 = byteBuffer.duplicate();
DataConvertor convertor= fromBson(byteBuffer,
DataConvertor.class);
String username= convertor.getUsername();
BytesWritable value = new BytesWritable();
if (username != null && username.equals(cip)) {
byte[] arr = convertToByteArray(bb2);
value.set(new BytesWritable(arr));
Text tkey = new Text(rowkey);
context.write( tkey, value);
} else {
log.info("ip not match [" + ip + "]");
}
}
Thanks in advance
Kind Regards
Re: Is mapper called per row when used with Cassandra
Posted by highpointe <hi...@gmail.com>.
Here is my SS: 259 71 2451
On May 21, 2012, at 10:31 AM, murat migdisoglu <mu...@gmail.com> wrote:
> Hi,
>
> I'm quite new in Hadoop and trying to understand how the task split works
> when used with Cassandra ColumnFamilyInputFormat.
>
> I have a very basic scenario: Cassandra has the sessionId and a bson data
> that contains the username. I want to go through all rows and dump the row
> to a file when the username is matching to a certain criteria. And I do not
> need any Reducer or Combiner for now.
>
> After I've written the following very simple hadoop job, I see from the
> logs that my mapper function is called per each row. Is that normal? If
> that is the case, doing such a search operation in a big dataset would take
> hours if not days...
>
> I guess i need a better understanding on how splitting the job into tasks
> works exactly..
>
>
> @Override
> public void map(ByteBuffer key, SortedMap<ByteBuffer, IColumn> columns,
> Context context)
> throws IOException, InterruptedException
> {
> String rowkey = ByteBufferUtil.string(key);
> String ip = context.getConfiguration().get(IP);
> IColumn column = columns.get(sourceColumn);
> if (column == null)
> return;
> ByteBuffer byteBuffer = column.value();
> ByteBuffer bb2 = byteBuffer.duplicate();
>
> DataConvertor convertor= fromBson(byteBuffer,
> DataConvertor.class);
> String username= convertor.getUsername();
> BytesWritable value = new BytesWritable();
> if (username != null && username.equals(cip)) {
> byte[] arr = convertToByteArray(bb2);
> value.set(new BytesWritable(arr));
> Text tkey = new Text(rowkey);
> context.write( tkey, value);
> } else {
> log.info("ip not match [" + ip + "]");
> }
> }
>
> Thanks in advance
> Kind Regards