You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Mubarak Seyed <mu...@gmail.com> on 2010/07/13 04:35:37 UTC

CassandraBulkLoader

Where can i find the documentation for BinaryMemTable (btm_example in
contrib) to use CassandraBulkLoader?

Do i need the HDFS to store my storage-conf.xml? What is the input to be
supplied to CassandraBulkLoader?

How to form the input data and what is the format of an input data?

-- 
Thanks,
Mubarak Seyed.

Re: CassandraBulkLoader

Posted by Torsten Curdt <tc...@vafer.org>.
> When i run bmt_example, M/R job gets executed, cassandra server  gets the
> data but it goes as HintedHandoff to 127.0.0.2 and it is trying to send data
> to 127.0.0.2 as if 127.0.0.2 is an actual node.

Well, it kind of becomes an actual node.

> Any idea, why does StorageService
> returns 127.0.0.2 as a EndPoint even though 127.0.0.1 is up?

Sorry, not sure. But is it really a problem?

cheers
--
Torsten

Re: CassandraBulkLoader

Posted by Mubarak Seyed <mu...@gmail.com>.
Hi Torsten,

When i run bmt_example, M/R job gets executed, cassandra server  gets the
data but it goes as HintedHandoff to 127.0.0.2 and it is trying to send data
to 127.0.0.2 as if 127.0.0.2 is an actual node. When the job was done,
close() stop the StorageService instance. Any idea, why does StorageService
returns 127.0.0.2 as a EndPoint even though 127.0.0.1 is up?

I am using 127.0.0.1 is a Cassandra server and 127.0.0.2 is a fat client
(used from word_count/storage-conf.xml)

from CassandraBulkLoader
---------------------------------------
10/07/19 10:58:50 INFO mapred.JobClient:  map 100% reduce 0%
10/07/19 10:58:50 INFO config.DatabaseDescriptor: Auto DiskAccessMode
determined to be mmap
10/07/19 10:58:51 INFO service.StorageService: Starting up client gossip
10/07/19 10:58:53 INFO gms.Gossiper: Node /127.0.0.1 is now part of the
cluster
10/07/19 10:58:54 INFO gms.Gossiper: InetAddress /127.0.0.1 is now UP


10/07/19 10:59:04 INFO net.MessagingService: Shutting down MessageService...
10/07/19 10:59:04 INFO net.MessagingService: MessagingService shutting down
server thread.
10/07/19 10:59:04 INFO net.MessagingService: Shutdown complete (no further
commands will be processed)
10/07/19 10:59:04 INFO mapred.TaskRunner: Task:attempt_local_0001_r_000000_0
is done. And is in the process of commiting
10/07/19 10:59:04 INFO mapred.LocalJobRunner:
10/07/19 10:59:04 INFO mapred.TaskRunner: Task attempt_local_0001_r_000000_0
is allowed to commit now
10/07/19 10:59:04 INFO mapred.FileOutputCommitter: Saved output of task
'attempt_local_0001_r_000000_0' to
file:/Users/xxxxxxxx/Documents/apache-cassandra-0.6.3-src/contrib/bmt_example/test
10/07/19 10:59:04 INFO mapred.LocalJobRunner: reduce > reduce
10/07/19 10:59:04 INFO mapred.TaskRunner: Task
'attempt_local_0001_r_000000_0' done.
10/07/19 10:59:05 INFO mapred.JobClient:  map 100% reduce 100%
10/07/19 10:59:05 INFO mapred.JobClient: Job complete: job_local_0001


from Cassandra_home/logs/system.log
--------------------------------------------------------
DEBUG [WRITE-/127.0.0.2] 2010-07-19 10:58:52,501 OutboundTcpConnection.java
(line 142) attempting to connect to /127.0.0.2
 INFO [GMFD:1] 2010-07-19 10:58:52,530 Gossiper.java (line 587) Node /
127.0.0.2 is now part of the cluster
 INFO [GMFD:1] 2010-07-19 10:58:53,444 Gossiper.java (line 579) InetAddress
/127.0.0.2 is now UP
 INFO [HINTED-HANDOFF-POOL:1] 2010-07-19 10:58:53,444
HintedHandOffManager.java (line 153) Started hinted handoff for endPoint /
127.0.0.2
 INFO [HINTED-HANDOFF-POOL:1] 2010-07-19 10:58:53,445
HintedHandOffManager.java (line 210) Finished hinted handoff of 0 rows to
endpoint /127.0.0.2
DEBUG [Timer-1] 2010-07-19 10:59:06,990 LoadDisseminator.java (line 36)
Disseminating load info ...
 INFO [Timer-0] 2010-07-19 10:59:10,223 Gossiper.java (line 181) InetAddress
/127.0.0.2 is now dead.
 INFO [WRITE-/127.0.0.2] 2010-07-19 10:59:31,225 OutboundTcpConnection.java
(line 102) error writing to /127.0.0.2

On Thu, Jul 15, 2010 at 12:19 PM, Torsten Curdt <tc...@vafer.org> wrote:

> > If you could can you please share the command line function (to load
> TSV)?
>
> There is no command line function ... you have to write code for this.
>
> > and Can you please help me on storing storage-conf.xml on HDFS part?
>
> As I said. Maybe you better start with a simpler scenario and leave
> out HDFS for now.
>
> cheers
> --
> Torsten
>



-- 
Thanks,
Mubarak Seyed.

Re: CassandraBulkLoader

Posted by Torsten Curdt <tc...@vafer.org>.
> If you could can you please share the command line function (to load TSV)?

There is no command line function ... you have to write code for this.

> and Can you please help me on storing storage-conf.xml on HDFS part?

As I said. Maybe you better start with a simpler scenario and leave
out HDFS for now.

cheers
--
Torsten

Re: CassandraBulkLoader

Posted by Mubarak Seyed <mu...@gmail.com>.
Hi Torsten,

If you could can you please share the command line function (to load TSV)?
and Can you please help me on storing storage-conf.xml on HDFS part?

Thanks,
Mubarak

On Tue, Jul 13, 2010 at 1:27 AM, Torsten Curdt <tc...@vafer.org> wrote:

> On Tue, Jul 13, 2010 at 04:35, Mubarak Seyed <mu...@gmail.com>
> wrote:
> > Where can i find the documentation for BinaryMemTable (btm_example in
> contrib)
> > to use CassandraBulkLoader? What is the input to be supplied to
> CassandraBulkLoader?
> > How to form the input data and what is the format of an input data?
>
> The code is the documentation I fear.
>
> I'll see if I get permission to get our updated code contributed.
> We added command line fu and using it to import large TSVs.
>
> > Do i need the HDFS to store my storage-conf.xml?
>
> Why HDFS?
>
> The machine running the bulk loader joins the cassandra ring kind of
> like a temporary node.
> So you will need the storage-conf.xml on that machine.
>
> cheers
> --
> Torsten
>



-- 
Thanks,
Mubarak Seyed.

Re: CassandraBulkLoader

Posted by Torsten Curdt <tc...@vafer.org>.
> look at contrib/bmt_example, with the caveat that it's usually
> premature optimization

I wish that was true for us :)

>> Fact: It has always been straightforward to send the output of Hadoop jobs
>> to Cassandra, and Facebook, Digg, and others have been using Hadoop like
>> this as a Cassandra bulk-loader for over a year.

That we've done as well. With a custom OutputFormat.

>> Does anyone from Facebook or Digg share details on how to use Cassandra
>> BulkLoader?

You just use the StorageProxy and create the RowMutations.

>> I could see some details from Arin's presentation on Cassandra @ Digg about
>> data load from MySQL -> Hadoop -> Cassandra.

Maybe you should just try with a simple bulk load first?

>> Can someone please help me?

You need to tell us how :)

cheers
--
Torsten

Re: CassandraBulkLoader

Posted by Jonathan Ellis <jb...@gmail.com>.
look at contrib/bmt_example, with the caveat that it's usually
premature optimization

On Tue, Jul 13, 2010 at 12:31 PM, Mubarak Seyed <mu...@gmail.com> wrote:
> Thanks Torsten.
> Jonathan's blog on Fact Vs Fiction says that
> Fact: It has always been straightforward to send the output of Hadoop jobs
> to Cassandra, and Facebook, Digg, and others have been using Hadoop like
> this as a Cassandra bulk-loader for over a year.
> Does anyone from Facebook or Digg share details on how to use Cassandra
> BulkLoader?
> I could see some details from Arin's presentation on Cassandra @ Digg about
> data load from MySQL -> Hadoop -> Cassandra.
> Can someone please help me?
> Thanks,
> Mubarak
>
> On Tue, Jul 13, 2010 at 1:27 AM, Torsten Curdt <tc...@vafer.org> wrote:
>>
>> On Tue, Jul 13, 2010 at 04:35, Mubarak Seyed <mu...@gmail.com>
>> wrote:
>> > Where can i find the documentation for BinaryMemTable (btm_example in
>> > contrib)
>> > to use CassandraBulkLoader? What is the input to be supplied to
>> > CassandraBulkLoader?
>> > How to form the input data and what is the format of an input data?
>>
>> The code is the documentation I fear.
>>
>> I'll see if I get permission to get our updated code contributed.
>> We added command line fu and using it to import large TSVs.
>>
>> > Do i need the HDFS to store my storage-conf.xml?
>>
>> Why HDFS?
>>
>> The machine running the bulk loader joins the cassandra ring kind of
>> like a temporary node.
>> So you will need the storage-conf.xml on that machine.
>>
>> cheers
>> --
>> Torsten
>
>
>
> --
> Thanks,
> Mubarak Seyed.
>



-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of Riptano, the source for professional Cassandra support
http://riptano.com

Re: CassandraBulkLoader

Posted by Mubarak Seyed <mu...@gmail.com>.
Thanks Torsten.

Jonathan's blog on Fact Vs Fiction says that

Fact: It has always been straightforward to send the output of Hadoop jobs
to Cassandra, and Facebook, Digg, and others have been using Hadoop like
this as a Cassandra bulk-loader for over a year.

Does anyone from Facebook or Digg share details on how to use Cassandra
BulkLoader?

I could see some details from Arin's presentation on Cassandra @ Digg about
data load from MySQL -> Hadoop -> Cassandra.

Can someone please help me?

Thanks,
Mubarak

On Tue, Jul 13, 2010 at 1:27 AM, Torsten Curdt <tc...@vafer.org> wrote:

> On Tue, Jul 13, 2010 at 04:35, Mubarak Seyed <mu...@gmail.com>
> wrote:
> > Where can i find the documentation for BinaryMemTable (btm_example in
> contrib)
> > to use CassandraBulkLoader? What is the input to be supplied to
> CassandraBulkLoader?
> > How to form the input data and what is the format of an input data?
>
> The code is the documentation I fear.
>
> I'll see if I get permission to get our updated code contributed.
> We added command line fu and using it to import large TSVs.
>
> > Do i need the HDFS to store my storage-conf.xml?
>
> Why HDFS?
>
> The machine running the bulk loader joins the cassandra ring kind of
> like a temporary node.
> So you will need the storage-conf.xml on that machine.
>
> cheers
> --
> Torsten
>



-- 
Thanks,
Mubarak Seyed.

Re: CassandraBulkLoader

Posted by Torsten Curdt <tc...@vafer.org>.
On Tue, Jul 13, 2010 at 04:35, Mubarak Seyed <mu...@gmail.com> wrote:
> Where can i find the documentation for BinaryMemTable (btm_example in contrib)
> to use CassandraBulkLoader? What is the input to be supplied to CassandraBulkLoader?
> How to form the input data and what is the format of an input data?

The code is the documentation I fear.

I'll see if I get permission to get our updated code contributed.
We added command line fu and using it to import large TSVs.

> Do i need the HDFS to store my storage-conf.xml?

Why HDFS?

The machine running the bulk loader joins the cassandra ring kind of
like a temporary node.
So you will need the storage-conf.xml on that machine.

cheers
--
Torsten