You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by amit jaiswal <am...@yahoo.com> on 2010/11/30 05:49:12 UTC

Current status of HBasene?

Hi,

I am trying to explore HBasene for using HBase as a backend for lucene index 
store. But it seems that the current code in github is not in working stage, and 
there is no active development either (https://github.com/akkumar/hbasene/).

Can somebody tell its current status? Didn't get any response on hbasene mailing 
list.

-regards
Amit

Re: What is the fastest way to get a large amount of data into the Hadoop HDFS file system (or Hbase)?

Posted by Kevin Fox <Ke...@pnl.gov>.

As I understand it, and please correct me if I'm wrong, a map/reduce job
has an instance of a FileSystem object on either side. One that the data
is read out of on the map side, and one the data is fed into on the
reduce side.

Can't you run the map reduce job on the storage cluster that stores the
archival data, feeding the map side with
http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/fs/RawLocalFileSystem.html
from the mounted, posix parallel file system,
and feed it into the hadoop cluster on the reduce side. That would mean,
the network in the middle would only see the reduced data set cross the
wire and you could parallelize the data reduction as close to the
archive as possible.

Thanks,
Kevin

On Tue, 2010-12-28 at 14:04 -0800, Taylor, Ronald C wrote:
> Folks,
> 
> We plan on uploading large amounts of data on a regular basis onto a Hadoop cluster, with Hbase operating on top of Hadoop. Figure eventually on the order of multiple terabytes per week. So - we are concerned about doing the uploads themselves as fast as possible from our native Linux file system into HDFS. Figure files will be in, roughly, the 1 to 300 GB range. 
> 
> Off the top of my head, I'm thinking that doing this in parallel using a Java MapReduce program would work fastest. So my idea would be to have a file listing all the data files (full paths) to be uploaded, one per line, and then use that listing file as input to a MapReduce program. 
> 
> Each Mapper would then upload one of the data files (using "hadoop fs -copyFromLocal <source> <dest>") in parallel with all the other Mappers, with the Mappers operating on all the nodes of the cluster, spreading out the file upload across the nodes.
> 
> Does that sound like a wise way to approach this? Are there better methods? Anything else out there for doing automated upload in parallel? We would very much appreciate advice in this area, since we believe upload speed might become a bottleneck.
> 
>   - Ron Taylor
> 
> ___________________________________________
> Ronald Taylor, Ph.D.
> Computational Biology & Bioinformatics Group
> 
> Pacific Northwest National Laboratory
> 902 Battelle Boulevard
> P.O. Box 999, Mail Stop J4-33
> Richland, WA  99352 USA
> Office:  509-372-6568
> Email: ronald.taylor@pnl.gov
> 
>

Re: What is the fastest way to get a large amount of data into the Hadoop HDFS file system (or Hbase)?

Posted by Patrick Angeles <pa...@cloudera.com>.

To add to what Ted says,

"hadoop fs -copyFromLocal" assumes that the file is present locally on the
datanode. That means the file would have to have been transfered to the
node, copied to local disk, and only after that is it written to HDFS, so
there's an extra trip to disk that you could have avoided.

You might have avoided that using a direct pull from the MapReduce task that
writes directly to HDFS. But, as Ted mentions, that is not as efficient as a
source-based push and is also more complex.

On Tue, Dec 28, 2010 at 6:07 PM, Ted Dunning <td...@maprtech.com> wrote:

> if the data is coming off of a single machine then simply running multiple
> threads on that machine spraying the data into the cluster is likely to be
> faster than a map-reduce program.  The reason is that you can run the
> spraying process continuously and can tune it to carefully saturate your
> outbound link toward the cluster.  With a map-reduce program it will be very
> easy to flatten the link.
>
> Another issue is that it is easy to push data to the cluster from a local
> disk rather than to pull it from nodes in the cluster because most network
> file protocols aren't as efficient as you might like.
>
>
> On Tue, Dec 28, 2010 at 2:47 PM, Taylor, Ronald C <ro...@pnl.gov>wrote:
>
>> 2) some way of parallelizing the reads
>>
>> So - I will check into network hardware, in regard to (1). But for (2), is
>> the MapReduce method that I was think of, a way that uses "hadoop fs
>> -copyFromLocal" in each Mapper, a good way to go at the destination end? I
>> believe that you were saying that it is indeed OK, but I want to
>> double-check, since this will be a critical piece of our work flow.
>>
>
>

Re: What is the fastest way to get a large amount of data into the Hadoop HDFS file system (or Hbase)?

Posted by Ted Dunning <td...@maprtech.com>.

if the data is coming off of a single machine then simply running multiple
threads on that machine spraying the data into the cluster is likely to be
faster than a map-reduce program.  The reason is that you can run the
spraying process continuously and can tune it to carefully saturate your
outbound link toward the cluster.  With a map-reduce program it will be very
easy to flatten the link.

Another issue is that it is easy to push data to the cluster from a local
disk rather than to pull it from nodes in the cluster because most network
file protocols aren't as efficient as you might like.

On Tue, Dec 28, 2010 at 2:47 PM, Taylor, Ronald C <ro...@pnl.gov>wrote:

> 2) some way of parallelizing the reads
>
> So - I will check into network hardware, in regard to (1). But for (2), is
> the MapReduce method that I was think of, a way that uses "hadoop fs
> -copyFromLocal" in each Mapper, a good way to go at the destination end? I
> believe that you were saying that it is indeed OK, but I want to
> double-check, since this will be a critical piece of our work flow.
>

Re: What is the fastest way to get a large amount of data into the Hadoop HDFS file system (or Hbase)?

Posted by Ted Dunning <td...@maprtech.com>.

if the data is coming off of a single machine then simply running multiple
threads on that machine spraying the data into the cluster is likely to be
faster than a map-reduce program.  The reason is that you can run the
spraying process continuously and can tune it to carefully saturate your
outbound link toward the cluster.  With a map-reduce program it will be very
easy to flatten the link.

Another issue is that it is easy to push data to the cluster from a local
disk rather than to pull it from nodes in the cluster because most network
file protocols aren't as efficient as you might like.

On Tue, Dec 28, 2010 at 2:47 PM, Taylor, Ronald C <ro...@pnl.gov>wrote:

> 2) some way of parallelizing the reads
>
> So - I will check into network hardware, in regard to (1). But for (2), is
> the MapReduce method that I was think of, a way that uses "hadoop fs
> -copyFromLocal" in each Mapper, a good way to go at the destination end? I
> believe that you were saying that it is indeed OK, but I want to
> double-check, since this will be a critical piece of our work flow.
>

RE: What is the fastest way to get a large amount of data into the Hadoop HDFS file system (or Hbase)?

Posted by "Taylor, Ronald C" <ro...@pnl.gov>.

Patrick,

Thanks for the info (and quick reply). I want to make sure I understand: Presuming that the data files are coming off a set of disk drives attached to a single Linux file server, you say I need two things to optimize the transfer:

1)  a fat network pipe

2) some way of parallelizing the reads

So - I will check into network hardware, in regard to (1). But for (2), is the MapReduce method that I was think of, a way that uses "hadoop fs -copyFromLocal" in each Mapper, a good way to go at the destination end? I believe that you were saying that it is indeed OK, but I want to double-check, since this will be a critical piece of our work flow.

 Ron


________________________________
From: patrickangeles@gmail.com [mailto:patrickangeles@gmail.com] On Behalf Of Patrick Angeles
Sent: Tuesday, December 28, 2010 2:27 PM
To: general@hadoop.apache.org
Cc: user@hbase.apache.org; Taylor, Ronald C; Fox, Kevin M; Brown, David M JR
Subject: Re: What is the fastest way to get a large amount of data into the Hadoop HDFS file system (or Hbase)?

Ron,

While MapReduce can help to parallelize the load effort, your likely bottleneck is the source system (where the files come from). If the files are coming from a single server, then parallelizing the load won't gain you much past a certain point. You have to figure in how fast you can read the file(s) off disk(s) and push the bits through your network and finally onto HDFS.

The best scenario is if you can parallelize the reads and have a fat network pipe (10GbE or more) going into your Hadoop cluster.

Regards,

- Patrick

On Tue, Dec 28, 2010 at 5:04 PM, Taylor, Ronald C <ro...@pnl.gov>> wrote:

Folks,

We plan on uploading large amounts of data on a regular basis onto a Hadoop cluster, with Hbase operating on top of Hadoop. Figure eventually on the order of multiple terabytes per week. So - we are concerned about doing the uploads themselves as fast as possible from our native Linux file system into HDFS. Figure files will be in, roughly, the 1 to 300 GB range.

Off the top of my head, I'm thinking that doing this in parallel using a Java MapReduce program would work fastest. So my idea would be to have a file listing all the data files (full paths) to be uploaded, one per line, and then use that listing file as input to a MapReduce program.

Each Mapper would then upload one of the data files (using "hadoop fs -copyFromLocal <source> <dest>") in parallel with all the other Mappers, with the Mappers operating on all the nodes of the cluster, spreading out the file upload across the nodes.

Does that sound like a wise way to approach this? Are there better methods? Anything else out there for doing automated upload in parallel? We would very much appreciate advice in this area, since we believe upload speed might become a bottleneck.

 - Ron Taylor

___________________________________________
Ronald Taylor, Ph.D.
Computational Biology & Bioinformatics Group

Pacific Northwest National Laboratory
902 Battelle Boulevard
P.O. Box 999, Mail Stop J4-33
Richland, WA  99352 USA
Office:  509-372-6568
Email: ronald.taylor@pnl.gov<ma...@pnl.gov>

RE: What is the fastest way to get a large amount of data into the Hadoop HDFS file system (or Hbase)?

Posted by "Hiller, Dean (Contractor)" <de...@broadridge.com>.

Thanks of the info, missed that at the bottom of that page.
Dean

-----Original Message-----
From: Fox, Kevin M [mailto:kevin.fox@pnl.gov] 
Sent: Wednesday, December 29, 2010 2:21 PM
To: Hiller, Dean (Contractor); general@hadoop.apache.org; Patrick
Angeles
Cc: user@hbase.apache.org; Brown, David M JR
Subject: RE: What is the fastest way to get a large amount of data into
the Hadoop HDFS file system (or Hbase)?

http://wiki.apache.org/hadoop/MountableHDFS

Under Known Issues:
2. Writes are approximately 33% slower than the DFSClient. TBD how to
optimize this. see: HADOOP-3805  - try using -obig_writes if on a
>2.6.26 kernel, should perform much better since bigger writes implies
less context switching.

3. Reads are ~20-30% slower even with the read buffering.

Sounds like just pushing it in would be better.

Thanks,
Kevin
-----Original Message-----
From: Hiller, Dean (Contractor) [mailto:dean.hiller@broadridge.com] 
Sent: Wednesday, December 29, 2010 1:16 PM
To: general@hadoop.apache.org; Fox, Kevin M; Patrick Angeles
Cc: user@hbase.apache.org; Brown, David M JR
Subject: RE: What is the fastest way to get a large amount of data into
the Hadoop HDFS file system (or Hbase)?

I wonder if having linux mount hdfs would help here so as people put the
file on your linux /hdfs directory, it was actually writing to hdfs and
not linux ;) (yeah, you still have that one machine bottle neck as the
files come in unless that can be clustered too somehow).  Just google
mounting hdfs from linux....something that sounds pretty cool that we
may be using later.  

Later,
Dean

-----Original Message-----
From: Taylor, Ronald C [mailto:ronald.taylor@pnl.gov] 
Sent: Tuesday, December 28, 2010 5:05 PM
To: Fox, Kevin M; Patrick Angeles
Cc: general@hadoop.apache.org; user@hbase.apache.org; Brown, David M JR;
Taylor, Ronald C
Subject: RE: What is the fastest way to get a large amount of data into
the Hadoop HDFS file system (or Hbase)?

Hi Kevin,

So - from what Patrick and Ted are saying it sounds like we want the
best way to parallelize a source-based push, rather than doing a
parallelized pull through a MapReduce program. And I see that what you
ask about below is on parallelizing a push, so we are on the same page.
 Ron

-----Original Message-----
From: Fox, Kevin M 
Sent: Tuesday, December 28, 2010 3:39 PM
To: Patrick Angeles
Cc: general@hadoop.apache.org; user@hbase.apache.org; Taylor, Ronald C;
Brown, David M JR
Subject: Re: What is the fastest way to get a large amount of data into
the Hadoop HDFS file system (or Hbase)?

On Tue, 2010-12-28 at 14:26 -0800, Patrick Angeles wrote:
> Ron,
> 
> 
> While MapReduce can help to parallelize the load effort, your likely 
> bottleneck is the source system (where the files come from). If the 
> files are coming from a single server, then parallelizing the load 
> won't gain you much past a certain point. You have to figure in how 
> fast you can read the file(s) off disk(s) and push the bits through 
> your network and finally onto HDFS.
> 
> 
> The best scenario is if you can parallelize the reads and have a fat 
> network pipe (10GbE or more) going into your Hadoop cluster.

We have a way to parallelize a push from the archive storage cluster to
the hadoop storage cluster.

Is there a way to target a particular storage node with a push into the
hadoop file system? The hadoop cluster nodes are 1gig attached to its
core switch and we have a 10 gig uplink to the core from the storage
archive. Say, we have 4 nodes in each storage cluster (we have more,
just a simplified example):

a0 --\                                /-- h0
a1 --+                                +-- h1
a2 --+ (A switch) -10gige- (h switch) +-- h2
a3 --/                                \-- h3

I want to be able to have a0 talk to h0 and not have h0 decide the data
belongs on h3, slowing down a3's ability to write data into h3, greatly
reducing bandwidth.

Thanks,
Kevin

> 
> 
> Regards,
> 
> 
> - Patrick
> 
> On Tue, Dec 28, 2010 at 5:04 PM, Taylor, Ronald C 
> <ro...@pnl.gov> wrote:
>         
>         Folks,
>         
>         We plan on uploading large amounts of data on a regular basis
>         onto a Hadoop cluster, with Hbase operating on top of Hadoop.
>         Figure eventually on the order of multiple terabytes per week.
>         So - we are concerned about doing the uploads themselves as
>         fast as possible from our native Linux file system into HDFS.
>         Figure files will be in, roughly, the 1 to 300 GB range.
>         
>         Off the top of my head, I'm thinking that doing this in
>         parallel using a Java MapReduce program would work fastest. So
>         my idea would be to have a file listing all the data files
>         (full paths) to be uploaded, one per line, and then use that
>         listing file as input to a MapReduce program.
>         
>         Each Mapper would then upload one of the data files (using
>         "hadoop fs -copyFromLocal <source> <dest>") in parallel with
>         all the other Mappers, with the Mappers operating on all the
>         nodes of the cluster, spreading out the file upload across the
>         nodes.
>         
>         Does that sound like a wise way to approach this? Are there
>         better methods? Anything else out there for doing automated
>         upload in parallel? We would very much appreciate advice in
>         this area, since we believe upload speed might become a
>         bottleneck.
>         
>          - Ron Taylor
>         
>         ___________________________________________
>         Ronald Taylor, Ph.D.
>         Computational Biology & Bioinformatics Group
>         
>         Pacific Northwest National Laboratory
>         902 Battelle Boulevard
>         P.O. Box 999, Mail Stop J4-33
>         Richland, WA  99352 USA
>         Office:  509-372-6568
>         Email: ronald.taylor@pnl.gov
>         
>         
> 
> 

This message and any attachments are intended only for the use of the
addressee and
may contain information that is privileged and confidential. If the
reader of the 
message is not the intended recipient or an authorized representative of
the
intended recipient, you are hereby notified that any dissemination of
this
communication is strictly prohibited. If you have received this
communication in
error, please notify us immediately by e-mail and delete the message and
any
attachments from your system.

This message and any attachments are intended only for the use of the addressee and
may contain information that is privileged and confidential. If the reader of the 
message is not the intended recipient or an authorized representative of the
intended recipient, you are hereby notified that any dissemination of this
communication is strictly prohibited. If you have received this communication in
error, please notify us immediately by e-mail and delete the message and any
attachments from your system.

RE: What is the fastest way to get a large amount of data into the Hadoop HDFS file system (or Hbase)?

Posted by "Hiller, Dean (Contractor)" <de...@broadridge.com>.

Thanks of the info, missed that at the bottom of that page.
Dean

-----Original Message-----
From: Fox, Kevin M [mailto:kevin.fox@pnl.gov] 
Sent: Wednesday, December 29, 2010 2:21 PM
To: Hiller, Dean (Contractor); general@hadoop.apache.org; Patrick
Angeles
Cc: user@hbase.apache.org; Brown, David M JR
Subject: RE: What is the fastest way to get a large amount of data into
the Hadoop HDFS file system (or Hbase)?

http://wiki.apache.org/hadoop/MountableHDFS

Under Known Issues:
2. Writes are approximately 33% slower than the DFSClient. TBD how to
optimize this. see: HADOOP-3805  - try using -obig_writes if on a
>2.6.26 kernel, should perform much better since bigger writes implies
less context switching.

3. Reads are ~20-30% slower even with the read buffering.

Sounds like just pushing it in would be better.

Thanks,
Kevin
-----Original Message-----
From: Hiller, Dean (Contractor) [mailto:dean.hiller@broadridge.com] 
Sent: Wednesday, December 29, 2010 1:16 PM
To: general@hadoop.apache.org; Fox, Kevin M; Patrick Angeles
Cc: user@hbase.apache.org; Brown, David M JR
Subject: RE: What is the fastest way to get a large amount of data into
the Hadoop HDFS file system (or Hbase)?

I wonder if having linux mount hdfs would help here so as people put the
file on your linux /hdfs directory, it was actually writing to hdfs and
not linux ;) (yeah, you still have that one machine bottle neck as the
files come in unless that can be clustered too somehow).  Just google
mounting hdfs from linux....something that sounds pretty cool that we
may be using later.  

Later,
Dean

-----Original Message-----
From: Taylor, Ronald C [mailto:ronald.taylor@pnl.gov] 
Sent: Tuesday, December 28, 2010 5:05 PM
To: Fox, Kevin M; Patrick Angeles
Cc: general@hadoop.apache.org; user@hbase.apache.org; Brown, David M JR;
Taylor, Ronald C
Subject: RE: What is the fastest way to get a large amount of data into
the Hadoop HDFS file system (or Hbase)?

Hi Kevin,

So - from what Patrick and Ted are saying it sounds like we want the
best way to parallelize a source-based push, rather than doing a
parallelized pull through a MapReduce program. And I see that what you
ask about below is on parallelizing a push, so we are on the same page.
 Ron

-----Original Message-----
From: Fox, Kevin M 
Sent: Tuesday, December 28, 2010 3:39 PM
To: Patrick Angeles
Cc: general@hadoop.apache.org; user@hbase.apache.org; Taylor, Ronald C;
Brown, David M JR
Subject: Re: What is the fastest way to get a large amount of data into
the Hadoop HDFS file system (or Hbase)?

On Tue, 2010-12-28 at 14:26 -0800, Patrick Angeles wrote:
> Ron,
> 
> 
> While MapReduce can help to parallelize the load effort, your likely 
> bottleneck is the source system (where the files come from). If the 
> files are coming from a single server, then parallelizing the load 
> won't gain you much past a certain point. You have to figure in how 
> fast you can read the file(s) off disk(s) and push the bits through 
> your network and finally onto HDFS.
> 
> 
> The best scenario is if you can parallelize the reads and have a fat 
> network pipe (10GbE or more) going into your Hadoop cluster.

We have a way to parallelize a push from the archive storage cluster to
the hadoop storage cluster.

Is there a way to target a particular storage node with a push into the
hadoop file system? The hadoop cluster nodes are 1gig attached to its
core switch and we have a 10 gig uplink to the core from the storage
archive. Say, we have 4 nodes in each storage cluster (we have more,
just a simplified example):

a0 --\                                /-- h0
a1 --+                                +-- h1
a2 --+ (A switch) -10gige- (h switch) +-- h2
a3 --/                                \-- h3

I want to be able to have a0 talk to h0 and not have h0 decide the data
belongs on h3, slowing down a3's ability to write data into h3, greatly
reducing bandwidth.

Thanks,
Kevin

> 
> 
> Regards,
> 
> 
> - Patrick
> 
> On Tue, Dec 28, 2010 at 5:04 PM, Taylor, Ronald C 
> <ro...@pnl.gov> wrote:
>         
>         Folks,
>         
>         We plan on uploading large amounts of data on a regular basis
>         onto a Hadoop cluster, with Hbase operating on top of Hadoop.
>         Figure eventually on the order of multiple terabytes per week.
>         So - we are concerned about doing the uploads themselves as
>         fast as possible from our native Linux file system into HDFS.
>         Figure files will be in, roughly, the 1 to 300 GB range.
>         
>         Off the top of my head, I'm thinking that doing this in
>         parallel using a Java MapReduce program would work fastest. So
>         my idea would be to have a file listing all the data files
>         (full paths) to be uploaded, one per line, and then use that
>         listing file as input to a MapReduce program.
>         
>         Each Mapper would then upload one of the data files (using
>         "hadoop fs -copyFromLocal <source> <dest>") in parallel with
>         all the other Mappers, with the Mappers operating on all the
>         nodes of the cluster, spreading out the file upload across the
>         nodes.
>         
>         Does that sound like a wise way to approach this? Are there
>         better methods? Anything else out there for doing automated
>         upload in parallel? We would very much appreciate advice in
>         this area, since we believe upload speed might become a
>         bottleneck.
>         
>          - Ron Taylor
>         
>         ___________________________________________
>         Ronald Taylor, Ph.D.
>         Computational Biology & Bioinformatics Group
>         
>         Pacific Northwest National Laboratory
>         902 Battelle Boulevard
>         P.O. Box 999, Mail Stop J4-33
>         Richland, WA  99352 USA
>         Office:  509-372-6568
>         Email: ronald.taylor@pnl.gov
>         
>         
> 
> 

This message and any attachments are intended only for the use of the
addressee and
may contain information that is privileged and confidential. If the
reader of the 
message is not the intended recipient or an authorized representative of
the
intended recipient, you are hereby notified that any dissemination of
this
communication is strictly prohibited. If you have received this
communication in
error, please notify us immediately by e-mail and delete the message and
any
attachments from your system.

This message and any attachments are intended only for the use of the addressee and
may contain information that is privileged and confidential. If the reader of the 
message is not the intended recipient or an authorized representative of the
intended recipient, you are hereby notified that any dissemination of this
communication is strictly prohibited. If you have received this communication in
error, please notify us immediately by e-mail and delete the message and any
attachments from your system.

RE: What is the fastest way to get a large amount of data into the Hadoop HDFS file system (or Hbase)?

Posted by "Fox, Kevin M" <ke...@pnl.gov>.

http://wiki.apache.org/hadoop/MountableHDFS

Under Known Issues:
2. Writes are approximately 33% slower than the DFSClient. TBD how to optimize this. see: HADOOP-3805  - try using -obig_writes if on a >2.6.26 kernel, should perform much better since bigger writes implies less context switching.

3. Reads are ~20-30% slower even with the read buffering.

Sounds like just pushing it in would be better.

Thanks,
Kevin
-----Original Message-----
From: Hiller, Dean (Contractor) [mailto:dean.hiller@broadridge.com] 
Sent: Wednesday, December 29, 2010 1:16 PM
To: general@hadoop.apache.org; Fox, Kevin M; Patrick Angeles
Cc: user@hbase.apache.org; Brown, David M JR
Subject: RE: What is the fastest way to get a large amount of data into the Hadoop HDFS file system (or Hbase)?

I wonder if having linux mount hdfs would help here so as people put the
file on your linux /hdfs directory, it was actually writing to hdfs and
not linux ;) (yeah, you still have that one machine bottle neck as the
files come in unless that can be clustered too somehow).  Just google
mounting hdfs from linux....something that sounds pretty cool that we
may be using later.  

Later,
Dean

-----Original Message-----
From: Taylor, Ronald C [mailto:ronald.taylor@pnl.gov] 
Sent: Tuesday, December 28, 2010 5:05 PM
To: Fox, Kevin M; Patrick Angeles
Cc: general@hadoop.apache.org; user@hbase.apache.org; Brown, David M JR;
Taylor, Ronald C
Subject: RE: What is the fastest way to get a large amount of data into
the Hadoop HDFS file system (or Hbase)?

Hi Kevin,

So - from what Patrick and Ted are saying it sounds like we want the
best way to parallelize a source-based push, rather than doing a
parallelized pull through a MapReduce program. And I see that what you
ask about below is on parallelizing a push, so we are on the same page.
 Ron

-----Original Message-----
From: Fox, Kevin M 
Sent: Tuesday, December 28, 2010 3:39 PM
To: Patrick Angeles
Cc: general@hadoop.apache.org; user@hbase.apache.org; Taylor, Ronald C;
Brown, David M JR
Subject: Re: What is the fastest way to get a large amount of data into
the Hadoop HDFS file system (or Hbase)?

On Tue, 2010-12-28 at 14:26 -0800, Patrick Angeles wrote:
> Ron,
> 
> 
> While MapReduce can help to parallelize the load effort, your likely 
> bottleneck is the source system (where the files come from). If the 
> files are coming from a single server, then parallelizing the load 
> won't gain you much past a certain point. You have to figure in how 
> fast you can read the file(s) off disk(s) and push the bits through 
> your network and finally onto HDFS.
> 
> 
> The best scenario is if you can parallelize the reads and have a fat 
> network pipe (10GbE or more) going into your Hadoop cluster.

We have a way to parallelize a push from the archive storage cluster to
the hadoop storage cluster.

Is there a way to target a particular storage node with a push into the
hadoop file system? The hadoop cluster nodes are 1gig attached to its
core switch and we have a 10 gig uplink to the core from the storage
archive. Say, we have 4 nodes in each storage cluster (we have more,
just a simplified example):

a0 --\                                /-- h0
a1 --+                                +-- h1
a2 --+ (A switch) -10gige- (h switch) +-- h2
a3 --/                                \-- h3

I want to be able to have a0 talk to h0 and not have h0 decide the data
belongs on h3, slowing down a3's ability to write data into h3, greatly
reducing bandwidth.

Thanks,
Kevin

> 
> 
> Regards,
> 
> 
> - Patrick
> 
> On Tue, Dec 28, 2010 at 5:04 PM, Taylor, Ronald C 
> <ro...@pnl.gov> wrote:
>         
>         Folks,
>         
>         We plan on uploading large amounts of data on a regular basis
>         onto a Hadoop cluster, with Hbase operating on top of Hadoop.
>         Figure eventually on the order of multiple terabytes per week.
>         So - we are concerned about doing the uploads themselves as
>         fast as possible from our native Linux file system into HDFS.
>         Figure files will be in, roughly, the 1 to 300 GB range.
>         
>         Off the top of my head, I'm thinking that doing this in
>         parallel using a Java MapReduce program would work fastest. So
>         my idea would be to have a file listing all the data files
>         (full paths) to be uploaded, one per line, and then use that
>         listing file as input to a MapReduce program.
>         
>         Each Mapper would then upload one of the data files (using
>         "hadoop fs -copyFromLocal <source> <dest>") in parallel with
>         all the other Mappers, with the Mappers operating on all the
>         nodes of the cluster, spreading out the file upload across the
>         nodes.
>         
>         Does that sound like a wise way to approach this? Are there
>         better methods? Anything else out there for doing automated
>         upload in parallel? We would very much appreciate advice in
>         this area, since we believe upload speed might become a
>         bottleneck.
>         
>          - Ron Taylor
>         
>         ___________________________________________
>         Ronald Taylor, Ph.D.
>         Computational Biology & Bioinformatics Group
>         
>         Pacific Northwest National Laboratory
>         902 Battelle Boulevard
>         P.O. Box 999, Mail Stop J4-33
>         Richland, WA  99352 USA
>         Office:  509-372-6568
>         Email: ronald.taylor@pnl.gov
>         
>         
> 
> 

This message and any attachments are intended only for the use of the addressee and
may contain information that is privileged and confidential. If the reader of the 
message is not the intended recipient or an authorized representative of the
intended recipient, you are hereby notified that any dissemination of this
communication is strictly prohibited. If you have received this communication in
error, please notify us immediately by e-mail and delete the message and any
attachments from your system.

Re: What is the fastest way to get a large amount of data into the Hadoop HDFS file system (or Hbase)?

Posted by Ted Dunning <td...@maprtech.com>.

The problem there is that HDFS isn't a first class file system.  That means
that the nice and easy ways of mounting will lead to problems (notably NFS
which maintains no state will require random write capabilities).

On Wed, Dec 29, 2010 at 1:16 PM, Hiller, Dean (Contractor) <
dean.hiller@broadridge.com> wrote:

> I wonder if having linux mount hdfs would help here so as people put the
> file on your linux /hdfs directory, it was actually writing to hdfs and
> not linux ;) (yeah, you still have that one machine bottle neck as the
> files come in unless that can be clustered too somehow).  Just google
> mounting hdfs from linux....something that sounds pretty cool that we
> may be using later.
>
> Later,
> Dean
>
> -----Original Message-----
> From: Taylor, Ronald C [mailto:ronald.taylor@pnl.gov]
> Sent: Tuesday, December 28, 2010 5:05 PM
> To: Fox, Kevin M; Patrick Angeles
> Cc: general@hadoop.apache.org; user@hbase.apache.org; Brown, David M JR;
> Taylor, Ronald C
> Subject: RE: What is the fastest way to get a large amount of data into
> the Hadoop HDFS file system (or Hbase)?
>
>
> Hi Kevin,
>
> So - from what Patrick and Ted are saying it sounds like we want the
> best way to parallelize a source-based push, rather than doing a
> parallelized pull through a MapReduce program. And I see that what you
> ask about below is on parallelizing a push, so we are on the same page.
>  Ron
>
> -----Original Message-----
> From: Fox, Kevin M
> Sent: Tuesday, December 28, 2010 3:39 PM
> To: Patrick Angeles
> Cc: general@hadoop.apache.org; user@hbase.apache.org; Taylor, Ronald C;
> Brown, David M JR
> Subject: Re: What is the fastest way to get a large amount of data into
> the Hadoop HDFS file system (or Hbase)?
>
> On Tue, 2010-12-28 at 14:26 -0800, Patrick Angeles wrote:
> > Ron,
> >
> >
> > While MapReduce can help to parallelize the load effort, your likely
> > bottleneck is the source system (where the files come from). If the
> > files are coming from a single server, then parallelizing the load
> > won't gain you much past a certain point. You have to figure in how
> > fast you can read the file(s) off disk(s) and push the bits through
> > your network and finally onto HDFS.
> >
> >
> > The best scenario is if you can parallelize the reads and have a fat
> > network pipe (10GbE or more) going into your Hadoop cluster.
>
>
> We have a way to parallelize a push from the archive storage cluster to
> the hadoop storage cluster.
>
> Is there a way to target a particular storage node with a push into the
> hadoop file system? The hadoop cluster nodes are 1gig attached to its
> core switch and we have a 10 gig uplink to the core from the storage
> archive. Say, we have 4 nodes in each storage cluster (we have more,
> just a simplified example):
>
> a0 --\                                /-- h0
> a1 --+                                +-- h1
> a2 --+ (A switch) -10gige- (h switch) +-- h2
> a3 --/                                \-- h3
>
> I want to be able to have a0 talk to h0 and not have h0 decide the data
> belongs on h3, slowing down a3's ability to write data into h3, greatly
> reducing bandwidth.
>
> Thanks,
> Kevin
>
> >
> >
> > Regards,
> >
> >
> > - Patrick
> >
> > On Tue, Dec 28, 2010 at 5:04 PM, Taylor, Ronald C
> > <ro...@pnl.gov> wrote:
> >
> >         Folks,
> >
> >         We plan on uploading large amounts of data on a regular basis
> >         onto a Hadoop cluster, with Hbase operating on top of Hadoop.
> >         Figure eventually on the order of multiple terabytes per week.
> >         So - we are concerned about doing the uploads themselves as
> >         fast as possible from our native Linux file system into HDFS.
> >         Figure files will be in, roughly, the 1 to 300 GB range.
> >
> >         Off the top of my head, I'm thinking that doing this in
> >         parallel using a Java MapReduce program would work fastest. So
> >         my idea would be to have a file listing all the data files
> >         (full paths) to be uploaded, one per line, and then use that
> >         listing file as input to a MapReduce program.
> >
> >         Each Mapper would then upload one of the data files (using
> >         "hadoop fs -copyFromLocal <source> <dest>") in parallel with
> >         all the other Mappers, with the Mappers operating on all the
> >         nodes of the cluster, spreading out the file upload across the
> >         nodes.
> >
> >         Does that sound like a wise way to approach this? Are there
> >         better methods? Anything else out there for doing automated
> >         upload in parallel? We would very much appreciate advice in
> >         this area, since we believe upload speed might become a
> >         bottleneck.
> >
> >          - Ron Taylor
> >
> >         ___________________________________________
> >         Ronald Taylor, Ph.D.
> >         Computational Biology & Bioinformatics Group
> >
> >         Pacific Northwest National Laboratory
> >         902 Battelle Boulevard
> >         P.O. Box 999, Mail Stop J4-33
> >         Richland, WA  99352 USA
> >         Office:  509-372-6568
> >         Email: ronald.taylor@pnl.gov
> >
> >
> >
> >
>
>
> This message and any attachments are intended only for the use of the
> addressee and
> may contain information that is privileged and confidential. If the reader
> of the
> message is not the intended recipient or an authorized representative of
> the
> intended recipient, you are hereby notified that any dissemination of this
> communication is strictly prohibited. If you have received this
> communication in
> error, please notify us immediately by e-mail and delete the message and
> any
> attachments from your system.
>
>

Re: What is the fastest way to get a large amount of data into the Hadoop HDFS file system (or Hbase)?

Posted by Ted Dunning <td...@maprtech.com>.

The problem there is that HDFS isn't a first class file system.  That means
that the nice and easy ways of mounting will lead to problems (notably NFS
which maintains no state will require random write capabilities).

On Wed, Dec 29, 2010 at 1:16 PM, Hiller, Dean (Contractor) <
dean.hiller@broadridge.com> wrote:

> I wonder if having linux mount hdfs would help here so as people put the
> file on your linux /hdfs directory, it was actually writing to hdfs and
> not linux ;) (yeah, you still have that one machine bottle neck as the
> files come in unless that can be clustered too somehow).  Just google
> mounting hdfs from linux....something that sounds pretty cool that we
> may be using later.
>
> Later,
> Dean
>
> -----Original Message-----
> From: Taylor, Ronald C [mailto:ronald.taylor@pnl.gov]
> Sent: Tuesday, December 28, 2010 5:05 PM
> To: Fox, Kevin M; Patrick Angeles
> Cc: general@hadoop.apache.org; user@hbase.apache.org; Brown, David M JR;
> Taylor, Ronald C
> Subject: RE: What is the fastest way to get a large amount of data into
> the Hadoop HDFS file system (or Hbase)?
>
>
> Hi Kevin,
>
> So - from what Patrick and Ted are saying it sounds like we want the
> best way to parallelize a source-based push, rather than doing a
> parallelized pull through a MapReduce program. And I see that what you
> ask about below is on parallelizing a push, so we are on the same page.
>  Ron
>
> -----Original Message-----
> From: Fox, Kevin M
> Sent: Tuesday, December 28, 2010 3:39 PM
> To: Patrick Angeles
> Cc: general@hadoop.apache.org; user@hbase.apache.org; Taylor, Ronald C;
> Brown, David M JR
> Subject: Re: What is the fastest way to get a large amount of data into
> the Hadoop HDFS file system (or Hbase)?
>
> On Tue, 2010-12-28 at 14:26 -0800, Patrick Angeles wrote:
> > Ron,
> >
> >
> > While MapReduce can help to parallelize the load effort, your likely
> > bottleneck is the source system (where the files come from). If the
> > files are coming from a single server, then parallelizing the load
> > won't gain you much past a certain point. You have to figure in how
> > fast you can read the file(s) off disk(s) and push the bits through
> > your network and finally onto HDFS.
> >
> >
> > The best scenario is if you can parallelize the reads and have a fat
> > network pipe (10GbE or more) going into your Hadoop cluster.
>
>
> We have a way to parallelize a push from the archive storage cluster to
> the hadoop storage cluster.
>
> Is there a way to target a particular storage node with a push into the
> hadoop file system? The hadoop cluster nodes are 1gig attached to its
> core switch and we have a 10 gig uplink to the core from the storage
> archive. Say, we have 4 nodes in each storage cluster (we have more,
> just a simplified example):
>
> a0 --\                                /-- h0
> a1 --+                                +-- h1
> a2 --+ (A switch) -10gige- (h switch) +-- h2
> a3 --/                                \-- h3
>
> I want to be able to have a0 talk to h0 and not have h0 decide the data
> belongs on h3, slowing down a3's ability to write data into h3, greatly
> reducing bandwidth.
>
> Thanks,
> Kevin
>
> >
> >
> > Regards,
> >
> >
> > - Patrick
> >
> > On Tue, Dec 28, 2010 at 5:04 PM, Taylor, Ronald C
> > <ro...@pnl.gov> wrote:
> >
> >         Folks,
> >
> >         We plan on uploading large amounts of data on a regular basis
> >         onto a Hadoop cluster, with Hbase operating on top of Hadoop.
> >         Figure eventually on the order of multiple terabytes per week.
> >         So - we are concerned about doing the uploads themselves as
> >         fast as possible from our native Linux file system into HDFS.
> >         Figure files will be in, roughly, the 1 to 300 GB range.
> >
> >         Off the top of my head, I'm thinking that doing this in
> >         parallel using a Java MapReduce program would work fastest. So
> >         my idea would be to have a file listing all the data files
> >         (full paths) to be uploaded, one per line, and then use that
> >         listing file as input to a MapReduce program.
> >
> >         Each Mapper would then upload one of the data files (using
> >         "hadoop fs -copyFromLocal <source> <dest>") in parallel with
> >         all the other Mappers, with the Mappers operating on all the
> >         nodes of the cluster, spreading out the file upload across the
> >         nodes.
> >
> >         Does that sound like a wise way to approach this? Are there
> >         better methods? Anything else out there for doing automated
> >         upload in parallel? We would very much appreciate advice in
> >         this area, since we believe upload speed might become a
> >         bottleneck.
> >
> >          - Ron Taylor
> >
> >         ___________________________________________
> >         Ronald Taylor, Ph.D.
> >         Computational Biology & Bioinformatics Group
> >
> >         Pacific Northwest National Laboratory
> >         902 Battelle Boulevard
> >         P.O. Box 999, Mail Stop J4-33
> >         Richland, WA  99352 USA
> >         Office:  509-372-6568
> >         Email: ronald.taylor@pnl.gov
> >
> >
> >
> >
>
>
> This message and any attachments are intended only for the use of the
> addressee and
> may contain information that is privileged and confidential. If the reader
> of the
> message is not the intended recipient or an authorized representative of
> the
> intended recipient, you are hereby notified that any dissemination of this
> communication is strictly prohibited. If you have received this
> communication in
> error, please notify us immediately by e-mail and delete the message and
> any
> attachments from your system.
>
>

RE: What is the fastest way to get a large amount of data into the Hadoop HDFS file system (or Hbase)?

Posted by "Hiller, Dean (Contractor)" <de...@broadridge.com>.

I wonder if having linux mount hdfs would help here so as people put the
file on your linux /hdfs directory, it was actually writing to hdfs and
not linux ;) (yeah, you still have that one machine bottle neck as the
files come in unless that can be clustered too somehow).  Just google
mounting hdfs from linux....something that sounds pretty cool that we
may be using later.  

Later,
Dean

-----Original Message-----
From: Taylor, Ronald C [mailto:ronald.taylor@pnl.gov] 
Sent: Tuesday, December 28, 2010 5:05 PM
To: Fox, Kevin M; Patrick Angeles
Cc: general@hadoop.apache.org; user@hbase.apache.org; Brown, David M JR;
Taylor, Ronald C
Subject: RE: What is the fastest way to get a large amount of data into
the Hadoop HDFS file system (or Hbase)?

Hi Kevin,

So - from what Patrick and Ted are saying it sounds like we want the
best way to parallelize a source-based push, rather than doing a
parallelized pull through a MapReduce program. And I see that what you
ask about below is on parallelizing a push, so we are on the same page.
 Ron

-----Original Message-----
From: Fox, Kevin M 
Sent: Tuesday, December 28, 2010 3:39 PM
To: Patrick Angeles
Cc: general@hadoop.apache.org; user@hbase.apache.org; Taylor, Ronald C;
Brown, David M JR
Subject: Re: What is the fastest way to get a large amount of data into
the Hadoop HDFS file system (or Hbase)?

On Tue, 2010-12-28 at 14:26 -0800, Patrick Angeles wrote:
> Ron,
> 
> 
> While MapReduce can help to parallelize the load effort, your likely 
> bottleneck is the source system (where the files come from). If the 
> files are coming from a single server, then parallelizing the load 
> won't gain you much past a certain point. You have to figure in how 
> fast you can read the file(s) off disk(s) and push the bits through 
> your network and finally onto HDFS.
> 
> 
> The best scenario is if you can parallelize the reads and have a fat 
> network pipe (10GbE or more) going into your Hadoop cluster.

We have a way to parallelize a push from the archive storage cluster to
the hadoop storage cluster.

Is there a way to target a particular storage node with a push into the
hadoop file system? The hadoop cluster nodes are 1gig attached to its
core switch and we have a 10 gig uplink to the core from the storage
archive. Say, we have 4 nodes in each storage cluster (we have more,
just a simplified example):

a0 --\                                /-- h0
a1 --+                                +-- h1
a2 --+ (A switch) -10gige- (h switch) +-- h2
a3 --/                                \-- h3

I want to be able to have a0 talk to h0 and not have h0 decide the data
belongs on h3, slowing down a3's ability to write data into h3, greatly
reducing bandwidth.

Thanks,
Kevin

> 
> 
> Regards,
> 
> 
> - Patrick
> 
> On Tue, Dec 28, 2010 at 5:04 PM, Taylor, Ronald C 
> <ro...@pnl.gov> wrote:
>         
>         Folks,
>         
>         We plan on uploading large amounts of data on a regular basis
>         onto a Hadoop cluster, with Hbase operating on top of Hadoop.
>         Figure eventually on the order of multiple terabytes per week.
>         So - we are concerned about doing the uploads themselves as
>         fast as possible from our native Linux file system into HDFS.
>         Figure files will be in, roughly, the 1 to 300 GB range.
>         
>         Off the top of my head, I'm thinking that doing this in
>         parallel using a Java MapReduce program would work fastest. So
>         my idea would be to have a file listing all the data files
>         (full paths) to be uploaded, one per line, and then use that
>         listing file as input to a MapReduce program.
>         
>         Each Mapper would then upload one of the data files (using
>         "hadoop fs -copyFromLocal <source> <dest>") in parallel with
>         all the other Mappers, with the Mappers operating on all the
>         nodes of the cluster, spreading out the file upload across the
>         nodes.
>         
>         Does that sound like a wise way to approach this? Are there
>         better methods? Anything else out there for doing automated
>         upload in parallel? We would very much appreciate advice in
>         this area, since we believe upload speed might become a
>         bottleneck.
>         
>          - Ron Taylor
>         
>         ___________________________________________
>         Ronald Taylor, Ph.D.
>         Computational Biology & Bioinformatics Group
>         
>         Pacific Northwest National Laboratory
>         902 Battelle Boulevard
>         P.O. Box 999, Mail Stop J4-33
>         Richland, WA  99352 USA
>         Office:  509-372-6568
>         Email: ronald.taylor@pnl.gov
>         
>         
> 
> 

This message and any attachments are intended only for the use of the addressee and
may contain information that is privileged and confidential. If the reader of the 
message is not the intended recipient or an authorized representative of the
intended recipient, you are hereby notified that any dissemination of this
communication is strictly prohibited. If you have received this communication in
error, please notify us immediately by e-mail and delete the message and any
attachments from your system.

RE: What is the fastest way to get a large amount of data into the Hadoop HDFS file system (or Hbase)?

Posted by Kevin Fox <Ke...@pnl.gov>.

On Tue, 2010-12-28 at 16:04 -0800, Taylor, Ronald C wrote:
> Hi Kevin,
> 
> So - from what Patrick and Ted are saying it sounds like we want the best way to parallelize a source-based push, rather than doing a parallelized pull through a MapReduce program. And I see that what you ask about below is on parallelizing a push, so we are on the same page.
>  Ron

Hi Ron,

I think there are merits for both approaches, a depending on the type of
data it is. If you only ever need a subset of the data for reprocessing,
map/reducing then transfer might be better. If you want all the raw data
of a particular set, push transfer may make more sense.

So we'd need solutions to both:
*parallel map reduce on non hdfs storage cluster, push to hdfs
reprocessing cluster.
*parallel push data from non hdfs storage cluster, into hdfs
reprocessing cluster.

Thanks,
Kevin

> 
> -----Original Message-----
> From: Fox, Kevin M 
> Sent: Tuesday, December 28, 2010 3:39 PM
> To: Patrick Angeles
> Cc: general@hadoop.apache.org; user@hbase.apache.org; Taylor, Ronald C; Brown, David M JR
> Subject: Re: What is the fastest way to get a large amount of data into the Hadoop HDFS file system (or Hbase)?
> 
> On Tue, 2010-12-28 at 14:26 -0800, Patrick Angeles wrote:
> > Ron,
> > 
> > 
> > While MapReduce can help to parallelize the load effort, your likely 
> > bottleneck is the source system (where the files come from). If the 
> > files are coming from a single server, then parallelizing the load 
> > won't gain you much past a certain point. You have to figure in how 
> > fast you can read the file(s) off disk(s) and push the bits through 
> > your network and finally onto HDFS.
> > 
> > 
> > The best scenario is if you can parallelize the reads and have a fat 
> > network pipe (10GbE or more) going into your Hadoop cluster.
> 
> 
> We have a way to parallelize a push from the archive storage cluster to the hadoop storage cluster.
> 
> Is there a way to target a particular storage node with a push into the hadoop file system? The hadoop cluster nodes are 1gig attached to its core switch and we have a 10 gig uplink to the core from the storage archive. Say, we have 4 nodes in each storage cluster (we have more, just a simplified example):
> 
> a0 --\                                /-- h0
> a1 --+                                +-- h1
> a2 --+ (A switch) -10gige- (h switch) +-- h2
> a3 --/                                \-- h3
> 
> I want to be able to have a0 talk to h0 and not have h0 decide the data belongs on h3, slowing down a3's ability to write data into h3, greatly reducing bandwidth.
> 
> Thanks,
> Kevin
> 
> > 
> > 
> > Regards,
> > 
> > 
> > - Patrick
> > 
> > On Tue, Dec 28, 2010 at 5:04 PM, Taylor, Ronald C 
> > <ro...@pnl.gov> wrote:
> >         
> >         Folks,
> >         
> >         We plan on uploading large amounts of data on a regular basis
> >         onto a Hadoop cluster, with Hbase operating on top of Hadoop.
> >         Figure eventually on the order of multiple terabytes per week.
> >         So - we are concerned about doing the uploads themselves as
> >         fast as possible from our native Linux file system into HDFS.
> >         Figure files will be in, roughly, the 1 to 300 GB range.
> >         
> >         Off the top of my head, I'm thinking that doing this in
> >         parallel using a Java MapReduce program would work fastest. So
> >         my idea would be to have a file listing all the data files
> >         (full paths) to be uploaded, one per line, and then use that
> >         listing file as input to a MapReduce program.
> >         
> >         Each Mapper would then upload one of the data files (using
> >         "hadoop fs -copyFromLocal <source> <dest>") in parallel with
> >         all the other Mappers, with the Mappers operating on all the
> >         nodes of the cluster, spreading out the file upload across the
> >         nodes.
> >         
> >         Does that sound like a wise way to approach this? Are there
> >         better methods? Anything else out there for doing automated
> >         upload in parallel? We would very much appreciate advice in
> >         this area, since we believe upload speed might become a
> >         bottleneck.
> >         
> >          - Ron Taylor
> >         
> >         ___________________________________________
> >         Ronald Taylor, Ph.D.
> >         Computational Biology & Bioinformatics Group
> >         
> >         Pacific Northwest National Laboratory
> >         902 Battelle Boulevard
> >         P.O. Box 999, Mail Stop J4-33
> >         Richland, WA  99352 USA
> >         Office:  509-372-6568
> >         Email: ronald.taylor@pnl.gov
> >         
> >         
> > 
> > 
> 
>

RE: What is the fastest way to get a large amount of data into the Hadoop HDFS file system (or Hbase)?

Posted by "Hiller, Dean (Contractor)" <de...@broadridge.com>.

I wonder if having linux mount hdfs would help here so as people put the
file on your linux /hdfs directory, it was actually writing to hdfs and
not linux ;) (yeah, you still have that one machine bottle neck as the
files come in unless that can be clustered too somehow).  Just google
mounting hdfs from linux....something that sounds pretty cool that we
may be using later.  

Later,
Dean

-----Original Message-----
From: Taylor, Ronald C [mailto:ronald.taylor@pnl.gov] 
Sent: Tuesday, December 28, 2010 5:05 PM
To: Fox, Kevin M; Patrick Angeles
Cc: general@hadoop.apache.org; user@hbase.apache.org; Brown, David M JR;
Taylor, Ronald C
Subject: RE: What is the fastest way to get a large amount of data into
the Hadoop HDFS file system (or Hbase)?

Hi Kevin,

So - from what Patrick and Ted are saying it sounds like we want the
best way to parallelize a source-based push, rather than doing a
parallelized pull through a MapReduce program. And I see that what you
ask about below is on parallelizing a push, so we are on the same page.
 Ron

-----Original Message-----
From: Fox, Kevin M 
Sent: Tuesday, December 28, 2010 3:39 PM
To: Patrick Angeles
Cc: general@hadoop.apache.org; user@hbase.apache.org; Taylor, Ronald C;
Brown, David M JR
Subject: Re: What is the fastest way to get a large amount of data into
the Hadoop HDFS file system (or Hbase)?

On Tue, 2010-12-28 at 14:26 -0800, Patrick Angeles wrote:
> Ron,
> 
> 
> While MapReduce can help to parallelize the load effort, your likely 
> bottleneck is the source system (where the files come from). If the 
> files are coming from a single server, then parallelizing the load 
> won't gain you much past a certain point. You have to figure in how 
> fast you can read the file(s) off disk(s) and push the bits through 
> your network and finally onto HDFS.
> 
> 
> The best scenario is if you can parallelize the reads and have a fat 
> network pipe (10GbE or more) going into your Hadoop cluster.

We have a way to parallelize a push from the archive storage cluster to
the hadoop storage cluster.

Is there a way to target a particular storage node with a push into the
hadoop file system? The hadoop cluster nodes are 1gig attached to its
core switch and we have a 10 gig uplink to the core from the storage
archive. Say, we have 4 nodes in each storage cluster (we have more,
just a simplified example):

a0 --\                                /-- h0
a1 --+                                +-- h1
a2 --+ (A switch) -10gige- (h switch) +-- h2
a3 --/                                \-- h3

I want to be able to have a0 talk to h0 and not have h0 decide the data
belongs on h3, slowing down a3's ability to write data into h3, greatly
reducing bandwidth.

Thanks,
Kevin

> 
> 
> Regards,
> 
> 
> - Patrick
> 
> On Tue, Dec 28, 2010 at 5:04 PM, Taylor, Ronald C 
> <ro...@pnl.gov> wrote:
>         
>         Folks,
>         
>         We plan on uploading large amounts of data on a regular basis
>         onto a Hadoop cluster, with Hbase operating on top of Hadoop.
>         Figure eventually on the order of multiple terabytes per week.
>         So - we are concerned about doing the uploads themselves as
>         fast as possible from our native Linux file system into HDFS.
>         Figure files will be in, roughly, the 1 to 300 GB range.
>         
>         Off the top of my head, I'm thinking that doing this in
>         parallel using a Java MapReduce program would work fastest. So
>         my idea would be to have a file listing all the data files
>         (full paths) to be uploaded, one per line, and then use that
>         listing file as input to a MapReduce program.
>         
>         Each Mapper would then upload one of the data files (using
>         "hadoop fs -copyFromLocal <source> <dest>") in parallel with
>         all the other Mappers, with the Mappers operating on all the
>         nodes of the cluster, spreading out the file upload across the
>         nodes.
>         
>         Does that sound like a wise way to approach this? Are there
>         better methods? Anything else out there for doing automated
>         upload in parallel? We would very much appreciate advice in
>         this area, since we believe upload speed might become a
>         bottleneck.
>         
>          - Ron Taylor
>         
>         ___________________________________________
>         Ronald Taylor, Ph.D.
>         Computational Biology & Bioinformatics Group
>         
>         Pacific Northwest National Laboratory
>         902 Battelle Boulevard
>         P.O. Box 999, Mail Stop J4-33
>         Richland, WA  99352 USA
>         Office:  509-372-6568
>         Email: ronald.taylor@pnl.gov
>         
>         
> 
> 

This message and any attachments are intended only for the use of the addressee and
may contain information that is privileged and confidential. If the reader of the 
message is not the intended recipient or an authorized representative of the
intended recipient, you are hereby notified that any dissemination of this
communication is strictly prohibited. If you have received this communication in
error, please notify us immediately by e-mail and delete the message and any
attachments from your system.

RE: What is the fastest way to get a large amount of data into the Hadoop HDFS file system (or Hbase)?

Posted by "Taylor, Ronald C" <ro...@pnl.gov>.

 
Hi Kevin,

So - from what Patrick and Ted are saying it sounds like we want the best way to parallelize a source-based push, rather than doing a parallelized pull through a MapReduce program. And I see that what you ask about below is on parallelizing a push, so we are on the same page.
 Ron

-----Original Message-----
From: Fox, Kevin M 
Sent: Tuesday, December 28, 2010 3:39 PM
To: Patrick Angeles
Cc: general@hadoop.apache.org; user@hbase.apache.org; Taylor, Ronald C; Brown, David M JR
Subject: Re: What is the fastest way to get a large amount of data into the Hadoop HDFS file system (or Hbase)?

On Tue, 2010-12-28 at 14:26 -0800, Patrick Angeles wrote:
> Ron,
> 
> 
> While MapReduce can help to parallelize the load effort, your likely 
> bottleneck is the source system (where the files come from). If the 
> files are coming from a single server, then parallelizing the load 
> won't gain you much past a certain point. You have to figure in how 
> fast you can read the file(s) off disk(s) and push the bits through 
> your network and finally onto HDFS.
> 
> 
> The best scenario is if you can parallelize the reads and have a fat 
> network pipe (10GbE or more) going into your Hadoop cluster.


We have a way to parallelize a push from the archive storage cluster to the hadoop storage cluster.

Is there a way to target a particular storage node with a push into the hadoop file system? The hadoop cluster nodes are 1gig attached to its core switch and we have a 10 gig uplink to the core from the storage archive. Say, we have 4 nodes in each storage cluster (we have more, just a simplified example):

a0 --\                                /-- h0
a1 --+                                +-- h1
a2 --+ (A switch) -10gige- (h switch) +-- h2
a3 --/                                \-- h3

I want to be able to have a0 talk to h0 and not have h0 decide the data belongs on h3, slowing down a3's ability to write data into h3, greatly reducing bandwidth.

Thanks,
Kevin

> 
> 
> Regards,
> 
> 
> - Patrick
> 
> On Tue, Dec 28, 2010 at 5:04 PM, Taylor, Ronald C 
> <ro...@pnl.gov> wrote:
>         
>         Folks,
>         
>         We plan on uploading large amounts of data on a regular basis
>         onto a Hadoop cluster, with Hbase operating on top of Hadoop.
>         Figure eventually on the order of multiple terabytes per week.
>         So - we are concerned about doing the uploads themselves as
>         fast as possible from our native Linux file system into HDFS.
>         Figure files will be in, roughly, the 1 to 300 GB range.
>         
>         Off the top of my head, I'm thinking that doing this in
>         parallel using a Java MapReduce program would work fastest. So
>         my idea would be to have a file listing all the data files
>         (full paths) to be uploaded, one per line, and then use that
>         listing file as input to a MapReduce program.
>         
>         Each Mapper would then upload one of the data files (using
>         "hadoop fs -copyFromLocal <source> <dest>") in parallel with
>         all the other Mappers, with the Mappers operating on all the
>         nodes of the cluster, spreading out the file upload across the
>         nodes.
>         
>         Does that sound like a wise way to approach this? Are there
>         better methods? Anything else out there for doing automated
>         upload in parallel? We would very much appreciate advice in
>         this area, since we believe upload speed might become a
>         bottleneck.
>         
>          - Ron Taylor
>         
>         ___________________________________________
>         Ronald Taylor, Ph.D.
>         Computational Biology & Bioinformatics Group
>         
>         Pacific Northwest National Laboratory
>         902 Battelle Boulevard
>         P.O. Box 999, Mail Stop J4-33
>         Richland, WA  99352 USA
>         Office:  509-372-6568
>         Email: ronald.taylor@pnl.gov
>         
>         
> 
>

RE: What is the fastest way to get a large amount of data into the Hadoop HDFS file system (or Hbase)?

Posted by "Taylor, Ronald C" <ro...@pnl.gov>.

 
Hi Kevin,

So - from what Patrick and Ted are saying it sounds like we want the best way to parallelize a source-based push, rather than doing a parallelized pull through a MapReduce program. And I see that what you ask about below is on parallelizing a push, so we are on the same page.
 Ron

-----Original Message-----
From: Fox, Kevin M 
Sent: Tuesday, December 28, 2010 3:39 PM
To: Patrick Angeles
Cc: general@hadoop.apache.org; user@hbase.apache.org; Taylor, Ronald C; Brown, David M JR
Subject: Re: What is the fastest way to get a large amount of data into the Hadoop HDFS file system (or Hbase)?

On Tue, 2010-12-28 at 14:26 -0800, Patrick Angeles wrote:
> Ron,
> 
> 
> While MapReduce can help to parallelize the load effort, your likely 
> bottleneck is the source system (where the files come from). If the 
> files are coming from a single server, then parallelizing the load 
> won't gain you much past a certain point. You have to figure in how 
> fast you can read the file(s) off disk(s) and push the bits through 
> your network and finally onto HDFS.
> 
> 
> The best scenario is if you can parallelize the reads and have a fat 
> network pipe (10GbE or more) going into your Hadoop cluster.


We have a way to parallelize a push from the archive storage cluster to the hadoop storage cluster.

Is there a way to target a particular storage node with a push into the hadoop file system? The hadoop cluster nodes are 1gig attached to its core switch and we have a 10 gig uplink to the core from the storage archive. Say, we have 4 nodes in each storage cluster (we have more, just a simplified example):

a0 --\                                /-- h0
a1 --+                                +-- h1
a2 --+ (A switch) -10gige- (h switch) +-- h2
a3 --/                                \-- h3

I want to be able to have a0 talk to h0 and not have h0 decide the data belongs on h3, slowing down a3's ability to write data into h3, greatly reducing bandwidth.

Thanks,
Kevin

> 
> 
> Regards,
> 
> 
> - Patrick
> 
> On Tue, Dec 28, 2010 at 5:04 PM, Taylor, Ronald C 
> <ro...@pnl.gov> wrote:
>         
>         Folks,
>         
>         We plan on uploading large amounts of data on a regular basis
>         onto a Hadoop cluster, with Hbase operating on top of Hadoop.
>         Figure eventually on the order of multiple terabytes per week.
>         So - we are concerned about doing the uploads themselves as
>         fast as possible from our native Linux file system into HDFS.
>         Figure files will be in, roughly, the 1 to 300 GB range.
>         
>         Off the top of my head, I'm thinking that doing this in
>         parallel using a Java MapReduce program would work fastest. So
>         my idea would be to have a file listing all the data files
>         (full paths) to be uploaded, one per line, and then use that
>         listing file as input to a MapReduce program.
>         
>         Each Mapper would then upload one of the data files (using
>         "hadoop fs -copyFromLocal <source> <dest>") in parallel with
>         all the other Mappers, with the Mappers operating on all the
>         nodes of the cluster, spreading out the file upload across the
>         nodes.
>         
>         Does that sound like a wise way to approach this? Are there
>         better methods? Anything else out there for doing automated
>         upload in parallel? We would very much appreciate advice in
>         this area, since we believe upload speed might become a
>         bottleneck.
>         
>          - Ron Taylor
>         
>         ___________________________________________
>         Ronald Taylor, Ph.D.
>         Computational Biology & Bioinformatics Group
>         
>         Pacific Northwest National Laboratory
>         902 Battelle Boulevard
>         P.O. Box 999, Mail Stop J4-33
>         Richland, WA  99352 USA
>         Office:  509-372-6568
>         Email: ronald.taylor@pnl.gov
>         
>         
> 
>

Re: What is the fastest way to get a large amount of data into the Hadoop HDFS file system (or Hbase)?

Posted by Kevin Fox <Ke...@pnl.gov>.

On Tue, 2010-12-28 at 14:26 -0800, Patrick Angeles wrote:
> Ron,
> 
> 
> While MapReduce can help to parallelize the load effort, your likely
> bottleneck is the source system (where the files come from). If the
> files are coming from a single server, then parallelizing the load
> won't gain you much past a certain point. You have to figure in how
> fast you can read the file(s) off disk(s) and push the bits through
> your network and finally onto HDFS.
> 
> 
> The best scenario is if you can parallelize the reads and have a fat
> network pipe (10GbE or more) going into your Hadoop cluster.


We have a way to parallelize a push from the archive storage cluster to
the hadoop storage cluster.

Is there a way to target a particular storage node with a push into the
hadoop file system? The hadoop cluster nodes are 1gig attached to its
core switch and we have a 10 gig uplink to the core from the storage
archive. Say, we have 4 nodes in each storage cluster (we have more,
just a simplified example):

a0 --\                                /-- h0
a1 --+                                +-- h1
a2 --+ (A switch) -10gige- (h switch) +-- h2
a3 --/                                \-- h3

I want to be able to have a0 talk to h0 and not have h0 decide the data
belongs on h3, slowing down a3's ability to write data into h3, greatly
reducing bandwidth.

Thanks,
Kevin

> 
> 
> Regards,
> 
> 
> - Patrick
> 
> On Tue, Dec 28, 2010 at 5:04 PM, Taylor, Ronald C
> <ro...@pnl.gov> wrote:
>         
>         Folks,
>         
>         We plan on uploading large amounts of data on a regular basis
>         onto a Hadoop cluster, with Hbase operating on top of Hadoop.
>         Figure eventually on the order of multiple terabytes per week.
>         So - we are concerned about doing the uploads themselves as
>         fast as possible from our native Linux file system into HDFS.
>         Figure files will be in, roughly, the 1 to 300 GB range.
>         
>         Off the top of my head, I'm thinking that doing this in
>         parallel using a Java MapReduce program would work fastest. So
>         my idea would be to have a file listing all the data files
>         (full paths) to be uploaded, one per line, and then use that
>         listing file as input to a MapReduce program.
>         
>         Each Mapper would then upload one of the data files (using
>         "hadoop fs -copyFromLocal <source> <dest>") in parallel with
>         all the other Mappers, with the Mappers operating on all the
>         nodes of the cluster, spreading out the file upload across the
>         nodes.
>         
>         Does that sound like a wise way to approach this? Are there
>         better methods? Anything else out there for doing automated
>         upload in parallel? We would very much appreciate advice in
>         this area, since we believe upload speed might become a
>         bottleneck.
>         
>          - Ron Taylor
>         
>         ___________________________________________
>         Ronald Taylor, Ph.D.
>         Computational Biology & Bioinformatics Group
>         
>         Pacific Northwest National Laboratory
>         902 Battelle Boulevard
>         P.O. Box 999, Mail Stop J4-33
>         Richland, WA  99352 USA
>         Office:  509-372-6568
>         Email: ronald.taylor@pnl.gov
>         
>         
> 
>

RE: What is the fastest way to get a large amount of data into the Hadoop HDFS file system (or Hbase)?

Posted by "Taylor, Ronald C" <ro...@pnl.gov>.

Patrick,

Thanks for the info (and quick reply). I want to make sure I understand: Presuming that the data files are coming off a set of disk drives attached to a single Linux file server, you say I need two things to optimize the transfer:

1)  a fat network pipe

2) some way of parallelizing the reads

So - I will check into network hardware, in regard to (1). But for (2), is the MapReduce method that I was think of, a way that uses "hadoop fs -copyFromLocal" in each Mapper, a good way to go at the destination end? I believe that you were saying that it is indeed OK, but I want to double-check, since this will be a critical piece of our work flow.

 Ron


________________________________
From: patrickangeles@gmail.com [mailto:patrickangeles@gmail.com] On Behalf Of Patrick Angeles
Sent: Tuesday, December 28, 2010 2:27 PM
To: general@hadoop.apache.org
Cc: user@hbase.apache.org; Taylor, Ronald C; Fox, Kevin M; Brown, David M JR
Subject: Re: What is the fastest way to get a large amount of data into the Hadoop HDFS file system (or Hbase)?

Ron,

While MapReduce can help to parallelize the load effort, your likely bottleneck is the source system (where the files come from). If the files are coming from a single server, then parallelizing the load won't gain you much past a certain point. You have to figure in how fast you can read the file(s) off disk(s) and push the bits through your network and finally onto HDFS.

The best scenario is if you can parallelize the reads and have a fat network pipe (10GbE or more) going into your Hadoop cluster.

Regards,

- Patrick

On Tue, Dec 28, 2010 at 5:04 PM, Taylor, Ronald C <ro...@pnl.gov>> wrote:

Folks,

We plan on uploading large amounts of data on a regular basis onto a Hadoop cluster, with Hbase operating on top of Hadoop. Figure eventually on the order of multiple terabytes per week. So - we are concerned about doing the uploads themselves as fast as possible from our native Linux file system into HDFS. Figure files will be in, roughly, the 1 to 300 GB range.

Off the top of my head, I'm thinking that doing this in parallel using a Java MapReduce program would work fastest. So my idea would be to have a file listing all the data files (full paths) to be uploaded, one per line, and then use that listing file as input to a MapReduce program.

Each Mapper would then upload one of the data files (using "hadoop fs -copyFromLocal <source> <dest>") in parallel with all the other Mappers, with the Mappers operating on all the nodes of the cluster, spreading out the file upload across the nodes.

Does that sound like a wise way to approach this? Are there better methods? Anything else out there for doing automated upload in parallel? We would very much appreciate advice in this area, since we believe upload speed might become a bottleneck.

 - Ron Taylor

___________________________________________
Ronald Taylor, Ph.D.
Computational Biology & Bioinformatics Group

Pacific Northwest National Laboratory
902 Battelle Boulevard
P.O. Box 999, Mail Stop J4-33
Richland, WA  99352 USA
Office:  509-372-6568
Email: ronald.taylor@pnl.gov<ma...@pnl.gov>

Re: What is the fastest way to get a large amount of data into the Hadoop HDFS file system (or Hbase)?

Posted by Kevin Fox <Ke...@pnl.gov>.

On Tue, 2010-12-28 at 14:26 -0800, Patrick Angeles wrote:
> Ron,
> 
> 
> While MapReduce can help to parallelize the load effort, your likely
> bottleneck is the source system (where the files come from). If the
> files are coming from a single server, then parallelizing the load
> won't gain you much past a certain point. You have to figure in how
> fast you can read the file(s) off disk(s) and push the bits through
> your network and finally onto HDFS.
> 
> 
> The best scenario is if you can parallelize the reads and have a fat
> network pipe (10GbE or more) going into your Hadoop cluster.
> 

We have a way to parallelize a push from the archive storage cluster to
the hadoop storage cluster.

Is there a way to target a particular storage node with a push into the
hadoop file system? The hadoop cluster nodes are 1gig attached to its
core switch and we have a 10 gig uplink to the core from the storage
archive. Say, we have 4 nodes in each storage cluster (we have more,
just a simplified example):

a0 --\                                /-- h0
a1 --+                                +-- h1
a2 --+ (A switch) -10gige- (h switch) +-- h2
a3 --/                                \-- h3

I want to be able to have a0 talk to h0 and not have h0 decide the data
belongs on h3, slowing down a3's ability to write data into h3, greatly
reducing bandwidth.

Thanks,
Kevin

> 
> Regards,
> 
> 
> - Patrick
> 
> On Tue, Dec 28, 2010 at 5:04 PM, Taylor, Ronald C
> <ro...@pnl.gov> wrote:
>         
>         Folks,
>         
>         We plan on uploading large amounts of data on a regular basis
>         onto a Hadoop cluster, with Hbase operating on top of Hadoop.
>         Figure eventually on the order of multiple terabytes per week.
>         So - we are concerned about doing the uploads themselves as
>         fast as possible from our native Linux file system into HDFS.
>         Figure files will be in, roughly, the 1 to 300 GB range.
>         
>         Off the top of my head, I'm thinking that doing this in
>         parallel using a Java MapReduce program would work fastest. So
>         my idea would be to have a file listing all the data files
>         (full paths) to be uploaded, one per line, and then use that
>         listing file as input to a MapReduce program.
>         
>         Each Mapper would then upload one of the data files (using
>         "hadoop fs -copyFromLocal <source> <dest>") in parallel with
>         all the other Mappers, with the Mappers operating on all the
>         nodes of the cluster, spreading out the file upload across the
>         nodes.
>         
>         Does that sound like a wise way to approach this? Are there
>         better methods? Anything else out there for doing automated
>         upload in parallel? We would very much appreciate advice in
>         this area, since we believe upload speed might become a
>         bottleneck.
>         
>          - Ron Taylor
>         
>         ___________________________________________
>         Ronald Taylor, Ph.D.
>         Computational Biology & Bioinformatics Group
>         
>         Pacific Northwest National Laboratory
>         902 Battelle Boulevard
>         P.O. Box 999, Mail Stop J4-33
>         Richland, WA  99352 USA
>         Office:  509-372-6568
>         Email: ronald.taylor@pnl.gov
>         
>         
> 
>

Re: What is the fastest way to get a large amount of data into the Hadoop HDFS file system (or Hbase)?

Posted by Patrick Angeles <pa...@cloudera.com>.

Ron,

While MapReduce can help to parallelize the load effort, your likely
bottleneck is the source system (where the files come from). If the files
are coming from a single server, then parallelizing the load won't gain you
much past a certain point. You have to figure in how fast you can read the
file(s) off disk(s) and push the bits through your network and finally onto
HDFS.

The best scenario is if you can parallelize the reads and have a fat network
pipe (10GbE or more) going into your Hadoop cluster.

Regards,

- Patrick

On Tue, Dec 28, 2010 at 5:04 PM, Taylor, Ronald C <ro...@pnl.gov>wrote:

>
> Folks,
>
> We plan on uploading large amounts of data on a regular basis onto a Hadoop
> cluster, with Hbase operating on top of Hadoop. Figure eventually on the
> order of multiple terabytes per week. So - we are concerned about doing the
> uploads themselves as fast as possible from our native Linux file system
> into HDFS. Figure files will be in, roughly, the 1 to 300 GB range.
>
> Off the top of my head, I'm thinking that doing this in parallel using a
> Java MapReduce program would work fastest. So my idea would be to have a
> file listing all the data files (full paths) to be uploaded, one per line,
> and then use that listing file as input to a MapReduce program.
>
> Each Mapper would then upload one of the data files (using "hadoop fs
> -copyFromLocal <source> <dest>") in parallel with all the other Mappers,
> with the Mappers operating on all the nodes of the cluster, spreading out
> the file upload across the nodes.
>
> Does that sound like a wise way to approach this? Are there better methods?
> Anything else out there for doing automated upload in parallel? We would
> very much appreciate advice in this area, since we believe upload speed
> might become a bottleneck.
>
>  - Ron Taylor
>
> ___________________________________________
> Ronald Taylor, Ph.D.
> Computational Biology & Bioinformatics Group
>
> Pacific Northwest National Laboratory
> 902 Battelle Boulevard
> P.O. Box 999, Mail Stop J4-33
> Richland, WA  99352 USA
> Office:  509-372-6568
> Email: ronald.taylor@pnl.gov
>
>
>

Re: What is the fastest way to get a large amount of data into the Hadoop HDFS file system (or Hbase)?

Posted by Kevin Fox <Ke...@pnl.gov>.

As I understand it, and please correct me if I'm wrong, a map/reduce job
has an instance of a FileSystem object on either side. One that the data
is read out of on the map side, and one the data is fed into on the
reduce side.

Can't you run the map reduce job on the storage cluster that stores the
archival data, feeding the map side with
http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/fs/RawLocalFileSystem.html
from the mounted, posix parallel file system,
and feed it into the hadoop cluster on the reduce side. That would mean,
the network in the middle would only see the reduced data set cross the
wire and you could parallelize the data reduction as close to the
archive as possible.

Thanks,
Kevin

On Tue, 2010-12-28 at 14:04 -0800, Taylor, Ronald C wrote:
> Folks,
> 
> We plan on uploading large amounts of data on a regular basis onto a Hadoop cluster, with Hbase operating on top of Hadoop. Figure eventually on the order of multiple terabytes per week. So - we are concerned about doing the uploads themselves as fast as possible from our native Linux file system into HDFS. Figure files will be in, roughly, the 1 to 300 GB range. 
> 
> Off the top of my head, I'm thinking that doing this in parallel using a Java MapReduce program would work fastest. So my idea would be to have a file listing all the data files (full paths) to be uploaded, one per line, and then use that listing file as input to a MapReduce program. 
> 
> Each Mapper would then upload one of the data files (using "hadoop fs -copyFromLocal <source> <dest>") in parallel with all the other Mappers, with the Mappers operating on all the nodes of the cluster, spreading out the file upload across the nodes.
> 
> Does that sound like a wise way to approach this? Are there better methods? Anything else out there for doing automated upload in parallel? We would very much appreciate advice in this area, since we believe upload speed might become a bottleneck.
> 
>   - Ron Taylor
> 
> ___________________________________________
> Ronald Taylor, Ph.D.
> Computational Biology & Bioinformatics Group
> 
> Pacific Northwest National Laboratory
> 902 Battelle Boulevard
> P.O. Box 999, Mail Stop J4-33
> Richland, WA  99352 USA
> Office:  509-372-6568
> Email: ronald.taylor@pnl.gov
> 
>

RE: What is the fastest way to get a large amount of data into the Hadoop HDFS file system (or Hbase)?

Posted by "Taylor, Ronald C" <ro...@pnl.gov>.

Hi Dave,

Thanks for the suggestions. Glad to hear from a fellow DOE national lab person! 

We are just starting to explore all this here at Pacific Northwest Nat Lab, and what will be going into Hbase and what will be left as files in HDFS is an open question, to be empirically determined over the coming year. It will depend on upon what instrument data gets put in, how the users want to analyze the data, what turns out to be practical for future growth and maintenance, etc. My lab colleagues Kevin Fox and David Brown have a lot more experience handling massive amount of data - they are already handling hundreds of TBs in the archive cluster for EMSL, our national user facility (lots of mass spec, NMR, microscopy, and next gen sequencing machines for biology and chemistry, as you may already know). And they have much better grip on the hardware and OS side of things. So I imagine you & the list will be hearing directly from them fairly often as questions arise.

 Ron

-----Original Message-----
From: Buttler, David [mailto:buttler1@llnl.gov] 
Sent: Monday, January 03, 2011 12:21 PM
To: user@hbase.apache.org; 'general@hadoop.apache.org'
Cc: Fox, Kevin M; Brown, David M JR
Subject: RE: What is the fastest way to get a large amount of data into the Hadoop HDFS file system (or Hbase)?

Hi Ron,
Loading into HDFS and HBase are two different issues.  

HDFS: if you have a large number of files to load from your nfs file system into HDFS it is not clear that parallelizing the load will help.  You have two sources of bottlenecks: the nfs file system and the HDFS file system.  In your parallel example, you will likely saturate your nfs file system first.  If they are actually local files, then loading them via M/R is a non-starter as you have no control over which machine will get a map task.  Unless all of the machines have files in the same directory and you are just going to look in that directory to upload.  Then, it sounds like more of a job for a parallel shell command and less of a map/reduce command.

HBase: So far my strategy has been to get the files into HDFS first, and then write a Map job to load them into HBase.  You can try to do this and see if direct inserts into hbase are fast enough for your use case.  But, if you are going to TBs/week then you will likely want to investigate the bulk load features.  I haven't yet incorporated that into my workflow so I can't offer much advice there. Just be sure your cluster is sized appropriately.  E.g., with your compression turned on in hbase, see how much a 1 GB input file expands to inside hbase / hdfs.  That should give you a feeling for how much space you will need for your expected data load.

Dave


-----Original Message-----
From: Taylor, Ronald C [mailto:ronald.taylor@pnl.gov] 
Sent: Tuesday, December 28, 2010 2:05 PM
To: 'user@hbase.apache.org'; 'general@hadoop.apache.org'
Cc: Taylor, Ronald C; Fox, Kevin M; Brown, David M JR
Subject: What is the fastest way to get a large amount of data into the Hadoop HDFS file system (or Hbase)?


Folks,

We plan on uploading large amounts of data on a regular basis onto a Hadoop cluster, with Hbase operating on top of Hadoop. Figure eventually on the order of multiple terabytes per week. So - we are concerned about doing the uploads themselves as fast as possible from our native Linux file system into HDFS. Figure files will be in, roughly, the 1 to 300 GB range. 

Off the top of my head, I'm thinking that doing this in parallel using a Java MapReduce program would work fastest. So my idea would be to have a file listing all the data files (full paths) to be uploaded, one per line, and then use that listing file as input to a MapReduce program. 

Each Mapper would then upload one of the data files (using "hadoop fs -copyFromLocal <source> <dest>") in parallel with all the other Mappers, with the Mappers operating on all the nodes of the cluster, spreading out the file upload across the nodes.

Does that sound like a wise way to approach this? Are there better methods? Anything else out there for doing automated upload in parallel? We would very much appreciate advice in this area, since we believe upload speed might become a bottleneck.

  - Ron Taylor

___________________________________________
Ronald Taylor, Ph.D.
Computational Biology & Bioinformatics Group

Pacific Northwest National Laboratory
902 Battelle Boulevard
P.O. Box 999, Mail Stop J4-33
Richland, WA  99352 USA
Office:  509-372-6568
Email: ronald.taylor@pnl.gov

RE: What is the fastest way to get a large amount of data into the Hadoop HDFS file system (or Hbase)?

Posted by "Buttler, David" <bu...@llnl.gov>.

Right, I should have realized that you guys would be using a good parallel file system.  In that case a M/R job will be great for moving the data -- as long as you don't overload the network.  And if you are going to have the data end up in HBase you may just write a map job to directly insert into hbase, either through standard inserts or bulk loads (which should be 10x faster).

If you have time to play, it might be interesting to see how hbase runs over your current file system.  Just use the file://<path to shared directory> instead of the hdfs:// url in the hbase-site.xml file.  I would stress test that first with both hbase and m/r jobs just to make sure that it behaves well, but there would be many people who would love to hear about your experience here.  I for one would like to know if I can run hbase over luster since we love luster around here.

Dave

-----Original Message-----
From: Kevin Fox [mailto:Kevin.Fox@pnl.gov] 
Sent: Monday, January 03, 2011 2:58 PM
To: Buttler, David
Cc: user@hbase.apache.org; 'general@hadoop.apache.org'; Brown, David M JR
Subject: RE: What is the fastest way to get a large amount of data into the Hadoop HDFS file system (or Hbase)?

On Mon, 2011-01-03 at 12:20 -0800, Buttler, David wrote:
> Hi Ron,
> Loading into HDFS and HBase are two different issues.  
> 
> HDFS: if you have a large number of files to load from your nfs file system into HDFS it is not clear that parallelizing the load will help. 

Its not nfs. Its a parallel file system.

>  You have two sources of bottlenecks: the nfs file system and the HDFS file system.  In your parallel example, you will likely saturate your nfs file system first.

Unlikely in this case. We're in the unusual position of our archive
cluster being faster then our hadoop cluster.

>   If they are actually local files, then loading them via M/R is a non-starter as you have no control over which machine will get a map task.

If the same files are "local" on each node, does it matter? Shouldn't
the map jobs all be scheduled in a way as to spread out the load?

Thanks,
Kevin

>   Unless all of the machines have files in the same directory and you are just going to look in that directory to upload.  Then, it sounds like more of a job for a parallel shell command and less of a map/reduce command.
> 
> HBase: So far my strategy has been to get the files into HDFS first, and then write a Map job to load them into HBase.  You can try to do this and see if direct inserts into hbase are fast enough for your use case.  But, if you are going to TBs/week then you will likely want to investigate the bulk load features.  I haven't yet incorporated that into my workflow so I can't offer much advice there. Just be sure your cluster is sized appropriately.  E.g., with your compression turned on in hbase, see how much a 1 GB input file expands to inside hbase / hdfs.  That should give you a feeling for how much space you will need for your expected data load.
> 
> Dave
> 
> 
> -----Original Message-----
> From: Taylor, Ronald C [mailto:ronald.taylor@pnl.gov] 
> Sent: Tuesday, December 28, 2010 2:05 PM
> To: 'user@hbase.apache.org'; 'general@hadoop.apache.org'
> Cc: Taylor, Ronald C; Fox, Kevin M; Brown, David M JR
> Subject: What is the fastest way to get a large amount of data into the Hadoop HDFS file system (or Hbase)?
> 
> 
> Folks,
> 
> We plan on uploading large amounts of data on a regular basis onto a Hadoop cluster, with Hbase operating on top of Hadoop. Figure eventually on the order of multiple terabytes per week. So - we are concerned about doing the uploads themselves as fast as possible from our native Linux file system into HDFS. Figure files will be in, roughly, the 1 to 300 GB range. 
> 
> Off the top of my head, I'm thinking that doing this in parallel using a Java MapReduce program would work fastest. So my idea would be to have a file listing all the data files (full paths) to be uploaded, one per line, and then use that listing file as input to a MapReduce program. 
> 
> Each Mapper would then upload one of the data files (using "hadoop fs -copyFromLocal <source> <dest>") in parallel with all the other Mappers, with the Mappers operating on all the nodes of the cluster, spreading out the file upload across the nodes.
> 
> Does that sound like a wise way to approach this? Are there better methods? Anything else out there for doing automated upload in parallel? We would very much appreciate advice in this area, since we believe upload speed might become a bottleneck.
> 
>   - Ron Taylor
> 
> ___________________________________________
> Ronald Taylor, Ph.D.
> Computational Biology & Bioinformatics Group
> 
> Pacific Northwest National Laboratory
> 902 Battelle Boulevard
> P.O. Box 999, Mail Stop J4-33
> Richland, WA  99352 USA
> Office:  509-372-6568
> Email: ronald.taylor@pnl.gov
> 
>

RE: What is the fastest way to get a large amount of data into the Hadoop HDFS file system (or Hbase)?

Posted by "Buttler, David" <bu...@llnl.gov>.

Right, I should have realized that you guys would be using a good parallel file system.  In that case a M/R job will be great for moving the data -- as long as you don't overload the network.  And if you are going to have the data end up in HBase you may just write a map job to directly insert into hbase, either through standard inserts or bulk loads (which should be 10x faster).

If you have time to play, it might be interesting to see how hbase runs over your current file system.  Just use the file://<path to shared directory> instead of the hdfs:// url in the hbase-site.xml file.  I would stress test that first with both hbase and m/r jobs just to make sure that it behaves well, but there would be many people who would love to hear about your experience here.  I for one would like to know if I can run hbase over luster since we love luster around here.

Dave

-----Original Message-----
From: Kevin Fox [mailto:Kevin.Fox@pnl.gov] 
Sent: Monday, January 03, 2011 2:58 PM
To: Buttler, David
Cc: user@hbase.apache.org; 'general@hadoop.apache.org'; Brown, David M JR
Subject: RE: What is the fastest way to get a large amount of data into the Hadoop HDFS file system (or Hbase)?

On Mon, 2011-01-03 at 12:20 -0800, Buttler, David wrote:
> Hi Ron,
> Loading into HDFS and HBase are two different issues.  
> 
> HDFS: if you have a large number of files to load from your nfs file system into HDFS it is not clear that parallelizing the load will help. 

Its not nfs. Its a parallel file system.

>  You have two sources of bottlenecks: the nfs file system and the HDFS file system.  In your parallel example, you will likely saturate your nfs file system first.

Unlikely in this case. We're in the unusual position of our archive
cluster being faster then our hadoop cluster.

>   If they are actually local files, then loading them via M/R is a non-starter as you have no control over which machine will get a map task.

If the same files are "local" on each node, does it matter? Shouldn't
the map jobs all be scheduled in a way as to spread out the load?

Thanks,
Kevin

>   Unless all of the machines have files in the same directory and you are just going to look in that directory to upload.  Then, it sounds like more of a job for a parallel shell command and less of a map/reduce command.
> 
> HBase: So far my strategy has been to get the files into HDFS first, and then write a Map job to load them into HBase.  You can try to do this and see if direct inserts into hbase are fast enough for your use case.  But, if you are going to TBs/week then you will likely want to investigate the bulk load features.  I haven't yet incorporated that into my workflow so I can't offer much advice there. Just be sure your cluster is sized appropriately.  E.g., with your compression turned on in hbase, see how much a 1 GB input file expands to inside hbase / hdfs.  That should give you a feeling for how much space you will need for your expected data load.
> 
> Dave
> 
> 
> -----Original Message-----
> From: Taylor, Ronald C [mailto:ronald.taylor@pnl.gov] 
> Sent: Tuesday, December 28, 2010 2:05 PM
> To: 'user@hbase.apache.org'; 'general@hadoop.apache.org'
> Cc: Taylor, Ronald C; Fox, Kevin M; Brown, David M JR
> Subject: What is the fastest way to get a large amount of data into the Hadoop HDFS file system (or Hbase)?
> 
> 
> Folks,
> 
> We plan on uploading large amounts of data on a regular basis onto a Hadoop cluster, with Hbase operating on top of Hadoop. Figure eventually on the order of multiple terabytes per week. So - we are concerned about doing the uploads themselves as fast as possible from our native Linux file system into HDFS. Figure files will be in, roughly, the 1 to 300 GB range. 
> 
> Off the top of my head, I'm thinking that doing this in parallel using a Java MapReduce program would work fastest. So my idea would be to have a file listing all the data files (full paths) to be uploaded, one per line, and then use that listing file as input to a MapReduce program. 
> 
> Each Mapper would then upload one of the data files (using "hadoop fs -copyFromLocal <source> <dest>") in parallel with all the other Mappers, with the Mappers operating on all the nodes of the cluster, spreading out the file upload across the nodes.
> 
> Does that sound like a wise way to approach this? Are there better methods? Anything else out there for doing automated upload in parallel? We would very much appreciate advice in this area, since we believe upload speed might become a bottleneck.
> 
>   - Ron Taylor
> 
> ___________________________________________
> Ronald Taylor, Ph.D.
> Computational Biology & Bioinformatics Group
> 
> Pacific Northwest National Laboratory
> 902 Battelle Boulevard
> P.O. Box 999, Mail Stop J4-33
> Richland, WA  99352 USA
> Office:  509-372-6568
> Email: ronald.taylor@pnl.gov
> 
>

RE: What is the fastest way to get a large amount of data into the Hadoop HDFS file system (or Hbase)?

Posted by Kevin Fox <Ke...@pnl.gov>.

On Mon, 2011-01-03 at 12:20 -0800, Buttler, David wrote:
> Hi Ron,
> Loading into HDFS and HBase are two different issues.  
> 
> HDFS: if you have a large number of files to load from your nfs file system into HDFS it is not clear that parallelizing the load will help. 

Its not nfs. Its a parallel file system.

>  You have two sources of bottlenecks: the nfs file system and the HDFS file system.  In your parallel example, you will likely saturate your nfs file system first.

Unlikely in this case. We're in the unusual position of our archive
cluster being faster then our hadoop cluster.

>   If they are actually local files, then loading them via M/R is a non-starter as you have no control over which machine will get a map task.

If the same files are "local" on each node, does it matter? Shouldn't
the map jobs all be scheduled in a way as to spread out the load?

Thanks,
Kevin

>   Unless all of the machines have files in the same directory and you are just going to look in that directory to upload.  Then, it sounds like more of a job for a parallel shell command and less of a map/reduce command.
> 
> HBase: So far my strategy has been to get the files into HDFS first, and then write a Map job to load them into HBase.  You can try to do this and see if direct inserts into hbase are fast enough for your use case.  But, if you are going to TBs/week then you will likely want to investigate the bulk load features.  I haven't yet incorporated that into my workflow so I can't offer much advice there. Just be sure your cluster is sized appropriately.  E.g., with your compression turned on in hbase, see how much a 1 GB input file expands to inside hbase / hdfs.  That should give you a feeling for how much space you will need for your expected data load.
> 
> Dave
> 
> 
> -----Original Message-----
> From: Taylor, Ronald C [mailto:ronald.taylor@pnl.gov] 
> Sent: Tuesday, December 28, 2010 2:05 PM
> To: 'user@hbase.apache.org'; 'general@hadoop.apache.org'
> Cc: Taylor, Ronald C; Fox, Kevin M; Brown, David M JR
> Subject: What is the fastest way to get a large amount of data into the Hadoop HDFS file system (or Hbase)?
> 
> 
> Folks,
> 
> We plan on uploading large amounts of data on a regular basis onto a Hadoop cluster, with Hbase operating on top of Hadoop. Figure eventually on the order of multiple terabytes per week. So - we are concerned about doing the uploads themselves as fast as possible from our native Linux file system into HDFS. Figure files will be in, roughly, the 1 to 300 GB range. 
> 
> Off the top of my head, I'm thinking that doing this in parallel using a Java MapReduce program would work fastest. So my idea would be to have a file listing all the data files (full paths) to be uploaded, one per line, and then use that listing file as input to a MapReduce program. 
> 
> Each Mapper would then upload one of the data files (using "hadoop fs -copyFromLocal <source> <dest>") in parallel with all the other Mappers, with the Mappers operating on all the nodes of the cluster, spreading out the file upload across the nodes.
> 
> Does that sound like a wise way to approach this? Are there better methods? Anything else out there for doing automated upload in parallel? We would very much appreciate advice in this area, since we believe upload speed might become a bottleneck.
> 
>   - Ron Taylor
> 
> ___________________________________________
> Ronald Taylor, Ph.D.
> Computational Biology & Bioinformatics Group
> 
> Pacific Northwest National Laboratory
> 902 Battelle Boulevard
> P.O. Box 999, Mail Stop J4-33
> Richland, WA  99352 USA
> Office:  509-372-6568
> Email: ronald.taylor@pnl.gov
> 
>

RE: What is the fastest way to get a large amount of data into the Hadoop HDFS file system (or Hbase)?

Posted by "Taylor, Ronald C" <ro...@pnl.gov>.

Hi Dave,

Thanks for the suggestions. Glad to hear from a fellow DOE national lab person! 

We are just starting to explore all this here at Pacific Northwest Nat Lab, and what will be going into Hbase and what will be left as files in HDFS is an open question, to be empirically determined over the coming year. It will depend on upon what instrument data gets put in, how the users want to analyze the data, what turns out to be practical for future growth and maintenance, etc. My lab colleagues Kevin Fox and David Brown have a lot more experience handling massive amount of data - they are already handling hundreds of TBs in the archive cluster for EMSL, our national user facility (lots of mass spec, NMR, microscopy, and next gen sequencing machines for biology and chemistry, as you may already know). And they have much better grip on the hardware and OS side of things. So I imagine you & the list will be hearing directly from them fairly often as questions arise.

 Ron

-----Original Message-----
From: Buttler, David [mailto:buttler1@llnl.gov] 
Sent: Monday, January 03, 2011 12:21 PM
To: user@hbase.apache.org; 'general@hadoop.apache.org'
Cc: Fox, Kevin M; Brown, David M JR
Subject: RE: What is the fastest way to get a large amount of data into the Hadoop HDFS file system (or Hbase)?

Hi Ron,
Loading into HDFS and HBase are two different issues.  

HDFS: if you have a large number of files to load from your nfs file system into HDFS it is not clear that parallelizing the load will help.  You have two sources of bottlenecks: the nfs file system and the HDFS file system.  In your parallel example, you will likely saturate your nfs file system first.  If they are actually local files, then loading them via M/R is a non-starter as you have no control over which machine will get a map task.  Unless all of the machines have files in the same directory and you are just going to look in that directory to upload.  Then, it sounds like more of a job for a parallel shell command and less of a map/reduce command.

HBase: So far my strategy has been to get the files into HDFS first, and then write a Map job to load them into HBase.  You can try to do this and see if direct inserts into hbase are fast enough for your use case.  But, if you are going to TBs/week then you will likely want to investigate the bulk load features.  I haven't yet incorporated that into my workflow so I can't offer much advice there. Just be sure your cluster is sized appropriately.  E.g., with your compression turned on in hbase, see how much a 1 GB input file expands to inside hbase / hdfs.  That should give you a feeling for how much space you will need for your expected data load.

Dave


-----Original Message-----
From: Taylor, Ronald C [mailto:ronald.taylor@pnl.gov] 
Sent: Tuesday, December 28, 2010 2:05 PM
To: 'user@hbase.apache.org'; 'general@hadoop.apache.org'
Cc: Taylor, Ronald C; Fox, Kevin M; Brown, David M JR
Subject: What is the fastest way to get a large amount of data into the Hadoop HDFS file system (or Hbase)?


Folks,

We plan on uploading large amounts of data on a regular basis onto a Hadoop cluster, with Hbase operating on top of Hadoop. Figure eventually on the order of multiple terabytes per week. So - we are concerned about doing the uploads themselves as fast as possible from our native Linux file system into HDFS. Figure files will be in, roughly, the 1 to 300 GB range. 

Off the top of my head, I'm thinking that doing this in parallel using a Java MapReduce program would work fastest. So my idea would be to have a file listing all the data files (full paths) to be uploaded, one per line, and then use that listing file as input to a MapReduce program. 

Each Mapper would then upload one of the data files (using "hadoop fs -copyFromLocal <source> <dest>") in parallel with all the other Mappers, with the Mappers operating on all the nodes of the cluster, spreading out the file upload across the nodes.

Does that sound like a wise way to approach this? Are there better methods? Anything else out there for doing automated upload in parallel? We would very much appreciate advice in this area, since we believe upload speed might become a bottleneck.

  - Ron Taylor

___________________________________________
Ronald Taylor, Ph.D.
Computational Biology & Bioinformatics Group

Pacific Northwest National Laboratory
902 Battelle Boulevard
P.O. Box 999, Mail Stop J4-33
Richland, WA  99352 USA
Office:  509-372-6568
Email: ronald.taylor@pnl.gov

RE: What is the fastest way to get a large amount of data into the Hadoop HDFS file system (or Hbase)?

Posted by "Buttler, David" <bu...@llnl.gov>.

Hi Ron,
Loading into HDFS and HBase are two different issues.  

HDFS: if you have a large number of files to load from your nfs file system into HDFS it is not clear that parallelizing the load will help.  You have two sources of bottlenecks: the nfs file system and the HDFS file system.  In your parallel example, you will likely saturate your nfs file system first.  If they are actually local files, then loading them via M/R is a non-starter as you have no control over which machine will get a map task.  Unless all of the machines have files in the same directory and you are just going to look in that directory to upload.  Then, it sounds like more of a job for a parallel shell command and less of a map/reduce command.

HBase: So far my strategy has been to get the files into HDFS first, and then write a Map job to load them into HBase.  You can try to do this and see if direct inserts into hbase are fast enough for your use case.  But, if you are going to TBs/week then you will likely want to investigate the bulk load features.  I haven't yet incorporated that into my workflow so I can't offer much advice there. Just be sure your cluster is sized appropriately.  E.g., with your compression turned on in hbase, see how much a 1 GB input file expands to inside hbase / hdfs.  That should give you a feeling for how much space you will need for your expected data load.

Dave


-----Original Message-----
From: Taylor, Ronald C [mailto:ronald.taylor@pnl.gov] 
Sent: Tuesday, December 28, 2010 2:05 PM
To: 'user@hbase.apache.org'; 'general@hadoop.apache.org'
Cc: Taylor, Ronald C; Fox, Kevin M; Brown, David M JR
Subject: What is the fastest way to get a large amount of data into the Hadoop HDFS file system (or Hbase)?


Folks,

We plan on uploading large amounts of data on a regular basis onto a Hadoop cluster, with Hbase operating on top of Hadoop. Figure eventually on the order of multiple terabytes per week. So - we are concerned about doing the uploads themselves as fast as possible from our native Linux file system into HDFS. Figure files will be in, roughly, the 1 to 300 GB range. 

Off the top of my head, I'm thinking that doing this in parallel using a Java MapReduce program would work fastest. So my idea would be to have a file listing all the data files (full paths) to be uploaded, one per line, and then use that listing file as input to a MapReduce program. 

Each Mapper would then upload one of the data files (using "hadoop fs -copyFromLocal <source> <dest>") in parallel with all the other Mappers, with the Mappers operating on all the nodes of the cluster, spreading out the file upload across the nodes.

Does that sound like a wise way to approach this? Are there better methods? Anything else out there for doing automated upload in parallel? We would very much appreciate advice in this area, since we believe upload speed might become a bottleneck.

  - Ron Taylor

___________________________________________
Ronald Taylor, Ph.D.
Computational Biology & Bioinformatics Group

Pacific Northwest National Laboratory
902 Battelle Boulevard
P.O. Box 999, Mail Stop J4-33
Richland, WA  99352 USA
Office:  509-372-6568
Email: ronald.taylor@pnl.gov

Re: What is the fastest way to get a large amount of data into the Hadoop HDFS file system (or Hbase)?

Posted by Patrick Angeles <pa...@cloudera.com>.

Ron,

While MapReduce can help to parallelize the load effort, your likely
bottleneck is the source system (where the files come from). If the files
are coming from a single server, then parallelizing the load won't gain you
much past a certain point. You have to figure in how fast you can read the
file(s) off disk(s) and push the bits through your network and finally onto
HDFS.

The best scenario is if you can parallelize the reads and have a fat network
pipe (10GbE or more) going into your Hadoop cluster.

Regards,

- Patrick

On Tue, Dec 28, 2010 at 5:04 PM, Taylor, Ronald C <ro...@pnl.gov>wrote:

>
> Folks,
>
> We plan on uploading large amounts of data on a regular basis onto a Hadoop
> cluster, with Hbase operating on top of Hadoop. Figure eventually on the
> order of multiple terabytes per week. So - we are concerned about doing the
> uploads themselves as fast as possible from our native Linux file system
> into HDFS. Figure files will be in, roughly, the 1 to 300 GB range.
>
> Off the top of my head, I'm thinking that doing this in parallel using a
> Java MapReduce program would work fastest. So my idea would be to have a
> file listing all the data files (full paths) to be uploaded, one per line,
> and then use that listing file as input to a MapReduce program.
>
> Each Mapper would then upload one of the data files (using "hadoop fs
> -copyFromLocal <source> <dest>") in parallel with all the other Mappers,
> with the Mappers operating on all the nodes of the cluster, spreading out
> the file upload across the nodes.
>
> Does that sound like a wise way to approach this? Are there better methods?
> Anything else out there for doing automated upload in parallel? We would
> very much appreciate advice in this area, since we believe upload speed
> might become a bottleneck.
>
>  - Ron Taylor
>
> ___________________________________________
> Ronald Taylor, Ph.D.
> Computational Biology & Bioinformatics Group
>
> Pacific Northwest National Laboratory
> 902 Battelle Boulevard
> P.O. Box 999, Mail Stop J4-33
> Richland, WA  99352 USA
> Office:  509-372-6568
> Email: ronald.taylor@pnl.gov
>
>
>

RE: What is the fastest way to get a large amount of data into the Hadoop HDFS file system (or Hbase)?

Posted by "Buttler, David" <bu...@llnl.gov>.

Hi Ron,
Loading into HDFS and HBase are two different issues.  

HDFS: if you have a large number of files to load from your nfs file system into HDFS it is not clear that parallelizing the load will help.  You have two sources of bottlenecks: the nfs file system and the HDFS file system.  In your parallel example, you will likely saturate your nfs file system first.  If they are actually local files, then loading them via M/R is a non-starter as you have no control over which machine will get a map task.  Unless all of the machines have files in the same directory and you are just going to look in that directory to upload.  Then, it sounds like more of a job for a parallel shell command and less of a map/reduce command.

HBase: So far my strategy has been to get the files into HDFS first, and then write a Map job to load them into HBase.  You can try to do this and see if direct inserts into hbase are fast enough for your use case.  But, if you are going to TBs/week then you will likely want to investigate the bulk load features.  I haven't yet incorporated that into my workflow so I can't offer much advice there. Just be sure your cluster is sized appropriately.  E.g., with your compression turned on in hbase, see how much a 1 GB input file expands to inside hbase / hdfs.  That should give you a feeling for how much space you will need for your expected data load.

Dave


-----Original Message-----
From: Taylor, Ronald C [mailto:ronald.taylor@pnl.gov] 
Sent: Tuesday, December 28, 2010 2:05 PM
To: 'user@hbase.apache.org'; 'general@hadoop.apache.org'
Cc: Taylor, Ronald C; Fox, Kevin M; Brown, David M JR
Subject: What is the fastest way to get a large amount of data into the Hadoop HDFS file system (or Hbase)?


Folks,

We plan on uploading large amounts of data on a regular basis onto a Hadoop cluster, with Hbase operating on top of Hadoop. Figure eventually on the order of multiple terabytes per week. So - we are concerned about doing the uploads themselves as fast as possible from our native Linux file system into HDFS. Figure files will be in, roughly, the 1 to 300 GB range. 

Off the top of my head, I'm thinking that doing this in parallel using a Java MapReduce program would work fastest. So my idea would be to have a file listing all the data files (full paths) to be uploaded, one per line, and then use that listing file as input to a MapReduce program. 

Each Mapper would then upload one of the data files (using "hadoop fs -copyFromLocal <source> <dest>") in parallel with all the other Mappers, with the Mappers operating on all the nodes of the cluster, spreading out the file upload across the nodes.

Does that sound like a wise way to approach this? Are there better methods? Anything else out there for doing automated upload in parallel? We would very much appreciate advice in this area, since we believe upload speed might become a bottleneck.

  - Ron Taylor

___________________________________________
Ronald Taylor, Ph.D.
Computational Biology & Bioinformatics Group

Pacific Northwest National Laboratory
902 Battelle Boulevard
P.O. Box 999, Mail Stop J4-33
Richland, WA  99352 USA
Office:  509-372-6568
Email: ronald.taylor@pnl.gov

What is the fastest way to get a large amount of data into the Hadoop HDFS file system (or Hbase)?

Posted by "Taylor, Ronald C" <ro...@pnl.gov>.

Folks,

We plan on uploading large amounts of data on a regular basis onto a Hadoop cluster, with Hbase operating on top of Hadoop. Figure eventually on the order of multiple terabytes per week. So - we are concerned about doing the uploads themselves as fast as possible from our native Linux file system into HDFS. Figure files will be in, roughly, the 1 to 300 GB range. 

Off the top of my head, I'm thinking that doing this in parallel using a Java MapReduce program would work fastest. So my idea would be to have a file listing all the data files (full paths) to be uploaded, one per line, and then use that listing file as input to a MapReduce program. 

Each Mapper would then upload one of the data files (using "hadoop fs -copyFromLocal <source> <dest>") in parallel with all the other Mappers, with the Mappers operating on all the nodes of the cluster, spreading out the file upload across the nodes.

Does that sound like a wise way to approach this? Are there better methods? Anything else out there for doing automated upload in parallel? We would very much appreciate advice in this area, since we believe upload speed might become a bottleneck.

  - Ron Taylor

___________________________________________
Ronald Taylor, Ph.D.
Computational Biology & Bioinformatics Group

Pacific Northwest National Laboratory
902 Battelle Boulevard
P.O. Box 999, Mail Stop J4-33
Richland, WA  99352 USA
Office:  509-372-6568
Email: ronald.taylor@pnl.gov

What is the fastest way to get a large amount of data into the Hadoop HDFS file system (or Hbase)?

Posted by "Taylor, Ronald C" <ro...@pnl.gov>.

Folks,

We plan on uploading large amounts of data on a regular basis onto a Hadoop cluster, with Hbase operating on top of Hadoop. Figure eventually on the order of multiple terabytes per week. So - we are concerned about doing the uploads themselves as fast as possible from our native Linux file system into HDFS. Figure files will be in, roughly, the 1 to 300 GB range. 

Off the top of my head, I'm thinking that doing this in parallel using a Java MapReduce program would work fastest. So my idea would be to have a file listing all the data files (full paths) to be uploaded, one per line, and then use that listing file as input to a MapReduce program. 

Each Mapper would then upload one of the data files (using "hadoop fs -copyFromLocal <source> <dest>") in parallel with all the other Mappers, with the Mappers operating on all the nodes of the cluster, spreading out the file upload across the nodes.

Does that sound like a wise way to approach this? Are there better methods? Anything else out there for doing automated upload in parallel? We would very much appreciate advice in this area, since we believe upload speed might become a bottleneck.

  - Ron Taylor

___________________________________________
Ronald Taylor, Ph.D.
Computational Biology & Bioinformatics Group

Pacific Northwest National Laboratory
902 Battelle Boulevard
P.O. Box 999, Mail Stop J4-33
Richland, WA  99352 USA
Office:  509-372-6568
Email: ronald.taylor@pnl.gov

Re: Current status of Hbasene - is the Lily project relevant?

Posted by Daniel Einspanjer <de...@mozilla.com>.

Mozilla is using it for a test project that is building a datawarehouse 
on top of our bugzilla installation.  While it is still a bit young, it 
is usable and very exciting, not only for the searching capabilities, 
but also for the application friendly extensions to HBase such as linked 
fields, whole document versioning, field typing, etc.  Once the data is 
indexed in Solr then you have that entire powerful world to interact 
with.  I am hoping that Outerthought has a chance to consider supporting 
ElasticSearch as an alternative indexing engine though.  We are also 
doing some work with ES and it is quite amazing as well.

I remember seeing a reference to a project that created an ElasticSearch 
plugin to feed data from HBase into it.  I'm not sure where that is at 
the moment.  It would probably be worth taking a look at ES as well.

-Daniel

On 12/24/10 9:53 PM, Taylor, Ronald C wrote:
> Hello Amit,
>
> Re your question on Hbasene: I presume that you are interested in support for full-text search, in conjunction with data storage in Hbase tables. Have you considered Lily as an alternative?
>
> I don't have any personal experience with Lily - I only have seen the web site at
>     http://www.lilyproject.org/
>   and
>     http://www.lilyproject.org/lily/about/faq.html
>     http://www.lilyproject.org/lily/about/technology.html
>
>   but Lily is based on Hbase and SOLR, which is a standalone Lucene-based search server, as I'm sure you know, and Lily is open source. So ...perhaps you should check Lily out. I'm interested in Lily myself for future use, and might be looking into it in the coming year.
>
> I've appended the Outerthought email below that brought Lily to my attention. According to the email, release  1.0 of Lily is planned for March 2011; release 0.2.1 is downloadable now.
>
> Does anybody out there have any experience yet with Lily?
>
>    Cheers,
>      Ron
>
> ___________________________________________
> Ronald Taylor, Ph.D.
> Computational Biology&  Bioinformatics Group
> Pacific Northwest National Laboratory
> 902 Battelle Boulevard
> P.O. Box 999, Mail Stop J4-33
> Richland, WA  99352 USA
> Office:  509-372-6568
> Email: ronald.taylor@pnl.gov
>
>
>
>
> -----Original Message-----
> From: Steven Noels [mailto:stevenn@outerthought.org]
> Sent: Friday, October 29, 2010 8:24 AM
> To: user; nosql-discussion; solr-user
> Subject: Something for the weekend - Lily 0.2 is OUT ! :)
>
> Dear all,
>
> three months after the highly anticipated proof of architecture release, we're living up to our promises, and are releasing Lily 'CR' 0.2 today - a fully-distributed, highly scalable and highly available content repository, marrying best-of-breed database and search technology into a powerful, productive and easy-to-use solution for contemporary internet-scale content applications.
> For whom
>
> You're building content applications (content management, archiving, asset management, DMS, WCMS, portals, ...) that scale well, either as a product, a project or in the cloud. You need a trustworthy underlying content repository that provides a flexible and easy-to-use content model you can adapt to your requirements. You have a keen interest in NoSQL/HBase technology but needs a higher-level API, and scalable indexing and search as well.
> Foundations
>
> Lily builds further upon Apache HBase and Apache SOLR. HBase is a faithful implementation of the Google BigTable database, and provides infinite elastic scaling and high-performance access to huge amounts of data. SOLR is the server version of Lucene, the industry-standard search library. Lily joins HBase and SOLR in a single, solidly packaged content repository product, with automated sharding (making use of multiple hardware nodes to provide scaling of volume and performance) and automatic index maintenance.
>
> Lily adds a sophisticated, yet flexible and surprisingly practical content schema on top of this, providing the structuredness of more classic databases, versioning, secondary indexing, queuing: all the stuff developers care for when fixing real-world problems.
> Key features of this release
>
>     - Fully distributed: Lily has a fully-distributed architecture making
>     maximum use of all available hardware for scalability and availability.
>     ZooKeeper is used for distributed process coordination, configuration and
>     locking. Index maintenance is based on an HBase-backed RowLog mechanism
>     allowing fast but reliable updating of SOLR indexes.
>
>     - Index maintenance: Lily offers all the features and functionality of
>     SOLR, but makes index maintenance a breeze, both for interactive as-you-go
>     updating and MapReduce-based full index rebuilds
>
>     - Multi-indexers: for high-load situations, multiple indexers can work in
>     parallel and talk to a sharded SOLR setup
>
>     - REST interface: a flexible and platform-neutral access method for all
>     Lily operations using HTTP and JSON
>
>     - Improved content model: we added URI as a base Lily type as a (small)
>     indication of our interest in semantic technology
>
> More importantly, we commit ourselves to take care of API compatibility and data format layout from this release onwards - as much as humanly possible.
>
> Lily 0.2 offers the API we want to support in the final release. Lily 0.2 is our contract for content application developers, upgrading to Lily final should require them to do as little code or data changes as possible.
>
>  From where
>
> Download Lily from www.lilyproject.org. It's Apache Licensed Open Source. No strings attached.
> Enterprise support
>
> Together with this release, we're rolling out our commercial support services<http://outerthought.org/site/services/lily.html>  (and signed up a first customer, yay!) that allows you to use Lily with peace of mind. Also, this release has been fully tested and depends on the latest Cloudera Distribution for Hadoop<http://www.cloudera.com/hadoop/>  (CDH3 beta3).
> Next up
>
> Lily 1.0 is planned for March 2011, with an interim release candidate in January. We'll be working on performance enhancements, feature additions, and are happily - eagerly - awaiting your feedback and comments. We'll post a roadmap for Lily 0.3 and onwards by mid November.
> Follow us
>
> If you want to keep track of Lily's on-going development, join the Lily discussion list or follow our company Twitter @outerthought<http://twitter.com/#%21/outerthought>
> .
> Thank you
>
> I'd like to thank Bruno and Evert for their hard work so far, the HBase and SOLR community for their help, the IWT government fund for their partial financial support, and all of our early Lily adopters and enthusiasts for their much valued feedback. You guys rock!
>
> Steven.
> --
> Steven Noels
> http://outerthought.org/
> Open Source Content Applications
> Makers of Kauri, Daisy CMS and Lily
>
>
>
>
> -----Original Message-----
> From: Otis Gospodnetic [mailto:otis_gospodnetic@yahoo.com]
> Sent: Friday, December 24, 2010 8:58 AM
> To: user@hbase.apache.org
> Subject: Re: Current status of HBasene?
>
> As far as I know, HBasene is dead.  There is Lucandra (now known as Solandra) that is similar to HBasene idea, but on top of Cassandra.
>
> See
> http://blog.sematext.com/2010/02/09/lucandra-a-cassandra-based-lucene-backend/
>
> Otis
> P.S.
> Coincidentally, we are looking for people who know search and "big data" stuff:
> http://sematext.com/about/jobs.html
>
> ----
> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - Hadoop - HBase Hadoop ecosystem search :: http://search-hadoop.com/
>
>
>
> ----- Original Message ----
>> From: amit jaiswal<am...@yahoo.com>
>> To: user@hbase.apache.org
>> Sent: Mon, November 29, 2010 11:49:12 PM
>> Subject: Current status of HBasene?
>>
>> Hi,
>>
>> I am trying to explore HBasene for using HBase as a backend for
>> lucene index  store. But it seems that the current code in github is
>> not in  working stage, and
>>
>> there is no active development either (https://github.com/akkumar/hbasene/).
>>
>> Can somebody tell its current  status? Didn't get any response on
>> hbasene mailing
>>
>> list.
>>
>> -regards
>> Amit
>>

RE: Current status of Hbasene - is the Lily project relevant?

Posted by "Taylor, Ronald C" <ro...@pnl.gov>.

Hello Amit,

Re your question on Hbasene: I presume that you are interested in support for full-text search, in conjunction with data storage in Hbase tables. Have you considered Lily as an alternative?

I don't have any personal experience with Lily - I only have seen the web site at
   http://www.lilyproject.org/
 and
   http://www.lilyproject.org/lily/about/faq.html
   http://www.lilyproject.org/lily/about/technology.html

 but Lily is based on Hbase and SOLR, which is a standalone Lucene-based search server, as I'm sure you know, and Lily is open source. So ...perhaps you should check Lily out. I'm interested in Lily myself for future use, and might be looking into it in the coming year.

I've appended the Outerthought email below that brought Lily to my attention. According to the email, release  1.0 of Lily is planned for March 2011; release 0.2.1 is downloadable now.

Does anybody out there have any experience yet with Lily?

  Cheers,
    Ron

___________________________________________
Ronald Taylor, Ph.D.
Computational Biology & Bioinformatics Group
Pacific Northwest National Laboratory
902 Battelle Boulevard
P.O. Box 999, Mail Stop J4-33
Richland, WA  99352 USA
Office:  509-372-6568
Email: ronald.taylor@pnl.gov




-----Original Message-----
From: Steven Noels [mailto:stevenn@outerthought.org]
Sent: Friday, October 29, 2010 8:24 AM
To: user; nosql-discussion; solr-user
Subject: Something for the weekend - Lily 0.2 is OUT ! :)

Dear all,

three months after the highly anticipated proof of architecture release, we're living up to our promises, and are releasing Lily 'CR' 0.2 today - a fully-distributed, highly scalable and highly available content repository, marrying best-of-breed database and search technology into a powerful, productive and easy-to-use solution for contemporary internet-scale content applications.
For whom

You're building content applications (content management, archiving, asset management, DMS, WCMS, portals, ...) that scale well, either as a product, a project or in the cloud. You need a trustworthy underlying content repository that provides a flexible and easy-to-use content model you can adapt to your requirements. You have a keen interest in NoSQL/HBase technology but needs a higher-level API, and scalable indexing and search as well.
Foundations

Lily builds further upon Apache HBase and Apache SOLR. HBase is a faithful implementation of the Google BigTable database, and provides infinite elastic scaling and high-performance access to huge amounts of data. SOLR is the server version of Lucene, the industry-standard search library. Lily joins HBase and SOLR in a single, solidly packaged content repository product, with automated sharding (making use of multiple hardware nodes to provide scaling of volume and performance) and automatic index maintenance.

Lily adds a sophisticated, yet flexible and surprisingly practical content schema on top of this, providing the structuredness of more classic databases, versioning, secondary indexing, queuing: all the stuff developers care for when fixing real-world problems.
Key features of this release

   - Fully distributed: Lily has a fully-distributed architecture making
   maximum use of all available hardware for scalability and availability.
   ZooKeeper is used for distributed process coordination, configuration and
   locking. Index maintenance is based on an HBase-backed RowLog mechanism
   allowing fast but reliable updating of SOLR indexes.

   - Index maintenance: Lily offers all the features and functionality of
   SOLR, but makes index maintenance a breeze, both for interactive as-you-go
   updating and MapReduce-based full index rebuilds

   - Multi-indexers: for high-load situations, multiple indexers can work in
   parallel and talk to a sharded SOLR setup

   - REST interface: a flexible and platform-neutral access method for all
   Lily operations using HTTP and JSON

   - Improved content model: we added URI as a base Lily type as a (small)
   indication of our interest in semantic technology

More importantly, we commit ourselves to take care of API compatibility and data format layout from this release onwards - as much as humanly possible.

Lily 0.2 offers the API we want to support in the final release. Lily 0.2 is our contract for content application developers, upgrading to Lily final should require them to do as little code or data changes as possible.

>From where

Download Lily from www.lilyproject.org. It's Apache Licensed Open Source. No strings attached.
Enterprise support

Together with this release, we're rolling out our commercial support services <http://outerthought.org/site/services/lily.html> (and signed up a first customer, yay!) that allows you to use Lily with peace of mind. Also, this release has been fully tested and depends on the latest Cloudera Distribution for Hadoop <http://www.cloudera.com/hadoop/> (CDH3 beta3).
Next up

Lily 1.0 is planned for March 2011, with an interim release candidate in January. We'll be working on performance enhancements, feature additions, and are happily - eagerly - awaiting your feedback and comments. We'll post a roadmap for Lily 0.3 and onwards by mid November.
Follow us

If you want to keep track of Lily's on-going development, join the Lily discussion list or follow our company Twitter @outerthought<http://twitter.com/#%21/outerthought>
.
Thank you

I'd like to thank Bruno and Evert for their hard work so far, the HBase and SOLR community for their help, the IWT government fund for their partial financial support, and all of our early Lily adopters and enthusiasts for their much valued feedback. You guys rock!

Steven.
--
Steven Noels
http://outerthought.org/
Open Source Content Applications
Makers of Kauri, Daisy CMS and Lily




-----Original Message-----
From: Otis Gospodnetic [mailto:otis_gospodnetic@yahoo.com] 
Sent: Friday, December 24, 2010 8:58 AM
To: user@hbase.apache.org
Subject: Re: Current status of HBasene?

As far as I know, HBasene is dead.  There is Lucandra (now known as Solandra) that is similar to HBasene idea, but on top of Cassandra.

See
http://blog.sematext.com/2010/02/09/lucandra-a-cassandra-based-lucene-backend/

Otis
P.S.
Coincidentally, we are looking for people who know search and "big data" stuff: 
http://sematext.com/about/jobs.html

----
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - Hadoop - HBase Hadoop ecosystem search :: http://search-hadoop.com/



----- Original Message ----
> From: amit jaiswal <am...@yahoo.com>
> To: user@hbase.apache.org
> Sent: Mon, November 29, 2010 11:49:12 PM
> Subject: Current status of HBasene?
> 
> Hi,
> 
> I am trying to explore HBasene for using HBase as a backend for  
>lucene index  store. But it seems that the current code in github is 
>not in  working stage, and
>
> there is no active development either (https://github.com/akkumar/hbasene/).
> 
> Can somebody tell its current  status? Didn't get any response on 
>hbasene mailing
>
> list.
> 
> -regards
> Amit
>

Re: Current status of HBasene?

Posted by Otis Gospodnetic <ot...@yahoo.com>.

As far as I know, HBasene is dead.  There is Lucandra (now known as Solandra) 
that is similar to HBasene idea, but on top of Cassandra.

See 
http://blog.sematext.com/2010/02/09/lucandra-a-cassandra-based-lucene-backend/

Otis
P.S.
Coincidentally, we are looking for people who know search and "big data" stuff: 
http://sematext.com/about/jobs.html

----
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - Hadoop - HBase
Hadoop ecosystem search :: http://search-hadoop.com/



----- Original Message ----
> From: amit jaiswal <am...@yahoo.com>
> To: user@hbase.apache.org
> Sent: Mon, November 29, 2010 11:49:12 PM
> Subject: Current status of HBasene?
> 
> Hi,
> 
> I am trying to explore HBasene for using HBase as a backend for  lucene index 
> store. But it seems that the current code in github is not in  working stage, 
>and 
>
> there is no active development either (https://github.com/akkumar/hbasene/).
> 
> Can somebody tell its current  status? Didn't get any response on hbasene 
>mailing 
>
> list.
> 
> -regards
> Amit
>

Re: Current status of HBasene?

Posted by amit jaiswal <am...@yahoo.com>.

Hi,

Ravi,
There is no active development that seems to happen in HBasene, but the code 
quality is good, and I suggest that you can still take a look. Few changes that 
I need to do to make it working:
1. Set commit batch size to a lower number so that the documents are always 
indexed (when you are playing with 1-2 documents).
2. If you are printing the docIds from the result, then the current 
implementation prints the segmentIds rather than docIds. A small hack is to 
store only a single document per segment.

Stack,
The project seems to be left in the process of a major design change. The design 
doc doesn't talk about logical partitioning of indexes in terms of segments 
which seems to be a good approach for scalability. 

Can the developers:
1. Update the design doc with the proposals as what were the design approaches 
like how the result set would show both segment + docId, how to assign documents 
to segment so that update of indexes is possible.
2. Give the version of the last working snapshot of hbasene (which is in sync 
with the current documentation).

-regards
Amit

----- Original Message ----
From: Stack <st...@duboce.net>
To: user@hbase.apache.org
Sent: Tue, 30 November, 2010 9:36:38 PM
Subject: Re: Current status of HBasene?

Karthik has not touched Hbasene in a while now going by commit log up
on github.  The project seems to have stagnated.
St.Ack

On Tue, Nov 30, 2010 at 7:32 AM, Veeramachaneni, Ravi
<ra...@navteq.com> wrote:
> Hi,
>
> I'm also in the process of exploring HBasene, is this get some good traction 
>and community support soon? Or should I explore other options?
>
> Thanks,
> Ravi
>
> Sent from my iPhone
>
> On Nov 30, 2010, at 12:13 AM, "Stack" <st...@duboce.net> wrote:
>
>> The HBasene author has been busy off on other projects.  I just
>> forwarded him your query below.  Lets see if he responds.
>> St.Ack
>>
>> On Mon, Nov 29, 2010 at 8:49 PM, amit jaiswal <am...@yahoo.com> wrote:
>>> Hi,
>>>
>>> I am trying to explore HBasene for using HBase as a backend for lucene index
>>> store. But it seems that the current code in github is not in working stage, 
>>>and
>>> there is no active development either (https://github.com/akkumar/hbasene/).
>>>
>>> Can somebody tell its current status? Didn't get any response on hbasene 
>>>mailing
>>> list.
>>>
>>> -regards
>>> Amit
>>>
>
>
> The information contained in this communication may be CONFIDENTIAL and is 
>intended only for the use of the recipient(s) named above.  If you are not the 
>intended recipient, you are hereby notified that any dissemination, 
>distribution, or copying of this communication, or any of its contents, is 
>strictly prohibited.  If you have received this communication in error, please 
>notify the sender and delete/destroy the original message and any copy of it 
>from your computer or paper files.
>

Re: Current status of HBasene?

Posted by Stack <st...@duboce.net>.

Karthik has not touched Hbasene in a while now going by commit log up
on github.  The project seems to have stagnated.
St.Ack

On Tue, Nov 30, 2010 at 7:32 AM, Veeramachaneni, Ravi
<ra...@navteq.com> wrote:
> Hi,
>
> I'm also in the process of exploring HBasene, is this get some good traction and community support soon? Or should I explore other options?
>
> Thanks,
> Ravi
>
> Sent from my iPhone
>
> On Nov 30, 2010, at 12:13 AM, "Stack" <st...@duboce.net> wrote:
>
>> The HBasene author has been busy off on other projects.  I just
>> forwarded him your query below.  Lets see if he responds.
>> St.Ack
>>
>> On Mon, Nov 29, 2010 at 8:49 PM, amit jaiswal <am...@yahoo.com> wrote:
>>> Hi,
>>>
>>> I am trying to explore HBasene for using HBase as a backend for lucene index
>>> store. But it seems that the current code in github is not in working stage, and
>>> there is no active development either (https://github.com/akkumar/hbasene/).
>>>
>>> Can somebody tell its current status? Didn't get any response on hbasene mailing
>>> list.
>>>
>>> -regards
>>> Amit
>>>
>
>
> The information contained in this communication may be CONFIDENTIAL and is intended only for the use of the recipient(s) named above.  If you are not the intended recipient, you are hereby notified that any dissemination, distribution, or copying of this communication, or any of its contents, is strictly prohibited.  If you have received this communication in error, please notify the sender and delete/destroy the original message and any copy of it from your computer or paper files.
>

Re: Current status of HBasene?

Posted by "Veeramachaneni, Ravi" <ra...@navteq.com>.

Hi,

I'm also in the process of exploring HBasene, is this get some good traction and community support soon? Or should I explore other options?

Thanks,
Ravi

Sent from my iPhone

On Nov 30, 2010, at 12:13 AM, "Stack" <st...@duboce.net> wrote:

> The HBasene author has been busy off on other projects.  I just
> forwarded him your query below.  Lets see if he responds.
> St.Ack
> 
> On Mon, Nov 29, 2010 at 8:49 PM, amit jaiswal <am...@yahoo.com> wrote:
>> Hi,
>> 
>> I am trying to explore HBasene for using HBase as a backend for lucene index
>> store. But it seems that the current code in github is not in working stage, and
>> there is no active development either (https://github.com/akkumar/hbasene/).
>> 
>> Can somebody tell its current status? Didn't get any response on hbasene mailing
>> list.
>> 
>> -regards
>> Amit
>> 

The information contained in this communication may be CONFIDENTIAL and is intended only for the use of the recipient(s) named above.  If you are not the intended recipient, you are hereby notified that any dissemination, distribution, or copying of this communication, or any of its contents, is strictly prohibited.  If you have received this communication in error, please notify the sender and delete/destroy the original message and any copy of it from your computer or paper files.

Re: Current status of HBasene?

Posted by Stack <st...@duboce.net>.

The HBasene author has been busy off on other projects.  I just
forwarded him your query below.  Lets see if he responds.
St.Ack

On Mon, Nov 29, 2010 at 8:49 PM, amit jaiswal <am...@yahoo.com> wrote:
> Hi,
>
> I am trying to explore HBasene for using HBase as a backend for lucene index
> store. But it seems that the current code in github is not in working stage, and
> there is no active development either (https://github.com/akkumar/hbasene/).
>
> Can somebody tell its current status? Didn't get any response on hbasene mailing
> list.
>
> -regards
> Amit
>