You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Marcus Herou <ma...@tailsweep.com> on 2008/07/28 11:38:57 UTC

Multi get/put

Hi guys.

Is there a way of retrieving multiple "rows" with one server call ?
Something like MySQL's "where id in (a,b,c...)

Or more like this.
List<SortedMap<Text,byte[]>> rows = HTable.getRows(Text[] rowKeys);

I'm building a framework around HBase which would benefit of handling batch
wise puts and gets.

Kindly

//Marcus



-- 
Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
marcus.herou@tailsweep.com
http://www.tailsweep.com/
http://blogg.tailsweep.com/

Re: Multi get/put

Posted by Marcus Herou <ma...@tailsweep.com>.

Hi Jun and thanks!

We are in the middle of the process of setting up our "SAN" servers and I
will run both Bonnie++ and IOZone tests on them as soon as we are done and
before we put them into production. Another good thing about GlusterFS is
that the community around it is super and everyone is tuned in at
performance: E.g.
http://www.gluster.org/docs/index.php/Guide_to_Optimizing_GlusterFS

I have used GlusterFS as replacement for NFS for our webapps and it has not
failed me yet :) I was so impressed that I decided to use GlusterFS without
even considering Lustre, KosmosFS etc which I have had real trouble with in
the past.

Perhaps not fully satisfying answer but look here:
http://www.gluster.org/docs/index.php/GlusterFS_1.3.1_-_64_Bricks_Aggregated_I/O_Benchmark

The only tricky part in the process of setting up GlusterFS is to get the
patched FUSE running on Ubuntu Hardy (not needed). The patched FUSE gives
additional performance so you would want to get it installed but GlusterFS
works just fine with the FUSE that comes with the kernel.

The installation of GlusterFS takes less than 10 mins and configuration
perhaps 30-60 mins the first time.

I have been working with big NetApp solutions, PolyServe etc and everyone I
speak to says the same thing: GlusterFS rock!
Our hardware supplier Southpole.se are as well specialists in distributed
computing and they have many customers in the technical Universities around
Sweden which are moving from Lustre "the industry standard" to GlusterFS.

Kindly

/Marcus



On Sun, Aug 10, 2008 at 6:44 PM, Jun Rao <ju...@almaden.ibm.com> wrote:

> Marcus,
>
> I found your discussion on distributed file systems very interesting. Could
> you shed light on how those file systems compare (HDFS, KFS, Lustre,
> GlusterFS, etc)? Do they all support locality as HDFS does? How easy is the
> setup? What about the read/write performance (both sequential and random
> I/O)? Thanks,
>
> Jun
> IBM Almaden Research Center
> K55/B1, 650 Harry Road, San Jose, CA  95120-6099
>
> junrao@almaden.ibm.com
> (408)927-1886 (phone)
> (408)927-3215 (fax)
>
>
> "Marcus Herou" <ma...@tailsweep.com> wrote on 08/09/2008 04:40:46
> AM:
>
> > Hi.
> >
> > Cool! This is a much lower level and probably better approach than ours.
> We
> > have now a functional index which however only have support for primitve
> > types but not free text indexing. It can store dups of data in the index
> for
> > fast retrieval. It is mostly used as a test of howto scale indexing
> > alongside with HBase. In the end we will probably stick with Lucene.
> >
> > We will probably in the end as well subclass HRegion, HTable etc but for
> now
> > we have a system which rather uses the existing framework.
> >
> > I understand that you would like to use HDFS for storing stuff... But
> have
> > you tried GlusterFS ?
> >
> > It is so simple and really works as a normal POSIX system. We will store
> our
> > Solr based index failes in GlusterFS. Actually I think we will use
> GlusterFS
> > as storing mechanism for the HDFS as well :) Stupid but we have some
> highly
> > potential storage machines which are must faster than a bunch of local
> > machines.
> >
> > The community should really spend some time in looking at the first of my
> > knowledge clustered file system which will lower storage costs making SAN
> > commodity. Yes we have Lustre, yes we have KosmosFS but have you ever
> tried
> > to install Lustre ? Puh... Enough about GlusterFS this is a HBase mailing
> > list :)
> >
> > Kindly
> >
> > //Marcus
> >
> > On Tue, Aug 5, 2008 at 4:58 PM, Ning Li <ni...@gmail.com> wrote:
> >
> > > We have been working on supporting Lucene-based index in HBase.
> > > In a nutshell, we extend the region to support indexing on column(s).
> > >
> > > We have a working implementation of our design. An overview of our
> > > design and the preliminary performance evaluation is provided below.
> > > We welcome feedback and we would be happy to contribute the code
> > > to HBase once the major performance issue is resolved.
> > >
> > > DATA MODEL
> > > An index can be created for a column, a column family or all the
> > > columns. In the implementation, we extend the HRegion class so that
> > > it not only manages store files which stores the column values of a
> > > region, but also Lucene instances which are used to support indexing
> > > on columns.
> > >
> > > The following assumes a per-column index and in the end we'll briefly
> > > describe how per-column family index and all-column index work.
> > >
> > > UPDATING A COLUMN
> > > Upon receiving a column update request, a region not only adds the
> > > column to the cache part of the store, but also analyzes the column
> > > and adds it to the cache part of the index. Same as the store files,
> > > the Lucene index files are also written to HDFS.
> > >
> > > Following the HBase design, to avoid resource contention, a region
> > > server globally schedules the cache flush and the compaction of both
> > > the store files and the index files of all the regions on the server.
> > >
> > > QUERYING AN INDEX
> > > We add to HTable the following method to enable querying an index.
> > >    Results search(range, column, query, max_num_hits);
> > > Depending on the specified key range, a client sends a search request
> > > to one or more region servers, who call the search method of queried
> > > regions. The client will merge the results from all the queried
> regions.
> > >
> > > In the current implementation, queries are conducted on the index files
> > > stored in HDFS.
> > >
> > > SPLITTING A REGION
> > > The region split works the same way as before - in addition to creating
> > > reference files for store files, reference files are also created for
> index
> > > files in the child regions. The old parent region will be deleted once
> > > all the reference files are deleted.
> > >
> > > PERFORMANCE ISSUES
> > > Our preliminary performance experiments show that the performance
> > > of building an index is quite reasonable. However, the performance of
> > > random reads in HDFS is so poor that the search performance is
> > > dramatically worse than that on local file systems.
> > >
> > > We are exploring different ways to solve this problem. One possibility
> > > is to store a copy on local file system. On the other hand, most likely
> > > HDFS already stores a local copy...
> > >
> > > VARIATIONS
> > > As we mentioned earlier, an index can also be created for a column
> > > family or for all the columns. If an index is created for a column
> family,
> > > whenever a column is updated, the rest of the column family needs to
> > > be retrieved to re-index the column family. This adds some overhead
> > > to the indexing process. Also, it's open what the best versioning
> > > semantics is.
> > >
> >
> >
> >
> > --
> > Marcus Herou CTO and co-founder Tailsweep AB
> > +46702561312
> > marcus.herou@tailsweep.com
> > http://www.tailsweep.com/
> > http://blogg.tailsweep.com/
>
>


-- 
Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
marcus.herou@tailsweep.com
http://www.tailsweep.com/
http://blogg.tailsweep.com/

Re: Multi get/put

Posted by Jun Rao <ju...@almaden.ibm.com>.

Marcus,

I found your discussion on distributed file systems very interesting. Could
you shed light on how those file systems compare (HDFS, KFS, Lustre,
GlusterFS, etc)? Do they all support locality as HDFS does? How easy is the
setup? What about the read/write performance (both sequential and random
I/O)? Thanks,

Jun
IBM Almaden Research Center
K55/B1, 650 Harry Road, San Jose, CA  95120-6099

junrao@almaden.ibm.com
(408)927-1886 (phone)
(408)927-3215 (fax)


"Marcus Herou" <ma...@tailsweep.com> wrote on 08/09/2008 04:40:46
AM:

> Hi.
>
> Cool! This is a much lower level and probably better approach than ours.
We
> have now a functional index which however only have support for primitve
> types but not free text indexing. It can store dups of data in the index
for
> fast retrieval. It is mostly used as a test of howto scale indexing
> alongside with HBase. In the end we will probably stick with Lucene.
>
> We will probably in the end as well subclass HRegion, HTable etc but for
now
> we have a system which rather uses the existing framework.
>
> I understand that you would like to use HDFS for storing stuff... But
have
> you tried GlusterFS ?
>
> It is so simple and really works as a normal POSIX system. We will store
our
> Solr based index failes in GlusterFS. Actually I think we will use
GlusterFS
> as storing mechanism for the HDFS as well :) Stupid but we have some
highly
> potential storage machines which are must faster than a bunch of local
> machines.
>
> The community should really spend some time in looking at the first of my
> knowledge clustered file system which will lower storage costs making SAN
> commodity. Yes we have Lustre, yes we have KosmosFS but have you ever
tried
> to install Lustre ? Puh... Enough about GlusterFS this is a HBase mailing
> list :)
>
> Kindly
>
> //Marcus
>
> On Tue, Aug 5, 2008 at 4:58 PM, Ning Li <ni...@gmail.com> wrote:
>
> > We have been working on supporting Lucene-based index in HBase.
> > In a nutshell, we extend the region to support indexing on column(s).
> >
> > We have a working implementation of our design. An overview of our
> > design and the preliminary performance evaluation is provided below.
> > We welcome feedback and we would be happy to contribute the code
> > to HBase once the major performance issue is resolved.
> >
> > DATA MODEL
> > An index can be created for a column, a column family or all the
> > columns. In the implementation, we extend the HRegion class so that
> > it not only manages store files which stores the column values of a
> > region, but also Lucene instances which are used to support indexing
> > on columns.
> >
> > The following assumes a per-column index and in the end we'll briefly
> > describe how per-column family index and all-column index work.
> >
> > UPDATING A COLUMN
> > Upon receiving a column update request, a region not only adds the
> > column to the cache part of the store, but also analyzes the column
> > and adds it to the cache part of the index. Same as the store files,
> > the Lucene index files are also written to HDFS.
> >
> > Following the HBase design, to avoid resource contention, a region
> > server globally schedules the cache flush and the compaction of both
> > the store files and the index files of all the regions on the server.
> >
> > QUERYING AN INDEX
> > We add to HTable the following method to enable querying an index.
> >    Results search(range, column, query, max_num_hits);
> > Depending on the specified key range, a client sends a search request
> > to one or more region servers, who call the search method of queried
> > regions. The client will merge the results from all the queried
regions.
> >
> > In the current implementation, queries are conducted on the index files
> > stored in HDFS.
> >
> > SPLITTING A REGION
> > The region split works the same way as before - in addition to creating
> > reference files for store files, reference files are also created for
index
> > files in the child regions. The old parent region will be deleted once
> > all the reference files are deleted.
> >
> > PERFORMANCE ISSUES
> > Our preliminary performance experiments show that the performance
> > of building an index is quite reasonable. However, the performance of
> > random reads in HDFS is so poor that the search performance is
> > dramatically worse than that on local file systems.
> >
> > We are exploring different ways to solve this problem. One possibility
> > is to store a copy on local file system. On the other hand, most likely
> > HDFS already stores a local copy...
> >
> > VARIATIONS
> > As we mentioned earlier, an index can also be created for a column
> > family or for all the columns. If an index is created for a column
family,
> > whenever a column is updated, the rest of the column family needs to
> > be retrieved to re-index the column family. This adds some overhead
> > to the indexing process. Also, it's open what the best versioning
> > semantics is.
> >
>
>
>
> --
> Marcus Herou CTO and co-founder Tailsweep AB
> +46702561312
> marcus.herou@tailsweep.com
> http://www.tailsweep.com/
> http://blogg.tailsweep.com/

Re: Multi get/put

Posted by Marcus Herou <ma...@tailsweep.com>.

Hi.

Cool! This is a much lower level and probably better approach than ours. We
have now a functional index which however only have support for primitve
types but not free text indexing. It can store dups of data in the index for
fast retrieval. It is mostly used as a test of howto scale indexing
alongside with HBase. In the end we will probably stick with Lucene.

We will probably in the end as well subclass HRegion, HTable etc but for now
we have a system which rather uses the existing framework.

I understand that you would like to use HDFS for storing stuff... But have
you tried GlusterFS ?

It is so simple and really works as a normal POSIX system. We will store our
Solr based index failes in GlusterFS. Actually I think we will use GlusterFS
as storing mechanism for the HDFS as well :) Stupid but we have some highly
potential storage machines which are must faster than a bunch of local
machines.

The community should really spend some time in looking at the first of my
knowledge clustered file system which will lower storage costs making SAN
commodity. Yes we have Lustre, yes we have KosmosFS but have you ever tried
to install Lustre ? Puh... Enough about GlusterFS this is a HBase mailing
list :)

Kindly

//Marcus

On Tue, Aug 5, 2008 at 4:58 PM, Ning Li <ni...@gmail.com> wrote:

> We have been working on supporting Lucene-based index in HBase.
> In a nutshell, we extend the region to support indexing on column(s).
>
> We have a working implementation of our design. An overview of our
> design and the preliminary performance evaluation is provided below.
> We welcome feedback and we would be happy to contribute the code
> to HBase once the major performance issue is resolved.
>
> DATA MODEL
> An index can be created for a column, a column family or all the
> columns. In the implementation, we extend the HRegion class so that
> it not only manages store files which stores the column values of a
> region, but also Lucene instances which are used to support indexing
> on columns.
>
> The following assumes a per-column index and in the end we'll briefly
> describe how per-column family index and all-column index work.
>
> UPDATING A COLUMN
> Upon receiving a column update request, a region not only adds the
> column to the cache part of the store, but also analyzes the column
> and adds it to the cache part of the index. Same as the store files,
> the Lucene index files are also written to HDFS.
>
> Following the HBase design, to avoid resource contention, a region
> server globally schedules the cache flush and the compaction of both
> the store files and the index files of all the regions on the server.
>
> QUERYING AN INDEX
> We add to HTable the following method to enable querying an index.
>    Results search(range, column, query, max_num_hits);
> Depending on the specified key range, a client sends a search request
> to one or more region servers, who call the search method of queried
> regions. The client will merge the results from all the queried regions.
>
> In the current implementation, queries are conducted on the index files
> stored in HDFS.
>
> SPLITTING A REGION
> The region split works the same way as before - in addition to creating
> reference files for store files, reference files are also created for index
> files in the child regions. The old parent region will be deleted once
> all the reference files are deleted.
>
> PERFORMANCE ISSUES
> Our preliminary performance experiments show that the performance
> of building an index is quite reasonable. However, the performance of
> random reads in HDFS is so poor that the search performance is
> dramatically worse than that on local file systems.
>
> We are exploring different ways to solve this problem. One possibility
> is to store a copy on local file system. On the other hand, most likely
> HDFS already stores a local copy...
>
> VARIATIONS
> As we mentioned earlier, an index can also be created for a column
> family or for all the columns. If an index is created for a column family,
> whenever a column is updated, the rest of the column family needs to
> be retrieved to re-index the column family. This adds some overhead
> to the indexing process. Also, it's open what the best versioning
> semantics is.
>



-- 
Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
marcus.herou@tailsweep.com
http://www.tailsweep.com/
http://blogg.tailsweep.com/

Re: Multi get/put

Posted by Marcus Herou <ma...@tailsweep.com>.

This is something I would like to implement as well. A connection pool of
some sort to increase the open/close performance and to be able to hold a
connection "open" during a session or at least a transaction (more than one
put in a row) which I guess is supported in trunk ?


//Marcus

On Thu, Aug 7, 2008 at 2:15 AM, Jun Rao <ju...@almaden.ibm.com> wrote:

> In terms of performance, the biggest overhead comes from Hbase/Hadoop ipc.
> For simple queries, a search through ipc takes 3-4 times as long as that
> directly on HDFS. I guess a lot of the overhead is because of java
> reflection in ipc proxy. Does Hbase have plans to make ipc more efficient?
>
> HDFS adds another layer of overhead compared with local file system. A
> search on HDFS (on a node that has a local copy of all data) can take 10
> times as long as that on local file system. We suspect most overhead comes
> from reopening sockets in HDFS client.
>
> Jun
> IBM Almaden Research Center
> K55/B1, 650 Harry Road, San Jose, CA  95120-6099
>
> junrao@almaden.ibm.com
> (408)927-1886 (phone)
> (408)927-3215 (fax)
>
>
>
>
>             stack
>             <stack@duboce.net
>             >                                                          To
>                                       hbase-user@hadoop.apache.org
>             08/06/2008 01:42                                           cc
>             PM
>                                                                   Subject
>                                       Re: Multi get/put
>             Please respond to
>             hbase-user@hadoop
>                .apache.org
>
>
>
>
>
>
>
> Ning Li wrote:
> >> Does you have to do a rewrite of the lucene index at compaction time?
> Or
> >> just call optimize?  (I suppose its the former if you need to clean up
> >> 'References' as per below where you talk of splits)
> >>
> >
> > What do you mean by "a rewrite of the lucene index"?
>
> In hbase, on split, daughters hold a reference to either the top or
> bottom half of their parent region.  References are undone by
> compactions; as part of compaction, the part of the parent referenced by
> the daughter gets written out to store files under the daughter.
> Daughters try to undo references as promptly as possible because regions
> with references are not splitable (references to references, and so on,
> would soon become unmanageble).
>
> In your description, you mentioned that daughter regions reference their
> parents' index.  When I said, 'a rewrite of the lucene index', I was
> asking, as per hbase regions, if you followed the model and wrote a new
> lucene index comprised of daughter-only content at compaction time.  Or
> do you just 'optimize' and let the references build up so the daughter
> of a daughter points all the ways up to the parent?
>
> Just wondering.
>
>
> >> Regards your 'on the other hand' above, thats a good point.  Have you
> >> verified that if a regionerver is running on a datanode, that the lucene
> >> index is written local?  Would be interesting to know.
> >>
> >
> > That's HDFS's policy. See HDFS's FSNamesystem.getAdditionalBlock.
> >
> Sorry.  Yeah, of course.
>
> So, why do you think it so slow going via HDFS FileSystem when the data
> is local?  Is it the block-orientated access or is there just a high-tax
> going via the HDFS FS interface?
>
> St.Ack
>
>
>


-- 
Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
marcus.herou@tailsweep.com
http://www.tailsweep.com/
http://blogg.tailsweep.com/

Re: Multi get/put

Posted by Ning Li <ni...@gmail.com>.

> In hbase, on split, daughters hold a reference to either the top or bottom
> half of their parent region.  References are undone by compactions; as part
> of compaction, the part of the parent referenced by the daughter gets
> written out to store files under the daughter.  Daughters try to undo
> references as promptly as possible because regions with references are not
> splitable (references to references, and so on, would soon become
> unmanageble).
>
> In your description, you mentioned that daughter regions reference their
> parents' index.  When I said, 'a rewrite of the lucene index', I was asking,
> as per hbase regions, if you followed the model and wrote a new lucene index
> comprised of daughter-only content at compaction time.  Or do you just
> 'optimize' and let the references build up so the daughter of a daughter
> points all the ways up to the parent?

Similar as in HBase, a split is not allowed if there are references to
parent files, whether they are store files or index files.

> So, why do you think it so slow going via HDFS FileSystem when the data is
> local?  Is it the block-orientated access or is there just a high-tax going
> via the HDFS FS interface?

Because of how DFSClient.DFSInputStream is implemented, a socket
connection is opened and closed for almost every random read. We'll
experiment resuing socket connections in DFSInputStream.

Cheers,
Ning

Re: Multi get/put

Posted by Jun Rao <ju...@almaden.ibm.com>.

stack <st...@duboce.net> wrote on 08/06/2008 05:32:09 PM:

> Jun Rao wrote:
> > In terms of performance, the biggest overhead comes from Hbase/Hadoop
ipc.
> > For simple queries, a search through ipc takes 3-4 times as long as
that
> > directly on HDFS. I guess a lot of the overhead is because of java
> > reflection in ipc proxy. Does Hbase have plans to make ipc more
efficient?
> >
> We do.  Its a priority.  0.3.0 hopefully.
>
> > HDFS adds another layer of overhead compared with local file system. A
> > search on HDFS (on a node that has a local copy of all data) can take
10
> > times as long as that on local file system. We suspect most overhead
comes
> > from reopening sockets in HDFS client.
> >
> Are you on a recent hbase Jun?  Hadoop RPC seems to be reusing
> connections in 0.17.1.  Maybe that will help.
>

Our tests were done on Hadoop 0.17.1.


> St.Ack
>
>
> > Jun
> > IBM Almaden Research Center
> > K55/B1, 650 Harry Road, San Jose, CA  95120-6099
> >
> > junrao@almaden.ibm.com
> > (408)927-1886 (phone)
> > (408)927-3215 (fax)
> >
> >
> >
> >

> >              stack

> >              <stack@duboce.net

> >              >
To
> >                                        hbase-user@hadoop.apache.org

> >              08/06/2008 01:42
cc
> >              PM

> >
Subject
> >                                        Re: Multi get/put

> >              Please respond to

> >              hbase-user@hadoop

> >                 .apache.org

> >

> >

> >

> >
> >
> >
> >
> > Ning Li wrote:
> >
> >>> Does you have to do a rewrite of the lucene index at compaction time?
> >>>
> > Or
> >
> >>> just call optimize?  (I suppose its the former if you need to clean
up
> >>> 'References' as per below where you talk of splits)
> >>>
> >>>
> >> What do you mean by "a rewrite of the lucene index"?
> >>
> >
> > In hbase, on split, daughters hold a reference to either the top or
> > bottom half of their parent region.  References are undone by
> > compactions; as part of compaction, the part of the parent referenced
by
> > the daughter gets written out to store files under the daughter.
> > Daughters try to undo references as promptly as possible because
regions
> > with references are not splitable (references to references, and so on,
> > would soon become unmanageble).
> >
> > In your description, you mentioned that daughter regions reference
their
> > parents' index.  When I said, 'a rewrite of the lucene index', I was
> > asking, as per hbase regions, if you followed the model and wrote a new
> > lucene index comprised of daughter-only content at compaction time.  Or
> > do you just 'optimize' and let the references build up so the daughter
> > of a daughter points all the ways up to the parent?
> >
> > Just wondering.
> >
> >
> >
> >>> Regards your 'on the other hand' above, thats a good point.  Have you
> >>> verified that if a regionerver is running on a datanode, that the
lucene
> >>> index is written local?  Would be interesting to know.
> >>>
> >>>
> >> That's HDFS's policy. See HDFS's FSNamesystem.getAdditionalBlock.
> >>
> >>
> > Sorry.  Yeah, of course.
> >
> > So, why do you think it so slow going via HDFS FileSystem when the data
> > is local?  Is it the block-orientated access or is there just a
high-tax
> > going via the HDFS FS interface?
> >
> > St.Ack
> >
> >
> >
>

Re: Multi get/put

Posted by stack <st...@duboce.net>.

Jun Rao wrote:
> In terms of performance, the biggest overhead comes from Hbase/Hadoop ipc.
> For simple queries, a search through ipc takes 3-4 times as long as that
> directly on HDFS. I guess a lot of the overhead is because of java
> reflection in ipc proxy. Does Hbase have plans to make ipc more efficient?
>   
We do.  Its a priority.  0.3.0 hopefully.

> HDFS adds another layer of overhead compared with local file system. A
> search on HDFS (on a node that has a local copy of all data) can take 10
> times as long as that on local file system. We suspect most overhead comes
> from reopening sockets in HDFS client.
>   
Are you on a recent hbase Jun?  Hadoop RPC seems to be reusing 
connections in 0.17.1.  Maybe that will help.

St.Ack


> Jun
> IBM Almaden Research Center
> K55/B1, 650 Harry Road, San Jose, CA  95120-6099
>
> junrao@almaden.ibm.com
> (408)927-1886 (phone)
> (408)927-3215 (fax)
>
>
>
>                                                                            
>              stack                                                         
>              <stack@duboce.net                                             
>              >                                                          To 
>                                        hbase-user@hadoop.apache.org        
>              08/06/2008 01:42                                           cc 
>              PM                                                            
>                                                                    Subject 
>                                        Re: Multi get/put                   
>              Please respond to                                             
>              hbase-user@hadoop                                             
>                 .apache.org                                                
>                                                                            
>                                                                            
>                                                                            
>
>
>
>
> Ning Li wrote:
>   
>>> Does you have to do a rewrite of the lucene index at compaction time?
>>>       
> Or
>   
>>> just call optimize?  (I suppose its the former if you need to clean up
>>> 'References' as per below where you talk of splits)
>>>
>>>       
>> What do you mean by "a rewrite of the lucene index"?
>>     
>
> In hbase, on split, daughters hold a reference to either the top or
> bottom half of their parent region.  References are undone by
> compactions; as part of compaction, the part of the parent referenced by
> the daughter gets written out to store files under the daughter.
> Daughters try to undo references as promptly as possible because regions
> with references are not splitable (references to references, and so on,
> would soon become unmanageble).
>
> In your description, you mentioned that daughter regions reference their
> parents' index.  When I said, 'a rewrite of the lucene index', I was
> asking, as per hbase regions, if you followed the model and wrote a new
> lucene index comprised of daughter-only content at compaction time.  Or
> do you just 'optimize' and let the references build up so the daughter
> of a daughter points all the ways up to the parent?
>
> Just wondering.
>
>
>   
>>> Regards your 'on the other hand' above, thats a good point.  Have you
>>> verified that if a regionerver is running on a datanode, that the lucene
>>> index is written local?  Would be interesting to know.
>>>
>>>       
>> That's HDFS's policy. See HDFS's FSNamesystem.getAdditionalBlock.
>>
>>     
> Sorry.  Yeah, of course.
>
> So, why do you think it so slow going via HDFS FileSystem when the data
> is local?  Is it the block-orientated access or is there just a high-tax
> going via the HDFS FS interface?
>
> St.Ack
>
>
>

Re: Multi get/put

Posted by Jun Rao <ju...@almaden.ibm.com>.

In terms of performance, the biggest overhead comes from Hbase/Hadoop ipc.
For simple queries, a search through ipc takes 3-4 times as long as that
directly on HDFS. I guess a lot of the overhead is because of java
reflection in ipc proxy. Does Hbase have plans to make ipc more efficient?

HDFS adds another layer of overhead compared with local file system. A
search on HDFS (on a node that has a local copy of all data) can take 10
times as long as that on local file system. We suspect most overhead comes
from reopening sockets in HDFS client.

Jun
IBM Almaden Research Center
K55/B1, 650 Harry Road, San Jose, CA  95120-6099

junrao@almaden.ibm.com
(408)927-1886 (phone)
(408)927-3215 (fax)

             stack                                                         
             <stack@duboce.net                                             
             >                                                          To 
                                       hbase-user@hadoop.apache.org        
             08/06/2008 01:42                                           cc 
             PM                                                            
                                                                   Subject 
                                       Re: Multi get/put                   
             Please respond to                                             
             hbase-user@hadoop                                             
                .apache.org                                                

Ning Li wrote:
>> Does you have to do a rewrite of the lucene index at compaction time?
Or
>> just call optimize?  (I suppose its the former if you need to clean up
>> 'References' as per below where you talk of splits)
>>
>
> What do you mean by "a rewrite of the lucene index"?

In hbase, on split, daughters hold a reference to either the top or
bottom half of their parent region.  References are undone by
compactions; as part of compaction, the part of the parent referenced by
the daughter gets written out to store files under the daughter.
Daughters try to undo references as promptly as possible because regions
with references are not splitable (references to references, and so on,
would soon become unmanageble).

In your description, you mentioned that daughter regions reference their
parents' index.  When I said, 'a rewrite of the lucene index', I was
asking, as per hbase regions, if you followed the model and wrote a new
lucene index comprised of daughter-only content at compaction time.  Or
do you just 'optimize' and let the references build up so the daughter
of a daughter points all the ways up to the parent?

Just wondering.

>> Regards your 'on the other hand' above, thats a good point.  Have you
>> verified that if a regionerver is running on a datanode, that the lucene
>> index is written local?  Would be interesting to know.
>>
>
> That's HDFS's policy. See HDFS's FSNamesystem.getAdditionalBlock.
>
Sorry.  Yeah, of course.

So, why do you think it so slow going via HDFS FileSystem when the data
is local?  Is it the block-orientated access or is there just a high-tax
going via the HDFS FS interface?

St.Ack

Re: Multi get/put

Posted by stack <st...@duboce.net>.

Ning Li wrote:
>> Does you have to do a rewrite of the lucene index at compaction time?  Or
>> just call optimize?  (I suppose its the former if you need to clean up
>> 'References' as per below where you talk of splits)
>>     
>
> What do you mean by "a rewrite of the lucene index"? 

In hbase, on split, daughters hold a reference to either the top or 
bottom half of their parent region.  References are undone by 
compactions; as part of compaction, the part of the parent referenced by 
the daughter gets written out to store files under the daughter.  
Daughters try to undo references as promptly as possible because regions 
with references are not splitable (references to references, and so on, 
would soon become unmanageble).

In your description, you mentioned that daughter regions reference their 
parents' index.  When I said, 'a rewrite of the lucene index', I was 
asking, as per hbase regions, if you followed the model and wrote a new 
lucene index comprised of daughter-only content at compaction time.  Or 
do you just 'optimize' and let the references build up so the daughter 
of a daughter points all the ways up to the parent?

Just wondering.

>> Regards your 'on the other hand' above, thats a good point.  Have you
>> verified that if a regionerver is running on a datanode, that the lucene
>> index is written local?  Would be interesting to know.
>>     
>
> That's HDFS's policy. See HDFS's FSNamesystem.getAdditionalBlock.
>   
Sorry.  Yeah, of course.

So, why do you think it so slow going via HDFS FileSystem when the data 
is local?  Is it the block-orientated access or is there just a high-tax 
going via the HDFS FS interface?

St.Ack

Re: Multi get/put

Posted by Ning Li <ni...@gmail.com>.

> How does this work with regard to TTL and cell versions?

The trunk snapshot we based the work on does not support TTL. We'll
add the support when porting to the latest version.

> Does you have to do a rewrite of the lucene index at compaction time?  Or
> just call optimize?  (I suppose its the former if you need to clean up
> 'References' as per below where you talk of splits)

What do you mean by "a rewrite of the lucene index"? Right now,
optimize is called. But we'll experiment with maybeMerge to allow more
flexible compaction policies. I.e. so we don't have to merge all the
files for every compaction. References are taken care of in the
customized Directory implementation.


> What do you mean by 'dramatic' in the above?  This is a sweet feature.  That
> its slow on first implementation is OK.  Are you thinking its so slow, its
> not functional?

Right now, the search performance is more than an order of magnitude
slower primarily because of the random read performance in HDFS...

> Regards your 'on the other hand' above, thats a good point.  Have you
> verified that if a regionerver is running on a datanode, that the lucene
> index is written local?  Would be interesting to know.

That's HDFS's policy. See HDFS's FSNamesystem.getAdditionalBlock.

Cheers,
Ning

Re: Multi get/put

Posted by stack <st...@duboce.net>.

Ning Li wrote:
> Some follow-up on the performance issues:
>   
>>> PERFORMANCE ISSUES
>>> Our preliminary performance experiments show that the performance
>>> of building an index is quite reasonable. However, the performance of
>>> random reads in HDFS is so poor that the search performance is
>>> dramatically worse than that on local file systems.
>>>
>>>       
>> What do you mean by 'dramatic' in the above?  This is a sweet feature.  That
>> its slow on first implementation is OK.  Are you thinking its so slow, its
>> not functional?
>>     
>
> On local FS, real disk IO is expensive. Lucene relies on FS cache to
> provide high search performance on local FS. Because of this, the
> following comparisons are based on warm test results.
>
> The comparison is between the local FS and a one-node HDFS. HDFS
> provides high sequential read performance but poor random read
> performance mainly because of socket overhead when data is warm.
>
> On HDFS 0.17.1, the search performance is more than an order of
> magnitude slower than that on a local FS. Even with reusing socket
> connection, the search performance is still about an order of
> magnitude slower.
>
> Since this is caused by the socket overhead in HDFS, you see similar
> results with random reads on a map file. I used HBase's
> MapFilePerformanceEvaluation. The random read performance is a bit
> less than 7 times lower than that on a local FS. This is a bit better
> than the search performance probably because a random read on a map
> file is several almost-sequential reads on the data file in HDFS.
>
> Given the above, would the search performance be acceptable?
>   
I think performance  -- an order of magnitude slower than local fs --  
is OK for now.  Slow search will be just one more reason why random-read 
performance needs to be improved.

> PS: I saw on http://wiki.apache.org/hadoop/Hbase/PerformanceEvaluation
> that the random read performance on a map file improved quite a bit
> from 0.17.1 to 0.18.0. Any insight?
>   
Chatting w/ some of the fellas, they said that they've started to worry 
about performance and have been making improvements slowly.  Let me try 
and get some more specifics.  Will be back if I learn anything.

St.Ack

Re: Multi get/put

Posted by Ning Li <ni...@gmail.com>.

Some follow-up on the performance issues:

> > PERFORMANCE ISSUES
> > Our preliminary performance experiments show that the performance
> > of building an index is quite reasonable. However, the performance of
> > random reads in HDFS is so poor that the search performance is
> > dramatically worse than that on local file systems.
> >
> What do you mean by 'dramatic' in the above?  This is a sweet feature.  That
> its slow on first implementation is OK.  Are you thinking its so slow, its
> not functional?

On local FS, real disk IO is expensive. Lucene relies on FS cache to
provide high search performance on local FS. Because of this, the
following comparisons are based on warm test results.

The comparison is between the local FS and a one-node HDFS. HDFS
provides high sequential read performance but poor random read
performance mainly because of socket overhead when data is warm.

On HDFS 0.17.1, the search performance is more than an order of
magnitude slower than that on a local FS. Even with reusing socket
connection, the search performance is still about an order of
magnitude slower.

Since this is caused by the socket overhead in HDFS, you see similar
results with random reads on a map file. I used HBase's
MapFilePerformanceEvaluation. The random read performance is a bit
less than 7 times lower than that on a local FS. This is a bit better
than the search performance probably because a random read on a map
file is several almost-sequential reads on the data file in HDFS.

Given the above, would the search performance be acceptable?

PS: I saw on http://wiki.apache.org/hadoop/Hbase/PerformanceEvaluation
that the random read performance on a map file improved quite a bit
from 0.17.1 to 0.18.0. Any insight?

Re: Multi get/put

Posted by stack <st...@duboce.net>.

The below looks excellent Ning.

Comments interspersed.


Ning Li wrote:
> ...
>
> UPDATING A COLUMN
> Upon receiving a column update request, a region not only adds the
> column to the cache part of the store, but also analyzes the column
> and adds it to the cache part of the index. Same as the store files,
> the Lucene index files are also written to HDFS.
>
> Following the HBase design, to avoid resource contention, a region
> server globally schedules the cache flush and the compaction of both
> the store files and the index files of all the regions on the server.
>
>   
How does this work with regard to TTL and cell versions?

Does you have to do a rewrite of the lucene index at compaction time?  
Or just call optimize?  (I suppose its the former if you need to clean 
up 'References' as per below where you talk of splits)

> PERFORMANCE ISSUES
> Our preliminary performance experiments show that the performance
> of building an index is quite reasonable. However, the performance of
> random reads in HDFS is so poor that the search performance is
> dramatically worse than that on local file systems.
>
> We are exploring different ways to solve this problem. One possibility
> is to store a copy on local file system. On the other hand, most likely
> HDFS already stores a local copy...
>
>   
What do you mean by 'dramatic' in the above?  This is a sweet feature.  
That its slow on first implementation is OK.  Are you thinking its so 
slow, its not functional?

Regards your 'on the other hand' above, thats a good point.  Have you 
verified that if a regionerver is running on a datanode, that the lucene 
index is written local?  Would be interesting to know.

St.Ack

Re: Multi get/put

Posted by Ning Li <ni...@gmail.com>.

We have been working on supporting Lucene-based index in HBase.
In a nutshell, we extend the region to support indexing on column(s).

We have a working implementation of our design. An overview of our
design and the preliminary performance evaluation is provided below.
We welcome feedback and we would be happy to contribute the code
to HBase once the major performance issue is resolved.

DATA MODEL
An index can be created for a column, a column family or all the
columns. In the implementation, we extend the HRegion class so that
it not only manages store files which stores the column values of a
region, but also Lucene instances which are used to support indexing
on columns.

The following assumes a per-column index and in the end we'll briefly
describe how per-column family index and all-column index work.

UPDATING A COLUMN
Upon receiving a column update request, a region not only adds the
column to the cache part of the store, but also analyzes the column
and adds it to the cache part of the index. Same as the store files,
the Lucene index files are also written to HDFS.

Following the HBase design, to avoid resource contention, a region
server globally schedules the cache flush and the compaction of both
the store files and the index files of all the regions on the server.

QUERYING AN INDEX
We add to HTable the following method to enable querying an index.
    Results search(range, column, query, max_num_hits);
Depending on the specified key range, a client sends a search request
to one or more region servers, who call the search method of queried
regions. The client will merge the results from all the queried regions.

In the current implementation, queries are conducted on the index files
stored in HDFS.

SPLITTING A REGION
The region split works the same way as before - in addition to creating
reference files for store files, reference files are also created for index
files in the child regions. The old parent region will be deleted once
all the reference files are deleted.

PERFORMANCE ISSUES
Our preliminary performance experiments show that the performance
of building an index is quite reasonable. However, the performance of
random reads in HDFS is so poor that the search performance is
dramatically worse than that on local file systems.

We are exploring different ways to solve this problem. One possibility
is to store a copy on local file system. On the other hand, most likely
HDFS already stores a local copy...

VARIATIONS
As we mentioned earlier, an index can also be created for a column
family or for all the columns. If an index is created for a column family,
whenever a column is updated, the rest of the column family needs to
be retrieved to re-index the column family. This adds some overhead
to the indexing process. Also, it's open what the best versioning
semantics is.

Re: Multi get/put

Posted by Marcus Herou <ma...@tailsweep.com>.

Yep I will release it next week.

Kindly

//Marcus

On Mon, Aug 4, 2008 at 11:19 PM, stack <st...@duboce.net> wrote:

> Marcus Herou wrote:
>
>> ..
>> Would you call it "safe" to start developing on 0.2 if we will use the
>> code
>> in production in October ? I can live with changes of interfaces and such
>> but if the kernel of HBase itself will be unstable so there is potential
>> dataloss I'm getting a little more worried. When do you plan that 0.2 is
>> final ?
>>
>>
>>
> For October, would suggest you plan on 0.3.0.
>
> 0.2.0RC2 should be going out in the next day or so.
>
> We need to put up a proposal for folks to discuss and vote on, but chatting
> on IRC, current thought is for 0.3.0 to have a short development cycle and
> come out soon after 0.2.0.
>
>  Yesterday my first successful ORM test cases for HBase went through in
>> which
>> the batching stuff would be extremely helpful
>>
>> I have many cases where I need to batch data in and out of HBase.
>> Searching
>> is one: I coupled HBase to SOLR whenever I want to retrieve data by query.
>> HBase is only scanning which is'nt the fastest way if you have zillions of
>> rows :) Lucene is a good indexing system already but it is'nt very easy to
>> make it scale along with HBase.
>>
>> I would like to have the case that whenever I add a HBase machine I as
>> well
>> add indexing speed, so...we are building an indexing system which will use
>> HBase. HBase is great for this since the row keys are sorted. < > and =
>> queries will be piece of a cake. I will release both HBaseORM and
>> HBaseIndex
>> as OpenSource whenever I have removed the company deps.
>>
>> I would gladly contribute these stuff in a contrib source tree.
>>
>>
> Marcus, this is great stuff.  I would encourage you to do your development
> of the ORM and index out in the open (Feel free to use the hbase wiki and
> the hbase mailing lists to lay out ideas/plans and to solicit opinions).
>  From what I hear, you are not the only gentleperson trying to figure these
> issues.  Doing your dev in the open, you might get some useful feedback and
> even some help.
>
> Also, file issues against hbase for any functionality you need to make your
> ORM and index happen.
> Thanks,
> St.Ack
>
>
>
>  l Cryans <jd...@gmail.com>wrote:
>>
>>
>>
>>> Marcus,
>>>
>>> If you are currently building upon 0.2.0, the way to retrieve multiple
>>> rows
>>> is to use a scanner available from the client class HTable. The way to
>>> batch
>>> multiple rows updates is to use the BatchUpdate[ ]  version of
>>> HTable.commit
>>>
>>> Hope this helps,
>>>
>>> J-D
>>>
>>> On Mon, Jul 28, 2008 at 5:38 AM, Marcus Herou <
>>> marcus.herou@tailsweep.com
>>>
>>>
>>>> wrote:
>>>>      Hi guys.
>>>>
>>>> Is there a way of retrieving multiple "rows" with one server call ?
>>>> Something like MySQL's "where id in (a,b,c...)
>>>>
>>>> Or more like this.
>>>> List<SortedMap<Text,byte[]>> rows = HTable.getRows(Text[] rowKeys);
>>>>
>>>> I'm building a framework around HBase which would benefit of handling
>>>>
>>>>
>>> batch
>>>
>>>
>>>> wise puts and gets.
>>>>
>>>> Kindly
>>>>
>>>> //Marcus
>>>>
>>>>
>>>>
>>>> --
>>>> Marcus Herou CTO and co-founder Tailsweep AB
>>>> +46702561312
>>>> marcus.herou@tailsweep.com
>>>> http://www.tailsweep.com/
>>>> http://blogg.tailsweep.com/
>>>>
>>>>
>>>>
>>>
>>
>>
>>
>>
>
>


-- 
Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
marcus.herou@tailsweep.com
http://www.tailsweep.com/
http://blogg.tailsweep.com/

Re: Multi get/put

Posted by stack <st...@duboce.net>.

Marcus Herou wrote:
> ..
> Would you call it "safe" to start developing on 0.2 if we will use the code
> in production in October ? I can live with changes of interfaces and such
> but if the kernel of HBase itself will be unstable so there is potential
> dataloss I'm getting a little more worried. When do you plan that 0.2 is
> final ?
>
>   
For October, would suggest you plan on 0.3.0.

0.2.0RC2 should be going out in the next day or so.

We need to put up a proposal for folks to discuss and vote on, but 
chatting on IRC, current thought is for 0.3.0 to have a short 
development cycle and come out soon after 0.2.0.

> Yesterday my first successful ORM test cases for HBase went through in which
> the batching stuff would be extremely helpful
>
> I have many cases where I need to batch data in and out of HBase. Searching
> is one: I coupled HBase to SOLR whenever I want to retrieve data by query.
> HBase is only scanning which is'nt the fastest way if you have zillions of
> rows :) Lucene is a good indexing system already but it is'nt very easy to
> make it scale along with HBase.
>
> I would like to have the case that whenever I add a HBase machine I as well
> add indexing speed, so...we are building an indexing system which will use
> HBase. HBase is great for this since the row keys are sorted. < > and =
> queries will be piece of a cake. I will release both HBaseORM and HBaseIndex
> as OpenSource whenever I have removed the company deps.
>
> I would gladly contribute these stuff in a contrib source tree.
>   
Marcus, this is great stuff.  I would encourage you to do your 
development of the ORM and index out in the open (Feel free to use the 
hbase wiki and the hbase mailing lists to lay out ideas/plans and to 
solicit opinions).  From what I hear, you are not the only gentleperson 
trying to figure these issues.  Doing your dev in the open, you might 
get some useful feedback and even some help.

Also, file issues against hbase for any functionality you need to make 
your ORM and index happen. 

Thanks,
St.Ack


> l Cryans <jd...@gmail.com>wrote:
>
>   
>> Marcus,
>>
>> If you are currently building upon 0.2.0, the way to retrieve multiple rows
>> is to use a scanner available from the client class HTable. The way to
>> batch
>> multiple rows updates is to use the BatchUpdate[ ]  version of
>> HTable.commit
>>
>> Hope this helps,
>>
>> J-D
>>
>> On Mon, Jul 28, 2008 at 5:38 AM, Marcus Herou <marcus.herou@tailsweep.com
>>     
>>> wrote:
>>>       
>>> Hi guys.
>>>
>>> Is there a way of retrieving multiple "rows" with one server call ?
>>> Something like MySQL's "where id in (a,b,c...)
>>>
>>> Or more like this.
>>> List<SortedMap<Text,byte[]>> rows = HTable.getRows(Text[] rowKeys);
>>>
>>> I'm building a framework around HBase which would benefit of handling
>>>       
>> batch
>>     
>>> wise puts and gets.
>>>
>>> Kindly
>>>
>>> //Marcus
>>>
>>>
>>>
>>> --
>>> Marcus Herou CTO and co-founder Tailsweep AB
>>> +46702561312
>>> marcus.herou@tailsweep.com
>>> http://www.tailsweep.com/
>>> http://blogg.tailsweep.com/
>>>
>>>       
>
>
>
>

Re: Multi get/put

Posted by Tim Sell <tr...@gmail.com>.

I'd be very interested in these, once you make them public :)

~Tim.

2008/7/29 Marcus Herou <ma...@tailsweep.com>:
> OK thanks, I will try 0.2!
>
> No I was (am) using 0.1.3 but am looking in the trunk which I've noticed
> have many new cool stuff.
>
> Would you call it "safe" to start developing on 0.2 if we will use the code
> in production in October ? I can live with changes of interfaces and such
> but if the kernel of HBase itself will be unstable so there is potential
> dataloss I'm getting a little more worried. When do you plan that 0.2 is
> final ?
>
> Yesterday my first successful ORM test cases for HBase went through in which
> the batching stuff would be extremely helpful
>
> I have many cases where I need to batch data in and out of HBase. Searching
> is one: I coupled HBase to SOLR whenever I want to retrieve data by query.
> HBase is only scanning which is'nt the fastest way if you have zillions of
> rows :) Lucene is a good indexing system already but it is'nt very easy to
> make it scale along with HBase.
>
> I would like to have the case that whenever I add a HBase machine I as well
> add indexing speed, so...we are building an indexing system which will use
> HBase. HBase is great for this since the row keys are sorted. < > and =
> queries will be piece of a cake. I will release both HBaseORM and HBaseIndex
> as OpenSource whenever I have removed the company deps.
>
> I would gladly contribute these stuff in a contrib source tree.
>
> Kindly
>
> //Marcus
>
>
>
>
>
> On Mon, Jul 28, 2008 at 8:36 PM, Jean-Daniel Cryans <jd...@gmail.com>wrote:
>
>> Marcus,
>>
>> If you are currently building upon 0.2.0, the way to retrieve multiple rows
>> is to use a scanner available from the client class HTable. The way to
>> batch
>> multiple rows updates is to use the BatchUpdate[ ]  version of
>> HTable.commit
>>
>> Hope this helps,
>>
>> J-D
>>
>> On Mon, Jul 28, 2008 at 5:38 AM, Marcus Herou <marcus.herou@tailsweep.com
>> >wrote:
>>
>> > Hi guys.
>> >
>> > Is there a way of retrieving multiple "rows" with one server call ?
>> > Something like MySQL's "where id in (a,b,c...)
>> >
>> > Or more like this.
>> > List<SortedMap<Text,byte[]>> rows = HTable.getRows(Text[] rowKeys);
>> >
>> > I'm building a framework around HBase which would benefit of handling
>> batch
>> > wise puts and gets.
>> >
>> > Kindly
>> >
>> > //Marcus
>> >
>> >
>> >
>> > --
>> > Marcus Herou CTO and co-founder Tailsweep AB
>> > +46702561312
>> > marcus.herou@tailsweep.com
>> > http://www.tailsweep.com/
>> > http://blogg.tailsweep.com/
>> >
>>
>
>
>
> --
> Marcus Herou CTO and co-founder Tailsweep AB
> +46702561312
> marcus.herou@tailsweep.com
> http://www.tailsweep.com/
> http://blogg.tailsweep.com/
>

Re: Multi get/put

Posted by Marcus Herou <ma...@tailsweep.com>.

OK thanks, I will try 0.2!

No I was (am) using 0.1.3 but am looking in the trunk which I've noticed
have many new cool stuff.

Would you call it "safe" to start developing on 0.2 if we will use the code
in production in October ? I can live with changes of interfaces and such
but if the kernel of HBase itself will be unstable so there is potential
dataloss I'm getting a little more worried. When do you plan that 0.2 is
final ?

Yesterday my first successful ORM test cases for HBase went through in which
the batching stuff would be extremely helpful

I have many cases where I need to batch data in and out of HBase. Searching
is one: I coupled HBase to SOLR whenever I want to retrieve data by query.
HBase is only scanning which is'nt the fastest way if you have zillions of
rows :) Lucene is a good indexing system already but it is'nt very easy to
make it scale along with HBase.

I would like to have the case that whenever I add a HBase machine I as well
add indexing speed, so...we are building an indexing system which will use
HBase. HBase is great for this since the row keys are sorted. < > and =
queries will be piece of a cake. I will release both HBaseORM and HBaseIndex
as OpenSource whenever I have removed the company deps.

I would gladly contribute these stuff in a contrib source tree.

Kindly

//Marcus

On Mon, Jul 28, 2008 at 8:36 PM, Jean-Daniel Cryans <jd...@gmail.com>wrote:

> Marcus,
>
> If you are currently building upon 0.2.0, the way to retrieve multiple rows
> is to use a scanner available from the client class HTable. The way to
> batch
> multiple rows updates is to use the BatchUpdate[ ]  version of
> HTable.commit
>
> Hope this helps,
>
> J-D
>
> On Mon, Jul 28, 2008 at 5:38 AM, Marcus Herou <marcus.herou@tailsweep.com
> >wrote:
>
> > Hi guys.
> >
> > Is there a way of retrieving multiple "rows" with one server call ?
> > Something like MySQL's "where id in (a,b,c...)
> >
> > Or more like this.
> > List<SortedMap<Text,byte[]>> rows = HTable.getRows(Text[] rowKeys);
> >
> > I'm building a framework around HBase which would benefit of handling
> batch
> > wise puts and gets.
> >
> > Kindly
> >
> > //Marcus
> >
> >
> >
> > --
> > Marcus Herou CTO and co-founder Tailsweep AB
> > +46702561312
> > marcus.herou@tailsweep.com
> > http://www.tailsweep.com/
> > http://blogg.tailsweep.com/
> >
>

-- 
Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
marcus.herou@tailsweep.com
http://www.tailsweep.com/
http://blogg.tailsweep.com/

Re: Multi get/put

Posted by Daniel Yu <d4...@gmail.com>.

oh, that explains a lot, thanks:)

2008/7/29 Jean-Daniel Cryans <jd...@gmail.com>

> Daniel,
>
> BatchUpdate is the 0.2.0  equivalent to 0.1.x put().
>
> J-D
>
> On Tue, Jul 29, 2008 at 12:41 PM, Daniel Yu <d4...@gmail.com> wrote:
>
> > hi J-D,
> >   how about the performance of BatchUpdate and multiple single-Update? in
> > MapReduce jobs, if we use TableReduce,
> > we only have a put() method to update the table,  i'm wondering whether
> the
> > put() method can use a BatchUpdate mechanism.  Thanks.
> >
> > 2008/7/28 Jean-Daniel Cryans <jd...@gmail.com>
> >
> > > Marcus,
> > >
> > > If you are currently building upon 0.2.0, the way to retrieve multiple
> > rows
> > > is to use a scanner available from the client class HTable. The way to
> > > batch
> > > multiple rows updates is to use the BatchUpdate[ ]  version of
> > > HTable.commit
> > >
> > > Hope this helps,
> > >
> > > J-D
> > >
> > > On Mon, Jul 28, 2008 at 5:38 AM, Marcus Herou <
> > marcus.herou@tailsweep.com
> > > >wrote:
> > >
> > > > Hi guys.
> > > >
> > > > Is there a way of retrieving multiple "rows" with one server call ?
> > > > Something like MySQL's "where id in (a,b,c...)
> > > >
> > > > Or more like this.
> > > > List<SortedMap<Text,byte[]>> rows = HTable.getRows(Text[] rowKeys);
> > > >
> > > > I'm building a framework around HBase which would benefit of handling
> > > batch
> > > > wise puts and gets.
> > > >
> > > > Kindly
> > > >
> > > > //Marcus
> > > >
> > > >
> > > >
> > > > --
> > > > Marcus Herou CTO and co-founder Tailsweep AB
> > > > +46702561312
> > > > marcus.herou@tailsweep.com
> > > > http://www.tailsweep.com/
> > > > http://blogg.tailsweep.com/
> > > >
> > >
> >
>

Re: Multi get/put

Posted by Jean-Daniel Cryans <jd...@gmail.com>.

Daniel,

BatchUpdate is the 0.2.0  equivalent to 0.1.x put().

J-D

On Tue, Jul 29, 2008 at 12:41 PM, Daniel Yu <d4...@gmail.com> wrote:

> hi J-D,
>   how about the performance of BatchUpdate and multiple single-Update? in
> MapReduce jobs, if we use TableReduce,
> we only have a put() method to update the table,  i'm wondering whether the
> put() method can use a BatchUpdate mechanism.  Thanks.
>
> 2008/7/28 Jean-Daniel Cryans <jd...@gmail.com>
>
> > Marcus,
> >
> > If you are currently building upon 0.2.0, the way to retrieve multiple
> rows
> > is to use a scanner available from the client class HTable. The way to
> > batch
> > multiple rows updates is to use the BatchUpdate[ ]  version of
> > HTable.commit
> >
> > Hope this helps,
> >
> > J-D
> >
> > On Mon, Jul 28, 2008 at 5:38 AM, Marcus Herou <
> marcus.herou@tailsweep.com
> > >wrote:
> >
> > > Hi guys.
> > >
> > > Is there a way of retrieving multiple "rows" with one server call ?
> > > Something like MySQL's "where id in (a,b,c...)
> > >
> > > Or more like this.
> > > List<SortedMap<Text,byte[]>> rows = HTable.getRows(Text[] rowKeys);
> > >
> > > I'm building a framework around HBase which would benefit of handling
> > batch
> > > wise puts and gets.
> > >
> > > Kindly
> > >
> > > //Marcus
> > >
> > >
> > >
> > > --
> > > Marcus Herou CTO and co-founder Tailsweep AB
> > > +46702561312
> > > marcus.herou@tailsweep.com
> > > http://www.tailsweep.com/
> > > http://blogg.tailsweep.com/
> > >
> >
>

Re: Multi get/put

Posted by Daniel Yu <d4...@gmail.com>.

hi J-D,
   how about the performance of BatchUpdate and multiple single-Update? in
MapReduce jobs, if we use TableReduce,
we only have a put() method to update the table,  i'm wondering whether the
put() method can use a BatchUpdate mechanism.  Thanks.

2008/7/28 Jean-Daniel Cryans <jd...@gmail.com>

> Marcus,
>
> If you are currently building upon 0.2.0, the way to retrieve multiple rows
> is to use a scanner available from the client class HTable. The way to
> batch
> multiple rows updates is to use the BatchUpdate[ ]  version of
> HTable.commit
>
> Hope this helps,
>
> J-D
>
> On Mon, Jul 28, 2008 at 5:38 AM, Marcus Herou <marcus.herou@tailsweep.com
> >wrote:
>
> > Hi guys.
> >
> > Is there a way of retrieving multiple "rows" with one server call ?
> > Something like MySQL's "where id in (a,b,c...)
> >
> > Or more like this.
> > List<SortedMap<Text,byte[]>> rows = HTable.getRows(Text[] rowKeys);
> >
> > I'm building a framework around HBase which would benefit of handling
> batch
> > wise puts and gets.
> >
> > Kindly
> >
> > //Marcus
> >
> >
> >
> > --
> > Marcus Herou CTO and co-founder Tailsweep AB
> > +46702561312
> > marcus.herou@tailsweep.com
> > http://www.tailsweep.com/
> > http://blogg.tailsweep.com/
> >
>

Re: Multi get/put

Posted by Jean-Daniel Cryans <jd...@gmail.com>.

Marcus,

If you are currently building upon 0.2.0, the way to retrieve multiple rows
is to use a scanner available from the client class HTable. The way to batch
multiple rows updates is to use the BatchUpdate[ ]  version of HTable.commit

Hope this helps,

J-D

On Mon, Jul 28, 2008 at 5:38 AM, Marcus Herou <ma...@tailsweep.com>wrote:

> Hi guys.
>
> Is there a way of retrieving multiple "rows" with one server call ?
> Something like MySQL's "where id in (a,b,c...)
>
> Or more like this.
> List<SortedMap<Text,byte[]>> rows = HTable.getRows(Text[] rowKeys);
>
> I'm building a framework around HBase which would benefit of handling batch
> wise puts and gets.
>
> Kindly
>
> //Marcus
>
>
>
> --
> Marcus Herou CTO and co-founder Tailsweep AB
> +46702561312
> marcus.herou@tailsweep.com
> http://www.tailsweep.com/
> http://blogg.tailsweep.com/
>