You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Jeff Zhang <zj...@gmail.com> on 2009/11/16 09:54:03 UTC

Have a idea of leveraging hbase for machine learning

Hi all,

I start learning hbase these days. and I found we can use hbase for machine
learning.
In the field of machine learning, we always need to handle matrix and vector
which is very fit to be stored in hbase.

e.g. we always have to compute the doc-term matrix in text classification.
If we use hbase, we can store each document as a row in hbase, and store the
document id as the row id ,and tf (term frequency) as columns.
e.g. we have one document A titled "love", and the content is:
I love this game.

Then we can store them as one hbase row:
A: {tilte:love=>1,
content:I=>1,content:love=>1,content:this=>1,content:game=>1}


Using hbase, it will be very easy for us to compute the similarity between
documents.
And another  advantage of hbase compared to raw text data is that it's
semi-structured. And I think it will be easy for programming if we use hbase
rather than the raw data.

This is currently what I think of, maybe there's something not correct, Hope
to hear ideas from experts.


Thank you.

Jeff Zhang

Re: Have a idea of leveraging hbase for machine learning

Posted by Jeff Zhang <zj...@gmail.com>.

BTW, there's a JDBCDataModel in Taste, I think it will be convinient for
users if we provide a HBaseDataModel that leverage the hbase as the data
store.


Jeff Zhang



On Mon, Nov 16, 2009 at 4:54 PM, Jeff Zhang <zj...@gmail.com> wrote:

> Hi all,
>
> I start learning hbase these days. and I found we can use hbase for machine
> learning.
> In the field of machine learning, we always need to handle matrix and
> vector which is very fit to be stored in hbase.
>
> e.g. we always have to compute the doc-term matrix in text classification.
> If we use hbase, we can store each document as a row in hbase, and store
> the document id as the row id ,and tf (term frequency) as columns.
> e.g. we have one document A titled "love", and the content is:
> I love this game.
>
> Then we can store them as one hbase row:
> A: {tilte:love=>1,
> content:I=>1,content:love=>1,content:this=>1,content:game=>1}
>
>
> Using hbase, it will be very easy for us to compute the similarity between
> documents.
> And another  advantage of hbase compared to raw text data is that it's
> semi-structured. And I think it will be easy for programming if we use hbase
> rather than the raw data.
>
> This is currently what I think of, maybe there's something not correct,
> Hope to hear ideas from experts.
>
>
> Thank you.
>
> Jeff Zhang
>
>
>
>

Re: Have a idea of leveraging hbase for machine learning

Posted by Jeff Zhang <zj...@gmail.com>.

BTW, there's a JDBCDataModel in Taste, I think it will be convinient for
users if we provide a HBaseDataModel that leverage the hbase as the data
store.


Jeff Zhang



On Mon, Nov 16, 2009 at 4:54 PM, Jeff Zhang <zj...@gmail.com> wrote:

> Hi all,
>
> I start learning hbase these days. and I found we can use hbase for machine
> learning.
> In the field of machine learning, we always need to handle matrix and
> vector which is very fit to be stored in hbase.
>
> e.g. we always have to compute the doc-term matrix in text classification.
> If we use hbase, we can store each document as a row in hbase, and store
> the document id as the row id ,and tf (term frequency) as columns.
> e.g. we have one document A titled "love", and the content is:
> I love this game.
>
> Then we can store them as one hbase row:
> A: {tilte:love=>1,
> content:I=>1,content:love=>1,content:this=>1,content:game=>1}
>
>
> Using hbase, it will be very easy for us to compute the similarity between
> documents.
> And another  advantage of hbase compared to raw text data is that it's
> semi-structured. And I think it will be easy for programming if we use hbase
> rather than the raw data.
>
> This is currently what I think of, maybe there's something not correct,
> Hope to hear ideas from experts.
>
>
> Thank you.
>
> Jeff Zhang
>
>
>
>

Re: Have a idea of leveraging hbase for machine learning

Posted by Grant Ingersoll <gs...@apache.org>.

On Nov 16, 2009, at 3:54 AM, Jeff Zhang wrote:

> Hi all,
> 
> I start learning hbase these days. and I found we can use hbase for machine
> learning.
> In the field of machine learning, we always need to handle matrix and vector
> which is very fit to be stored in hbase.
> 
> e.g. we always have to compute the doc-term matrix in text classification.
> If we use hbase, we can store each document as a row in hbase, and store the
> document id as the row id ,and tf (term frequency) as columns.
> e.g. we have one document A titled "love", and the content is:
> I love this game.
> 
> Then we can store them as one hbase row:
> A: {tilte:love=>1,
> content:I=>1,content:love=>1,content:this=>1,content:game=>1}
> 
> 
> Using hbase, it will be very easy for us to compute the similarity between
> documents.
> And another  advantage of hbase compared to raw text data is that it's
> semi-structured. And I think it will be easy for programming if we use hbase
> rather than the raw data.
> 
> This is currently what I think of, maybe there's something not correct, Hope
> to hear ideas from experts.
> 





If you check out the classification algorithms in Mahout, they have HBase as a storage option.  Feedback on them would be appreciated.

I tend to think about being agnostic of underlying storage as much as possible. 

-Grant

Re: Have a idea of leveraging hbase for machine learning

Posted by Grant Ingersoll <gs...@apache.org>.

On Nov 16, 2009, at 3:54 AM, Jeff Zhang wrote:

> Hi all,
> 
> I start learning hbase these days. and I found we can use hbase for machine
> learning.
> In the field of machine learning, we always need to handle matrix and vector
> which is very fit to be stored in hbase.
> 
> e.g. we always have to compute the doc-term matrix in text classification.
> If we use hbase, we can store each document as a row in hbase, and store the
> document id as the row id ,and tf (term frequency) as columns.
> e.g. we have one document A titled "love", and the content is:
> I love this game.
> 
> Then we can store them as one hbase row:
> A: {tilte:love=>1,
> content:I=>1,content:love=>1,content:this=>1,content:game=>1}
> 
> 
> Using hbase, it will be very easy for us to compute the similarity between
> documents.
> And another  advantage of hbase compared to raw text data is that it's
> semi-structured. And I think it will be easy for programming if we use hbase
> rather than the raw data.
> 
> This is currently what I think of, maybe there's something not correct, Hope
> to hear ideas from experts.
> 





If you check out the classification algorithms in Mahout, they have HBase as a storage option.  Feedback on them would be appreciated.

I tend to think about being agnostic of underlying storage as much as possible. 

-Grant

Re: Have a idea of leveraging hbase for machine learning

Posted by Andrew Purtell <ap...@apache.org>.

Ted,

> within a column family that the number of distinct columns is nearly
> irrelevant and that the only cost is incurred by the columns that are
> actually present.  This would make hbase roughly equivalent to a file based
> representation that stores the labels for all non-zero elements in a row.

Correct.

   - Andy

________________________________
From: Ted Dunning <te...@gmail.com>
To: mahout-dev@lucene.apache.org
Cc: mahout-user@lucene.apache.org
Sent: Mon, November 16, 2009 11:17:34 AM
Subject: Re: Have a idea of leveraging hbase for machine learning

Andrew,

This is very good news.  The rapid progress of hbase recently is really
exciting (and hard to keep up with).

The performance bounds in the sparse matrices we are liable to see are
usually forced by a very few very long row vectors (sparsity >10% in some
cases) and by the vast majority of very sparse row vectors (sparsity <<
0.1%).  One or the other of these typically causes problems in
representation and algorithms.

With Hbase, I would be worried about whether it would be efficient to store
the sparse rows with only a few entries.  It sounds like you are saying that
within a column family that the number of distinct columns is nearly
irrelevant and that the only cost is incurred by the columns that are
actually present.  This would make hbase roughly equivalent to a file based
representation that stores the labels for all non-zero elements in a row.

Is my understanding of what you said correct?

On Mon, Nov 16, 2009 at 10:45 AM, Andrew Purtell <ap...@apache.org>wrote:

> > Practically speaking, it probalby isn't feasible to have an hbase column
> per matrix column
>
> Just in case that is predicated on old information: Distinguishing between
> columns and column qualifiers, it is architecturally feasible to have a
> single column family with millions of values in a row with distinct
> qualifiers. Someone with more depth in this space than I could say if order
> of millions is sufficient for handling very large sparse matrices, how
> compelling that might be. In practical terms, HBASE-1537 (
> https://issues.apache.org/jira/browse/HBASE-1537) makes retrieval of large
> rows with scanners possible with current svn trunk or upcoming release
> 0.21.0. (Chunked get of a single row is not currently under consideration.)
> Previously, indeed due to a limitation of implementation trying to retrieve
> order of a million values stored in a column would have either blown up the
> region server or the client due to the need to pack all the data into a
> single RPC buffer.

-- 
Ted Dunning, CTO
DeepDyve

Re: Have a idea of leveraging hbase for machine learning

Posted by Andrew Purtell <ap...@apache.org>.

Ted,

> within a column family that the number of distinct columns is nearly
> irrelevant and that the only cost is incurred by the columns that are
> actually present.  This would make hbase roughly equivalent to a file based
> representation that stores the labels for all non-zero elements in a row.

Correct.

   - Andy

________________________________
From: Ted Dunning <te...@gmail.com>
To: mahout-dev@lucene.apache.org
Cc: mahout-user@lucene.apache.org
Sent: Mon, November 16, 2009 11:17:34 AM
Subject: Re: Have a idea of leveraging hbase for machine learning

Andrew,

This is very good news.  The rapid progress of hbase recently is really
exciting (and hard to keep up with).

The performance bounds in the sparse matrices we are liable to see are
usually forced by a very few very long row vectors (sparsity >10% in some
cases) and by the vast majority of very sparse row vectors (sparsity <<
0.1%).  One or the other of these typically causes problems in
representation and algorithms.

With Hbase, I would be worried about whether it would be efficient to store
the sparse rows with only a few entries.  It sounds like you are saying that
within a column family that the number of distinct columns is nearly
irrelevant and that the only cost is incurred by the columns that are
actually present.  This would make hbase roughly equivalent to a file based
representation that stores the labels for all non-zero elements in a row.

Is my understanding of what you said correct?

On Mon, Nov 16, 2009 at 10:45 AM, Andrew Purtell <ap...@apache.org>wrote:

> > Practically speaking, it probalby isn't feasible to have an hbase column
> per matrix column
>
> Just in case that is predicated on old information: Distinguishing between
> columns and column qualifiers, it is architecturally feasible to have a
> single column family with millions of values in a row with distinct
> qualifiers. Someone with more depth in this space than I could say if order
> of millions is sufficient for handling very large sparse matrices, how
> compelling that might be. In practical terms, HBASE-1537 (
> https://issues.apache.org/jira/browse/HBASE-1537) makes retrieval of large
> rows with scanners possible with current svn trunk or upcoming release
> 0.21.0. (Chunked get of a single row is not currently under consideration.)
> Previously, indeed due to a limitation of implementation trying to retrieve
> order of a million values stored in a column would have either blown up the
> region server or the client due to the need to pack all the data into a
> single RPC buffer.

-- 
Ted Dunning, CTO
DeepDyve

Re: Have a idea of leveraging hbase for machine learning

Posted by Ted Dunning <te...@gmail.com>.

Andrew,

This is very good news.  The rapid progress of hbase recently is really
exciting (and hard to keep up with).

The performance bounds in the sparse matrices we are liable to see are
usually forced by a very few very long row vectors (sparsity >10% in some
cases) and by the vast majority of very sparse row vectors (sparsity <<
0.1%).  One or the other of these typically causes problems in
representation and algorithms.

With Hbase, I would be worried about whether it would be efficient to store
the sparse rows with only a few entries.  It sounds like you are saying that
within a column family that the number of distinct columns is nearly
irrelevant and that the only cost is incurred by the columns that are
actually present.  This would make hbase roughly equivalent to a file based
representation that stores the labels for all non-zero elements in a row.

Is my understanding of what you said correct?

On Mon, Nov 16, 2009 at 10:45 AM, Andrew Purtell <ap...@apache.org>wrote:

> > Practically speaking, it probalby isn't feasible to have an hbase column
> per matrix column
>
> Just in case that is predicated on old information: Distinguishing between
> columns and column qualifiers, it is architecturally feasible to have a
> single column family with millions of values in a row with distinct
> qualifiers. Someone with more depth in this space than I could say if order
> of millions is sufficient for handling very large sparse matrices, how
> compelling that might be. In practical terms, HBASE-1537 (
> https://issues.apache.org/jira/browse/HBASE-1537) makes retrieval of large
> rows with scanners possible with current svn trunk or upcoming release
> 0.21.0. (Chunked get of a single row is not currently under consideration.)
> Previously, indeed due to a limitation of implementation trying to retrieve
> order of a million values stored in a column would have either blown up the
> region server or the client due to the need to pack all the data into a
> single RPC buffer.

-- 
Ted Dunning, CTO
DeepDyve

Re: Have a idea of leveraging hbase for machine learning

Posted by Ted Dunning <te...@gmail.com>.

Andrew,

This is very good news.  The rapid progress of hbase recently is really
exciting (and hard to keep up with).

The performance bounds in the sparse matrices we are liable to see are
usually forced by a very few very long row vectors (sparsity >10% in some
cases) and by the vast majority of very sparse row vectors (sparsity <<
0.1%).  One or the other of these typically causes problems in
representation and algorithms.

With Hbase, I would be worried about whether it would be efficient to store
the sparse rows with only a few entries.  It sounds like you are saying that
within a column family that the number of distinct columns is nearly
irrelevant and that the only cost is incurred by the columns that are
actually present.  This would make hbase roughly equivalent to a file based
representation that stores the labels for all non-zero elements in a row.

Is my understanding of what you said correct?

On Mon, Nov 16, 2009 at 10:45 AM, Andrew Purtell <ap...@apache.org>wrote:

> > Practically speaking, it probalby isn't feasible to have an hbase column
> per matrix column
>
> Just in case that is predicated on old information: Distinguishing between
> columns and column qualifiers, it is architecturally feasible to have a
> single column family with millions of values in a row with distinct
> qualifiers. Someone with more depth in this space than I could say if order
> of millions is sufficient for handling very large sparse matrices, how
> compelling that might be. In practical terms, HBASE-1537 (
> https://issues.apache.org/jira/browse/HBASE-1537) makes retrieval of large
> rows with scanners possible with current svn trunk or upcoming release
> 0.21.0. (Chunked get of a single row is not currently under consideration.)
> Previously, indeed due to a limitation of implementation trying to retrieve
> order of a million values stored in a column would have either blown up the
> region server or the client due to the need to pack all the data into a
> single RPC buffer.

-- 
Ted Dunning, CTO
DeepDyve

Re: Have a idea of leveraging hbase for machine learning

Posted by Andrew Purtell <ap...@apache.org>.

> Practically speaking, it probalby isn't feasible to have an hbase column per matrix column

Just in case that is predicated on old information: Distinguishing between columns and column qualifiers, it is architecturally feasible to have a single column family with millions of values in a row with distinct qualifiers. Someone with more depth in this space than I could say if order of millions is sufficient for handling very large sparse matrices, how compelling that might be. In practical terms, HBASE-1537 (https://issues.apache.org/jira/browse/HBASE-1537) makes retrieval of large rows with scanners possible with current svn trunk or upcoming release 0.21.0. (Chunked get of a single row is not currently under consideration.) Previously, indeed due to a limitation of implementation trying to retrieve order of a million values stored in a column would have either blown up the region server or the client due to the need to pack all the data into a single RPC buffer. 

> Mahout is trying to stay pretty agnostic relative to data storage methods.
> [...]
> We need to support all of those options.

Good to hear. If you have any problems with HBase, please come over to hbase-user@. 

Best regards,

   - Andy
     Committer, HBase
     Lurker, Mahout

________________________________
From: Ted Dunning <te...@gmail.com>
To: mahout-dev@lucene.apache.org
Cc: mahout-user@lucene.apache.org
Sent: Mon, November 16, 2009 9:35:14 AM
Subject: Re: Have a idea of leveraging hbase for machine learning

Jeff,

Glad to hear you are looking at Mahout.

Practically speaking, it probalby isn't feasible to have an hbase column per
matrix column.  That makes storage of matrix data in hbase somewhat less
compelling, although clearly still very useful for some applications.

As Grant pointed out, Mahout is trying to stay pretty agnostic relative to
data storage methods.  Some people need to read matrices from Lucene
indexes, others from files, still others from hbase.  We need to support all
of those options.

Your suggestion about making sure that Taste supports hbase is a good one.

On Mon, Nov 16, 2009 at 12:54 AM, Jeff Zhang <zj...@gmail.com> wrote:

> Then we can store them as one hbase row:
> A: {tilte:love=>1,
> content:I=>1,content:love=>1,content:this=>1,content:game=>1}
>
>
> Using hbase, it will be very easy for us to compute the similarity between
> documents.
> And another  advantage of hbase compared to raw text data is that it's
> semi-structured. And I think it will be easy for programming if we use
> hbase
> rather than the raw data.
>

-- 
Ted Dunning, CTO
DeepDyve

Re: Have a idea of leveraging hbase for machine learning

Posted by Ted Dunning <te...@gmail.com>.

You should be able to create and attach a patch to the JIRA without being
assigned the ownership.

On Tue, Nov 17, 2009 at 11:47 PM, Jeff Zhang <zj...@gmail.com> wrote:

> Hi Ted,
>
> I have created a jira item for this. But I can not assign this task to
> myself. So do I have permission to work on this one and submit patch ?
>
>
> Thank you
>
> Jeff Zhang
>
>
>
> On Tue, Nov 17, 2009 at 1:35 AM, Ted Dunning <te...@gmail.com>
> wrote:
>
> > Jeff,
> >
> > Glad to hear you are looking at Mahout.
> >
> > Practically speaking, it probalby isn't feasible to have an hbase column
> > per
> > matrix column.  That makes storage of matrix data in hbase somewhat less
> > compelling, although clearly still very useful for some applications.
> >
> > As Grant pointed out, Mahout is trying to stay pretty agnostic relative
> to
> > data storage methods.  Some people need to read matrices from Lucene
> > indexes, others from files, still others from hbase.  We need to support
> > all
> > of those options.
> >
> > Your suggestion about making sure that Taste supports hbase is a good
> one.
> >
> > On Mon, Nov 16, 2009 at 12:54 AM, Jeff Zhang <zj...@gmail.com> wrote:
> >
> > > Then we can store them as one hbase row:
> > > A: {tilte:love=>1,
> > > content:I=>1,content:love=>1,content:this=>1,content:game=>1}
> > >
> > >
> > > Using hbase, it will be very easy for us to compute the similarity
> > between
> > > documents.
> > > And another  advantage of hbase compared to raw text data is that it's
> > > semi-structured. And I think it will be easy for programming if we use
> > > hbase
> > > rather than the raw data.
> > >
> >
> >
> >
> > --
> > Ted Dunning, CTO
> > DeepDyve
> >
>



-- 
Ted Dunning, CTO
DeepDyve

Re: Have a idea of leveraging hbase for machine learning

Posted by Jeff Zhang <zj...@gmail.com>.

Hi Ted,

I have created a jira item for this. But I can not assign this task to
myself. So do I have permission to work on this one and submit patch ?


Thank you

Jeff Zhang



On Tue, Nov 17, 2009 at 1:35 AM, Ted Dunning <te...@gmail.com> wrote:

> Jeff,
>
> Glad to hear you are looking at Mahout.
>
> Practically speaking, it probalby isn't feasible to have an hbase column
> per
> matrix column.  That makes storage of matrix data in hbase somewhat less
> compelling, although clearly still very useful for some applications.
>
> As Grant pointed out, Mahout is trying to stay pretty agnostic relative to
> data storage methods.  Some people need to read matrices from Lucene
> indexes, others from files, still others from hbase.  We need to support
> all
> of those options.
>
> Your suggestion about making sure that Taste supports hbase is a good one.
>
> On Mon, Nov 16, 2009 at 12:54 AM, Jeff Zhang <zj...@gmail.com> wrote:
>
> > Then we can store them as one hbase row:
> > A: {tilte:love=>1,
> > content:I=>1,content:love=>1,content:this=>1,content:game=>1}
> >
> >
> > Using hbase, it will be very easy for us to compute the similarity
> between
> > documents.
> > And another  advantage of hbase compared to raw text data is that it's
> > semi-structured. And I think it will be easy for programming if we use
> > hbase
> > rather than the raw data.
> >
>
>
>
> --
> Ted Dunning, CTO
> DeepDyve
>

Re: Have a idea of leveraging hbase for machine learning

Posted by Andrew Purtell <ap...@apache.org>.

> Practically speaking, it probalby isn't feasible to have an hbase column per matrix column

Just in case that is predicated on old information: Distinguishing between columns and column qualifiers, it is architecturally feasible to have a single column family with millions of values in a row with distinct qualifiers. Someone with more depth in this space than I could say if order of millions is sufficient for handling very large sparse matrices, how compelling that might be. In practical terms, HBASE-1537 (https://issues.apache.org/jira/browse/HBASE-1537) makes retrieval of large rows with scanners possible with current svn trunk or upcoming release 0.21.0. (Chunked get of a single row is not currently under consideration.) Previously, indeed due to a limitation of implementation trying to retrieve order of a million values stored in a column would have either blown up the region server or the client due to the need to pack all the data into a single RPC buffer. 

> Mahout is trying to stay pretty agnostic relative to data storage methods.
> [...]
> We need to support all of those options.

Good to hear. If you have any problems with HBase, please come over to hbase-user@. 

Best regards,

   - Andy
     Committer, HBase
     Lurker, Mahout

________________________________
From: Ted Dunning <te...@gmail.com>
To: mahout-dev@lucene.apache.org
Cc: mahout-user@lucene.apache.org
Sent: Mon, November 16, 2009 9:35:14 AM
Subject: Re: Have a idea of leveraging hbase for machine learning

Jeff,

Glad to hear you are looking at Mahout.

Practically speaking, it probalby isn't feasible to have an hbase column per
matrix column.  That makes storage of matrix data in hbase somewhat less
compelling, although clearly still very useful for some applications.

As Grant pointed out, Mahout is trying to stay pretty agnostic relative to
data storage methods.  Some people need to read matrices from Lucene
indexes, others from files, still others from hbase.  We need to support all
of those options.

Your suggestion about making sure that Taste supports hbase is a good one.

On Mon, Nov 16, 2009 at 12:54 AM, Jeff Zhang <zj...@gmail.com> wrote:

> Then we can store them as one hbase row:
> A: {tilte:love=>1,
> content:I=>1,content:love=>1,content:this=>1,content:game=>1}
>
>
> Using hbase, it will be very easy for us to compute the similarity between
> documents.
> And another  advantage of hbase compared to raw text data is that it's
> semi-structured. And I think it will be easy for programming if we use
> hbase
> rather than the raw data.
>

-- 
Ted Dunning, CTO
DeepDyve

Re: Have a idea of leveraging hbase for machine learning

Posted by Ted Dunning <te...@gmail.com>.

Jeff,

Glad to hear you are looking at Mahout.

Practically speaking, it probalby isn't feasible to have an hbase column per
matrix column.  That makes storage of matrix data in hbase somewhat less
compelling, although clearly still very useful for some applications.

As Grant pointed out, Mahout is trying to stay pretty agnostic relative to
data storage methods.  Some people need to read matrices from Lucene
indexes, others from files, still others from hbase.  We need to support all
of those options.

Your suggestion about making sure that Taste supports hbase is a good one.

On Mon, Nov 16, 2009 at 12:54 AM, Jeff Zhang <zj...@gmail.com> wrote:

> Then we can store them as one hbase row:
> A: {tilte:love=>1,
> content:I=>1,content:love=>1,content:this=>1,content:game=>1}
>
>
> Using hbase, it will be very easy for us to compute the similarity between
> documents.
> And another  advantage of hbase compared to raw text data is that it's
> semi-structured. And I think it will be easy for programming if we use
> hbase
> rather than the raw data.
>

-- 
Ted Dunning, CTO
DeepDyve

Re: Have a idea of leveraging hbase for machine learning

Posted by Ted Dunning <te...@gmail.com>.

Jeff,

Glad to hear you are looking at Mahout.

Practically speaking, it probalby isn't feasible to have an hbase column per
matrix column.  That makes storage of matrix data in hbase somewhat less
compelling, although clearly still very useful for some applications.

As Grant pointed out, Mahout is trying to stay pretty agnostic relative to
data storage methods.  Some people need to read matrices from Lucene
indexes, others from files, still others from hbase.  We need to support all
of those options.

Your suggestion about making sure that Taste supports hbase is a good one.

On Mon, Nov 16, 2009 at 12:54 AM, Jeff Zhang <zj...@gmail.com> wrote:

> Then we can store them as one hbase row:
> A: {tilte:love=>1,
> content:I=>1,content:love=>1,content:this=>1,content:game=>1}
>
>
> Using hbase, it will be very easy for us to compute the similarity between
> documents.
> And another  advantage of hbase compared to raw text data is that it's
> semi-structured. And I think it will be easy for programming if we use
> hbase
> rather than the raw data.
>

-- 
Ted Dunning, CTO
DeepDyve