You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by "sreejith P. K." <sr...@nesote.com> on 2011/03/15 18:19:25 UTC

habse schema design and retrieving values through REST interface

Hello experts,

I have a scenario as follows,
I need to maintain a huge table for a 'web crawler' project in HBASE.
Basically it contains thousands of keywords and for each keyword i need to
maintain a list of urls (it again will count in thousands). Corresponding to
each url, i need to store a number, which will in turn resemble the priority
value the keyword holds.
Let me explain you a bit, Suppose i have a keyword 'united states', i need
to store about ten thousand urls corresponding to that keyword. Each keyword
will be holding a priority value which is an integer. Again i have thousands
of keywords like that. The rare thing about this is i need to do the project
in PHP.

I have configured a hadoop-hbase cluster consists of three machines. My plan
was to design the schema by taking the keyword as 'row key'. The urls i will
keep as column family. The schema looked fine at first. I have done a lot of
research on how to retrieve the url list if i know the keyword. Any ways i
managed a way out by preg-matching the xml data out put using the url
http://localhost:8080/tablename/rowkey (REST interface i used). It also
works fine if the url list has a limited number of urls. When it comes in
thousands, it seems i cannot fetch the xml data itself!
Now I am in a do or die situation. Please correct me if my schema design
needs any changes (I do believe it should change!) and please help me up to
retrieve the column family values (urls)
 corresponding to each row-key in an efficient way. Please guide me how i
can do the same using PHP-REST interface.
Thanks in advance.

Sreejith

Re: habse schema design and retrieving values through REST interface

Posted by "sreejith P. K." <sr...@nesote.com>.

Hi Andrew,
I am new to hbase. Can you just elaborate the same and can you help me with
the schema design?


<http://stackoverflow.com/questions/5325616/hbase-schema-design#>

I have a scenario as follows, I need to maintain a huge table for a 'web
crawler' project in HBASE. Basically it contains thousands of keywords and
for each keyword i need to maintain a list of urls (it again will count in
thousands). Corresponding to each url, I need to store a number, which will
in turn resemble the priority value the keyword holds. Let me explain you a
bit, Suppose i have a keyword 'united states', i need to store about ten
thousand urls corresponding to that keyword. Each keyword will be holding a
priority value which is an integer. Again i have thousands of keywords like
that. The rare thing about this is i need to do the project in PHP.

I have configured a hadoop-hbase cluster consists of three machines. My plan
was to design the schema by taking the keyword as 'row key'. The urls I will
keep as column family. The schema looked fine at first. I have done a lot of
research on how to retrieve the url list if i know the keyword. Any ways i
managed a way out by preg-matching the xml data out put using the url
http://localhost:8080/tablename/rowkey (REST interface i used). It also
works fine if the url list has a limited number of urls. When it comes in
thousands, it seems i cannot fetch the xml data itself! Now I am in a do or
die situation. Please correct me if my schema design needs any changes (I do
believe it should change!) and please help me up to retrieve the column
family values (urls) corresponding to each row-key in an efficient way.
Please guide me how i can do the same using PHP-REST interface.

If I am wrong with schema, please help me setting up a new one. From the
table I should be able to list all URls corresponding to any keyword
given(order by descending priority value). I may need to limit the
results(Like, giving condition to priority-'where priority>30')
Thanks in advance

Sreejith PK
Nesote Technologies (P) Ltd







On Thu, Mar 17, 2011 at 5:14 AM, Stack <st...@duboce.net> wrote:

> Thank you Andrew.
> St.Ack
>
> On Wed, Mar 16, 2011 at 3:12 PM, Andrew Purtell <ap...@apache.org>
> wrote:
> >>  This facility is not exposed in the REST API at the moment
> >> (not that I know of -- please someone correct me if I'm
> >> wrong).
> >
> > Wrong. :-)
> >
> > See ScannerModel in the rest package:
> http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/rest/model/ScannerModel.html
> >
> > ScannerModel#setBatch
> >
> >   - Andy
> >
> >
> >
> > --- On Wed, 3/16/11, Stack <st...@duboce.net> wrote:
> >
> >> From: Stack <st...@duboce.net>
> >> Subject: Re: habse schema design and retrieving values through REST
> interface
> >> To: user@hbase.apache.org
> >> Date: Wednesday, March 16, 2011, 10:47 AM
> >> You can limit the return when
> >> scanning from the java api; see
> >>
> http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Scan.html#setBatch(int)
> >>  This facility is not exposed in the REST API at the moment
> >> (not that
> >> I know of -- please someone correct me if I'm
> >> wrong).   So, yes, wide
> >> rows, if thousands of elements of some size, since they
> >> need to be
> >> composed all in RAM, could bring on an OOME if the composed
> >> size >
> >> available heap.
> >>
> >> St.Ack
> >>
> >>
> >> On Wed, Mar 16, 2011 at 2:41 AM, sreejith P. K. <sr...@nesote.com>
> >> wrote:
> >> > With this schema, if i can limit the column family
> >> over a particular range,
> >> > I can manage everything else. (like Select first n
> >> columns of a column
> >> > family)
> >> >
> >> > Sreejith
> >> >
> >> >
> >> > On Wed, Mar 16, 2011 at 12:33 PM, sreejith P. K.
> >> <sr...@nesote.com>wrote:
> >> >
> >> >> @ Jean-Daniel,
> >> >>
> >> >> As i told, each row key contains thousands of
> >> column family values (may be
> >> >> i am wrong with the schema design). I started REST
> >> and tried to cURL
> >> >> http:/localhost/tablename/rowname. It seems it
> >> will work only with limited
> >> >> amount of data (may be i can limit the cURL
> >> output), and how i can limit the
> >> >> column values for a particular row?
> >> >> Suppose i have two thousand urls under a keyword
> >> and i need to fetch the
> >> >> urls and should limit the result to five hundred.
> >> How it is possible??
> >> >>
> >> >> @ tsuna,
> >> >>
> >> >>  It seems http://www.elasticsearch.org/ using
> >> CouchDB right?
> >> >>
> >> >>
> >> >> On Tue, Mar 15, 2011 at 11:32 PM, Jean-Daniel
> >> Cryans <jd...@apache.org>wrote:
> >> >>
> >> >>> Can you tell why it's not able to get the
> >> bigger rows? Why would you
> >> >>> try another schema if you don't even know
> >> what's going on right now?
> >> >>> If you have the same issue with the new
> >> schema, you're back to square
> >> >>> one right?
> >> >>>
> >> >>> Looking at the logs should give you some
> >> hints.
> >> >>>
> >> >>> J-D
> >> >>>
> >> >>> On Tue, Mar 15, 2011 at 10:19 AM, sreejith P.
> >> K. <sr...@nesote.com>
> >> >>> wrote:
> >> >>> > Hello experts,
> >> >>> >
> >> >>> > I have a scenario as follows,
> >> >>> > I need to maintain a huge table for a
> >> 'web crawler' project in HBASE.
> >> >>> > Basically it contains thousands of
> >> keywords and for each keyword i need
> >> >>> to
> >> >>> > maintain a list of urls (it again will
> >> count in thousands).
> >> >>> Corresponding to
> >> >>> > each url, i need to store a number, which
> >> will in turn resemble the
> >> >>> priority
> >> >>> > value the keyword holds.
> >> >>> > Let me explain you a bit, Suppose i have
> >> a keyword 'united states', i
> >> >>> need
> >> >>> > to store about ten thousand urls
> >> corresponding to that keyword. Each
> >> >>> keyword
> >> >>> > will be holding a priority value which is
> >> an integer. Again i have
> >> >>> thousands
> >> >>> > of keywords like that. The rare thing
> >> about this is i need to do the
> >> >>> project
> >> >>> > in PHP.
> >> >>> >
> >> >>> > I have configured a hadoop-hbase cluster
> >> consists of three machines. My
> >> >>> plan
> >> >>> > was to design the schema by taking the
> >> keyword as 'row key'. The urls i
> >> >>> will
> >> >>> > keep as column family. The schema looked
> >> fine at first. I have done a
> >> >>> lot of
> >> >>> > research on how to retrieve the url list
> >> if i know the keyword. Any ways
> >> >>> i
> >> >>> > managed a way out by preg-matching the
> >> xml data out put using the url
> >> >>> > http://localhost:8080/tablename/rowkey (REST interface
> >> i used). It also
> >> >>> > works fine if the url list has a limited
> >> number of urls. When it comes
> >> >>> in
> >> >>> > thousands, it seems i cannot fetch the
> >> xml data itself!
> >> >>> > Now I am in a do or die situation. Please
> >> correct me if my schema design
> >> >>> > needs any changes (I do believe it should
> >> change!) and please help me up
> >> >>> to
> >> >>> > retrieve the column family values (urls)
> >> >>> >  corresponding to each row-key in an
> >> efficient way. Please guide me how
> >> >>> i
> >> >>> > can do the same using PHP-REST
> >> interface.
> >> >>> > Thanks in advance.
> >> >>> >
> >> >>> > Sreejith
> >> >>> >
> >> >>>
> >> >>
> >> >>
> >> >>
> >> >> --
> >> >> Sreejith PK
> >> >> Nesote Technologies (P) Ltd
> >> >>
> >> >>
> >> >>
> >> >
> >> >
> >> > --
> >> > Sreejith PK
> >> > Nesote Technologies (P) Ltd
> >> >
> >>
> >
> >
> >
> >
>



-- 
Sreejith PK
Nesote Technologies (P) Ltd

Re: habse schema design and retrieving values through REST interface

Posted by Stack <st...@duboce.net>.

Thank you Andrew.
St.Ack

On Wed, Mar 16, 2011 at 3:12 PM, Andrew Purtell <ap...@apache.org> wrote:
>>  This facility is not exposed in the REST API at the moment
>> (not that I know of -- please someone correct me if I'm
>> wrong).
>
> Wrong. :-)
>
> See ScannerModel in the rest package: http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/rest/model/ScannerModel.html
>
> ScannerModel#setBatch
>
>   - Andy
>
>
>
> --- On Wed, 3/16/11, Stack <st...@duboce.net> wrote:
>
>> From: Stack <st...@duboce.net>
>> Subject: Re: habse schema design and retrieving values through REST interface
>> To: user@hbase.apache.org
>> Date: Wednesday, March 16, 2011, 10:47 AM
>> You can limit the return when
>> scanning from the java api; see
>> http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Scan.html#setBatch(int)
>>  This facility is not exposed in the REST API at the moment
>> (not that
>> I know of -- please someone correct me if I'm
>> wrong).   So, yes, wide
>> rows, if thousands of elements of some size, since they
>> need to be
>> composed all in RAM, could bring on an OOME if the composed
>> size >
>> available heap.
>>
>> St.Ack
>>
>>
>> On Wed, Mar 16, 2011 at 2:41 AM, sreejith P. K. <sr...@nesote.com>
>> wrote:
>> > With this schema, if i can limit the column family
>> over a particular range,
>> > I can manage everything else. (like Select first n
>> columns of a column
>> > family)
>> >
>> > Sreejith
>> >
>> >
>> > On Wed, Mar 16, 2011 at 12:33 PM, sreejith P. K.
>> <sr...@nesote.com>wrote:
>> >
>> >> @ Jean-Daniel,
>> >>
>> >> As i told, each row key contains thousands of
>> column family values (may be
>> >> i am wrong with the schema design). I started REST
>> and tried to cURL
>> >> http:/localhost/tablename/rowname. It seems it
>> will work only with limited
>> >> amount of data (may be i can limit the cURL
>> output), and how i can limit the
>> >> column values for a particular row?
>> >> Suppose i have two thousand urls under a keyword
>> and i need to fetch the
>> >> urls and should limit the result to five hundred.
>> How it is possible??
>> >>
>> >> @ tsuna,
>> >>
>> >>  It seems http://www.elasticsearch.org/ using
>> CouchDB right?
>> >>
>> >>
>> >> On Tue, Mar 15, 2011 at 11:32 PM, Jean-Daniel
>> Cryans <jd...@apache.org>wrote:
>> >>
>> >>> Can you tell why it's not able to get the
>> bigger rows? Why would you
>> >>> try another schema if you don't even know
>> what's going on right now?
>> >>> If you have the same issue with the new
>> schema, you're back to square
>> >>> one right?
>> >>>
>> >>> Looking at the logs should give you some
>> hints.
>> >>>
>> >>> J-D
>> >>>
>> >>> On Tue, Mar 15, 2011 at 10:19 AM, sreejith P.
>> K. <sr...@nesote.com>
>> >>> wrote:
>> >>> > Hello experts,
>> >>> >
>> >>> > I have a scenario as follows,
>> >>> > I need to maintain a huge table for a
>> 'web crawler' project in HBASE.
>> >>> > Basically it contains thousands of
>> keywords and for each keyword i need
>> >>> to
>> >>> > maintain a list of urls (it again will
>> count in thousands).
>> >>> Corresponding to
>> >>> > each url, i need to store a number, which
>> will in turn resemble the
>> >>> priority
>> >>> > value the keyword holds.
>> >>> > Let me explain you a bit, Suppose i have
>> a keyword 'united states', i
>> >>> need
>> >>> > to store about ten thousand urls
>> corresponding to that keyword. Each
>> >>> keyword
>> >>> > will be holding a priority value which is
>> an integer. Again i have
>> >>> thousands
>> >>> > of keywords like that. The rare thing
>> about this is i need to do the
>> >>> project
>> >>> > in PHP.
>> >>> >
>> >>> > I have configured a hadoop-hbase cluster
>> consists of three machines. My
>> >>> plan
>> >>> > was to design the schema by taking the
>> keyword as 'row key'. The urls i
>> >>> will
>> >>> > keep as column family. The schema looked
>> fine at first. I have done a
>> >>> lot of
>> >>> > research on how to retrieve the url list
>> if i know the keyword. Any ways
>> >>> i
>> >>> > managed a way out by preg-matching the
>> xml data out put using the url
>> >>> > http://localhost:8080/tablename/rowkey (REST interface
>> i used). It also
>> >>> > works fine if the url list has a limited
>> number of urls. When it comes
>> >>> in
>> >>> > thousands, it seems i cannot fetch the
>> xml data itself!
>> >>> > Now I am in a do or die situation. Please
>> correct me if my schema design
>> >>> > needs any changes (I do believe it should
>> change!) and please help me up
>> >>> to
>> >>> > retrieve the column family values (urls)
>> >>> >  corresponding to each row-key in an
>> efficient way. Please guide me how
>> >>> i
>> >>> > can do the same using PHP-REST
>> interface.
>> >>> > Thanks in advance.
>> >>> >
>> >>> > Sreejith
>> >>> >
>> >>>
>> >>
>> >>
>> >>
>> >> --
>> >> Sreejith PK
>> >> Nesote Technologies (P) Ltd
>> >>
>> >>
>> >>
>> >
>> >
>> > --
>> > Sreejith PK
>> > Nesote Technologies (P) Ltd
>> >
>>
>
>
>
>

Re: habse schema design and retrieving values through REST interface

Posted by Andrew Purtell <ap...@apache.org>.

>  This facility is not exposed in the REST API at the moment
> (not that I know of -- please someone correct me if I'm
> wrong).

Wrong. :-)

See ScannerModel in the rest package: http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/rest/model/ScannerModel.html

ScannerModel#setBatch

   - Andy



--- On Wed, 3/16/11, Stack <st...@duboce.net> wrote:

> From: Stack <st...@duboce.net>
> Subject: Re: habse schema design and retrieving values through REST interface
> To: user@hbase.apache.org
> Date: Wednesday, March 16, 2011, 10:47 AM
> You can limit the return when
> scanning from the java api; see
> http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Scan.html#setBatch(int)
>  This facility is not exposed in the REST API at the moment
> (not that
> I know of -- please someone correct me if I'm
> wrong).   So, yes, wide
> rows, if thousands of elements of some size, since they
> need to be
> composed all in RAM, could bring on an OOME if the composed
> size >
> available heap.
> 
> St.Ack
> 
> 
> On Wed, Mar 16, 2011 at 2:41 AM, sreejith P. K. <sr...@nesote.com>
> wrote:
> > With this schema, if i can limit the column family
> over a particular range,
> > I can manage everything else. (like Select first n
> columns of a column
> > family)
> >
> > Sreejith
> >
> >
> > On Wed, Mar 16, 2011 at 12:33 PM, sreejith P. K.
> <sr...@nesote.com>wrote:
> >
> >> @ Jean-Daniel,
> >>
> >> As i told, each row key contains thousands of
> column family values (may be
> >> i am wrong with the schema design). I started REST
> and tried to cURL
> >> http:/localhost/tablename/rowname. It seems it
> will work only with limited
> >> amount of data (may be i can limit the cURL
> output), and how i can limit the
> >> column values for a particular row?
> >> Suppose i have two thousand urls under a keyword
> and i need to fetch the
> >> urls and should limit the result to five hundred.
> How it is possible??
> >>
> >> @ tsuna,
> >>
> >>  It seems http://www.elasticsearch.org/ using
> CouchDB right?
> >>
> >>
> >> On Tue, Mar 15, 2011 at 11:32 PM, Jean-Daniel
> Cryans <jd...@apache.org>wrote:
> >>
> >>> Can you tell why it's not able to get the
> bigger rows? Why would you
> >>> try another schema if you don't even know
> what's going on right now?
> >>> If you have the same issue with the new
> schema, you're back to square
> >>> one right?
> >>>
> >>> Looking at the logs should give you some
> hints.
> >>>
> >>> J-D
> >>>
> >>> On Tue, Mar 15, 2011 at 10:19 AM, sreejith P.
> K. <sr...@nesote.com>
> >>> wrote:
> >>> > Hello experts,
> >>> >
> >>> > I have a scenario as follows,
> >>> > I need to maintain a huge table for a
> 'web crawler' project in HBASE.
> >>> > Basically it contains thousands of
> keywords and for each keyword i need
> >>> to
> >>> > maintain a list of urls (it again will
> count in thousands).
> >>> Corresponding to
> >>> > each url, i need to store a number, which
> will in turn resemble the
> >>> priority
> >>> > value the keyword holds.
> >>> > Let me explain you a bit, Suppose i have
> a keyword 'united states', i
> >>> need
> >>> > to store about ten thousand urls
> corresponding to that keyword. Each
> >>> keyword
> >>> > will be holding a priority value which is
> an integer. Again i have
> >>> thousands
> >>> > of keywords like that. The rare thing
> about this is i need to do the
> >>> project
> >>> > in PHP.
> >>> >
> >>> > I have configured a hadoop-hbase cluster
> consists of three machines. My
> >>> plan
> >>> > was to design the schema by taking the
> keyword as 'row key'. The urls i
> >>> will
> >>> > keep as column family. The schema looked
> fine at first. I have done a
> >>> lot of
> >>> > research on how to retrieve the url list
> if i know the keyword. Any ways
> >>> i
> >>> > managed a way out by preg-matching the
> xml data out put using the url
> >>> > http://localhost:8080/tablename/rowkey (REST interface
> i used). It also
> >>> > works fine if the url list has a limited
> number of urls. When it comes
> >>> in
> >>> > thousands, it seems i cannot fetch the
> xml data itself!
> >>> > Now I am in a do or die situation. Please
> correct me if my schema design
> >>> > needs any changes (I do believe it should
> change!) and please help me up
> >>> to
> >>> > retrieve the column family values (urls)
> >>> >  corresponding to each row-key in an
> efficient way. Please guide me how
> >>> i
> >>> > can do the same using PHP-REST
> interface.
> >>> > Thanks in advance.
> >>> >
> >>> > Sreejith
> >>> >
> >>>
> >>
> >>
> >>
> >> --
> >> Sreejith PK
> >> Nesote Technologies (P) Ltd
> >>
> >>
> >>
> >
> >
> > --
> > Sreejith PK
> > Nesote Technologies (P) Ltd
> >
>

Re: habse schema design and retrieving values through REST interface

Posted by Stack <st...@duboce.net>.

You can limit the return when scanning from the java api; see
http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Scan.html#setBatch(int)
 This facility is not exposed in the REST API at the moment (not that
I know of -- please someone correct me if I'm wrong).   So, yes, wide
rows, if thousands of elements of some size, since they need to be
composed all in RAM, could bring on an OOME if the composed size >
available heap.

St.Ack


On Wed, Mar 16, 2011 at 2:41 AM, sreejith P. K. <sr...@nesote.com> wrote:
> With this schema, if i can limit the column family over a particular range,
> I can manage everything else. (like Select first n columns of a column
> family)
>
> Sreejith
>
>
> On Wed, Mar 16, 2011 at 12:33 PM, sreejith P. K. <sr...@nesote.com>wrote:
>
>> @ Jean-Daniel,
>>
>> As i told, each row key contains thousands of column family values (may be
>> i am wrong with the schema design). I started REST and tried to cURL
>> http:/localhost/tablename/rowname. It seems it will work only with limited
>> amount of data (may be i can limit the cURL output), and how i can limit the
>> column values for a particular row?
>> Suppose i have two thousand urls under a keyword and i need to fetch the
>> urls and should limit the result to five hundred. How it is possible??
>>
>> @ tsuna,
>>
>>  It seems http://www.elasticsearch.org/ using CouchDB right?
>>
>>
>> On Tue, Mar 15, 2011 at 11:32 PM, Jean-Daniel Cryans <jd...@apache.org>wrote:
>>
>>> Can you tell why it's not able to get the bigger rows? Why would you
>>> try another schema if you don't even know what's going on right now?
>>> If you have the same issue with the new schema, you're back to square
>>> one right?
>>>
>>> Looking at the logs should give you some hints.
>>>
>>> J-D
>>>
>>> On Tue, Mar 15, 2011 at 10:19 AM, sreejith P. K. <sr...@nesote.com>
>>> wrote:
>>> > Hello experts,
>>> >
>>> > I have a scenario as follows,
>>> > I need to maintain a huge table for a 'web crawler' project in HBASE.
>>> > Basically it contains thousands of keywords and for each keyword i need
>>> to
>>> > maintain a list of urls (it again will count in thousands).
>>> Corresponding to
>>> > each url, i need to store a number, which will in turn resemble the
>>> priority
>>> > value the keyword holds.
>>> > Let me explain you a bit, Suppose i have a keyword 'united states', i
>>> need
>>> > to store about ten thousand urls corresponding to that keyword. Each
>>> keyword
>>> > will be holding a priority value which is an integer. Again i have
>>> thousands
>>> > of keywords like that. The rare thing about this is i need to do the
>>> project
>>> > in PHP.
>>> >
>>> > I have configured a hadoop-hbase cluster consists of three machines. My
>>> plan
>>> > was to design the schema by taking the keyword as 'row key'. The urls i
>>> will
>>> > keep as column family. The schema looked fine at first. I have done a
>>> lot of
>>> > research on how to retrieve the url list if i know the keyword. Any ways
>>> i
>>> > managed a way out by preg-matching the xml data out put using the url
>>> > http://localhost:8080/tablename/rowkey (REST interface i used). It also
>>> > works fine if the url list has a limited number of urls. When it comes
>>> in
>>> > thousands, it seems i cannot fetch the xml data itself!
>>> > Now I am in a do or die situation. Please correct me if my schema design
>>> > needs any changes (I do believe it should change!) and please help me up
>>> to
>>> > retrieve the column family values (urls)
>>> >  corresponding to each row-key in an efficient way. Please guide me how
>>> i
>>> > can do the same using PHP-REST interface.
>>> > Thanks in advance.
>>> >
>>> > Sreejith
>>> >
>>>
>>
>>
>>
>> --
>> Sreejith PK
>> Nesote Technologies (P) Ltd
>>
>>
>>
>
>
> --
> Sreejith PK
> Nesote Technologies (P) Ltd
>

Re: habse schema design and retrieving values through REST interface

Posted by "sreejith P. K." <sr...@nesote.com>.

With this schema, if i can limit the column family over a particular range,
I can manage everything else. (like Select first n columns of a column
family)

Sreejith


On Wed, Mar 16, 2011 at 12:33 PM, sreejith P. K. <sr...@nesote.com>wrote:

> @ Jean-Daniel,
>
> As i told, each row key contains thousands of column family values (may be
> i am wrong with the schema design). I started REST and tried to cURL
> http:/localhost/tablename/rowname. It seems it will work only with limited
> amount of data (may be i can limit the cURL output), and how i can limit the
> column values for a particular row?
> Suppose i have two thousand urls under a keyword and i need to fetch the
> urls and should limit the result to five hundred. How it is possible??
>
> @ tsuna,
>
>  It seems http://www.elasticsearch.org/ using CouchDB right?
>
>
> On Tue, Mar 15, 2011 at 11:32 PM, Jean-Daniel Cryans <jd...@apache.org>wrote:
>
>> Can you tell why it's not able to get the bigger rows? Why would you
>> try another schema if you don't even know what's going on right now?
>> If you have the same issue with the new schema, you're back to square
>> one right?
>>
>> Looking at the logs should give you some hints.
>>
>> J-D
>>
>> On Tue, Mar 15, 2011 at 10:19 AM, sreejith P. K. <sr...@nesote.com>
>> wrote:
>> > Hello experts,
>> >
>> > I have a scenario as follows,
>> > I need to maintain a huge table for a 'web crawler' project in HBASE.
>> > Basically it contains thousands of keywords and for each keyword i need
>> to
>> > maintain a list of urls (it again will count in thousands).
>> Corresponding to
>> > each url, i need to store a number, which will in turn resemble the
>> priority
>> > value the keyword holds.
>> > Let me explain you a bit, Suppose i have a keyword 'united states', i
>> need
>> > to store about ten thousand urls corresponding to that keyword. Each
>> keyword
>> > will be holding a priority value which is an integer. Again i have
>> thousands
>> > of keywords like that. The rare thing about this is i need to do the
>> project
>> > in PHP.
>> >
>> > I have configured a hadoop-hbase cluster consists of three machines. My
>> plan
>> > was to design the schema by taking the keyword as 'row key'. The urls i
>> will
>> > keep as column family. The schema looked fine at first. I have done a
>> lot of
>> > research on how to retrieve the url list if i know the keyword. Any ways
>> i
>> > managed a way out by preg-matching the xml data out put using the url
>> > http://localhost:8080/tablename/rowkey (REST interface i used). It also
>> > works fine if the url list has a limited number of urls. When it comes
>> in
>> > thousands, it seems i cannot fetch the xml data itself!
>> > Now I am in a do or die situation. Please correct me if my schema design
>> > needs any changes (I do believe it should change!) and please help me up
>> to
>> > retrieve the column family values (urls)
>> >  corresponding to each row-key in an efficient way. Please guide me how
>> i
>> > can do the same using PHP-REST interface.
>> > Thanks in advance.
>> >
>> > Sreejith
>> >
>>
>
>
>
> --
> Sreejith PK
> Nesote Technologies (P) Ltd
>
>
>


-- 
Sreejith PK
Nesote Technologies (P) Ltd

Re: habse schema design and retrieving values through REST interface

Posted by "sreejith P. K." <sr...@nesote.com>.

@ Jean-Daniel,

As i told, each row key contains thousands of column family values (may be i
am wrong with the schema design). I started REST and tried to cURL
http:/localhost/tablename/rowname. It seems it will work only with limited
amount of data (may be i can limit the cURL output), and how i can limit the
column values for a particular row?
Suppose i have two thousand urls under a keyword and i need to fetch the
urls and should limit the result to five hundred. How it is possible??

@ tsuna,

 It seems http://www.elasticsearch.org/ using CouchDB right?

On Tue, Mar 15, 2011 at 11:32 PM, Jean-Daniel Cryans <jd...@apache.org>wrote:

> Can you tell why it's not able to get the bigger rows? Why would you
> try another schema if you don't even know what's going on right now?
> If you have the same issue with the new schema, you're back to square
> one right?
>
> Looking at the logs should give you some hints.
>
> J-D
>
> On Tue, Mar 15, 2011 at 10:19 AM, sreejith P. K. <sr...@nesote.com>
> wrote:
> > Hello experts,
> >
> > I have a scenario as follows,
> > I need to maintain a huge table for a 'web crawler' project in HBASE.
> > Basically it contains thousands of keywords and for each keyword i need
> to
> > maintain a list of urls (it again will count in thousands). Corresponding
> to
> > each url, i need to store a number, which will in turn resemble the
> priority
> > value the keyword holds.
> > Let me explain you a bit, Suppose i have a keyword 'united states', i
> need
> > to store about ten thousand urls corresponding to that keyword. Each
> keyword
> > will be holding a priority value which is an integer. Again i have
> thousands
> > of keywords like that. The rare thing about this is i need to do the
> project
> > in PHP.
> >
> > I have configured a hadoop-hbase cluster consists of three machines. My
> plan
> > was to design the schema by taking the keyword as 'row key'. The urls i
> will
> > keep as column family. The schema looked fine at first. I have done a lot
> of
> > research on how to retrieve the url list if i know the keyword. Any ways
> i
> > managed a way out by preg-matching the xml data out put using the url
> > http://localhost:8080/tablename/rowkey (REST interface i used). It also
> > works fine if the url list has a limited number of urls. When it comes in
> > thousands, it seems i cannot fetch the xml data itself!
> > Now I am in a do or die situation. Please correct me if my schema design
> > needs any changes (I do believe it should change!) and please help me up
> to
> > retrieve the column family values (urls)
> >  corresponding to each row-key in an efficient way. Please guide me how i
> > can do the same using PHP-REST interface.
> > Thanks in advance.
> >
> > Sreejith
> >
>



-- 
Sreejith PK
Nesote Technologies (P) Ltd

Re: habse schema design and retrieving values through REST interface

Posted by Jean-Daniel Cryans <jd...@apache.org>.

Can you tell why it's not able to get the bigger rows? Why would you
try another schema if you don't even know what's going on right now?
If you have the same issue with the new schema, you're back to square
one right?

Looking at the logs should give you some hints.

J-D

On Tue, Mar 15, 2011 at 10:19 AM, sreejith P. K. <sr...@nesote.com> wrote:
> Hello experts,
>
> I have a scenario as follows,
> I need to maintain a huge table for a 'web crawler' project in HBASE.
> Basically it contains thousands of keywords and for each keyword i need to
> maintain a list of urls (it again will count in thousands). Corresponding to
> each url, i need to store a number, which will in turn resemble the priority
> value the keyword holds.
> Let me explain you a bit, Suppose i have a keyword 'united states', i need
> to store about ten thousand urls corresponding to that keyword. Each keyword
> will be holding a priority value which is an integer. Again i have thousands
> of keywords like that. The rare thing about this is i need to do the project
> in PHP.
>
> I have configured a hadoop-hbase cluster consists of three machines. My plan
> was to design the schema by taking the keyword as 'row key'. The urls i will
> keep as column family. The schema looked fine at first. I have done a lot of
> research on how to retrieve the url list if i know the keyword. Any ways i
> managed a way out by preg-matching the xml data out put using the url
> http://localhost:8080/tablename/rowkey (REST interface i used). It also
> works fine if the url list has a limited number of urls. When it comes in
> thousands, it seems i cannot fetch the xml data itself!
> Now I am in a do or die situation. Please correct me if my schema design
> needs any changes (I do believe it should change!) and please help me up to
> retrieve the column family values (urls)
>  corresponding to each row-key in an efficient way. Please guide me how i
> can do the same using PHP-REST interface.
> Thanks in advance.
>
> Sreejith
>

Re: habse schema design and retrieving values through REST interface

Posted by tsuna <ts...@gmail.com>.

On Tue, Mar 15, 2011 at 10:19 AM, sreejith P. K. <sr...@nesote.com> wrote:
> I need to maintain a huge table for a 'web crawler' project in HBASE.
> Basically it contains thousands of keywords and for each keyword i need to
> maintain a list of urls (it again will count in thousands). Corresponding to
> each url, i need to store a number, which will in turn resemble the priority
> value the keyword holds.
> Let me explain you a bit, Suppose i have a keyword 'united states', i need
> to store about ten thousand urls corresponding to that keyword. Each keyword
> will be holding a priority value which is an integer. Again i have thousands
> of keywords like that. The rare thing about this is i need to do the project
> in PHP.

Have you looked at ElasticSearch?  Seems like it would do what you
want out of the box.  In your PHP app you simply need to make REST
calls with a bit of JSON here and there, and that would be all.
http://www.elasticsearch.org/

-- 
Benoit "tsuna" Sigoure
Software Engineer @ www.StumbleUpon.com