You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by "Goel, Ankur" <An...@corp.aol.com> on 2008/04/08 15:42:02 UTC

RE: HBase performance tuning

Hi Stack
     I uploaded the hertirx2-hbase-writer code here
http://heritrix2-hbase-writer.googlecode.com/files/heritrix2.0-hbase-wri
ter.jar

The jar size is 15 MB as it has all the necessary libraries to build
writer code. 
The actual code is split in 5 files.

Do take a look.

Thanks
-Ankur


-----Original Message-----
From: Goel, Ankur [mailto:Ankur.Goel@corp.aol.com] 
Sent: Friday, March 28, 2008 1:06 PM
To: hbase-user@hadoop.apache.org
Subject: RE: HBase performance tuning

Thanks for posting the code stack. One thing that I saw missing in my
code is the use of a writer pool.
I'll incorporate that in my code and make some other changes as well. 

There should'nt be any issues in the contributing the updated code
except for converting the schema to make it column oriented. At the
moment it's a simple RDMS schema converted directly to an Hbase schema
by substituting column name with column family. 

I'll try to reduce it to make it fit the column-oriented design. Feel
free to suggest changes if you like.
The details have been mentioned in a post before.

Thanks
-Ankur

-----Original Message-----
From: stack [mailto:stack@duboce.net]
Sent: Friday, March 28, 2008 11:54 AM
To: hbase-user@hadoop.apache.org
Subject: Re: HBase performance tuning

Goel, Ankur wrote:
>  ...
>
> I'll check and let you know if the code can be contributed.
> Once I get a green, I'll make some modifications to make it more 
> generic and share with you folks to understand how we can Improve it 
> further before posting.
>   

A while back, I had a go at making such a Writer: see
http://www.duboce.net/~stack/hbase-writer.tgz.  Its old, probably won't
work w/ current hbase -- I haven't tried it -- and its for Heritrix 1.x
generation but shouldn't be hard to update.  When I left it, I was
trying to mavenize it and was to put needed jars -- hadoop, etc. -- up 
on the Archive's build box.   Publishing such a Writer is a little 
awkward given the different licenses.  Having maven pull jars seemed
like one way of working within the constraints imposed by licensing
(Archive is apparently moving toward Apache licensing which should
alleviate at least the above issue).

St.Ack

> Thanks
> -Ankur
>
>
>
> -----Original Message-----
> From: stack [mailto:stack@duboce.net]
> Sent: Thursday, March 27, 2008 10:08 PM
> To: hbase-user@hadoop.apache.org
> Subject: Re: HBase performance tuning
>
> I have some familiarity with that crawler.
>
> Tell us more about your writer.   Is it proprietary?  If not, can we
get
>
> it into a place where others could use it if wanted?
>
> Thanks,
> St.Ack
>
>
> Goel, Ankur wrote:
>   
>> I am crawling the web indeed, but only the sites that are present in 
>> my seedlist. The crawler used here is heritrix 2.0 - 
>> http://webteam.archive.org/confluence/display/Heritrix/2.0.0.
>>
>> I developed a Heritrix specific HBase writer that can be integrated 
>> with Heritrix to write the crawled content directly into Hbase.
>>
>> -Ankur
>>   
>>     


RE: HBase performance tuning

Posted by "Goel, Ankur" <An...@corp.aol.com>.
 
I appreciate it :-)
Thanks for your feedback Stack.

Once I rectify few things you mentioned, I will announce the same
on heritrix mailing list too as I feel that it will be more encouraging
for the 
users to look into it.

Another thing I am working on that you might want to take a look
Is converting this schema into a column oriented design, may be have a
single table
Instead of a web_content and seedlist table and modify the fields
accordingly.


-----Original Message-----
From: stack [mailto:stack@duboce.net] 
Sent: Tuesday, April 08, 2008 10:49 PM
To: hbase-user@hadoop.apache.org
Subject: Re: HBase performance tuning

Excellent.

I filed a couple of issues against your bundle (smile) up on googlecode.

Here's a few other notes:

Why is the src not checked in?  When I browse to the 'source' tab, there
is nothing there.

You've bundled jars that are LGPL (fastutil, archive-overlay, etc.).  
The archive ones are supposedly going to be relicensed as Apache but
I've not heard that is the case for fastutil.

Do you want to put the writer into the org.archive package rather than
an aol package or whatever the package you use developing software
outside aol-time?

Thats nice that you include a tool to create tables (Make note in the
doc that this exists -- and doc. should include description of schema
you're using).

Is it possible to filter outlinks -- i.e. run outlinks through an
Heritrix filter (maybe its not) -- rather than do it here inside in your
writer? Same canonicalizing?  If not, could you add the hook to call
filters?  (Not important).

Does it work?

Great stuff Ankur (Would suggest announcing on Heritrix list too -- you
might get feedback from there).

St.Ack



Goel, Ankur wrote:
> Hi Stack
>      I uploaded the hertirx2-hbase-writer code here 
> http://heritrix2-hbase-writer.googlecode.com/files/heritrix2.0-hbase-w
> ri
> ter.jar
>
> The jar size is 15 MB as it has all the necessary libraries to build 
> writer code.
> The actual code is split in 5 files.
>
> Do take a look.
>
> Thanks
> -Ankur
>
>
> -----Original Message-----
> From: Goel, Ankur [mailto:Ankur.Goel@corp.aol.com]
> Sent: Friday, March 28, 2008 1:06 PM
> To: hbase-user@hadoop.apache.org
> Subject: RE: HBase performance tuning
>
> Thanks for posting the code stack. One thing that I saw missing in my 
> code is the use of a writer pool.
> I'll incorporate that in my code and make some other changes as well. 
>
> There should'nt be any issues in the contributing the updated code 
> except for converting the schema to make it column oriented. At the 
> moment it's a simple RDMS schema converted directly to an Hbase schema

> by substituting column name with column family.
>
> I'll try to reduce it to make it fit the column-oriented design. Feel 
> free to suggest changes if you like.
> The details have been mentioned in a post before.
>
> Thanks
> -Ankur
>
> -----Original Message-----
> From: stack [mailto:stack@duboce.net]
> Sent: Friday, March 28, 2008 11:54 AM
> To: hbase-user@hadoop.apache.org
> Subject: Re: HBase performance tuning
>
> Goel, Ankur wrote:
>   
>>  ...
>>
>> I'll check and let you know if the code can be contributed.
>> Once I get a green, I'll make some modifications to make it more 
>> generic and share with you folks to understand how we can Improve it 
>> further before posting.
>>   
>>     
>
> A while back, I had a go at making such a Writer: see 
> http://www.duboce.net/~stack/hbase-writer.tgz.  Its old, probably 
> won't work w/ current hbase -- I haven't tried it -- and its for 
> Heritrix 1.x generation but shouldn't be hard to update.  When I left 
> it, I was trying to mavenize it and was to put needed jars -- hadoop,
etc. -- up
> on the Archive's build box.   Publishing such a Writer is a little 
> awkward given the different licenses.  Having maven pull jars seemed 
> like one way of working within the constraints imposed by licensing 
> (Archive is apparently moving toward Apache licensing which should 
> alleviate at least the above issue).
>
> St.Ack
>
>   
>> Thanks
>> -Ankur
>>
>>
>>
>> -----Original Message-----
>> From: stack [mailto:stack@duboce.net]
>> Sent: Thursday, March 27, 2008 10:08 PM
>> To: hbase-user@hadoop.apache.org
>> Subject: Re: HBase performance tuning
>>
>> I have some familiarity with that crawler.
>>
>> Tell us more about your writer.   Is it proprietary?  If not, can we
>>     
> get
>   
>> it into a place where others could use it if wanted?
>>
>> Thanks,
>> St.Ack
>>
>>
>> Goel, Ankur wrote:
>>   
>>     
>>> I am crawling the web indeed, but only the sites that are present in

>>> my seedlist. The crawler used here is heritrix 2.0 - 
>>> http://webteam.archive.org/confluence/display/Heritrix/2.0.0.
>>>
>>> I developed a Heritrix specific HBase writer that can be integrated 
>>> with Heritrix to write the crawled content directly into Hbase.
>>>
>>> -Ankur
>>>   
>>>     
>>>       
>
>   


Re: HBase performance tuning

Posted by stack <st...@duboce.net>.
Excellent.

I filed a couple of issues against your bundle (smile) up on googlecode.

Here's a few other notes:

Why is the src not checked in?  When I browse to the 'source' tab, there 
is nothing there.

You've bundled jars that are LGPL (fastutil, archive-overlay, etc.).  
The archive ones are supposedly going to be relicensed as Apache but 
I've not heard that is the case for fastutil.

Do you want to put the writer into the org.archive package rather than 
an aol package or whatever the package you use developing software 
outside aol-time?

Thats nice that you include a tool to create tables (Make note in the 
doc that this exists -- and doc. should include description of schema 
you're using).

Is it possible to filter outlinks -- i.e. run outlinks through an 
Heritrix filter (maybe its not) -- rather than do it here inside in your 
writer? Same canonicalizing?  If not, could you add the hook to call 
filters?  (Not important).

Does it work?

Great stuff Ankur (Would suggest announcing on Heritrix list too -- you 
might get feedback from there).

St.Ack



Goel, Ankur wrote:
> Hi Stack
>      I uploaded the hertirx2-hbase-writer code here
> http://heritrix2-hbase-writer.googlecode.com/files/heritrix2.0-hbase-wri
> ter.jar
>
> The jar size is 15 MB as it has all the necessary libraries to build
> writer code. 
> The actual code is split in 5 files.
>
> Do take a look.
>
> Thanks
> -Ankur
>
>
> -----Original Message-----
> From: Goel, Ankur [mailto:Ankur.Goel@corp.aol.com] 
> Sent: Friday, March 28, 2008 1:06 PM
> To: hbase-user@hadoop.apache.org
> Subject: RE: HBase performance tuning
>
> Thanks for posting the code stack. One thing that I saw missing in my
> code is the use of a writer pool.
> I'll incorporate that in my code and make some other changes as well. 
>
> There should'nt be any issues in the contributing the updated code
> except for converting the schema to make it column oriented. At the
> moment it's a simple RDMS schema converted directly to an Hbase schema
> by substituting column name with column family. 
>
> I'll try to reduce it to make it fit the column-oriented design. Feel
> free to suggest changes if you like.
> The details have been mentioned in a post before.
>
> Thanks
> -Ankur
>
> -----Original Message-----
> From: stack [mailto:stack@duboce.net]
> Sent: Friday, March 28, 2008 11:54 AM
> To: hbase-user@hadoop.apache.org
> Subject: Re: HBase performance tuning
>
> Goel, Ankur wrote:
>   
>>  ...
>>
>> I'll check and let you know if the code can be contributed.
>> Once I get a green, I'll make some modifications to make it more 
>> generic and share with you folks to understand how we can Improve it 
>> further before posting.
>>   
>>     
>
> A while back, I had a go at making such a Writer: see
> http://www.duboce.net/~stack/hbase-writer.tgz.  Its old, probably won't
> work w/ current hbase -- I haven't tried it -- and its for Heritrix 1.x
> generation but shouldn't be hard to update.  When I left it, I was
> trying to mavenize it and was to put needed jars -- hadoop, etc. -- up 
> on the Archive's build box.   Publishing such a Writer is a little 
> awkward given the different licenses.  Having maven pull jars seemed
> like one way of working within the constraints imposed by licensing
> (Archive is apparently moving toward Apache licensing which should
> alleviate at least the above issue).
>
> St.Ack
>
>   
>> Thanks
>> -Ankur
>>
>>
>>
>> -----Original Message-----
>> From: stack [mailto:stack@duboce.net]
>> Sent: Thursday, March 27, 2008 10:08 PM
>> To: hbase-user@hadoop.apache.org
>> Subject: Re: HBase performance tuning
>>
>> I have some familiarity with that crawler.
>>
>> Tell us more about your writer.   Is it proprietary?  If not, can we
>>     
> get
>   
>> it into a place where others could use it if wanted?
>>
>> Thanks,
>> St.Ack
>>
>>
>> Goel, Ankur wrote:
>>   
>>     
>>> I am crawling the web indeed, but only the sites that are present in 
>>> my seedlist. The crawler used here is heritrix 2.0 - 
>>> http://webteam.archive.org/confluence/display/Heritrix/2.0.0.
>>>
>>> I developed a Heritrix specific HBase writer that can be integrated 
>>> with Heritrix to write the crawled content directly into Hbase.
>>>
>>> -Ankur
>>>   
>>>     
>>>       
>
>