You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by k4200 <k4...@kazu.tv> on 2013/01/12 10:09:04 UTC

Size limit for fetched pages

Hi,

I started using Nutch with HBase several days ago following the
Nutch2Tutorial shown below and it seemed to start working.
http://wiki.apache.org/nutch/Nutch2Tutorial

Today, I noticed that page contents were cut down to 64KB. Actually,
those pages are less than 64KB, but the contents are UTF-8, and
multi-byte characters seem to be encoded like "\xE3\x81\x93" when
stored in HBase, so basically the size becomes almost 4 times larger
than that of the original content.

Here are the questions:
1. How to fix this? I'm guessing changing the block size in HBase
would fix the problem, but I don't know how. gora.properties, perhaps?
2. After fixing up the configurations, I need to fetch those
incomplete pages again. Any easy way to do this?

Any help would be appreciated.

Thanks,
Kaz

Re: Size limit for fetched pages

Posted by Lewis John Mcgibbney <le...@gmail.com>.
Hi,

[UPDATE]

I was completely incorrect about gora-hbase not supporting configurable
block size within the table mapping XML file.
Currently, configurable attributes which can be specified within gora-hbase
include the following

String familyName  = fieldElement.getAttributeValue("name");
          String compression = fieldElement.getAttributeValue("compression");
          String blockCache  = fieldElement.getAttributeValue("blockCache");
          String blockSize   = fieldElement.getAttributeValue("blockSize");
          String bloomFilter = fieldElement.getAttributeValue("bloomFilter");
          String maxVersions = fieldElement.getAttributeValue("maxVersions");
          String timeToLive  = fieldElement.getAttributeValue("timeToLive");
          String inMemory    = fieldElement.getAttributeValue("inMemory");

These should be specified within the <table> block of the mapping file. For
more information on this head over to user@gora.

Thank you
Lewis

On Sun, Jan 13, 2013 at 6:49 AM, k4200 <k4...@kazu.tv> wrote:

> Hi Feng and Lewis,
>
> Thanks for your replies! I tried a few different settings and finally
> found out that increasing "http.content.limit" fixed the problem.
>
> Kaz
>
> 2013/1/13 Lewis John Mcgibbney <le...@gmail.com>:
> > Hi Kaz,
> >
> > On Sat, Jan 12, 2013 at 1:09 AM, k4200 <k4...@kazu.tv> wrote:
> >
> >>
> >> Here are the questions:
> >> 1. How to fix this? I'm guessing changing the block size in HBase
> >> would fix the problem, but I don't know how. gora.properties, perhaps?
> >>
> >
> > No, such functionality is not currently implemented in gora-hbase.
> Usually
> > the blocksize is retrieved from the table schema (HColumnDescriptor),
> > however there are also overrides you can specify in your
> > conf/hbase-site.xml file. Please see here [0] for more options. I am not
> > very comfortable with HBase but this looks like its going in the right
> > direction.
> >
> >
> >> 2. After fixing up the configurations, I need to fetch those
> >> incomplete pages again. Any easy way to do this?
> >>
> >
> > I suppose if you know a list of the batchId's then you can attempt from
> the
> > fetch cycle onwards, however I've not attempted this myself. Please give
> us
> > feedback on this one.
> >
> > Thanks
> > Lewis
> >
> > [0] http://hbase.apache.org/book/config.files.html
> >
> >
> >
> > --
> > *Lewis*
>



-- 
*Lewis*

Re: Size limit for fetched pages

Posted by k4200 <k4...@kazu.tv>.
Hi Feng and Lewis,

Thanks for your replies! I tried a few different settings and finally
found out that increasing "http.content.limit" fixed the problem.

Kaz

2013/1/13 Lewis John Mcgibbney <le...@gmail.com>:
> Hi Kaz,
>
> On Sat, Jan 12, 2013 at 1:09 AM, k4200 <k4...@kazu.tv> wrote:
>
>>
>> Here are the questions:
>> 1. How to fix this? I'm guessing changing the block size in HBase
>> would fix the problem, but I don't know how. gora.properties, perhaps?
>>
>
> No, such functionality is not currently implemented in gora-hbase. Usually
> the blocksize is retrieved from the table schema (HColumnDescriptor),
> however there are also overrides you can specify in your
> conf/hbase-site.xml file. Please see here [0] for more options. I am not
> very comfortable with HBase but this looks like its going in the right
> direction.
>
>
>> 2. After fixing up the configurations, I need to fetch those
>> incomplete pages again. Any easy way to do this?
>>
>
> I suppose if you know a list of the batchId's then you can attempt from the
> fetch cycle onwards, however I've not attempted this myself. Please give us
> feedback on this one.
>
> Thanks
> Lewis
>
> [0] http://hbase.apache.org/book/config.files.html
>
>
>
> --
> *Lewis*

Re: Size limit for fetched pages

Posted by Lewis John Mcgibbney <le...@gmail.com>.
Hi Kaz,

On Sat, Jan 12, 2013 at 1:09 AM, k4200 <k4...@kazu.tv> wrote:

>
> Here are the questions:
> 1. How to fix this? I'm guessing changing the block size in HBase
> would fix the problem, but I don't know how. gora.properties, perhaps?
>

No, such functionality is not currently implemented in gora-hbase. Usually
the blocksize is retrieved from the table schema (HColumnDescriptor),
however there are also overrides you can specify in your
conf/hbase-site.xml file. Please see here [0] for more options. I am not
very comfortable with HBase but this looks like its going in the right
direction.


> 2. After fixing up the configurations, I need to fetch those
> incomplete pages again. Any easy way to do this?
>

I suppose if you know a list of the batchId's then you can attempt from the
fetch cycle onwards, however I've not attempted this myself. Please give us
feedback on this one.

Thanks
Lewis

[0] http://hbase.apache.org/book/config.files.html



-- 
*Lewis*

Re: Size limit for fetched pages

Posted by feng lu <am...@gmail.com>.
Hi Kaz

for incomplete pags, you can change file.content.limit property in
nutch-site.xml.

maybe you can regenerate the urls and fetch again.




On Sat, Jan 12, 2013 at 5:09 PM, k4200 <k4...@kazu.tv> wrote:

> Hi,
>
> I started using Nutch with HBase several days ago following the
> Nutch2Tutorial shown below and it seemed to start working.
> http://wiki.apache.org/nutch/Nutch2Tutorial
>
> Today, I noticed that page contents were cut down to 64KB. Actually,
> those pages are less than 64KB, but the contents are UTF-8, and
> multi-byte characters seem to be encoded like "\xE3\x81\x93" when
> stored in HBase, so basically the size becomes almost 4 times larger
> than that of the original content.
>
> Here are the questions:
> 1. How to fix this? I'm guessing changing the block size in HBase
> would fix the problem, but I don't know how. gora.properties, perhaps?
> 2. After fixing up the configurations, I need to fetch those
> incomplete pages again. Any easy way to do this?
>
> Any help would be appreciated.
>
> Thanks,
> Kaz
>



-- 
Don't Grow Old, Grow Up... :-)