You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by k4200 <k4...@kazu.tv> on 2013/01/12 10:09:04 UTC
Size limit for fetched pages
Hi,
I started using Nutch with HBase several days ago following the
Nutch2Tutorial shown below and it seemed to start working.
http://wiki.apache.org/nutch/Nutch2Tutorial
Today, I noticed that page contents were cut down to 64KB. Actually,
those pages are less than 64KB, but the contents are UTF-8, and
multi-byte characters seem to be encoded like "\xE3\x81\x93" when
stored in HBase, so basically the size becomes almost 4 times larger
than that of the original content.
Here are the questions:
1. How to fix this? I'm guessing changing the block size in HBase
would fix the problem, but I don't know how. gora.properties, perhaps?
2. After fixing up the configurations, I need to fetch those
incomplete pages again. Any easy way to do this?
Any help would be appreciated.
Thanks,
Kaz
Re: Size limit for fetched pages
Posted by Lewis John Mcgibbney <le...@gmail.com>.
Hi,
[UPDATE]
I was completely incorrect about gora-hbase not supporting configurable
block size within the table mapping XML file.
Currently, configurable attributes which can be specified within gora-hbase
include the following
String familyName = fieldElement.getAttributeValue("name");
String compression = fieldElement.getAttributeValue("compression");
String blockCache = fieldElement.getAttributeValue("blockCache");
String blockSize = fieldElement.getAttributeValue("blockSize");
String bloomFilter = fieldElement.getAttributeValue("bloomFilter");
String maxVersions = fieldElement.getAttributeValue("maxVersions");
String timeToLive = fieldElement.getAttributeValue("timeToLive");
String inMemory = fieldElement.getAttributeValue("inMemory");
These should be specified within the <table> block of the mapping file. For
more information on this head over to user@gora.
Thank you
Lewis
On Sun, Jan 13, 2013 at 6:49 AM, k4200 <k4...@kazu.tv> wrote:
> Hi Feng and Lewis,
>
> Thanks for your replies! I tried a few different settings and finally
> found out that increasing "http.content.limit" fixed the problem.
>
> Kaz
>
> 2013/1/13 Lewis John Mcgibbney <le...@gmail.com>:
> > Hi Kaz,
> >
> > On Sat, Jan 12, 2013 at 1:09 AM, k4200 <k4...@kazu.tv> wrote:
> >
> >>
> >> Here are the questions:
> >> 1. How to fix this? I'm guessing changing the block size in HBase
> >> would fix the problem, but I don't know how. gora.properties, perhaps?
> >>
> >
> > No, such functionality is not currently implemented in gora-hbase.
> Usually
> > the blocksize is retrieved from the table schema (HColumnDescriptor),
> > however there are also overrides you can specify in your
> > conf/hbase-site.xml file. Please see here [0] for more options. I am not
> > very comfortable with HBase but this looks like its going in the right
> > direction.
> >
> >
> >> 2. After fixing up the configurations, I need to fetch those
> >> incomplete pages again. Any easy way to do this?
> >>
> >
> > I suppose if you know a list of the batchId's then you can attempt from
> the
> > fetch cycle onwards, however I've not attempted this myself. Please give
> us
> > feedback on this one.
> >
> > Thanks
> > Lewis
> >
> > [0] http://hbase.apache.org/book/config.files.html
> >
> >
> >
> > --
> > *Lewis*
>
--
*Lewis*
Re: Size limit for fetched pages
Posted by k4200 <k4...@kazu.tv>.
Hi Feng and Lewis,
Thanks for your replies! I tried a few different settings and finally
found out that increasing "http.content.limit" fixed the problem.
Kaz
2013/1/13 Lewis John Mcgibbney <le...@gmail.com>:
> Hi Kaz,
>
> On Sat, Jan 12, 2013 at 1:09 AM, k4200 <k4...@kazu.tv> wrote:
>
>>
>> Here are the questions:
>> 1. How to fix this? I'm guessing changing the block size in HBase
>> would fix the problem, but I don't know how. gora.properties, perhaps?
>>
>
> No, such functionality is not currently implemented in gora-hbase. Usually
> the blocksize is retrieved from the table schema (HColumnDescriptor),
> however there are also overrides you can specify in your
> conf/hbase-site.xml file. Please see here [0] for more options. I am not
> very comfortable with HBase but this looks like its going in the right
> direction.
>
>
>> 2. After fixing up the configurations, I need to fetch those
>> incomplete pages again. Any easy way to do this?
>>
>
> I suppose if you know a list of the batchId's then you can attempt from the
> fetch cycle onwards, however I've not attempted this myself. Please give us
> feedback on this one.
>
> Thanks
> Lewis
>
> [0] http://hbase.apache.org/book/config.files.html
>
>
>
> --
> *Lewis*
Re: Size limit for fetched pages
Posted by Lewis John Mcgibbney <le...@gmail.com>.
Hi Kaz,
On Sat, Jan 12, 2013 at 1:09 AM, k4200 <k4...@kazu.tv> wrote:
>
> Here are the questions:
> 1. How to fix this? I'm guessing changing the block size in HBase
> would fix the problem, but I don't know how. gora.properties, perhaps?
>
No, such functionality is not currently implemented in gora-hbase. Usually
the blocksize is retrieved from the table schema (HColumnDescriptor),
however there are also overrides you can specify in your
conf/hbase-site.xml file. Please see here [0] for more options. I am not
very comfortable with HBase but this looks like its going in the right
direction.
> 2. After fixing up the configurations, I need to fetch those
> incomplete pages again. Any easy way to do this?
>
I suppose if you know a list of the batchId's then you can attempt from the
fetch cycle onwards, however I've not attempted this myself. Please give us
feedback on this one.
Thanks
Lewis
[0] http://hbase.apache.org/book/config.files.html
--
*Lewis*
Re: Size limit for fetched pages
Posted by feng lu <am...@gmail.com>.
Hi Kaz
for incomplete pags, you can change file.content.limit property in
nutch-site.xml.
maybe you can regenerate the urls and fetch again.
On Sat, Jan 12, 2013 at 5:09 PM, k4200 <k4...@kazu.tv> wrote:
> Hi,
>
> I started using Nutch with HBase several days ago following the
> Nutch2Tutorial shown below and it seemed to start working.
> http://wiki.apache.org/nutch/Nutch2Tutorial
>
> Today, I noticed that page contents were cut down to 64KB. Actually,
> those pages are less than 64KB, but the contents are UTF-8, and
> multi-byte characters seem to be encoded like "\xE3\x81\x93" when
> stored in HBase, so basically the size becomes almost 4 times larger
> than that of the original content.
>
> Here are the questions:
> 1. How to fix this? I'm guessing changing the block size in HBase
> would fix the problem, but I don't know how. gora.properties, perhaps?
> 2. After fixing up the configurations, I need to fetch those
> incomplete pages again. Any easy way to do this?
>
> Any help would be appreciated.
>
> Thanks,
> Kaz
>
--
Don't Grow Old, Grow Up... :-)