You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Ward Loving <wa...@appirio.com> on 2013/02/07 03:44:57 UTC

Content Truncation in Nutch 2.1/MySQL

Hello:

I've got Nutch up and running except for one big problem.  It is truncating
the content of my downloaded pages at 27,000 or 28,000 bytes.  Basically it
just slices the end of my web pages off and of course that completely hoses
any downstream parsing of the tags and content that I'd like to do.

This error is driving me crazy.  I've done the typical things like update
the http.content.limit:

<property>
  <name>http.content.limit</name>
  <value>-1</value>
  <description>The length limit for downloaded content using the http
  protocol, in bytes. If this value is nonnegative (>=0), content longer
  than it will be truncated; otherwise, no truncation at all. Do not
  confuse this setting with the file.content.limit setting.
  </description>
</property>

And I've updated the content setting in my gora-sql-mapping.xml to a huge
number:

<class name="org.apache.nutch.storage.WebPage" keyClass="java.lang.String"
table="webpage">
  <primarykey column="id" length="767"/>
    <field name="baseUrl" column="baseUrl" length="512"/>
    <field name="status" column="status"/>
    <field name="prevFetchTime" column="prevFetchTime"/>
    <field name="fetchTime" column="fetchTime"/>
    <field name="fetchInterval" column="fetchInterval"/>
    <field name="retriesSinceFetch" column="retriesSinceFetch"/>
    <field name="reprUrl" column="reprUrl" length="512"/>
    <!-- <field name="content" column="content" length="65536"/>-->
    *<field name="content" column="content" length="262144"/>*
    <field name="contentType" column="typ" length="32"/>
    <field name="protocolStatus" column="protocolStatus"/>
    <field name="modifiedTime" column="modifiedTime"/>

    <!-- parse fields                                       -->
    <field name="title" column="title" length="512"/>
    <field name="text" column="text" length="32000"/>
    <field name="parseStatus" column="parseStatus"/>
    <field name="signature" column="signature"/>
    <field name="prevSignature" column="prevSignature"/>

    <!-- score fields                                       -->
    <field name="score" column="score"/>
    <field name="headers" column="headers"/>
    <field name="inlinks" column="inlinks"/>
    <field name="outlinks" column="outlinks"/>
    <field name="metadata" column="metadata"/>
    <field name="markers" column="markers"/>
</class>

But I'm getting no love here.   Any other ideas what could be trimming the
content?

-- 
Ward Loving
Senior Technical Consultant
Appirio, Inc.
www.appirio.com
(706) 225-9475

Re: Content Truncation in Nutch 2.1/MySQL

Posted by Ward Loving <wa...@appirio.com>.

Hi Lewis:

Well, I've done some additional testing and the truncation issue seems to
be isolated to the particular web server/site that I'm trying to process.
 When I run the process against other sites, I'm not seeing the same issue.
 I guess for processing that site I'll have to go with Plan B.

Thanks for your help.

Ward


On Sun, Feb 10, 2013 at 8:19 PM, Lewis John Mcgibbney <
lewis.mcgibbney@gmail.com> wrote:

> No content should be truncated if you set http.content.limit to -1 and
> leave the default settings on. It is as simple as that.
> Have you recompiled Nutch with some changes you made before continuing
> crawling?
>
> On Fri, Feb 8, 2013 at 9:01 PM, Ward Loving <wa...@appirio.com> wrote:
>
> > Well,
> >
> > I spoke to soon.  I ran a crawl overnight and I'm seeing all kinds of
> > truncation happening again.   I can hardly find a content field in my
> > database that hasn't been truncated.  I'm seeing a ton of these warning
> > messages in the log:
> >
> > 2013-02-08 19:40:36,861 WARN  parse.ParserJob -
> > http://www.episcopalchurch.org/parish/university-texas-austin-txskipped.
> > Content of size 30220 was truncated to 29919
> > 2013-02-08 19:40:36,861 INFO  parse.ParserJob - Parsing
> > http://www.episcopalchurch.org/parish/varina-church-richmond-va
> > 2013-02-08 19:40:36,861 WARN  parse.ParserJob -
> > http://www.episcopalchurch.org/parish/varina-church-richmond-va skipped.
> > Content of size 29559 was truncated to 28471
> > 2013-02-08 19:40:36,861 INFO  parse.ParserJob - Parsing
> > http://www.episcopalchurch.org/parish/vauters-church-champlain-va
> >
> > This is sort of bizarrre.  I spot checked 5 pages when I first started
> the
> > process yesterday morning and all the content in the content fields was
> > complete.  Now I'm running it again and nothing is, but I don't see the
> > warning messages that anything is amiss with the data with the first
> couple
> > of pages I fetched.  I've tried updating the following setting to false
> but
> > it doesn't seem to help:
> >
> > <property>
> >   <name>parser.skip.truncated</name>
> >   <value>false</value>
> >   <description>Boolean value for whether we should skip parsing for
> > truncated documents. By default this
> >   property is activated due to extremely high levels of CPU which parsing
> > can sometimes take.
> >   </description>
> > </property>
> >
> >
> >
> >
> >
> >
> > On Thu, Feb 7, 2013 at 5:24 PM, Ward Loving <wa...@appirio.com> wrote:
> >
> > > Yep, looks like it.  The configuration is tricky no doubt.  In my case,
> > > however, I think I had actually fixed the config, I just couldn't tell
> > that
> > > I had resolved the issue.  I was looking at stale data.
> > >
> > >
> > > On Thu, Feb 7, 2013 at 5:12 PM, Lewis John Mcgibbney <
> > > lewis.mcgibbney@gmail.com> wrote:
> > >
> > >> So the problem for you is resolved?
> > >> The main (typical) problem here is in the underlying gora-sql library
> > and
> > >> some rather difficult to master gora-sql-mapping.xml constraints.
> > >> Hope all is resolved
> > >> Lewis
> > >>
> > >> On Thu, Feb 7, 2013 at 1:57 PM, Ward Loving <wa...@appirio.com> wrote:
> > >>
> > >> > Alright...very good news.  I guess something I did fixed the issue.
> > >>  Once I
> > >> > dropped my webpage table and restarted the process, I'm now getting
> > >> > complete pages.  The actual load of the data to that field can
> happen
> > >> > somewhat later than the fetch entry in the logs.  Easy to see when
> > >> > inserting data the first time around.  Not as simple to detect when
> > >> you've
> > >> > loaded data previously. Thanks for your assistance.
> > >> >
> > >> >
> > >> >
> > >> >
> > >> >
> > >> >
> > >> > On Thu, Feb 7, 2013 at 3:01 PM, Lewis John Mcgibbney <
> > >> > lewis.mcgibbney@gmail.com> wrote:
> > >> >
> > >> > > It will prduce more output on the fetcher part of your hadoop.log
> > not
> > >> on
> > >> > > the parsechecker tool itself that is why you are seeing nothing
> > more.
> > >> > > Are you still having problems with the truncation aspect?
> > >> > > Lewis
> > >> > >
> > >> > > On Thu, Feb 7, 2013 at 11:07 AM, Ward Loving <wa...@appirio.com>
> > >> wrote:
> > >> > >
> > >> > > > Lewis:
> > >> > > >
> > >> > > >
> > >> > >
> > >> >
> > >> >
> > >> >
> > >> > --
> > >> > Ward Loving
> > >> > Senior Technical Consultant
> > >> > Appirio, Inc.
> > >> > www.appirio.com
> > >> > (706) 225-9475
> > >> >
> > >>
> > >>
> > >>
> > >> --
> > >> *Lewis*
> > >>
> > >
> > >
> > >
> > > --
> > > Ward Loving
> > > Senior Technical Consultant
> > > Appirio, Inc.
> > > www.appirio.com
> > > (706) 225-9475
> > >
> >
> >
> >
> > --
> > Ward Loving
> > Senior Technical Consultant
> > Appirio, Inc.
> > www.appirio.com
> > (706) 225-9475
> >
>
>
>
> --
> *Lewis*
>



-- 
Ward Loving
Senior Technical Consultant
Appirio, Inc.
www.appirio.com
(706) 225-9475

Re: Content Truncation in Nutch 2.1/MySQL

Posted by Lewis John Mcgibbney <le...@gmail.com>.

No content should be truncated if you set http.content.limit to -1 and
leave the default settings on. It is as simple as that.
Have you recompiled Nutch with some changes you made before continuing
crawling?

On Fri, Feb 8, 2013 at 9:01 PM, Ward Loving <wa...@appirio.com> wrote:

> Well,
>
> I spoke to soon.  I ran a crawl overnight and I'm seeing all kinds of
> truncation happening again.   I can hardly find a content field in my
> database that hasn't been truncated.  I'm seeing a ton of these warning
> messages in the log:
>
> 2013-02-08 19:40:36,861 WARN  parse.ParserJob -
> http://www.episcopalchurch.org/parish/university-texas-austin-tx skipped.
> Content of size 30220 was truncated to 29919
> 2013-02-08 19:40:36,861 INFO  parse.ParserJob - Parsing
> http://www.episcopalchurch.org/parish/varina-church-richmond-va
> 2013-02-08 19:40:36,861 WARN  parse.ParserJob -
> http://www.episcopalchurch.org/parish/varina-church-richmond-va skipped.
> Content of size 29559 was truncated to 28471
> 2013-02-08 19:40:36,861 INFO  parse.ParserJob - Parsing
> http://www.episcopalchurch.org/parish/vauters-church-champlain-va
>
> This is sort of bizarrre.  I spot checked 5 pages when I first started the
> process yesterday morning and all the content in the content fields was
> complete.  Now I'm running it again and nothing is, but I don't see the
> warning messages that anything is amiss with the data with the first couple
> of pages I fetched.  I've tried updating the following setting to false but
> it doesn't seem to help:
>
> <property>
>   <name>parser.skip.truncated</name>
>   <value>false</value>
>   <description>Boolean value for whether we should skip parsing for
> truncated documents. By default this
>   property is activated due to extremely high levels of CPU which parsing
> can sometimes take.
>   </description>
> </property>
>
>
>
>
>
>
> On Thu, Feb 7, 2013 at 5:24 PM, Ward Loving <wa...@appirio.com> wrote:
>
> > Yep, looks like it.  The configuration is tricky no doubt.  In my case,
> > however, I think I had actually fixed the config, I just couldn't tell
> that
> > I had resolved the issue.  I was looking at stale data.
> >
> >
> > On Thu, Feb 7, 2013 at 5:12 PM, Lewis John Mcgibbney <
> > lewis.mcgibbney@gmail.com> wrote:
> >
> >> So the problem for you is resolved?
> >> The main (typical) problem here is in the underlying gora-sql library
> and
> >> some rather difficult to master gora-sql-mapping.xml constraints.
> >> Hope all is resolved
> >> Lewis
> >>
> >> On Thu, Feb 7, 2013 at 1:57 PM, Ward Loving <wa...@appirio.com> wrote:
> >>
> >> > Alright...very good news.  I guess something I did fixed the issue.
> >>  Once I
> >> > dropped my webpage table and restarted the process, I'm now getting
> >> > complete pages.  The actual load of the data to that field can happen
> >> > somewhat later than the fetch entry in the logs.  Easy to see when
> >> > inserting data the first time around.  Not as simple to detect when
> >> you've
> >> > loaded data previously. Thanks for your assistance.
> >> >
> >> >
> >> >
> >> >
> >> >
> >> >
> >> > On Thu, Feb 7, 2013 at 3:01 PM, Lewis John Mcgibbney <
> >> > lewis.mcgibbney@gmail.com> wrote:
> >> >
> >> > > It will prduce more output on the fetcher part of your hadoop.log
> not
> >> on
> >> > > the parsechecker tool itself that is why you are seeing nothing
> more.
> >> > > Are you still having problems with the truncation aspect?
> >> > > Lewis
> >> > >
> >> > > On Thu, Feb 7, 2013 at 11:07 AM, Ward Loving <wa...@appirio.com>
> >> wrote:
> >> > >
> >> > > > Lewis:
> >> > > >
> >> > > >
> >> > >
> >> >
> >> >
> >> >
> >> > --
> >> > Ward Loving
> >> > Senior Technical Consultant
> >> > Appirio, Inc.
> >> > www.appirio.com
> >> > (706) 225-9475
> >> >
> >>
> >>
> >>
> >> --
> >> *Lewis*
> >>
> >
> >
> >
> > --
> > Ward Loving
> > Senior Technical Consultant
> > Appirio, Inc.
> > www.appirio.com
> > (706) 225-9475
> >
>
>
>
> --
> Ward Loving
> Senior Technical Consultant
> Appirio, Inc.
> www.appirio.com
> (706) 225-9475
>



-- 
*Lewis*

Re: Content Truncation in Nutch 2.1/MySQL

Posted by Ward Loving <wa...@appirio.com>.

Well,

I spoke to soon.  I ran a crawl overnight and I'm seeing all kinds of
truncation happening again.   I can hardly find a content field in my
database that hasn't been truncated.  I'm seeing a ton of these warning
messages in the log:

2013-02-08 19:40:36,861 WARN  parse.ParserJob -
http://www.episcopalchurch.org/parish/university-texas-austin-tx skipped.
Content of size 30220 was truncated to 29919
2013-02-08 19:40:36,861 INFO  parse.ParserJob - Parsing
http://www.episcopalchurch.org/parish/varina-church-richmond-va
2013-02-08 19:40:36,861 WARN  parse.ParserJob -
http://www.episcopalchurch.org/parish/varina-church-richmond-va skipped.
Content of size 29559 was truncated to 28471
2013-02-08 19:40:36,861 INFO  parse.ParserJob - Parsing
http://www.episcopalchurch.org/parish/vauters-church-champlain-va

This is sort of bizarrre.  I spot checked 5 pages when I first started the
process yesterday morning and all the content in the content fields was
complete.  Now I'm running it again and nothing is, but I don't see the
warning messages that anything is amiss with the data with the first couple
of pages I fetched.  I've tried updating the following setting to false but
it doesn't seem to help:

<property>
  <name>parser.skip.truncated</name>
  <value>false</value>
  <description>Boolean value for whether we should skip parsing for
truncated documents. By default this
  property is activated due to extremely high levels of CPU which parsing
can sometimes take.
  </description>
</property>






On Thu, Feb 7, 2013 at 5:24 PM, Ward Loving <wa...@appirio.com> wrote:

> Yep, looks like it.  The configuration is tricky no doubt.  In my case,
> however, I think I had actually fixed the config, I just couldn't tell that
> I had resolved the issue.  I was looking at stale data.
>
>
> On Thu, Feb 7, 2013 at 5:12 PM, Lewis John Mcgibbney <
> lewis.mcgibbney@gmail.com> wrote:
>
>> So the problem for you is resolved?
>> The main (typical) problem here is in the underlying gora-sql library and
>> some rather difficult to master gora-sql-mapping.xml constraints.
>> Hope all is resolved
>> Lewis
>>
>> On Thu, Feb 7, 2013 at 1:57 PM, Ward Loving <wa...@appirio.com> wrote:
>>
>> > Alright...very good news.  I guess something I did fixed the issue.
>>  Once I
>> > dropped my webpage table and restarted the process, I'm now getting
>> > complete pages.  The actual load of the data to that field can happen
>> > somewhat later than the fetch entry in the logs.  Easy to see when
>> > inserting data the first time around.  Not as simple to detect when
>> you've
>> > loaded data previously. Thanks for your assistance.
>> >
>> >
>> >
>> >
>> >
>> >
>> > On Thu, Feb 7, 2013 at 3:01 PM, Lewis John Mcgibbney <
>> > lewis.mcgibbney@gmail.com> wrote:
>> >
>> > > It will prduce more output on the fetcher part of your hadoop.log not
>> on
>> > > the parsechecker tool itself that is why you are seeing nothing more.
>> > > Are you still having problems with the truncation aspect?
>> > > Lewis
>> > >
>> > > On Thu, Feb 7, 2013 at 11:07 AM, Ward Loving <wa...@appirio.com>
>> wrote:
>> > >
>> > > > Lewis:
>> > > >
>> > > >
>> > >
>> >
>> >
>> >
>> > --
>> > Ward Loving
>> > Senior Technical Consultant
>> > Appirio, Inc.
>> > www.appirio.com
>> > (706) 225-9475
>> >
>>
>>
>>
>> --
>> *Lewis*
>>
>
>
>
> --
> Ward Loving
> Senior Technical Consultant
> Appirio, Inc.
> www.appirio.com
> (706) 225-9475
>



-- 
Ward Loving
Senior Technical Consultant
Appirio, Inc.
www.appirio.com
(706) 225-9475

Re: Content Truncation in Nutch 2.1/MySQL

Posted by Ward Loving <wa...@appirio.com>.

Yep, looks like it.  The configuration is tricky no doubt.  In my case,
however, I think I had actually fixed the config, I just couldn't tell that
I had resolved the issue.  I was looking at stale data.


On Thu, Feb 7, 2013 at 5:12 PM, Lewis John Mcgibbney <
lewis.mcgibbney@gmail.com> wrote:

> So the problem for you is resolved?
> The main (typical) problem here is in the underlying gora-sql library and
> some rather difficult to master gora-sql-mapping.xml constraints.
> Hope all is resolved
> Lewis
>
> On Thu, Feb 7, 2013 at 1:57 PM, Ward Loving <wa...@appirio.com> wrote:
>
> > Alright...very good news.  I guess something I did fixed the issue.
>  Once I
> > dropped my webpage table and restarted the process, I'm now getting
> > complete pages.  The actual load of the data to that field can happen
> > somewhat later than the fetch entry in the logs.  Easy to see when
> > inserting data the first time around.  Not as simple to detect when
> you've
> > loaded data previously. Thanks for your assistance.
> >
> >
> >
> >
> >
> >
> > On Thu, Feb 7, 2013 at 3:01 PM, Lewis John Mcgibbney <
> > lewis.mcgibbney@gmail.com> wrote:
> >
> > > It will prduce more output on the fetcher part of your hadoop.log not
> on
> > > the parsechecker tool itself that is why you are seeing nothing more.
> > > Are you still having problems with the truncation aspect?
> > > Lewis
> > >
> > > On Thu, Feb 7, 2013 at 11:07 AM, Ward Loving <wa...@appirio.com> wrote:
> > >
> > > > Lewis:
> > > >
> > > >
> > >
> >
> >
> >
> > --
> > Ward Loving
> > Senior Technical Consultant
> > Appirio, Inc.
> > www.appirio.com
> > (706) 225-9475
> >
>
>
>
> --
> *Lewis*
>



-- 
Ward Loving
Senior Technical Consultant
Appirio, Inc.
www.appirio.com
(706) 225-9475

Re: Content Truncation in Nutch 2.1/MySQL

Posted by Lewis John Mcgibbney <le...@gmail.com>.

So the problem for you is resolved?
The main (typical) problem here is in the underlying gora-sql library and
some rather difficult to master gora-sql-mapping.xml constraints.
Hope all is resolved
Lewis

On Thu, Feb 7, 2013 at 1:57 PM, Ward Loving <wa...@appirio.com> wrote:

> Alright...very good news.  I guess something I did fixed the issue.  Once I
> dropped my webpage table and restarted the process, I'm now getting
> complete pages.  The actual load of the data to that field can happen
> somewhat later than the fetch entry in the logs.  Easy to see when
> inserting data the first time around.  Not as simple to detect when you've
> loaded data previously. Thanks for your assistance.
>
>
>
>
>
>
> On Thu, Feb 7, 2013 at 3:01 PM, Lewis John Mcgibbney <
> lewis.mcgibbney@gmail.com> wrote:
>
> > It will prduce more output on the fetcher part of your hadoop.log not on
> > the parsechecker tool itself that is why you are seeing nothing more.
> > Are you still having problems with the truncation aspect?
> > Lewis
> >
> > On Thu, Feb 7, 2013 at 11:07 AM, Ward Loving <wa...@appirio.com> wrote:
> >
> > > Lewis:
> > >
> > >
> >
>
>
>
> --
> Ward Loving
> Senior Technical Consultant
> Appirio, Inc.
> www.appirio.com
> (706) 225-9475
>



-- 
*Lewis*

Re: Content Truncation in Nutch 2.1/MySQL

Posted by Ward Loving <wa...@appirio.com>.

Alright...very good news.  I guess something I did fixed the issue.  Once I
dropped my webpage table and restarted the process, I'm now getting
complete pages.  The actual load of the data to that field can happen
somewhat later than the fetch entry in the logs.  Easy to see when
inserting data the first time around.  Not as simple to detect when you've
loaded data previously. Thanks for your assistance.

On Thu, Feb 7, 2013 at 3:01 PM, Lewis John Mcgibbney <
lewis.mcgibbney@gmail.com> wrote:

> It will prduce more output on the fetcher part of your hadoop.log not on
> the parsechecker tool itself that is why you are seeing nothing more.
> Are you still having problems with the truncation aspect?
> Lewis
>
> On Thu, Feb 7, 2013 at 11:07 AM, Ward Loving <wa...@appirio.com> wrote:
>
> > Lewis:
> >
> >
>

-- 
Ward Loving
Senior Technical Consultant
Appirio, Inc.
www.appirio.com
(706) 225-9475

Re: Content Truncation in Nutch 2.1/MySQL

Posted by Lewis John Mcgibbney <le...@gmail.com>.

It will prduce more output on the fetcher part of your hadoop.log not on
the parsechecker tool itself that is why you are seeing nothing more.
Are you still having problems with the truncation aspect?
Lewis

On Thu, Feb 7, 2013 at 11:07 AM, Ward Loving <wa...@appirio.com> wrote:

> Lewis:
>
>

Re: Content Truncation in Nutch 2.1/MySQL

Posted by Ward Loving <wa...@appirio.com>.

Lewis:

I've update the following parameter to true:

<property>
  <name>fetcher.verbose</name>
  <value>true</value>
  <description>If true, fetcher will log more verbosely.</description>
</property>

But this doesn't seem to be generating a much extra output.  When I run the
parserchecker against the site in question I get the following output:

---------
Url
---------------
http://www.mysite.com/webpage
---------
Metadata
---------
---------
ParseText
---------
All the Content of my page

There are just a few question marks in the parsed text that the bash shell
seems to be struggling with but the page content (under ParseText) looks
complete.




On Wed, Feb 6, 2013 at 9:53 PM, Lewis John Mcgibbney <
lewis.mcgibbney@gmail.com> wrote:

> Can you use the parsechecker tool with fetcher.verbose overriden as true
> and the same settings on one of the (HTML?) documents giving you bother?
> The gora-sql-0.1.1 -incubating module is becoming a real pain to be honest.
>
> On Wed, Feb 6, 2013 at 6:44 PM, Ward Loving <wa...@appirio.com> wrote:
>
> > Hello:
> >
> > I've got Nutch up and running except for one big problem.  It is
> truncating
> > the content of my downloaded pages at 27,000 or 28,000 bytes.  Basically
> it
> > just slices the end of my web pages off and of course that completely
> hoses
> > any downstream parsing of the tags and content that I'd like to do.
> >
> > This error is driving me crazy.  I've done the typical things like update
> > the http.content.limit:
> >
> > <property>
> >   <name>http.content.limit</name>
> >   <value>-1</value>
> >   <description>The length limit for downloaded content using the http
> >   protocol, in bytes. If this value is nonnegative (>=0), content longer
> >   than it will be truncated; otherwise, no truncation at all. Do not
> >   confuse this setting with the file.content.limit setting.
> >   </description>
> > </property>
> >
> > And I've updated the content setting in my gora-sql-mapping.xml to a huge
> > number:
> >
> > <class name="org.apache.nutch.storage.WebPage"
> keyClass="java.lang.String"
> > table="webpage">
> >   <primarykey column="id" length="767"/>
> >     <field name="baseUrl" column="baseUrl" length="512"/>
> >     <field name="status" column="status"/>
> >     <field name="prevFetchTime" column="prevFetchTime"/>
> >     <field name="fetchTime" column="fetchTime"/>
> >     <field name="fetchInterval" column="fetchInterval"/>
> >     <field name="retriesSinceFetch" column="retriesSinceFetch"/>
> >     <field name="reprUrl" column="reprUrl" length="512"/>
> >     <!-- <field name="content" column="content" length="65536"/>-->
> >     *<field name="content" column="content" length="262144"/>*
> >     <field name="contentType" column="typ" length="32"/>
> >     <field name="protocolStatus" column="protocolStatus"/>
> >     <field name="modifiedTime" column="modifiedTime"/>
> >
> >     <!-- parse fields                                       -->
> >     <field name="title" column="title" length="512"/>
> >     <field name="text" column="text" length="32000"/>
> >     <field name="parseStatus" column="parseStatus"/>
> >     <field name="signature" column="signature"/>
> >     <field name="prevSignature" column="prevSignature"/>
> >
> >     <!-- score fields                                       -->
> >     <field name="score" column="score"/>
> >     <field name="headers" column="headers"/>
> >     <field name="inlinks" column="inlinks"/>
> >     <field name="outlinks" column="outlinks"/>
> >     <field name="metadata" column="metadata"/>
> >     <field name="markers" column="markers"/>
> > </class>
> >
> > But I'm getting no love here.   Any other ideas what could be trimming
> the
> > content?
> >
> > --
> > Ward Loving
> > Senior Technical Consultant
> > Appirio, Inc.
> > www.appirio.com
> > (706) 225-9475
> >
>
>
>
> --
> *Lewis*
>



-- 
Ward Loving
Senior Technical Consultant
Appirio, Inc.
www.appirio.com
(706) 225-9475

Re: Content Truncation in Nutch 2.1/MySQL

Posted by Lewis John Mcgibbney <le...@gmail.com>.

Can you use the parsechecker tool with fetcher.verbose overriden as true
and the same settings on one of the (HTML?) documents giving you bother?
The gora-sql-0.1.1 -incubating module is becoming a real pain to be honest.

On Wed, Feb 6, 2013 at 6:44 PM, Ward Loving <wa...@appirio.com> wrote:

> Hello:
>
> I've got Nutch up and running except for one big problem.  It is truncating
> the content of my downloaded pages at 27,000 or 28,000 bytes.  Basically it
> just slices the end of my web pages off and of course that completely hoses
> any downstream parsing of the tags and content that I'd like to do.
>
> This error is driving me crazy.  I've done the typical things like update
> the http.content.limit:
>
> <property>
>   <name>http.content.limit</name>
>   <value>-1</value>
>   <description>The length limit for downloaded content using the http
>   protocol, in bytes. If this value is nonnegative (>=0), content longer
>   than it will be truncated; otherwise, no truncation at all. Do not
>   confuse this setting with the file.content.limit setting.
>   </description>
> </property>
>
> And I've updated the content setting in my gora-sql-mapping.xml to a huge
> number:
>
> <class name="org.apache.nutch.storage.WebPage" keyClass="java.lang.String"
> table="webpage">
>   <primarykey column="id" length="767"/>
>     <field name="baseUrl" column="baseUrl" length="512"/>
>     <field name="status" column="status"/>
>     <field name="prevFetchTime" column="prevFetchTime"/>
>     <field name="fetchTime" column="fetchTime"/>
>     <field name="fetchInterval" column="fetchInterval"/>
>     <field name="retriesSinceFetch" column="retriesSinceFetch"/>
>     <field name="reprUrl" column="reprUrl" length="512"/>
>     <!-- <field name="content" column="content" length="65536"/>-->
>     *<field name="content" column="content" length="262144"/>*
>     <field name="contentType" column="typ" length="32"/>
>     <field name="protocolStatus" column="protocolStatus"/>
>     <field name="modifiedTime" column="modifiedTime"/>
>
>     <!-- parse fields                                       -->
>     <field name="title" column="title" length="512"/>
>     <field name="text" column="text" length="32000"/>
>     <field name="parseStatus" column="parseStatus"/>
>     <field name="signature" column="signature"/>
>     <field name="prevSignature" column="prevSignature"/>
>
>     <!-- score fields                                       -->
>     <field name="score" column="score"/>
>     <field name="headers" column="headers"/>
>     <field name="inlinks" column="inlinks"/>
>     <field name="outlinks" column="outlinks"/>
>     <field name="metadata" column="metadata"/>
>     <field name="markers" column="markers"/>
> </class>
>
> But I'm getting no love here.   Any other ideas what could be trimming the
> content?
>
> --
> Ward Loving
> Senior Technical Consultant
> Appirio, Inc.
> www.appirio.com
> (706) 225-9475
>



-- 
*Lewis*