You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Vikas Hazrati <vi...@knoldus.com> on 2012/05/21 13:43:46 UTC

Setting the Fetch time with a CustomFetchSchedule

Hi,

I would like to implement a custom implementation of AbstractFetchSchedule
and would like to change the FetchTime on the basis of some parameters that
I get as a part of my parsing.

// something like this
datum.setFetchTime(fetchTime + (long)datum.getFetchInterval() * 1000 +
customLogic);

Right now I have a custom URLFilter and a custom parser which extends
HtmlParseFilter. At the time of custom parsing, I come across some
parameters which would help me define how should I define the fetchtime for
that URL. I would like to pass these values to my CustomFetchSchedule.

Is there a way to do that? Can I pass them as a part of configuration?

Since I would get the data that i need to make a decision only as a part of
Parse, would it be possible to pass this data to the FetchSchedule?

Thoughts?

Regards | Vikas

Re: Setting the Fetch time with a CustomFetchSchedule

Posted by Vikas Hazrati <vi...@knoldus.com>.

Ok, the class gets called after I include it as a part of the classpath.
Thanks

On Tue, May 29, 2012 at 4:28 PM, Vikas Hazrati <vi...@knoldus.com> wrote:

> Thanks Markus, would try with the classpath. I believe I did try that
>
> > <property>
> >   <name>db.fetch.schedule.class</name>
> >   <value>com.custom.CustomEventFetchScheduler</value>
> > </property>
>
> but would give it a try again and let the group know...
>
> On Tue, May 29, 2012 at 2:49 PM, Markus Jelsma <markus.jelsma@openindex.io
> > wrote:
>
>> -----Original message-----
>> > From:Vikas Hazrati <vi...@knoldus.com>
>> > Sent: Mon 28-May-2012 13:55
>> > To: user@nutch.apache.org
>> > Subject: Re: Setting the Fetch time with a CustomFetchSchedule
>> >
>> > Thanks Markus, what I understand from the code is that I should be able
>> to
>> > extract and pass meta information from my ParsePlugin and access that
>> as a
>> > part of the custom fetch schedule which extends AbstractFetchSchedule.
>> >
>> > If I create a custom fetch class as
>> >
>> > class CustomEventFetchScheduler extends AbstractFetchSchedule { ...}
>> >
>> > how do i include this custom class a part of my crawl cycle? I
>> understand
>> > that there is no extension point for this?
>>
>> Indeed, there is no extension point so you cannot make a nice plugin.
>> What you can do is make sure it's on the classpath and simply tell the
>> scheduler to use it via db.fetch.schedule.class, that should work just fine.
>>
>> >
>> > I get this -> Caused by: java.lang.RuntimeException: Plugin
>> > (12kdaggregator), extension point: org.apache.nutch.crawl.FetchSchedule
>> > does not exist.
>> >
>> > Also I could not successfully plug it as a part of nutch-site.xml by
>> > overriding the nutch-default.xml
>> >
>> >
>> > <property>
>> >   <name>db.fetch.schedule.class</name>
>> >   <value>com.custom.CustomEventFetchScheduler</value>
>> > </property>
>> >
>> >
>> > How do I include my custom logic so that it gets picked as a part of the
>> > crawl cycle.
>> >
>> > Regards | Vikas
>> >
>> > On Mon, May 21, 2012 at 6:14 PM, Markus Jelsma
>> > <ma...@openindex.io>wrote:
>> >
>> > > Yes, you can pass ParseMeta keys to the FetchSchedule as part of the
>> > > CrawlDatum's meta data as i did with:
>> > > https://issues.apache.org/jira/browse/NUTCH-1024
>> > >
>> > >
>> > > -----Original message-----
>> > > > From:Vikas Hazrati <vi...@knoldus.com>
>> > > > Sent: Mon 21-May-2012 13:44
>> > > > To: user@nutch.apache.org
>> > > > Subject: Setting the Fetch time with a CustomFetchSchedule
>> > > >
>> > > > Hi,
>> > > >
>> > > > I would like to implement a custom implementation of
>> > > AbstractFetchSchedule
>> > > > and would like to change the FetchTime on the basis of some
>> parameters
>> > > that
>> > > > I get as a part of my parsing.
>> > > >
>> > > > // something like this
>> > > > datum.setFetchTime(fetchTime + (long)datum.getFetchInterval() *
>> 1000 +
>> > > > customLogic);
>> > > >
>> > > > Right now I have a custom URLFilter and a custom parser which
>> extends
>> > > > HtmlParseFilter. At the time of custom parsing, I come across some
>> > > > parameters which would help me define how should I define the
>> fetchtime
>> > > for
>> > > > that URL. I would like to pass these values to my
>> CustomFetchSchedule.
>> > > >
>> > > > Is there a way to do that? Can I pass them as a part of
>> configuration?
>> > > >
>> > > > Since I would get the data that i need to make a decision only as a
>> part
>> > > of
>> > > > Parse, would it be possible to pass this data to the FetchSchedule?
>> > > >
>> > > > Thoughts?
>> > > >
>> > > > Regards | Vikas
>> > > >
>> > >
>> >
>>
>
>

Re: Setting the Fetch time with a CustomFetchSchedule

Posted by Vikas Hazrati <vi...@knoldus.com>.

Thanks Markus, would try with the classpath. I believe I did try that

> <property>
>   <name>db.fetch.schedule.class</name>
>   <value>com.custom.CustomEventFetchScheduler</value>
> </property>

but would give it a try again and let the group know...

On Tue, May 29, 2012 at 2:49 PM, Markus Jelsma
<ma...@openindex.io>wrote:

> -----Original message-----
> > From:Vikas Hazrati <vi...@knoldus.com>
> > Sent: Mon 28-May-2012 13:55
> > To: user@nutch.apache.org
> > Subject: Re: Setting the Fetch time with a CustomFetchSchedule
> >
> > Thanks Markus, what I understand from the code is that I should be able
> to
> > extract and pass meta information from my ParsePlugin and access that as
> a
> > part of the custom fetch schedule which extends AbstractFetchSchedule.
> >
> > If I create a custom fetch class as
> >
> > class CustomEventFetchScheduler extends AbstractFetchSchedule { ...}
> >
> > how do i include this custom class a part of my crawl cycle? I understand
> > that there is no extension point for this?
>
> Indeed, there is no extension point so you cannot make a nice plugin. What
> you can do is make sure it's on the classpath and simply tell the scheduler
> to use it via db.fetch.schedule.class, that should work just fine.
>
> >
> > I get this -> Caused by: java.lang.RuntimeException: Plugin
> > (12kdaggregator), extension point: org.apache.nutch.crawl.FetchSchedule
> > does not exist.
> >
> > Also I could not successfully plug it as a part of nutch-site.xml by
> > overriding the nutch-default.xml
> >
> >
> > <property>
> >   <name>db.fetch.schedule.class</name>
> >   <value>com.custom.CustomEventFetchScheduler</value>
> > </property>
> >
> >
> > How do I include my custom logic so that it gets picked as a part of the
> > crawl cycle.
> >
> > Regards | Vikas
> >
> > On Mon, May 21, 2012 at 6:14 PM, Markus Jelsma
> > <ma...@openindex.io>wrote:
> >
> > > Yes, you can pass ParseMeta keys to the FetchSchedule as part of the
> > > CrawlDatum's meta data as i did with:
> > > https://issues.apache.org/jira/browse/NUTCH-1024
> > >
> > >
> > > -----Original message-----
> > > > From:Vikas Hazrati <vi...@knoldus.com>
> > > > Sent: Mon 21-May-2012 13:44
> > > > To: user@nutch.apache.org
> > > > Subject: Setting the Fetch time with a CustomFetchSchedule
> > > >
> > > > Hi,
> > > >
> > > > I would like to implement a custom implementation of
> > > AbstractFetchSchedule
> > > > and would like to change the FetchTime on the basis of some
> parameters
> > > that
> > > > I get as a part of my parsing.
> > > >
> > > > // something like this
> > > > datum.setFetchTime(fetchTime + (long)datum.getFetchInterval() * 1000
> +
> > > > customLogic);
> > > >
> > > > Right now I have a custom URLFilter and a custom parser which extends
> > > > HtmlParseFilter. At the time of custom parsing, I come across some
> > > > parameters which would help me define how should I define the
> fetchtime
> > > for
> > > > that URL. I would like to pass these values to my
> CustomFetchSchedule.
> > > >
> > > > Is there a way to do that? Can I pass them as a part of
> configuration?
> > > >
> > > > Since I would get the data that i need to make a decision only as a
> part
> > > of
> > > > Parse, would it be possible to pass this data to the FetchSchedule?
> > > >
> > > > Thoughts?
> > > >
> > > > Regards | Vikas
> > > >
> > >
> >
>

RE: Setting the Fetch time with a CustomFetchSchedule

Posted by Markus Jelsma <ma...@openindex.io>.

-----Original message-----
> From:Vikas Hazrati <vi...@knoldus.com>
> Sent: Mon 28-May-2012 13:55
> To: user@nutch.apache.org
> Subject: Re: Setting the Fetch time with a CustomFetchSchedule
> 
> Thanks Markus, what I understand from the code is that I should be able to
> extract and pass meta information from my ParsePlugin and access that as a
> part of the custom fetch schedule which extends AbstractFetchSchedule.
> 
> If I create a custom fetch class as
> 
> class CustomEventFetchScheduler extends AbstractFetchSchedule { ...}
> 
> how do i include this custom class a part of my crawl cycle? I understand
> that there is no extension point for this?

Indeed, there is no extension point so you cannot make a nice plugin. What you can do is make sure it's on the classpath and simply tell the scheduler to use it via db.fetch.schedule.class, that should work just fine.

> 
> I get this -> Caused by: java.lang.RuntimeException: Plugin
> (12kdaggregator), extension point: org.apache.nutch.crawl.FetchSchedule
> does not exist.
> 
> Also I could not successfully plug it as a part of nutch-site.xml by
> overriding the nutch-default.xml
> 
> 
> <property>
>   <name>db.fetch.schedule.class</name>
>   <value>com.custom.CustomEventFetchScheduler</value>
> </property>
> 
> 
> How do I include my custom logic so that it gets picked as a part of the
> crawl cycle.
> 
> Regards | Vikas
> 
> On Mon, May 21, 2012 at 6:14 PM, Markus Jelsma
> <ma...@openindex.io>wrote:
> 
> > Yes, you can pass ParseMeta keys to the FetchSchedule as part of the
> > CrawlDatum's meta data as i did with:
> > https://issues.apache.org/jira/browse/NUTCH-1024
> >
> >
> > -----Original message-----
> > > From:Vikas Hazrati <vi...@knoldus.com>
> > > Sent: Mon 21-May-2012 13:44
> > > To: user@nutch.apache.org
> > > Subject: Setting the Fetch time with a CustomFetchSchedule
> > >
> > > Hi,
> > >
> > > I would like to implement a custom implementation of
> > AbstractFetchSchedule
> > > and would like to change the FetchTime on the basis of some parameters
> > that
> > > I get as a part of my parsing.
> > >
> > > // something like this
> > > datum.setFetchTime(fetchTime + (long)datum.getFetchInterval() * 1000 +
> > > customLogic);
> > >
> > > Right now I have a custom URLFilter and a custom parser which extends
> > > HtmlParseFilter. At the time of custom parsing, I come across some
> > > parameters which would help me define how should I define the fetchtime
> > for
> > > that URL. I would like to pass these values to my CustomFetchSchedule.
> > >
> > > Is there a way to do that? Can I pass them as a part of configuration?
> > >
> > > Since I would get the data that i need to make a decision only as a part
> > of
> > > Parse, would it be possible to pass this data to the FetchSchedule?
> > >
> > > Thoughts?
> > >
> > > Regards | Vikas
> > >
> >
>

RE: No links to process, is the webgraph empty?

Posted by Markus Jelsma <ma...@openindex.io>.

-----Original message-----
> From:Dustine Rene Bernasor <du...@thecyberguardian.com>
> Sent: Wed 30-May-2012 05:45
> To: user@nutch.apache.org
> Subject: Re: No links to process, is the webgraph empty?
> 
> Hello,
> 
> I tried your suggestion by setting the link.ignore.xxx.xxx values to 
> false but it does not work. I tried to crawl a very small list of sites. 
> Without performing webgraph, I dumped the segment using this command:
> 
> /bin/nutch readseg -dump 
> /user/fetchdb/crawled/test/segments/20120530112254 /user/dump -nocontent 
> -nofetch -nogenerate -noparse -noparsetext/
> 
> Here's a sample entry from the dump:
> 
> /ParseData::
> Version: 5
> Status: success(1,0)
> Title: TinyMCE - Home
> Outlinks: 35
>    outlink: toUrl: http://www.tinymce.com/index.php anchor: Home
>    outlink: toUrl: http://www.tinymce.com/tryit/full.php anchor: Try it
>    outlink: toUrl: http://www.tinymce.com/download/download.php anchor: 
> Download
>    outlink: toUrl: http://www.tinymce.com/wiki.php anchor: Documentation
>    outlink: toUrl: http://www.tinymce.com/enterprise/enterprise.php 
> anchor: Enterprise
>    outlink: toUrl: http://www.tinymce.com/develop/develop.php anchor: 
> Develop
>    outlink: toUrl: http://www.tinymce.com/forum/index.php anchor: Forum
>    outlink: toUrl: http://www.tinymce.com/# anchor: Login
>    outlink: toUrl: http://www.tinymce.com/forum/register.php anchor: 
> Register
>    outlink: toUrl: http://www.tinymce.com/index.php anchor: Version: 3.5.1.1
>    outlink: toUrl: http://www.tinymce.com/# anchor: always the same.
>    outlink: toUrl: http://www.tinymce.com/# anchor: Easy to integrate
>    outlink: toUrl: http://www.tinymce.com/# anchor: Customizable
>    outlink: toUrl: http://www.tinymce.com/# anchor: Browserfriendly
>    outlink: toUrl: http://www.tinymce.com/# anchor: Lightweight
>    outlink: toUrl: http://www.tinymce.com/# anchor: AJAX Compatible
>    outlink: toUrl: http://www.tinymce.com/# anchor: International
>    outlink: toUrl: http://www.tinymce.com/# anchor: Open Source
>    outlink: toUrl: http://www.tinymce.com/tryit/full.php anchor:
>    outlink: toUrl: http://www.tinymce.com/download/download.php anchor: 
> Download
>    outlink: toUrl: 
> http://www.tinymce.com/js/tinymce/jscripts/tiny_mce/license.txt anchor: 
> License
>    outlink: toUrl: http://www.tinymce.com/enterprise/mcimagemanager.php 
> anchor: Learn more
>    outlink: toUrl: 
> http://www.tinymce.com/enterprise/mcimagemanager_buy.php anchor: Buy
>    outlink: toUrl: http://www.tinymce.com/enterprise/mcfilemanager.php 
> anchor: Learn more
>    outlink: toUrl: 
> http://www.tinymce.com/enterprise/mcfilemanager_buy.php anchor: Buy
>    outlink: toUrl: http://www.tinymce.com/enterprise/support.php anchor: 
> ask question
>    outlink: toUrl: http://www.tinymce.com/develop/bugtracker.php anchor: 
> submit bug
>    outlink: toUrl: http://www.tinymce.com/enterprise/using.php anchor: 
> More TinyMCE Users
>    outlink: toUrl: http://www.tinymce.com/# anchor: Back to site top
>    outlink: toUrl: http://www.tinymce.com/tryit/full.php anchor: Try it
>    outlink: toUrl: http://www.tinymce.com/download/download.php anchor: 
> Download
>    outlink: toUrl: http://www.tinymce.com/wiki.php anchor: Documentation
>    outlink: toUrl: http://www.tinymce.com/enterprise/enterprise.php 
> anchor: Enterprise
>    outlink: toUrl: http://www.tinymce.com/develop/develop.php anchor: 
> Develop
>    outlink: toUrl: http://www.tinymce.com/forum/index.php anchor: Forum
> Content Metadata: nutch.content.digest=647e4d7705884d232ce5456145f7cb99 
> Date=Wed, 30 May 2012 03:22:43 GMT Vary=Accept-Encoding 
> Content-Length=3895 Content-Encoding=gzip nutch.crawl.score=1.0 _fst_=33 
> nutch.segment.name=20120530112254 Content-Type=text/html; charset=UTF-8 
> Connection=close Server=Apache _ftk_=1338348124863
> Parse Metadata: CharEncodingForConversion=utf-8 OriginalCharEncoding=utf-8 /
> 
> As you can see, even in the parse data, there are no outlinks to 
> external sites. (If you check the tinymce site, it has links to 
> microsoft, facebook, etc) So I am thinking my problem is more or less 
> related to the issue described
> here
> 
> https://issues.apache.org/jira/browse/NUTCH-1346

No, that is a fix for an entirely different feature that is not yet released. If external outlinks are not present then check URL filters and db.ignore.external.

> 
> 
> On 5/29/2012 4:55 PM, Markus Jelsma wrote:
> > Hi,
> >
> > That depends on what you crawl, many connected/linked sites or isolated sites. If you crawl isolated sites then do not ignore internal links or you won't be able to build the webgraph. Keep in mind that without ignoring interal links the webgraph will become very dense.
> >
> > Cheers
> >
> >
> > -----Original message-----
> >> From:Dustine Rene Bernasor<du...@thecyberguardian.com>
> >> Sent: Tue 29-May-2012 10:51
> >> To: user@nutch.apache.org
> >> Subject: Re: No links to process, is the webgraph empty?
> >>
> >> Hello,
> >>
> >> If I understand this correctly, I need to set link.ignore.limit.page and
> >> link.ignore.limit.domain to false and the link.ignore.internal.xxx can be
> >> set to true? Or should I just set all of the link.ignore.xxx.xxx values
> >> to false?
> >>
> >> On 5/29/2012 4:43 PM, Markus Jelsma wrote:
> >>> Hi,
> >>>
> >>> That's a patch for the fetcher. The error you are seeing is quite simple actually. Because you set those two link.ignore parameters to true, no links between the same domain and host or aggregated, only links from/to external hosts and domains. This is a good setting for wide web crawls. If you restrict crawling to a few domains and they don't share links between them, then with these settings you will have no links to process.
> >>>
> >>> Markus
> >>>
> >>>
> >>> -----Original message-----
> >>>> From:Dustine Rene Bernasor<du...@thecyberguardian.com>
> >>>> Sent: Tue 29-May-2012 10:40
> >>>> To: user@nutch.apache.org
> >>>> Subject: Re: No links to process, is the webgraph empty?
> >>>>
> >>>> Hello,
> >>>>
> >>>> I tried to read the segment containing the site which I am sure has a
> >>>> link towards another site and I was surprised to find out that the outlinks
> >>>> stored all belong to the same domain. I came across this
> >>>>
> >>>> https://issues.apache.org/jira/browse/NUTCH-1346
> >>>>
> >>>> It seems a patch is available for 1.6. I am currently using 1.2. The
> >>>> latest release for Nutch is 1.4. Would it be safe to switch directly to
> >>>> 1.6?
> >>>>
> >>>>
> >>>>
> >>>> On 5/29/2012 10:19 AM, Dustine Rene Bernasor wrote:
> >>>>> Hello,
> >>>>>
> >>>>> Whenever I set link.ignore.internal.host and link.ignore.internal.domain
> >>>>> in nutch-site.xml to "true", I get the "No links to process, is the
> >>>>> webgraph empty?" error when performing LinkRank. However, if I set it to
> >>>>> "false", LinkRank works just fine. I have been searching about this
> >>>>> error but I haven't found anything conclusive so far.  Btw, I have also
> >>>>> set both the link.ignore.limit.page and the link.ignore.limit.domain to
> >>>>> "true".
> >>>>>
> >>>>> Furthermore, if I perform NodeReader on a certain page A, it says that
> >>>>> that that page has 0 inlinks and outlinks but I know that there's
> >>>>> another page B that links to A. But if I do the NodeReader on B it says
> >>>>> there's 1 inlink and 1 outlink although B has links to many other sites.
> >>>>>
> >>>>> I hope someone can shed light on this matter.
> >>>>>
> >>>>> Thanks.
> >>>>>
> >>>>> Dustine
> >>>>>
> >>
> 
>

Re: No links to process, is the webgraph empty?

Posted by Dustine Rene Bernasor <du...@thecyberguardian.com>.

Hello,

I tried your suggestion by setting the link.ignore.xxx.xxx values to 
false but it does not work. I tried to crawl a very small list of sites. 
Without performing webgraph, I dumped the segment using this command:

/bin/nutch readseg -dump 
/user/fetchdb/crawled/test/segments/20120530112254 /user/dump -nocontent 
-nofetch -nogenerate -noparse -noparsetext/

Here's a sample entry from the dump:

/ParseData::
Version: 5
Status: success(1,0)
Title: TinyMCE - Home
Outlinks: 35
   outlink: toUrl: http://www.tinymce.com/index.php anchor: Home
   outlink: toUrl: http://www.tinymce.com/tryit/full.php anchor: Try it
   outlink: toUrl: http://www.tinymce.com/download/download.php anchor: 
Download
   outlink: toUrl: http://www.tinymce.com/wiki.php anchor: Documentation
   outlink: toUrl: http://www.tinymce.com/enterprise/enterprise.php 
anchor: Enterprise
   outlink: toUrl: http://www.tinymce.com/develop/develop.php anchor: 
Develop
   outlink: toUrl: http://www.tinymce.com/forum/index.php anchor: Forum
   outlink: toUrl: http://www.tinymce.com/# anchor: Login
   outlink: toUrl: http://www.tinymce.com/forum/register.php anchor: 
Register
   outlink: toUrl: http://www.tinymce.com/index.php anchor: Version: 3.5.1.1
   outlink: toUrl: http://www.tinymce.com/# anchor: always the same.
   outlink: toUrl: http://www.tinymce.com/# anchor: Easy to integrate
   outlink: toUrl: http://www.tinymce.com/# anchor: Customizable
   outlink: toUrl: http://www.tinymce.com/# anchor: Browserfriendly
   outlink: toUrl: http://www.tinymce.com/# anchor: Lightweight
   outlink: toUrl: http://www.tinymce.com/# anchor: AJAX Compatible
   outlink: toUrl: http://www.tinymce.com/# anchor: International
   outlink: toUrl: http://www.tinymce.com/# anchor: Open Source
   outlink: toUrl: http://www.tinymce.com/tryit/full.php anchor:
   outlink: toUrl: http://www.tinymce.com/download/download.php anchor: 
Download
   outlink: toUrl: 
http://www.tinymce.com/js/tinymce/jscripts/tiny_mce/license.txt anchor: 
License
   outlink: toUrl: http://www.tinymce.com/enterprise/mcimagemanager.php 
anchor: Learn more
   outlink: toUrl: 
http://www.tinymce.com/enterprise/mcimagemanager_buy.php anchor: Buy
   outlink: toUrl: http://www.tinymce.com/enterprise/mcfilemanager.php 
anchor: Learn more
   outlink: toUrl: 
http://www.tinymce.com/enterprise/mcfilemanager_buy.php anchor: Buy
   outlink: toUrl: http://www.tinymce.com/enterprise/support.php anchor: 
ask question
   outlink: toUrl: http://www.tinymce.com/develop/bugtracker.php anchor: 
submit bug
   outlink: toUrl: http://www.tinymce.com/enterprise/using.php anchor: 
More TinyMCE Users
   outlink: toUrl: http://www.tinymce.com/# anchor: Back to site top
   outlink: toUrl: http://www.tinymce.com/tryit/full.php anchor: Try it
   outlink: toUrl: http://www.tinymce.com/download/download.php anchor: 
Download
   outlink: toUrl: http://www.tinymce.com/wiki.php anchor: Documentation
   outlink: toUrl: http://www.tinymce.com/enterprise/enterprise.php 
anchor: Enterprise
   outlink: toUrl: http://www.tinymce.com/develop/develop.php anchor: 
Develop
   outlink: toUrl: http://www.tinymce.com/forum/index.php anchor: Forum
Content Metadata: nutch.content.digest=647e4d7705884d232ce5456145f7cb99 
Date=Wed, 30 May 2012 03:22:43 GMT Vary=Accept-Encoding 
Content-Length=3895 Content-Encoding=gzip nutch.crawl.score=1.0 _fst_=33 
nutch.segment.name=20120530112254 Content-Type=text/html; charset=UTF-8 
Connection=close Server=Apache _ftk_=1338348124863
Parse Metadata: CharEncodingForConversion=utf-8 OriginalCharEncoding=utf-8 /

As you can see, even in the parse data, there are no outlinks to 
external sites. (If you check the tinymce site, it has links to 
microsoft, facebook, etc) So I am thinking my problem is more or less 
related to the issue described
here

https://issues.apache.org/jira/browse/NUTCH-1346

On 5/29/2012 4:55 PM, Markus Jelsma wrote:
> Hi,
>
> That depends on what you crawl, many connected/linked sites or isolated sites. If you crawl isolated sites then do not ignore internal links or you won't be able to build the webgraph. Keep in mind that without ignoring interal links the webgraph will become very dense.
>
> Cheers
>
>
> -----Original message-----
>> From:Dustine Rene Bernasor<du...@thecyberguardian.com>
>> Sent: Tue 29-May-2012 10:51
>> To: user@nutch.apache.org
>> Subject: Re: No links to process, is the webgraph empty?
>>
>> Hello,
>>
>> If I understand this correctly, I need to set link.ignore.limit.page and
>> link.ignore.limit.domain to false and the link.ignore.internal.xxx can be
>> set to true? Or should I just set all of the link.ignore.xxx.xxx values
>> to false?
>>
>> On 5/29/2012 4:43 PM, Markus Jelsma wrote:
>>> Hi,
>>>
>>> That's a patch for the fetcher. The error you are seeing is quite simple actually. Because you set those two link.ignore parameters to true, no links between the same domain and host or aggregated, only links from/to external hosts and domains. This is a good setting for wide web crawls. If you restrict crawling to a few domains and they don't share links between them, then with these settings you will have no links to process.
>>>
>>> Markus
>>>
>>>
>>> -----Original message-----
>>>> From:Dustine Rene Bernasor<du...@thecyberguardian.com>
>>>> Sent: Tue 29-May-2012 10:40
>>>> To: user@nutch.apache.org
>>>> Subject: Re: No links to process, is the webgraph empty?
>>>>
>>>> Hello,
>>>>
>>>> I tried to read the segment containing the site which I am sure has a
>>>> link towards another site and I was surprised to find out that the outlinks
>>>> stored all belong to the same domain. I came across this
>>>>
>>>> https://issues.apache.org/jira/browse/NUTCH-1346
>>>>
>>>> It seems a patch is available for 1.6. I am currently using 1.2. The
>>>> latest release for Nutch is 1.4. Would it be safe to switch directly to
>>>> 1.6?
>>>>
>>>>
>>>>
>>>> On 5/29/2012 10:19 AM, Dustine Rene Bernasor wrote:
>>>>> Hello,
>>>>>
>>>>> Whenever I set link.ignore.internal.host and link.ignore.internal.domain
>>>>> in nutch-site.xml to "true", I get the "No links to process, is the
>>>>> webgraph empty?" error when performing LinkRank. However, if I set it to
>>>>> "false", LinkRank works just fine. I have been searching about this
>>>>> error but I haven't found anything conclusive so far.  Btw, I have also
>>>>> set both the link.ignore.limit.page and the link.ignore.limit.domain to
>>>>> "true".
>>>>>
>>>>> Furthermore, if I perform NodeReader on a certain page A, it says that
>>>>> that that page has 0 inlinks and outlinks but I know that there's
>>>>> another page B that links to A. But if I do the NodeReader on B it says
>>>>> there's 1 inlink and 1 outlink although B has links to many other sites.
>>>>>
>>>>> I hope someone can shed light on this matter.
>>>>>
>>>>> Thanks.
>>>>>
>>>>> Dustine
>>>>>
>>

RE: No links to process, is the webgraph empty?

Posted by Markus Jelsma <ma...@openindex.io>.

Hi,

That depends on what you crawl, many connected/linked sites or isolated sites. If you crawl isolated sites then do not ignore internal links or you won't be able to build the webgraph. Keep in mind that without ignoring interal links the webgraph will become very dense.

Cheers
 
 
-----Original message-----
> From:Dustine Rene Bernasor <du...@thecyberguardian.com>
> Sent: Tue 29-May-2012 10:51
> To: user@nutch.apache.org
> Subject: Re: No links to process, is the webgraph empty?
> 
> Hello,
> 
> If I understand this correctly, I need to set link.ignore.limit.page and 
> link.ignore.limit.domain to false and the link.ignore.internal.xxx can be
> set to true? Or should I just set all of the link.ignore.xxx.xxx values 
> to false?
> 
> On 5/29/2012 4:43 PM, Markus Jelsma wrote:
> > Hi,
> >
> > That's a patch for the fetcher. The error you are seeing is quite simple actually. Because you set those two link.ignore parameters to true, no links between the same domain and host or aggregated, only links from/to external hosts and domains. This is a good setting for wide web crawls. If you restrict crawling to a few domains and they don't share links between them, then with these settings you will have no links to process.
> >
> > Markus
> >
> >
> > -----Original message-----
> >> From:Dustine Rene Bernasor<du...@thecyberguardian.com>
> >> Sent: Tue 29-May-2012 10:40
> >> To: user@nutch.apache.org
> >> Subject: Re: No links to process, is the webgraph empty?
> >>
> >> Hello,
> >>
> >> I tried to read the segment containing the site which I am sure has a
> >> link towards another site and I was surprised to find out that the outlinks
> >> stored all belong to the same domain. I came across this
> >>
> >> https://issues.apache.org/jira/browse/NUTCH-1346
> >>
> >> It seems a patch is available for 1.6. I am currently using 1.2. The
> >> latest release for Nutch is 1.4. Would it be safe to switch directly to
> >> 1.6?
> >>
> >>
> >>
> >> On 5/29/2012 10:19 AM, Dustine Rene Bernasor wrote:
> >>> Hello,
> >>>
> >>> Whenever I set link.ignore.internal.host and link.ignore.internal.domain
> >>> in nutch-site.xml to "true", I get the "No links to process, is the
> >>> webgraph empty?" error when performing LinkRank. However, if I set it to
> >>> "false", LinkRank works just fine. I have been searching about this
> >>> error but I haven't found anything conclusive so far.  Btw, I have also
> >>> set both the link.ignore.limit.page and the link.ignore.limit.domain to
> >>> "true".
> >>>
> >>> Furthermore, if I perform NodeReader on a certain page A, it says that
> >>> that that page has 0 inlinks and outlinks but I know that there's
> >>> another page B that links to A. But if I do the NodeReader on B it says
> >>> there's 1 inlink and 1 outlink although B has links to many other sites.
> >>>
> >>> I hope someone can shed light on this matter.
> >>>
> >>> Thanks.
> >>>
> >>> Dustine
> >>>
> >>
> 
>

Re: No links to process, is the webgraph empty?

Posted by Dustine Rene Bernasor <du...@thecyberguardian.com>.

Hello,

If I understand this correctly, I need to set link.ignore.limit.page and 
link.ignore.limit.domain to false and the link.ignore.internal.xxx can be
set to true? Or should I just set all of the link.ignore.xxx.xxx values 
to false?

On 5/29/2012 4:43 PM, Markus Jelsma wrote:
> Hi,
>
> That's a patch for the fetcher. The error you are seeing is quite simple actually. Because you set those two link.ignore parameters to true, no links between the same domain and host or aggregated, only links from/to external hosts and domains. This is a good setting for wide web crawls. If you restrict crawling to a few domains and they don't share links between them, then with these settings you will have no links to process.
>
> Markus
>
>
> -----Original message-----
>> From:Dustine Rene Bernasor<du...@thecyberguardian.com>
>> Sent: Tue 29-May-2012 10:40
>> To: user@nutch.apache.org
>> Subject: Re: No links to process, is the webgraph empty?
>>
>> Hello,
>>
>> I tried to read the segment containing the site which I am sure has a
>> link towards another site and I was surprised to find out that the outlinks
>> stored all belong to the same domain. I came across this
>>
>> https://issues.apache.org/jira/browse/NUTCH-1346
>>
>> It seems a patch is available for 1.6. I am currently using 1.2. The
>> latest release for Nutch is 1.4. Would it be safe to switch directly to
>> 1.6?
>>
>>
>>
>> On 5/29/2012 10:19 AM, Dustine Rene Bernasor wrote:
>>> Hello,
>>>
>>> Whenever I set link.ignore.internal.host and link.ignore.internal.domain
>>> in nutch-site.xml to "true", I get the "No links to process, is the
>>> webgraph empty?" error when performing LinkRank. However, if I set it to
>>> "false", LinkRank works just fine. I have been searching about this
>>> error but I haven't found anything conclusive so far.  Btw, I have also
>>> set both the link.ignore.limit.page and the link.ignore.limit.domain to
>>> "true".
>>>
>>> Furthermore, if I perform NodeReader on a certain page A, it says that
>>> that that page has 0 inlinks and outlinks but I know that there's
>>> another page B that links to A. But if I do the NodeReader on B it says
>>> there's 1 inlink and 1 outlink although B has links to many other sites.
>>>
>>> I hope someone can shed light on this matter.
>>>
>>> Thanks.
>>>
>>> Dustine
>>>
>>

RE: No links to process, is the webgraph empty?

Posted by Markus Jelsma <ma...@openindex.io>.

Hi,

That's a patch for the fetcher. The error you are seeing is quite simple actually. Because you set those two link.ignore parameters to true, no links between the same domain and host or aggregated, only links from/to external hosts and domains. This is a good setting for wide web crawls. If you restrict crawling to a few domains and they don't share links between them, then with these settings you will have no links to process.

Markus
 
 
-----Original message-----
> From:Dustine Rene Bernasor <du...@thecyberguardian.com>
> Sent: Tue 29-May-2012 10:40
> To: user@nutch.apache.org
> Subject: Re: No links to process, is the webgraph empty?
> 
> Hello,
> 
> I tried to read the segment containing the site which I am sure has a 
> link towards another site and I was surprised to find out that the outlinks
> stored all belong to the same domain. I came across this
> 
> https://issues.apache.org/jira/browse/NUTCH-1346
> 
> It seems a patch is available for 1.6. I am currently using 1.2. The 
> latest release for Nutch is 1.4. Would it be safe to switch directly to 
> 1.6?
> 
> 
> 
> On 5/29/2012 10:19 AM, Dustine Rene Bernasor wrote:
> > Hello,
> >
> > Whenever I set link.ignore.internal.host and link.ignore.internal.domain
> > in nutch-site.xml to "true", I get the "No links to process, is the
> > webgraph empty?" error when performing LinkRank. However, if I set it to
> > "false", LinkRank works just fine. I have been searching about this
> > error but I haven't found anything conclusive so far.  Btw, I have also
> > set both the link.ignore.limit.page and the link.ignore.limit.domain to
> > "true".
> >
> > Furthermore, if I perform NodeReader on a certain page A, it says that
> > that that page has 0 inlinks and outlinks but I know that there's
> > another page B that links to A. But if I do the NodeReader on B it says
> > there's 1 inlink and 1 outlink although B has links to many other sites.
> >
> > I hope someone can shed light on this matter.
> >
> > Thanks.
> >
> > Dustine
> >
> 
>

Re: No links to process, is the webgraph empty?

Posted by Dustine Rene Bernasor <du...@thecyberguardian.com>.

Hello,

I tried to read the segment containing the site which I am sure has a 
link towards another site and I was surprised to find out that the outlinks
stored all belong to the same domain. I came across this

https://issues.apache.org/jira/browse/NUTCH-1346

It seems a patch is available for 1.6. I am currently using 1.2. The 
latest release for Nutch is 1.4. Would it be safe to switch directly to 
1.6?



On 5/29/2012 10:19 AM, Dustine Rene Bernasor wrote:
> Hello,
>
> Whenever I set link.ignore.internal.host and link.ignore.internal.domain
> in nutch-site.xml to "true", I get the "No links to process, is the
> webgraph empty?" error when performing LinkRank. However, if I set it to
> "false", LinkRank works just fine. I have been searching about this
> error but I haven't found anything conclusive so far.  Btw, I have also
> set both the link.ignore.limit.page and the link.ignore.limit.domain to
> "true".
>
> Furthermore, if I perform NodeReader on a certain page A, it says that
> that that page has 0 inlinks and outlinks but I know that there's
> another page B that links to A. But if I do the NodeReader on B it says
> there's 1 inlink and 1 outlink although B has links to many other sites.
>
> I hope someone can shed light on this matter.
>
> Thanks.
>
> Dustine
>

No links to process, is the webgraph empty?

Posted by Dustine Rene Bernasor <du...@thecyberguardian.com>.

Hello,

Whenever I set link.ignore.internal.host and link.ignore.internal.domain 
in nutch-site.xml to "true", I get the "No links to process, is the 
webgraph empty?" error when performing LinkRank. However, if I set it to 
"false", LinkRank works just fine. I have been searching about this 
error but I haven't found anything conclusive so far.  Btw, I have also 
set both the link.ignore.limit.page and the link.ignore.limit.domain to 
"true".

Furthermore, if I perform NodeReader on a certain page A, it says that 
that that page has 0 inlinks and outlinks but I know that there's 
another page B that links to A. But if I do the NodeReader on B it says 
there's 1 inlink and 1 outlink although B has links to many other sites.

I hope someone can shed light on this matter.

Thanks.

Dustine

Re: Setting the Fetch time with a CustomFetchSchedule

Posted by Vikas Hazrati <vi...@knoldus.com>.

Anyone? Any idea on what could be going wrong? Is it possible to inject a
custom fetch scheduler?

On Mon, May 28, 2012 at 5:25 PM, Vikas Hazrati <vi...@knoldus.com> wrote:

> Thanks Markus, what I understand from the code is that I should be able to
> extract and pass meta information from my ParsePlugin and access that as a
> part of the custom fetch schedule which extends AbstractFetchSchedule.
>
> If I create a custom fetch class as
>
> class CustomEventFetchScheduler extends AbstractFetchSchedule { ...}
>
> how do i include this custom class a part of my crawl cycle? I understand
> that there is no extension point for this?
>
> I get this -> Caused by: java.lang.RuntimeException: Plugin
> (myaggregator), extension point: org.apache.nutch.crawl.FetchSchedule does
> not exist.
>
> Also I could not successfully plug it as a part of nutch-site.xml by
> overriding the nutch-default.xml
>
>
> <property>
>   <name>db.fetch.schedule.class</name>
>   <value>com.custom.CustomEventFetchScheduler</value>
> </property>
>
>
> How do I include my custom logic so that it gets picked as a part of the
> crawl cycle.
>
> Regards | Vikas
>
> On Mon, May 21, 2012 at 6:14 PM, Markus Jelsma <markus.jelsma@openindex.io
> > wrote:
>
>> Yes, you can pass ParseMeta keys to the FetchSchedule as part of the
>> CrawlDatum's meta data as i did with:
>> https://issues.apache.org/jira/browse/NUTCH-1024
>>
>>
>> -----Original message-----
>> > From:Vikas Hazrati <vi...@knoldus.com>
>> > Sent: Mon 21-May-2012 13:44
>> > To: user@nutch.apache.org
>> > Subject: Setting the Fetch time with a CustomFetchSchedule
>> >
>> > Hi,
>> >
>> > I would like to implement a custom implementation of
>> AbstractFetchSchedule
>> > and would like to change the FetchTime on the basis of some parameters
>> that
>> > I get as a part of my parsing.
>> >
>> > // something like this
>> > datum.setFetchTime(fetchTime + (long)datum.getFetchInterval() * 1000 +
>> > customLogic);
>> >
>> > Right now I have a custom URLFilter and a custom parser which extends
>> > HtmlParseFilter. At the time of custom parsing, I come across some
>> > parameters which would help me define how should I define the fetchtime
>> for
>> > that URL. I would like to pass these values to my CustomFetchSchedule.
>> >
>> > Is there a way to do that? Can I pass them as a part of configuration?
>> >
>> > Since I would get the data that i need to make a decision only as a
>> part of
>> > Parse, would it be possible to pass this data to the FetchSchedule?
>> >
>> > Thoughts?
>> >
>> > Regards | Vikas
>> >
>>
>
>

Re: Setting the Fetch time with a CustomFetchSchedule

Posted by Vikas Hazrati <vi...@knoldus.com>.

Thanks Markus, what I understand from the code is that I should be able to
extract and pass meta information from my ParsePlugin and access that as a
part of the custom fetch schedule which extends AbstractFetchSchedule.

If I create a custom fetch class as

class CustomEventFetchScheduler extends AbstractFetchSchedule { ...}

how do i include this custom class a part of my crawl cycle? I understand
that there is no extension point for this?

I get this -> Caused by: java.lang.RuntimeException: Plugin
(12kdaggregator), extension point: org.apache.nutch.crawl.FetchSchedule
does not exist.

Also I could not successfully plug it as a part of nutch-site.xml by
overriding the nutch-default.xml


<property>
  <name>db.fetch.schedule.class</name>
  <value>com.custom.CustomEventFetchScheduler</value>
</property>


How do I include my custom logic so that it gets picked as a part of the
crawl cycle.

Regards | Vikas

On Mon, May 21, 2012 at 6:14 PM, Markus Jelsma
<ma...@openindex.io>wrote:

> Yes, you can pass ParseMeta keys to the FetchSchedule as part of the
> CrawlDatum's meta data as i did with:
> https://issues.apache.org/jira/browse/NUTCH-1024
>
>
> -----Original message-----
> > From:Vikas Hazrati <vi...@knoldus.com>
> > Sent: Mon 21-May-2012 13:44
> > To: user@nutch.apache.org
> > Subject: Setting the Fetch time with a CustomFetchSchedule
> >
> > Hi,
> >
> > I would like to implement a custom implementation of
> AbstractFetchSchedule
> > and would like to change the FetchTime on the basis of some parameters
> that
> > I get as a part of my parsing.
> >
> > // something like this
> > datum.setFetchTime(fetchTime + (long)datum.getFetchInterval() * 1000 +
> > customLogic);
> >
> > Right now I have a custom URLFilter and a custom parser which extends
> > HtmlParseFilter. At the time of custom parsing, I come across some
> > parameters which would help me define how should I define the fetchtime
> for
> > that URL. I would like to pass these values to my CustomFetchSchedule.
> >
> > Is there a way to do that? Can I pass them as a part of configuration?
> >
> > Since I would get the data that i need to make a decision only as a part
> of
> > Parse, would it be possible to pass this data to the FetchSchedule?
> >
> > Thoughts?
> >
> > Regards | Vikas
> >
>

RE: error parsing some xml

Posted by Markus Jelsma <ma...@openindex.io>.

Strange, it should show the bad URL. But since you have only 9 URL's the easiest way to go is to use the parsechecker tool for each URL.

 
 
-----Original message-----
> From:Ing. Eyeris Rodriguez Rueda <er...@uci.cu>
> Sent: Mon 21-May-2012 19:42
> To: user@nutch.apache.org
> Subject: Re: error parsing some xml
> 
> I use nutch 1.4 and solr 3.4
> I think that my error is at moment to parse one xml with this structure
> <!--text with -- inside the comentary-->
> I was reading but not found so much, this is my error's log.
> please some help.
> *************************************************************************************************
> 2012-05-21 10:17:53,398 INFO  fetcher.Fetcher - Fetcher: starting at 2012-05-21 10:17:53
> 2012-05-21 10:17:53,399 INFO  fetcher.Fetcher - Fetcher: segment: crawl/segments/20120521101752
> 2012-05-21 10:17:53,762 INFO  fetcher.Fetcher - Using queue mode : byHost
> 2012-05-21 10:17:53,762 INFO  fetcher.Fetcher - Fetcher: threads: 20
> 2012-05-21 10:17:53,762 INFO  fetcher.Fetcher - Fetcher: time-out divisor: 2
> 2012-05-21 10:17:53,777 INFO  fetcher.Fetcher - QueueFeeder finished: total 9 records + hit by time limit :0
> 2012-05-21 10:17:53,804 WARN  parse.ParsePluginsReader - Unable to parse [null].Reason is [org.xml.sax.SAXParseException; lineNumber: 37; columnNumber: 7; The string "--" is not permitted within comments.]
> 2012-05-21 10:17:53,809 WARN  mapred.LocalJobRunner - job_local_0005
> java.lang.RuntimeException: Parse Plugins preferences could not be loaded.
> 	at org.apache.nutch.parse.ParserFactory.<init>(ParserFactory.java:73)
> 	at org.apache.nutch.parse.ParseUtil.<init>(ParseUtil.java:53)
> 	at org.apache.nutch.fetcher.Fetcher$FetcherThread.<init>(Fetcher.java:581)
> 	at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:1075)
> 	at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
> 	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
> 	at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
> ****************************************************************************************************
> 
> 
> 
> 
> ----- Mensaje original -----
> De: "Markus Jelsma" <ma...@openindex.io>
> Para: user@nutch.apache.org
> Enviados: Lunes, 21 de Mayo 2012 11:41:40
> Asunto: RE: error parsing some xml
> 
> Hi
> 
> Which version do you use? It should list the troubling URL. What's the stack trace?
> 
> Cheers
> 
>  
>  
> -----Original message-----
> > From:Ing. Eyeris Rodriguez Rueda <er...@uci.cu>
> > Sent: Mon 21-May-2012 17:07
> > To: user@nutch.apache.org
> > Subject: error parsing some xml
> > 
> > Hi all.
> > When I try to crawl i have a problem at parsing some xml, i get the exception below, i want to know which is the xml with problem at parsing moment.
> > **************************************************************************************
> > WARN  parse.ParsePluginsReader - Unable to parse [null].Reason is [org.xml.sax.SAXParseException; lineNumber: 37; columnNumber: 7; The string "--" is not permitted within comments.]
> > ***************************************************************************************
> > Please some help will apreciated
> > 
> > 
> > 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS...
> > CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION
> > 
> > http://www.uci.cu
> > http://www.facebook.com/universidad.uci
> > http://www.flickr.com/photos/universidad_uci
> > 
> 
> 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS...
> CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION
> 
> http://www.uci.cu
> http://www.facebook.com/universidad.uci
> http://www.flickr.com/photos/universidad_uci
> 
> 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS...
> CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION
> 
> http://www.uci.cu
> http://www.facebook.com/universidad.uci
> http://www.flickr.com/photos/universidad_uci
>

Re: error parsing some xml

Posted by "Ing. Eyeris Rodriguez Rueda" <er...@uci.cu>.

I use nutch 1.4 and solr 3.4
I think that my error is at moment to parse one xml with this structure
<!--text with -- inside the comentary-->
I was reading but not found so much, this is my error's log.
please some help.
*************************************************************************************************
2012-05-21 10:17:53,398 INFO  fetcher.Fetcher - Fetcher: starting at 2012-05-21 10:17:53
2012-05-21 10:17:53,399 INFO  fetcher.Fetcher - Fetcher: segment: crawl/segments/20120521101752
2012-05-21 10:17:53,762 INFO  fetcher.Fetcher - Using queue mode : byHost
2012-05-21 10:17:53,762 INFO  fetcher.Fetcher - Fetcher: threads: 20
2012-05-21 10:17:53,762 INFO  fetcher.Fetcher - Fetcher: time-out divisor: 2
2012-05-21 10:17:53,777 INFO  fetcher.Fetcher - QueueFeeder finished: total 9 records + hit by time limit :0
2012-05-21 10:17:53,804 WARN  parse.ParsePluginsReader - Unable to parse [null].Reason is [org.xml.sax.SAXParseException; lineNumber: 37; columnNumber: 7; The string "--" is not permitted within comments.]
2012-05-21 10:17:53,809 WARN  mapred.LocalJobRunner - job_local_0005
java.lang.RuntimeException: Parse Plugins preferences could not be loaded.
	at org.apache.nutch.parse.ParserFactory.<init>(ParserFactory.java:73)
	at org.apache.nutch.parse.ParseUtil.<init>(ParseUtil.java:53)
	at org.apache.nutch.fetcher.Fetcher$FetcherThread.<init>(Fetcher.java:581)
	at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:1075)
	at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
	at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
****************************************************************************************************




----- Mensaje original -----
De: "Markus Jelsma" <ma...@openindex.io>
Para: user@nutch.apache.org
Enviados: Lunes, 21 de Mayo 2012 11:41:40
Asunto: RE: error parsing some xml

Hi

Which version do you use? It should list the troubling URL. What's the stack trace?

Cheers

 
 
-----Original message-----
> From:Ing. Eyeris Rodriguez Rueda <er...@uci.cu>
> Sent: Mon 21-May-2012 17:07
> To: user@nutch.apache.org
> Subject: error parsing some xml
> 
> Hi all.
> When I try to crawl i have a problem at parsing some xml, i get the exception below, i want to know which is the xml with problem at parsing moment.
> **************************************************************************************
> WARN  parse.ParsePluginsReader - Unable to parse [null].Reason is [org.xml.sax.SAXParseException; lineNumber: 37; columnNumber: 7; The string "--" is not permitted within comments.]
> ***************************************************************************************
> Please some help will apreciated
> 
> 
> 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS...
> CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION
> 
> http://www.uci.cu
> http://www.facebook.com/universidad.uci
> http://www.flickr.com/photos/universidad_uci
> 

10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS...
CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION

http://www.uci.cu
http://www.facebook.com/universidad.uci
http://www.flickr.com/photos/universidad_uci

10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS...
CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION

http://www.uci.cu
http://www.facebook.com/universidad.uci
http://www.flickr.com/photos/universidad_uci

RE: error parsing some xml

Posted by Markus Jelsma <ma...@openindex.io>.

Hi

Which version do you use? It should list the troubling URL. What's the stack trace?

Cheers

 
 
-----Original message-----
> From:Ing. Eyeris Rodriguez Rueda <er...@uci.cu>
> Sent: Mon 21-May-2012 17:07
> To: user@nutch.apache.org
> Subject: error parsing some xml
> 
> Hi all.
> When I try to crawl i have a problem at parsing some xml, i get the exception below, i want to know which is the xml with problem at parsing moment.
> **************************************************************************************
> WARN  parse.ParsePluginsReader - Unable to parse [null].Reason is [org.xml.sax.SAXParseException; lineNumber: 37; columnNumber: 7; The string "--" is not permitted within comments.]
> ***************************************************************************************
> Please some help will apreciated
> 
> 
> 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS...
> CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION
> 
> http://www.uci.cu
> http://www.facebook.com/universidad.uci
> http://www.flickr.com/photos/universidad_uci
>

error parsing some xml

Posted by "Ing. Eyeris Rodriguez Rueda" <er...@uci.cu>.

Hi all.
When I try to crawl i have a problem at parsing some xml, i get the exception below, i want to know which is the xml with problem at parsing moment.
**************************************************************************************
WARN  parse.ParsePluginsReader - Unable to parse [null].Reason is [org.xml.sax.SAXParseException; lineNumber: 37; columnNumber: 7; The string "--" is not permitted within comments.]
***************************************************************************************
Please some help will apreciated


10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS...
CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION

http://www.uci.cu
http://www.facebook.com/universidad.uci
http://www.flickr.com/photos/universidad_uci

RE: Setting the Fetch time with a CustomFetchSchedule

Posted by Markus Jelsma <ma...@openindex.io>.

Yes, you can pass ParseMeta keys to the FetchSchedule as part of the CrawlDatum's meta data as i did with:
https://issues.apache.org/jira/browse/NUTCH-1024
 
 
-----Original message-----
> From:Vikas Hazrati <vi...@knoldus.com>
> Sent: Mon 21-May-2012 13:44
> To: user@nutch.apache.org
> Subject: Setting the Fetch time with a CustomFetchSchedule
> 
> Hi,
> 
> I would like to implement a custom implementation of AbstractFetchSchedule
> and would like to change the FetchTime on the basis of some parameters that
> I get as a part of my parsing.
> 
> // something like this
> datum.setFetchTime(fetchTime + (long)datum.getFetchInterval() * 1000 +
> customLogic);
> 
> Right now I have a custom URLFilter and a custom parser which extends
> HtmlParseFilter. At the time of custom parsing, I come across some
> parameters which would help me define how should I define the fetchtime for
> that URL. I would like to pass these values to my CustomFetchSchedule.
> 
> Is there a way to do that? Can I pass them as a part of configuration?
> 
> Since I would get the data that i need to make a decision only as a part of
> Parse, would it be possible to pass this data to the FetchSchedule?
> 
> Thoughts?
> 
> Regards | Vikas
>