You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Roberto Gardenier <r....@simgroep.nl> on 2012/05/01 13:55:28 UTC

RE: [jira] [Closed] (NUTCH-1343) Crawl sites with hashtags in url

Markus Jelsma,

I got notified that you have closed my jira ticket, chaning its resolution status to Invalid.
I wonder why you have closed my ticket and marked it invalid as i did not commit any changes or solutions?

With kind regards,
Roberto Gardenier 


-----Oorspronkelijk bericht-----
Van: Markus Jelsma (JIRA) [mailto:jira@apache.org] 
Verzonden: dinsdag 1 mei 2012 13:40
Aan: r.gardenier@simgroep.nl
Onderwerp: [jira] [Closed] (NUTCH-1343) Crawl sites with hashtags in url


     [ https://issues.apache.org/jira/browse/NUTCH-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Markus Jelsma closed NUTCH-1343.
--------------------------------

    Resolution: Invalid
    
> Crawl sites with hashtags in url
> --------------------------------
>
>                 Key: NUTCH-1343
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1343
>             Project: Nutch
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.4
>            Reporter: Roberto Gardenier
>            Priority: Blocker
>
> Hello,
> Im currently trying to crawl a site which uses hashtags in the urls. I dont seem to get any results and Im hoping im just overlooking something.
> Site structure is as follows:
> http://domain.com (landingpage)
> http://domain.com/#/page1
> http://domain.com/#/page1/subpage1
> http://domain.com/#/page2
> http://domain.com/#/page2/subpage1
> and so on.
> I've pointed nutch to http://domain.com as start url and in my filter i've placed all kind of rules.
> First i thought this would be sufficient:
> +http\://domain\.com\/#
> But then i realised that # is used for comments so i escaped it:
> +http\://domain\.com\/\#
> Still no results. So i thought i could use the asterix for it:
> +http\://domain\.com\/*
> Still no luck.. So i started using various regex stuff but without success.
> I noticed the following messages in hadoop.log:
> INFO  anchor.AnchorIndexingFilter - Anchor deduplication is: off
> Ive researched on this setting but i dont know for sure if this affects my problem in a way. This property is set to false in my configs.
> I dont know if this is even related to the situation above but maybe it helps.
> Any help is very much appreciated! I've tried googling the problem but i couldnt find documentation or anyone else with this problem.
> Many thanks in advance. 
> With kind regard,
> Roberto Gardenier

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

Re: Hadoop not doing anything

Posted by Markus Jelsma <ma...@openindex.io>.

 Do you have running task trackers and data nodes? Which Nutch job did 
 you start? Any custom code?

 Check the logs of of the four Hadoop daemons, there may be something 
 there.

 On Tue, 01 May 2012 16:26:31 +0100, Dean Pullen 
 <de...@semantico.com> wrote:
> Hi all,
>
> If this is definitely a Hadoop issue, as opposed to it being an issue
> caused by Nutch, I'll happily go ask on the Hadoop mailing list...
>
> Anyway, I'm kicking off a  nutch ibject job via Hadoop 0.20.2 with 
> Nutch 1.4.
> (I'm using v 0.20.2 because this is the library version included with
> Nutch 1.4.)
>
> This is the output:
>
> 2012-05-01 16:11:06,869 INFO org.apache.hadoop.mapred.JobTracker:
> Initializing job_201205011600_0001
> 2012-05-01 16:11:06,870 INFO org.apache.hadoop.mapred.JobInProgress:
> Initializing job_201205011600_0001
> 2012-05-01 16:11:07,099 INFO org.apache.hadoop.mapred.JobInProgress:
> Input size for job job_201205011600_0001 = 47. Number of splits = 2
> 2012-05-01 16:11:07,102 INFO org.apache.hadoop.net.NetworkTopology:
> Adding a new node: /default-rack/localhost
> 2012-05-01 16:11:07,102 INFO org.apache.hadoop.mapred.JobInProgress:
> tip:task_201205011600_0001_m_000000 has split on
> node:/default-rack/localhost
>
>
> It then does nothing else. The Hadoop job tracker says Total
> Submissions = 1, yet states that there are/have been, no running,
> completed or failed jobs.
>
>
> Any ideas as to what's stalling?
>
> Cheers,
>
> Dean Pullen

Hadoop not doing anything

Posted by Dean Pullen <de...@semantico.com>.

Hi all,

If this is definitely a Hadoop issue, as opposed to it being an issue 
caused by Nutch, I'll happily go ask on the Hadoop mailing list...

Anyway, I'm kicking off a  nutch ibject job via Hadoop 0.20.2 with Nutch 
1.4.
(I'm using v 0.20.2 because this is the library version included with 
Nutch 1.4.)

This is the output:

2012-05-01 16:11:06,869 INFO org.apache.hadoop.mapred.JobTracker: 
Initializing job_201205011600_0001
2012-05-01 16:11:06,870 INFO org.apache.hadoop.mapred.JobInProgress: 
Initializing job_201205011600_0001
2012-05-01 16:11:07,099 INFO org.apache.hadoop.mapred.JobInProgress: 
Input size for job job_201205011600_0001 = 47. Number of splits = 2
2012-05-01 16:11:07,102 INFO org.apache.hadoop.net.NetworkTopology: 
Adding a new node: /default-rack/localhost
2012-05-01 16:11:07,102 INFO org.apache.hadoop.mapred.JobInProgress: 
tip:task_201205011600_0001_m_000000 has split on 
node:/default-rack/localhost


It then does nothing else. The Hadoop job tracker says Total Submissions 
= 1, yet states that there are/have been, no running, completed or 
failed jobs.


Any ideas as to what's stalling?

Cheers,

Dean Pullen.