You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Roberto Gardenier <r....@simgroep.nl> on 2012/05/01 13:55:28 UTC
RE: [jira] [Closed] (NUTCH-1343) Crawl sites with hashtags in url
Markus Jelsma,
I got notified that you have closed my jira ticket, chaning its resolution status to Invalid.
I wonder why you have closed my ticket and marked it invalid as i did not commit any changes or solutions?
With kind regards,
Roberto Gardenier
-----Oorspronkelijk bericht-----
Van: Markus Jelsma (JIRA) [mailto:jira@apache.org]
Verzonden: dinsdag 1 mei 2012 13:40
Aan: r.gardenier@simgroep.nl
Onderwerp: [jira] [Closed] (NUTCH-1343) Crawl sites with hashtags in url
[ https://issues.apache.org/jira/browse/NUTCH-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Markus Jelsma closed NUTCH-1343.
--------------------------------
Resolution: Invalid
> Crawl sites with hashtags in url
> --------------------------------
>
> Key: NUTCH-1343
> URL: https://issues.apache.org/jira/browse/NUTCH-1343
> Project: Nutch
> Issue Type: Bug
> Components: parser
> Affects Versions: 1.4
> Reporter: Roberto Gardenier
> Priority: Blocker
>
> Hello,
> Im currently trying to crawl a site which uses hashtags in the urls. I dont seem to get any results and Im hoping im just overlooking something.
> Site structure is as follows:
> http://domain.com (landingpage)
> http://domain.com/#/page1
> http://domain.com/#/page1/subpage1
> http://domain.com/#/page2
> http://domain.com/#/page2/subpage1
> and so on.
> I've pointed nutch to http://domain.com as start url and in my filter i've placed all kind of rules.
> First i thought this would be sufficient:
> +http\://domain\.com\/#
> But then i realised that # is used for comments so i escaped it:
> +http\://domain\.com\/\#
> Still no results. So i thought i could use the asterix for it:
> +http\://domain\.com\/*
> Still no luck.. So i started using various regex stuff but without success.
> I noticed the following messages in hadoop.log:
> INFO anchor.AnchorIndexingFilter - Anchor deduplication is: off
> Ive researched on this setting but i dont know for sure if this affects my problem in a way. This property is set to false in my configs.
> I dont know if this is even related to the situation above but maybe it helps.
> Any help is very much appreciated! I've tried googling the problem but i couldnt find documentation or anyone else with this problem.
> Many thanks in advance.
> With kind regard,
> Roberto Gardenier
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: Hadoop not doing anything
Posted by Markus Jelsma <ma...@openindex.io>.
Do you have running task trackers and data nodes? Which Nutch job did
you start? Any custom code?
Check the logs of of the four Hadoop daemons, there may be something
there.
On Tue, 01 May 2012 16:26:31 +0100, Dean Pullen
<de...@semantico.com> wrote:
> Hi all,
>
> If this is definitely a Hadoop issue, as opposed to it being an issue
> caused by Nutch, I'll happily go ask on the Hadoop mailing list...
>
> Anyway, I'm kicking off a nutch ibject job via Hadoop 0.20.2 with
> Nutch 1.4.
> (I'm using v 0.20.2 because this is the library version included with
> Nutch 1.4.)
>
> This is the output:
>
> 2012-05-01 16:11:06,869 INFO org.apache.hadoop.mapred.JobTracker:
> Initializing job_201205011600_0001
> 2012-05-01 16:11:06,870 INFO org.apache.hadoop.mapred.JobInProgress:
> Initializing job_201205011600_0001
> 2012-05-01 16:11:07,099 INFO org.apache.hadoop.mapred.JobInProgress:
> Input size for job job_201205011600_0001 = 47. Number of splits = 2
> 2012-05-01 16:11:07,102 INFO org.apache.hadoop.net.NetworkTopology:
> Adding a new node: /default-rack/localhost
> 2012-05-01 16:11:07,102 INFO org.apache.hadoop.mapred.JobInProgress:
> tip:task_201205011600_0001_m_000000 has split on
> node:/default-rack/localhost
>
>
> It then does nothing else. The Hadoop job tracker says Total
> Submissions = 1, yet states that there are/have been, no running,
> completed or failed jobs.
>
>
> Any ideas as to what's stalling?
>
> Cheers,
>
> Dean Pullen
Hadoop not doing anything
Posted by Dean Pullen <de...@semantico.com>.
Hi all,
If this is definitely a Hadoop issue, as opposed to it being an issue
caused by Nutch, I'll happily go ask on the Hadoop mailing list...
Anyway, I'm kicking off a nutch ibject job via Hadoop 0.20.2 with Nutch
1.4.
(I'm using v 0.20.2 because this is the library version included with
Nutch 1.4.)
This is the output:
2012-05-01 16:11:06,869 INFO org.apache.hadoop.mapred.JobTracker:
Initializing job_201205011600_0001
2012-05-01 16:11:06,870 INFO org.apache.hadoop.mapred.JobInProgress:
Initializing job_201205011600_0001
2012-05-01 16:11:07,099 INFO org.apache.hadoop.mapred.JobInProgress:
Input size for job job_201205011600_0001 = 47. Number of splits = 2
2012-05-01 16:11:07,102 INFO org.apache.hadoop.net.NetworkTopology:
Adding a new node: /default-rack/localhost
2012-05-01 16:11:07,102 INFO org.apache.hadoop.mapred.JobInProgress:
tip:task_201205011600_0001_m_000000 has split on
node:/default-rack/localhost
It then does nothing else. The Hadoop job tracker says Total Submissions
= 1, yet states that there are/have been, no running, completed or
failed jobs.
Any ideas as to what's stalling?
Cheers,
Dean Pullen.