You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Ian Piper <ia...@tellura.co.uk> on 2012/01/23 08:46:52 UTC

Following .axd urls

Hi all,

I'd appreciate some guidance... can't seem to find much useful stuff on the web on this. I have set up a Nutch and Solr service that is crawling a client's site. They have a lot of pages that are accessed with urls like this:

http://[domain]/resources/consultationonstatutoryguidancefordisabilityinequalityact.aspx

The crawler is finding these urls with no problem and pulling their contents into the Solr index.

However, many of the pages at these urls also contain links to attachments, using .axd extensions. For example, this page:

http://[domain]/resources/anatozguidetolitigationfundingoptions.aspx

has this link in the body:

<p>
	12 May 2011<br />
	Download 
	<span id="internal-source-marker_0.1622281443260543">
		<a href="/medialibrary.axd?id=414405745" target="_self">
			An A to Z Guide to Litigation Funding Options 
		</a>
	</span>(PDF, 401 KB)<br />
	<span id="internal-source-marker_0.1622281443260543">
		Julian Chamberlayne, Stewarts Law and 
	</span>
	David Hartley, Abbey Legal Protection<br />From the ELA Annual Conference 2011
</p>

The problem I'm finding is that the crawler is not apparently visiting or indexing the content of these urls. The document at the far end of the link has this url

http://[domain]/medialibrary.axd?id=414405745

is actually a pdf. I am using the tika plugin which I thought would allow for indexing pdfs.

Anyway, I'd be very grateful for some guidance about how to get Nutch to follow these links.

Thanks,


Ian.
--





dfiuhspub

Re: Following .axd urls

Posted by Ian Piper <ia...@me.com>.

Hi Lewis,

Thanks for the reply. I'm using a fetch depth of 10 (which I thought would be ample - this is not a deep site hierarchy). Here is the command I'm running:

bin/nutch crawl urls -solr [solrurl] -depth 10 -topN 5000

On 23 Jan 2012, at 16:02, Lewis John Mcgibbney wrote:

> Hi Ian,
> 
> What fetching depth are you using?
> 
> Lewis

Re: Following .axd urls

Posted by Lewis John Mcgibbney <le...@gmail.com>.

Hi Ian,

What fetching depth are you using?

Lewis

On Mon, Jan 23, 2012 at 7:46 AM, Ian Piper <ia...@tellura.co.uk> wrote:

> Hi all,
>
> I'd appreciate some guidance... can't seem to find much useful stuff on
> the web on this. I have set up a Nutch and Solr service that is crawling a
> client's site. They have a lot of pages that are accessed with urls like
> this:
>
> http://
> [domain]/resources/consultationonstatutoryguidancefordisabilityinequalityact.aspx
>
> The crawler is finding these urls with no problem and pulling their
> contents into the Solr index.
>
> However, many of the pages at these urls also contain links to
> attachments, using .axd extensions. For example, this page:
>
> http://[domain]/resources/anatozguidetolitigationfundingoptions.aspx
>
> has this link in the body:
>
> <p>
>        12 May 2011<br />
>        Download
>        <span id="internal-source-marker_0.1622281443260543">
>                <a href="/medialibrary.axd?id=414405745" target="_self">
>                        An A to Z Guide to Litigation Funding Options
>                </a>
>        </span>(PDF, 401 KB)<br />
>        <span id="internal-source-marker_0.1622281443260543">
>                Julian Chamberlayne, Stewarts Law and
>        </span>
>        David Hartley, Abbey Legal Protection<br />From the ELA Annual
> Conference 2011
> </p>
>
> The problem I'm finding is that the crawler is not apparently visiting or
> indexing the content of these urls. The document at the far end of the link
> has this url
>
> http://[domain]/medialibrary.axd?id=414405745
>
> is actually a pdf. I am using the tika plugin which I thought would allow
> for indexing pdfs.
>
> Anyway, I'd be very grateful for some guidance about how to get Nutch to
> follow these links.
>
> Thanks,
>
>
> Ian.
> --
>
>
>
>
>
> dfiuhspub
>
>


-- 
*Lewis*

Re: Following .axd urls

Posted by Julien Nioche <li...@gmail.com>.

having said that if the URL filters are correct, the next step is to check
that the parser actually returns the outlink. Google for ParserChecker and
try it on the URL containing the link

On 23 January 2012 16:04, Julien Nioche <li...@gmail.com>wrote:

> Hi Ian
>
>
>> The problem I'm finding is that the crawler is not apparently visiting or
>> indexing the content of these urls. The document at the far end of the link
>> has this url
>>
>> http://[domain]/medialibrary.axd?id=414405745
>>
>> is actually a pdf. I am using the tika plugin which I thought would allow
>> for indexing pdfs.
>>
>>
> don't blame parse-tika : if the URL is not fetched then it has no chance
> of being parsed then indexed
>
> check your URL filter : the link above contains a '?' which by default
> would get the URL to be filtered out
>
>
>
> --
> *
> *Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com

Re: Following .axd urls

Posted by Ian Piper <ia...@tellura.co.uk>.

On 23 Jan 2012, at 16:04, Julien Nioche wrote:

> check your URL filter : the link above contains a '?' which by default
> would get the URL to be filtered out

That was definitely the problem. Nutch is happily fetching those documents now!

Thanks very much for your help.


Ian.
--

Re: Following .axd urls

Posted by Ian Piper <ia...@me.com>.

Hi Julien,

Thanks for the message. I think  you have found part of the problem - I have this in regex-urlfilter.txt

# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]

I will try modifying this and re-running the crawl.


Ian.
--

On 23 Jan 2012, at 16:04, Julien Nioche wrote:

> Hi Ian
> 
> 
>> The problem I'm finding is that the crawler is not apparently visiting or
>> indexing the content of these urls. The document at the far end of the link
>> has this url
>> 
>> http://[domain]/medialibrary.axd?id=414405745
>> 
>> is actually a pdf. I am using the tika plugin which I thought would allow
>> for indexing pdfs.
>> 
>> 
> don't blame parse-tika : if the URL is not fetched then it has no chance of
> being parsed then indexed
> 
> check your URL filter : the link above contains a '?' which by default
> would get the URL to be filtered out
> 
> 
> 
> -- 
> *
> *Open Source Solutions for Text Engineering
> 
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com

Re: Following .axd urls

Posted by Julien Nioche <li...@gmail.com>.

Hi Ian


> The problem I'm finding is that the crawler is not apparently visiting or
> indexing the content of these urls. The document at the far end of the link
> has this url
>
> http://[domain]/medialibrary.axd?id=414405745
>
> is actually a pdf. I am using the tika plugin which I thought would allow
> for indexing pdfs.
>
>
don't blame parse-tika : if the URL is not fetched then it has no chance of
being parsed then indexed

check your URL filter : the link above contains a '?' which by default
would get the URL to be filtered out



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com