You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Teague James <te...@insystechinc.com> on 2014/01/22 17:16:43 UTC

How to Get Links With Nutch

I am trying to use Nutch to crawl a site and return all of the links that
are on a page. As a simple example, the page might look like this if its
address were www.example.com and each of the items in [brackets] were links
of some sort - relative or full URLs:

Article 1 text blah blah blah [Read more]
Download [Article 1 PDF]
Article 2 text blah blah blah [Read more]
Download [Article 2 PDF]
In partnership with [Some Partner]
[Home]|[Articles]|[Contact Us]

What I want to get is a list of all the links and destination URLs,
something like:
[Read more] /article1
[Article 1 PDF] /pdfs/article1.pdf
[Read more] /article2
[Article 2 PDF] /pdfs/article2.pdf
[Some Partner] www.somepartner.com
[Home] /home
[Articles] /articles
[Contact Us] /contact us

Note that a lot of the links are relative. I don't care whether I can get
only the relative "/article1" or the full "www.example.com/article1" and I
do not necessarily need Nutch to go to each of those links and crawl them. I
just want Nutch to report on all of the links on the page. 

Can anyone offer me any advice on how to accomplish this?


Re: How to Get Links With Nutch

Posted by Tejas Patil <te...@gmail.com>.
On Wed, Jan 22, 2014 at 10:33 PM, Teague James <te...@insystechinc.com>wrote:

> Tejas,
>
> Thanks for your response, that is exactly correct. Ultimately I want to be
> able to index the Nutch crawl with Solr to make it all searchable. After
> doing my crawl, I use:
>
> bin/nutch solrindex http://localhost/solr/ crawl/crawldb -linkdb
> crawl/linkdb -dir crawl/segments/
>
> But I do not get all of the anchors. I get some anchors as a comma
> delimited
> list in the anchor field.


HTML parser in Nutch is doing this.


> I do not get any of the outlinks.
>
> I think that the indexer would only index the parsed content from the
segments and the outlinks won't be included. Thats why you see that
happening.

I did a dump with readdb of the crawldb and found that the links I want are
> there. I will take a look at doing a segment dump as you suggest.


ok. Look out for the "outlinks" section in the segment dump.


> Will that
> make these outlinks available to Solr or are there additional steps I need
> to take?
>
> I am not sure if there is a better way than this: Write your own HTML
parser (or tweak the one provided with Nutch) which would just emit the
outlinks along with its anchor text.

-----Original Message-----
>
> Correct me if I am wrong: You want the anchor text and the outlink. Right ?
> If you crawl the seed url for depth 1 using Nutch 1.x and then get a
> segment
> dump of the segment generated after crawl, it should have that information.
>
>
> On Wed, Jan 22, 2014 at 9:46 PM, Teague James
> <te...@insystechinc.com>wrote:
>
> > I am trying to use Nutch to crawl a site and return all of the links
> > that are on a page. As a simple example, the page might look like this
> > if its address were www.example.com and each of the items in
> > [brackets] were links of some sort - relative or full URLs:
> >
> > Article 1 text blah blah blah [Read more] Download [Article 1 PDF]
> > Article 2 text blah blah blah [Read more] Download [Article 2 PDF] In
> > partnership with [Some Partner] [Home]|[Articles]|[Contact Us]
> >
> > What I want to get is a list of all the links and destination URLs,
> > something like:
> > [Read more] /article1
> > [Article 1 PDF] /pdfs/article1.pdf
> > [Read more] /article2
> > [Article 2 PDF] /pdfs/article2.pdf
> > [Some Partner] www.somepartner.com
> > [Home] /home
> > [Articles] /articles
> > [Contact Us] /contact us
> >
> > Note that a lot of the links are relative. I don't care whether I can
> > get only the relative "/article1" or the full
> > "www.example.com/article1" and I do not necessarily need Nutch to go to
> each of those links and crawl them.
> > I
> > just want Nutch to report on all of the links on the page.
> >
> > Can anyone offer me any advice on how to accomplish this?
> >
> >
>
>

RE: How to Get Links With Nutch

Posted by Teague James <te...@insystechinc.com>.
Tejas,

Thanks for your response, that is exactly correct. Ultimately I want to be
able to index the Nutch crawl with Solr to make it all searchable. After
doing my crawl, I use:

bin/nutch solrindex http://localhost/solr/ crawl/crawldb -linkdb
crawl/linkdb -dir crawl/segments/

But I do not get all of the anchors. I get some anchors as a comma delimited
list in the anchor field. I do not get any of the outlinks.

I did a dump with readdb of the crawldb and found that the links I want are
there. I will take a look at doing a segment dump as you suggest. Will that
make these outlinks available to Solr or are there additional steps I need
to take?

-----Original Message-----

Correct me if I am wrong: You want the anchor text and the outlink. Right ?
If you crawl the seed url for depth 1 using Nutch 1.x and then get a segment
dump of the segment generated after crawl, it should have that information.


On Wed, Jan 22, 2014 at 9:46 PM, Teague James
<te...@insystechinc.com>wrote:

> I am trying to use Nutch to crawl a site and return all of the links 
> that are on a page. As a simple example, the page might look like this 
> if its address were www.example.com and each of the items in 
> [brackets] were links of some sort - relative or full URLs:
>
> Article 1 text blah blah blah [Read more] Download [Article 1 PDF] 
> Article 2 text blah blah blah [Read more] Download [Article 2 PDF] In 
> partnership with [Some Partner] [Home]|[Articles]|[Contact Us]
>
> What I want to get is a list of all the links and destination URLs, 
> something like:
> [Read more] /article1
> [Article 1 PDF] /pdfs/article1.pdf
> [Read more] /article2
> [Article 2 PDF] /pdfs/article2.pdf
> [Some Partner] www.somepartner.com
> [Home] /home
> [Articles] /articles
> [Contact Us] /contact us
>
> Note that a lot of the links are relative. I don't care whether I can 
> get only the relative "/article1" or the full 
> "www.example.com/article1" and I do not necessarily need Nutch to go to
each of those links and crawl them.
> I
> just want Nutch to report on all of the links on the page.
>
> Can anyone offer me any advice on how to accomplish this?
>
>


Re: How to Get Links With Nutch

Posted by Tejas Patil <te...@gmail.com>.
Correct me if I am wrong: You want the anchor text and the outlink. Right ?
If you crawl the seed url for depth 1 using Nutch 1.x and then get a
segment dump of the segment generated after crawl, it should have that
information.


On Wed, Jan 22, 2014 at 9:46 PM, Teague James <te...@insystechinc.com>wrote:

> I am trying to use Nutch to crawl a site and return all of the links that
> are on a page. As a simple example, the page might look like this if its
> address were www.example.com and each of the items in [brackets] were
> links
> of some sort - relative or full URLs:
>
> Article 1 text blah blah blah [Read more]
> Download [Article 1 PDF]
> Article 2 text blah blah blah [Read more]
> Download [Article 2 PDF]
> In partnership with [Some Partner]
> [Home]|[Articles]|[Contact Us]
>
> What I want to get is a list of all the links and destination URLs,
> something like:
> [Read more] /article1
> [Article 1 PDF] /pdfs/article1.pdf
> [Read more] /article2
> [Article 2 PDF] /pdfs/article2.pdf
> [Some Partner] www.somepartner.com
> [Home] /home
> [Articles] /articles
> [Contact Us] /contact us
>
> Note that a lot of the links are relative. I don't care whether I can get
> only the relative "/article1" or the full "www.example.com/article1" and I
> do not necessarily need Nutch to go to each of those links and crawl them.
> I
> just want Nutch to report on all of the links on the page.
>
> Can anyone offer me any advice on how to accomplish this?
>
>