You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by 高睿 <ga...@163.com> on 2012/12/15 04:47:55 UTC

How to extend Nutch for article crawling

Hi,

I'm look for a framework to grab articles, then I find Nutch 2.1. Here's my plan and questions in each:
1. Add article list pages into url/seed.txt
    Here's one problem. What I actually want to be indexed is the article pages, not the article list pages. But, if I don't allow the list page to be indexed, Nutch will do nothing because the list page is the entrance. So, how can I index only the article page without list pages?

2. Write a plugin to parse out the 'author', 'date', 'article body', 'headline' and maybe other information from html.
    The 'Parser' plugin interface in Nutch 2.1 is:
    Parse getParse(String url, WebPage page)
    And the 'WebPage' class has some predefined attributs:
public class WebPage extends PersistentBase {
  //...
  private Utf8 baseUrl;
  // ...
  private Utf8 title;
  private Utf8 text;
  // ...
  private Map<Utf8,ByteBuffer> metadata;
  // ...
}

    So, the only field I can put my specified attributes in is the 'metadata'. Is it designed for this purpose?
    BTW, the Parser in trunk looks like: 'public ParseResult getParse(Content content)', and seems more reasonable for me.

3. After the articles are indexed into Solr, another application can query it by 'date' then store the article information into Mysql.
    My question here is: can Nutch store the article directly into Mysql? Or can I write a plugin to specify the index behavior?

Is Nutch a good choice for my purpose? If not, do you guys suggest another good quality framework/library for me?
Thanks for your help.

Regards,
Rui

Re: How to extend Nutch for article crawling

Posted by nitin hardeniya <ni...@gmail.com>.
Try http://scrapy.org/

On Sat, Dec 15, 2012 at 9:17 AM, 高睿 <ga...@163.com> wrote:

> Hi,
>
> I'm look for a framework to grab articles, then I find Nutch 2.1. Here's
> my plan and questions in each:
> 1. Add article list pages into url/seed.txt
>     Here's one problem. What I actually want to be indexed is the article
> pages, not the article list pages. But, if I don't allow the list page to
> be indexed, Nutch will do nothing because the list page is the entrance.
> So, how can I index only the article page without list pages?
>
> 2. Write a plugin to parse out the 'author', 'date', 'article body',
> 'headline' and maybe other information from html.
>     The 'Parser' plugin interface in Nutch 2.1 is:
>     Parse getParse(String url, WebPage page)
>     And the 'WebPage' class has some predefined attributs:
> public class WebPage extends PersistentBase {
>   //...
>   private Utf8 baseUrl;
>   // ...
>   private Utf8 title;
>   private Utf8 text;
>   // ...
>   private Map<Utf8,ByteBuffer> metadata;
>   // ...
> }
>
>     So, the only field I can put my specified attributes in is the
> 'metadata'. Is it designed for this purpose?
>     BTW, the Parser in trunk looks like: 'public ParseResult
> getParse(Content content)', and seems more reasonable for me.
>
> 3. After the articles are indexed into Solr, another application can query
> it by 'date' then store the article information into Mysql.
>     My question here is: can Nutch store the article directly into Mysql?
> Or can I write a plugin to specify the index behavior?
>
> Is Nutch a good choice for my purpose? If not, do you guys suggest another
> good quality framework/library for me?
> Thanks for your help.
>
> Regards,
> Rui
>



-- 
~Nit
http://about.me/nitinhardeniya




SAVE PAPER - THINK BEFORE YOU PRINT

RE: How to extend Nutch for article crawling

Posted by Markus Jelsma <ma...@openindex.io>.
The 1.x indexer can filter and normalize. 
 
-----Original message-----
> From:Julien Nioche <li...@gmail.com>
> Sent: Mon 17-Dec-2012 15:11
> To: user@nutch.apache.org
> Subject: Re: How to extend Nutch for article crawling
> 
> Hi
> 
> See comments below
> 
> 
> > 1. Add article list pages into url/seed.txt
> >     Here's one problem. What I actually want to be indexed is the article
> > pages, not the article list pages. But, if I don't allow the list page to
> > be indexed, Nutch will do nothing because the list page is the entrance.
> > So, how can I index only the article page without list pages?
> >
> 
> I think that the indexer can now filter URLs but can't remember whether it
> is for 1.x only or is in 2.x as well. Anyone?
> This would work if you can find a regular expression that captures the list
> pages. Another approach would be to tweak the indexer so that it skips
> documents containing an arbitrary metadatum (e.g. skip.indexing), this
> metadata would be set in a custom parser when processing the list pages.
> 
> I think this would be a useful feature to have anyway. URL filters use the
> URL string only and having the option to skip based on metadata would be
> good IMHO
> 
> 
> >
> > 2. Write a plugin to parse out the 'author', 'date', 'article body',
> > 'headline' and maybe other information from html.
> >     The 'Parser' plugin interface in Nutch 2.1 is:
> >     Parse getParse(String url, WebPage page)
> >     And the 'WebPage' class has some predefined attributs:
> > public class WebPage extends PersistentBase {
> >   //...
> >   private Utf8 baseUrl;
> >   // ...
> >   private Utf8 title;
> >   private Utf8 text;
> >   // ...
> >   private Map<Utf8,ByteBuffer> metadata;
> >   // ...
> > }
> >
> >     So, the only field I can put my specified attributes in is the
> > 'metadata'. Is it designed for this purpose?
> >     BTW, the Parser in trunk looks like: 'public ParseResult
> > getParse(Content content)', and seems more reasonable for me.
> >
> 
> The extension point Parser is for low level parsing i.e extract text and
> metadata from binary formats, which is done typically by parse-tika. What
> you want to implement is an extension of ParseFilter and add your own
> entries to the parse metadata. The creative commons plugin should be a good
> example to get started
> 
> 
> >
> > 3. After the articles are indexed into Solr, another application can query
> > it by 'date' then store the article information into Mysql.
> >     My question here is: can Nutch store the article directly into Mysql?
> > Or can I write a plugin to specify the index behavior?
> >
> 
> you could use the mysql backend in GORA (but it is broken AFAIK) and get
> the other application to use it, alternatively you could write a custom
> indexer that sends directly into MySQL but that would be a bit redundant.
> Do you need to use SOLR at all or is the aim to simply to store in MySQL?
> 
> 
> >
> > Is Nutch a good choice for my purpose? If not, do you guys suggest another
> > good quality framework/library for me?
> >
> 
> You can definitely do that with Nutch. There are certainly other resources
> that could be used but they might also need a bit of customisation anyway
> 
> HTH
> 
> Julien
> 
> 
> -- 
> *
> *Open Source Solutions for Text Engineering
> 
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
> http://twitter.com/digitalpebble
> 

Re: Re: How to extend Nutch for article crawling

Posted by Julien Nioche <li...@gmail.com>.
Hi

>>> The callback method in the IndexingFilter has a 'URL' parameter and
> returns NutchDocument, so it is hard to be customized to do this.
>

I did not mean the IndexingFilter but using a standard URLFilter during the
indexing step


> >>> So, it's better to add 'skip' ability to the IndexingFilter based on
> URL or medadata.
>         @Override
>         public NutchDocument filter(NutchDocument doc, String url, WebPage
> page)


well we could nullify a document in an IndexingFilter based on the metadata
found in the webpage indeed, which would save us hacking the code of the
indexer. That's assuming that the rest of the chain behaves nicely in the
presence of a null.

[...]


> >>> Good suggestion. It is not decided yet to depend on SOLR or not. SOLR
> is an amazing tool for indexing, however I'm not quit sure whether it is
> good to store the 'content' inside it. By default, the 'content' is
> configured only to be indexed but stored. What do you think?


many people see SOLR as a distributed NoSQL datastore which has also the
additional benefit of indexing + search. Whether it is a good match for
your needs depends on the sort of queries you'd want to run on it. It can
definitely store the content and I believe that the recent versions have
great performance for compression so this is not really an issue.

Julien

-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Re:Re: How to extend Nutch for article crawling

Posted by 高睿 <ga...@163.com>.
Hi,

Thanks for your comments very much. My comments are inline.


At 2012-12-17 22:04:48,"Julien Nioche" <li...@gmail.com> wrote:
>Hi
>
>See comments below
>
>
>> 1. Add article list pages into url/seed.txt
>>     Here's one problem. What I actually want to be indexed is the article
>> pages, not the article list pages. But, if I don't allow the list page to
>> be indexed, Nutch will do nothing because the list page is the entrance.
>> So, how can I index only the article page without list pages?
>>
>
>I think that the indexer can now filter URLs but can't remember whether it
>is for 1.x only or is in 2.x as well. Anyone?
>This would work if you can find a regular expression that captures the list
>pages. Another approach would be to tweak the indexer so that it skips
>documents containing an arbitrary metadatum (e.g. skip.indexing), this
>metadata would be set in a custom parser when processing the list pages.
>
>I think this would be a useful feature to have anyway. URL filters use the
>URL string only and having the option to skip based on metadata would be
>good IMHO
>
>
>>> The callback method in the IndexingFilter has a 'URL' parameter and returns NutchDocument, so it is hard to be customized to do this.
>>> So, it's better to add 'skip' ability to the IndexingFilter based on URL or medadata.
	@Override
	public NutchDocument filter(NutchDocument doc, String url, WebPage page)

>>
>> 2. Write a plugin to parse out the 'author', 'date', 'article body',
>> 'headline' and maybe other information from html.
>>     The 'Parser' plugin interface in Nutch 2.1 is:
>>     Parse getParse(String url, WebPage page)
>>     And the 'WebPage' class has some predefined attributs:
>> public class WebPage extends PersistentBase {
>>   //...
>>   private Utf8 baseUrl;
>>   // ...
>>   private Utf8 title;
>>   private Utf8 text;
>>   // ...
>>   private Map<Utf8,ByteBuffer> metadata;
>>   // ...
>> }
>>
>>     So, the only field I can put my specified attributes in is the
>> 'metadata'. Is it designed for this purpose?
>>     BTW, the Parser in trunk looks like: 'public ParseResult
>> getParse(Content content)', and seems more reasonable for me.
>>
>
>The extension point Parser is for low level parsing i.e extract text and
>metadata from binary formats, which is done typically by parse-tika. What
>you want to implement is an extension of ParseFilter and add your own
>entries to the parse metadata. The creative commons plugin should be a good
>example to get started
>
>
>>> Very good point. The manual I have read does cover this part. Currently, I have my customized Parser to parse the HTML. My parser first delegate the parse request to the existing 'HtmlParser' plugin implementation, then extract out the detailed information. It's low performance indeed.
>>
>>
>>
>> 3. After the articles are indexed into Solr, another application can query 
>> it by 'date' then store the article information into Mysql.
>>     My question here is: can Nutch store the article directly into Mysql?
>> Or can I write a plugin to specify the index behavior?
>>
>
>you could use the mysql backend in GORA (but it is broken AFAIK) and get
>the other application to use it, alternatively you could write a custom
>indexer that sends directly into MySQL but that would be a bit redundant.
>Do you need to use SOLR at all or is the aim to simply to store in MySQL?
>
>
>
>>> Good suggestion. It is not decided yet to depend on SOLR or not. SOLR is an amazing tool for indexing, however I'm not quit sure whether it is good to store the 'content' inside it. By default, the 'content' is configured only to be indexed but stored. What do you think?
>>
>>
 >> Is Nutch a good choice for my purpose? If not, do you guys suggest another
>> good quality framework/library for me?
>>
>
>You can definitely do that with Nutch. There are certainly other resources
>that could be used but they might also need a bit of customisation anyway
>
>HTH
>
>Julien
>
>
>-- 
>*
>*Open Source Solutions for Text Engineering
>
>http://digitalpebble.blogspot.com/
>http://www.digitalpebble.com
>http://twitter.com/digitalpebble

Re: How to extend Nutch for article crawling

Posted by Julien Nioche <li...@gmail.com>.
Hi

See comments below


> 1. Add article list pages into url/seed.txt
>     Here's one problem. What I actually want to be indexed is the article
> pages, not the article list pages. But, if I don't allow the list page to
> be indexed, Nutch will do nothing because the list page is the entrance.
> So, how can I index only the article page without list pages?
>

I think that the indexer can now filter URLs but can't remember whether it
is for 1.x only or is in 2.x as well. Anyone?
This would work if you can find a regular expression that captures the list
pages. Another approach would be to tweak the indexer so that it skips
documents containing an arbitrary metadatum (e.g. skip.indexing), this
metadata would be set in a custom parser when processing the list pages.

I think this would be a useful feature to have anyway. URL filters use the
URL string only and having the option to skip based on metadata would be
good IMHO


>
> 2. Write a plugin to parse out the 'author', 'date', 'article body',
> 'headline' and maybe other information from html.
>     The 'Parser' plugin interface in Nutch 2.1 is:
>     Parse getParse(String url, WebPage page)
>     And the 'WebPage' class has some predefined attributs:
> public class WebPage extends PersistentBase {
>   //...
>   private Utf8 baseUrl;
>   // ...
>   private Utf8 title;
>   private Utf8 text;
>   // ...
>   private Map<Utf8,ByteBuffer> metadata;
>   // ...
> }
>
>     So, the only field I can put my specified attributes in is the
> 'metadata'. Is it designed for this purpose?
>     BTW, the Parser in trunk looks like: 'public ParseResult
> getParse(Content content)', and seems more reasonable for me.
>

The extension point Parser is for low level parsing i.e extract text and
metadata from binary formats, which is done typically by parse-tika. What
you want to implement is an extension of ParseFilter and add your own
entries to the parse metadata. The creative commons plugin should be a good
example to get started


>
> 3. After the articles are indexed into Solr, another application can query
> it by 'date' then store the article information into Mysql.
>     My question here is: can Nutch store the article directly into Mysql?
> Or can I write a plugin to specify the index behavior?
>

you could use the mysql backend in GORA (but it is broken AFAIK) and get
the other application to use it, alternatively you could write a custom
indexer that sends directly into MySQL but that would be a bit redundant.
Do you need to use SOLR at all or is the aim to simply to store in MySQL?


>
> Is Nutch a good choice for my purpose? If not, do you guys suggest another
> good quality framework/library for me?
>

You can definitely do that with Nutch. There are certainly other resources
that could be used but they might also need a bit of customisation anyway

HTH

Julien


-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble