You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Ernesto De Santis <de...@yahoo.com.ar> on 2006/09/11 02:54:26 UTC

rss integration

Hi all

I'm trying to integrate a rss and atom source to my nutch index.
I see that nutch has a RSSParser, but it seems that index the whole 
source as one source, right?

I want to index each item separately.
Some body do it? What's the best approach.

I hope about do a external process to add Document's to nutch(lucene) 
index using a rss fetcher like Rome. The negative point about it, is 
that it isn't integrated with nutch.

I don't know details of nutch core to hack it, I don't know if is 
possible to integrate it in nutch.

Thanks a lot!
Ernesto.

	
	
		
__________________________________________________
Preguntá. Respondé. Descubrí.
Todo lo que querías saber, y lo que ni imaginabas,
está en Yahoo! Respuestas (Beta).
¡Probalo ya! 
http://www.yahoo.com.ar/respuestas

Re: rss integration

Posted by Ernesto De Santis <de...@yahoo.com.ar>.

Thanks a lot Chris, it's working.

The key was in crawl-urlfilter.txt. Because all the url items had more 
that three '/' and filtered characters like '?'

I did comment:
#-[?*!@=]
and
#-.*(/.+?)/.*?\1/.*?\1/
lines, and it works very well.

Thanks Chris!
Ernesto.


Chris Mattmann escribió:
> Hi Ernesto,
>
>   You need to make sure that the links inside of the RSS files that are
> getting indexed are not filtered out by your url filter. For instance, say
> you had an RSS file that had the following links:
>
> http://foo.com/news/
> http://foo.bar.com/sports/
> http://bar.foo.com/breaking/news/highlights
>
> Well, you would need in your url filter to add support for each of the
> different host names and paths that you would be indexing. So, in your
> example below, I'm pretty sure that your URL filter below limits you to only
> those 2 domains, rss.cnn.com and www.cnn.com. I think that if you chanted
> your filter, for example to:
>
> +^http://([a-z0-9]*\.)*cnn.com/
>
> That might help. Ensure that the links present in the CNN RSS files fall
> within the *.cnn.com domain, otherwise, update your url filter accordingly.
>
>  More specific comments below:
>
> On 9/10/06 11:23 PM, "Ernesto De Santis" <de...@yahoo.com.ar>
> wrote:
>
>   
>> Hi Chris
>>
>> Thanks for your response.
>> But I can't do that it works.
>>
>> All times it indexes the whole channel as one Document.
>>
>> I did these steps (to index a cnn channel):
>>
>> 1- write in my seed file, with just one seed:
>>
>> http://rss.cnn.com/rss/cnn_topstories.rss
>>     
>
> Good, that's the right thing to do.
>
>   
>> 2- include the parser:
>>
>> In the file nutch-default.xml, tag plugin.includes, I include the rss
>> parser:
>>   
>> <value>protocol-http|urlfilter-regex|parse-(rss|text|html|js)|index-basic|quer
>> y-(basic|site|url)|summary-basic|scoring-opic|index-url-category</value>
>>
>>     
>
> Perfect.
>
>   
>> 3- Accept cnn hosts
>>
>> In the file crawl.urlfilter.txt I wrote:
>> +^http://rss.cnn.com/
>> +^http://www.cnn.com/
>>     
>
> See my comments above here. I think that you need to change these.
>
>   
>> Then I run the crawler, but always I get an index with once Document.
>> I try some things more, without successes... (like set
>> db.ignore.internal.links to false, change the mimetype parsers order, I
>> did read some problem about that in a post yours)
>>
>> Do you know what I'm forgetting?
>>
>> How can I be sure if parser-rss is parsing some content?
>> Can I get some log about that?
>>     
>
> Yup, there should be some information in the nutch.log file. Do a grep for
> "parse-rss" or "RSSParser" in the log file.
>
>   
>> About outlinks, I don't understand what I must do with them. I need do
>> something with outlinks after parser-rss work?
>>     
>
> Nope. Outlinks are links coming out of a page of content. So, if there are 5
> links in a web page, or an RSS document, then there are 5 so-called
> "Outlinks" in Nutch terminology. During the parsing phase, as content is
> parsed individually, Nutch requires a parser to append any Outlinks found in
> a particular piece of content and return them back to the Fetcher so that
> they too can be crawled.
>
>
> HTH,
>   Chris
>
>   
>> Thanks a lot ... again.
>> Ernesto.
>>
>> Chris Mattmann escribió:
>>     
>>> Hi Ernesto,
>>>
>>>  The RSSParser in Nutch does in fact index the individual item links: they
>>> are added as Outlinks during each iteration in which the RSSParser is
>>> called. Both the channel text and the item text are indexed. Also, since
>>> each Item link is added as an Outlink to the list of returned Outlinks,
>>> Nutch is able to crawl many urls that can come out of a single RSS feed.
>>>
>>> HTH,
>>>   Chris
>>>
>>>
>>>
>>> On 9/10/06 5:54 PM, "Ernesto De Santis" <de...@yahoo.com.ar>
>>> wrote:
>>>
>>>   
>>>       
>>>> Hi all
>>>>
>>>> I'm trying to integrate a rss and atom source to my nutch index.
>>>> I see that nutch has a RSSParser, but it seems that index the whole
>>>> source as one source, right?
>>>>
>>>> I want to index each item separately.
>>>> Some body do it? What's the best approach.
>>>>
>>>> I hope about do a external process to add Document's to nutch(lucene)
>>>> index using a rss fetcher like Rome. The negative point about it, is
>>>> that it isn't integrated with nutch.
>>>>
>>>> I don't know details of nutch core to hack it, I don't know if is
>>>> possible to integrate it in nutch.
>>>>
>>>> Thanks a lot!
>>>> Ernesto.
>>>>
>>>>
>>>>
>>>>
>>>> __________________________________________________
>>>> Preguntá. Respondé. Descubrí.
>>>> Todo lo que querías saber, y lo que ni imaginabas,
>>>> está en Yahoo! Respuestas (Beta).
>>>> ¡Probalo ya! 
>>>> http://www.yahoo.com.ar/respuestas
>>>>
>>>>     
>>>>         
>>>
>>>   
>>>       
>>
>>
>> __________________________________________________
>> Preguntá. Respondé. Descubrí.
>> Todo lo que querías saber, y lo que ni imaginabas,
>> está en Yahoo! Respuestas (Beta).
>> ¡Probalo ya! 
>> http://www.yahoo.com.ar/respuestas
>>
>>     
>
>
>
>   

	
	
		
__________________________________________________
Preguntá. Respondé. Descubrí.
Todo lo que querías saber, y lo que ni imaginabas,
está en Yahoo! Respuestas (Beta).
¡Probalo ya! 
http://www.yahoo.com.ar/respuestas

Re: rss integration

Posted by Chris Mattmann <ch...@jpl.nasa.gov>.

Hi Ernesto,

  You need to make sure that the links inside of the RSS files that are
getting indexed are not filtered out by your url filter. For instance, say
you had an RSS file that had the following links:

http://foo.com/news/
http://foo.bar.com/sports/
http://bar.foo.com/breaking/news/highlights

Well, you would need in your url filter to add support for each of the
different host names and paths that you would be indexing. So, in your
example below, I'm pretty sure that your URL filter below limits you to only
those 2 domains, rss.cnn.com and www.cnn.com. I think that if you chanted
your filter, for example to:

+^http://([a-z0-9]*\.)*cnn.com/

That might help. Ensure that the links present in the CNN RSS files fall
within the *.cnn.com domain, otherwise, update your url filter accordingly.

 More specific comments below:

On 9/10/06 11:23 PM, "Ernesto De Santis" <de...@yahoo.com.ar>
wrote:

> Hi Chris
> 
> Thanks for your response.
> But I can't do that it works.
> 
> All times it indexes the whole channel as one Document.
> 
> I did these steps (to index a cnn channel):
> 
> 1- write in my seed file, with just one seed:
> 
> http://rss.cnn.com/rss/cnn_topstories.rss

Good, that's the right thing to do.

> 
> 2- include the parser:
> 
> In the file nutch-default.xml, tag plugin.includes, I include the rss
> parser:
>   
> <value>protocol-http|urlfilter-regex|parse-(rss|text|html|js)|index-basic|quer
> y-(basic|site|url)|summary-basic|scoring-opic|index-url-category</value>
> 

Perfect.

> 3- Accept cnn hosts
> 
> In the file crawl.urlfilter.txt I wrote:
> +^http://rss.cnn.com/
> +^http://www.cnn.com/

See my comments above here. I think that you need to change these.

> 
> Then I run the crawler, but always I get an index with once Document.
> I try some things more, without successes... (like set
> db.ignore.internal.links to false, change the mimetype parsers order, I
> did read some problem about that in a post yours)
> 
> Do you know what I'm forgetting?
> 
> How can I be sure if parser-rss is parsing some content?
> Can I get some log about that?

Yup, there should be some information in the nutch.log file. Do a grep for
"parse-rss" or "RSSParser" in the log file.

> 
> About outlinks, I don't understand what I must do with them. I need do
> something with outlinks after parser-rss work?

Nope. Outlinks are links coming out of a page of content. So, if there are 5
links in a web page, or an RSS document, then there are 5 so-called
"Outlinks" in Nutch terminology. During the parsing phase, as content is
parsed individually, Nutch requires a parser to append any Outlinks found in
a particular piece of content and return them back to the Fetcher so that
they too can be crawled.

HTH,
  Chris

> 
> Thanks a lot ... again.
> Ernesto.
> 
> Chris Mattmann escribió:
>> Hi Ernesto,
>> 
>>  The RSSParser in Nutch does in fact index the individual item links: they
>> are added as Outlinks during each iteration in which the RSSParser is
>> called. Both the channel text and the item text are indexed. Also, since
>> each Item link is added as an Outlink to the list of returned Outlinks,
>> Nutch is able to crawl many urls that can come out of a single RSS feed.
>> 
>> HTH,
>>   Chris
>> 
>> 
>> 
>> On 9/10/06 5:54 PM, "Ernesto De Santis" <de...@yahoo.com.ar>
>> wrote:
>> 
>>   
>>> Hi all
>>> 
>>> I'm trying to integrate a rss and atom source to my nutch index.
>>> I see that nutch has a RSSParser, but it seems that index the whole
>>> source as one source, right?
>>> 
>>> I want to index each item separately.
>>> Some body do it? What's the best approach.
>>> 
>>> I hope about do a external process to add Document's to nutch(lucene)
>>> index using a rss fetcher like Rome. The negative point about it, is
>>> that it isn't integrated with nutch.
>>> 
>>> I don't know details of nutch core to hack it, I don't know if is
>>> possible to integrate it in nutch.
>>> 
>>> Thanks a lot!
>>> Ernesto.
>>> 
>>> 
>>> 
>>> 
>>> __________________________________________________
>>> Preguntá. Respondé. Descubrí.
>>> Todo lo que querías saber, y lo que ni imaginabas,
>>> está en Yahoo! Respuestas (Beta).
>>> ¡Probalo ya! 
>>> http://www.yahoo.com.ar/respuestas
>>> 
>>>     
>> 
>> 
>> 
>>   
> 
> 
> 
> 
> __________________________________________________
> Preguntá. Respondé. Descubrí.
> Todo lo que querías saber, y lo que ni imaginabas,
> está en Yahoo! Respuestas (Beta).
> ¡Probalo ya! 
> http://www.yahoo.com.ar/respuestas
>

Re: rss integration

Posted by Ernesto De Santis <de...@yahoo.com.ar>.

Hi Chris

Thanks for your response.
But I can't do that it works.

All times it indexes the whole channel as one Document.

I did these steps (to index a cnn channel):

1- write in my seed file, with just one seed:

http://rss.cnn.com/rss/cnn_topstories.rss

2- include the parser:

In the file nutch-default.xml, tag plugin.includes, I include the rss 
parser:

<value>protocol-http|urlfilter-regex|parse-(rss|text|html|js)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|index-url-category</value>

3- Accept cnn hosts

In the file crawl.urlfilter.txt I wrote:
+^http://rss.cnn.com/
+^http://www.cnn.com/

Then I run the crawler, but always I get an index with once Document.
I try some things more, without successes... (like set 
db.ignore.internal.links to false, change the mimetype parsers order, I 
did read some problem about that in a post yours)

Do you know what I'm forgetting?

How can I be sure if parser-rss is parsing some content?
Can I get some log about that?

About outlinks, I don't understand what I must do with them. I need do 
something with outlinks after parser-rss work?

Thanks a lot ... again.
Ernesto.

Chris Mattmann escribió:
> Hi Ernesto,
>
>  The RSSParser in Nutch does in fact index the individual item links: they
> are added as Outlinks during each iteration in which the RSSParser is
> called. Both the channel text and the item text are indexed. Also, since
> each Item link is added as an Outlink to the list of returned Outlinks,
> Nutch is able to crawl many urls that can come out of a single RSS feed.
>
> HTH,
>   Chris
>
>
>
> On 9/10/06 5:54 PM, "Ernesto De Santis" <de...@yahoo.com.ar>
> wrote:
>
>   
>> Hi all
>>
>> I'm trying to integrate a rss and atom source to my nutch index.
>> I see that nutch has a RSSParser, but it seems that index the whole
>> source as one source, right?
>>
>> I want to index each item separately.
>> Some body do it? What's the best approach.
>>
>> I hope about do a external process to add Document's to nutch(lucene)
>> index using a rss fetcher like Rome. The negative point about it, is
>> that it isn't integrated with nutch.
>>
>> I don't know details of nutch core to hack it, I don't know if is
>> possible to integrate it in nutch.
>>
>> Thanks a lot!
>> Ernesto.
>>
>>
>>
>>
>> __________________________________________________
>> Preguntá. Respondé. Descubrí.
>> Todo lo que querías saber, y lo que ni imaginabas,
>> está en Yahoo! Respuestas (Beta).
>> ¡Probalo ya! 
>> http://www.yahoo.com.ar/respuestas
>>
>>     
>
>
>
>   

__________________________________________________
Preguntá. Respondé. Descubrí.
Todo lo que querías saber, y lo que ni imaginabas,
está en Yahoo! Respuestas (Beta).
¡Probalo ya! 
http://www.yahoo.com.ar/respuestas

Re: rss integration

Posted by Chris Mattmann <ch...@jpl.nasa.gov>.

Hi Ernesto,

 The RSSParser in Nutch does in fact index the individual item links: they
are added as Outlinks during each iteration in which the RSSParser is
called. Both the channel text and the item text are indexed. Also, since
each Item link is added as an Outlink to the list of returned Outlinks,
Nutch is able to crawl many urls that can come out of a single RSS feed.

HTH,
  Chris

On 9/10/06 5:54 PM, "Ernesto De Santis" <de...@yahoo.com.ar>
wrote:

> Hi all
> 
> I'm trying to integrate a rss and atom source to my nutch index.
> I see that nutch has a RSSParser, but it seems that index the whole
> source as one source, right?
> 
> I want to index each item separately.
> Some body do it? What's the best approach.
> 
> I hope about do a external process to add Document's to nutch(lucene)
> index using a rss fetcher like Rome. The negative point about it, is
> that it isn't integrated with nutch.
> 
> I don't know details of nutch core to hack it, I don't know if is
> possible to integrate it in nutch.
> 
> Thanks a lot!
> Ernesto.
> 
> 
> 
> 
> __________________________________________________
> Preguntá. Respondé. Descubrí.
> Todo lo que querías saber, y lo que ni imaginabas,
> está en Yahoo! Respuestas (Beta).
> ¡Probalo ya! 
> http://www.yahoo.com.ar/respuestas
>