You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Ernesto De Santis <de...@yahoo.com.ar> on 2006/09/11 02:54:26 UTC
rss integration
Hi all
I'm trying to integrate a rss and atom source to my nutch index.
I see that nutch has a RSSParser, but it seems that index the whole
source as one source, right?
I want to index each item separately.
Some body do it? What's the best approach.
I hope about do a external process to add Document's to nutch(lucene)
index using a rss fetcher like Rome. The negative point about it, is
that it isn't integrated with nutch.
I don't know details of nutch core to hack it, I don't know if is
possible to integrate it in nutch.
Thanks a lot!
Ernesto.
__________________________________________________
Preguntá. Respondé. Descubrí.
Todo lo que querías saber, y lo que ni imaginabas,
está en Yahoo! Respuestas (Beta).
¡Probalo ya!
http://www.yahoo.com.ar/respuestas
Re: rss integration
Posted by Ernesto De Santis <de...@yahoo.com.ar>.
Thanks a lot Chris, it's working.
The key was in crawl-urlfilter.txt. Because all the url items had more
that three '/' and filtered characters like '?'
I did comment:
#-[?*!@=]
and
#-.*(/.+?)/.*?\1/.*?\1/
lines, and it works very well.
Thanks Chris!
Ernesto.
Chris Mattmann escribió:
> Hi Ernesto,
>
> You need to make sure that the links inside of the RSS files that are
> getting indexed are not filtered out by your url filter. For instance, say
> you had an RSS file that had the following links:
>
> http://foo.com/news/
> http://foo.bar.com/sports/
> http://bar.foo.com/breaking/news/highlights
>
> Well, you would need in your url filter to add support for each of the
> different host names and paths that you would be indexing. So, in your
> example below, I'm pretty sure that your URL filter below limits you to only
> those 2 domains, rss.cnn.com and www.cnn.com. I think that if you chanted
> your filter, for example to:
>
> +^http://([a-z0-9]*\.)*cnn.com/
>
> That might help. Ensure that the links present in the CNN RSS files fall
> within the *.cnn.com domain, otherwise, update your url filter accordingly.
>
> More specific comments below:
>
> On 9/10/06 11:23 PM, "Ernesto De Santis" <de...@yahoo.com.ar>
> wrote:
>
>
>> Hi Chris
>>
>> Thanks for your response.
>> But I can't do that it works.
>>
>> All times it indexes the whole channel as one Document.
>>
>> I did these steps (to index a cnn channel):
>>
>> 1- write in my seed file, with just one seed:
>>
>> http://rss.cnn.com/rss/cnn_topstories.rss
>>
>
> Good, that's the right thing to do.
>
>
>> 2- include the parser:
>>
>> In the file nutch-default.xml, tag plugin.includes, I include the rss
>> parser:
>>
>> <value>protocol-http|urlfilter-regex|parse-(rss|text|html|js)|index-basic|quer
>> y-(basic|site|url)|summary-basic|scoring-opic|index-url-category</value>
>>
>>
>
> Perfect.
>
>
>> 3- Accept cnn hosts
>>
>> In the file crawl.urlfilter.txt I wrote:
>> +^http://rss.cnn.com/
>> +^http://www.cnn.com/
>>
>
> See my comments above here. I think that you need to change these.
>
>
>> Then I run the crawler, but always I get an index with once Document.
>> I try some things more, without successes... (like set
>> db.ignore.internal.links to false, change the mimetype parsers order, I
>> did read some problem about that in a post yours)
>>
>> Do you know what I'm forgetting?
>>
>> How can I be sure if parser-rss is parsing some content?
>> Can I get some log about that?
>>
>
> Yup, there should be some information in the nutch.log file. Do a grep for
> "parse-rss" or "RSSParser" in the log file.
>
>
>> About outlinks, I don't understand what I must do with them. I need do
>> something with outlinks after parser-rss work?
>>
>
> Nope. Outlinks are links coming out of a page of content. So, if there are 5
> links in a web page, or an RSS document, then there are 5 so-called
> "Outlinks" in Nutch terminology. During the parsing phase, as content is
> parsed individually, Nutch requires a parser to append any Outlinks found in
> a particular piece of content and return them back to the Fetcher so that
> they too can be crawled.
>
>
> HTH,
> Chris
>
>
>> Thanks a lot ... again.
>> Ernesto.
>>
>> Chris Mattmann escribió:
>>
>>> Hi Ernesto,
>>>
>>> The RSSParser in Nutch does in fact index the individual item links: they
>>> are added as Outlinks during each iteration in which the RSSParser is
>>> called. Both the channel text and the item text are indexed. Also, since
>>> each Item link is added as an Outlink to the list of returned Outlinks,
>>> Nutch is able to crawl many urls that can come out of a single RSS feed.
>>>
>>> HTH,
>>> Chris
>>>
>>>
>>>
>>> On 9/10/06 5:54 PM, "Ernesto De Santis" <de...@yahoo.com.ar>
>>> wrote:
>>>
>>>
>>>
>>>> Hi all
>>>>
>>>> I'm trying to integrate a rss and atom source to my nutch index.
>>>> I see that nutch has a RSSParser, but it seems that index the whole
>>>> source as one source, right?
>>>>
>>>> I want to index each item separately.
>>>> Some body do it? What's the best approach.
>>>>
>>>> I hope about do a external process to add Document's to nutch(lucene)
>>>> index using a rss fetcher like Rome. The negative point about it, is
>>>> that it isn't integrated with nutch.
>>>>
>>>> I don't know details of nutch core to hack it, I don't know if is
>>>> possible to integrate it in nutch.
>>>>
>>>> Thanks a lot!
>>>> Ernesto.
>>>>
>>>>
>>>>
>>>>
>>>> __________________________________________________
>>>> Preguntá. Respondé. Descubrí.
>>>> Todo lo que querías saber, y lo que ni imaginabas,
>>>> está en Yahoo! Respuestas (Beta).
>>>> ¡Probalo ya!
>>>> http://www.yahoo.com.ar/respuestas
>>>>
>>>>
>>>>
>>>
>>>
>>>
>>
>>
>> __________________________________________________
>> Preguntá. Respondé. Descubrí.
>> Todo lo que querías saber, y lo que ni imaginabas,
>> está en Yahoo! Respuestas (Beta).
>> ¡Probalo ya!
>> http://www.yahoo.com.ar/respuestas
>>
>>
>
>
>
>
__________________________________________________
Preguntá. Respondé. Descubrí.
Todo lo que querías saber, y lo que ni imaginabas,
está en Yahoo! Respuestas (Beta).
¡Probalo ya!
http://www.yahoo.com.ar/respuestas
Re: rss integration
Posted by Chris Mattmann <ch...@jpl.nasa.gov>.
Hi Ernesto,
You need to make sure that the links inside of the RSS files that are
getting indexed are not filtered out by your url filter. For instance, say
you had an RSS file that had the following links:
http://foo.com/news/
http://foo.bar.com/sports/
http://bar.foo.com/breaking/news/highlights
Well, you would need in your url filter to add support for each of the
different host names and paths that you would be indexing. So, in your
example below, I'm pretty sure that your URL filter below limits you to only
those 2 domains, rss.cnn.com and www.cnn.com. I think that if you chanted
your filter, for example to:
+^http://([a-z0-9]*\.)*cnn.com/
That might help. Ensure that the links present in the CNN RSS files fall
within the *.cnn.com domain, otherwise, update your url filter accordingly.
More specific comments below:
On 9/10/06 11:23 PM, "Ernesto De Santis" <de...@yahoo.com.ar>
wrote:
> Hi Chris
>
> Thanks for your response.
> But I can't do that it works.
>
> All times it indexes the whole channel as one Document.
>
> I did these steps (to index a cnn channel):
>
> 1- write in my seed file, with just one seed:
>
> http://rss.cnn.com/rss/cnn_topstories.rss
Good, that's the right thing to do.
>
> 2- include the parser:
>
> In the file nutch-default.xml, tag plugin.includes, I include the rss
> parser:
>
> <value>protocol-http|urlfilter-regex|parse-(rss|text|html|js)|index-basic|quer
> y-(basic|site|url)|summary-basic|scoring-opic|index-url-category</value>
>
Perfect.
> 3- Accept cnn hosts
>
> In the file crawl.urlfilter.txt I wrote:
> +^http://rss.cnn.com/
> +^http://www.cnn.com/
See my comments above here. I think that you need to change these.
>
> Then I run the crawler, but always I get an index with once Document.
> I try some things more, without successes... (like set
> db.ignore.internal.links to false, change the mimetype parsers order, I
> did read some problem about that in a post yours)
>
> Do you know what I'm forgetting?
>
> How can I be sure if parser-rss is parsing some content?
> Can I get some log about that?
Yup, there should be some information in the nutch.log file. Do a grep for
"parse-rss" or "RSSParser" in the log file.
>
> About outlinks, I don't understand what I must do with them. I need do
> something with outlinks after parser-rss work?
Nope. Outlinks are links coming out of a page of content. So, if there are 5
links in a web page, or an RSS document, then there are 5 so-called
"Outlinks" in Nutch terminology. During the parsing phase, as content is
parsed individually, Nutch requires a parser to append any Outlinks found in
a particular piece of content and return them back to the Fetcher so that
they too can be crawled.
HTH,
Chris
>
> Thanks a lot ... again.
> Ernesto.
>
> Chris Mattmann escribió:
>> Hi Ernesto,
>>
>> The RSSParser in Nutch does in fact index the individual item links: they
>> are added as Outlinks during each iteration in which the RSSParser is
>> called. Both the channel text and the item text are indexed. Also, since
>> each Item link is added as an Outlink to the list of returned Outlinks,
>> Nutch is able to crawl many urls that can come out of a single RSS feed.
>>
>> HTH,
>> Chris
>>
>>
>>
>> On 9/10/06 5:54 PM, "Ernesto De Santis" <de...@yahoo.com.ar>
>> wrote:
>>
>>
>>> Hi all
>>>
>>> I'm trying to integrate a rss and atom source to my nutch index.
>>> I see that nutch has a RSSParser, but it seems that index the whole
>>> source as one source, right?
>>>
>>> I want to index each item separately.
>>> Some body do it? What's the best approach.
>>>
>>> I hope about do a external process to add Document's to nutch(lucene)
>>> index using a rss fetcher like Rome. The negative point about it, is
>>> that it isn't integrated with nutch.
>>>
>>> I don't know details of nutch core to hack it, I don't know if is
>>> possible to integrate it in nutch.
>>>
>>> Thanks a lot!
>>> Ernesto.
>>>
>>>
>>>
>>>
>>> __________________________________________________
>>> Preguntá. Respondé. Descubrí.
>>> Todo lo que querías saber, y lo que ni imaginabas,
>>> está en Yahoo! Respuestas (Beta).
>>> ¡Probalo ya!
>>> http://www.yahoo.com.ar/respuestas
>>>
>>>
>>
>>
>>
>>
>
>
>
>
> __________________________________________________
> Preguntá. Respondé. Descubrí.
> Todo lo que querías saber, y lo que ni imaginabas,
> está en Yahoo! Respuestas (Beta).
> ¡Probalo ya!
> http://www.yahoo.com.ar/respuestas
>
Re: rss integration
Posted by Ernesto De Santis <de...@yahoo.com.ar>.
Hi Chris
Thanks for your response.
But I can't do that it works.
All times it indexes the whole channel as one Document.
I did these steps (to index a cnn channel):
1- write in my seed file, with just one seed:
http://rss.cnn.com/rss/cnn_topstories.rss
2- include the parser:
In the file nutch-default.xml, tag plugin.includes, I include the rss
parser:
<value>protocol-http|urlfilter-regex|parse-(rss|text|html|js)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|index-url-category</value>
3- Accept cnn hosts
In the file crawl.urlfilter.txt I wrote:
+^http://rss.cnn.com/
+^http://www.cnn.com/
Then I run the crawler, but always I get an index with once Document.
I try some things more, without successes... (like set
db.ignore.internal.links to false, change the mimetype parsers order, I
did read some problem about that in a post yours)
Do you know what I'm forgetting?
How can I be sure if parser-rss is parsing some content?
Can I get some log about that?
About outlinks, I don't understand what I must do with them. I need do
something with outlinks after parser-rss work?
Thanks a lot ... again.
Ernesto.
Chris Mattmann escribió:
> Hi Ernesto,
>
> The RSSParser in Nutch does in fact index the individual item links: they
> are added as Outlinks during each iteration in which the RSSParser is
> called. Both the channel text and the item text are indexed. Also, since
> each Item link is added as an Outlink to the list of returned Outlinks,
> Nutch is able to crawl many urls that can come out of a single RSS feed.
>
> HTH,
> Chris
>
>
>
> On 9/10/06 5:54 PM, "Ernesto De Santis" <de...@yahoo.com.ar>
> wrote:
>
>
>> Hi all
>>
>> I'm trying to integrate a rss and atom source to my nutch index.
>> I see that nutch has a RSSParser, but it seems that index the whole
>> source as one source, right?
>>
>> I want to index each item separately.
>> Some body do it? What's the best approach.
>>
>> I hope about do a external process to add Document's to nutch(lucene)
>> index using a rss fetcher like Rome. The negative point about it, is
>> that it isn't integrated with nutch.
>>
>> I don't know details of nutch core to hack it, I don't know if is
>> possible to integrate it in nutch.
>>
>> Thanks a lot!
>> Ernesto.
>>
>>
>>
>>
>> __________________________________________________
>> Preguntá. Respondé. Descubrí.
>> Todo lo que querías saber, y lo que ni imaginabas,
>> está en Yahoo! Respuestas (Beta).
>> ¡Probalo ya!
>> http://www.yahoo.com.ar/respuestas
>>
>>
>
>
>
>
__________________________________________________
Preguntá. Respondé. Descubrí.
Todo lo que querías saber, y lo que ni imaginabas,
está en Yahoo! Respuestas (Beta).
¡Probalo ya!
http://www.yahoo.com.ar/respuestas
Re: rss integration
Posted by Chris Mattmann <ch...@jpl.nasa.gov>.
Hi Ernesto,
The RSSParser in Nutch does in fact index the individual item links: they
are added as Outlinks during each iteration in which the RSSParser is
called. Both the channel text and the item text are indexed. Also, since
each Item link is added as an Outlink to the list of returned Outlinks,
Nutch is able to crawl many urls that can come out of a single RSS feed.
HTH,
Chris
On 9/10/06 5:54 PM, "Ernesto De Santis" <de...@yahoo.com.ar>
wrote:
> Hi all
>
> I'm trying to integrate a rss and atom source to my nutch index.
> I see that nutch has a RSSParser, but it seems that index the whole
> source as one source, right?
>
> I want to index each item separately.
> Some body do it? What's the best approach.
>
> I hope about do a external process to add Document's to nutch(lucene)
> index using a rss fetcher like Rome. The negative point about it, is
> that it isn't integrated with nutch.
>
> I don't know details of nutch core to hack it, I don't know if is
> possible to integrate it in nutch.
>
> Thanks a lot!
> Ernesto.
>
>
>
>
> __________________________________________________
> Preguntá. Respondé. Descubrí.
> Todo lo que querías saber, y lo que ni imaginabas,
> está en Yahoo! Respuestas (Beta).
> ¡Probalo ya!
> http://www.yahoo.com.ar/respuestas
>