You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Ernesto De Santis <de...@yahoo.com.ar> on 2006/09/18 15:41:43 UTC

youtube rss failure

Hi all

I have problems parsing youtube rss.

This is the url:
http://youtube.com/rss/global/top_viewed_today.rss

It seems has problems with the line:
<rss version="2.0" xmlns:media="http://search.yahoo.com/mrss">

In the log file I see:

2006-09-18 09:00:04,163 INFO  fetcher.Fetcher - fetching 
http://youtube.com/rss/global/top_viewed_today.rss
2006-09-18 09:00:17,265 ERROR parse.OutlinkExtractor - getOutlinks
java.net.MalformedURLException: unknown protocol: xmlns
    at java.net.URL.<init>(URL.java:574)
    at java.net.URL.<init>(URL.java:464)
    at java.net.URL.<init>(URL.java:413)
    at 
org.apache.nutch.net.BasicUrlNormalizer.normalize(BasicUrlNormalizer.java:78)
    at org.apache.nutch.parse.Outlink.<init>(Outlink.java:35)
    at 
org.apache.nutch.parse.OutlinkExtractor.getOutlinks(OutlinkExtractor.java:111)
    at 
org.apache.nutch.parse.OutlinkExtractor.getOutlinks(OutlinkExtractor.java:70)
    at org.apache.nutch.parse.text.TextParser.getParse(TextParser.java:47)
    at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:82)
    at 
org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:276)
    at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:152)

Some body know how is wrong, or if it is a bug?

Thanks,
Ernesto





	
	
		
__________________________________________________
Preguntá. Respondé. Descubrí.
Todo lo que querías saber, y lo que ni imaginabas,
está en Yahoo! Respuestas (Beta).
¡Probalo ya! 
http://www.yahoo.com.ar/respuestas


Re: youtube rss failure

Posted by Ernesto De Santis <de...@yahoo.com.ar>.
Yes!

You are right, it work. I don't understand how do you did to know the 
problem.

What I must to do to take the rss with the right mime type?

I have another rss channel with a similar problem, It hasn't rss as a 
pathSubfix. And I see in the rss-parser plugin.xml:

<extension id="org.apache.nutch.parse.rss" name="RssParse" 
point="org.apache.nutch.parse.Parser">
   <implementation id="org.apache.nutch.parse.rss.RSSParser" 
class="org.apache.nutch.parse.rss.RSSParser">
      <parameter name="contentType" value="application/rss+xml"/>
      <parameter name="pathSuffix" value="rss"/>
   </implementation>
</extension>

I try changing it, but doesn't work.

Thanks a lot
Ernesto.


Meghna Kukreja escribió:
> Hi Ernesto,
>
> The reason you are getting that error is because the content-type
> returned is "text/plain" which calls the text parser and not the rss
> parser plugin. Just to check that it works, you can put "<plugin
> id="parse-rss" />" under mimeType for "text/plain" as the first option
> and try crawling it again.
>
> -Meghna
>
> On 9/18/06, Ernesto De Santis <de...@yahoo.com.ar> wrote:
>> Hi all
>>
>> I have problems parsing youtube rss.
>>
>> This is the url:
>> http://youtube.com/rss/global/top_viewed_today.rss
>>
>> It seems has problems with the line:
>> <rss version="2.0" xmlns:media="http://search.yahoo.com/mrss">
>>
>> In the log file I see:
>>
>> 2006-09-18 09:00:04,163 INFO  fetcher.Fetcher - fetching
>> http://youtube.com/rss/global/top_viewed_today.rss
>> 2006-09-18 09:00:17,265 ERROR parse.OutlinkExtractor - getOutlinks
>> java.net.MalformedURLException: unknown protocol: xmlns
>>     at java.net.URL.<init>(URL.java:574)
>>     at java.net.URL.<init>(URL.java:464)
>>     at java.net.URL.<init>(URL.java:413)
>>     at
>> org.apache.nutch.net.BasicUrlNormalizer.normalize(BasicUrlNormalizer.java:78) 
>>
>>     at org.apache.nutch.parse.Outlink.<init>(Outlink.java:35)
>>     at
>> org.apache.nutch.parse.OutlinkExtractor.getOutlinks(OutlinkExtractor.java:111) 
>>
>>     at
>> org.apache.nutch.parse.OutlinkExtractor.getOutlinks(OutlinkExtractor.java:70) 
>>
>>     at 
>> org.apache.nutch.parse.text.TextParser.getParse(TextParser.java:47)
>>     at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:82)
>>     at
>> org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:276)
>>     at 
>> org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:152)
>>
>> Some body know how is wrong, or if it is a bug?
>>
>> Thanks,
>> Ernesto
>>
>>
>>
>>
>>
>>
>>
>>
>> __________________________________________________
>> Preguntá. Respondé. Descubrí.
>> Todo lo que querías saber, y lo que ni imaginabas,
>> está en Yahoo! Respuestas (Beta).
>> ¡Probalo ya!
>> http://www.yahoo.com.ar/respuestas
>>
>>
>

	
	
		
__________________________________________________
Preguntá. Respondé. Descubrí.
Todo lo que querías saber, y lo que ni imaginabas,
está en Yahoo! Respuestas (Beta).
¡Probalo ya! 
http://www.yahoo.com.ar/respuestas


Re: youtube rss failure

Posted by Meghna Kukreja <om...@gmail.com>.
Hi Ernesto,

The reason you are getting that error is because the content-type
returned is "text/plain" which calls the text parser and not the rss
parser plugin. Just to check that it works, you can put "<plugin
id="parse-rss" />" under mimeType for "text/plain" as the first option
and try crawling it again.

-Meghna

On 9/18/06, Ernesto De Santis <de...@yahoo.com.ar> wrote:
> Hi all
>
> I have problems parsing youtube rss.
>
> This is the url:
> http://youtube.com/rss/global/top_viewed_today.rss
>
> It seems has problems with the line:
> <rss version="2.0" xmlns:media="http://search.yahoo.com/mrss">
>
> In the log file I see:
>
> 2006-09-18 09:00:04,163 INFO  fetcher.Fetcher - fetching
> http://youtube.com/rss/global/top_viewed_today.rss
> 2006-09-18 09:00:17,265 ERROR parse.OutlinkExtractor - getOutlinks
> java.net.MalformedURLException: unknown protocol: xmlns
>     at java.net.URL.<init>(URL.java:574)
>     at java.net.URL.<init>(URL.java:464)
>     at java.net.URL.<init>(URL.java:413)
>     at
> org.apache.nutch.net.BasicUrlNormalizer.normalize(BasicUrlNormalizer.java:78)
>     at org.apache.nutch.parse.Outlink.<init>(Outlink.java:35)
>     at
> org.apache.nutch.parse.OutlinkExtractor.getOutlinks(OutlinkExtractor.java:111)
>     at
> org.apache.nutch.parse.OutlinkExtractor.getOutlinks(OutlinkExtractor.java:70)
>     at org.apache.nutch.parse.text.TextParser.getParse(TextParser.java:47)
>     at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:82)
>     at
> org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:276)
>     at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:152)
>
> Some body know how is wrong, or if it is a bug?
>
> Thanks,
> Ernesto
>
>
>
>
>
>
>
>
> __________________________________________________
> Preguntá. Respondé. Descubrí.
> Todo lo que querías saber, y lo que ni imaginabas,
> está en Yahoo! Respuestas (Beta).
> ¡Probalo ya!
> http://www.yahoo.com.ar/respuestas
>
>