You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Israel <we...@gmail.com> on 2010/08/21 01:31:41 UTC

Crawl atom, rss, xml .... I need any plugin extra?

Hello, I tried to indexer these pages that use xml, rss, atom or inclusive
rdf or the respective format ..... but errors occur, I download the "parse
xml " plugin but I don't how to use this.

I index this pages:

http://cnx.org/lenses/ccotp/endorsements/atom
http://ocw.nd.edu/courselist/rss
http://openlearn.open.ac.uk/file.php/1/learningspace.xml

I need any plugin? I tried with rss and feed... and how do I configure the
files "crawl-urlfilter" *. txt and seed (web addresses ).... if I could
please send to my mail if you have some plugin .... Thank you.

I've searched hours and hours in the web...and I don't have answer

Re: Crawl atom, rss, xml .... I need any plugin extra?

Posted by Volli <il...@web.de>.

Concerning page
academicearth.rss

Its header is not declared correctly by their webserver. 
That's the reason why Firefox/Opera/SRWare Iron and Nutch 
are interpreting it as Content-Type:text/plain. Firefox 
falls back into Quirks-mode. Only IE8/Safari are displaying 
a "normal feed page".
Feed-validator @ w3c did validation but threw a "text/plain" 
warning.

I made a "Save file as academicearth.rss" to DOCROOT of my 
webserver and page was displayed correctly in all browsers 
(Content-Type:text/xml, application/xhtml+xml, standards 
compliant mode, "normal feed page").
Feed-validator @ w3c: no warning concerning Content-Type.

Ask their admin

Am 24.08.2010 02:50, schrieb Israel:
> Hello volley. please help me one more time, i want to crawl this page, but
> don't generate nothing...is posible?
> http://uc.princeton.edu/main/index.php?option=com_vodcast&view=feed&format=raw
> or:
> This page is available .. rss, but leave type plain text, and the nutch
> search results page shows that:
> http://academicearth.org/academicearth.rss

Re: nutch crawler ignores query string url like "...a.php?b=com_x&c=y" - SOLVED

Posted by Israel <we...@gmail.com>.

thanks volley.........you rule jajaja

Re: nutch crawler ignores query string url like "...a.php?b=com_x&c=y" - SOLVED

Posted by Volli <il...@web.de>.

Because some characters were replaced by dots in my last post:
"OR CHANGE:" in words:
Remove question mark and equals sign.

I don't know if the remaining charcaters are allowed ones in 
a query string. Possibly a stupid solution.

Am 24.08.2010 23:17, schrieb Volli:
> I think it's the query string exclusion in files
> conf/regex-urlfilter.txt or conf/crawl-urlfilter.txt:
>
> FIND:
> # skip URLs containing certain characters as probable
> queries, etc.
> -[?*!@=]
>
> REPLACE:
> # skip URLs containing certain characters as probable
> queries, etc.
> # -[?*!@=]
>
> OR CHANGE:
> # -[?*!@=]
> -[*!@]
>
&gt;
> Am 24.08.2010 02:50, schrieb Israel:
>> Hello volley. please help me one more time, i want to
>> crawl this page, but
>> don't generate nothing...is posible?
>>
>> http://uc.princeton.edu/main/index.php?option=com_vodcast&view=feed&format=raw
>>
> ...
>

nutch crawler ignores query string url like "...a.php?b=com_x&c=y" - SOLVED

Posted by Volli <il...@web.de>.

I think it's the query string exclusion in files 
conf/regex-urlfilter.txt or conf/crawl-urlfilter.txt:

FIND:
# skip URLs containing certain characters as probable 
queries, etc.
-[?*!@=]

REPLACE:
# skip URLs containing certain characters as probable 
queries, etc.
# -[?*!@=]

OR CHANGE:
# -[?*!@=]
-[*!@]


Am 24.08.2010 02:50, schrieb Israel:
> Hello volley. please help me one more time, i want to crawl this page, but
> don't generate nothing...is posible?
>
> http://uc.princeton.edu/main/index.php?option=com_vodcast&view=feed&format=raw
...

Re: Crawl atom, rss, xml .... I need any plugin extra?

Posted by Israel <we...@gmail.com>.

Hello volley. please help me one more time, i want to crawl this page, but
don't generate nothing...is posible?

http://uc.princeton.edu/main/index.php?option=com_vodcast&view=feed&format=raw



or:

This page is available .. rss, but leave type plain text, and the nutch
search results page shows that:

http://academicearth.org/academicearth.rss

>

Re: Crawl atom, rss, xml .... I need any plugin extra?

Posted by Israel <we...@gmail.com>.

Great Volly  .. thank you very much, saludos...Israel

Re: Crawl atom, rss, xml .... I need any plugin extra?

Posted by Volli <il...@web.de>.

Addendum to my last post:

After, i've read my own post: All crawls worked with parser 
parse-html. I think, you don't need to update Nutch.

If not:
==>TODO1<==
In conf/parse-plugins.xml:

--FIND:
<mimeType name="text/xml">
<plugin id="parse-html" />
<plugin id="parse-rss" />
<plugin id="feed" />
</mimeType>

--REPLACE WITH:
<mimeType name="text/xml">
<plugin id="parse-html" />
<plugin id="parse-rss" />
<plugin id="feed" />
</mimeType>
<mimeType name="application/xml">
<plugin id="parse-html" />
</mimeType>
==>TODO1-end<==


Am 23.08.2010 05:23, schrieb Volli:
> I use Nutch version 1.1 (Released 06 June 2010).
>
> I didn't install any additional plugin!
>
> I think your xml-plugin at NUTCH-185 is outdated:
> "Resolution:Won't Fix" and "Affects Version/s: 0.7.2, 0.8,
> 0.8.1".
>
> Check your nutch version (and update).
>
> Check in "nutch-site.xml" at "<name>plugin.includes</name>"
> if parse-tika is available.
> "...parse-(text|html|js|tika)...".
>
> If "nutch-site.xml" is empty because you don't use it(?).
> Check parse-tika in "nutch-default.xml" instead.
>
> -------------------------
> -------------------------
> TESTING and **MY** BEST MATCHES (maybe some other guys out
> there have better ones):
> -------------------------
> -------------------------
>
> I've tested your links for several hours. This is my whole
> journal including all failures, too.
>
> Forget my last post concerning changes in
> "parse-plugins.xml" and "nutch-site.xml"!
>
> You'll find three times "==>TODOx<==" and "==>TODOx-end<==".
>
> 1)
> - http://cnx.org/lenses/ccotp/endorsements/atom:
> contentType=application/xml.
>
> Crawled as Nutch 1.1 is.
> NOTHING CHANGED IN *.XML-FILES.
> SUCCESS, BUT: There is HTML-Source-Code in some search
> summaries (like "<p>" or "<a>")?
> Checked the source code of the page. Lots of entities inside
> SUBTITLE-Tag, declared as type="text/html" => Not a
> parser-fault, I think!
>
> 2)
> nutch-site.xml: removed parser-tika:
> Error: parser not found for contentType=application/xml
>
> 3)
> parse-plugins.xml: added:
> <mimeType name="application/xml">
> <plugin id="parse-html" />
> </mimeType>
> Like parser-tika did it. Not better.
>
> 4)
> parse-plugins.xml: changed
> <mimeType name="application/xml">
> <plugin id="parse-rss" />
> </mimeType>
> nutch-site.xml: added parser-rss:
> No errors but no search results (empty).
>
> 5)
> parse-plugins.xml: changed
> <mimeType name="application/xml">
> <plugin id="feed" />
> </mimeType>
> nutch-site.xml: added feed:
> Errors, errors,errors.
>
> ======> *MY* BEST MATCH:
> Page - http://cnx.org/lenses/ccotp/endorsements:
> contentType=application/xml.
>
> ==>TODO1<==
> Nothing.
> Crawl as is. parse-tika did it.
> ==>TODO1-end<==
> ---------------------------------------------------------
>
> 1)
> - http://openlearn.open.ac.uk/file.php/1/learningspace.xml:
> mime-type application/rss+xml
> Crawl as is. parse-tika.
> Error: Can't retrieve Tika parser for mime-type
> application/rss+xml.
> Makes me wonder because I found "application/rss+xml" in
> tika-mimetypes.xml.
>
> 2)
> Found in parse-plugins.xml:
> <mimeType name="application/rss+xml">
> <plugin id="parse-rss" />
> <plugin id="feed" />
> </mimeType>
> nutch-site.xml: added parse-rss but not feed:
> Error: Can't be handled as rss document.
> org.apache.commons.feedparser.FeedParserException:
> org.jdom.input.JDOMParseException: Error on line 768: The
> element type "dc:creator" must be terminated by the matching
> end-tag "</dc:creator>".
>
> nutch-site.xml: removed parse-rss and added feed:
> Error: dito
>
> This means: Parsers parse-rss or feed would be right if the
> page wouldn't be corrupt! Is it? I checked the source code
> with Firefox and couldn't find any error!!!!!!!!!!!!!
> Line 768: <dc:creator>The Open University</dc:creator>
> looks fine. Strange!
>
> 3)
> Changed in parse-plugins.xml:
> <mimeType name="application/rss+xml">
> <plugin id="parse-html" />
> <plugin id="parse-rss" />
> <plugin id="feed" />
> </mimeType>
> SUCCESS!
>
> ======> *MY* BEST MATCH:
> Page -
> http://openlearn.open.ac.uk/file.php/1/learningspace.xml:
> mime-type application/rss+xml.
>
> ==>TODO2<==
> In conf/parse-plugins.xml:
>
> --FIND:
> <mimeType name="application/rss+xml">
> <plugin id="parse-rss" />
> <plugin id="feed" />
> </mimeType>
>
> --REPLACE WITH:
> <mimeType name="application/rss+xml">
> <plugin id="parse-html" /><!--subsequently added. parse-rss
> and feed throw unreproducible error. thread msg00666.html et
> seqq.-->
> <plugin id="parse-rss" />
> <plugin id="feed" />
> </mimeType>
> ==>TODO2-end<==
> ---------------------------------------------------------
>
> 1)
> - http://ocw.nd.edu/courselist/rss:
> mime-type application/rdf+xml
>
> Crawl as is. parse-tika.
> Error: Can't retrieve Tika parser for mime-type
> application/rdf+xml.
> Makes me wonder because I found "application/rdf+xml" in
> tika-mimetypes.xml.
>
> 2)
> nutch-site.xml: added parse-rss but not feed.
> parse-plugins.xml: added:
> <mimeType name="application/rdf+xml">
> <plugin id="parse-rss" />
> <plugin id="feed" />
> </mimeType>
> Searched for normal text: SUCCESS!
> Searched for a title (displayed as anchors): No search result!
>
> 3)
> <mimeType name="application/rdf+xml">
> <plugin id="parse-html" />
> </mimeType>
> Searched for normal text: SUCCESS!
> Searched for a title (displayed as anchors): SUCCESS!
>
> ======> *MY* BEST MATCH:
> Page - http://ocw.nd.edu/courselist/rss:
> mime-type application/rdf+xml.
>
> ==>TODO3<==
> In conf/parse-plugins.xml:
>
> --FIND:
> <mimeType name="text/xml">
> <plugin id="parse-html" />
> <plugin id="parse-rss" />
> <plugin id="feed" />
> </mimeType>
>
> --REPLACE WITH:
> <mimeType name="text/xml">
> <plugin id="parse-html" />
> <plugin id="parse-rss" />
> <plugin id="feed" />
> </mimeType>
> <mimeType name="application/rdf+xml"><!--subsequently added.
> parse-tika throws error. thread msg00666.html et seqq.-->
> <plugin id="parse-html" />
> </mimeType>
> ==>TODO3-end<==
> ---------------------------------------------------------
>
>
>
>
> Am 21.08.2010 20:43, schrieb Israel:
>> to put this:
>> Added to "parse-plugins.xml"
>>
>> <mimeType name="application/xml">
>>
>> <plugin id="parse-html" />
>>
>> <plugin id="parse-rss" />
>>
>> <plugin id="feed" />
>>
>> </mimeType>
>>
>> 2010/8/21 Israel<we...@gmail.com>
>>
>>>
>>>
>>> 2010/8/21 Israel<we...@gmail.com>
>>>
>>>
>>>> Thanks for your help, plese help me with this
>>>>
>>>> Hello, i download the parse plugin from: "
>>>> https://issues.apache.org/jira/browse/NUTCH-185", and i
>>>> don't know where
>>>> put this:
>>>>
>>>>
>>>>> Added to "parse-plugins.xml"
>>>>>
>>>>> <mimeType name="application/xml">
>>>>>
>>>>> <plugin id="parse-html" />
>>>>>
>>>>> <plugin id="parse-rss" />
>>>>>
>>>>> <plugin id="feed" />
>>>>>
>>>>> </mimeType>
>>>>>
>>>>>
>>>>>
>>>>
>>>
>> to put this:
>>
>> Added to "parse-plugins.xml"
>>
>> <mimeType name="application/xml">
>>
>> <plugin id="parse-html" />
>>
>> <plugin id="parse-rss" />
>>
>> <plugin id="feed" />
>>
>> </mimeType>
>>

Re: Crawl atom, rss, xml .... I need any plugin extra?

Posted by Volli <il...@web.de>.

I use Nutch version 1.1 (Released 06 June 2010).

I didn't install any additional plugin!

I think your xml-plugin at NUTCH-185 is outdated: 
"Resolution:Won't Fix" and "Affects Version/s: 0.7.2, 0.8, 
0.8.1".

Check your nutch version (and update).

Check in "nutch-site.xml" at "<name>plugin.includes</name>" 
if parse-tika is available.
"...parse-(text|html|js|tika)...".

If "nutch-site.xml" is empty because you don't use it(?). 
Check parse-tika in "nutch-default.xml" instead.

-------------------------
-------------------------
TESTING and **MY** BEST MATCHES (maybe some other guys out 
there have better ones):
-------------------------
-------------------------

I've tested your links for several hours. This is my whole 
journal including all failures, too.

Forget my last post concerning changes in 
"parse-plugins.xml" and "nutch-site.xml"!

You'll find three times "==>TODOx<==" and  "==>TODOx-end<==".

1)
- http://cnx.org/lenses/ccotp/endorsements/atom:
contentType=application/xml.

Crawled as Nutch 1.1 is.
NOTHING CHANGED IN *.XML-FILES.
SUCCESS, BUT: There is HTML-Source-Code in some search 
summaries (like "<p>" or "<a>")?
Checked the source code of the page. Lots of entities inside 
SUBTITLE-Tag, declared as  type="text/html" => Not a 
parser-fault, I think!

2)
nutch-site.xml: removed parser-tika:
Error: parser not found for contentType=application/xml

3)
parse-plugins.xml: added:
<mimeType name="application/xml">
<plugin id="parse-html" />
</mimeType>
Like parser-tika did it. Not better.

4)
parse-plugins.xml: changed
<mimeType name="application/xml">
<plugin id="parse-rss" />
</mimeType>
nutch-site.xml: added parser-rss:
No errors but no search results (empty).

5)
parse-plugins.xml: changed
<mimeType name="application/xml">
<plugin id="feed" />
</mimeType>
nutch-site.xml: added feed:
Errors, errors,errors.

======> *MY* BEST MATCH:
Page - http://cnx.org/lenses/ccotp/endorsements:
contentType=application/xml.

==>TODO1<==
Nothing.
Crawl as is. parse-tika did it.
==>TODO1-end<==
---------------------------------------------------------

1)
- http://openlearn.open.ac.uk/file.php/1/learningspace.xml:
mime-type application/rss+xml
Crawl as is. parse-tika.
Error: Can't retrieve Tika parser for mime-type 
application/rss+xml.
Makes me wonder because I found "application/rss+xml" in 
tika-mimetypes.xml.

2)
Found in parse-plugins.xml:
<mimeType name="application/rss+xml">
<plugin id="parse-rss" />
<plugin id="feed" />
</mimeType>
nutch-site.xml: added parse-rss but not feed:
Error:  Can't be handled as rss document. 
org.apache.commons.feedparser.FeedParserException: 
org.jdom.input.JDOMParseException: Error on line 768: The 
element type "dc:creator" must be terminated by the matching 
end-tag "</dc:creator>".

nutch-site.xml: removed parse-rss and added feed:
Error: dito

This means: Parsers parse-rss or feed would be right if the 
page wouldn't be corrupt! Is it? I checked the source code 
with Firefox and couldn't find any error!!!!!!!!!!!!!
Line 768: <dc:creator>The Open University</dc:creator>
looks fine. Strange!

3)
Changed in parse-plugins.xml:
<mimeType name="application/rss+xml">
<plugin id="parse-html" />
<plugin id="parse-rss" />
<plugin id="feed" />
</mimeType>
SUCCESS!

======> *MY* BEST MATCH:
Page - http://openlearn.open.ac.uk/file.php/1/learningspace.xml:
mime-type application/rss+xml.

==>TODO2<==
In conf/parse-plugins.xml:

--FIND:
<mimeType name="application/rss+xml">
<plugin id="parse-rss" />
<plugin id="feed" />
</mimeType>

--REPLACE WITH:
<mimeType name="application/rss+xml">
<plugin id="parse-html" /><!--subsequently added. parse-rss 
and feed throw unreproducible error. thread msg00666.html et 
seqq.-->
<plugin id="parse-rss" />
<plugin id="feed" />
</mimeType>
==>TODO2-end<==
---------------------------------------------------------

1)
- http://ocw.nd.edu/courselist/rss:
mime-type application/rdf+xml

Crawl as is. parse-tika.
Error: Can't retrieve Tika parser for mime-type 
application/rdf+xml.
Makes me wonder because I found "application/rdf+xml" in 
tika-mimetypes.xml.

2)
nutch-site.xml: added parse-rss but not feed.
parse-plugins.xml: added:
<mimeType name="application/rdf+xml">
<plugin id="parse-rss" />
<plugin id="feed" />
</mimeType>
Searched for normal text: SUCCESS!
Searched for a title (displayed as anchors): No search result!

3)
<mimeType name="application/rdf+xml">
<plugin id="parse-html" />
</mimeType>
Searched for normal text: SUCCESS!
Searched for a title (displayed as anchors): SUCCESS!

======> *MY* BEST MATCH:
Page - http://ocw.nd.edu/courselist/rss:
mime-type application/rdf+xml.

==>TODO3<==
In conf/parse-plugins.xml:

--FIND:
<mimeType name="text/xml">
<plugin id="parse-html" />
<plugin id="parse-rss" />
<plugin id="feed" />
</mimeType>

--REPLACE WITH:
<mimeType name="text/xml">
<plugin id="parse-html" />
<plugin id="parse-rss" />
<plugin id="feed" />
</mimeType>
<mimeType name="application/rdf+xml"><!--subsequently added. 
parse-tika throws error. thread msg00666.html et seqq.-->
<plugin id="parse-html" />
</mimeType>
==>TODO3-end<==
---------------------------------------------------------




Am 21.08.2010 20:43, schrieb Israel:
> to put this:
> Added to "parse-plugins.xml"
>
> <mimeType name="application/xml">
>
> <plugin id="parse-html" />
>
> <plugin id="parse-rss" />
>
> <plugin id="feed" />
>
> </mimeType>
>
> 2010/8/21 Israel<we...@gmail.com>
>
>>
>>
>> 2010/8/21 Israel<we...@gmail.com>
>>
>>
>>> Thanks for your help, plese help me with this
>>>
>>> Hello, i download the parse plugin from: "
>>> https://issues.apache.org/jira/browse/NUTCH-185", and i don't know where
>>> put this:
>>>
>>>
>>>> Added to "parse-plugins.xml"
>>>>
>>>> <mimeType name="application/xml">
>>>>
>>>> <plugin id="parse-html" />
>>>>
>>>> <plugin id="parse-rss" />
>>>>
>>>> <plugin id="feed" />
>>>>
>>>> </mimeType>
>>>>
>>>>
>>>>
>>>
>>
> to put this:
>
> Added to "parse-plugins.xml"
>
> <mimeType name="application/xml">
>
> <plugin id="parse-html" />
>
> <plugin id="parse-rss" />
>
> <plugin id="feed" />
>
> </mimeType>
>

Re: Crawl atom, rss, xml .... I need any plugin extra?

Posted by Israel <we...@gmail.com>.

to put this:
Added to "parse-plugins.xml"

<mimeType name="application/xml">

<plugin id="parse-html" />

<plugin id="parse-rss" />

<plugin id="feed" />

</mimeType>

2010/8/21 Israel <we...@gmail.com>

>
>
> 2010/8/21 Israel <we...@gmail.com>
>
>
>> Thanks for your help, plese help me with this
>>
>> Hello, i download the parse plugin from: "
>> https://issues.apache.org/jira/browse/NUTCH-185", and i don't know where
>> put this:
>>
>>
>>> Added to "parse-plugins.xml"
>>>
>>> <mimeType name="application/xml">
>>>
>>> <plugin id="parse-html" />
>>>
>>> <plugin id="parse-rss" />
>>>
>>> <plugin id="feed" />
>>>
>>> </mimeType>
>>>
>>>
>>>
>>
>
to put this:

Added to "parse-plugins.xml"

<mimeType name="application/xml">

<plugin id="parse-html" />

<plugin id="parse-rss" />

<plugin id="feed" />

</mimeType>

Re: Crawl atom, rss, xml .... I need any plugin extra?

Posted by Israel <we...@gmail.com>.

2010/8/21 Israel <we...@gmail.com>

>
> Thanks for your help, plese help me with this
>
> Hello, i download the parse plugin from: "
> https://issues.apache.org/jira/browse/NUTCH-185", and i don't know where
> put this:
>
>
>> Added to "parse-plugins.xml"
>>
>> <mimeType name="application/xml">
>>
>> <plugin id="parse-html" />
>>
>> <plugin id="parse-rss" />
>>
>> <plugin id="feed" />
>>
>> </mimeType>
>>
>>
>>
>

Re: Crawl atom, rss, xml .... I need any plugin extra?

Posted by Israel <we...@gmail.com>.

Thanks for your help, plese help me with this

Hello, i download the parse plugin from: "
https://issues.apache.org/jira/browse/NUTCH-185", and i don't know where put
this:

>
> Added to "parse-plugins.xml"
>
> <mimeType name="application/xml">
>
> <plugin id="parse-html" />
>
> <plugin id="parse-rss" />
>
> <plugin id="feed" />
>
> </mimeType>
>
>
>

Re: Crawl atom, rss, xml .... I need any plugin extra?

Posted by Volli <il...@web.de>.

Nutch 1.1.

I tested just with 
"http://cnx.org/lenses/ccotp/endorsements/atom"

I added to property "plugin.includes" in "nutch-site.xml"

"...parse-(text|html|js|tika|pdf|rss)|feed|..."

(see added "rss" and "feed"; I don't know which one did it).

Added to "parse-plugins.xml"

<mimeType name="application/xml">

<plugin id="parse-html" />

<plugin id="parse-rss" />

<plugin id="feed" />

</mimeType>

and to "regex-urlfilter.txt"

"

+^http://cnx.org/lenses/ccotp/endorsements/atom

# skip everything else

-.

"

---------
If you use the runbot-script at
"http://wiki.apache.org/nutch/Crawl":

Created a directory "urls" added a text-file with

"http://cnx.org/lenses/ccotp/endorsements/atom" in it.

configured the runbot-script and started the script with

"sh runbot"

  and got the page indexed.


Am 21.08.2010 01:31, schrieb Israel:
> Hello, I tried to indexer these pages that use xml, rss, atom or inclusive
> rdf or the respective format ..... but errors occur, I download the "parse
> xml " plugin but I don't how to use this.
>
> I index this pages:
>
> http://cnx.org/lenses/ccotp/endorsements/atom
> http://ocw.nd.edu/courselist/rss
> http://openlearn.open.ac.uk/file.php/1/learningspace.xml
>
> I need any plugin? I tried with rss and feed... and how do I configure the
> files "crawl-urlfilter" *. txt and seed (web addresses ).... if I could
> please send to my mail if you have some plugin .... Thank you.
>
> I've searched hours and hours in the web...and I don't have answer
>