You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Ankit Goel <an...@gmail.com> on 2017/11/01 16:55:25 UTC

sitemap and xml crawl

Hi,
I need to crawl a xml feed, which includes url, title and content of the articles on site.

The documentation on the site says that bin/nutch sitemap exists, but on my nutch 1.13 sitemap is not a command in bin/nutch. So does nutch support crawling sitemaps? Or xml links.

Regards,
Ankit Goel

Re: sitemap and xml crawl

Posted by Steven Pollock <ja...@gmail.com>.

Hi Ankit,

I've never seen an answer to any questions on this list.  (I have a few)

So I suspect it's a dead list.

Regards,

-Steve

On Wed, Nov 1, 2017 at 9:55 AM, Ankit Goel <an...@gmail.com> wrote:

> Hi,
> I need to crawl a xml feed, which includes url, title and content of the
> articles on site.
>
> The documentation on the site says that bin/nutch sitemap exists, but on
> my nutch 1.13 sitemap is not a command in bin/nutch. So does nutch support
> crawling sitemaps? Or xml links.
>
> Regards,
> Ankit Goel
>
>

RE: sitemap and xml crawl

Posted by Yossi Tamari <yo...@pipl.com>.

Hi Ankit,

So I guess you want to remove the parser that is configured by default (since you don't need to parse HTML at all), add the RSS parser that Markus suggested, and then you probably need to add either a custom parser for the second XML format, or an indexing filter, or both. This would depend on exactly what you are trying to achieve at the end of the crawl.

	Yossi.

> -----Original Message-----
> From: Markus Jelsma [mailto:markus.jelsma@openindex.io]
> Sent: 02 November 2017 11:29
> To: user@nutch.apache.org
> Subject: RE: sitemap and xml crawl
> 
> Hi - Nutch has a parser for RSS and ATOM on-board:
> https://nutch.apache.org/apidocs/apidocs-
> 1.13/org/apache/nutch/parse/feed/FeedParser.html
> 
> You must configure it in your plugin.includes to use it.
> 
> Regards,
> Markus
> 
> 
> 
> -----Original message-----
> > From:Ankit Goel <an...@gmail.com>
> > Sent: Thursday 2nd November 2017 10:11
> > To: user@nutch.apache.org
> > Subject: Re: sitemap and xml crawl
> >
> > Hi Yossi,
> > I have 2 kinds of rss links which are domain.com/rss/feed.xml
> <http://domain.com/rss/feed.xml> links. One is the standard rss feed that we
> see, which becomes the starting point for crawling further as we can pull links
> from it.
> >
> >
> > <item>
> > <title>
> > <![CDATA[
> > Article headline
> > ]]>
> > </title>
> > <link>
> > article url
> > </link>
> > <pubDate> date </pubDate>
> > <dc:creator>
> > <![CDATA[ author ]]>
> > </dc:creator>
> > <description>
> > <![CDATA[
> > One line descriptor tag line
> > ]]>
> > </description>
> > </item>
> > <item>
> > …
> > </item>
> >
> > The other one also includes the content within the xml itself, so it doesn’t need
> further crawling.
> > I have standalone xml parsers in java that I can use directly, but obviously,
> crawling is an important part, because it documents all the links traversed so far.
> >
> > What would you advice?
> >
> > Regards,
> > Ankit Goel
> >
> > > On 02-Nov-2017, at 2:04 PM, Yossi Tamari <yo...@pipl.com> wrote:
> > >
> > > Hi Ankit,
> > >
> > > If you are looking for a Sitemap parser, I would suggest moving to
> > > 1.14 (trunk). I've been using it, and it is probably in better shape than 1.13.
> > > If you need to parse your own format, the answer depends on the
> > > details. Do you need to crawl pages in this format where each page
> > > contains links in XML that you need to crawl? Or is this more like
> > > Sitemap where the XML is just the  initial starting point?
> > > In the second case, maybe just write something outside of Nutch that
> > > will parse the XML and produce a seed file?
> > > In the first case, the link you sent is not relevant. You need to
> > > implement a
> > > http://nutch.apache.org/apidocs/apidocs-1.13/org/apache/nutch/parse/
> > > Parser.h tml. I haven't done that myself. My suggestion is that you
> > > take a look at the built-in parser at
> > > https://github.com/apache/nutch/blob/master/src/plugin/parse-html/sr
> > > c/java/o rg/apache/nutch/parse/html/HtmlParser.java. Google found
> > > this article on developing a custom parser, which might be a good
> > > starting point:
> > > http://www.treselle.com/blog/apache-nutch-with-custom-parser/.
> > >
> > > 	Yossi.
> > >
> > >
> > >> -----Original Message-----
> > >> From: Ankit Goel [mailto:ankitgoel2004@gmail.com]
> > >> Sent: 02 November 2017 10:24
> > >> To: user@nutch.apache.org
> > >> Subject: Re: sitemap and xml crawl
> > >>
> > >> Hi Yossi,
> > >> So I need to make a custom parser. Where do I start? I found this
> > >> link https://wiki.apache.org/nutch/HowToMakeCustomSearch
> > >> <https://wiki.apache.org/nutch/HowToMakeCustomSearch>. Is this the
> > >> right place, or should I be looking at creating a plugin page. Any
> > >> advice would
> > > be
> > >> helpful.
> > >>
> > >> Thank you,
> > >> Ankit Goel
> > >>
> > >>> On 02-Nov-2017, at 1:14 PM, Yossi Tamari <yo...@pipl.com>
> wrote:
> > >>>
> > >>> Hi Ankit,
> > >>>
> > >>> According to this:
> > >>> https://issues.apache.org/jira/browse/NUTCH-1465,
> > >>> sitemap is a 1.14 feature.
> > >>> I just checked, and the command indeed exists in 1.14. I did not
> > >>> test that it works.
> > >>>
> > >>> In general, Nutch supports crawling anything, but you might need
> > >>> to write your own parser for custom protocols.
> > >>>
> > >>> 	Yossi.
> > >>>
> > >>>> -----Original Message-----
> > >>>> From: Ankit Goel [mailto:ankitgoel2004@gmail.com]
> > >>>> Sent: 01 November 2017 18:55
> > >>>> To: user@nutch.apache.org
> > >>>> Subject: sitemap and xml crawl
> > >>>>
> > >>>> Hi,
> > >>>> I need to crawl a xml feed, which includes url, title and content
> > >>>> of the
> > >>> articles on
> > >>>> site.
> > >>>>
> > >>>> The documentation on the site says that bin/nutch sitemap exists,
> > >>>> but on
> > >>> my
> > >>>> nutch 1.13 sitemap is not a command in bin/nutch. So does nutch
> > >>>> support crawling sitemaps? Or xml links.
> > >>>>
> > >>>> Regards,
> > >>>> Ankit Goel
> > >>>
> > >>>
> > >
> > >
> >
> >

RE: sitemap and xml crawl

Posted by Markus Jelsma <ma...@openindex.io>.

Hi - Nutch has a parser for RSS and ATOM on-board:
https://nutch.apache.org/apidocs/apidocs-1.13/org/apache/nutch/parse/feed/FeedParser.html

You must configure it in your plugin.includes to use it.

Regards,
Markus

 
 
-----Original message-----
> From:Ankit Goel <an...@gmail.com>
> Sent: Thursday 2nd November 2017 10:11
> To: user@nutch.apache.org
> Subject: Re: sitemap and xml crawl
> 
> Hi Yossi,
> I have 2 kinds of rss links which are domain.com/rss/feed.xml <http://domain.com/rss/feed.xml> links. One is the standard rss feed that we see, which becomes the starting point for crawling further as we can pull links from it.
> 
> 
> <item>
> <title>
> <![CDATA[
> Article headline
> ]]>
> </title>
> <link>
> article url
> </link>
> <pubDate> date </pubDate>
> <dc:creator>
> <![CDATA[ author ]]>
> </dc:creator>
> <description>
> <![CDATA[
> One line descriptor tag line
> ]]>
> </description>
> </item>
> <item>
> …
> </item>
> 
> The other one also includes the content within the xml itself, so it doesn’t need further crawling.
> I have standalone xml parsers in java that I can use directly, but obviously, crawling is an important part, because it documents all the links traversed so far.
> 
> What would you advice?
> 
> Regards,
> Ankit Goel
> 
> > On 02-Nov-2017, at 2:04 PM, Yossi Tamari <yo...@pipl.com> wrote:
> > 
> > Hi Ankit,
> > 
> > If you are looking for a Sitemap parser, I would suggest moving to 1.14
> > (trunk). I've been using it, and it is probably in better shape than 1.13.
> > If you need to parse your own format, the answer depends on the details. Do
> > you need to crawl pages in this format where each page contains links in XML
> > that you need to crawl? Or is this more like Sitemap where the XML is just
> > the  initial starting point? 
> > In the second case, maybe just write something outside of Nutch that will
> > parse the XML and produce a seed file?
> > In the first case, the link you sent is not relevant. You need to implement
> > a
> > http://nutch.apache.org/apidocs/apidocs-1.13/org/apache/nutch/parse/Parser.h
> > tml. I haven't done that myself. My suggestion is that you take a look at
> > the built-in parser at
> > https://github.com/apache/nutch/blob/master/src/plugin/parse-html/src/java/o
> > rg/apache/nutch/parse/html/HtmlParser.java. Google found this article on
> > developing a custom parser, which might be a good starting point:
> > http://www.treselle.com/blog/apache-nutch-with-custom-parser/.
> > 
> > 	Yossi.
> > 
> > 
> >> -----Original Message-----
> >> From: Ankit Goel [mailto:ankitgoel2004@gmail.com]
> >> Sent: 02 November 2017 10:24
> >> To: user@nutch.apache.org
> >> Subject: Re: sitemap and xml crawl
> >> 
> >> Hi Yossi,
> >> So I need to make a custom parser. Where do I start? I found this link
> >> https://wiki.apache.org/nutch/HowToMakeCustomSearch
> >> <https://wiki.apache.org/nutch/HowToMakeCustomSearch>. Is this the right
> >> place, or should I be looking at creating a plugin page. Any advice would
> > be
> >> helpful.
> >> 
> >> Thank you,
> >> Ankit Goel
> >> 
> >>> On 02-Nov-2017, at 1:14 PM, Yossi Tamari <yo...@pipl.com> wrote:
> >>> 
> >>> Hi Ankit,
> >>> 
> >>> According to this: https://issues.apache.org/jira/browse/NUTCH-1465,
> >>> sitemap is a 1.14 feature.
> >>> I just checked, and the command indeed exists in 1.14. I did not test
> >>> that it works.
> >>> 
> >>> In general, Nutch supports crawling anything, but you might need to
> >>> write your own parser for custom protocols.
> >>> 
> >>> 	Yossi.
> >>> 
> >>>> -----Original Message-----
> >>>> From: Ankit Goel [mailto:ankitgoel2004@gmail.com]
> >>>> Sent: 01 November 2017 18:55
> >>>> To: user@nutch.apache.org
> >>>> Subject: sitemap and xml crawl
> >>>> 
> >>>> Hi,
> >>>> I need to crawl a xml feed, which includes url, title and content of
> >>>> the
> >>> articles on
> >>>> site.
> >>>> 
> >>>> The documentation on the site says that bin/nutch sitemap exists, but
> >>>> on
> >>> my
> >>>> nutch 1.13 sitemap is not a command in bin/nutch. So does nutch
> >>>> support crawling sitemaps? Or xml links.
> >>>> 
> >>>> Regards,
> >>>> Ankit Goel
> >>> 
> >>> 
> > 
> > 
> 
>

Re: sitemap and xml crawl

Posted by Ankit Goel <an...@gmail.com>.

Hi Yossi,
I have 2 kinds of rss links which are domain.com/rss/feed.xml <http://domain.com/rss/feed.xml> links. One is the standard rss feed that we see, which becomes the starting point for crawling further as we can pull links from it.


<item>
<title>
<![CDATA[
Article headline
]]>
</title>
<link>
article url
</link>
<pubDate> date </pubDate>
<dc:creator>
<![CDATA[ author ]]>
</dc:creator>
<description>
<![CDATA[
One line descriptor tag line
]]>
</description>
</item>
<item>
…
</item>

The other one also includes the content within the xml itself, so it doesn’t need further crawling.
I have standalone xml parsers in java that I can use directly, but obviously, crawling is an important part, because it documents all the links traversed so far.

What would you advice?

Regards,
Ankit Goel

> On 02-Nov-2017, at 2:04 PM, Yossi Tamari <yo...@pipl.com> wrote:
> 
> Hi Ankit,
> 
> If you are looking for a Sitemap parser, I would suggest moving to 1.14
> (trunk). I've been using it, and it is probably in better shape than 1.13.
> If you need to parse your own format, the answer depends on the details. Do
> you need to crawl pages in this format where each page contains links in XML
> that you need to crawl? Or is this more like Sitemap where the XML is just
> the  initial starting point? 
> In the second case, maybe just write something outside of Nutch that will
> parse the XML and produce a seed file?
> In the first case, the link you sent is not relevant. You need to implement
> a
> http://nutch.apache.org/apidocs/apidocs-1.13/org/apache/nutch/parse/Parser.h
> tml. I haven't done that myself. My suggestion is that you take a look at
> the built-in parser at
> https://github.com/apache/nutch/blob/master/src/plugin/parse-html/src/java/o
> rg/apache/nutch/parse/html/HtmlParser.java. Google found this article on
> developing a custom parser, which might be a good starting point:
> http://www.treselle.com/blog/apache-nutch-with-custom-parser/.
> 
> 	Yossi.
> 
> 
>> -----Original Message-----
>> From: Ankit Goel [mailto:ankitgoel2004@gmail.com]
>> Sent: 02 November 2017 10:24
>> To: user@nutch.apache.org
>> Subject: Re: sitemap and xml crawl
>> 
>> Hi Yossi,
>> So I need to make a custom parser. Where do I start? I found this link
>> https://wiki.apache.org/nutch/HowToMakeCustomSearch
>> <https://wiki.apache.org/nutch/HowToMakeCustomSearch>. Is this the right
>> place, or should I be looking at creating a plugin page. Any advice would
> be
>> helpful.
>> 
>> Thank you,
>> Ankit Goel
>> 
>>> On 02-Nov-2017, at 1:14 PM, Yossi Tamari <yo...@pipl.com> wrote:
>>> 
>>> Hi Ankit,
>>> 
>>> According to this: https://issues.apache.org/jira/browse/NUTCH-1465,
>>> sitemap is a 1.14 feature.
>>> I just checked, and the command indeed exists in 1.14. I did not test
>>> that it works.
>>> 
>>> In general, Nutch supports crawling anything, but you might need to
>>> write your own parser for custom protocols.
>>> 
>>> 	Yossi.
>>> 
>>>> -----Original Message-----
>>>> From: Ankit Goel [mailto:ankitgoel2004@gmail.com]
>>>> Sent: 01 November 2017 18:55
>>>> To: user@nutch.apache.org
>>>> Subject: sitemap and xml crawl
>>>> 
>>>> Hi,
>>>> I need to crawl a xml feed, which includes url, title and content of
>>>> the
>>> articles on
>>>> site.
>>>> 
>>>> The documentation on the site says that bin/nutch sitemap exists, but
>>>> on
>>> my
>>>> nutch 1.13 sitemap is not a command in bin/nutch. So does nutch
>>>> support crawling sitemaps? Or xml links.
>>>> 
>>>> Regards,
>>>> Ankit Goel
>>> 
>>> 
> 
>

RE: sitemap and xml crawl

Posted by Yossi Tamari <yo...@pipl.com>.

Hi Ankit,

If you are looking for a Sitemap parser, I would suggest moving to 1.14
(trunk). I've been using it, and it is probably in better shape than 1.13.
If you need to parse your own format, the answer depends on the details. Do
you need to crawl pages in this format where each page contains links in XML
that you need to crawl? Or is this more like Sitemap where the XML is just
the  initial starting point? 
In the second case, maybe just write something outside of Nutch that will
parse the XML and produce a seed file?
In the first case, the link you sent is not relevant. You need to implement
a
http://nutch.apache.org/apidocs/apidocs-1.13/org/apache/nutch/parse/Parser.h
tml. I haven't done that myself. My suggestion is that you take a look at
the built-in parser at
https://github.com/apache/nutch/blob/master/src/plugin/parse-html/src/java/o
rg/apache/nutch/parse/html/HtmlParser.java. Google found this article on
developing a custom parser, which might be a good starting point:
http://www.treselle.com/blog/apache-nutch-with-custom-parser/.

	Yossi.

> -----Original Message-----
> From: Ankit Goel [mailto:ankitgoel2004@gmail.com]
> Sent: 02 November 2017 10:24
> To: user@nutch.apache.org
> Subject: Re: sitemap and xml crawl
> 
> Hi Yossi,
> So I need to make a custom parser. Where do I start? I found this link
> https://wiki.apache.org/nutch/HowToMakeCustomSearch
> <https://wiki.apache.org/nutch/HowToMakeCustomSearch>. Is this the right
> place, or should I be looking at creating a plugin page. Any advice would
be
> helpful.
> 
> Thank you,
> Ankit Goel
> 
> > On 02-Nov-2017, at 1:14 PM, Yossi Tamari <yo...@pipl.com> wrote:
> >
> > Hi Ankit,
> >
> > According to this: https://issues.apache.org/jira/browse/NUTCH-1465,
> > sitemap is a 1.14 feature.
> > I just checked, and the command indeed exists in 1.14. I did not test
> > that it works.
> >
> > In general, Nutch supports crawling anything, but you might need to
> > write your own parser for custom protocols.
> >
> > 	Yossi.
> >
> >> -----Original Message-----
> >> From: Ankit Goel [mailto:ankitgoel2004@gmail.com]
> >> Sent: 01 November 2017 18:55
> >> To: user@nutch.apache.org
> >> Subject: sitemap and xml crawl
> >>
> >> Hi,
> >> I need to crawl a xml feed, which includes url, title and content of
> >> the
> > articles on
> >> site.
> >>
> >> The documentation on the site says that bin/nutch sitemap exists, but
> >> on
> > my
> >> nutch 1.13 sitemap is not a command in bin/nutch. So does nutch
> >> support crawling sitemaps? Or xml links.
> >>
> >> Regards,
> >> Ankit Goel
> >
> >

Re: sitemap and xml crawl

Posted by Ankit Goel <an...@gmail.com>.

Hi Yossi,
So I need to make a custom parser. Where do I start? I found this link https://wiki.apache.org/nutch/HowToMakeCustomSearch <https://wiki.apache.org/nutch/HowToMakeCustomSearch>. Is this the right place, or should I be looking at creating a plugin page. Any advice would be helpful. 

Thank you,
Ankit Goel

> On 02-Nov-2017, at 1:14 PM, Yossi Tamari <yo...@pipl.com> wrote:
> 
> Hi Ankit,
> 
> According to this: https://issues.apache.org/jira/browse/NUTCH-1465, sitemap
> is a 1.14 feature.
> I just checked, and the command indeed exists in 1.14. I did not test that
> it works.
> 
> In general, Nutch supports crawling anything, but you might need to write
> your own parser for custom protocols.
> 
> 	Yossi.
> 
>> -----Original Message-----
>> From: Ankit Goel [mailto:ankitgoel2004@gmail.com]
>> Sent: 01 November 2017 18:55
>> To: user@nutch.apache.org
>> Subject: sitemap and xml crawl
>> 
>> Hi,
>> I need to crawl a xml feed, which includes url, title and content of the
> articles on
>> site.
>> 
>> The documentation on the site says that bin/nutch sitemap exists, but on
> my
>> nutch 1.13 sitemap is not a command in bin/nutch. So does nutch support
>> crawling sitemaps? Or xml links.
>> 
>> Regards,
>> Ankit Goel
> 
>

RE: sitemap and xml crawl

Posted by Yossi Tamari <yo...@pipl.com>.

Hi Ankit,

According to this: https://issues.apache.org/jira/browse/NUTCH-1465, sitemap
is a 1.14 feature.
I just checked, and the command indeed exists in 1.14. I did not test that
it works.

In general, Nutch supports crawling anything, but you might need to write
your own parser for custom protocols.

	Yossi.

> -----Original Message-----
> From: Ankit Goel [mailto:ankitgoel2004@gmail.com]
> Sent: 01 November 2017 18:55
> To: user@nutch.apache.org
> Subject: sitemap and xml crawl
> 
> Hi,
> I need to crawl a xml feed, which includes url, title and content of the
articles on
> site.
> 
> The documentation on the site says that bin/nutch sitemap exists, but on
my
> nutch 1.13 sitemap is not a command in bin/nutch. So does nutch support
> crawling sitemaps? Or xml links.
> 
> Regards,
> Ankit Goel