You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by kauu <ba...@gmail.com> on 2007/01/26 03:17:20 UTC

parse-rss make them items as different pages

i want to crawl the rss feeds and parse them ,then index them and at last
when search the content I just want that the hit just like an individual
page.


i don't know wether i tell u clearly.

<item>
    <title>欧洲暴风雪后发制人 致航班延误交通混乱(组图)</title>
    <description>暴风雪横扫欧洲,导致多次航班延误
1月24日,几架民航客机在德国斯图加特机场内等待去除机身上冰雪。1月24日,工作人员在德国南部的慕尼黑机场清扫飞机跑道上的积雪。
据报道,迟来的暴风雪连续两天横扫中...
    </description>
    <link>http://news.sohu.com/20070125/n247833568.shtml</link>
    <category>搜狐焦点图新闻</category>
    <author>cms@sohu.com</author>
    <pubDate>Thu, 25 Jan 2007 11:29:11 +0800</pubDate>
    <comments>http://comment.news.sohu.com/comment/topic.jsp?id=247833847</comments>
</item>

this one item in an rss file

i want nutch deal with an item like an individual page.

so i search something in this item,the nutch return it as a hit.

so ...
any one can tell me how to do about ?
any reply will be appreciated

-- 
www.babatu.com

Re: parse-rss make them items as different pages

Posted by kauu <ba...@gmail.com>.
it's a great idea i think .
we can't just have more than one document in the index because of the unique
key is the URL.
but the only problem is that how to write a separate protocol for the RSS.



On 1/28/07, Alan Tanaman <al...@idna-solutions.com> wrote:
>
> This is a problem that we have encountered too (although in a different
> context than RSS).  The problem is that the "unique key" is the URL - you
> cannot have more than one document in the index with the same URL.
>
> The way around this might be to have a separate protocol (instead of the
> usual http one) that will be used only for RSS feeds, and which will
> append
> an sequential number to the real-URL (passing say 10 identical copies of
> each page to the parse-rss).  The parse-rss would need to extract only the
> nth news item from the whole page.
>
> Any comments?
>
> Best regards,
> Alan
> _________________________
> Alan Tanaman
> iDNA Solutions
> http://blog.idna-solutions.com
>
> -----Original Message-----
> From: kauu [mailto:babatu@gmail.com]
> Sent: 27 January 2007 06:43
> To: nutch-dev@lucene.apache.org; gnitzan@usa.net
> Subject: Re: parse-rss make them items as different pages
>
> who can tell  me where and how to build a nutch document in nutch-0.8.1?
>
> for example , one html page is a document , but i want to detach a
> document
> to several ones .
>
> On 1/27/07, kauu <ba...@gmail.com> wrote:
> >
> > that's the right thing.
> >
> > i think we should to do some thing when nutch fetch a page successfully,
> > judge if a rss then create as many pages as the items'  number.i  don't
> > know whether it work.
> > In the other hand , we can do some thing in the segment just like what u
> > say .
> >
> >
> > i don't know that whether we can write a plugin to get the
> functionality.
> >
> > anyone who can give me some hint?
> >
> > On 1/26/07, Gal Nitzan <gn...@usa.net> wrote:
> > >
> > > Hi Kauu,
> > >
> > > The functionality you require doesn't exist in the current parse-rss
> > > plugin. I need the same functionality but it doesn't exist and I
> believe
> > > it's not a simple task.
> > >
> > > The functionality required basically is to create a page in a segment
> > > for each item and the URL to the crawldb.
> > >
> > > Since the data already exists in the item element there is no reason
> to
> > > "fetch" the page (item). After that the only thing left is to index
> it.
> > >
> > > Any thoughts on how to achieve that goal?
> > >
> > > Gal.
> > >
> > >
> > >
> > >
> > >
> > >
> > > -----Original Message-----
> > > From: kauu [mailto:babatu@gmail.com]
> > > Sent: Friday, January 26, 2007 4:17 AM
> > > To: nutch-dev@lucene.apache.org
> > > Subject: parse-rss make them items as different pages
> > >
> > > i want to crawl the rss feeds and parse them ,then index them and at
> > > last
> > > when search the content I just want that the hit just like an
> individual
> > > page.
> > >
> > >
> > > i don't know wether i tell u clearly.
> > >
> > > <item>
> > >     <title>欧洲暴风雪后发制人 致航班延误交通混乱(组图)</title>
> > >     <description>暴风雪横扫欧洲,导致多次航班延误
> > > 1月24日,几架民航客机在德国斯图加特机场内等待去除机身上冰雪。1月24日,工
> 作人员在德国南部的慕尼黑机场清扫飞机跑道上的积雪。
> > > 据报道,迟来的暴风雪连续两天横扫中...
> > >     </description>
> > >     <link>http://news.sohu.com/20070125/n247833568.shtml </link>
> > >     <category>搜狐焦点图新闻</category>
> > >     <author>cms@sohu.com</author>
> > >     <pubDate>Thu, 25 Jan 2007 11:29:11 +0800</pubDate>
> > >     <comments>
> > > http://comment.news.sohu.com/comment/topic.jsp?id=247833847</comments>
> > > </item>
> > >
> > > this one item in an rss file
> > >
> > > i want nutch deal with an item like an individual page.
> > >
> > > so i search something in this item,the nutch return it as a hit.
> > >
> > > so ...
> > > any one can tell me how to do about ?
> > > any reply will be appreciated
> > >
> > > --
> > > www.babatu.com
> > >
> >
> >
> >
> > --
> > www.babatu.com
>
>
>
>
> --
> www.babatu.com
>
>


-- 
www.babatu.com

RE: parse-rss make them items as different pages

Posted by Alan Tanaman <al...@idna-solutions.com>.
This is a problem that we have encountered too (although in a different
context than RSS).  The problem is that the "unique key" is the URL - you
cannot have more than one document in the index with the same URL.

The way around this might be to have a separate protocol (instead of the
usual http one) that will be used only for RSS feeds, and which will append
an sequential number to the real-URL (passing say 10 identical copies of
each page to the parse-rss).  The parse-rss would need to extract only the
nth news item from the whole page.

Any comments?

Best regards,
Alan
_________________________
Alan Tanaman
iDNA Solutions
http://blog.idna-solutions.com

-----Original Message-----
From: kauu [mailto:babatu@gmail.com] 
Sent: 27 January 2007 06:43
To: nutch-dev@lucene.apache.org; gnitzan@usa.net
Subject: Re: parse-rss make them items as different pages

who can tell  me where and how to build a nutch document in nutch-0.8.1?

for example , one html page is a document , but i want to detach a document
to several ones .

On 1/27/07, kauu <ba...@gmail.com> wrote:
>
> that's the right thing.
>
> i think we should to do some thing when nutch fetch a page successfully,
> judge if a rss then create as many pages as the items'  number.i  don't
> know whether it work.
> In the other hand , we can do some thing in the segment just like what u
> say .
>
>
> i don't know that whether we can write a plugin to get the functionality.
>
> anyone who can give me some hint?
>
> On 1/26/07, Gal Nitzan <gn...@usa.net> wrote:
> >
> > Hi Kauu,
> >
> > The functionality you require doesn't exist in the current parse-rss
> > plugin. I need the same functionality but it doesn't exist and I believe
> > it's not a simple task.
> >
> > The functionality required basically is to create a page in a segment
> > for each item and the URL to the crawldb.
> >
> > Since the data already exists in the item element there is no reason to
> > "fetch" the page (item). After that the only thing left is to index it.
> >
> > Any thoughts on how to achieve that goal?
> >
> > Gal.
> >
> >
> >
> >
> >
> >
> > -----Original Message-----
> > From: kauu [mailto:babatu@gmail.com]
> > Sent: Friday, January 26, 2007 4:17 AM
> > To: nutch-dev@lucene.apache.org
> > Subject: parse-rss make them items as different pages
> >
> > i want to crawl the rss feeds and parse them ,then index them and at
> > last
> > when search the content I just want that the hit just like an individual
> > page.
> >
> >
> > i don't know wether i tell u clearly.
> >
> > <item>
> >     <title>欧洲暴风雪后发制人 致航班延误交通混乱(组图)</title>
> >     <description>暴风雪横扫欧洲,导致多次航班延误
> > 1月24日,几架民航客机在德国斯图加特机场内等待去除机身上冰雪。1月24日,工
作人员在德国南部的慕尼黑机场清扫飞机跑道上的积雪。
> > 据报道,迟来的暴风雪连续两天横扫中...
> >     </description>
> >     <link>http://news.sohu.com/20070125/n247833568.shtml </link>
> >     <category>搜狐焦点图新闻</category>
> >     <author>cms@sohu.com</author>
> >     <pubDate>Thu, 25 Jan 2007 11:29:11 +0800</pubDate>
> >     <comments>
> > http://comment.news.sohu.com/comment/topic.jsp?id=247833847</comments>
> > </item>
> >
> > this one item in an rss file
> >
> > i want nutch deal with an item like an individual page.
> >
> > so i search something in this item,the nutch return it as a hit.
> >
> > so ...
> > any one can tell me how to do about ?
> > any reply will be appreciated
> >
> > --
> > www.babatu.com
> >
>
>
>
> --
> www.babatu.com




-- 
www.babatu.com


Re: parse-rss make them items as different pages

Posted by kauu <ba...@gmail.com>.
who can tell  me where and how to build a nutch document in nutch-0.8.1?

for example , one html page is a document , but i want to detach a document
to several ones .

On 1/27/07, kauu <ba...@gmail.com> wrote:
>
> that's the right thing.
>
> i think we should to do some thing when nutch fetch a page successfully,
> judge if a rss then create as many pages as the items'  number.i  don't
> know whether it work.
> In the other hand , we can do some thing in the segment just like what u
> say .
>
>
> i don't know that whether we can write a plugin to get the functionality.
>
> anyone who can give me some hint?
>
> On 1/26/07, Gal Nitzan <gn...@usa.net> wrote:
> >
> > Hi Kauu,
> >
> > The functionality you require doesn't exist in the current parse-rss
> > plugin. I need the same functionality but it doesn't exist and I believe
> > it's not a simple task.
> >
> > The functionality required basically is to create a page in a segment
> > for each item and the URL to the crawldb.
> >
> > Since the data already exists in the item element there is no reason to
> > "fetch" the page (item). After that the only thing left is to index it.
> >
> > Any thoughts on how to achieve that goal?
> >
> > Gal.
> >
> >
> >
> >
> >
> >
> > -----Original Message-----
> > From: kauu [mailto:babatu@gmail.com]
> > Sent: Friday, January 26, 2007 4:17 AM
> > To: nutch-dev@lucene.apache.org
> > Subject: parse-rss make them items as different pages
> >
> > i want to crawl the rss feeds and parse them ,then index them and at
> > last
> > when search the content I just want that the hit just like an individual
> > page.
> >
> >
> > i don't know wether i tell u clearly.
> >
> > <item>
> >     <title>欧洲暴风雪后发制人 致航班延误交通混乱(组图)</title>
> >     <description>暴风雪横扫欧洲,导致多次航班延误
> > 1月24日,几架民航客机在德国斯图加特机场内等待去除机身上冰雪。1月24日,工作人员在德国南部的慕尼黑机场清扫飞机跑道上的积雪。
> > 据报道,迟来的暴风雪连续两天横扫中...
> >     </description>
> >     <link>http://news.sohu.com/20070125/n247833568.shtml </link>
> >     <category>搜狐焦点图新闻</category>
> >     <author>cms@sohu.com</author>
> >     <pubDate>Thu, 25 Jan 2007 11:29:11 +0800</pubDate>
> >     <comments>
> > http://comment.news.sohu.com/comment/topic.jsp?id=247833847</comments>
> > </item>
> >
> > this one item in an rss file
> >
> > i want nutch deal with an item like an individual page.
> >
> > so i search something in this item,the nutch return it as a hit.
> >
> > so ...
> > any one can tell me how to do about ?
> > any reply will be appreciated
> >
> > --
> > www.babatu.com
> >
>
>
>
> --
> www.babatu.com




-- 
www.babatu.com

Re: parse-rss make them items as different pages

Posted by kauu <ba...@gmail.com>.
that's the right thing.

i think we should to do some thing when nutch fetch a page successfully,
judge if a rss then create as many pages as the items'  number.i  don't know
whether it work.
In the other hand , we can do some thing in the segment just like what u say
.


i don't know that whether we can write a plugin to get the functionality.

anyone who can give me some hint?

On 1/26/07, Gal Nitzan <gn...@usa.net> wrote:
>
> Hi Kauu,
>
> The functionality you require doesn't exist in the current parse-rss
> plugin. I need the same functionality but it doesn't exist and I believe
> it's not a simple task.
>
> The functionality required basically is to create a page in a segment for
> each item and the URL to the crawldb.
>
> Since the data already exists in the item element there is no reason to
> "fetch" the page (item). After that the only thing left is to index it.
>
> Any thoughts on how to achieve that goal?
>
> Gal.
>
>
>
>
>
>
> -----Original Message-----
> From: kauu [mailto:babatu@gmail.com]
> Sent: Friday, January 26, 2007 4:17 AM
> To: nutch-dev@lucene.apache.org
> Subject: parse-rss make them items as different pages
>
> i want to crawl the rss feeds and parse them ,then index them and at last
> when search the content I just want that the hit just like an individual
> page.
>
>
> i don't know wether i tell u clearly.
>
> <item>
>     <title>欧洲暴风雪后发制人 致航班延误交通混乱(组图)</title>
>     <description>暴风雪横扫欧洲,导致多次航班延误
> 1月24日,几架民航客机在德国斯图加特机场内等待去除机身上冰雪。1月24日,工作人员在德国南部的慕尼黑机场清扫飞机跑道上的积雪。
> 据报道,迟来的暴风雪连续两天横扫中...
>     </description>
>     <link>http://news.sohu.com/20070125/n247833568.shtml</link>
>     <category>搜狐焦点图新闻</category>
>     <author>cms@sohu.com</author>
>     <pubDate>Thu, 25 Jan 2007 11:29:11 +0800</pubDate>
>     <comments>http://comment.news.sohu.com/comment/topic.jsp?id=247833847
> </comments>
> </item>
>
> this one item in an rss file
>
> i want nutch deal with an item like an individual page.
>
> so i search something in this item,the nutch return it as a hit.
>
> so ...
> any one can tell me how to do about ?
> any reply will be appreciated
>
> --
> www.babatu.com
>



-- 
www.babatu.com

Re: parse-rss make them items as different pages

Posted by kauu <ba...@gmail.com>.
that's right ,but in the other word , i just need to index the exact
information in  a page .but in real ,the real world pages contain lots of
spam ,so i just want to index the description.

On 1/27/07, sishen <ye...@gmail.com> wrote:
>
> On 1/26/07, Gal Nitzan <gn...@usa.net> wrote:
> >
> > Hi Kauu,
> >
> > The functionality you require doesn't exist in the current parse-rss
> > plugin. I need the same functionality but it doesn't exist and I believe
> > it's not a simple task.
> >
> > The functionality required basically is to create a page in a segment
> for
> > each item and the URL to the crawldb.
> >
> > Since the data already exists in the item element there is no reason to
> > "fetch" the page (item). After that the only thing left is to index it.
>
>
> I don't think so.  The data in description is  not completed. So to fetch
> the page through the link is needed.
>
> Any thoughts on how to achieve that goal?
> >
> > Gal.
> >
> >
> >
> >
> >
> >
> > -----Original Message-----
> > From: kauu [mailto:babatu@gmail.com]
> > Sent: Friday, January 26, 2007 4:17 AM
> > To: nutch-dev@lucene.apache.org
> > Subject: parse-rss make them items as different pages
> >
> > i want to crawl the rss feeds and parse them ,then index them and at
> last
> > when search the content I just want that the hit just like an individual
> > page.
> >
> >
> > i don't know wether i tell u clearly.
> >
> > <item>
> >     <title>欧洲暴风雪后发制人 致航班延误交通混乱(组图)</title>
> >     <description>暴风雪横扫欧洲,导致多次航班延误
> > 1月24日,几架民航客机在德国斯图加特机场内等待去除机身上冰雪。1月24日,工作人员在德国南部的慕尼黑机场清扫飞机跑道上的积雪。
> > 据报道,迟来的暴风雪连续两天横扫中...
> >     </description>
> >     <link>http://news.sohu.com/20070125/n247833568.shtml</link>
> >     <category>搜狐焦点图新闻</category>
> >     <author>cms@sohu.com</author>
> >     <pubDate>Thu, 25 Jan 2007 11:29:11 +0800</pubDate>
> >     <comments>
> http://comment.news.sohu.com/comment/topic.jsp?id=247833847
> > </comments>
> > </item>
> >
> > this one item in an rss file
> >
> > i want nutch deal with an item like an individual page.
> >
> > so i search something in this item,the nutch return it as a hit.
> >
> > so ...
> > any one can tell me how to do about ?
> > any reply will be appreciated
> >
> > --
> > www.babatu.com
> >
>



-- 
www.babatu.com

Re: parse-rss make them items as different pages

Posted by sishen <ye...@gmail.com>.
On 1/26/07, Gal Nitzan <gn...@usa.net> wrote:
>
> Hi Kauu,
>
> The functionality you require doesn't exist in the current parse-rss
> plugin. I need the same functionality but it doesn't exist and I believe
> it's not a simple task.
>
> The functionality required basically is to create a page in a segment for
> each item and the URL to the crawldb.
>
> Since the data already exists in the item element there is no reason to
> "fetch" the page (item). After that the only thing left is to index it.


I don't think so.  The data in description is  not completed. So to fetch
the page through the link is needed.

Any thoughts on how to achieve that goal?
>
> Gal.
>
>
>
>
>
>
> -----Original Message-----
> From: kauu [mailto:babatu@gmail.com]
> Sent: Friday, January 26, 2007 4:17 AM
> To: nutch-dev@lucene.apache.org
> Subject: parse-rss make them items as different pages
>
> i want to crawl the rss feeds and parse them ,then index them and at last
> when search the content I just want that the hit just like an individual
> page.
>
>
> i don't know wether i tell u clearly.
>
> <item>
>     <title>欧洲暴风雪后发制人 致航班延误交通混乱(组图)</title>
>     <description>暴风雪横扫欧洲,导致多次航班延误
> 1月24日,几架民航客机在德国斯图加特机场内等待去除机身上冰雪。1月24日,工作人员在德国南部的慕尼黑机场清扫飞机跑道上的积雪。
> 据报道,迟来的暴风雪连续两天横扫中...
>     </description>
>     <link>http://news.sohu.com/20070125/n247833568.shtml</link>
>     <category>搜狐焦点图新闻</category>
>     <author>cms@sohu.com</author>
>     <pubDate>Thu, 25 Jan 2007 11:29:11 +0800</pubDate>
>     <comments>http://comment.news.sohu.com/comment/topic.jsp?id=247833847
> </comments>
> </item>
>
> this one item in an rss file
>
> i want nutch deal with an item like an individual page.
>
> so i search something in this item,the nutch return it as a hit.
>
> so ...
> any one can tell me how to do about ?
> any reply will be appreciated
>
> --
> www.babatu.com
>

RE: parse-rss make them items as different pages

Posted by Gal Nitzan <gn...@usa.net>.
Hi Kauu,

The functionality you require doesn't exist in the current parse-rss plugin. I need the same functionality but it doesn't exist and I believe it's not a simple task.

The functionality required basically is to create a page in a segment for each item and the URL to the crawldb.

Since the data already exists in the item element there is no reason to "fetch" the page (item). After that the only thing left is to index it.

Any thoughts on how to achieve that goal?

Gal.






-----Original Message-----
From: kauu [mailto:babatu@gmail.com] 
Sent: Friday, January 26, 2007 4:17 AM
To: nutch-dev@lucene.apache.org
Subject: parse-rss make them items as different pages

i want to crawl the rss feeds and parse them ,then index them and at last
when search the content I just want that the hit just like an individual
page.


i don't know wether i tell u clearly.

<item>
    <title>欧洲暴风雪后发制人 致航班延误交通混乱(组图)</title>
    <description>暴风雪横扫欧洲,导致多次航班延误
1月24日,几架民航客机在德国斯图加特机场内等待去除机身上冰雪。1月24日,工作人员在德国南部的慕尼黑机场清扫飞机跑道上的积雪。
据报道,迟来的暴风雪连续两天横扫中...
    </description>
    <link>http://news.sohu.com/20070125/n247833568.shtml</link>
    <category>搜狐焦点图新闻</category>
    <author>cms@sohu.com</author>
    <pubDate>Thu, 25 Jan 2007 11:29:11 +0800</pubDate>
    <comments>http://comment.news.sohu.com/comment/topic.jsp?id=247833847</comments>
</item>

this one item in an rss file

i want nutch deal with an item like an individual page.

so i search something in this item,the nutch return it as a hit.

so ...
any one can tell me how to do about ?
any reply will be appreciated

-- 
www.babatu.com