You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Morrowwind <ne...@hotmail.com> on 2008/01/15 20:46:10 UTC

How to use Nutch to parse Web-pages!

Hi,

My project is about web page processing and I need to parse the web-pages to
get all the plain text first. 

Now I have finished the crawling part using nutch, and I'm in trouble with
the parsing part. I have my data in crawldb folder. How can I parse the
plain text out of the web pages and store them in a .txt file? 

Could anyone give me a hint please. 

Thanks a lot.


-- 
View this message in context: http://www.nabble.com/How-to-use-Nutch-to-parse-Web-pages%21-tp14845212p14845212.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: How to use Nutch to parse Web-pages!

Posted by Morrowwind <ne...@hotmail.com>.

Thanks Tomislav!  Your reply is a big help!



Tomislav Poljak wrote:
> 
> Hi,
> I think the simplest way to get parsed text from segment (Nutch stores
> parse text in segment, for example :
> crawl/segments/20080107120936/parse_text) to text file is dump option of
> segment reader:
> 
> bin/nutch readseg -dump crawl/segments/20080107120936 dump -nocontent
> -nofetch -nogenerate -noparse -noparsedata
> 
> This will store only parsed text (recno/url/parsetext) from web pages
> (but all in one file). If you need more control look at the source of
> segment reader: org.apache.nutch.segment.SegmentReader
> 
> Hope this helps,
> 
> Tomislav
> 
> 
> On Tue, 2008-01-15 at 11:46 -0800, Morrowwind wrote:
>> Hi,
>> 
>> My project is about web page processing and I need to parse the web-pages
>> to
>> get all the plain text first. 
>> 
>> Now I have finished the crawling part using nutch, and I'm in trouble
>> with
>> the parsing part. I have my data in crawldb folder. How can I parse the
>> plain text out of the web pages and store them in a .txt file? 
>> 
>> Could anyone give me a hint please. 
>> 
>> Thanks a lot.
>> 
>> 
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/How-to-use-Nutch-to-parse-Web-pages%21-tp14845212p14929821.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: How to use Nutch to parse Web-pages!

Posted by Tomislav Poljak <tp...@gmail.com>.

Hi,
I think the simplest way to get parsed text from segment (Nutch stores
parse text in segment, for example :
crawl/segments/20080107120936/parse_text) to text file is dump option of
segment reader:

bin/nutch readseg -dump crawl/segments/20080107120936 dump -nocontent
-nofetch -nogenerate -noparse -noparsedata

This will store only parsed text (recno/url/parsetext) from web pages
(but all in one file). If you need more control look at the source of
segment reader: org.apache.nutch.segment.SegmentReader

Hope this helps,

Tomislav


On Tue, 2008-01-15 at 11:46 -0800, Morrowwind wrote:
> Hi,
> 
> My project is about web page processing and I need to parse the web-pages to
> get all the plain text first. 
> 
> Now I have finished the crawling part using nutch, and I'm in trouble with
> the parsing part. I have my data in crawldb folder. How can I parse the
> plain text out of the web pages and store them in a .txt file? 
> 
> Could anyone give me a hint please. 
> 
> Thanks a lot.
> 
>

Re: How to use Nutch to parse Web-pages!

Posted by Morrowwind <ne...@hotmail.com>.

Thanks!

Developer Developer wrote:
> 
> check this out
> 
> http://kuthrax.blogspot.com/2008/01/how-to-retrieve-parsed-content-from.html
> 
> 
> On Jan 15, 2008 2:46 PM, Morrowwind <ne...@hotmail.com> wrote:
> 
>>
>> Hi,
>>
>> My project is about web page processing and I need to parse the web-pages
>> to
>> get all the plain text first.
>>
>> Now I have finished the crawling part using nutch, and I'm in trouble
>> with
>> the parsing part. I have my data in crawldb folder. How can I parse the
>> plain text out of the web pages and store them in a .txt file?
>>
>> Could anyone give me a hint please.
>>
>> Thanks a lot.
>>
>>
>> --
>> View this message in context:
>> http://www.nabble.com/How-to-use-Nutch-to-parse-Web-pages%21-tp14845212p14845212.html
>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>
>>
> 
> 

-- 
View this message in context: http://www.nabble.com/How-to-use-Nutch-to-parse-Web-pages%21-tp14845212p14929823.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: How to use Nutch to parse Web-pages!

Posted by Developer Developer <de...@gmail.com>.

check this out

http://kuthrax.blogspot.com/2008/01/how-to-retrieve-parsed-content-from.html


On Jan 15, 2008 2:46 PM, Morrowwind <ne...@hotmail.com> wrote:

>
> Hi,
>
> My project is about web page processing and I need to parse the web-pages
> to
> get all the plain text first.
>
> Now I have finished the crawling part using nutch, and I'm in trouble with
> the parsing part. I have my data in crawldb folder. How can I parse the
> plain text out of the web pages and store them in a .txt file?
>
> Could anyone give me a hint please.
>
> Thanks a lot.
>
>
> --
> View this message in context:
> http://www.nabble.com/How-to-use-Nutch-to-parse-Web-pages%21-tp14845212p14845212.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
>