You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by "Eggebrecht, Thomas (GfK Marktforschung)" <th...@gfk.com> on 2011/02/28 18:34:38 UTC

Too low performance of SegmentReader

Hi there,

I need to read some pages from segments to get the raw HTML.

I do it like:

nutch-1.2/bin nutch readseg -get /path/to/segment http://key.value.html -nofetch -nogenerate -noparse -noparsedata -noparsetext

That works fine but it takes 2 or 3 full seconds per page! My very small test environment has about 20 crawled and indexed pages and is on a single machine. A search over the Lucene index takes only milli seconds.

Is there a way to read segments faster?
Is it the right way to implement SegmentReader.class to get original HTML?

Best Regards
Thomas



GfK SE, Nuremberg, Germany, commercial register Nuremberg HRB 25014; Management Board: Professor Dr. Klaus L. W?bbenhorst (CEO), Pamela Knapp (CFO), Dr. Gerhard Hausruckinger, Petra Heinlein, Debra A. Pruent, Wilhelm R. Wessels; Chairman of the Supervisory Board: Dr. Arno Mahlert
This email and any attachments may contain confidential or privileged information. Please note that unauthorized copying, disclosure or distribution of the material in this email is not permitted.

Re: Too low performance of SegmentReader

Posted by Julien Nioche <li...@gmail.com>.

Hi Thomas,

It's not so much that readseg is too slow, it's just that it is probably not
the right tool. Readseg is used primarily to debug and check the content of
a segment. What is it that you are trying to achieve?

If you need to put the original content in a set of files or a DB you should
do that with a custom map-reduce job, as it is done e.g. in the Nutch module
of Behemoth (
https://github.com/jnioche/behemoth/blob/master/modules/io/src/main/java/com/digitalpebble/behemoth/io/nutch/NutchSegmentConverterJob.java).


HTH

Julien

On 28 February 2011 17:34, Eggebrecht, Thomas (GfK Marktforschung) <
thomas.eggebrecht@gfk.com> wrote:

> Hi there,
>
> I need to read some pages from segments to get the raw HTML.
>
> I do it like:
>
> nutch-1.2/bin nutch readseg -get /path/to/segment http://key.value.html-nofetch -nogenerate -noparse -noparsedata -noparsetext
>
> That works fine but it takes 2 or 3 full seconds per page! My very small
> test environment has about 20 crawled and indexed pages and is on a single
> machine. A search over the Lucene index takes only milli seconds.
>
> Is there a way to read segments faster?
> Is it the right way to implement SegmentReader.class to get original HTML?
>
> Best Regards
> Thomas
>
>
>
> GfK SE, Nuremberg, Germany, commercial register Nuremberg HRB 25014;
> Management Board: Professor Dr. Klaus L. W?bbenhorst (CEO), Pamela Knapp
> (CFO), Dr. Gerhard Hausruckinger, Petra Heinlein, Debra A. Pruent, Wilhelm
> R. Wessels; Chairman of the Supervisory Board: Dr. Arno Mahlert
> This email and any attachments may contain confidential or privileged
> information. Please note that unauthorized copying, disclosure or
> distribution of the material in this email is not permitted.
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com