You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Markus Jelsma <ma...@openindex.io> on 2011/08/10 14:24:49 UTC

Re: Question to reduce while parsing

That doesn't sound good indeed. Perhaps the parser chokes on your truncated 
PDF files, which may happen with a too long content limit. Are you sure all 
PDF's are below 10MB limit?

You can add this patch so you can see progress in parsing when running local 
jobs: https://issues.apache.org/jira/browse/NUTCH-1028

On Saturday 10 September 2011 13:12:07 Marek Bachmann wrote:
> Hi everybody,
> 
> a parse cycle is working for two days on my machine. I think this is way
> too long.
> The Hadoop Log file contains nothing but this, always repeating message:
> 
> 2011-08-10 11:15:08,863 INFO  mapred.LocalJobRunner - reduce > reduce
> 2011-08-10 11:15:08,864 DEBUG mapred.Counters - Creating group
> ParserStatus with nothing
> 2011-08-10 11:15:08,864 DEBUG mapred.Counters - Adding failed
> 2011-08-10 11:15:08,864 DEBUG mapred.Counters - Adding success
> 2011-08-10 11:15:08,864 DEBUG mapred.Counters - Creating group
> FileSystemCounters with nothing
> 2011-08-10 11:15:08,864 DEBUG mapred.Counters - Adding FILE_BYTES_READ
> 2011-08-10 11:15:08,864 DEBUG mapred.Counters - Adding FILE_BYTES_WRITTEN
> 2011-08-10 11:15:08,864 DEBUG mapred.Counters - Creating group
> org.apache.hadoop.mapred.Task$Counter with bundle
> 2011-08-10 11:15:08,864 DEBUG mapred.Counters - Adding
> COMBINE_OUTPUT_RECORDS
> 2011-08-10 11:15:08,864 DEBUG mapred.Counters - Adding MAP_INPUT_RECORDS
> 2011-08-10 11:15:08,864 DEBUG mapred.Counters - Adding SPILLED_RECORDS
> 2011-08-10 11:15:08,864 DEBUG mapred.Counters - Adding MAP_OUTPUT_BYTES
> 2011-08-10 11:15:08,864 DEBUG mapred.Counters - Adding MAP_INPUT_BYTES
> 2011-08-10 11:15:08,864 DEBUG mapred.Counters - Adding MAP_OUTPUT_RECORDS
> 2011-08-10 11:15:08,864 DEBUG mapred.Counters - Adding
> COMBINE_INPUT_RECORDS 2011-08-10 11:15:08,864 DEBUG mapred.Counters -
> Adding REDUCE_INPUT_GROUPS 2011-08-10 11:15:08,864 DEBUG mapred.Counters -
> Adding REDUCE_SHUFFLE_BYTES 2011-08-10 11:15:08,864 DEBUG mapred.Counters
> - Adding REDUCE_OUTPUT_RECORDS 2011-08-10 11:15:08,864 DEBUG
> mapred.Counters - Adding REDUCE_INPUT_RECORDS
> 
> Unfortunately, I can't interpret this message. Can anybody tell me if
> this is normal?
> 
> Here a few more details for the segment and my machine:
> 
> Content and Size of the segment:
> 
> root@hrz-vm180:/home/nutchServer/uni_nutch/runtime/local/bin/crawl/segments
> /20110808145606# ll
> total 0
> drwxr-xr-x 3 root root 23 Aug  8 15:22 content
> drwxr-xr-x 3 root root 23 Aug  8 15:22 crawl_fetch
> drwxr-xr-x 2 root root 45 Aug  8 14:56 crawl_generate
> drwxr-xr-x 2 root root 45 Aug  8 19:28 crawl_parse
> drwxr-xr-x 3 root root 23 Aug  8 19:28 parse_data
> drwxr-xr-x 3 root root 23 Aug  8 19:28 parse_text
> drwxr-xr-x 2 root root  6 Aug  8 15:34 _temporary
> root@hrz-vm180:/home/nutchServer/uni_nutch/runtime/local/bin/crawl/segments
> /20110808145606# du -h
> 8.4M    ./crawl_generate
> 9.4M    ./crawl_fetch/part-00000
> 9.4M    ./crawl_fetch
> 2.6G    ./content/part-00000
> 2.6G    ./content
> 0       ./_temporary
> 64M     ./parse_text/part-00000
> 64M     ./parse_text
> 30M     ./parse_data/part-00000
> 30M     ./parse_data
> 80M     ./crawl_parse
> 2.8G    .
> 
> System status:
> 
> top - 13:10:39 up 72 days, 23:13,  4 users,  load average: 1.53, 4.47, 5.36
> Tasks: 125 total,   1 running, 124 sleeping,   0 stopped,   0 zombie
> Cpu(s): 64.4%us,  0.4%sy,  0.0%ni, 35.2%id,  0.0%wa,  0.0%hi,  0.0%si,
> 0.0%st
> Mem:   8003904k total,  7944152k used,    59752k free,   100172k buffers
> Swap:   418808k total,     7916k used,   410892k free,  2807036k cached
> 
>    PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
> 
> 
> 11697 root      20   0 4746m 4.2g  12m S  259 54.5   7683:28 java
> 
> Hope anybody could help me :)
> 
> Thanks
> 
> PS: I think there are many PDF files to process. The http content limit
> was set to 10 MB

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Re: Question to reduce while parsing

Posted by Markus Jelsma <ma...@openindex.io>.
Also check this fix for truncated docs:
https://issues.apache.org/jira/browse/NUTCH-965



On Wednesday 10 August 2011 14:52:26 Marek Bachmann wrote:
> Hi Markus,
> 
> thanks for the reply. I am sure that they are NOT all below 10 MB, some
> of them actually contain images and are much bigger. I decided to use 10
> MB just in the opinion that it should be great enough for the most text
> pdfs.
> 
> I'll stop the process and add the patch. Hope it will discover the issue.
> :)
> 
> On 10.08.2011 14:24, Markus Jelsma wrote:
> > That doesn't sound good indeed. Perhaps the parser chokes on your
> > truncated PDF files, which may happen with a too long content limit. Are
> > you sure all PDF's are below 10MB limit?
> > 
> > You can add this patch so you can see progress in parsing when running
> > local jobs: https://issues.apache.org/jira/browse/NUTCH-1028
> > 
> > On Saturday 10 September 2011 13:12:07 Marek Bachmann wrote:
> >> Hi everybody,
> >> 
> >> a parse cycle is working for two days on my machine. I think this is way
> >> too long.
> >> The Hadoop Log file contains nothing but this, always repeating message:
> >> 
> >> 2011-08-10 11:15:08,863 INFO  mapred.LocalJobRunner - reduce>  reduce
> >> 2011-08-10 11:15:08,864 DEBUG mapred.Counters - Creating group
> >> ParserStatus with nothing
> >> 2011-08-10 11:15:08,864 DEBUG mapred.Counters - Adding failed
> >> 2011-08-10 11:15:08,864 DEBUG mapred.Counters - Adding success
> >> 2011-08-10 11:15:08,864 DEBUG mapred.Counters - Creating group
> >> FileSystemCounters with nothing
> >> 2011-08-10 11:15:08,864 DEBUG mapred.Counters - Adding FILE_BYTES_READ
> >> 2011-08-10 11:15:08,864 DEBUG mapred.Counters - Adding
> >> FILE_BYTES_WRITTEN 2011-08-10 11:15:08,864 DEBUG mapred.Counters -
> >> Creating group
> >> org.apache.hadoop.mapred.Task$Counter with bundle
> >> 2011-08-10 11:15:08,864 DEBUG mapred.Counters - Adding
> >> COMBINE_OUTPUT_RECORDS
> >> 2011-08-10 11:15:08,864 DEBUG mapred.Counters - Adding MAP_INPUT_RECORDS
> >> 2011-08-10 11:15:08,864 DEBUG mapred.Counters - Adding SPILLED_RECORDS
> >> 2011-08-10 11:15:08,864 DEBUG mapred.Counters - Adding MAP_OUTPUT_BYTES
> >> 2011-08-10 11:15:08,864 DEBUG mapred.Counters - Adding MAP_INPUT_BYTES
> >> 2011-08-10 11:15:08,864 DEBUG mapred.Counters - Adding
> >> MAP_OUTPUT_RECORDS 2011-08-10 11:15:08,864 DEBUG mapred.Counters -
> >> Adding
> >> COMBINE_INPUT_RECORDS 2011-08-10 11:15:08,864 DEBUG mapred.Counters -
> >> Adding REDUCE_INPUT_GROUPS 2011-08-10 11:15:08,864 DEBUG mapred.Counters
> >> - Adding REDUCE_SHUFFLE_BYTES 2011-08-10 11:15:08,864 DEBUG
> >> mapred.Counters - Adding REDUCE_OUTPUT_RECORDS 2011-08-10 11:15:08,864
> >> DEBUG
> >> mapred.Counters - Adding REDUCE_INPUT_RECORDS
> >> 
> >> Unfortunately, I can't interpret this message. Can anybody tell me if
> >> this is normal?
> >> 
> >> Here a few more details for the segment and my machine:
> >> 
> >> Content and Size of the segment:
> >> 
> >> root@hrz-vm180:/home/nutchServer/uni_nutch/runtime/local/bin/crawl/segme
> >> nts /20110808145606# ll
> >> total 0
> >> drwxr-xr-x 3 root root 23 Aug  8 15:22 content
> >> drwxr-xr-x 3 root root 23 Aug  8 15:22 crawl_fetch
> >> drwxr-xr-x 2 root root 45 Aug  8 14:56 crawl_generate
> >> drwxr-xr-x 2 root root 45 Aug  8 19:28 crawl_parse
> >> drwxr-xr-x 3 root root 23 Aug  8 19:28 parse_data
> >> drwxr-xr-x 3 root root 23 Aug  8 19:28 parse_text
> >> drwxr-xr-x 2 root root  6 Aug  8 15:34 _temporary
> >> root@hrz-vm180:/home/nutchServer/uni_nutch/runtime/local/bin/crawl/segme
> >> nts /20110808145606# du -h
> >> 8.4M    ./crawl_generate
> >> 9.4M    ./crawl_fetch/part-00000
> >> 9.4M    ./crawl_fetch
> >> 2.6G    ./content/part-00000
> >> 2.6G    ./content
> >> 0       ./_temporary
> >> 64M     ./parse_text/part-00000
> >> 64M     ./parse_text
> >> 30M     ./parse_data/part-00000
> >> 30M     ./parse_data
> >> 80M     ./crawl_parse
> >> 2.8G    .
> >> 
> >> System status:
> >> 
> >> top - 13:10:39 up 72 days, 23:13,  4 users,  load average: 1.53, 4.47,
> >> 5.36 Tasks: 125 total,   1 running, 124 sleeping,   0 stopped,   0
> >> zombie Cpu(s): 64.4%us,  0.4%sy,  0.0%ni, 35.2%id,  0.0%wa,  0.0%hi, 
> >> 0.0%si, 0.0%st
> >> Mem:   8003904k total,  7944152k used,    59752k free,   100172k buffers
> >> Swap:   418808k total,     7916k used,   410892k free,  2807036k cached
> >> 
> >>     PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
> >> 
> >> 11697 root      20   0 4746m 4.2g  12m S  259 54.5   7683:28 java
> >> 
> >> Hope anybody could help me :)
> >> 
> >> Thanks
> >> 
> >> PS: I think there are many PDF files to process. The http content limit
> >> was set to 10 MB

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Re: Question to reduce while parsing

Posted by Marek Bachmann <m....@uni-kassel.de>.
Hi Markus,

thanks for the reply. I am sure that they are NOT all below 10 MB, some 
of them actually contain images and are much bigger. I decided to use 10 
MB just in the opinion that it should be great enough for the most text 
pdfs.

I'll stop the process and add the patch. Hope it will discover the issue. :)

On 10.08.2011 14:24, Markus Jelsma wrote:
> That doesn't sound good indeed. Perhaps the parser chokes on your truncated
> PDF files, which may happen with a too long content limit. Are you sure all
> PDF's are below 10MB limit?
>
> You can add this patch so you can see progress in parsing when running local
> jobs: https://issues.apache.org/jira/browse/NUTCH-1028
>
> On Saturday 10 September 2011 13:12:07 Marek Bachmann wrote:
>> Hi everybody,
>>
>> a parse cycle is working for two days on my machine. I think this is way
>> too long.
>> The Hadoop Log file contains nothing but this, always repeating message:
>>
>> 2011-08-10 11:15:08,863 INFO  mapred.LocalJobRunner - reduce>  reduce
>> 2011-08-10 11:15:08,864 DEBUG mapred.Counters - Creating group
>> ParserStatus with nothing
>> 2011-08-10 11:15:08,864 DEBUG mapred.Counters - Adding failed
>> 2011-08-10 11:15:08,864 DEBUG mapred.Counters - Adding success
>> 2011-08-10 11:15:08,864 DEBUG mapred.Counters - Creating group
>> FileSystemCounters with nothing
>> 2011-08-10 11:15:08,864 DEBUG mapred.Counters - Adding FILE_BYTES_READ
>> 2011-08-10 11:15:08,864 DEBUG mapred.Counters - Adding FILE_BYTES_WRITTEN
>> 2011-08-10 11:15:08,864 DEBUG mapred.Counters - Creating group
>> org.apache.hadoop.mapred.Task$Counter with bundle
>> 2011-08-10 11:15:08,864 DEBUG mapred.Counters - Adding
>> COMBINE_OUTPUT_RECORDS
>> 2011-08-10 11:15:08,864 DEBUG mapred.Counters - Adding MAP_INPUT_RECORDS
>> 2011-08-10 11:15:08,864 DEBUG mapred.Counters - Adding SPILLED_RECORDS
>> 2011-08-10 11:15:08,864 DEBUG mapred.Counters - Adding MAP_OUTPUT_BYTES
>> 2011-08-10 11:15:08,864 DEBUG mapred.Counters - Adding MAP_INPUT_BYTES
>> 2011-08-10 11:15:08,864 DEBUG mapred.Counters - Adding MAP_OUTPUT_RECORDS
>> 2011-08-10 11:15:08,864 DEBUG mapred.Counters - Adding
>> COMBINE_INPUT_RECORDS 2011-08-10 11:15:08,864 DEBUG mapred.Counters -
>> Adding REDUCE_INPUT_GROUPS 2011-08-10 11:15:08,864 DEBUG mapred.Counters -
>> Adding REDUCE_SHUFFLE_BYTES 2011-08-10 11:15:08,864 DEBUG mapred.Counters
>> - Adding REDUCE_OUTPUT_RECORDS 2011-08-10 11:15:08,864 DEBUG
>> mapred.Counters - Adding REDUCE_INPUT_RECORDS
>>
>> Unfortunately, I can't interpret this message. Can anybody tell me if
>> this is normal?
>>
>> Here a few more details for the segment and my machine:
>>
>> Content and Size of the segment:
>>
>> root@hrz-vm180:/home/nutchServer/uni_nutch/runtime/local/bin/crawl/segments
>> /20110808145606# ll
>> total 0
>> drwxr-xr-x 3 root root 23 Aug  8 15:22 content
>> drwxr-xr-x 3 root root 23 Aug  8 15:22 crawl_fetch
>> drwxr-xr-x 2 root root 45 Aug  8 14:56 crawl_generate
>> drwxr-xr-x 2 root root 45 Aug  8 19:28 crawl_parse
>> drwxr-xr-x 3 root root 23 Aug  8 19:28 parse_data
>> drwxr-xr-x 3 root root 23 Aug  8 19:28 parse_text
>> drwxr-xr-x 2 root root  6 Aug  8 15:34 _temporary
>> root@hrz-vm180:/home/nutchServer/uni_nutch/runtime/local/bin/crawl/segments
>> /20110808145606# du -h
>> 8.4M    ./crawl_generate
>> 9.4M    ./crawl_fetch/part-00000
>> 9.4M    ./crawl_fetch
>> 2.6G    ./content/part-00000
>> 2.6G    ./content
>> 0       ./_temporary
>> 64M     ./parse_text/part-00000
>> 64M     ./parse_text
>> 30M     ./parse_data/part-00000
>> 30M     ./parse_data
>> 80M     ./crawl_parse
>> 2.8G    .
>>
>> System status:
>>
>> top - 13:10:39 up 72 days, 23:13,  4 users,  load average: 1.53, 4.47, 5.36
>> Tasks: 125 total,   1 running, 124 sleeping,   0 stopped,   0 zombie
>> Cpu(s): 64.4%us,  0.4%sy,  0.0%ni, 35.2%id,  0.0%wa,  0.0%hi,  0.0%si,
>> 0.0%st
>> Mem:   8003904k total,  7944152k used,    59752k free,   100172k buffers
>> Swap:   418808k total,     7916k used,   410892k free,  2807036k cached
>>
>>     PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>>
>>
>> 11697 root      20   0 4746m 4.2g  12m S  259 54.5   7683:28 java
>>
>> Hope anybody could help me :)
>>
>> Thanks
>>
>> PS: I think there are many PDF files to process. The http content limit
>> was set to 10 MB
>