You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Marek Bachmann <m....@uni-kassel.de> on 2011/09/10 13:12:07 UTC

Question to reduce while parsing

Hi everybody,

a parse cycle is working for two days on my machine. I think this is way 
too long.
The Hadoop Log file contains nothing but this, always repeating message:

2011-08-10 11:15:08,863 INFO  mapred.LocalJobRunner - reduce > reduce
2011-08-10 11:15:08,864 DEBUG mapred.Counters - Creating group 
ParserStatus with nothing
2011-08-10 11:15:08,864 DEBUG mapred.Counters - Adding failed
2011-08-10 11:15:08,864 DEBUG mapred.Counters - Adding success
2011-08-10 11:15:08,864 DEBUG mapred.Counters - Creating group 
FileSystemCounters with nothing
2011-08-10 11:15:08,864 DEBUG mapred.Counters - Adding FILE_BYTES_READ
2011-08-10 11:15:08,864 DEBUG mapred.Counters - Adding FILE_BYTES_WRITTEN
2011-08-10 11:15:08,864 DEBUG mapred.Counters - Creating group 
org.apache.hadoop.mapred.Task$Counter with bundle
2011-08-10 11:15:08,864 DEBUG mapred.Counters - Adding 
COMBINE_OUTPUT_RECORDS
2011-08-10 11:15:08,864 DEBUG mapred.Counters - Adding MAP_INPUT_RECORDS
2011-08-10 11:15:08,864 DEBUG mapred.Counters - Adding SPILLED_RECORDS
2011-08-10 11:15:08,864 DEBUG mapred.Counters - Adding MAP_OUTPUT_BYTES
2011-08-10 11:15:08,864 DEBUG mapred.Counters - Adding MAP_INPUT_BYTES
2011-08-10 11:15:08,864 DEBUG mapred.Counters - Adding MAP_OUTPUT_RECORDS
2011-08-10 11:15:08,864 DEBUG mapred.Counters - Adding COMBINE_INPUT_RECORDS
2011-08-10 11:15:08,864 DEBUG mapred.Counters - Adding REDUCE_INPUT_GROUPS
2011-08-10 11:15:08,864 DEBUG mapred.Counters - Adding REDUCE_SHUFFLE_BYTES
2011-08-10 11:15:08,864 DEBUG mapred.Counters - Adding REDUCE_OUTPUT_RECORDS
2011-08-10 11:15:08,864 DEBUG mapred.Counters - Adding REDUCE_INPUT_RECORDS

Unfortunately, I can't interpret this message. Can anybody tell me if 
this is normal?

Here a few more details for the segment and my machine:

Content and Size of the segment:

root@hrz-vm180:/home/nutchServer/uni_nutch/runtime/local/bin/crawl/segments/20110808145606# 
ll
total 0
drwxr-xr-x 3 root root 23 Aug  8 15:22 content
drwxr-xr-x 3 root root 23 Aug  8 15:22 crawl_fetch
drwxr-xr-x 2 root root 45 Aug  8 14:56 crawl_generate
drwxr-xr-x 2 root root 45 Aug  8 19:28 crawl_parse
drwxr-xr-x 3 root root 23 Aug  8 19:28 parse_data
drwxr-xr-x 3 root root 23 Aug  8 19:28 parse_text
drwxr-xr-x 2 root root  6 Aug  8 15:34 _temporary
root@hrz-vm180:/home/nutchServer/uni_nutch/runtime/local/bin/crawl/segments/20110808145606# 
du -h
8.4M    ./crawl_generate
9.4M    ./crawl_fetch/part-00000
9.4M    ./crawl_fetch
2.6G    ./content/part-00000
2.6G    ./content
0       ./_temporary
64M     ./parse_text/part-00000
64M     ./parse_text
30M     ./parse_data/part-00000
30M     ./parse_data
80M     ./crawl_parse
2.8G    .

System status:

top - 13:10:39 up 72 days, 23:13,  4 users,  load average: 1.53, 4.47, 5.36
Tasks: 125 total,   1 running, 124 sleeping,   0 stopped,   0 zombie
Cpu(s): 64.4%us,  0.4%sy,  0.0%ni, 35.2%id,  0.0%wa,  0.0%hi,  0.0%si, 
0.0%st
Mem:   8003904k total,  7944152k used,    59752k free,   100172k buffers
Swap:   418808k total,     7916k used,   410892k free,  2807036k cached

   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND 
 

11697 root      20   0 4746m 4.2g  12m S  259 54.5   7683:28 java

Hope anybody could help me :)

Thanks

PS: I think there are many PDF files to process. The http content limit 
was set to 10 MB

Re: Question to reduce while parsing

Posted by Markus Jelsma <ma...@openindex.io>.
Also check this fix for truncated docs:
https://issues.apache.org/jira/browse/NUTCH-965



On Wednesday 10 August 2011 14:52:26 Marek Bachmann wrote:
> Hi Markus,
> 
> thanks for the reply. I am sure that they are NOT all below 10 MB, some
> of them actually contain images and are much bigger. I decided to use 10
> MB just in the opinion that it should be great enough for the most text
> pdfs.
> 
> I'll stop the process and add the patch. Hope it will discover the issue.
> :)
> 
> On 10.08.2011 14:24, Markus Jelsma wrote:
> > That doesn't sound good indeed. Perhaps the parser chokes on your
> > truncated PDF files, which may happen with a too long content limit. Are
> > you sure all PDF's are below 10MB limit?
> > 
> > You can add this patch so you can see progress in parsing when running
> > local jobs: https://issues.apache.org/jira/browse/NUTCH-1028
> > 
> > On Saturday 10 September 2011 13:12:07 Marek Bachmann wrote:
> >> Hi everybody,
> >> 
> >> a parse cycle is working for two days on my machine. I think this is way
> >> too long.
> >> The Hadoop Log file contains nothing but this, always repeating message:
> >> 
> >> 2011-08-10 11:15:08,863 INFO  mapred.LocalJobRunner - reduce>  reduce
> >> 2011-08-10 11:15:08,864 DEBUG mapred.Counters - Creating group
> >> ParserStatus with nothing
> >> 2011-08-10 11:15:08,864 DEBUG mapred.Counters - Adding failed
> >> 2011-08-10 11:15:08,864 DEBUG mapred.Counters - Adding success
> >> 2011-08-10 11:15:08,864 DEBUG mapred.Counters - Creating group
> >> FileSystemCounters with nothing
> >> 2011-08-10 11:15:08,864 DEBUG mapred.Counters - Adding FILE_BYTES_READ
> >> 2011-08-10 11:15:08,864 DEBUG mapred.Counters - Adding
> >> FILE_BYTES_WRITTEN 2011-08-10 11:15:08,864 DEBUG mapred.Counters -
> >> Creating group
> >> org.apache.hadoop.mapred.Task$Counter with bundle
> >> 2011-08-10 11:15:08,864 DEBUG mapred.Counters - Adding
> >> COMBINE_OUTPUT_RECORDS
> >> 2011-08-10 11:15:08,864 DEBUG mapred.Counters - Adding MAP_INPUT_RECORDS
> >> 2011-08-10 11:15:08,864 DEBUG mapred.Counters - Adding SPILLED_RECORDS
> >> 2011-08-10 11:15:08,864 DEBUG mapred.Counters - Adding MAP_OUTPUT_BYTES
> >> 2011-08-10 11:15:08,864 DEBUG mapred.Counters - Adding MAP_INPUT_BYTES
> >> 2011-08-10 11:15:08,864 DEBUG mapred.Counters - Adding
> >> MAP_OUTPUT_RECORDS 2011-08-10 11:15:08,864 DEBUG mapred.Counters -
> >> Adding
> >> COMBINE_INPUT_RECORDS 2011-08-10 11:15:08,864 DEBUG mapred.Counters -
> >> Adding REDUCE_INPUT_GROUPS 2011-08-10 11:15:08,864 DEBUG mapred.Counters
> >> - Adding REDUCE_SHUFFLE_BYTES 2011-08-10 11:15:08,864 DEBUG
> >> mapred.Counters - Adding REDUCE_OUTPUT_RECORDS 2011-08-10 11:15:08,864
> >> DEBUG
> >> mapred.Counters - Adding REDUCE_INPUT_RECORDS
> >> 
> >> Unfortunately, I can't interpret this message. Can anybody tell me if
> >> this is normal?
> >> 
> >> Here a few more details for the segment and my machine:
> >> 
> >> Content and Size of the segment:
> >> 
> >> root@hrz-vm180:/home/nutchServer/uni_nutch/runtime/local/bin/crawl/segme
> >> nts /20110808145606# ll
> >> total 0
> >> drwxr-xr-x 3 root root 23 Aug  8 15:22 content
> >> drwxr-xr-x 3 root root 23 Aug  8 15:22 crawl_fetch
> >> drwxr-xr-x 2 root root 45 Aug  8 14:56 crawl_generate
> >> drwxr-xr-x 2 root root 45 Aug  8 19:28 crawl_parse
> >> drwxr-xr-x 3 root root 23 Aug  8 19:28 parse_data
> >> drwxr-xr-x 3 root root 23 Aug  8 19:28 parse_text
> >> drwxr-xr-x 2 root root  6 Aug  8 15:34 _temporary
> >> root@hrz-vm180:/home/nutchServer/uni_nutch/runtime/local/bin/crawl/segme
> >> nts /20110808145606# du -h
> >> 8.4M    ./crawl_generate
> >> 9.4M    ./crawl_fetch/part-00000
> >> 9.4M    ./crawl_fetch
> >> 2.6G    ./content/part-00000
> >> 2.6G    ./content
> >> 0       ./_temporary
> >> 64M     ./parse_text/part-00000
> >> 64M     ./parse_text
> >> 30M     ./parse_data/part-00000
> >> 30M     ./parse_data
> >> 80M     ./crawl_parse
> >> 2.8G    .
> >> 
> >> System status:
> >> 
> >> top - 13:10:39 up 72 days, 23:13,  4 users,  load average: 1.53, 4.47,
> >> 5.36 Tasks: 125 total,   1 running, 124 sleeping,   0 stopped,   0
> >> zombie Cpu(s): 64.4%us,  0.4%sy,  0.0%ni, 35.2%id,  0.0%wa,  0.0%hi, 
> >> 0.0%si, 0.0%st
> >> Mem:   8003904k total,  7944152k used,    59752k free,   100172k buffers
> >> Swap:   418808k total,     7916k used,   410892k free,  2807036k cached
> >> 
> >>     PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
> >> 
> >> 11697 root      20   0 4746m 4.2g  12m S  259 54.5   7683:28 java
> >> 
> >> Hope anybody could help me :)
> >> 
> >> Thanks
> >> 
> >> PS: I think there are many PDF files to process. The http content limit
> >> was set to 10 MB

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Re: Question to reduce while parsing

Posted by Marek Bachmann <m....@uni-kassel.de>.
Hi Markus,

thanks for the reply. I am sure that they are NOT all below 10 MB, some 
of them actually contain images and are much bigger. I decided to use 10 
MB just in the opinion that it should be great enough for the most text 
pdfs.

I'll stop the process and add the patch. Hope it will discover the issue. :)

On 10.08.2011 14:24, Markus Jelsma wrote:
> That doesn't sound good indeed. Perhaps the parser chokes on your truncated
> PDF files, which may happen with a too long content limit. Are you sure all
> PDF's are below 10MB limit?
>
> You can add this patch so you can see progress in parsing when running local
> jobs: https://issues.apache.org/jira/browse/NUTCH-1028
>
> On Saturday 10 September 2011 13:12:07 Marek Bachmann wrote:
>> Hi everybody,
>>
>> a parse cycle is working for two days on my machine. I think this is way
>> too long.
>> The Hadoop Log file contains nothing but this, always repeating message:
>>
>> 2011-08-10 11:15:08,863 INFO  mapred.LocalJobRunner - reduce>  reduce
>> 2011-08-10 11:15:08,864 DEBUG mapred.Counters - Creating group
>> ParserStatus with nothing
>> 2011-08-10 11:15:08,864 DEBUG mapred.Counters - Adding failed
>> 2011-08-10 11:15:08,864 DEBUG mapred.Counters - Adding success
>> 2011-08-10 11:15:08,864 DEBUG mapred.Counters - Creating group
>> FileSystemCounters with nothing
>> 2011-08-10 11:15:08,864 DEBUG mapred.Counters - Adding FILE_BYTES_READ
>> 2011-08-10 11:15:08,864 DEBUG mapred.Counters - Adding FILE_BYTES_WRITTEN
>> 2011-08-10 11:15:08,864 DEBUG mapred.Counters - Creating group
>> org.apache.hadoop.mapred.Task$Counter with bundle
>> 2011-08-10 11:15:08,864 DEBUG mapred.Counters - Adding
>> COMBINE_OUTPUT_RECORDS
>> 2011-08-10 11:15:08,864 DEBUG mapred.Counters - Adding MAP_INPUT_RECORDS
>> 2011-08-10 11:15:08,864 DEBUG mapred.Counters - Adding SPILLED_RECORDS
>> 2011-08-10 11:15:08,864 DEBUG mapred.Counters - Adding MAP_OUTPUT_BYTES
>> 2011-08-10 11:15:08,864 DEBUG mapred.Counters - Adding MAP_INPUT_BYTES
>> 2011-08-10 11:15:08,864 DEBUG mapred.Counters - Adding MAP_OUTPUT_RECORDS
>> 2011-08-10 11:15:08,864 DEBUG mapred.Counters - Adding
>> COMBINE_INPUT_RECORDS 2011-08-10 11:15:08,864 DEBUG mapred.Counters -
>> Adding REDUCE_INPUT_GROUPS 2011-08-10 11:15:08,864 DEBUG mapred.Counters -
>> Adding REDUCE_SHUFFLE_BYTES 2011-08-10 11:15:08,864 DEBUG mapred.Counters
>> - Adding REDUCE_OUTPUT_RECORDS 2011-08-10 11:15:08,864 DEBUG
>> mapred.Counters - Adding REDUCE_INPUT_RECORDS
>>
>> Unfortunately, I can't interpret this message. Can anybody tell me if
>> this is normal?
>>
>> Here a few more details for the segment and my machine:
>>
>> Content and Size of the segment:
>>
>> root@hrz-vm180:/home/nutchServer/uni_nutch/runtime/local/bin/crawl/segments
>> /20110808145606# ll
>> total 0
>> drwxr-xr-x 3 root root 23 Aug  8 15:22 content
>> drwxr-xr-x 3 root root 23 Aug  8 15:22 crawl_fetch
>> drwxr-xr-x 2 root root 45 Aug  8 14:56 crawl_generate
>> drwxr-xr-x 2 root root 45 Aug  8 19:28 crawl_parse
>> drwxr-xr-x 3 root root 23 Aug  8 19:28 parse_data
>> drwxr-xr-x 3 root root 23 Aug  8 19:28 parse_text
>> drwxr-xr-x 2 root root  6 Aug  8 15:34 _temporary
>> root@hrz-vm180:/home/nutchServer/uni_nutch/runtime/local/bin/crawl/segments
>> /20110808145606# du -h
>> 8.4M    ./crawl_generate
>> 9.4M    ./crawl_fetch/part-00000
>> 9.4M    ./crawl_fetch
>> 2.6G    ./content/part-00000
>> 2.6G    ./content
>> 0       ./_temporary
>> 64M     ./parse_text/part-00000
>> 64M     ./parse_text
>> 30M     ./parse_data/part-00000
>> 30M     ./parse_data
>> 80M     ./crawl_parse
>> 2.8G    .
>>
>> System status:
>>
>> top - 13:10:39 up 72 days, 23:13,  4 users,  load average: 1.53, 4.47, 5.36
>> Tasks: 125 total,   1 running, 124 sleeping,   0 stopped,   0 zombie
>> Cpu(s): 64.4%us,  0.4%sy,  0.0%ni, 35.2%id,  0.0%wa,  0.0%hi,  0.0%si,
>> 0.0%st
>> Mem:   8003904k total,  7944152k used,    59752k free,   100172k buffers
>> Swap:   418808k total,     7916k used,   410892k free,  2807036k cached
>>
>>     PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>>
>>
>> 11697 root      20   0 4746m 4.2g  12m S  259 54.5   7683:28 java
>>
>> Hope anybody could help me :)
>>
>> Thanks
>>
>> PS: I think there are many PDF files to process. The http content limit
>> was set to 10 MB
>


Re: Question to reduce while parsing

Posted by Markus Jelsma <ma...@openindex.io>.
That doesn't sound good indeed. Perhaps the parser chokes on your truncated 
PDF files, which may happen with a too long content limit. Are you sure all 
PDF's are below 10MB limit?

You can add this patch so you can see progress in parsing when running local 
jobs: https://issues.apache.org/jira/browse/NUTCH-1028

On Saturday 10 September 2011 13:12:07 Marek Bachmann wrote:
> Hi everybody,
> 
> a parse cycle is working for two days on my machine. I think this is way
> too long.
> The Hadoop Log file contains nothing but this, always repeating message:
> 
> 2011-08-10 11:15:08,863 INFO  mapred.LocalJobRunner - reduce > reduce
> 2011-08-10 11:15:08,864 DEBUG mapred.Counters - Creating group
> ParserStatus with nothing
> 2011-08-10 11:15:08,864 DEBUG mapred.Counters - Adding failed
> 2011-08-10 11:15:08,864 DEBUG mapred.Counters - Adding success
> 2011-08-10 11:15:08,864 DEBUG mapred.Counters - Creating group
> FileSystemCounters with nothing
> 2011-08-10 11:15:08,864 DEBUG mapred.Counters - Adding FILE_BYTES_READ
> 2011-08-10 11:15:08,864 DEBUG mapred.Counters - Adding FILE_BYTES_WRITTEN
> 2011-08-10 11:15:08,864 DEBUG mapred.Counters - Creating group
> org.apache.hadoop.mapred.Task$Counter with bundle
> 2011-08-10 11:15:08,864 DEBUG mapred.Counters - Adding
> COMBINE_OUTPUT_RECORDS
> 2011-08-10 11:15:08,864 DEBUG mapred.Counters - Adding MAP_INPUT_RECORDS
> 2011-08-10 11:15:08,864 DEBUG mapred.Counters - Adding SPILLED_RECORDS
> 2011-08-10 11:15:08,864 DEBUG mapred.Counters - Adding MAP_OUTPUT_BYTES
> 2011-08-10 11:15:08,864 DEBUG mapred.Counters - Adding MAP_INPUT_BYTES
> 2011-08-10 11:15:08,864 DEBUG mapred.Counters - Adding MAP_OUTPUT_RECORDS
> 2011-08-10 11:15:08,864 DEBUG mapred.Counters - Adding
> COMBINE_INPUT_RECORDS 2011-08-10 11:15:08,864 DEBUG mapred.Counters -
> Adding REDUCE_INPUT_GROUPS 2011-08-10 11:15:08,864 DEBUG mapred.Counters -
> Adding REDUCE_SHUFFLE_BYTES 2011-08-10 11:15:08,864 DEBUG mapred.Counters
> - Adding REDUCE_OUTPUT_RECORDS 2011-08-10 11:15:08,864 DEBUG
> mapred.Counters - Adding REDUCE_INPUT_RECORDS
> 
> Unfortunately, I can't interpret this message. Can anybody tell me if
> this is normal?
> 
> Here a few more details for the segment and my machine:
> 
> Content and Size of the segment:
> 
> root@hrz-vm180:/home/nutchServer/uni_nutch/runtime/local/bin/crawl/segments
> /20110808145606# ll
> total 0
> drwxr-xr-x 3 root root 23 Aug  8 15:22 content
> drwxr-xr-x 3 root root 23 Aug  8 15:22 crawl_fetch
> drwxr-xr-x 2 root root 45 Aug  8 14:56 crawl_generate
> drwxr-xr-x 2 root root 45 Aug  8 19:28 crawl_parse
> drwxr-xr-x 3 root root 23 Aug  8 19:28 parse_data
> drwxr-xr-x 3 root root 23 Aug  8 19:28 parse_text
> drwxr-xr-x 2 root root  6 Aug  8 15:34 _temporary
> root@hrz-vm180:/home/nutchServer/uni_nutch/runtime/local/bin/crawl/segments
> /20110808145606# du -h
> 8.4M    ./crawl_generate
> 9.4M    ./crawl_fetch/part-00000
> 9.4M    ./crawl_fetch
> 2.6G    ./content/part-00000
> 2.6G    ./content
> 0       ./_temporary
> 64M     ./parse_text/part-00000
> 64M     ./parse_text
> 30M     ./parse_data/part-00000
> 30M     ./parse_data
> 80M     ./crawl_parse
> 2.8G    .
> 
> System status:
> 
> top - 13:10:39 up 72 days, 23:13,  4 users,  load average: 1.53, 4.47, 5.36
> Tasks: 125 total,   1 running, 124 sleeping,   0 stopped,   0 zombie
> Cpu(s): 64.4%us,  0.4%sy,  0.0%ni, 35.2%id,  0.0%wa,  0.0%hi,  0.0%si,
> 0.0%st
> Mem:   8003904k total,  7944152k used,    59752k free,   100172k buffers
> Swap:   418808k total,     7916k used,   410892k free,  2807036k cached
> 
>    PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
> 
> 
> 11697 root      20   0 4746m 4.2g  12m S  259 54.5   7683:28 java
> 
> Hope anybody could help me :)
> 
> Thanks
> 
> PS: I think there are many PDF files to process. The http content limit
> was set to 10 MB

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350