You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Michael Ji <fj...@yahoo.com> on 2006/03/30 03:32:42 UTC

nutch config setup to crawl/query for word/pdf files

hi there,

I have problem to configure nutch 07 to let it
crawling and query MS-word and pdf file correctly.

1.
I adding lines in nutch-site.xml as followings:

"
<!-- plugin properties -->
<property>
  <name>plugin.includes</name>
  <value>

nutch-extensionpoints|protocol-httpclient|urlfilter-regex|parse-(text|html|pdf|msword)|
index-(basic|pdf|msword)|
query-(basic|site|url|pdf|msword)

</value>
  <description>Regular expression naming plugin
directory names to
  include.  Any plugin not matching this expression is
excluded.
  In any case you need at least include the
nutch-extensionpoints plugin. By
  default Nutch includes crawling just HTML and plain
text via HTTP,
  and basic indexing and search plugins.
  </description>
</property>
"

2.
I check regex-urlfilter.txt, that I didn't exclude pdf
and ms-word

3.
I checked mime-type.xml, all the set for pdf and
ms-word are there.

4.
I checked nutch fetching log, pdf and ms-word plugin
are applied correctly as followings:

"060327 204736 parsing:
C:\cygwin\jifeng\versionControl\new_dev\nutch_V07_CNI_Alfa\nutch\build\plugins\parse-msword\plugin.xml
060327 204736 impl:
point=org.apache.nutch.parse.Parser
class=org.apache.nutch.parse.msword.MSWordParser
060327 204736 parsing:
C:\cygwin\jifeng\versionControl\new_dev\nutch_V07_CNI_Alfa\nutch\build\plugins\parse-pdf\plugin.xml
060327 204736 impl:
point=org.apache.nutch.parse.Parser
class=org.apache.nutch.parse.pdf.PdfParser
"

I wonder if I still missing something in
configuration.

thanks,

Michael,

__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

Re: fetching stuck in the middle of processing

Posted by Michael Ji <fj...@yahoo.com>.
hi Andrzej:

My linux disk usage is as followings:
"
Filesystem           1K-blocks      Used Available
Use% Mounted on
/dev/sda3            132645028  60431892  65475076 
48% /
/dev/sda1               101089     20607     75263 
22% /boot
none                   2060348         0   2060348  
0% /dev/shm
"

I didn't see any sign of out of disk space.

thanks,

Michael

--- Andrzej Bialecki <ab...@getopt.org> wrote:

> Michael Ji wrote:
> > Hi,
> >
> > My fetching is stuck in the middle of somewhere in
> > fetching. And when I took a look at fetch log, I
> got
> > the following error messages,
> >
> > Any reason why this happens? Should I stop current
> > session and restart crawling again?
> >
> > by the way, my config for ftp content is set as
> > unlimited "<name>http.content.limit</name> 
> > <value>-1</value> ". Will that be the reason?
> >   
> 
> These particular problems happen among other when
> you run out of disk 
> space - please check that you have enough disk
> space, also on your /tmp 
> partition.
> 
> -- 
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _  
> __________________________________
> [__ || __|__/|__||\/|  Information Retrieval,
> Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System
> Integration
> http://www.sigram.com  Contact: info at sigram dot
> com
> 
> 
> 


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

Re: fetching stuck in the middle of processing

Posted by Andrzej Bialecki <ab...@getopt.org>.
Michael Ji wrote:
> Hi,
>
> My fetching is stuck in the middle of somewhere in
> fetching. And when I took a look at fetch log, I got
> the following error messages,
>
> Any reason why this happens? Should I stop current
> session and restart crawling again?
>
> by the way, my config for ftp content is set as
> unlimited "<name>http.content.limit</name> 
> <value>-1</value> ". Will that be the reason?
>   

These particular problems happen among other when you run out of disk 
space - please check that you have enough disk space, also on your /tmp 
partition.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



RE: refetching interval

Posted by Michael Ji <fj...@yahoo.com>.
Hi Gal:

Yes, I set db.max.per.host == 1.

Another interesting thing I found is that when I
dubugging print out page information in
FetchListTool.java to go generation, I check the log,
found "...Next fetch: Fri Apr 14 19:49:3...". This
webdb I generate in April 9 and refetching interval is
set to 1 day. 

Should "Next fetch" date around Aril 10th?

Why this happens?

thanks,

Michael,

--- Gal Nitzan <gn...@usa.net> wrote:

> 
> What about db.max.per.host? is it set to -1 ?
> 
> 
> -----Original Message-----
> From: Michael Ji [mailto:fji_00@yahoo.com] 
> Sent: Monday, April 10, 2006 3:18 AM
> To: nutch-user@lucene.apache.org
> Subject: refetching interval
> 
> hi there,
> 
> I have webdb with over 60,000 pages (using
> nutch/admin
> dumptxt command) and refetching interval is set as 1
> day
> 
> <property>
>   <name>db.default.fetch.interval</name>
>   <value>1</value>
>   <description>The default number of days between
> re-fetches of a page.
>   </description>
> </property>
> 
> But, when I do crawling based on this webdb next
> day,
> the generate log only showing that around 8,000
> pages
> being generated for fetching and actually 7,500
> pages
> being fetched down.
> 
> Any reason why it behaves like that? Should 60,000
> pages being fetching this time?
> 
> thanks,
> 
> Michael,
> 
> __________________________________________________
> Do You Yahoo!?
> Tired of spam?  Yahoo! Mail has the best spam
> protection around 
> http://mail.yahoo.com 
> 
> 
> 


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

RE: refetching interval

Posted by Gal Nitzan <gn...@usa.net>.
What about db.max.per.host? is it set to -1 ?


-----Original Message-----
From: Michael Ji [mailto:fji_00@yahoo.com] 
Sent: Monday, April 10, 2006 3:18 AM
To: nutch-user@lucene.apache.org
Subject: refetching interval

hi there,

I have webdb with over 60,000 pages (using nutch/admin
dumptxt command) and refetching interval is set as 1
day

<property>
  <name>db.default.fetch.interval</name>
  <value>1</value>
  <description>The default number of days between
re-fetches of a page.
  </description>
</property>

But, when I do crawling based on this webdb next day,
the generate log only showing that around 8,000 pages
being generated for fetching and actually 7,500 pages
being fetched down.

Any reason why it behaves like that? Should 60,000
pages being fetching this time?

thanks,

Michael,

__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 



refetching interval

Posted by Michael Ji <fj...@yahoo.com>.
hi there,

I have webdb with over 60,000 pages (using nutch/admin
dumptxt command) and refetching interval is set as 1
day

<property>
  <name>db.default.fetch.interval</name>
  <value>1</value>
  <description>The default number of days between
re-fetches of a page.
  </description>
</property>

But, when I do crawling based on this webdb next day,
the generate log only showing that around 8,000 pages
being generated for fetching and actually 7,500 pages
being fetched down.

Any reason why it behaves like that? Should 60,000
pages being fetching this time?

thanks,

Michael,

__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

fetching stuck in the middle of processing

Posted by Michael Ji <fj...@yahoo.com>.
Hi,

My fetching is stuck in the middle of somewhere in
fetching. And when I took a look at fetch log, I got
the following error messages,

Any reason why this happens? Should I stop current
session and restart crawling again?

by the way, my config for ftp content is set as
unlimited "<name>http.content.limit</name> 
<value>-1</value> ". Will that be the reason?

thanks

"
060408 072825 SEVERE error writing
output:java.io.IOException: key out of order: 33079
after 33079
java.io.IOException: key out of order: 33079 after
33079
	at
org.apache.nutch.io.MapFile$Writer.checkKey(MapFile.java:134)
	at
org.apache.nutch.io.MapFile$Writer.append(MapFile.java:120)
	at
org.apache.nutch.io.ArrayFile$Writer.append(ArrayFile.java:39)
	at
org.apache.nutch.fetcher.Fetcher$FetcherThread.outputPage(Fetcher.java:318)
	at
org.apache.nutch.fetcher.Fetcher$FetcherThread.handleFetch(Fetcher.java:301)
	at
org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:160)
Exception in thread "main" java.lang.RuntimeException:
SEVERE error logged.  Exiting fetcher.
	at
org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:394)
	at
org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:528)
"

__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

Re: nutch config setup to crawl/query for word/pdf files

Posted by sudhendra seshachala <su...@yahoo.com>.
 OOPS, my bad. I was seeing 0.8 Dev. 

Michael Ji <fj...@yahoo.com> wrote: hi Sudhendra:

I didn't see a file with such name
(parse-plugins.xml)in nutch/conf/ folder;

Should I create it by myself? Any tutorial I could
follow to set it up?

thanks,

Michael,

--- sudhendra seshachala  wrote:

> Have you checked parse-plugins.xml in conf/
> 
> Thanks
> 
> 
>   Sudhi Seshachala
>   http://sudhilogs.blogspot.com/
>    
> 
> 
>    
> ---------------------------------
> Yahoo! Messenger with Voice. PC-to-Phone calls for
> ridiculously low rates.


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 



  Sudhi Seshachala
  http://sudhilogs.blogspot.com/
   


		
---------------------------------
Talk is cheap. Use Yahoo! Messenger to make PC-to-Phone calls.  Great rates starting at 1&cent;/min.

Re: nutch config setup to crawl/query for word/pdf files

Posted by Michael Ji <fj...@yahoo.com>.
hi Sudhendra:

I didn't see a file with such name
(parse-plugins.xml)in nutch/conf/ folder;

Should I create it by myself? Any tutorial I could
follow to set it up?

thanks,

Michael,

--- sudhendra seshachala <su...@yahoo.com> wrote:

> Have you checked parse-plugins.xml in conf/
> 
> Thanks
> 
> 
>   Sudhi Seshachala
>   http://sudhilogs.blogspot.com/
>    
> 
> 
> 			
> ---------------------------------
> Yahoo! Messenger with Voice. PC-to-Phone calls for
> ridiculously low rates.


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

Re: nutch config setup to crawl/query for word/pdf files

Posted by sudhendra seshachala <su...@yahoo.com>.
Have you checked parse-plugins.xml in conf/

Thanks


  Sudhi Seshachala
  http://sudhilogs.blogspot.com/
   


			
---------------------------------
Yahoo! Messenger with Voice. PC-to-Phone calls for ridiculously low rates.