You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Martin Kuen <ma...@gmail.com> on 2008/01/17 01:07:11 UTC

Re: Help: parsing pdf files

Hi,

what comes to my mind is that there is a setting for the maximum size of a
downloaded file.
Have a look at "nutch-default.xml" and override it in "nutch-site.xml".
pdf-files tend to be quite big (compared to html). so probably this is the
source of your problem.
pdf files are downloaded and may get truncated - however the pdf parser
cannot handle these truncated pdf files. (truncated html files are okay) If
that's the case you should see a warning in the log file.

So, you should try to increase/modify the logging level/settings in order to
see what is happening. Have a look at "log/hadoop.log". These logging
statements are valuable information regarding your problem. Logging is
controlled via "conf/log4j.properties" - if you're not running nutch in a
servlet container. (ok - you still may controll logging from the same place,
but I think that's hardly done (?) ). In the mentioned hadoop.log file
you'll also see which plugins are loaded.

btw. you don't need to "mess" around with compilation in order to get this
running. (Just looking at the link . . .)

Hope it helps,

Martin

PS: This kind of question should be asked on the nutch-user list not dev.
Reposted this on user
PPS: I think you should subscribe to the mailing list . . . it's useful,
really ;)

On Jan 16, 2008 9:31 PM, Krishnamohan Meduri <Kr...@sun.com> wrote:

> Hello,
>
> I want crawler to fetch pdf files also. I set the url to be
> http://localhost:8080/ and I have several html and pdf files in my
> document root.
>
> crawler is able to fetch html files but not pdf files.
> I saw
> http://www.mail-archive.com/nutch-user@incubator.apache.org/msg00344.html
>
> In <nutch_home>/nutch-site.xml, I added the following:
> ---------
> <property>
>   <name>plugin.includes</name>
>
>
> <value>protocol-http|urlfilter-regex|parse-(text|html|js|pdf)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
>   <description>description</description>
> </property>
> ---------
>
> I installed nutch 0.9 and I see all plugins including parse-pdf in
> plugins directory. So thought I don't have to do anything else.
>
> It doesn't work. Can you pls help.
>
> PS: I am not on any mailing list. Can you pls CC me on your replies.
>
> thanks,
> Krishna.
>

Re: Help: parsing pdf files

Posted by Martin Kuen <ma...@gmail.com>.

Hi,

well honestly I cannot give you an advice on that.

The native libraries are part of the hadoop distributed filesystem.
I think you should ask this question on the hadoop users list, since this is
a question regarding the hadoop dfs.

Things "should" work without these native dependencies as well. However, I
can just ASSUME this. From my point of view it would be better if you simply
include the linux-natives in your product. I don't know about the testing
procedure used for the hadoop project. I don't know how much effort is put
into testing the java-only version of hadoop on linux. On the other hand, as
they provide a version with native bindings for linux, I assume the
java+native bindings is tested better under linux, than the java-only
version

My two cents . . . probably try the hadoop-users mailing list

Best Regards,

Martin

On Jan 18, 2008 10:15 PM, Krishnamohan Meduri <Kr...@sun.com>
wrote:

> Hi Martin,
>
> I install Apache nutch 0.9. Under lib/native/Linux-i386-32/ directory, I
> see some linux specific shared objects.
>
> Is it completely harmless to remove dependency on them and remove them
> eventually? The reason I am asking is that I am trying to use Apache
> nutch for a product that is supported on Windows,Linux,Solaris,HPUX.
>
> By removing them, I can confidently say that it is all Java and no
> native library dependency. I am willing to sacrifice some optimizations
> done for linux.
>
> Would greatly appreciate if you could shed some light on this.
>
> thanks so much,
> -Krishna.
>

Re: Help: parsing pdf files

Posted by Krishnamohan Meduri <Kr...@Sun.COM>.

Hi Martin,

I install Apache nutch 0.9. Under lib/native/Linux-i386-32/ directory, I 
see some linux specific shared objects.

Is it completely harmless to remove dependency on them and remove them 
eventually? The reason I am asking is that I am trying to use Apache 
nutch for a product that is supported on Windows,Linux,Solaris,HPUX.

By removing them, I can confidently say that it is all Java and no 
native library dependency. I am willing to sacrifice some optimizations 
done for linux.

Would greatly appreciate if you could shed some light on this.

thanks so much,
-Krishna.

Re: Help: parsing pdf files

Posted by Krishnamohan Meduri <Kr...@Sun.COM>.

Hi Martin,

Thanks so much for your tips.
After I enabled directory listing && created a urls list that consists 
of just one url for a pdf file, it worked.

Now that I got a clue, I may bug you more :-)

Thanks
Krishna.

Martin Kuen wrote:
> Hi,
> 
> The settings "file.content.limit" and "http.content.limit" are used for 
> different protocols. If you are crawling an url like 
> "http://localhost/whatever <http://localhost/whatever>" the http plugin 
> is used for fetching (as you've already guessed). If you have a url 
> starting with "file" another plugin is used.
> "file.content.limit" is used for crawling a local disk (or if you have a 
> network drive mounted).
> "http.content.limit" is used for content that is fetched via http.
> 
> These two settings are not related to the mime-type of downloaded content.
> 
>  >2008-01-16 18:38:44,717 WARN  util.NativeCodeLoader - Unable to load
> native-hadoop library for your platform... using builtin-java classes
> This is okay since you are on solaris and the "native" stuff is just 
> avail. for linux
> 
>  >2008-01-16 18:38:42,419 WARN   regex.RegexURLNormalizer - can't find
> rules for scope 'inject', using default
> The urlnormalizer is controlled via "regex-normalize.xml". This warning 
> tells you that it'll use the patterns found in this file regardless of 
> the current "scope". You can ignore this, or disable the urlnormalizer 
> plugin.
> 
> Idea:
>  >I want crawler to fetch pdf files also. I set the url to be
>  >http://localhost:8080/ and I have several html and pdf files in my
>  >document root.
> 1.) And all your pdf files have an in-link - or are they "just there"?
> Try to make a seed urls list which consists of just one url for such a 
> pdf. See if that pdf is fetched . . .
> If the pdfs don't have an in-link . . . no way to discover them via http 
> (assuming that directory listings are turned off on your server, which 
> should be the default).
> 2.) Is your crawl-depth set deep enough?
> 
> One last thing I can recommend to you is to increase the log-level.
> 
> 
> Best regards,
> 
> Martin
> 
> On Jan 17, 2008 12:15 PM, Ismael < kraziest@gmail.com 
> <ma...@gmail.com>> wrote:
> 
>     I am not sure, but I think that PDF maximum size goes with this
>     property:
> 
>     <property>
>      <name>file.content.limit</name>
>      <value>-1</value>
>      <description>The length limit for downloaded content, in bytes.
>      If this value is nonnegative (>=0), content longer than it will be
>     truncated;
>      otherwise, no truncation at all.
>      </description>
>     </property>
> 
>     2008/1/17, Krishnamohan Meduri <Krishna.Meduri@sun.com
>     <ma...@sun.com>>:
>      > Hi Martin,
>      >
>      > Thanks for the response.
>      > My pdf file size is much less than the default 65536
>      >    <name>http.content.limit</name>
>      >    <value>65536</value>
>      >
>      > Can you suggest anything else?
>      >
>      > thanks,
>      > Krishna.
>      >
>      > Martin Kuen wrote:
>      > > Hi,
>      > >
>      > > what comes to my mind is that there is a setting for the
>     maximum size of
>      > > a downloaded file.
>      > > Have a look at "nutch-default.xml" and override it in
>     "nutch-site.xml".
>      > > pdf-files tend to be quite big (compared to html). so probably
>     this is
>      > > the source of your problem.
>      > > pdf files are downloaded and may get truncated - however the
>     pdf parser
>      > > cannot handle these truncated pdf files. (truncated html files
>     are okay)
>      > > If that's the case you should see a warning in the log file.
>      > >
>      > > So, you should try to increase/modify the logging level/settings in
>      > > order to see what is happening. Have a look at
>     "log/hadoop.log". These
>      > > logging statements are valuable information regarding your
>     problem.
>      > > Logging is controlled via "conf/log4j.properties" - if you're not
>      > > running nutch in a servlet container. (ok - you still may controll
>      > > logging from the same place, but I think that's hardly done (?)
>     ). In
>      > > the mentioned hadoop.log file you'll also see which plugins are
>     loaded.
>      > >
>      > > btw. you don't need to "mess" around with compilation in order
>     to get
>      > > this running. (Just looking at the link . . .)
>      > >
>      > >
>      > > Hope it helps,
>      > >
>      > > Martin
>      > >
>      > > PS: This kind of question should be asked on the nutch-user
>     list not
>      > > dev. Reposted this on user
>      > > PPS: I think you should subscribe to the mailing list . . .
>     it's useful,
>      > > really ;)
>      > >
>      > > On Jan 16, 2008 9:31 PM, Krishnamohan Meduri <
>     Krishna.Meduri@sun.com <ma...@sun.com>
>      > > <mailto:Krishna.Meduri@sun.com
>     <ma...@sun.com>>> wrote:
>      > >
>      > >     Hello,
>      > >
>      > >     I want crawler to fetch pdf files also. I set the url to be
>      > >     http://localhost:8080/ and I have several html and pdf
>     files in my
>      > >     document root.
>      > >
>      > >     crawler is able to fetch html files but not pdf files.
>      > >     I saw
>      > >    
>     http://www.mail-archive.com/nutch-user@incubator.apache.org/msg00344.html
>      > >    
>     <http://www.mail-archive.com/nutch-user@incubator.apache.org/msg00344.html>
>      > >
>      > >     In <nutch_home>/nutch- site.xml, I added the following:
>      > >     ---------
>      > >     <property>
>      > >       <name>plugin.includes</name>
>      > >
>      > >    
>     <value>protocol-http|urlfilter-regex|parse-(text|html|js|pdf)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
> 
>      > >
>      > >       <description>description</description>
>      > >     </property>
>      > >     ---------
>      > >
>      > >     I installed nutch 0.9 and I see all plugins including
>     parse-pdf in
>      > >     plugins directory. So thought I don't have to do anything else.
>      > >
>      > >     It doesn't work. Can you pls help.
>      > >
>      > >     PS: I am not on any mailing list. Can you pls CC me on your
>     replies.
>      > >
>      > >     thanks,
>      > >     Krishna.
>      > >
>      > >
>      >
> 
>

Re: Help: parsing pdf files

Posted by Martin Kuen <ma...@gmail.com>.

Hi,

The settings "file.content.limit" and "http.content.limit" are used for
different protocols. If you are crawling an url like "
http://localhost/whatever" the http plugin is used for fetching (as you've
already guessed). If you have a url starting with "file" another plugin is
used.
"file.content.limit" is used for crawling a local disk (or if you have a
network drive mounted).
"http.content.limit" is used for content that is fetched via http.

These two settings are not related to the mime-type of downloaded content.

>2008-01-16 18:38:44,717 WARN  util.NativeCodeLoader - Unable to load
native-hadoop library for your platform... using builtin-java classes
This is okay since you are on solaris and the "native" stuff is just avail.
for linux

>2008-01-16 18:38:42,419 WARN  regex.RegexURLNormalizer - can't find
rules for scope 'inject', using default
The urlnormalizer is controlled via "regex-normalize.xml". This warning
tells you that it'll use the patterns found in this file regardless of the
current "scope". You can ignore this, or disable the urlnormalizer plugin.

Idea:
>I want crawler to fetch pdf files also. I set the url to be
>http://localhost:8080/ <http://localhost:8080/> and I have several html and
pdf files in my
>document root.
1.) And all your pdf files have an in-link - or are they "just there"?
Try to make a seed urls list which consists of just one url for such a pdf.
See if that pdf is fetched . . .
If the pdfs don't have an in-link . . . no way to discover them via http
(assuming that directory listings are turned off on your server, which
should be the default).
2.) Is your crawl-depth set deep enough?

One last thing I can recommend to you is to increase the log-level.


Best regards,

Martin

On Jan 17, 2008 12:15 PM, Ismael <kr...@gmail.com> wrote:

> I am not sure, but I think that PDF maximum size goes with this property:
>
> <property>
>  <name>file.content.limit</name>
>  <value>-1</value>
>  <description>The length limit for downloaded content, in bytes.
>  If this value is nonnegative (>=0), content longer than it will be
> truncated;
>  otherwise, no truncation at all.
>  </description>
> </property>
>
> 2008/1/17, Krishnamohan Meduri <Kr...@sun.com>:
> > Hi Martin,
> >
> > Thanks for the response.
> > My pdf file size is much less than the default 65536
> >    <name>http.content.limit</name>
> >    <value>65536</value>
> >
> > Can you suggest anything else?
> >
> > thanks,
> > Krishna.
> >
> > Martin Kuen wrote:
> > > Hi,
> > >
> > > what comes to my mind is that there is a setting for the maximum size
> of
> > > a downloaded file.
> > > Have a look at "nutch-default.xml" and override it in "nutch-site.xml
> ".
> > > pdf-files tend to be quite big (compared to html). so probably this is
> > > the source of your problem.
> > > pdf files are downloaded and may get truncated - however the pdf
> parser
> > > cannot handle these truncated pdf files. (truncated html files are
> okay)
> > > If that's the case you should see a warning in the log file.
> > >
> > > So, you should try to increase/modify the logging level/settings in
> > > order to see what is happening. Have a look at "log/hadoop.log". These
> > > logging statements are valuable information regarding your problem.
> > > Logging is controlled via "conf/log4j.properties" - if you're not
> > > running nutch in a servlet container. (ok - you still may controll
> > > logging from the same place, but I think that's hardly done (?) ). In
> > > the mentioned hadoop.log file you'll also see which plugins are
> loaded.
> > >
> > > btw. you don't need to "mess" around with compilation in order to get
> > > this running. (Just looking at the link . . .)
> > >
> > >
> > > Hope it helps,
> > >
> > > Martin
> > >
> > > PS: This kind of question should be asked on the nutch-user list not
> > > dev. Reposted this on user
> > > PPS: I think you should subscribe to the mailing list . . . it's
> useful,
> > > really ;)
> > >
> > > On Jan 16, 2008 9:31 PM, Krishnamohan Meduri <Krishna.Meduri@sun.com
> > > <ma...@sun.com>> wrote:
> > >
> > >     Hello,
> > >
> > >     I want crawler to fetch pdf files also. I set the url to be
> > >     http://localhost:8080/ and I have several html and pdf files in my
> > >     document root.
> > >
> > >     crawler is able to fetch html files but not pdf files.
> > >     I saw
> > >
> http://www.mail-archive.com/nutch-user@incubator.apache.org/msg00344.html
> > >     <
> http://www.mail-archive.com/nutch-user@incubator.apache.org/msg00344.html>
> > >
> > >     In <nutch_home>/nutch-site.xml, I added the following:
> > >     ---------
> > >     <property>
> > >       <name>plugin.includes</name>
> > >
> > >
> <value>protocol-http|urlfilter-regex|parse-(text|html|js|pdf)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
> > >
> > >       <description>description</description>
> > >     </property>
> > >     ---------
> > >
> > >     I installed nutch 0.9 and I see all plugins including parse-pdf in
> > >     plugins directory. So thought I don't have to do anything else.
> > >
> > >     It doesn't work. Can you pls help.
> > >
> > >     PS: I am not on any mailing list. Can you pls CC me on your
> replies.
> > >
> > >     thanks,
> > >     Krishna.
> > >
> > >
> >
>

Re: Help: parsing pdf files

Posted by Ismael <kr...@gmail.com>.

I am not sure, but I think that PDF maximum size goes with this property:

<property>
  <name>file.content.limit</name>
  <value>-1</value>
  <description>The length limit for downloaded content, in bytes.
  If this value is nonnegative (>=0), content longer than it will be truncated;
  otherwise, no truncation at all.
  </description>
</property>

2008/1/17, Krishnamohan Meduri <Kr...@sun.com>:
> Hi Martin,
>
> Thanks for the response.
> My pdf file size is much less than the default 65536
>    <name>http.content.limit</name>
>    <value>65536</value>
>
> Can you suggest anything else?
>
> thanks,
> Krishna.
>
> Martin Kuen wrote:
> > Hi,
> >
> > what comes to my mind is that there is a setting for the maximum size of
> > a downloaded file.
> > Have a look at "nutch-default.xml" and override it in "nutch-site.xml".
> > pdf-files tend to be quite big (compared to html). so probably this is
> > the source of your problem.
> > pdf files are downloaded and may get truncated - however the pdf parser
> > cannot handle these truncated pdf files. (truncated html files are okay)
> > If that's the case you should see a warning in the log file.
> >
> > So, you should try to increase/modify the logging level/settings in
> > order to see what is happening. Have a look at "log/hadoop.log". These
> > logging statements are valuable information regarding your problem.
> > Logging is controlled via "conf/log4j.properties" - if you're not
> > running nutch in a servlet container. (ok - you still may controll
> > logging from the same place, but I think that's hardly done (?) ). In
> > the mentioned hadoop.log file you'll also see which plugins are loaded.
> >
> > btw. you don't need to "mess" around with compilation in order to get
> > this running. (Just looking at the link . . .)
> >
> >
> > Hope it helps,
> >
> > Martin
> >
> > PS: This kind of question should be asked on the nutch-user list not
> > dev. Reposted this on user
> > PPS: I think you should subscribe to the mailing list . . . it's useful,
> > really ;)
> >
> > On Jan 16, 2008 9:31 PM, Krishnamohan Meduri <Krishna.Meduri@sun.com
> > <ma...@sun.com>> wrote:
> >
> >     Hello,
> >
> >     I want crawler to fetch pdf files also. I set the url to be
> >     http://localhost:8080/ and I have several html and pdf files in my
> >     document root.
> >
> >     crawler is able to fetch html files but not pdf files.
> >     I saw
> >     http://www.mail-archive.com/nutch-user@incubator.apache.org/msg00344.html
> >     <http://www.mail-archive.com/nutch-user@incubator.apache.org/msg00344.html>
> >
> >     In <nutch_home>/nutch-site.xml, I added the following:
> >     ---------
> >     <property>
> >       <name>plugin.includes</name>
> >
> >     <value>protocol-http|urlfilter-regex|parse-(text|html|js|pdf)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
> >
> >       <description>description</description>
> >     </property>
> >     ---------
> >
> >     I installed nutch 0.9 and I see all plugins including parse-pdf in
> >     plugins directory. So thought I don't have to do anything else.
> >
> >     It doesn't work. Can you pls help.
> >
> >     PS: I am not on any mailing list. Can you pls CC me on your replies.
> >
> >     thanks,
> >     Krishna.
> >
> >
>

Re: Help: parsing pdf files

Posted by Krishnamohan Meduri <Kr...@Sun.COM>.

Hi Martin,

Thanks for the response.
My pdf file size is much less than the default 65536
   <name>http.content.limit</name>
   <value>65536</value>

Can you suggest anything else?

thanks,
Krishna.

Martin Kuen wrote:
> Hi,
> 
> what comes to my mind is that there is a setting for the maximum size of 
> a downloaded file.
> Have a look at "nutch-default.xml" and override it in "nutch-site.xml".
> pdf-files tend to be quite big (compared to html). so probably this is 
> the source of your problem.
> pdf files are downloaded and may get truncated - however the pdf parser 
> cannot handle these truncated pdf files. (truncated html files are okay) 
> If that's the case you should see a warning in the log file.
> 
> So, you should try to increase/modify the logging level/settings in 
> order to see what is happening. Have a look at "log/hadoop.log". These 
> logging statements are valuable information regarding your problem. 
> Logging is controlled via "conf/log4j.properties" - if you're not 
> running nutch in a servlet container. (ok - you still may controll 
> logging from the same place, but I think that's hardly done (?) ). In 
> the mentioned hadoop.log file you'll also see which plugins are loaded.
> 
> btw. you don't need to "mess" around with compilation in order to get 
> this running. (Just looking at the link . . .)
> 
> 
> Hope it helps,
> 
> Martin
> 
> PS: This kind of question should be asked on the nutch-user list not 
> dev. Reposted this on user
> PPS: I think you should subscribe to the mailing list . . . it's useful, 
> really ;)
> 
> On Jan 16, 2008 9:31 PM, Krishnamohan Meduri <Krishna.Meduri@sun.com 
> <ma...@sun.com>> wrote:
> 
>     Hello,
> 
>     I want crawler to fetch pdf files also. I set the url to be
>     http://localhost:8080/ and I have several html and pdf files in my
>     document root.
> 
>     crawler is able to fetch html files but not pdf files.
>     I saw
>     http://www.mail-archive.com/nutch-user@incubator.apache.org/msg00344.html
>     <http://www.mail-archive.com/nutch-user@incubator.apache.org/msg00344.html>
> 
>     In <nutch_home>/nutch-site.xml, I added the following:
>     ---------
>     <property>
>       <name>plugin.includes</name>
> 
>     <value>protocol-http|urlfilter-regex|parse-(text|html|js|pdf)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
> 
>       <description>description</description>
>     </property>
>     ---------
> 
>     I installed nutch 0.9 and I see all plugins including parse-pdf in
>     plugins directory. So thought I don't have to do anything else.
> 
>     It doesn't work. Can you pls help.
> 
>     PS: I am not on any mailing list. Can you pls CC me on your replies.
> 
>     thanks,
>     Krishna.
> 
>