You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Frank Huang <ka...@yahoo.com.tw> on 2006/09/01 19:17:57 UTC

Could anyone teache me how to index the title or content of PDF?

I already read lots of articles about Nutch to parse the content of  pdf
documents , but I`m still confused .  I try to modify
\nutch-0.7.2\conf\nutch-default.XML like following:

<property>
  <name>plugin.includes</name>
 
<value>nutch-extensionpoints|protocol-httpclient|urlfilter-regex|parse-(text|html|msword|pdf|rtf|rss|js|msexcel|mspowerpoint|zip)|index-(basic|more)|query-(basic|site|url)|language-identifier|clustering-carrot2</value>
  <description>Regular expression naming plugin directory names to
  include.  Any plugin not matching this expression is excluded.
  In any case you need at least include the nutch-extensionpoints plugin. By
  default Nutch includes crawling just HTML and plain text via HTTP,
  and basic indexing and search plugins.
  </description>
</property>

and I add something to  /nutch-0.7.2/build.XML like following:

<project name="parse-pdf" default="jar-core">
  <import file="../build-plugin.xml"/>
  <!-- Build compilation dependencies -->
  <target name="deps-jar">
    <ant target="jar" inheritall="false" dir="../lib-log4j"/>
    <ant target="jar" inheritall="false" dir="../lib-fontbox"/> 
  </target>
  <!-- Add compilation dependencies to classpath -->
  <path id="plugin.deps">
    <fileset dir="${nutch.root}/build">
      <include name="**/lib-log4j/*.jar" />
      <include name="**/lib-fontbox/*.jar" />  
    </fileset>
  </path>
  <!-- Deploy Unit test dependencies -->
  <target name="deps-test">
    <ant target="deploy" inheritall="false" dir="../lib-log4j"/>
    <ant target="deploy" inheritall="false" dir="../lib-fontbox"/>  
    <ant target="deploy" inheritall="false" dir="../nutch-extensionpoints"/>
    <ant target="deploy" inheritall="false" dir="../protocol-file"/>
  </target>
  <!-- for junit test -->
  <mkdir dir="${build.test}/data"/>
  <copy file="sample/pdftest.pdf" todir="${build.test}/data"/>
</project>

.I also put FontBox-0.1.0-dev.jar & PDFBox-0.7.3-dev-20060901.jar into
\plugins\parse-pdf,modfying plugin.XML like :

<plugin
   id="lib-fontbox"
   name="FontBox"
   version="0.1.0-dev"
   provider-name="org.fontbox">
   <runtime>
     <library name="FontBox-0.1.0-dev.jar">
        <export name="*"/>
     </library>
   </runtime>
</plugin>

<plugin
   id="parse-pdf"
   name="Pdf Parse Plug-in"
   version="1.0.0"
   provider-name="nutch.org">


   <runtime>
      <library name="parse-pdf.jar">
         <export name="*"/>
      </library>
      <library name="PDFBox-0.7.3-dev-20060901.jar"/>
      <library name="log4j-1.2.9.jar"/>
      <library name="FontBox-0.1.0-dev.jar"/>
   </runtime>

   <extension id="org.apache.nutch.parse.pdf"
              name="PdfParse"
              point="org.apache.nutch.parse.Parser">

      <implementation id="org.apache.nutch.parse.pdf.PdfParser"
                      class="org.apache.nutch.parse.pdf.PdfParser"
                      contentType="application/pdf"
                      pathSuffix=""/>

   </extension>

</plugin>
 
But when I execute ./nutch crawl there show some messages like "fetch okay
,but can`t parse http://(omit...).pdf " reason:failed <omit..>content
truncated at 70709 bytes.Parse can`t handle incomplete pdf file.

Could someone answer me how to finish it detail as soon as possible?Thanks.
-- 
View this message in context: http://www.nabble.com/Could-anyone-teache-me-how-to-index--the-title-or-content-of-PDF--tf2203822.html#a6102866
Sent from the Nutch - User forum at Nabble.com.

Re: Could anyone teache me how to index the title or content of PDF?

Posted by King Kong <ch...@hotmail.com>.



Frank Huang wrote:
> 
> Thanks for your help.
> 
> I crawl over http and set  http.content.limit like following in
> nutch-default:
> <property>
>   <name>http.content.limit</name>
>   <value>16777216</value>
>   <description>The length limit for downloaded content, in bytes.
>   If this value is nonnegative (>=0), content longer than it will be
> truncated;
>   otherwise, no truncation at all.
>   </description>
> </property>
> 
> but it still show the same error:
> fetch okay,but can`t parse http://(omit...).pdf " reason:failed
> <omit..>content
> truncated at 70709 bytes.Parse can`t handle incomplete pdf file.
> 
> what did I mistake ? thanks 
> 
> 


You must set http.content.limit=-1 . 

-- 
View this message in context: http://www.nabble.com/Could-anyone-teache-me-how-to-index--the-title-or-content-of-PDF--tf2203822.html#a6120073
Sent from the Nutch - User forum at Nabble.com.

Re: Could anyone teache me how to index the title or content of PDF?

Posted by Frank Huang <ka...@yahoo.com.tw>.

Thanks for your help.

I crawl over http and set  http.content.limit like following in
nutch-default:
<property>
  <name>http.content.limit</name>
  <value>16777216</value>
  <description>The length limit for downloaded content, in bytes.
  If this value is nonnegative (>=0), content longer than it will be
truncated;
  otherwise, no truncation at all.
  </description>
</property>

but it still show the same error:
fetch okay,but can`t parse http://(omit...).pdf " reason:failed
<omit..>content
truncated at 70709 bytes.Parse can`t handle incomplete pdf file.

what did I mistake ? thanks 

Tomi NA wrote:
> 
> On 9/1/06, Frank Huang <ka...@yahoo.com.tw> wrote:
> 
>> But when I execute ./nutch crawl there show some messages like "fetch
>> okay
>> ,but can`t parse http://(omit...).pdf " reason:failed <omit..>content
>> truncated at 70709 bytes.Parse can`t handle incomplete pdf file.
> 
> Haven't had time to go through the complete code (not sure I'd
> understand it, anyway), but this looks like you need to set
> file.content.limit to, say, 16777216. If you're crawling over http
> rather than intranet shares, the property you need to .set is
> http.content.limit
> 
> Hope it helps.
> 
> 
> t.n.a.
> 
> 

-- 
View this message in context: http://www.nabble.com/Could-anyone-teache-me-how-to-index--the-title-or-content-of-PDF--tf2203822.html#a6119492
Sent from the Nutch - User forum at Nabble.com.

Re: Could anyone teache me how to index the title or content of PDF?

Posted by Tomi NA <he...@gmail.com>.

On 9/1/06, Frank Huang <ka...@yahoo.com.tw> wrote:

> But when I execute ./nutch crawl there show some messages like "fetch okay
> ,but can`t parse http://(omit...).pdf " reason:failed <omit..>content
> truncated at 70709 bytes.Parse can`t handle incomplete pdf file.

Haven't had time to go through the complete code (not sure I'd
understand it, anyway), but this looks like you need to set
file.content.limit to, say, 16777216. If you're crawling over http
rather than intranet shares, the property you need to set is
http.content.limit.

Hope it helps.

t.n.a.