You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by ah...@accenture.com on 2011/03/23 18:59:06 UTC

RE: Crawling PDF


Hello everybody,

I need a help in my Nutch configuration , I want to crawl the PDF's index .

I tried to use the Guid configuration but not success , hier are important Part of my Cods ::

_____________________________________
Crawl-urlfilter.txt

# skip file:, ftp:, & mailto: urls
-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$

# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]

# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/

# accept hosts in MY.DOMAIN.NAME

# +^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/

+^http://\S*localhost:8080/examples/jsp/


# skip everything else
-.

___________________________________________________________________

Plugin.xml



   <runtime>
      <library name="parse-pdf.jar">
         <export name="*"/>
      </library>
      <library name="PDFBox-0.7.4-dev.jar"/>
      <library name="FontBox-0.2.0-dev.jar"/>
      <library name="JempBox-0.2.0-dev.jar"/>
      <library name="bcprov-jdk14-132.jar"/>
      <!-- Uncomment the following two lines after you have downloaded the
           libraries, see README.txt for more details.-->

      <library name="jai_codec.jar"/>
      <library name="jai_core.jar"/>
        </runtime>

__________________________________________________________________


Regex-urlfilter.txt


# skip file: ftp: and mailto: urls
-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$

# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]

# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/

# accept anything else
+.
________________________________________________________________________


Nutch-site.xml

<property>
  <name>plugin.includes</name>
  <value>protocol-http|urlfilter-regex|parse-(text|html|js|msexcel|mspowerpoint|msword|oo|pdf|swf|zip)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
<description>protocol-http|urlfilter-regex|parse-(text|html|js|msexcel|mspowerpoint|msword|oo|pdf|swf|zip)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</description>
  </property>
___________________________________________________________________________


And the 2 Libraries (jar files ) are copied in the src/plugin/parse-pdf Dir .

Please Help , and thanks in Advance

























This message is for the designated recipient only and may contain privileged, proprietary, or otherwise private information.  If you have received it in error, please notify the sender immediately and delete the original.  Any other use of the email by you is prohibited.