You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "dibyendu ghosh (Commented) (JIRA)" <ji...@apache.org> on 2011/12/02 08:11:40 UTC

[jira] [Commented] (NUTCH-1206) tika parser of nutch 1.3 is failing to prcess pdfs

    [ https://issues.apache.org/jira/browse/NUTCH-1206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13161463#comment-13161463 ] 

dibyendu ghosh commented on NUTCH-1206:
---------------------------------------

Tried with 1.4. Its still not working. 1.3 did not have parsechecker option for nutch script. 1.4 is showing the following output:
===========
bash-2.00$ bin/nutch parsechecker -dumpText https://issues.apache.org/jira/secu
re/attachment/12505323/direct.pdf
fetching: https://issues.apache.org/jira/secure/attachment/12505323/direct.pdf
Can't fetch URL successfully
===========
This is after keeping the above mentioned conf. setting in nutch-site.xml
                
> tika parser of nutch 1.3 is failing to prcess pdfs
> --------------------------------------------------
>
>                 Key: NUTCH-1206
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1206
>             Project: Nutch
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.3
>         Environment: Solaris/Linux/Windows
>            Reporter: dibyendu ghosh
>            Assignee: Chris A. Mattmann
>         Attachments: direct.pdf
>
>
> Please refer to this message: http://www.mail-archive.com/user%40nutch.apache.org/msg04315.html. Old parse-pdf parser seems to be able to parse old pdfs (checked with nutch 1.2) though it is not able to parse acrobat 9.0 version of pdfs. nutch 1.3 does not have parse-pdf plugin and it is not able to parse even older pdfs.
> my code (TestParse.java):
> ----------------------------
> bash-2.00$ cat TestParse.java
> import java.io.File;
> import java.io.FileInputStream;
> import java.io.FileOutputStream;
> import java.io.PrintStream;
> import java.util.Iterator;
> import java.util.Map;
> import java.util.Map.Entry;
> import org.apache.hadoop.conf.Configuration;
> import org.apache.hadoop.io.Text;
> import org.apache.nutch.metadata.Metadata;
> import org.apache.nutch.parse.ParseResult;
> import org.apache.nutch.parse.Parse;
> import org.apache.nutch.parse.ParseStatus;
> import org.apache.nutch.parse.ParseUtil;
> import org.apache.nutch.parse.ParseData;
> import org.apache.nutch.protocol.Content;
> import org.apache.nutch.util.NutchConfiguration;
> public class TestParse {
>     private static Configuration conf = NutchConfiguration.create();
>     public TestParse() {
>     }
>     public static void main(String[] args) {
>         String filename = args[0];
>         convert(filename);
>     }
>     public static String convert(String fileName) {
>         String newName = "abc.html";
>         try {
>             System.out.println("Converting " + fileName + " to html.");
>             if (convertToHtml(fileName, newName))
>                 return newName;
>         } catch (Exception e) {
>             (new File(newName)).delete();
>             System.out.println("General exception " + e.getMessage());
>         }
>         return null;
>     }
>     private static boolean convertToHtml(String fileName, String newName)
>         throws Exception {
>         // Read the file
>         FileInputStream in = new FileInputStream(fileName);
>         byte[] buf = new byte[in.available()];
>         in.read(buf);
>         in.close();
>         // Parse the file
>         Content content = new Content("file:" + fileName, "file:" +
> fileName,
>                                       buf, "", new Metadata(), conf);
>         ParseResult parseResult = new ParseUtil(conf).parse(content);
>         parseResult.filter();
>         if (parseResult.isEmpty()) {
>             System.out.println("All parsing attempts failed");
>             return false;
>         }
>         Iterator<Map.Entry&lt;Text,Parse>> iterator =
> parseResult.iterator();
>         if (iterator == null) {
>             System.out.println("Cannot iterate over successful parse
> results");
>             return false;
>         }
>         Parse parse = null;
>         ParseData parseData = null;
>         while (iterator.hasNext()) {
>             parse = parseResult.get((Text)iterator.next().getKey());
>             parseData = parse.getData();
>             ParseStatus status = parseData.getStatus();
>             // If Parse failed then bail
>             if (!ParseStatus.STATUS_SUCCESS.equals(status)) {
>                 System.out.println("Could not parse " + fileName + ". " +
>                             status.getMessage());
>                 return false;
>             }
>         }
>         // Start writing to newName
>         FileOutputStream fout = new FileOutputStream(newName);
>         PrintStream out = new PrintStream(fout, true, "UTF-8");
>         // Start Document
>         out.println("<html>");
>         // Start Header
>         out.println("<head>");
>         // Write Title
>         String title = parseData.getTitle();
>         if (title != null && title.trim().length() > 0) {
>             out.println("<title>" + parseData.getTitle() + "</title>");
>         }
>         // Write out Meta tags
>         Metadata metaData = parseData.getContentMeta();
>         String[] names = metaData.names();
>         for (String name : names) {
>             String[] subvalues = metaData.getValues(name);
>             String values = null;
>             for (String subvalue : subvalues) {
>                 values += subvalue;
>             }
>             if (values.length() > 0)
>                 out.printf("<meta name=\"%s\" content=\"%s\"/>\n",
>                            name, values);
>         }
>         out.println("<meta http-equiv=\"Content-Type\"
> content=\"text/html;charset=UTF-8\"/>");
>         // End Meta tags
>         out.println("</head>"); // End Header
>         // Start Body
>         out.println("<body>");
>         out.print(parse.getText());
>         out.println("</body>"); // End Body
>         out.println("</html>"); // End Document
>         out.close(); // Close the file
>         return true;
>     }
> }
> ----------------------------
> command:
> ======
> bash-2.00$ java -classpath
> conf:runtime/local/lib/nutch-1.3.jar:runtime/local/lib/hadoop-core-0.20.2.jar:runtime/local/lib/commons-logging-api-1.0.4.jar:runtime/local/lib/tika-core-0.9.jar:runtime/local/lib/log4j-1.2.15.jar:runtime/local/lib/oro-2.0.8.jar:.
> TestParse direct.pdf
> ======
> output:
> _____
> Converting direct.pdf to html.
> Oct 19, 2011 5:05:19 PM org.apache.hadoop.conf.Configuration
> getConfResourceAsInputStream
> INFO: found resource tika-mimetypes.xml at
> file:/path/to/nutch/1.3/conf/tika-mimetypes.xml
> Oct 19, 2011 5:05:20 PM org.apache.nutch.plugin.PluginManifestParser
> parsePluginFolder
> INFO: Plugins: looking in: /path/to/nutch/1.3/runtime/local/plugins
> Oct 19, 2011 5:05:20 PM org.apache.nutch.plugin.PluginRepository
> displayStatusINFO: Plugin Auto-activation mode: [true]
> Oct 19, 2011 5:05:20 PM org.apache.nutch.plugin.PluginRepository
> displayStatusINFO: Registered Plugins:
> Oct 19, 2011 5:05:20 PM org.apache.nutch.plugin.PluginRepository
> displayStatusINFO:   the nutch core extension points (nutch-extensionpoints)
> Oct 19, 2011 5:05:20 PM org.apache.nutch.plugin.PluginRepository
> displayStatusINFO:   Tika Parser Plug-in (parse-tika)
> Oct 19, 2011 5:05:20 PM org.apache.nutch.plugin.PluginRepository
> displayStatusINFO: Registered Extension-Points:
> Oct 19, 2011 5:05:20 PM org.apache.nutch.plugin.PluginRepository
> displayStatusINFO:   Nutch URL Normalizer
> (org.apache.nutch.net.URLNormalizer)
> Oct 19, 2011 5:05:20 PM org.apache.nutch.plugin.PluginRepository
> displayStatusINFO:   Nutch Protocol (org.apache.nutch.protocol.Protocol)
> Oct 19, 2011 5:05:20 PM org.apache.nutch.plugin.PluginRepository
> displayStatusINFO:   Nutch Segment Merge Filter
> (org.apache.nutch.segment.SegmentMergeFilter)
> Oct 19, 2011 5:05:20 PM org.apache.nutch.plugin.PluginRepository
> displayStatusINFO:   Nutch URL Filter (org.apache.nutch.net.URLFilter)
> Oct 19, 2011 5:05:20 PM org.apache.nutch.plugin.PluginRepository
> displayStatusINFO:   Nutch Indexing Filter
> (org.apache.nutch.indexer.IndexingFilter)
> Oct 19, 2011 5:05:20 PM org.apache.nutch.plugin.PluginRepository
> displayStatusINFO:   HTML Parse Filter
> (org.apache.nutch.parse.HtmlParseFilter)
> Oct 19, 2011 5:05:20 PM org.apache.nutch.plugin.PluginRepository
> displayStatusINFO:   Nutch Content Parser (org.apache.nutch.parse.Parser)
> Oct 19, 2011 5:05:20 PM org.apache.nutch.plugin.PluginRepository
> displayStatusINFO:   Nutch Scoring (org.apache.nutch.scoring.ScoringFilter)
> Oct 19, 2011 5:05:20 PM org.apache.hadoop.conf.Configuration
> getConfResourceAsInputStream
> INFO: found resource parse-plugins.xml at
> file:/path/to/nutch/1.3/conf/parse-plugins.xml
> Oct 19, 2011 5:05:20 PM org.apache.nutch.parse.ParserFactory matchExtensions
> INFO: The parsing plugins: [org.apache.nutch.parse.tika.TikaParser] are
> enabled via the plugin.includes system property, and all claim to support
> the content type application/pdf, but they are not mapped to it  in the
> parse-plugins.xml file
> Oct 19, 2011 5:05:21 PM org.apache.nutch.parse.ParseUtil parse
> WARNING: Unable to successfully parse content file:direct.pdf of type
> application/pdf
> Oct 19, 2011 5:05:21 PM org.apache.nutch.parse.ParseResult filter
> WARNING: file:direct.pdf is not parsed successfully, filtering
> All parsing attempts failed
> _____
> my customized nutch-site.xml:
> ~~~~~~~~~~~~~~~~~~~~
> bash-2.00$ cat conf/nutch-site.xml
> <?xml version="1.0"?>
> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
> <configuration>
>   <property>
>     <name>plugin.folders</name>
>     <value>runtime/local/plugins</value>
>     <description>Directories where nutch plugins are located.  Each
>     element may be a relative or absolute path.  If absolute, it is used
>     as is.  If relative, it is searched for on the classpath.</description>
>   </property>
>   <property>
>     <name>plugin.includes</name>
>     <value>parse-tika</value>
>     <description>Regular expression naming plugin directory names to
>     include. Any plugin not matching this expression is excluded.
>     </description>
>   </property>
> </configuration>
> ~~~~~~~~~~~~~~~~~~~~

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira