You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Dave Meikle (JIRA)" <ji...@apache.org> on 2013/04/07 20:33:16 UTC
[jira] [Assigned] (TIKA-1098) not able to parse pdfs/docs/ppts using 1.1 tika parser
[ https://issues.apache.org/jira/browse/TIKA-1098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Dave Meikle reassigned TIKA-1098:
---------------------------------
Assignee: Dave Meikle
> not able to parse pdfs/docs/ppts using 1.1 tika parser
> --------------------------------------------------------
>
> Key: TIKA-1098
> URL: https://issues.apache.org/jira/browse/TIKA-1098
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 1.1
> Environment: linux redhat
> Reporter: Qian Diao
> Assignee: Dave Meikle
> Attachments: url_1763_approx-alg-notes.pdf
>
>
> Hi,
> I got some parsing problems when using Tika 1.1 for the attached pdf file.
> my code (Test.java):
> import java.io.File;
> import java.io.InputStream;
> import java.io.FileInputStream;
> import org.apache.tika.metadata.Metadata;
> import org.apache.tika.parser.AutoDetectParser;
> import org.apache.tika.parser.ParseContext;
> import org.apache.tika.parser.Parser;
> import org.apache.tika.parser.html.BoilerpipeContentHandler;
> import org.apache.tika.sax.BodyContentHandler;
> import org.apache.tika.parser.html.HtmlParser;
> import de.l3s.boilerpipe.extractors.ArticleExtractor;
> public class Test {
> private static final String validBoilerpipeFilenameRegEx = ".*(\\.)(htm|html|shtml|php|asp|aspx)$";
> public String parseFile(File inFile) {
> if (inFile == null || !inFile.isFile() || !inFile.canRead()) return null;
>
> InputStream is = null;
> String outputText = "";
> try {
> // Open input stream
> is = new FileInputStream(inFile);
> // Prepare parser
> BodyContentHandler contenthandler = new BodyContentHandler(-1);
> Metadata metadata = new Metadata();
> metadata.set(Metadata.RESOURCE_NAME_KEY, inFile.getName());
> ParseContext pc = new ParseContext();
> // Call parse with boilerpipe if valid boilerpipe extension; otherwise, call regular parse.
> if (!inFile.getName().matches(validBoilerpipeFilenameRegEx)) {
> Parser parser = new AutoDetectParser();
> parser.parse(is, contenthandler, metadata, pc);
> }
> else {
> Parser parser = new HtmlParser();
> BoilerpipeContentHandler bh = new BoilerpipeContentHandler(contenthandler, new ArticleExtractor());
> parser.parse(is, bh, metadata, pc);
> }
> // Prepare text for write
> outputText = contenthandler.toString();
> } catch (Exception e) {
> System.out.println(e);
> return null;
> } finally {
> try {
> if (is != null)
> is.close();
> } catch (Exception e) {}
> }
>
> return outputText;
> }
> =====output====
> org.apache.tika.exception.TikaException: Unable to extract PDF content
> url_1763_approx-alg-notes.pdf
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: [jira] [Assigned] (TIKA-1098) not able to parse pdfs/docs/ppts using 1.1 tika parser
Posted by Dave Meikle <lo...@gmail.com>.
On 7 Apr 2013, at 19:33, Dave Meikle (JIRA) <ji...@apache.org> wrote:
>
> [ https://issues.apache.org/jira/browse/TIKA-1098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
>
> Dave Meikle reassigned TIKA-1098:
> ---------------------------------
>
> Assignee: Dave Meikle
Oops - didn't mean that, accidentally hit the trackpad!
I think it was unassigned, so have set it back to that, if it was you please pick it back up again.
Cheers,
Dave