You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Rafael Ferreira (JIRA)" <ji...@apache.org> on 2018/01/22 02:37:00 UTC
[jira] [Updated] (TIKA-2543) No content extraction for
application/x-webarchive format
[ https://issues.apache.org/jira/browse/TIKA-2543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Rafael Ferreira updated TIKA-2543:
----------------------------------
Attachment: Apache Tika – Configuring Tika.webarchive
> No content extraction for application/x-webarchive format
> ---------------------------------------------------------
>
> Key: TIKA-2543
> URL: https://issues.apache.org/jira/browse/TIKA-2543
> Project: Tika
> Issue Type: Bug
> Affects Versions: 1.17
> Environment: MacOS 10.13.2 JDK8
> Reporter: Rafael Ferreira
> Priority: Minor
> Attachments: Apache Tika – Configuring Tika.webarchive
>
>
> Steps to reproduce:
> # Using safari save any web page as "webarchive"
> # Use tika to extract the archive content like the example below
> Expected result:
> I would expect tika to extract the html contents from the webarchive
> Actual results:
> Nothing is extracted albeit the right mime type is identified.
> {code:java}
> try (BufferedWriter writer = Files.newBufferedWriter(extractedContentPath, Charsets.UTF_8)) {
> TesseractOCRConfig tesseractOCRConfig = new TesseractOCRConfig();
> // this looks for content anywhere in the page independently of orientation
> tesseractOCRConfig.setPageSegMode("11");
> ParseContext context = new ParseContext();
> context.set(Parser.class, tika.getParser());
> context.set(TesseractOCRConfig.class, tesseractOCRConfig);
> try (InputStream fd = Files.newInputStream(path)) {
> tika.getParser().parse(fd, new WriteOutContentHandler(writer), new Metadata(), context);
> } catch (SAXException e) {
> throw new EngineError(e);
> }
> {code}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)