You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Tim Allison (JIRA)" <ji...@apache.org> on 2017/12/13 15:51:00 UTC
[jira] [Resolved] (TIKA-2524) Create/integrate a parser for XPS

     [ https://issues.apache.org/jira/browse/TIKA-2524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tim Allison resolved TIKA-2524.
-------------------------------
       Resolution: Fixed
         Assignee: Tim Allison
    Fix Version/s: 2.0
                   1.18

Thank you [~gagravarr] for the test file!

I added a first draft of a parser to "master" and the new branch_1x.

While the work is fresh in my mind, some notes:
* I punted on bidi calculations.  Experiments with the {{testXPS_various.xps}} were, um, challenging to get right with the spaces.  I _think_ the spec suggests that we can rely on the storage order for the presentation order.  I did do some calculations to keep each canvas within its own div element.
* We could integrate the fragments information to get the right structure for paragraphs and also add table formatting, with back-off to current canvas based method if a file doesn't happen to contain the fragment info.
* As with PDFs, we're dumping urls at the end of each page for now.  We could improve this by matching the url to the anchor text.  I _think_ this will be far easier with xps than with PDF.
* We have handling of inlined images and thumbnails now.  I don't think it is possible to attach other file types, but if it is, we're not handling other file types at this point.
* I had to trust that OPCPackage was in fact a ZipPackage in order to grab files via the path in the zip file.  I don't like this.  We should try to rely on the .rels more and see if we can get the same functionality with OPCPackage even if it means putting most of this in POI, where it should go anyway. :)

> Create/integrate a parser for XPS
> ---------------------------------
>
>                 Key: TIKA-2524
>                 URL: https://issues.apache.org/jira/browse/TIKA-2524
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.16
>            Reporter: Peter Davies
>            Assignee: Tim Allison
>              Labels: features
>             Fix For: 1.18, 2.0
>
>         Attachments: A3S3TDRXL6DN2AN3NU2OE5L7KGFY6DZA.xps, WithBiDi.xps, doc_xps.xps
>
>
> When we parse XPS files using the AutoParser we always get an empty string.
> If we use DefaultDetector.detect() it correctly detects the MediaType as "application/vnd.ms-xpsdocument".
> This page
> https://tika.apache.org/1.16/formats.html
> suggests that XPS (application/vnd.ms-xpsdocument) is supported however. 
> Our code:
> 		InputStream bis = this.getClass().getResourceAsStream("/" + EXPECTED_LOCATION + "doc_xps.xps");
> 		Metadata metadata = new Metadata();
> 		BodyContentHandler handler = new BodyContentHandler();
> 		AutoDetectParser parser = new AutoDetectParser();
> 		TikaInputStream tikaStream = TikaInputStream.get(bis);
> 		parser.parse(tikaStream, handler, metadata);
> 		String parsedText = handler.toString();
> I will attach doc_xps.xps if I can



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)