You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Nick Burch (JIRA)" <ji...@apache.org> on 2017/12/11 20:24:00 UTC

[jira] [Comment Edited] (TIKA-2524) Create/integrate a parser for XPS

    [ https://issues.apache.org/jira/browse/TIKA-2524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16286494#comment-16286494 ] 

Nick Burch edited comment on TIKA-2524 at 12/11/17 8:23 PM:
------------------------------------------------------------

Based on the test XPS file we have in Apache POI, it probably wouldn't be too hard to knock up a parser for {{Documents/#/Pages/#.fpage}} entries which grabs the {{UnicodeString}} from any {{Glyphs}} elements. No idea how much text that'd miss though....


was (Author: gagravarr):
Based on the test XPS file we have in Apache POI, it probably wouldn't be too hard to knock up a parser for {{Documents/#/Pages/#.fpage}} entries which grabs the {{{UnicodeString}}} from any {{{Glyphs}}} elements. No idea how much text that'd miss though....

> Create/integrate a parser for XPS
> ---------------------------------
>
>                 Key: TIKA-2524
>                 URL: https://issues.apache.org/jira/browse/TIKA-2524
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.16
>            Reporter: Peter Davies
>              Labels: features
>         Attachments: doc_xps.xps
>
>
> When we parse XPS files using the AutoParser we always get an empty string.
> If we use DefaultDetector.detect() it correctly detects the MediaType as "application/vnd.ms-xpsdocument".
> This page
> https://tika.apache.org/1.16/formats.html
> suggests that XPS (application/vnd.ms-xpsdocument) is supported however. 
> Our code:
> 		InputStream bis = this.getClass().getResourceAsStream("/" + EXPECTED_LOCATION + "doc_xps.xps");
> 		Metadata metadata = new Metadata();
> 		BodyContentHandler handler = new BodyContentHandler();
> 		AutoDetectParser parser = new AutoDetectParser();
> 		TikaInputStream tikaStream = TikaInputStream.get(bis);
> 		parser.parse(tikaStream, handler, metadata);
> 		String parsedText = handler.toString();
> I will attach doc_xps.xps if I can



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)