You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Keith R. Bennett (JIRA)" <ji...@apache.org> on 2007/10/02 02:53:50 UTC

[jira] Updated: (TIKA-35) Extract MsOffice properties

     [ https://issues.apache.org/jira/browse/TIKA-35?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Keith R. Bennett updated TIKA-35:
---------------------------------

    Attachment: RereadableInputStreamTest.java
                RereadableInputStream.java

Attached are a first pass at a rereadable stream class and a basic unit test that illustrates that it works (basically ;)).

This stream class wraps the document's input stream and saves its content when the passed stream is read.

It supports a memory threshold; if the total size read is no more than this threshold, the data is stored in a byte [], and subsequent rereads of the stream are read from a ByteArrayInputStream.  If the total size exceeds the threshold, the data is stored in a File, and subsequent passes read a buffered FileInputStream.

If you place these files in src/main/java/org/apache/tika/utils and src/test/java/org/apache/tika/utils, you should be able to compile them and run the test.

Rereading the stream is accomplished by calling rewind().  Currently rewind() closes the input stream originally passed, but we may want to change that.



> Extract MsOffice properties
> ---------------------------
>
>                 Key: TIKA-35
>                 URL: https://issues.apache.org/jira/browse/TIKA-35
>             Project: Tika
>          Issue Type: Improvement
>    Affects Versions: 0.1-incubator
>            Reporter: Rida Benjelloun
>            Assignee: Rida Benjelloun
>             Fix For: 0.1-incubator
>
>         Attachments: RereadableInputStream.java, RereadableInputStreamTest.java, tika35.patch, tika35.patch
>
>
> Hi,
> I have developed a patch that allows MsOffice properties extraction. I wasn't able to extract the MsOffice properties and full text from a single inputstream, I always get this error : java.io.IOException Source code of java.io.IOException: Unable to read entire header; -1 bytes read;
> expected 512 bytes. 
> I don't know how they make it work in Nutch (any ideas ?).
> To get it work, I have added "filePath" variable in the parser class, and I populate it from ParseUtils class. After that I create an inputStream from filePath or Url and I use it to extract properties and I use the default inputstream to extract full text.
> I didn't commit this modification; I would like to have your opinions before.
> Regards.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.