You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@tika.apache.org by "Rida Benjelloun (JIRA)" <ji...@apache.org> on 2007/09/27 19:13:51 UTC

[jira] Created: (TIKA-35) Extract MsOffice properties

Extract MsOffice properties
---------------------------

                 Key: TIKA-35
                 URL: https://issues.apache.org/jira/browse/TIKA-35
             Project: Tika
          Issue Type: Improvement
    Affects Versions: 0.1-incubator
            Reporter: Rida Benjelloun
             Fix For: 0.1-incubator
         Attachments: tika35.patch

Hi,
I have developed a patch that allows MsOffice properties extraction. I wasn't able to extract the MsOffice properties and full text from a single inputstream, I always get this error : java.io.IOException Source code of java.io.IOException: Unable to read entire header; -1 bytes read;
expected 512 bytes. 
I don't know how they make it work in Nutch (any ideas ?).
To get it work, I have added "filePath" variable in the parser class, and I populate it from ParseUtils class. After that I create an inputStream from filePath or Url and I use it to extract properties and I use the default inputstream to extract full text.
I didn't commit this modification; I would like to have your opinions before.
Regards.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: [jira] Commented: (TIKA-35) Extract MsOffice properties

Posted by Bertrand Delacretaz <bd...@apache.org>.

On 10/1/07, Jukka Zitting <ju...@gmail.com> wrote:

> ...I can dig up
> some of my old code and contribute it to commons-io and/or Tika...

Cool - I think it has its place in commons-io.

> ...should we still create a temporary copy of the data while parsing or
> can we rely on rereading the source of the data? A temporary copy
> introduces quite a bit of overhead, but avoids nasty problems...

I'd go for a temp copy, at least initially. As you mention, rereading
can have "interesting" side effects sometimes...

-Bertrand

Re: [jira] Commented: (TIKA-35) Extract MsOffice properties

Posted by Jukka Zitting <ju...@gmail.com>.

Hi,

On 9/28/07, Bertrand Delacretaz <bd...@apache.org> wrote:
> On 9/28/07, kbennett <kb...@bbsinc.biz> wrote:
> > ...It would be nice if there were some implementation of BufferedReader that
> > used disk instead of memory if the readaheadLimit exceeded a threshold.  If
> > not, we may need to write our own....
>
> Agreed, a BufferedReader with "unlimited" storage on disk sounds like
> the way to go.
>
> I don't know of any existing implementation, though.

I've implemented such classes a few times before, based on support
classes (like DeferredFileOutputStream) from commons-io. I can dig up
some of my old code and contribute it to commons-io and/or Tika.

There's an interesting question about a potential optimization: If the
stream being processed is based on a File, a URI, or a byte array,
should we still create a temporary copy of the data while parsing or
can we rely on rereading the source of the data? A temporary copy
introduces quite a bit of overhead, but avoids nasty problems with
files/resources/arrays being overwritten between consecutive parsing
passes.

BR,

Jukka Zitting

Re: [jira] Commented: (TIKA-35) Extract MsOffice properties

Posted by Bertrand Delacretaz <bd...@apache.org>.

On 9/28/07, kbennett <kb...@bbsinc.biz> wrote:

> ...It would be nice if there were some implementation of BufferedReader that
> used disk instead of memory if the readaheadLimit exceeded a threshold.  If
> not, we may need to write our own....

Agreed, a BufferedReader with "unlimited" storage on disk sounds like
the way to go.

I don't know of any existing implementation, though.

-Bertrand

Re: [jira] Commented: (TIKA-35) Extract MsOffice properties

Posted by kbennett <kb...@bbsinc.biz>.

Rida & All -

I just did some research on mark and release and found out that IMO it will
not help us.  It is true that we could wrap every stream in a
BufferedReader, which is guaranteed to support mark and release.  However,
when mark() is called, it requires a parameter specifying the readahead
limit (the number of characters to save for a possible reset() call).  Since
we are dealing with documents of arbitrary length, it would not be practical
to rely on this.  Those characters are stored in memory, so even when we
implement chunking, we would still have a memory limitation regarding
document size.

Unless we can reuse the resource identifier (file, URL, etc.) for multiple
passes, I think we'll have to save the stream in a temporary file when we
read it the first time, and then read it from that file on subsequent
passes.  I suppose that's something each parser implementation would decide
for itself.  This, of course, would not remove a size limitation, but it
would change it to be the amount of usable disk space rather than memory.

It would be nice if there were some implementation of BufferedReader that
used disk instead of memory if the readaheadLimit exceeded a threshold.  If
not, we may need to write our own.  On the other hand, I'm sure we're not
the first people to encounter this problem; I wonder if there are better
solutions out there already.

- Keith

kbennett wrote:
> 
> Rida -
> 
> Some InputStream implementations support mark and release.  Using this,
> you can set a mark and then go back to it.  We may want to use that where
> possible if it looks like it's more economical to do so.  Then, in other
> cases, we could save the stream's bytes in a temporary file.
> 
> I think Jukka has put a lot of thought into this issue already, such as in
> this message:
> 
> http://www.nabble.com/Tika-pipelines-%28was%3A-Tika-discussions-in-Amsterdam%29-tf3691029.html#a12882886
> 
> - Keith
> 
> 
> 
> JIRA jira@apache.org wrote:
>> 
>> 
>>     [
>> https://issues.apache.org/jira/browse/TIKA-35?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12530825
>> ] 
>> 
>> Rida Benjelloun commented on TIKA-35:
>> -------------------------------------
>> 
>> Hi Keith,
>> I like the idea to save the content of the stream during the first pass. 
>> Thanks
>> 
>> 
>>> Extract MsOffice properties
>>> ---------------------------
>>>
>>>                 Key: TIKA-35
>>>                 URL: https://issues.apache.org/jira/browse/TIKA-35
>>>             Project: Tika
>>>          Issue Type: Improvement
>>>    Affects Versions: 0.1-incubator
>>>            Reporter: Rida Benjelloun
>>>             Fix For: 0.1-incubator
>>>
>>>         Attachments: tika35.patch, tika35.patch
>>>
>>>
>>> Hi,
>>> I have developed a patch that allows MsOffice properties extraction. I
>>> wasn't able to extract the MsOffice properties and full text from a
>>> single inputstream, I always get this error : java.io.IOException Source
>>> code of java.io.IOException: Unable to read entire header; -1 bytes
>>> read;
>>> expected 512 bytes. 
>>> I don't know how they make it work in Nutch (any ideas ?).
>>> To get it work, I have added "filePath" variable in the parser class,
>>> and I populate it from ParseUtils class. After that I create an
>>> inputStream from filePath or Url and I use it to extract properties and
>>> I use the default inputstream to extract full text.
>>> I didn't commit this modification; I would like to have your opinions
>>> before.
>>> Regards.
>> 
>> -- 
>> This message is automatically generated by JIRA.
>> -
>> You can reply to this email to add a comment to the issue online.
>> 
>> 
>> 
> 
> 

-- 
View this message in context: http://www.nabble.com/-jira--Created%3A-%28TIKA-35%29-Extract-MsOffice-properties-tf4529774.html#a12942832
Sent from the Apache Tika - Development mailing list archive at Nabble.com.

Re: [jira] Commented: (TIKA-35) Extract MsOffice properties

Posted by kbennett <kb...@bbsinc.biz>.

Rida -

Some InputStream implementations support mark and release.  Using this, you
can set a mark and then go back to it.  We may want to use that where
possible if it looks like it's more economical to do so.  Then, in other
cases, we could save the stream's bytes in a temporary file.

I think Jukka has put a lot of thought into this issue already, such as in
this message:

http://www.nabble.com/Tika-pipelines-%28was%3A-Tika-discussions-in-Amsterdam%29-tf3691029.html#a12882886

- Keith



JIRA jira@apache.org wrote:
> 
> 
>     [
> https://issues.apache.org/jira/browse/TIKA-35?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12530825
> ] 
> 
> Rida Benjelloun commented on TIKA-35:
> -------------------------------------
> 
> Hi Keith,
> I like the idea to save the content of the stream during the first pass. 
> Thanks
> 
> 
>> Extract MsOffice properties
>> ---------------------------
>>
>>                 Key: TIKA-35
>>                 URL: https://issues.apache.org/jira/browse/TIKA-35
>>             Project: Tika
>>          Issue Type: Improvement
>>    Affects Versions: 0.1-incubator
>>            Reporter: Rida Benjelloun
>>             Fix For: 0.1-incubator
>>
>>         Attachments: tika35.patch, tika35.patch
>>
>>
>> Hi,
>> I have developed a patch that allows MsOffice properties extraction. I
>> wasn't able to extract the MsOffice properties and full text from a
>> single inputstream, I always get this error : java.io.IOException Source
>> code of java.io.IOException: Unable to read entire header; -1 bytes read;
>> expected 512 bytes. 
>> I don't know how they make it work in Nutch (any ideas ?).
>> To get it work, I have added "filePath" variable in the parser class, and
>> I populate it from ParseUtils class. After that I create an inputStream
>> from filePath or Url and I use it to extract properties and I use the
>> default inputstream to extract full text.
>> I didn't commit this modification; I would like to have your opinions
>> before.
>> Regards.
> 
> -- 
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/-jira--Created%3A-%28TIKA-35%29-Extract-MsOffice-properties-tf4529774.html#a12929882
Sent from the Apache Tika - Development mailing list archive at Nabble.com.

[jira] Commented: (TIKA-35) Extract MsOffice properties

Posted by "Keith R. Bennett (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TIKA-35?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12532210 ] 

Keith R. Bennett commented on TIKA-35:
--------------------------------------

Rida -

I was wrong when I said the original input stream could have been used.  I didn't see that the RereadableInputStream was being read twice.

Sorry for the error.

- Keith


> Extract MsOffice properties
> ---------------------------
>
>                 Key: TIKA-35
>                 URL: https://issues.apache.org/jira/browse/TIKA-35
>             Project: Tika
>          Issue Type: Improvement
>    Affects Versions: 0.1-incubator
>            Reporter: Rida Benjelloun
>            Assignee: Rida Benjelloun
>             Fix For: 0.1-incubator
>
>         Attachments: RereadableInputStream.java, RereadableInputStreamTest.java, tika35.patch, tika35.patch
>
>
> Hi,
> I have developed a patch that allows MsOffice properties extraction. I wasn't able to extract the MsOffice properties and full text from a single inputstream, I always get this error : java.io.IOException Source code of java.io.IOException: Unable to read entire header; -1 bytes read;
> expected 512 bytes. 
> I don't know how they make it work in Nutch (any ideas ?).
> To get it work, I have added "filePath" variable in the parser class, and I populate it from ParseUtils class. After that I create an inputStream from filePath or Url and I use it to extract properties and I use the default inputstream to extract full text.
> I didn't commit this modification; I would like to have your opinions before.
> Regards.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (TIKA-35) Extract MsOffice properties

Posted by "Keith R. Bennett (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TIKA-35?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12532176 ] 

Keith R. Bennett commented on TIKA-35:
--------------------------------------

Chris -

As per my previous comment, we may not need the RereadableInputStream after all.

If we do keep it, though, I think for most use cases it would make sense for the default behavior to be:

If rewind() is called on the first pass, read until end of stream to save the entire stream content.

Perhaps there could be a way to override this, but I don't think we would ever want to use it.

What do you think?

By the way, this class was really a proof of concept, and was not intended to be complete.  For example, we would probably want to set a maximum number of bytes read total (as opposed to in memory), to avoid consuming too much disk space.  I would also add javadoc. ;)

- Keith


> Extract MsOffice properties
> ---------------------------
>
>                 Key: TIKA-35
>                 URL: https://issues.apache.org/jira/browse/TIKA-35
>             Project: Tika
>          Issue Type: Improvement
>    Affects Versions: 0.1-incubator
>            Reporter: Rida Benjelloun
>            Assignee: Rida Benjelloun
>             Fix For: 0.1-incubator
>
>         Attachments: RereadableInputStream.java, RereadableInputStreamTest.java, tika35.patch, tika35.patch
>
>
> Hi,
> I have developed a patch that allows MsOffice properties extraction. I wasn't able to extract the MsOffice properties and full text from a single inputstream, I always get this error : java.io.IOException Source code of java.io.IOException: Unable to read entire header; -1 bytes read;
> expected 512 bytes. 
> I don't know how they make it work in Nutch (any ideas ?).
> To get it work, I have added "filePath" variable in the parser class, and I populate it from ParseUtils class. After that I create an inputStream from filePath or Url and I use it to extract properties and I use the default inputstream to extract full text.
> I didn't commit this modification; I would like to have your opinions before.
> Regards.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (TIKA-35) Extract MsOffice properties

Posted by "Chris A. Mattmann (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TIKA-35?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12532000 ] 

Chris A. Mattmann commented on TIKA-35:
---------------------------------------

Hi Folks:

// Instantiate it with your stream and a memory thresold:
RereadableInputStream stream = new RereadableInputStream(aStream, 1024 * 1024);

// Force reading entire stream to place it in storage for subsequent passes:
while (stream.read() != -1) {
    // empty loop
} 

Why not use the approach suggested by Keith above to wrap the rewind method with a check to see if the stream is at the end of stream? We could require RereadableInputStream to take an optional parameter, let's call it "forceSeekOnRewind". By default, this would be set to false, but there could be a method that would set this to true, e.g., "enableForceSeekOnRewind()". Then, in the rewind method, it would first do something like:

if(forceSeekOnRewind){
  while(read() != -1){
   // empty loop
   }
 

   doRewind(); /* does the actual rewind work */
}
else{
  if(EOF()){
    /* at EOF, so go ahead and rewind */
  doRewind();
  }
  /* else do nothing */
}


Cheers, 
 Chris

   

> Extract MsOffice properties
> ---------------------------
>
>                 Key: TIKA-35
>                 URL: https://issues.apache.org/jira/browse/TIKA-35
>             Project: Tika
>          Issue Type: Improvement
>    Affects Versions: 0.1-incubator
>            Reporter: Rida Benjelloun
>            Assignee: Rida Benjelloun
>             Fix For: 0.1-incubator
>
>         Attachments: RereadableInputStream.java, RereadableInputStreamTest.java, tika35.patch, tika35.patch
>
>
> Hi,
> I have developed a patch that allows MsOffice properties extraction. I wasn't able to extract the MsOffice properties and full text from a single inputstream, I always get this error : java.io.IOException Source code of java.io.IOException: Unable to read entire header; -1 bytes read;
> expected 512 bytes. 
> I don't know how they make it work in Nutch (any ideas ?).
> To get it work, I have added "filePath" variable in the parser class, and I populate it from ParseUtils class. After that I create an inputStream from filePath or Url and I use it to extract properties and I use the default inputstream to extract full text.
> I didn't commit this modification; I would like to have your opinions before.
> Regards.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (TIKA-35) Extract MsOffice properties

Posted by "Rida Benjelloun (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TIKA-35?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12532241 ] 

Rida Benjelloun commented on TIKA-35:
-------------------------------------

Keith - 
This is done. 
Thanks

> Extract MsOffice properties
> ---------------------------
>
>                 Key: TIKA-35
>                 URL: https://issues.apache.org/jira/browse/TIKA-35
>             Project: Tika
>          Issue Type: Improvement
>    Affects Versions: 0.1-incubator
>            Reporter: Rida Benjelloun
>            Assignee: Rida Benjelloun
>             Fix For: 0.1-incubator
>
>         Attachments: RereadableInputStream.java, RereadableInputStreamTest.java, tika35.patch, tika35.patch
>
>
> Hi,
> I have developed a patch that allows MsOffice properties extraction. I wasn't able to extract the MsOffice properties and full text from a single inputstream, I always get this error : java.io.IOException Source code of java.io.IOException: Unable to read entire header; -1 bytes read;
> expected 512 bytes. 
> I don't know how they make it work in Nutch (any ideas ?).
> To get it work, I have added "filePath" variable in the parser class, and I populate it from ParseUtils class. After that I create an inputStream from filePath or Url and I use it to extract properties and I use the default inputstream to extract full text.
> I didn't commit this modification; I would like to have your opinions before.
> Regards.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (TIKA-35) Extract MsOffice properties

Posted by "Rida Benjelloun (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TIKA-35?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12531840 ] 

Rida Benjelloun commented on TIKA-35:
-------------------------------------

Hi Keiht,
Thanks for this contribution. I will test it to extract office properties.  

> Extract MsOffice properties
> ---------------------------
>
>                 Key: TIKA-35
>                 URL: https://issues.apache.org/jira/browse/TIKA-35
>             Project: Tika
>          Issue Type: Improvement
>    Affects Versions: 0.1-incubator
>            Reporter: Rida Benjelloun
>            Assignee: Rida Benjelloun
>             Fix For: 0.1-incubator
>
>         Attachments: RereadableInputStream.java, RereadableInputStreamTest.java, tika35.patch, tika35.patch
>
>
> Hi,
> I have developed a patch that allows MsOffice properties extraction. I wasn't able to extract the MsOffice properties and full text from a single inputstream, I always get this error : java.io.IOException Source code of java.io.IOException: Unable to read entire header; -1 bytes read;
> expected 512 bytes. 
> I don't know how they make it work in Nutch (any ideas ?).
> To get it work, I have added "filePath" variable in the parser class, and I populate it from ParseUtils class. After that I create an inputStream from filePath or Url and I use it to extract properties and I use the default inputstream to extract full text.
> I didn't commit this modification; I would like to have your opinions before.
> Regards.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (TIKA-35) Extract MsOffice properties

Posted by "Rida Benjelloun (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TIKA-35?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12531559 ] 

Rida Benjelloun commented on TIKA-35:
-------------------------------------

I have implement a method in Utils class that allows the copie of the inputstream in the memory.

> Extract MsOffice properties
> ---------------------------
>
>                 Key: TIKA-35
>                 URL: https://issues.apache.org/jira/browse/TIKA-35
>             Project: Tika
>          Issue Type: Improvement
>    Affects Versions: 0.1-incubator
>            Reporter: Rida Benjelloun
>            Assignee: Rida Benjelloun
>             Fix For: 0.1-incubator
>
>         Attachments: tika35.patch, tika35.patch
>
>
> Hi,
> I have developed a patch that allows MsOffice properties extraction. I wasn't able to extract the MsOffice properties and full text from a single inputstream, I always get this error : java.io.IOException Source code of java.io.IOException: Unable to read entire header; -1 bytes read;
> expected 512 bytes. 
> I don't know how they make it work in Nutch (any ideas ?).
> To get it work, I have added "filePath" variable in the parser class, and I populate it from ParseUtils class. After that I create an inputStream from filePath or Url and I use it to extract properties and I use the default inputstream to extract full text.
> I didn't commit this modification; I would like to have your opinions before.
> Regards.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (TIKA-35) Extract MsOffice properties

Posted by "Rida Benjelloun (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/TIKA-35?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Rida Benjelloun updated TIKA-35:
--------------------------------

    Attachment: tika35.patch

New patch, remove absolute directory name in the heaider.
Thanks keith.

> Extract MsOffice properties
> ---------------------------
>
>                 Key: TIKA-35
>                 URL: https://issues.apache.org/jira/browse/TIKA-35
>             Project: Tika
>          Issue Type: Improvement
>    Affects Versions: 0.1-incubator
>            Reporter: Rida Benjelloun
>             Fix For: 0.1-incubator
>
>         Attachments: tika35.patch, tika35.patch
>
>
> Hi,
> I have developed a patch that allows MsOffice properties extraction. I wasn't able to extract the MsOffice properties and full text from a single inputstream, I always get this error : java.io.IOException Source code of java.io.IOException: Unable to read entire header; -1 bytes read;
> expected 512 bytes. 
> I don't know how they make it work in Nutch (any ideas ?).
> To get it work, I have added "filePath" variable in the parser class, and I populate it from ParseUtils class. After that I create an inputStream from filePath or Url and I use it to extract properties and I use the default inputstream to extract full text.
> I didn't commit this modification; I would like to have your opinions before.
> Regards.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (TIKA-35) Extract MsOffice properties

Posted by "Keith R. Bennett (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TIKA-35?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12532211 ] 

Keith R. Bennett commented on TIKA-35:
--------------------------------------

Rida -

Please close the ReadableInputStream after you're finished using it.  That deletes the temporary file, if one was created.  Without this, the user's disk could be filled up while processing documents.

Sorry, I should have included the close() call in the unit test I provided.  (Would you modify that too please?)

- Keith


> Extract MsOffice properties
> ---------------------------
>
>                 Key: TIKA-35
>                 URL: https://issues.apache.org/jira/browse/TIKA-35
>             Project: Tika
>          Issue Type: Improvement
>    Affects Versions: 0.1-incubator
>            Reporter: Rida Benjelloun
>            Assignee: Rida Benjelloun
>             Fix For: 0.1-incubator
>
>         Attachments: RereadableInputStream.java, RereadableInputStreamTest.java, tika35.patch, tika35.patch
>
>
> Hi,
> I have developed a patch that allows MsOffice properties extraction. I wasn't able to extract the MsOffice properties and full text from a single inputstream, I always get this error : java.io.IOException Source code of java.io.IOException: Unable to read entire header; -1 bytes read;
> expected 512 bytes. 
> I don't know how they make it work in Nutch (any ideas ?).
> To get it work, I have added "filePath" variable in the parser class, and I populate it from ParseUtils class. After that I create an inputStream from filePath or Url and I use it to extract properties and I use the default inputstream to extract full text.
> I didn't commit this modification; I would like to have your opinions before.
> Regards.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (TIKA-35) Extract MsOffice properties

Posted by "Keith R. Bennett (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TIKA-35?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12530822 ] 

Keith R. Bennett commented on TIKA-35:
--------------------------------------

Rida -

The big question is: do we support the ability of parser implementations to make multiple passes over a stream?  If so, then we need to incorporate this cleanly into the architectural design.  Possible solutions are:

1) Save the contents of the stream during the first pass.  Or, if the stream supports, use mark() and release().
2) Pass to the Parsers a URL instead of an InputStream so that we can create a stream multiple times.  This is simpler, but runs the risk of the resource changing between stream instantiations, though.

IMO it would not be a good idea to put a resource identifier in the Parser class, even temporarily -- this is the reverse direction from our goal of making the parsers stateless.

Instead, we could start discussing (or should I say continue to discuss?) how to support multiple passes cleanly in the architecture.

Thanks,
Keith

P.S. For anyone having trouble applying Rida's patch, passing the "-p5" option to patch worked for me.

> Extract MsOffice properties
> ---------------------------
>
>                 Key: TIKA-35
>                 URL: https://issues.apache.org/jira/browse/TIKA-35
>             Project: Tika
>          Issue Type: Improvement
>    Affects Versions: 0.1-incubator
>            Reporter: Rida Benjelloun
>             Fix For: 0.1-incubator
>
>         Attachments: tika35.patch
>
>
> Hi,
> I have developed a patch that allows MsOffice properties extraction. I wasn't able to extract the MsOffice properties and full text from a single inputstream, I always get this error : java.io.IOException Source code of java.io.IOException: Unable to read entire header; -1 bytes read;
> expected 512 bytes. 
> I don't know how they make it work in Nutch (any ideas ?).
> To get it work, I have added "filePath" variable in the parser class, and I populate it from ParseUtils class. After that I create an inputStream from filePath or Url and I use it to extract properties and I use the default inputstream to extract full text.
> I didn't commit this modification; I would like to have your opinions before.
> Regards.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (TIKA-35) Extract MsOffice properties

Posted by "Jukka Zitting (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TIKA-35?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12531453 ] 

Jukka Zitting commented on TIKA-35:
-----------------------------------

Re: multiple passes; I'd rather achieve that with buffered streams and mark()/release(). This keeps the required coupling with the client to a minimum.

> Extract MsOffice properties
> ---------------------------
>
>                 Key: TIKA-35
>                 URL: https://issues.apache.org/jira/browse/TIKA-35
>             Project: Tika
>          Issue Type: Improvement
>    Affects Versions: 0.1-incubator
>            Reporter: Rida Benjelloun
>             Fix For: 0.1-incubator
>
>         Attachments: tika35.patch, tika35.patch
>
>
> Hi,
> I have developed a patch that allows MsOffice properties extraction. I wasn't able to extract the MsOffice properties and full text from a single inputstream, I always get this error : java.io.IOException Source code of java.io.IOException: Unable to read entire header; -1 bytes read;
> expected 512 bytes. 
> I don't know how they make it work in Nutch (any ideas ?).
> To get it work, I have added "filePath" variable in the parser class, and I populate it from ParseUtils class. After that I create an inputStream from filePath or Url and I use it to extract properties and I use the default inputstream to extract full text.
> I didn't commit this modification; I would like to have your opinions before.
> Regards.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (TIKA-35) Extract MsOffice properties

Posted by "Keith R. Bennett (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TIKA-35?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12531901 ] 

Keith R. Bennett commented on TIKA-35:
--------------------------------------

Rida -

You're welcome.  This class is functional, but not 100% robust or complete.  In particular, if you call rewind() before reaching end of stream on the first pass, only those bytes already read will be saved to the buffer (memory or disk).  So if the first user of the stream may not read the whole stream, I'd suggest forcing the initial pass to read the whole stream by doing something like:

// Instantiate it with your stream and a memory thresold:
RereadableInputStream stream = new RereadableInputStream(aStream, 1024 * 1024);

// Force reading entire stream to place it in storage for subsequent passes:
while (stream.read() != -1) {
    // empty loop
}

// Rewind the stream so that the next use of the stream will begin at the beginning of the stream,
// and read from the stored copy:
stream.rewind();

- Keith


> Extract MsOffice properties
> ---------------------------
>
>                 Key: TIKA-35
>                 URL: https://issues.apache.org/jira/browse/TIKA-35
>             Project: Tika
>          Issue Type: Improvement
>    Affects Versions: 0.1-incubator
>            Reporter: Rida Benjelloun
>            Assignee: Rida Benjelloun
>             Fix For: 0.1-incubator
>
>         Attachments: RereadableInputStream.java, RereadableInputStreamTest.java, tika35.patch, tika35.patch
>
>
> Hi,
> I have developed a patch that allows MsOffice properties extraction. I wasn't able to extract the MsOffice properties and full text from a single inputstream, I always get this error : java.io.IOException Source code of java.io.IOException: Unable to read entire header; -1 bytes read;
> expected 512 bytes. 
> I don't know how they make it work in Nutch (any ideas ?).
> To get it work, I have added "filePath" variable in the parser class, and I populate it from ParseUtils class. After that I create an inputStream from filePath or Url and I use it to extract properties and I use the default inputstream to extract full text.
> I didn't commit this modification; I would like to have your opinions before.
> Regards.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (TIKA-35) Extract MsOffice properties

Posted by "Rida Benjelloun (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/TIKA-35?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Rida Benjelloun updated TIKA-35:
--------------------------------

    Attachment: tika35.patch

> Extract MsOffice properties
> ---------------------------
>
>                 Key: TIKA-35
>                 URL: https://issues.apache.org/jira/browse/TIKA-35
>             Project: Tika
>          Issue Type: Improvement
>    Affects Versions: 0.1-incubator
>            Reporter: Rida Benjelloun
>             Fix For: 0.1-incubator
>
>         Attachments: tika35.patch
>
>
> Hi,
> I have developed a patch that allows MsOffice properties extraction. I wasn't able to extract the MsOffice properties and full text from a single inputstream, I always get this error : java.io.IOException Source code of java.io.IOException: Unable to read entire header; -1 bytes read;
> expected 512 bytes. 
> I don't know how they make it work in Nutch (any ideas ?).
> To get it work, I have added "filePath" variable in the parser class, and I populate it from ParseUtils class. After that I create an inputStream from filePath or Url and I use it to extract properties and I use the default inputstream to extract full text.
> I didn't commit this modification; I would like to have your opinions before.
> Regards.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (TIKA-35) Extract MsOffice properties

Posted by "Keith R. Bennett (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/TIKA-35?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Keith R. Bennett updated TIKA-35:
---------------------------------

    Attachment: RereadableInputStreamTest.java
                RereadableInputStream.java

Attached are a first pass at a rereadable stream class and a basic unit test that illustrates that it works (basically ;)).

This stream class wraps the document's input stream and saves its content when the passed stream is read.

It supports a memory threshold; if the total size read is no more than this threshold, the data is stored in a byte [], and subsequent rereads of the stream are read from a ByteArrayInputStream.  If the total size exceeds the threshold, the data is stored in a File, and subsequent passes read a buffered FileInputStream.

If you place these files in src/main/java/org/apache/tika/utils and src/test/java/org/apache/tika/utils, you should be able to compile them and run the test.

Rereading the stream is accomplished by calling rewind().  Currently rewind() closes the input stream originally passed, but we may want to change that.



> Extract MsOffice properties
> ---------------------------
>
>                 Key: TIKA-35
>                 URL: https://issues.apache.org/jira/browse/TIKA-35
>             Project: Tika
>          Issue Type: Improvement
>    Affects Versions: 0.1-incubator
>            Reporter: Rida Benjelloun
>            Assignee: Rida Benjelloun
>             Fix For: 0.1-incubator
>
>         Attachments: RereadableInputStream.java, RereadableInputStreamTest.java, tika35.patch, tika35.patch
>
>
> Hi,
> I have developed a patch that allows MsOffice properties extraction. I wasn't able to extract the MsOffice properties and full text from a single inputstream, I always get this error : java.io.IOException Source code of java.io.IOException: Unable to read entire header; -1 bytes read;
> expected 512 bytes. 
> I don't know how they make it work in Nutch (any ideas ?).
> To get it work, I have added "filePath" variable in the parser class, and I populate it from ParseUtils class. After that I create an inputStream from filePath or Url and I use it to extract properties and I use the default inputstream to extract full text.
> I didn't commit this modification; I would like to have your opinions before.
> Regards.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (TIKA-35) Extract MsOffice properties

Posted by "Rida Benjelloun (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TIKA-35?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12531981 ] 

Rida Benjelloun commented on TIKA-35:
-------------------------------------

Keith - 
I noted this problem, I will force reading entire stream before calling rewind() method. 
Thanks for the suggestion


> Extract MsOffice properties
> ---------------------------
>
>                 Key: TIKA-35
>                 URL: https://issues.apache.org/jira/browse/TIKA-35
>             Project: Tika
>          Issue Type: Improvement
>    Affects Versions: 0.1-incubator
>            Reporter: Rida Benjelloun
>            Assignee: Rida Benjelloun
>             Fix For: 0.1-incubator
>
>         Attachments: RereadableInputStream.java, RereadableInputStreamTest.java, tika35.patch, tika35.patch
>
>
> Hi,
> I have developed a patch that allows MsOffice properties extraction. I wasn't able to extract the MsOffice properties and full text from a single inputstream, I always get this error : java.io.IOException Source code of java.io.IOException: Unable to read entire header; -1 bytes read;
> expected 512 bytes. 
> I don't know how they make it work in Nutch (any ideas ?).
> To get it work, I have added "filePath" variable in the parser class, and I populate it from ParseUtils class. After that I create an inputStream from filePath or Url and I use it to extract properties and I use the default inputstream to extract full text.
> I didn't commit this modification; I would like to have your opinions before.
> Regards.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (TIKA-35) Extract MsOffice properties

Posted by "Keith R. Bennett (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TIKA-35?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12532175 ] 

Keith R. Bennett commented on TIKA-35:
--------------------------------------

Rida -

I saw your use of RereadableInputStream in MSExtractor.  The instance you are creating is passed to extractText() as an InputStream, not a RereadableInputStream, so the rereading functionality is never used.  Therefore, the original input stream could have been used instead.

Can explain the original problem in detail?

Thanks.



> Extract MsOffice properties
> ---------------------------
>
>                 Key: TIKA-35
>                 URL: https://issues.apache.org/jira/browse/TIKA-35
>             Project: Tika
>          Issue Type: Improvement
>    Affects Versions: 0.1-incubator
>            Reporter: Rida Benjelloun
>            Assignee: Rida Benjelloun
>             Fix For: 0.1-incubator
>
>         Attachments: RereadableInputStream.java, RereadableInputStreamTest.java, tika35.patch, tika35.patch
>
>
> Hi,
> I have developed a patch that allows MsOffice properties extraction. I wasn't able to extract the MsOffice properties and full text from a single inputstream, I always get this error : java.io.IOException Source code of java.io.IOException: Unable to read entire header; -1 bytes read;
> expected 512 bytes. 
> I don't know how they make it work in Nutch (any ideas ?).
> To get it work, I have added "filePath" variable in the parser class, and I populate it from ParseUtils class. After that I create an inputStream from filePath or Url and I use it to extract properties and I use the default inputstream to extract full text.
> I didn't commit this modification; I would like to have your opinions before.
> Regards.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Assigned: (TIKA-35) Extract MsOffice properties

Posted by "Rida Benjelloun (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/TIKA-35?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Rida Benjelloun reassigned TIKA-35:
-----------------------------------

    Assignee: Rida Benjelloun

> Extract MsOffice properties
> ---------------------------
>
>                 Key: TIKA-35
>                 URL: https://issues.apache.org/jira/browse/TIKA-35
>             Project: Tika
>          Issue Type: Improvement
>    Affects Versions: 0.1-incubator
>            Reporter: Rida Benjelloun
>            Assignee: Rida Benjelloun
>             Fix For: 0.1-incubator
>
>         Attachments: tika35.patch, tika35.patch
>
>
> Hi,
> I have developed a patch that allows MsOffice properties extraction. I wasn't able to extract the MsOffice properties and full text from a single inputstream, I always get this error : java.io.IOException Source code of java.io.IOException: Unable to read entire header; -1 bytes read;
> expected 512 bytes. 
> I don't know how they make it work in Nutch (any ideas ?).
> To get it work, I have added "filePath" variable in the parser class, and I populate it from ParseUtils class. After that I create an inputStream from filePath or Url and I use it to extract properties and I use the default inputstream to extract full text.
> I didn't commit this modification; I would like to have your opinions before.
> Regards.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (TIKA-35) Extract MsOffice properties

Posted by "Rida Benjelloun (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TIKA-35?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12530825 ] 

Rida Benjelloun commented on TIKA-35:
-------------------------------------

Hi Keith,
I like the idea to save the content of the stream during the first pass. 
Thanks


> Extract MsOffice properties
> ---------------------------
>
>                 Key: TIKA-35
>                 URL: https://issues.apache.org/jira/browse/TIKA-35
>             Project: Tika
>          Issue Type: Improvement
>    Affects Versions: 0.1-incubator
>            Reporter: Rida Benjelloun
>             Fix For: 0.1-incubator
>
>         Attachments: tika35.patch, tika35.patch
>
>
> Hi,
> I have developed a patch that allows MsOffice properties extraction. I wasn't able to extract the MsOffice properties and full text from a single inputstream, I always get this error : java.io.IOException Source code of java.io.IOException: Unable to read entire header; -1 bytes read;
> expected 512 bytes. 
> I don't know how they make it work in Nutch (any ideas ?).
> To get it work, I have added "filePath" variable in the parser class, and I populate it from ParseUtils class. After that I create an inputStream from filePath or Url and I use it to extract properties and I use the default inputstream to extract full text.
> I didn't commit this modification; I would like to have your opinions before.
> Regards.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Closed: (TIKA-35) Extract MsOffice properties

Posted by "Rida Benjelloun (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/TIKA-35?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Rida Benjelloun closed TIKA-35.
-------------------------------


> Extract MsOffice properties
> ---------------------------
>
>                 Key: TIKA-35
>                 URL: https://issues.apache.org/jira/browse/TIKA-35
>             Project: Tika
>          Issue Type: Improvement
>    Affects Versions: 0.1-incubator
>            Reporter: Rida Benjelloun
>            Assignee: Rida Benjelloun
>             Fix For: 0.1-incubator
>
>         Attachments: tika35.patch, tika35.patch
>
>
> Hi,
> I have developed a patch that allows MsOffice properties extraction. I wasn't able to extract the MsOffice properties and full text from a single inputstream, I always get this error : java.io.IOException Source code of java.io.IOException: Unable to read entire header; -1 bytes read;
> expected 512 bytes. 
> I don't know how they make it work in Nutch (any ideas ?).
> To get it work, I have added "filePath" variable in the parser class, and I populate it from ParseUtils class. After that I create an inputStream from filePath or Url and I use it to extract properties and I use the default inputstream to extract full text.
> I didn't commit this modification; I would like to have your opinions before.
> Regards.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (TIKA-35) Extract MsOffice properties

Posted by "Keith R. Bennett (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TIKA-35?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12531611 ] 

Keith R. Bennett commented on TIKA-35:
--------------------------------------

Rida, have you changed the code to copy an MS Word document's input stream into memory?  That would work well for 99% of input documents, but a really large one could bring down the JVM.


> Extract MsOffice properties
> ---------------------------
>
>                 Key: TIKA-35
>                 URL: https://issues.apache.org/jira/browse/TIKA-35
>             Project: Tika
>          Issue Type: Improvement
>    Affects Versions: 0.1-incubator
>            Reporter: Rida Benjelloun
>            Assignee: Rida Benjelloun
>             Fix For: 0.1-incubator
>
>         Attachments: tika35.patch, tika35.patch
>
>
> Hi,
> I have developed a patch that allows MsOffice properties extraction. I wasn't able to extract the MsOffice properties and full text from a single inputstream, I always get this error : java.io.IOException Source code of java.io.IOException: Unable to read entire header; -1 bytes read;
> expected 512 bytes. 
> I don't know how they make it work in Nutch (any ideas ?).
> To get it work, I have added "filePath" variable in the parser class, and I populate it from ParseUtils class. After that I create an inputStream from filePath or Url and I use it to extract properties and I use the default inputstream to extract full text.
> I didn't commit this modification; I would like to have your opinions before.
> Regards.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (TIKA-35) Extract MsOffice properties

Posted by "Rida Benjelloun (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/TIKA-35?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Rida Benjelloun resolved TIKA-35.
---------------------------------

    Resolution: Fixed

SVN commit

> Extract MsOffice properties
> ---------------------------
>
>                 Key: TIKA-35
>                 URL: https://issues.apache.org/jira/browse/TIKA-35
>             Project: Tika
>          Issue Type: Improvement
>    Affects Versions: 0.1-incubator
>            Reporter: Rida Benjelloun
>            Assignee: Rida Benjelloun
>             Fix For: 0.1-incubator
>
>         Attachments: tika35.patch, tika35.patch
>
>
> Hi,
> I have developed a patch that allows MsOffice properties extraction. I wasn't able to extract the MsOffice properties and full text from a single inputstream, I always get this error : java.io.IOException Source code of java.io.IOException: Unable to read entire header; -1 bytes read;
> expected 512 bytes. 
> I don't know how they make it work in Nutch (any ideas ?).
> To get it work, I have added "filePath" variable in the parser class, and I populate it from ParseUtils class. After that I create an inputStream from filePath or Url and I use it to extract properties and I use the default inputstream to extract full text.
> I didn't commit this modification; I would like to have your opinions before.
> Regards.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.