You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@tika.apache.org by Peter Kronenberg <pe...@torch.ai> on 2021/02/19 19:08:48 UTC

Re-using a TikaStream

If I finish parsing a TikaStream, can I re-use the stream (before it is closed)?  I know you said that there is some magic behind the scenes where it spools it to a file.  Can I just call reset() to start from the beginning?

Peter


Peter Kronenberg  |  Senior AI Analytic ENGINEER
C: 703.887.5623
[Torch AI]<http://www.torch.ai/>
4303 W. 119th St., Leawood, KS 66209
WWW.TORCH.AI<http://www.torch.ai/>

Re: Re-using a TikaStream

Posted by Nick Burch <ap...@gagravarr.org>.

On Mon, 1 Mar 2021, Tim Allison wrote:
> detectors should return the stream reset to the beginning.

I agree - needs to be ready for the parser to then process

> Parsers, IIRC, should return the stream fully(?) read but not closed.

Not always - if the parser wanted a File then it may not have touched the 
stream.

Equally if the parser can't handle the file (eg it starts reading, finds a 
version number that indicates it isn't able to handle it and gives up), 
then the stream won't be readu

Nick

RE: Re-using a TikaStream

Posted by Nick Burch <ap...@gagravarr.org>.

On Mon, 1 Mar 2021, Peter Kronenberg wrote:
> But the issue is that different parsers return the stream in different 
> states.  Sometimes the stream is all used up (although not closed). And 
> other times, the stream has been re-set to the beginning where it can be 
> re-used.  Is this expected behavior?

If the Parser actually wants a File, and triggers the stream to be spooled 
to disk, then if you supplied a TikaInputStream the stream will still be 
sat at the start of the file. That's because it was read once, reset for 
use again, but the stream was then never touched, just the backing file 
used

If the Parser really wanted a stream, it will likely read most/all of the 
stream, so the stream should be positioned at the end (or perhaps close to 
it). Depending on how you constructed the stream, the stream class etc, it 
may or may not be rewindable / resettable for another subsequent read

Nick

RE: Re-using a TikaStream

Posted by Nick Burch <ap...@gagravarr.org>.

On Fri, 26 Feb 2021, Peter Kronenberg wrote:
> For most audio files, using the AudioParser, the buffer is still at the 
> beginning.  Even though there is no text extraction, I would think that 
> Tika still needs to read through the stream. The MP3Parser consumes the 
> stream, but the MP4Parser does not

IIRC the MP4 parsing library we use needs a File not a Stream, so we have 
to spool everything to disk

> The OCR parser also leaves the pointer at the beginning.  It definitely 
> consumes the stream, so it must be resetting it.

OCR needs a file to call out to Tesseract with, so has to spool the stream 
to disk

> So what is going on.  And now I get back to my original question, which 
> is, what is the best way to consistently be able to re-use the stream?

Force Tika to spool to disk is probably the only way to be sure, assuming 
you don't have enough memory to always buffer everything in ram

Nick

RE: Re-using a TikaStream

Posted by Peter Kronenberg <pe...@torch.ai>.

That’s not what I’m seeing.  The AudioParser returns the stream at the beginning.  Maybe it’s because there was nothing to parse.  It just returns metadata.  But the MP4Parser returns the stream fully consumed, even though, again, it only returns meta-data.

Since right now, I’m dealing with audio and video, I need consistent behavior.   What I’m doing is running it through Tika first, and then based on the file type, doing additional processing.

I’m going to try using the ReusableInputStream

From: Tim Allison <ta...@apache.org>
Sent: Monday, March 1, 2021 10:31 AM
To: Peter Kronenberg <pe...@torch.ai>
Cc: user@tika.apache.org; lfcnassif@gmail.com
Subject: Re: Re-using a TikaStream

detectors should return the stream reset to the beginning.

Parsers, IIRC, should return the stream fully(?) read but not closed.

On Mon, Mar 1, 2021 at 10:29 AM Tim Allison <ta...@apache.org>> wrote:
Reusing streams after parsing hasn't been something I've done before...

This is not expected behavior.  Parsers should all behave the same.

On Mon, Mar 1, 2021 at 10:24 AM Peter Kronenberg <pe...@torch.ai>> wrote:
After more testing, it seems that it has nothing to do with TikaInputStream.  I just passed in a BufferedInputStream to the parsers.  I see that the first thing the AutoDetactParser does is to convert it to a TikaInputStream.  So maybe TIS is being leveraged at a lower level, but there no reason for me to use the TIS at my level.
But the issue is that different parsers return the stream in different states.  Sometimes the stream is all used up (although not closed). And other times, the stream has been re-set to the beginning where it can be re-used.  Is this expected behavior?

From: Peter Kronenberg <pe...@torch.ai>>
Sent: Friday, February 26, 2021 10:03 PM
To: tallison@apache.org<ma...@apache.org>
Cc: user@tika.apache.org<ma...@tika.apache.org>; lfcnassif@gmail.com<ma...@gmail.com>
Subject: RE: Re-using a TikaStream

This email was sent from outside your organisation, yet is displaying the name of someone from your organisation. This often happens in phishing attempts. Please only interact with this email if you know its source and that the content is safe.

But as I said, this doesn’t seem to work with all parsers.    So let’s say I pass in an MP4 file which uses the MP4Parser and then I want to re-use the stream afterward.  How can I guarantee consistent beahvor, no matter which paser gets used?

From: Tim Allison <ta...@apache.org>>
Sent: Friday, February 26, 2021 3:17 PM
To: Peter Kronenberg <pe...@torch.ai>>
Cc: user@tika.apache.org<ma...@tika.apache.org>; lfcnassif@gmail.com<ma...@gmail.com>
Subject: Re: Re-using a TikaStream

The stream.available() call comes from ProxyInputStream.  We don't modify that in TikaInputStream...maybe we should.

TikaInputStream wraps an incoming InputStream in a BufferedInputStream if it doesn't supportMark already.

So, as long as you're happy with the performance and potential limitations of BufferedInputStream, go with TikaInputStream.

Note that some parsers have to spool to disk.  TikaInputStream takes care of this for you.

On Fri, Feb 26, 2021 at 1:01 PM Peter Kronenberg <pe...@torch.ai>> wrote:
I think I figured this out.  It seems to depend on what parser is used.  Not sure if this just has to do with inconsistent implementations, or there is some reason behind it.

For most audio files, using the AudioParser, the buffer is still at the beginning.  Even though there is no text extraction, I would think that Tika still needs to read through the stream.
The MP3Parser consumes the stream, but the MP4Parser does not

The OCR parser also leaves the pointer at the beginning.  It definitely consumes the stream, so it must be resetting it.

So what is going on.  And now I get back to my original question, which is, what is the best way to consistently be able to re-use the stream?

From: Peter Kronenberg <pe...@torch.ai>>
Sent: Friday, February 26, 2021 12:18 PM
To: user@tika.apache.org<ma...@tika.apache.org>; tallison@apache.org<ma...@apache.org>
Cc: lfcnassif@gmail.com<ma...@gmail.com>
Subject: RE: Re-using a TikaStream

This email was sent from outside your organisation, yet is displaying the name of someone from your organisation. This often happens in phishing attempts. Please only interact with this email if you know its source and that the content is safe.

So is this guaranteed, expected behavior?

With a BufferedInputStream – I expect this

try (BufferedInputStream stream = new BufferedInputStream(new FileInputStream(file))) {
    System.out.printf("before - bytes available: %s", stream.available());
    parser.parse(stream, handler, metadata, parseContext);
    System.out.printf("after - bytes available: %s%n", stream.available());
}

before - bytes available: 10546620
after - bytes available: 0

But with a TikaInputStream, I get this

Note that I’m purposing creating a FileInputStream first in order to hide the file information from the TikaInputStream, since in my normal use case, I’m dealing with a regular InputStream, not reading from a file

try (TikaInputStream stream = TikaInputStream.get(new FileInputStream(file))) {
    System.out.printf("before - bytes available: %s, position: %s%n", stream.available(), stream.getPosition());
    parser.parse(stream, handler, metadata, parseContext);
    System.out.printf("after - bytes available: %s, position: %s%n", stream.available(), stream.getPosition());
}

before - bytes available: 10546620, position: 0
after - bytes available: 10546620, position: 0

From: Peter Kronenberg <pe...@torch.ai>>
Sent: Thursday, February 25, 2021 11:28 AM
To: user@tika.apache.org<ma...@tika.apache.org>; tallison@apache.org<ma...@apache.org>
Cc: lfcnassif@gmail.com<ma...@gmail.com>
Subject: RE: Re-using a TikaStream

This email was sent from outside your organisation, yet is displaying the name of someone from your organisation. This often happens in phishing attempts. Please only interact with this email if you know its source and that the content is safe.

Or reading from the cloud, either Google or AWS, in which case I also get a stream.   I know what the file name is, but can’t really use it

From: Peter Kronenberg <pe...@torch.ai>>
Sent: Thursday, February 25, 2021 11:19 AM
To: tallison@apache.org<ma...@apache.org>
Cc: lfcnassif@gmail.com<ma...@gmail.com>; user@tika.apache.org<ma...@tika.apache.org>
Subject: RE: Re-using a TikaStream

This email was sent from outside your organisation, yet is displaying the name of someone from your organisation. This often happens in phishing attempts. Please only interact with this email if you know its source and that the content is safe.

With a stream.  I am reading arbitrary streams and one of the goals is to figure out what it is. So there is no file backing it.

From: Tim Allison <ta...@apache.org>>
Sent: Thursday, February 25, 2021 11:11 AM
To: Peter Kronenberg <pe...@torch.ai>>
Cc: lfcnassif@gmail.com<ma...@gmail.com>; user@tika.apache.org<ma...@tika.apache.org>
Subject: Re: Re-using a TikaStream

Are you initializing w a file or a stream?

On Thu, Feb 25, 2021 at 9:00 AM Peter Kronenberg <pe...@torch.ai>> wrote:
But how is TikaInputStream allowing me to re-use the stream without me doing anything special?   Is it automatically spooling to disk as needed?

I wouldn’t say that I can’t afford to spool to disk.  I’m just looking for the most reasonable solution.  I don’t know how big the streams are that I’ll be processing.  Obviously, if they’re big, the keeping them in memory is not reasonable and disk is the only option.  But for smaller streams, if it can do it all in memory, that’s obviously better.  And for my use case, I don’t *always* have to re-read the stream.

From: Tim Allison <ta...@apache.org>>
Sent: Thursday, February 25, 2021 5:48 AM
To: user@tika.apache.org<ma...@tika.apache.org>
Cc: lfcnassif@gmail.com<ma...@gmail.com>
Subject: Re: Re-using a TikaStream

My $0.02 would be to use TikaInputStream because that gets a lot more use and is battle-tested.  Within the last year or so, we started using RereadableInputStream in one of the Microsoft format parsers so it is also getting some use now.

If you absolutely can't afford to spool to disk, then give RereadableInputStream a try.

The inputstreamfactories, in my mind, are somewhat work-arounds for other use cases, e.g. retrying/batch etc.

On Tue, Feb 23, 2021 at 11:41 AM Peter Kronenberg <pe...@torch.ai>> wrote:
So this might be moot, because it seems that TikaInputStream is already doing some magic and I’m not sure how.
I was able to re-use the stream without doing anything special after a call to parse.  And in fact, I displayed stream.available() and stream.position() before and after the call to parse, and the full stream was still available at position 0.  What is TikaInputStream doing to make this happen?

Just for some additional context, what I’m doing is running the file through Tika and then, depending on the file type, I want to do some additional non-tika processing.  I thought that once the Tika parse was done, the stream would be used up.

What is going on?

From: Peter Kronenberg <pe...@torch.ai>>
Sent: Tuesday, February 23, 2021 10:00 AM
To: user@tika.apache.org<ma...@tika.apache.org>; lfcnassif@gmail.com<ma...@gmail.com>
Subject: RE: Re-using a TikaStream

This email was sent from outside your organisation, yet is displaying the name of someone from your organisation. This often happens in phishing attempts. Please only interact with this email if you know its source and that the content is safe.

I just found the RereadableInputStream.  This looks more like what I was thinking.  Is there any reason not to use it?  What are the Tika best practices?  Pros/Cons of each approach?  If RereadableInputStream works as it’s supposed to, I’m not sure I see the advantage of InputStreamFactory

From: Peter Kronenberg <pe...@torch.ai>>
Sent: Monday, February 22, 2021 8:30 PM
To: lfcnassif@gmail.com<ma...@gmail.com>
Cc: user@tika.apache.org<ma...@tika.apache.org>
Subject: RE: Re-using a TikaStream

This email was sent from outside your organisation, yet is displaying the name of someone from your organisation. This often happens in phishing attempts. Please only interact with this email if you know its source and that the content is safe.

Oh ok.  I didn’t realize I needed to write my own class to implement it. I  was looking for some sort of existing framework.

What is the purpose of the 2 InputStreamFactory classes:

I was re-reading some emails with Nick Burch back around Dec 22-23 and maybe I mis-understood him, but it sounds like he was saying that TiksInputStream was smart enough to automatically spool the stream to disk to allow re-use.

It seems to me that I need an extra pass through the data in order to save to disk.  I’m not starting from a File, but from a stream.  So if I need to read the stream twice, I really have to pass through the data 3 times, correct?
Unless there is a way to save to disk during the first pass

(try/catch removed for simplicity)

tis = TikaInputSream.get(InputStream);
file = tis.getFile();   <== extra pass
tis =  TikaInputStream.get(new MyInputStreamFactory(file));
// first real pass
InputStream is = tis.getInputStreamFactory().getInputStream()
// second real pass
}

From: Luís Filipe Nassif <lf...@gmail.com>>
Sent: Monday, February 22, 2021 5:42 PM
To: Peter Kronenberg <pe...@torch.ai>>
Cc: user@tika.apache.org<ma...@tika.apache.org>
Subject: Re: Re-using a TikaStream

Something like:

class MyInputStreamFactory implements InputStreamFactory{

    private File file;

    public  MyInputStreamFactory(File file){
        this.file = file;
    }

    public InputStream getInputStream(){
        return new FileInputStream(file);
    }
}

in your client code:

Parser parser = new AutoDetectParser();
TikaInputStream tis =  TikaInputStream.get(new MyInputStreamFactory(file));
parser.parse(tis, new ToTextContentHandler(), new Metadata(), new ParseContext());

when you need to reuse the stream (into your parser):

public void parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context)
            throws IOException, SAXException, TikaException {
   //(...)
   TikaInputStream tis = TikaInputStream.get(stream);
   if(tis.hasInputStreamFactory()){
        try(InputStream is = tis.getInputStreamFactory().getInputStream()){
              //consume the new stream
        }
   }else
       throw new IOException("not a reusable inputStream");
 }

Of course this is useful if you are not processing files, e.g. reading files from the cloud or sockets.

Regards,
Luis

Em seg., 22 de fev. de 2021 às 19:18, Peter Kronenberg <pe...@torch.ai>> escreveu:
I sent this question late on Friday.  Sending it again.  Can you provide a little more information how out to use the InputStreamFactory?

From: Peter Kronenberg <pe...@torch.ai>>
Sent: Friday, February 19, 2021 5:10 PM
To: user@tika.apache.org<ma...@tika.apache.org>; lfcnassif@gmail.com<ma...@gmail.com>
Subject: RE: Re-using a TikaStream

This email was sent from outside your organisation, yet is displaying the name of someone from your organisation. This often happens in phishing attempts. Please only interact with this email if you know its source and that the content is safe.

There appear to be 2 InputStreamFactory classes: in tika-server-core and tika-io.  The one in server.core is the only one with a concrete class.
I’m not quite sure I see how to use this.
Normally, I create a TikaInputStream with TikaInputStream.get(InputStream).  How do I create it from an InputStreamFactory?
TikaInputStream.getInputStreamFactory() only returns a factory if the TikaInputStream was created from a factory.
Is there a good example of how this is used

From: Peter Kronenberg <pe...@torch.ai>>
Sent: Friday, February 19, 2021 4:57 PM
To: user@tika.apache.org<ma...@tika.apache.org>; lfcnassif@gmail.com<ma...@gmail.com>
Subject: RE: Re-using a TikaStream

This email was sent from outside your organisation, yet is displaying the name of someone from your organisation. This often happens in phishing attempts. Please only interact with this email if you know its source and that the content is safe.

Thanks.  I thought that TikaInputStream already automatically saved to disk to allow re-reading.

From: Luís Filipe Nassif <lf...@gmail.com>>
Sent: Friday, February 19, 2021 3:44 PM
To: user@tika.apache.org<ma...@tika.apache.org>
Subject: Re: Re-using a TikaStream

You could call TikaInputStream.getPath() at the beginning of your parser, it will spool to file if not file based. After consuming the original inputStream, create a new one from the temp file created.

If you are using 2.0.0-ALPHA, there is:

https://github.com/apache/tika/blob/main/tika-core/src/main/java/org/apache/tika/io/InputStreamFactory.java

Use with the new methods from TikaInputStream:
public static TikaInputStream get(InputStreamFactory factory)
public InputStreamFactory getInputStreamFactory()

Hope this helps,
Luis

Em sex., 19 de fev. de 2021 às 16:09, Peter Kronenberg <pe...@torch.ai>> escreveu:
If I finish parsing a TikaStream, can I re-use the stream (before it is closed)?  I know you said that there is some magic behind the scenes where it spools it to a file.  Can I just call reset() to start from the beginning?

Peter

Peter Kronenberg  |  Senior AI Analytic ENGINEER
C: 703.887.5623
[Torch AI]<http://www.torch.ai/>
4303 W. 119th St., Leawood, KS 66209<https://www.google.com/maps/search/4303+W.+119th+St.,+Leawood,+KS+66209?entry=gmail&source=g>
WWW.TORCH.AI<http://www.torch.ai/>

Re: Re-using a TikaStream

Posted by Tim Allison <ta...@apache.org>.

detectors should return the stream reset to the beginning.

Parsers, IIRC, should return the stream fully(?) read but not closed.

On Mon, Mar 1, 2021 at 10:29 AM Tim Allison <ta...@apache.org> wrote:

> Reusing streams after parsing hasn't been something I've done before...
>
> This is not expected behavior.  Parsers should all behave the same.
>
> On Mon, Mar 1, 2021 at 10:24 AM Peter Kronenberg <
> peter.kronenberg@torch.ai> wrote:
>
>> After more testing, it seems that it has nothing to do with
>> TikaInputStream.  I just passed in a BufferedInputStream to the parsers.  I
>> see that the first thing the AutoDetactParser does is to convert it to a
>> TikaInputStream.  So maybe TIS is being leveraged at a lower level, but
>> there no reason for me to use the TIS at my level.
>>
>> But the issue is that different parsers return the stream in different
>> states.  Sometimes the stream is all used up (although not closed). And
>> other times, the stream has been re-set to the beginning where it can be
>> re-used.  Is this expected behavior?
>>
>>
>>
>>
>>
>>
>>
>> *From:* Peter Kronenberg <pe...@torch.ai>
>> *Sent:* Friday, February 26, 2021 10:03 PM
>> *To:* tallison@apache.org
>> *Cc:* user@tika.apache.org; lfcnassif@gmail.com
>> *Subject:* RE: Re-using a TikaStream
>>
>>
>>
>> This email was sent from outside your organisation, yet is displaying the
>> name of someone from your organisation. This often happens in phishing
>> attempts. Please only interact with this email if you know its source and
>> that the content is safe.
>>
>>
>>
>> But as I said, this doesn’t seem to work with all parsers.    So let’s
>> say I pass in an MP4 file which uses the MP4Parser and then I want to
>> re-use the stream afterward.  How can I guarantee consistent beahvor, no
>> matter which paser gets used?
>>
>>
>>
>> *From:* Tim Allison <ta...@apache.org>
>> *Sent:* Friday, February 26, 2021 3:17 PM
>> *To:* Peter Kronenberg <pe...@torch.ai>
>> *Cc:* user@tika.apache.org; lfcnassif@gmail.com
>> *Subject:* Re: Re-using a TikaStream
>>
>>
>>
>> The stream.available() call comes from ProxyInputStream.  We don't modify
>> that in TikaInputStream...maybe we should.
>>
>>
>>
>> TikaInputStream wraps an incoming InputStream in a BufferedInputStream if
>> it doesn't supportMark already.
>>
>>
>>
>> So, as long as you're happy with the performance and potential
>> limitations of BufferedInputStream, go with TikaInputStream.
>>
>>
>>
>> Note that some parsers have to spool to disk.  TikaInputStream takes care
>> of this for you.
>>
>>
>>
>> On Fri, Feb 26, 2021 at 1:01 PM Peter Kronenberg <
>> peter.kronenberg@torch.ai> wrote:
>>
>> I think I figured this out.  It seems to depend on what parser is used.
>> Not sure if this just has to do with inconsistent implementations, or there
>> is some reason behind it.
>>
>>
>>
>> For most audio files, using the AudioParser, the buffer is still at the
>> beginning.  Even though there is no text extraction, I would think that
>> Tika still needs to read through the stream.
>>
>> The MP3Parser consumes the stream, but the MP4Parser does not
>>
>>
>>
>> The OCR parser also leaves the pointer at the beginning.  It definitely
>> consumes the stream, so it must be resetting it.
>>
>>
>>
>> So what is going on.  And now I get back to my original question, which
>> is, what is the best way to consistently be able to re-use the stream?
>>
>>
>>
>> *From:* Peter Kronenberg <pe...@torch.ai>
>> *Sent:* Friday, February 26, 2021 12:18 PM
>> *To:* user@tika.apache.org; tallison@apache.org
>> *Cc:* lfcnassif@gmail.com
>> *Subject:* RE: Re-using a TikaStream
>>
>>
>>
>> This email was sent from outside your organisation, yet is displaying the
>> name of someone from your organisation. This often happens in phishing
>> attempts. Please only interact with this email if you know its source and
>> that the content is safe.
>>
>>
>>
>> So is this guaranteed, expected behavior?
>>
>>
>>
>> With a BufferedInputStream – I expect this
>>
>>
>>
>>
>> *try *(BufferedInputStream stream = *new *BufferedInputStream(*new *FileInputStream(file)))
>> {
>>     System.*out*.printf(*"before - bytes available: %s"*,
>> stream.available());
>>     parser.parse(stream, handler, metadata, parseContext);
>>     System.*out*.printf(*"after - bytes available: %s%n"*,
>> stream.available());
>> }
>>
>>
>>
>> before - bytes available: 10546620
>>
>> after - bytes available: 0
>>
>>
>>
>>
>>
>>
>>
>> But with a TikaInputStream, I get this
>>
>>
>>
>> Note that I’m purposing creating a FileInputStream first in order to hide the file information from the TikaInputStream, since in my normal use case, I’m dealing with a regular InputStream, not reading from a file
>>
>>
>> *try *(TikaInputStream stream = TikaInputStream.*get*(*new *FileInputStream(file))) {
>>     System.*out*.printf(*"before - bytes available: %s, position: %s%n"*, stream.available(), stream.getPosition());
>>     parser.parse(stream, handler, metadata, parseContext);
>>     System.*out*.printf(*"after - bytes available: %s, position: %s%n"*, stream.available(), stream.getPosition());
>> }
>>
>>
>>
>> before - bytes available: 10546620, position: 0
>>
>> after - bytes available: 10546620, position: 0
>>
>>
>>
>>
>>
>> *From:* Peter Kronenberg <pe...@torch.ai>
>> *Sent:* Thursday, February 25, 2021 11:28 AM
>> *To:* user@tika.apache.org; tallison@apache.org
>> *Cc:* lfcnassif@gmail.com
>> *Subject:* RE: Re-using a TikaStream
>>
>>
>>
>> This email was sent from outside your organisation, yet is displaying the
>> name of someone from your organisation. This often happens in phishing
>> attempts. Please only interact with this email if you know its source and
>> that the content is safe.
>>
>>
>>
>> Or reading from the cloud, either Google or AWS, in which case I also get
>> a stream.   I know what the file name is, but can’t really use it
>>
>>
>>
>> *From:* Peter Kronenberg <pe...@torch.ai>
>> *Sent:* Thursday, February 25, 2021 11:19 AM
>> *To:* tallison@apache.org
>> *Cc:* lfcnassif@gmail.com; user@tika.apache.org
>> *Subject:* RE: Re-using a TikaStream
>>
>>
>>
>> This email was sent from outside your organisation, yet is displaying the
>> name of someone from your organisation. This often happens in phishing
>> attempts. Please only interact with this email if you know its source and
>> that the content is safe.
>>
>>
>>
>> With a stream.  I am reading arbitrary streams and one of the goals is to
>> figure out what it is. So there is no file backing it.
>>
>>
>>
>> *From:* Tim Allison <ta...@apache.org>
>> *Sent:* Thursday, February 25, 2021 11:11 AM
>> *To:* Peter Kronenberg <pe...@torch.ai>
>> *Cc:* lfcnassif@gmail.com; user@tika.apache.org
>> *Subject:* Re: Re-using a TikaStream
>>
>>
>>
>> Are you initializing w a file or a stream?
>>
>>
>>
>> On Thu, Feb 25, 2021 at 9:00 AM Peter Kronenberg <
>> peter.kronenberg@torch.ai> wrote:
>>
>> But how is TikaInputStream allowing me to re-use the stream without me
>> doing anything special?   Is it automatically spooling to disk as needed?
>>
>>
>>
>> I wouldn’t say that I can’t afford to spool to disk.  I’m just looking
>> for the most reasonable solution.  I don’t know how big the streams are
>> that I’ll be processing.  Obviously, if they’re big, the keeping them in
>> memory is not reasonable and disk is the only option.  But for smaller
>> streams, if it can do it all in memory, that’s obviously better.  And for
>> my use case, I don’t **always** have to re-read the stream.
>>
>>
>>
>> *From:* Tim Allison <ta...@apache.org>
>> *Sent:* Thursday, February 25, 2021 5:48 AM
>> *To:* user@tika.apache.org
>> *Cc:* lfcnassif@gmail.com
>> *Subject:* Re: Re-using a TikaStream
>>
>>
>>
>> My $0.02 would be to use TikaInputStream because that gets a lot more use
>> and is battle-tested.  Within the last year or so, we started using
>> RereadableInputStream in one of the Microsoft format parsers so it is also
>> getting some use now.
>>
>>
>>
>> If you absolutely can't afford to spool to disk, then give
>> RereadableInputStream a try.
>>
>>
>>
>> The inputstreamfactories, in my mind, are somewhat work-arounds for other
>> use cases, e.g. retrying/batch etc.
>>
>>
>>
>> On Tue, Feb 23, 2021 at 11:41 AM Peter Kronenberg <
>> peter.kronenberg@torch.ai> wrote:
>>
>> So this might be moot, because it seems that TikaInputStream is already
>> doing some magic and I’m not sure how.
>>
>> I was able to re-use the stream without doing anything special after a
>> call to parse.  And in fact, I displayed stream.available() and
>> stream.position() before and after the call to parse, and the full stream
>> was still available at position 0.  What is TikaInputStream doing to make
>> this happen?
>>
>>
>>
>> Just for some additional context, what I’m doing is running the file
>> through Tika and then, depending on the file type, I want to do some
>> additional non-tika processing.  I thought that once the Tika parse was
>> done, the stream would be used up.
>>
>>
>>
>> What is going on?
>>
>>
>>
>>
>>
>> *From:* Peter Kronenberg <pe...@torch.ai>
>> *Sent:* Tuesday, February 23, 2021 10:00 AM
>> *To:* user@tika.apache.org; lfcnassif@gmail.com
>> *Subject:* RE: Re-using a TikaStream
>>
>>
>>
>> This email was sent from outside your organisation, yet is displaying the
>> name of someone from your organisation. This often happens in phishing
>> attempts. Please only interact with this email if you know its source and
>> that the content is safe.
>>
>>
>>
>> I just found the RereadableInputStream.  This looks more like what I was
>> thinking.  Is there any reason not to use it?  What are the Tika best
>> practices?  Pros/Cons of each approach?  If RereadableInputStream works as
>> it’s supposed to, I’m not sure I see the advantage of InputStreamFactory
>>
>>
>>
>> *From:* Peter Kronenberg <pe...@torch.ai>
>> *Sent:* Monday, February 22, 2021 8:30 PM
>> *To:* lfcnassif@gmail.com
>> *Cc:* user@tika.apache.org
>> *Subject:* RE: Re-using a TikaStream
>>
>>
>>
>> This email was sent from outside your organisation, yet is displaying the
>> name of someone from your organisation. This often happens in phishing
>> attempts. Please only interact with this email if you know its source and
>> that the content is safe.
>>
>>
>>
>> Oh ok.  I didn’t realize I needed to write my own class to implement it.
>> I  was looking for some sort of existing framework.
>>
>>
>>
>> What is the purpose of the 2 InputStreamFactory classes:
>>
>>
>>
>> I was re-reading some emails with Nick Burch back around Dec 22-23 and
>> maybe I mis-understood him, but it sounds like he was saying that
>> TiksInputStream was smart enough to automatically spool the stream to disk
>> to allow re-use.
>>
>>
>>
>> It seems to me that I need an extra pass through the data in order to
>> save to disk.  I’m not starting from a File, but from a stream.  So if I
>> need to read the stream twice, I really have to pass through the data 3
>> times, correct?
>>
>> Unless there is a way to save to disk during the first pass
>>
>>
>>
>> (try/catch removed for simplicity)
>>
>>
>>
>> tis = TikaInputSream.get(InputStream);
>>
>> file = tis.getFile();   ç extra pass
>>
>> tis =  TikaInputStream.get(new MyInputStreamFactory(file));
>>
>> // first real pass
>>
>> InputStream is = tis.getInputStreamFactory().getInputStream()
>>
>> // second real pass
>>
>> }
>>
>>
>>
>>
>>
>>
>>
>> *From:* Luís Filipe Nassif <lf...@gmail.com>
>> *Sent:* Monday, February 22, 2021 5:42 PM
>> *To:* Peter Kronenberg <pe...@torch.ai>
>> *Cc:* user@tika.apache.org
>> *Subject:* Re: Re-using a TikaStream
>>
>>
>>
>> Something like:
>>
>>
>>
>> class MyInputStreamFactory implements InputStreamFactory{
>>
>>
>>
>>     private File file;
>>
>>
>>
>>     public  MyInputStreamFactory(File file){
>>
>>         this.file = file;
>>
>>     }
>>
>>
>>
>>     public InputStream getInputStream(){
>>
>>         return new FileInputStream(file);
>>
>>     }
>>
>> }
>>
>>
>>
>> in your client code:
>>
>>
>>
>> Parser parser = new AutoDetectParser();
>>
>> TikaInputStream tis =  TikaInputStream.get(new
>> MyInputStreamFactory(file));
>>
>> parser.parse(tis, new ToTextContentHandler(), new Metadata(), new
>> ParseContext());
>>
>>
>>
>> when you need to reuse the stream (into your parser):
>>
>>
>>
>> public void parse(InputStream stream, ContentHandler handler, Metadata
>> metadata, ParseContext context)
>>             throws IOException, SAXException, TikaException {
>>
>>    //(...)
>>
>>    TikaInputStream tis = TikaInputStream.get(stream);
>>
>>    if(tis.hasInputStreamFactory()){
>>
>>         try(InputStream is =
>> tis.getInputStreamFactory().getInputStream()){
>>
>>               //consume the new stream
>>
>>         }
>>
>>    }else
>>
>>        throw new IOException("not a reusable inputStream");
>>
>>  }
>>
>>
>>
>> Of course this is useful if you are not processing files, e.g. reading
>> files from the cloud or sockets.
>>
>>
>>
>> Regards,
>>
>> Luis
>>
>>
>>
>>
>>
>> Em seg., 22 de fev. de 2021 às 19:18, Peter Kronenberg <
>> peter.kronenberg@torch.ai> escreveu:
>>
>> I sent this question late on Friday.  Sending it again.  Can you provide
>> a little more information how out to use the InputStreamFactory?
>>
>>
>>
>> *From:* Peter Kronenberg <pe...@torch.ai>
>> *Sent:* Friday, February 19, 2021 5:10 PM
>> *To:* user@tika.apache.org; lfcnassif@gmail.com
>> *Subject:* RE: Re-using a TikaStream
>>
>>
>>
>> This email was sent from outside your organisation, yet is displaying the
>> name of someone from your organisation. This often happens in phishing
>> attempts. Please only interact with this email if you know its source and
>> that the content is safe.
>>
>>
>>
>> There appear to be 2 InputStreamFactory classes: in tika-server-core and
>> tika-io.  The one in server.core is the only one with a concrete class.
>>
>> I’m not quite sure I see how to use this.
>>
>> Normally, I create a TikaInputStream with
>> TikaInputStream.get(InputStream).  How do I create it from an
>> InputStreamFactory?
>>
>> TikaInputStream.getInputStreamFactory() only returns a factory if the
>> TikaInputStream was created from a factory.
>>
>> Is there a good example of how this is used
>>
>>
>>
>> *From:* Peter Kronenberg <pe...@torch.ai>
>> *Sent:* Friday, February 19, 2021 4:57 PM
>> *To:* user@tika.apache.org; lfcnassif@gmail.com
>> *Subject:* RE: Re-using a TikaStream
>>
>>
>>
>> This email was sent from outside your organisation, yet is displaying the
>> name of someone from your organisation. This often happens in phishing
>> attempts. Please only interact with this email if you know its source and
>> that the content is safe.
>>
>>
>>
>> Thanks.  I thought that TikaInputStream already automatically saved to
>> disk to allow re-reading.
>>
>>
>>
>> *From:* Luís Filipe Nassif <lf...@gmail.com>
>> *Sent:* Friday, February 19, 2021 3:44 PM
>> *To:* user@tika.apache.org
>> *Subject:* Re: Re-using a TikaStream
>>
>>
>>
>> You could call TikaInputStream.getPath() at the beginning of your parser,
>> it will spool to file if not file based. After consuming the original
>> inputStream, create a new one from the temp file created.
>>
>>
>>
>> If you are using 2.0.0-ALPHA, there is:
>>
>>
>>
>>
>> https://github.com/apache/tika/blob/main/tika-core/src/main/java/org/apache/tika/io/InputStreamFactory.java
>>
>>
>>
>> Use with the new methods from TikaInputStream:
>>
>> public static TikaInputStream get(InputStreamFactory factory)
>>
>> public InputStreamFactory getInputStreamFactory()
>>
>>
>>
>> Hope this helps,
>>
>> Luis
>>
>>
>>
>> Em sex., 19 de fev. de 2021 às 16:09, Peter Kronenberg <
>> peter.kronenberg@torch.ai> escreveu:
>>
>> If I finish parsing a TikaStream, can I re-use the stream (before it is
>> closed)?  I know you said that there is some magic behind the scenes where
>> it spools it to a file.  Can I just call reset() to start from the
>> beginning?
>>
>>
>>
>> Peter
>>
>>
>>
>>
>>
>> *Peter Kronenberg*  *| * *Senior AI Analytic ENGINEER *
>>
>> *C: 703.887.5623*
>>
>> [image: Torch AI] <http://www.torch.ai/>
>>
>> 4303 W. 119th St., Leawood, KS 66209
>> <https://www.google.com/maps/search/4303+W.+119th+St.,+Leawood,+KS+66209?entry=gmail&source=g>
>> WWW.TORCH.AI <http://www.torch.ai/>
>>
>>
>>
>>
>>
>>

Re: Re-using a TikaStream

Posted by Tim Allison <ta...@apache.org>.

Reusing streams after parsing hasn't been something I've done before...

This is not expected behavior.  Parsers should all behave the same.

On Mon, Mar 1, 2021 at 10:24 AM Peter Kronenberg <pe...@torch.ai>
wrote:

> After more testing, it seems that it has nothing to do with
> TikaInputStream.  I just passed in a BufferedInputStream to the parsers.  I
> see that the first thing the AutoDetactParser does is to convert it to a
> TikaInputStream.  So maybe TIS is being leveraged at a lower level, but
> there no reason for me to use the TIS at my level.
>
> But the issue is that different parsers return the stream in different
> states.  Sometimes the stream is all used up (although not closed). And
> other times, the stream has been re-set to the beginning where it can be
> re-used.  Is this expected behavior?
>
>
>
>
>
>
>
> *From:* Peter Kronenberg <pe...@torch.ai>
> *Sent:* Friday, February 26, 2021 10:03 PM
> *To:* tallison@apache.org
> *Cc:* user@tika.apache.org; lfcnassif@gmail.com
> *Subject:* RE: Re-using a TikaStream
>
>
>
> This email was sent from outside your organisation, yet is displaying the
> name of someone from your organisation. This often happens in phishing
> attempts. Please only interact with this email if you know its source and
> that the content is safe.
>
>
>
> But as I said, this doesn’t seem to work with all parsers.    So let’s say
> I pass in an MP4 file which uses the MP4Parser and then I want to re-use
> the stream afterward.  How can I guarantee consistent beahvor, no matter
> which paser gets used?
>
>
>
> *From:* Tim Allison <ta...@apache.org>
> *Sent:* Friday, February 26, 2021 3:17 PM
> *To:* Peter Kronenberg <pe...@torch.ai>
> *Cc:* user@tika.apache.org; lfcnassif@gmail.com
> *Subject:* Re: Re-using a TikaStream
>
>
>
> The stream.available() call comes from ProxyInputStream.  We don't modify
> that in TikaInputStream...maybe we should.
>
>
>
> TikaInputStream wraps an incoming InputStream in a BufferedInputStream if
> it doesn't supportMark already.
>
>
>
> So, as long as you're happy with the performance and potential limitations
> of BufferedInputStream, go with TikaInputStream.
>
>
>
> Note that some parsers have to spool to disk.  TikaInputStream takes care
> of this for you.
>
>
>
> On Fri, Feb 26, 2021 at 1:01 PM Peter Kronenberg <
> peter.kronenberg@torch.ai> wrote:
>
> I think I figured this out.  It seems to depend on what parser is used.
> Not sure if this just has to do with inconsistent implementations, or there
> is some reason behind it.
>
>
>
> For most audio files, using the AudioParser, the buffer is still at the
> beginning.  Even though there is no text extraction, I would think that
> Tika still needs to read through the stream.
>
> The MP3Parser consumes the stream, but the MP4Parser does not
>
>
>
> The OCR parser also leaves the pointer at the beginning.  It definitely
> consumes the stream, so it must be resetting it.
>
>
>
> So what is going on.  And now I get back to my original question, which
> is, what is the best way to consistently be able to re-use the stream?
>
>
>
> *From:* Peter Kronenberg <pe...@torch.ai>
> *Sent:* Friday, February 26, 2021 12:18 PM
> *To:* user@tika.apache.org; tallison@apache.org
> *Cc:* lfcnassif@gmail.com
> *Subject:* RE: Re-using a TikaStream
>
>
>
> This email was sent from outside your organisation, yet is displaying the
> name of someone from your organisation. This often happens in phishing
> attempts. Please only interact with this email if you know its source and
> that the content is safe.
>
>
>
> So is this guaranteed, expected behavior?
>
>
>
> With a BufferedInputStream – I expect this
>
>
>
>
> *try *(BufferedInputStream stream = *new *BufferedInputStream(*new *FileInputStream(file)))
> {
>     System.*out*.printf(*"before - bytes available: %s"*,
> stream.available());
>     parser.parse(stream, handler, metadata, parseContext);
>     System.*out*.printf(*"after - bytes available: %s%n"*,
> stream.available());
> }
>
>
>
> before - bytes available: 10546620
>
> after - bytes available: 0
>
>
>
>
>
>
>
> But with a TikaInputStream, I get this
>
>
>
> Note that I’m purposing creating a FileInputStream first in order to hide the file information from the TikaInputStream, since in my normal use case, I’m dealing with a regular InputStream, not reading from a file
>
>
> *try *(TikaInputStream stream = TikaInputStream.*get*(*new *FileInputStream(file))) {
>     System.*out*.printf(*"before - bytes available: %s, position: %s%n"*, stream.available(), stream.getPosition());
>     parser.parse(stream, handler, metadata, parseContext);
>     System.*out*.printf(*"after - bytes available: %s, position: %s%n"*, stream.available(), stream.getPosition());
> }
>
>
>
> before - bytes available: 10546620, position: 0
>
> after - bytes available: 10546620, position: 0
>
>
>
>
>
> *From:* Peter Kronenberg <pe...@torch.ai>
> *Sent:* Thursday, February 25, 2021 11:28 AM
> *To:* user@tika.apache.org; tallison@apache.org
> *Cc:* lfcnassif@gmail.com
> *Subject:* RE: Re-using a TikaStream
>
>
>
> This email was sent from outside your organisation, yet is displaying the
> name of someone from your organisation. This often happens in phishing
> attempts. Please only interact with this email if you know its source and
> that the content is safe.
>
>
>
> Or reading from the cloud, either Google or AWS, in which case I also get
> a stream.   I know what the file name is, but can’t really use it
>
>
>
> *From:* Peter Kronenberg <pe...@torch.ai>
> *Sent:* Thursday, February 25, 2021 11:19 AM
> *To:* tallison@apache.org
> *Cc:* lfcnassif@gmail.com; user@tika.apache.org
> *Subject:* RE: Re-using a TikaStream
>
>
>
> This email was sent from outside your organisation, yet is displaying the
> name of someone from your organisation. This often happens in phishing
> attempts. Please only interact with this email if you know its source and
> that the content is safe.
>
>
>
> With a stream.  I am reading arbitrary streams and one of the goals is to
> figure out what it is. So there is no file backing it.
>
>
>
> *From:* Tim Allison <ta...@apache.org>
> *Sent:* Thursday, February 25, 2021 11:11 AM
> *To:* Peter Kronenberg <pe...@torch.ai>
> *Cc:* lfcnassif@gmail.com; user@tika.apache.org
> *Subject:* Re: Re-using a TikaStream
>
>
>
> Are you initializing w a file or a stream?
>
>
>
> On Thu, Feb 25, 2021 at 9:00 AM Peter Kronenberg <
> peter.kronenberg@torch.ai> wrote:
>
> But how is TikaInputStream allowing me to re-use the stream without me
> doing anything special?   Is it automatically spooling to disk as needed?
>
>
>
> I wouldn’t say that I can’t afford to spool to disk.  I’m just looking for
> the most reasonable solution.  I don’t know how big the streams are that
> I’ll be processing.  Obviously, if they’re big, the keeping them in memory
> is not reasonable and disk is the only option.  But for smaller streams, if
> it can do it all in memory, that’s obviously better.  And for my use case,
> I don’t **always** have to re-read the stream.
>
>
>
> *From:* Tim Allison <ta...@apache.org>
> *Sent:* Thursday, February 25, 2021 5:48 AM
> *To:* user@tika.apache.org
> *Cc:* lfcnassif@gmail.com
> *Subject:* Re: Re-using a TikaStream
>
>
>
> My $0.02 would be to use TikaInputStream because that gets a lot more use
> and is battle-tested.  Within the last year or so, we started using
> RereadableInputStream in one of the Microsoft format parsers so it is also
> getting some use now.
>
>
>
> If you absolutely can't afford to spool to disk, then give
> RereadableInputStream a try.
>
>
>
> The inputstreamfactories, in my mind, are somewhat work-arounds for other
> use cases, e.g. retrying/batch etc.
>
>
>
> On Tue, Feb 23, 2021 at 11:41 AM Peter Kronenberg <
> peter.kronenberg@torch.ai> wrote:
>
> So this might be moot, because it seems that TikaInputStream is already
> doing some magic and I’m not sure how.
>
> I was able to re-use the stream without doing anything special after a
> call to parse.  And in fact, I displayed stream.available() and
> stream.position() before and after the call to parse, and the full stream
> was still available at position 0.  What is TikaInputStream doing to make
> this happen?
>
>
>
> Just for some additional context, what I’m doing is running the file
> through Tika and then, depending on the file type, I want to do some
> additional non-tika processing.  I thought that once the Tika parse was
> done, the stream would be used up.
>
>
>
> What is going on?
>
>
>
>
>
> *From:* Peter Kronenberg <pe...@torch.ai>
> *Sent:* Tuesday, February 23, 2021 10:00 AM
> *To:* user@tika.apache.org; lfcnassif@gmail.com
> *Subject:* RE: Re-using a TikaStream
>
>
>
> This email was sent from outside your organisation, yet is displaying the
> name of someone from your organisation. This often happens in phishing
> attempts. Please only interact with this email if you know its source and
> that the content is safe.
>
>
>
> I just found the RereadableInputStream.  This looks more like what I was
> thinking.  Is there any reason not to use it?  What are the Tika best
> practices?  Pros/Cons of each approach?  If RereadableInputStream works as
> it’s supposed to, I’m not sure I see the advantage of InputStreamFactory
>
>
>
> *From:* Peter Kronenberg <pe...@torch.ai>
> *Sent:* Monday, February 22, 2021 8:30 PM
> *To:* lfcnassif@gmail.com
> *Cc:* user@tika.apache.org
> *Subject:* RE: Re-using a TikaStream
>
>
>
> This email was sent from outside your organisation, yet is displaying the
> name of someone from your organisation. This often happens in phishing
> attempts. Please only interact with this email if you know its source and
> that the content is safe.
>
>
>
> Oh ok.  I didn’t realize I needed to write my own class to implement it. I
>  was looking for some sort of existing framework.
>
>
>
> What is the purpose of the 2 InputStreamFactory classes:
>
>
>
> I was re-reading some emails with Nick Burch back around Dec 22-23 and
> maybe I mis-understood him, but it sounds like he was saying that
> TiksInputStream was smart enough to automatically spool the stream to disk
> to allow re-use.
>
>
>
> It seems to me that I need an extra pass through the data in order to save
> to disk.  I’m not starting from a File, but from a stream.  So if I need to
> read the stream twice, I really have to pass through the data 3 times,
> correct?
>
> Unless there is a way to save to disk during the first pass
>
>
>
> (try/catch removed for simplicity)
>
>
>
> tis = TikaInputSream.get(InputStream);
>
> file = tis.getFile();   ç extra pass
>
> tis =  TikaInputStream.get(new MyInputStreamFactory(file));
>
> // first real pass
>
> InputStream is = tis.getInputStreamFactory().getInputStream()
>
> // second real pass
>
> }
>
>
>
>
>
>
>
> *From:* Luís Filipe Nassif <lf...@gmail.com>
> *Sent:* Monday, February 22, 2021 5:42 PM
> *To:* Peter Kronenberg <pe...@torch.ai>
> *Cc:* user@tika.apache.org
> *Subject:* Re: Re-using a TikaStream
>
>
>
> Something like:
>
>
>
> class MyInputStreamFactory implements InputStreamFactory{
>
>
>
>     private File file;
>
>
>
>     public  MyInputStreamFactory(File file){
>
>         this.file = file;
>
>     }
>
>
>
>     public InputStream getInputStream(){
>
>         return new FileInputStream(file);
>
>     }
>
> }
>
>
>
> in your client code:
>
>
>
> Parser parser = new AutoDetectParser();
>
> TikaInputStream tis =  TikaInputStream.get(new MyInputStreamFactory(file));
>
> parser.parse(tis, new ToTextContentHandler(), new Metadata(), new
> ParseContext());
>
>
>
> when you need to reuse the stream (into your parser):
>
>
>
> public void parse(InputStream stream, ContentHandler handler, Metadata
> metadata, ParseContext context)
>             throws IOException, SAXException, TikaException {
>
>    //(...)
>
>    TikaInputStream tis = TikaInputStream.get(stream);
>
>    if(tis.hasInputStreamFactory()){
>
>         try(InputStream is = tis.getInputStreamFactory().getInputStream()){
>
>               //consume the new stream
>
>         }
>
>    }else
>
>        throw new IOException("not a reusable inputStream");
>
>  }
>
>
>
> Of course this is useful if you are not processing files, e.g. reading
> files from the cloud or sockets.
>
>
>
> Regards,
>
> Luis
>
>
>
>
>
> Em seg., 22 de fev. de 2021 às 19:18, Peter Kronenberg <
> peter.kronenberg@torch.ai> escreveu:
>
> I sent this question late on Friday.  Sending it again.  Can you provide a
> little more information how out to use the InputStreamFactory?
>
>
>
> *From:* Peter Kronenberg <pe...@torch.ai>
> *Sent:* Friday, February 19, 2021 5:10 PM
> *To:* user@tika.apache.org; lfcnassif@gmail.com
> *Subject:* RE: Re-using a TikaStream
>
>
>
> This email was sent from outside your organisation, yet is displaying the
> name of someone from your organisation. This often happens in phishing
> attempts. Please only interact with this email if you know its source and
> that the content is safe.
>
>
>
> There appear to be 2 InputStreamFactory classes: in tika-server-core and
> tika-io.  The one in server.core is the only one with a concrete class.
>
> I’m not quite sure I see how to use this.
>
> Normally, I create a TikaInputStream with
> TikaInputStream.get(InputStream).  How do I create it from an
> InputStreamFactory?
>
> TikaInputStream.getInputStreamFactory() only returns a factory if the
> TikaInputStream was created from a factory.
>
> Is there a good example of how this is used
>
>
>
> *From:* Peter Kronenberg <pe...@torch.ai>
> *Sent:* Friday, February 19, 2021 4:57 PM
> *To:* user@tika.apache.org; lfcnassif@gmail.com
> *Subject:* RE: Re-using a TikaStream
>
>
>
> This email was sent from outside your organisation, yet is displaying the
> name of someone from your organisation. This often happens in phishing
> attempts. Please only interact with this email if you know its source and
> that the content is safe.
>
>
>
> Thanks.  I thought that TikaInputStream already automatically saved to
> disk to allow re-reading.
>
>
>
> *From:* Luís Filipe Nassif <lf...@gmail.com>
> *Sent:* Friday, February 19, 2021 3:44 PM
> *To:* user@tika.apache.org
> *Subject:* Re: Re-using a TikaStream
>
>
>
> You could call TikaInputStream.getPath() at the beginning of your parser,
> it will spool to file if not file based. After consuming the original
> inputStream, create a new one from the temp file created.
>
>
>
> If you are using 2.0.0-ALPHA, there is:
>
>
>
>
> https://github.com/apache/tika/blob/main/tika-core/src/main/java/org/apache/tika/io/InputStreamFactory.java
>
>
>
> Use with the new methods from TikaInputStream:
>
> public static TikaInputStream get(InputStreamFactory factory)
>
> public InputStreamFactory getInputStreamFactory()
>
>
>
> Hope this helps,
>
> Luis
>
>
>
> Em sex., 19 de fev. de 2021 às 16:09, Peter Kronenberg <
> peter.kronenberg@torch.ai> escreveu:
>
> If I finish parsing a TikaStream, can I re-use the stream (before it is
> closed)?  I know you said that there is some magic behind the scenes where
> it spools it to a file.  Can I just call reset() to start from the
> beginning?
>
>
>
> Peter
>
>
>
>
>
> *Peter Kronenberg*  *| * *Senior AI Analytic ENGINEER *
>
> *C: 703.887.5623*
>
> [image: Torch AI] <http://www.torch.ai/>
>
> 4303 W. 119th St., Leawood, KS 66209
> <https://www.google.com/maps/search/4303+W.+119th+St.,+Leawood,+KS+66209?entry=gmail&source=g>
> WWW.TORCH.AI <http://www.torch.ai/>
>
>
>
>
>
>

RE: Re-using a TikaStream

Posted by Peter Kronenberg <pe...@torch.ai>.

After more testing, it seems that it has nothing to do with TikaInputStream.  I just passed in a BufferedInputStream to the parsers.  I see that the first thing the AutoDetactParser does is to convert it to a TikaInputStream.  So maybe TIS is being leveraged at a lower level, but there no reason for me to use the TIS at my level.
But the issue is that different parsers return the stream in different states.  Sometimes the stream is all used up (although not closed). And other times, the stream has been re-set to the beginning where it can be re-used.  Is this expected behavior?

From: Peter Kronenberg <pe...@torch.ai>
Sent: Friday, February 26, 2021 10:03 PM
To: tallison@apache.org
Cc: user@tika.apache.org; lfcnassif@gmail.com
Subject: RE: Re-using a TikaStream

This email was sent from outside your organisation, yet is displaying the name of someone from your organisation. This often happens in phishing attempts. Please only interact with this email if you know its source and that the content is safe.

But as I said, this doesn't seem to work with all parsers.    So let's say I pass in an MP4 file which uses the MP4Parser and then I want to re-use the stream afterward.  How can I guarantee consistent beahvor, no matter which paser gets used?

From: Tim Allison <ta...@apache.org>>
Sent: Friday, February 26, 2021 3:17 PM
To: Peter Kronenberg <pe...@torch.ai>>
Cc: user@tika.apache.org<ma...@tika.apache.org>; lfcnassif@gmail.com<ma...@gmail.com>
Subject: Re: Re-using a TikaStream

The stream.available() call comes from ProxyInputStream.  We don't modify that in TikaInputStream...maybe we should.

TikaInputStream wraps an incoming InputStream in a BufferedInputStream if it doesn't supportMark already.

So, as long as you're happy with the performance and potential limitations of BufferedInputStream, go with TikaInputStream.

Note that some parsers have to spool to disk.  TikaInputStream takes care of this for you.

On Fri, Feb 26, 2021 at 1:01 PM Peter Kronenberg <pe...@torch.ai>> wrote:
I think I figured this out.  It seems to depend on what parser is used.  Not sure if this just has to do with inconsistent implementations, or there is some reason behind it.

For most audio files, using the AudioParser, the buffer is still at the beginning.  Even though there is no text extraction, I would think that Tika still needs to read through the stream.
The MP3Parser consumes the stream, but the MP4Parser does not

The OCR parser also leaves the pointer at the beginning.  It definitely consumes the stream, so it must be resetting it.

So what is going on.  And now I get back to my original question, which is, what is the best way to consistently be able to re-use the stream?

From: Peter Kronenberg <pe...@torch.ai>>
Sent: Friday, February 26, 2021 12:18 PM
To: user@tika.apache.org<ma...@tika.apache.org>; tallison@apache.org<ma...@apache.org>
Cc: lfcnassif@gmail.com<ma...@gmail.com>
Subject: RE: Re-using a TikaStream

This email was sent from outside your organisation, yet is displaying the name of someone from your organisation. This often happens in phishing attempts. Please only interact with this email if you know its source and that the content is safe.

So is this guaranteed, expected behavior?

With a BufferedInputStream - I expect this

try (BufferedInputStream stream = new BufferedInputStream(new FileInputStream(file))) {
    System.out.printf("before - bytes available: %s", stream.available());
    parser.parse(stream, handler, metadata, parseContext);
    System.out.printf("after - bytes available: %s%n", stream.available());
}

before - bytes available: 10546620
after - bytes available: 0

But with a TikaInputStream, I get this

Note that I'm purposing creating a FileInputStream first in order to hide the file information from the TikaInputStream, since in my normal use case, I'm dealing with a regular InputStream, not reading from a file

try (TikaInputStream stream = TikaInputStream.get(new FileInputStream(file))) {
    System.out.printf("before - bytes available: %s, position: %s%n", stream.available(), stream.getPosition());
    parser.parse(stream, handler, metadata, parseContext);
    System.out.printf("after - bytes available: %s, position: %s%n", stream.available(), stream.getPosition());
}

before - bytes available: 10546620, position: 0
after - bytes available: 10546620, position: 0

From: Peter Kronenberg <pe...@torch.ai>>
Sent: Thursday, February 25, 2021 11:28 AM
To: user@tika.apache.org<ma...@tika.apache.org>; tallison@apache.org<ma...@apache.org>
Cc: lfcnassif@gmail.com<ma...@gmail.com>
Subject: RE: Re-using a TikaStream

This email was sent from outside your organisation, yet is displaying the name of someone from your organisation. This often happens in phishing attempts. Please only interact with this email if you know its source and that the content is safe.

Or reading from the cloud, either Google or AWS, in which case I also get a stream.   I know what the file name is, but can't really use it

From: Peter Kronenberg <pe...@torch.ai>>
Sent: Thursday, February 25, 2021 11:19 AM
To: tallison@apache.org<ma...@apache.org>
Cc: lfcnassif@gmail.com<ma...@gmail.com>; user@tika.apache.org<ma...@tika.apache.org>
Subject: RE: Re-using a TikaStream

This email was sent from outside your organisation, yet is displaying the name of someone from your organisation. This often happens in phishing attempts. Please only interact with this email if you know its source and that the content is safe.

With a stream.  I am reading arbitrary streams and one of the goals is to figure out what it is. So there is no file backing it.

From: Tim Allison <ta...@apache.org>>
Sent: Thursday, February 25, 2021 11:11 AM
To: Peter Kronenberg <pe...@torch.ai>>
Cc: lfcnassif@gmail.com<ma...@gmail.com>; user@tika.apache.org<ma...@tika.apache.org>
Subject: Re: Re-using a TikaStream

Are you initializing w a file or a stream?

On Thu, Feb 25, 2021 at 9:00 AM Peter Kronenberg <pe...@torch.ai>> wrote:
But how is TikaInputStream allowing me to re-use the stream without me doing anything special?   Is it automatically spooling to disk as needed?

I wouldn't say that I can't afford to spool to disk.  I'm just looking for the most reasonable solution.  I don't know how big the streams are that I'll be processing.  Obviously, if they're big, the keeping them in memory is not reasonable and disk is the only option.  But for smaller streams, if it can do it all in memory, that's obviously better.  And for my use case, I don't *always* have to re-read the stream.

From: Tim Allison <ta...@apache.org>>
Sent: Thursday, February 25, 2021 5:48 AM
To: user@tika.apache.org<ma...@tika.apache.org>
Cc: lfcnassif@gmail.com<ma...@gmail.com>
Subject: Re: Re-using a TikaStream

My $0.02 would be to use TikaInputStream because that gets a lot more use and is battle-tested.  Within the last year or so, we started using RereadableInputStream in one of the Microsoft format parsers so it is also getting some use now.

If you absolutely can't afford to spool to disk, then give RereadableInputStream a try.

The inputstreamfactories, in my mind, are somewhat work-arounds for other use cases, e.g. retrying/batch etc.

On Tue, Feb 23, 2021 at 11:41 AM Peter Kronenberg <pe...@torch.ai>> wrote:
So this might be moot, because it seems that TikaInputStream is already doing some magic and I'm not sure how.
I was able to re-use the stream without doing anything special after a call to parse.  And in fact, I displayed stream.available() and stream.position() before and after the call to parse, and the full stream was still available at position 0.  What is TikaInputStream doing to make this happen?

Just for some additional context, what I'm doing is running the file through Tika and then, depending on the file type, I want to do some additional non-tika processing.  I thought that once the Tika parse was done, the stream would be used up.

What is going on?

From: Peter Kronenberg <pe...@torch.ai>>
Sent: Tuesday, February 23, 2021 10:00 AM
To: user@tika.apache.org<ma...@tika.apache.org>; lfcnassif@gmail.com<ma...@gmail.com>
Subject: RE: Re-using a TikaStream

This email was sent from outside your organisation, yet is displaying the name of someone from your organisation. This often happens in phishing attempts. Please only interact with this email if you know its source and that the content is safe.

I just found the RereadableInputStream.  This looks more like what I was thinking.  Is there any reason not to use it?  What are the Tika best practices?  Pros/Cons of each approach?  If RereadableInputStream works as it's supposed to, I'm not sure I see the advantage of InputStreamFactory

From: Peter Kronenberg <pe...@torch.ai>>
Sent: Monday, February 22, 2021 8:30 PM
To: lfcnassif@gmail.com<ma...@gmail.com>
Cc: user@tika.apache.org<ma...@tika.apache.org>
Subject: RE: Re-using a TikaStream

This email was sent from outside your organisation, yet is displaying the name of someone from your organisation. This often happens in phishing attempts. Please only interact with this email if you know its source and that the content is safe.

Oh ok.  I didn't realize I needed to write my own class to implement it. I  was looking for some sort of existing framework.

What is the purpose of the 2 InputStreamFactory classes:

I was re-reading some emails with Nick Burch back around Dec 22-23 and maybe I mis-understood him, but it sounds like he was saying that TiksInputStream was smart enough to automatically spool the stream to disk to allow re-use.

It seems to me that I need an extra pass through the data in order to save to disk.  I'm not starting from a File, but from a stream.  So if I need to read the stream twice, I really have to pass through the data 3 times, correct?
Unless there is a way to save to disk during the first pass

(try/catch removed for simplicity)

tis = TikaInputSream.get(InputStream);
file = tis.getFile();   <== extra pass
tis =  TikaInputStream.get(new MyInputStreamFactory(file));
// first real pass
InputStream is = tis.getInputStreamFactory().getInputStream()
// second real pass
}

From: Luís Filipe Nassif <lf...@gmail.com>>
Sent: Monday, February 22, 2021 5:42 PM
To: Peter Kronenberg <pe...@torch.ai>>
Cc: user@tika.apache.org<ma...@tika.apache.org>
Subject: Re: Re-using a TikaStream

Something like:

class MyInputStreamFactory implements InputStreamFactory{

    private File file;

    public  MyInputStreamFactory(File file){
        this.file = file;
    }

    public InputStream getInputStream(){
        return new FileInputStream(file);
    }
}

in your client code:

Parser parser = new AutoDetectParser();
TikaInputStream tis =  TikaInputStream.get(new MyInputStreamFactory(file));
parser.parse(tis, new ToTextContentHandler(), new Metadata(), new ParseContext());

when you need to reuse the stream (into your parser):

public void parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context)
            throws IOException, SAXException, TikaException {
   //(...)
   TikaInputStream tis = TikaInputStream.get(stream);
   if(tis.hasInputStreamFactory()){
        try(InputStream is = tis.getInputStreamFactory().getInputStream()){
              //consume the new stream
        }
   }else
       throw new IOException("not a reusable inputStream");
 }

Of course this is useful if you are not processing files, e.g. reading files from the cloud or sockets.

Regards,
Luis

Em seg., 22 de fev. de 2021 às 19:18, Peter Kronenberg <pe...@torch.ai>> escreveu:
I sent this question late on Friday.  Sending it again.  Can you provide a little more information how out to use the InputStreamFactory?

From: Peter Kronenberg <pe...@torch.ai>>
Sent: Friday, February 19, 2021 5:10 PM
To: user@tika.apache.org<ma...@tika.apache.org>; lfcnassif@gmail.com<ma...@gmail.com>
Subject: RE: Re-using a TikaStream

This email was sent from outside your organisation, yet is displaying the name of someone from your organisation. This often happens in phishing attempts. Please only interact with this email if you know its source and that the content is safe.

There appear to be 2 InputStreamFactory classes: in tika-server-core and tika-io.  The one in server.core is the only one with a concrete class.
I'm not quite sure I see how to use this.
Normally, I create a TikaInputStream with TikaInputStream.get(InputStream).  How do I create it from an InputStreamFactory?
TikaInputStream.getInputStreamFactory() only returns a factory if the TikaInputStream was created from a factory.
Is there a good example of how this is used

From: Peter Kronenberg <pe...@torch.ai>>
Sent: Friday, February 19, 2021 4:57 PM
To: user@tika.apache.org<ma...@tika.apache.org>; lfcnassif@gmail.com<ma...@gmail.com>
Subject: RE: Re-using a TikaStream

This email was sent from outside your organisation, yet is displaying the name of someone from your organisation. This often happens in phishing attempts. Please only interact with this email if you know its source and that the content is safe.

Thanks.  I thought that TikaInputStream already automatically saved to disk to allow re-reading.

From: Luís Filipe Nassif <lf...@gmail.com>>
Sent: Friday, February 19, 2021 3:44 PM
To: user@tika.apache.org<ma...@tika.apache.org>
Subject: Re: Re-using a TikaStream

You could call TikaInputStream.getPath() at the beginning of your parser, it will spool to file if not file based. After consuming the original inputStream, create a new one from the temp file created.

If you are using 2.0.0-ALPHA, there is:

https://github.com/apache/tika/blob/main/tika-core/src/main/java/org/apache/tika/io/InputStreamFactory.java

Use with the new methods from TikaInputStream:
public static TikaInputStream get(InputStreamFactory factory)
public InputStreamFactory getInputStreamFactory()

Hope this helps,
Luis

Em sex., 19 de fev. de 2021 às 16:09, Peter Kronenberg <pe...@torch.ai>> escreveu:
If I finish parsing a TikaStream, can I re-use the stream (before it is closed)?  I know you said that there is some magic behind the scenes where it spools it to a file.  Can I just call reset() to start from the beginning?

Peter

Peter Kronenberg  |  Senior AI Analytic ENGINEER
C: 703.887.5623
[Torch AI]<http://www.torch.ai/>
4303 W. 119th St., Leawood, KS 66209<https://www.google.com/maps/search/4303+W.+119th+St.,+Leawood,+KS+66209?entry=gmail&source=g>
WWW.TORCH.AI<http://www.torch.ai/>

RE: Re-using a TikaStream

Posted by Peter Kronenberg <pe...@torch.ai>.

But as I said, this doesn’t seem to work with all parsers.    So let’s say I pass in an MP4 file which uses the MP4Parser and then I want to re-use the stream afterward.  How can I guarantee consistent beahvor, no matter which paser gets used?

From: Tim Allison <ta...@apache.org>
Sent: Friday, February 26, 2021 3:17 PM
To: Peter Kronenberg <pe...@torch.ai>
Cc: user@tika.apache.org; lfcnassif@gmail.com
Subject: Re: Re-using a TikaStream

The stream.available() call comes from ProxyInputStream.  We don't modify that in TikaInputStream...maybe we should.

TikaInputStream wraps an incoming InputStream in a BufferedInputStream if it doesn't supportMark already.

So, as long as you're happy with the performance and potential limitations of BufferedInputStream, go with TikaInputStream.

Note that some parsers have to spool to disk.  TikaInputStream takes care of this for you.

On Fri, Feb 26, 2021 at 1:01 PM Peter Kronenberg <pe...@torch.ai>> wrote:
I think I figured this out.  It seems to depend on what parser is used.  Not sure if this just has to do with inconsistent implementations, or there is some reason behind it.

For most audio files, using the AudioParser, the buffer is still at the beginning.  Even though there is no text extraction, I would think that Tika still needs to read through the stream.
The MP3Parser consumes the stream, but the MP4Parser does not

The OCR parser also leaves the pointer at the beginning.  It definitely consumes the stream, so it must be resetting it.

So what is going on.  And now I get back to my original question, which is, what is the best way to consistently be able to re-use the stream?

From: Peter Kronenberg <pe...@torch.ai>>
Sent: Friday, February 26, 2021 12:18 PM
To: user@tika.apache.org<ma...@tika.apache.org>; tallison@apache.org<ma...@apache.org>
Cc: lfcnassif@gmail.com<ma...@gmail.com>
Subject: RE: Re-using a TikaStream

This email was sent from outside your organisation, yet is displaying the name of someone from your organisation. This often happens in phishing attempts. Please only interact with this email if you know its source and that the content is safe.

So is this guaranteed, expected behavior?

With a BufferedInputStream – I expect this

try (BufferedInputStream stream = new BufferedInputStream(new FileInputStream(file))) {
    System.out.printf("before - bytes available: %s", stream.available());
    parser.parse(stream, handler, metadata, parseContext);
    System.out.printf("after - bytes available: %s%n", stream.available());
}

before - bytes available: 10546620
after - bytes available: 0

But with a TikaInputStream, I get this

Note that I’m purposing creating a FileInputStream first in order to hide the file information from the TikaInputStream, since in my normal use case, I’m dealing with a regular InputStream, not reading from a file

try (TikaInputStream stream = TikaInputStream.get(new FileInputStream(file))) {
    System.out.printf("before - bytes available: %s, position: %s%n", stream.available(), stream.getPosition());
    parser.parse(stream, handler, metadata, parseContext);
    System.out.printf("after - bytes available: %s, position: %s%n", stream.available(), stream.getPosition());
}

before - bytes available: 10546620, position: 0
after - bytes available: 10546620, position: 0

From: Peter Kronenberg <pe...@torch.ai>>
Sent: Thursday, February 25, 2021 11:28 AM
To: user@tika.apache.org<ma...@tika.apache.org>; tallison@apache.org<ma...@apache.org>
Cc: lfcnassif@gmail.com<ma...@gmail.com>
Subject: RE: Re-using a TikaStream

This email was sent from outside your organisation, yet is displaying the name of someone from your organisation. This often happens in phishing attempts. Please only interact with this email if you know its source and that the content is safe.

Or reading from the cloud, either Google or AWS, in which case I also get a stream.   I know what the file name is, but can’t really use it

From: Peter Kronenberg <pe...@torch.ai>>
Sent: Thursday, February 25, 2021 11:19 AM
To: tallison@apache.org<ma...@apache.org>
Cc: lfcnassif@gmail.com<ma...@gmail.com>; user@tika.apache.org<ma...@tika.apache.org>
Subject: RE: Re-using a TikaStream

This email was sent from outside your organisation, yet is displaying the name of someone from your organisation. This often happens in phishing attempts. Please only interact with this email if you know its source and that the content is safe.

With a stream.  I am reading arbitrary streams and one of the goals is to figure out what it is. So there is no file backing it.

From: Tim Allison <ta...@apache.org>>
Sent: Thursday, February 25, 2021 11:11 AM
To: Peter Kronenberg <pe...@torch.ai>>
Cc: lfcnassif@gmail.com<ma...@gmail.com>; user@tika.apache.org<ma...@tika.apache.org>
Subject: Re: Re-using a TikaStream

Are you initializing w a file or a stream?

On Thu, Feb 25, 2021 at 9:00 AM Peter Kronenberg <pe...@torch.ai>> wrote:
But how is TikaInputStream allowing me to re-use the stream without me doing anything special?   Is it automatically spooling to disk as needed?

I wouldn’t say that I can’t afford to spool to disk.  I’m just looking for the most reasonable solution.  I don’t know how big the streams are that I’ll be processing.  Obviously, if they’re big, the keeping them in memory is not reasonable and disk is the only option.  But for smaller streams, if it can do it all in memory, that’s obviously better.  And for my use case, I don’t *always* have to re-read the stream.

From: Tim Allison <ta...@apache.org>>
Sent: Thursday, February 25, 2021 5:48 AM
To: user@tika.apache.org<ma...@tika.apache.org>
Cc: lfcnassif@gmail.com<ma...@gmail.com>
Subject: Re: Re-using a TikaStream

My $0.02 would be to use TikaInputStream because that gets a lot more use and is battle-tested.  Within the last year or so, we started using RereadableInputStream in one of the Microsoft format parsers so it is also getting some use now.

If you absolutely can't afford to spool to disk, then give RereadableInputStream a try.

The inputstreamfactories, in my mind, are somewhat work-arounds for other use cases, e.g. retrying/batch etc.

On Tue, Feb 23, 2021 at 11:41 AM Peter Kronenberg <pe...@torch.ai>> wrote:
So this might be moot, because it seems that TikaInputStream is already doing some magic and I’m not sure how.
I was able to re-use the stream without doing anything special after a call to parse.  And in fact, I displayed stream.available() and stream.position() before and after the call to parse, and the full stream was still available at position 0.  What is TikaInputStream doing to make this happen?

Just for some additional context, what I’m doing is running the file through Tika and then, depending on the file type, I want to do some additional non-tika processing.  I thought that once the Tika parse was done, the stream would be used up.

What is going on?

From: Peter Kronenberg <pe...@torch.ai>>
Sent: Tuesday, February 23, 2021 10:00 AM
To: user@tika.apache.org<ma...@tika.apache.org>; lfcnassif@gmail.com<ma...@gmail.com>
Subject: RE: Re-using a TikaStream

This email was sent from outside your organisation, yet is displaying the name of someone from your organisation. This often happens in phishing attempts. Please only interact with this email if you know its source and that the content is safe.

I just found the RereadableInputStream.  This looks more like what I was thinking.  Is there any reason not to use it?  What are the Tika best practices?  Pros/Cons of each approach?  If RereadableInputStream works as it’s supposed to, I’m not sure I see the advantage of InputStreamFactory

From: Peter Kronenberg <pe...@torch.ai>>
Sent: Monday, February 22, 2021 8:30 PM
To: lfcnassif@gmail.com<ma...@gmail.com>
Cc: user@tika.apache.org<ma...@tika.apache.org>
Subject: RE: Re-using a TikaStream

This email was sent from outside your organisation, yet is displaying the name of someone from your organisation. This often happens in phishing attempts. Please only interact with this email if you know its source and that the content is safe.

Oh ok.  I didn’t realize I needed to write my own class to implement it. I  was looking for some sort of existing framework.

What is the purpose of the 2 InputStreamFactory classes:

I was re-reading some emails with Nick Burch back around Dec 22-23 and maybe I mis-understood him, but it sounds like he was saying that TiksInputStream was smart enough to automatically spool the stream to disk to allow re-use.

It seems to me that I need an extra pass through the data in order to save to disk.  I’m not starting from a File, but from a stream.  So if I need to read the stream twice, I really have to pass through the data 3 times, correct?
Unless there is a way to save to disk during the first pass

(try/catch removed for simplicity)

tis = TikaInputSream.get(InputStream);
file = tis.getFile();   <== extra pass
tis =  TikaInputStream.get(new MyInputStreamFactory(file));
// first real pass
InputStream is = tis.getInputStreamFactory().getInputStream()
// second real pass
}

From: Luís Filipe Nassif <lf...@gmail.com>>
Sent: Monday, February 22, 2021 5:42 PM
To: Peter Kronenberg <pe...@torch.ai>>
Cc: user@tika.apache.org<ma...@tika.apache.org>
Subject: Re: Re-using a TikaStream

Something like:

class MyInputStreamFactory implements InputStreamFactory{

    private File file;

    public  MyInputStreamFactory(File file){
        this.file = file;
    }

    public InputStream getInputStream(){
        return new FileInputStream(file);
    }
}

in your client code:

Parser parser = new AutoDetectParser();
TikaInputStream tis =  TikaInputStream.get(new MyInputStreamFactory(file));
parser.parse(tis, new ToTextContentHandler(), new Metadata(), new ParseContext());

when you need to reuse the stream (into your parser):

public void parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context)
            throws IOException, SAXException, TikaException {
   //(...)
   TikaInputStream tis = TikaInputStream.get(stream);
   if(tis.hasInputStreamFactory()){
        try(InputStream is = tis.getInputStreamFactory().getInputStream()){
              //consume the new stream
        }
   }else
       throw new IOException("not a reusable inputStream");
 }

Of course this is useful if you are not processing files, e.g. reading files from the cloud or sockets.

Regards,
Luis

Em seg., 22 de fev. de 2021 às 19:18, Peter Kronenberg <pe...@torch.ai>> escreveu:
I sent this question late on Friday.  Sending it again.  Can you provide a little more information how out to use the InputStreamFactory?

From: Peter Kronenberg <pe...@torch.ai>>
Sent: Friday, February 19, 2021 5:10 PM
To: user@tika.apache.org<ma...@tika.apache.org>; lfcnassif@gmail.com<ma...@gmail.com>
Subject: RE: Re-using a TikaStream

This email was sent from outside your organisation, yet is displaying the name of someone from your organisation. This often happens in phishing attempts. Please only interact with this email if you know its source and that the content is safe.

There appear to be 2 InputStreamFactory classes: in tika-server-core and tika-io.  The one in server.core is the only one with a concrete class.
I’m not quite sure I see how to use this.
Normally, I create a TikaInputStream with TikaInputStream.get(InputStream).  How do I create it from an InputStreamFactory?
TikaInputStream.getInputStreamFactory() only returns a factory if the TikaInputStream was created from a factory.
Is there a good example of how this is used

From: Peter Kronenberg <pe...@torch.ai>>
Sent: Friday, February 19, 2021 4:57 PM
To: user@tika.apache.org<ma...@tika.apache.org>; lfcnassif@gmail.com<ma...@gmail.com>
Subject: RE: Re-using a TikaStream

This email was sent from outside your organisation, yet is displaying the name of someone from your organisation. This often happens in phishing attempts. Please only interact with this email if you know its source and that the content is safe.

Thanks.  I thought that TikaInputStream already automatically saved to disk to allow re-reading.

From: Luís Filipe Nassif <lf...@gmail.com>>
Sent: Friday, February 19, 2021 3:44 PM
To: user@tika.apache.org<ma...@tika.apache.org>
Subject: Re: Re-using a TikaStream

You could call TikaInputStream.getPath() at the beginning of your parser, it will spool to file if not file based. After consuming the original inputStream, create a new one from the temp file created.

If you are using 2.0.0-ALPHA, there is:

https://github.com/apache/tika/blob/main/tika-core/src/main/java/org/apache/tika/io/InputStreamFactory.java

Use with the new methods from TikaInputStream:
public static TikaInputStream get(InputStreamFactory factory)
public InputStreamFactory getInputStreamFactory()

Hope this helps,
Luis

Em sex., 19 de fev. de 2021 às 16:09, Peter Kronenberg <pe...@torch.ai>> escreveu:
If I finish parsing a TikaStream, can I re-use the stream (before it is closed)?  I know you said that there is some magic behind the scenes where it spools it to a file.  Can I just call reset() to start from the beginning?

Peter

Peter Kronenberg  |  Senior AI Analytic ENGINEER
C: 703.887.5623
[Torch AI]<http://www.torch.ai/>
4303 W. 119th St., Leawood, KS 66209<https://www.google.com/maps/search/4303+W.+119th+St.,+Leawood,+KS+66209?entry=gmail&source=g>
WWW.TORCH.AI<http://www.torch.ai/>

Re: Re-using a TikaStream

Posted by Tim Allison <ta...@apache.org>.

The stream.available() call comes from ProxyInputStream.  We don't modify
that in TikaInputStream...maybe we should.

TikaInputStream wraps an incoming InputStream in a BufferedInputStream if
it doesn't supportMark already.

So, as long as you're happy with the performance and potential limitations
of BufferedInputStream, go with TikaInputStream.

Note that some parsers have to spool to disk.  TikaInputStream takes care
of this for you.

On Fri, Feb 26, 2021 at 1:01 PM Peter Kronenberg <pe...@torch.ai>
wrote:

> I think I figured this out.  It seems to depend on what parser is used.
> Not sure if this just has to do with inconsistent implementations, or there
> is some reason behind it.
>
>
>
> For most audio files, using the AudioParser, the buffer is still at the
> beginning.  Even though there is no text extraction, I would think that
> Tika still needs to read through the stream.
>
> The MP3Parser consumes the stream, but the MP4Parser does not
>
>
>
> The OCR parser also leaves the pointer at the beginning.  It definitely
> consumes the stream, so it must be resetting it.
>
>
>
> So what is going on.  And now I get back to my original question, which
> is, what is the best way to consistently be able to re-use the stream?
>
>
>
> *From:* Peter Kronenberg <pe...@torch.ai>
> *Sent:* Friday, February 26, 2021 12:18 PM
> *To:* user@tika.apache.org; tallison@apache.org
> *Cc:* lfcnassif@gmail.com
> *Subject:* RE: Re-using a TikaStream
>
>
>
> This email was sent from outside your organisation, yet is displaying the
> name of someone from your organisation. This often happens in phishing
> attempts. Please only interact with this email if you know its source and
> that the content is safe.
>
>
>
> So is this guaranteed, expected behavior?
>
>
>
> With a BufferedInputStream – I expect this
>
>
>
>
> *try *(BufferedInputStream stream = *new *BufferedInputStream(*new *FileInputStream(file)))
> {
>     System.*out*.printf(*"before - bytes available: %s"*,
> stream.available());
>     parser.parse(stream, handler, metadata, parseContext);
>     System.*out*.printf(*"after - bytes available: %s%n"*,
> stream.available());
> }
>
>
>
> before - bytes available: 10546620
>
> after - bytes available: 0
>
>
>
>
>
>
>
> But with a TikaInputStream, I get this
>
>
>
> Note that I’m purposing creating a FileInputStream first in order to hide the file information from the TikaInputStream, since in my normal use case, I’m dealing with a regular InputStream, not reading from a file
>
>
> *try *(TikaInputStream stream = TikaInputStream.*get*(*new *FileInputStream(file))) {
>     System.*out*.printf(*"before - bytes available: %s, position: %s%n"*, stream.available(), stream.getPosition());
>     parser.parse(stream, handler, metadata, parseContext);
>     System.*out*.printf(*"after - bytes available: %s, position: %s%n"*, stream.available(), stream.getPosition());
> }
>
>
>
> before - bytes available: 10546620, position: 0
>
> after - bytes available: 10546620, position: 0
>
>
>
>
>
> *From:* Peter Kronenberg <pe...@torch.ai>
> *Sent:* Thursday, February 25, 2021 11:28 AM
> *To:* user@tika.apache.org; tallison@apache.org
> *Cc:* lfcnassif@gmail.com
> *Subject:* RE: Re-using a TikaStream
>
>
>
> This email was sent from outside your organisation, yet is displaying the
> name of someone from your organisation. This often happens in phishing
> attempts. Please only interact with this email if you know its source and
> that the content is safe.
>
>
>
> Or reading from the cloud, either Google or AWS, in which case I also get
> a stream.   I know what the file name is, but can’t really use it
>
>
>
> *From:* Peter Kronenberg <pe...@torch.ai>
> *Sent:* Thursday, February 25, 2021 11:19 AM
> *To:* tallison@apache.org
> *Cc:* lfcnassif@gmail.com; user@tika.apache.org
> *Subject:* RE: Re-using a TikaStream
>
>
>
> This email was sent from outside your organisation, yet is displaying the
> name of someone from your organisation. This often happens in phishing
> attempts. Please only interact with this email if you know its source and
> that the content is safe.
>
>
>
> With a stream.  I am reading arbitrary streams and one of the goals is to
> figure out what it is. So there is no file backing it.
>
>
>
> *From:* Tim Allison <ta...@apache.org>
> *Sent:* Thursday, February 25, 2021 11:11 AM
> *To:* Peter Kronenberg <pe...@torch.ai>
> *Cc:* lfcnassif@gmail.com; user@tika.apache.org
> *Subject:* Re: Re-using a TikaStream
>
>
>
> Are you initializing w a file or a stream?
>
>
>
> On Thu, Feb 25, 2021 at 9:00 AM Peter Kronenberg <
> peter.kronenberg@torch.ai> wrote:
>
> But how is TikaInputStream allowing me to re-use the stream without me
> doing anything special?   Is it automatically spooling to disk as needed?
>
>
>
> I wouldn’t say that I can’t afford to spool to disk.  I’m just looking for
> the most reasonable solution.  I don’t know how big the streams are that
> I’ll be processing.  Obviously, if they’re big, the keeping them in memory
> is not reasonable and disk is the only option.  But for smaller streams, if
> it can do it all in memory, that’s obviously better.  And for my use case,
> I don’t **always** have to re-read the stream.
>
>
>
> *From:* Tim Allison <ta...@apache.org>
> *Sent:* Thursday, February 25, 2021 5:48 AM
> *To:* user@tika.apache.org
> *Cc:* lfcnassif@gmail.com
> *Subject:* Re: Re-using a TikaStream
>
>
>
> My $0.02 would be to use TikaInputStream because that gets a lot more use
> and is battle-tested.  Within the last year or so, we started using
> RereadableInputStream in one of the Microsoft format parsers so it is also
> getting some use now.
>
>
>
> If you absolutely can't afford to spool to disk, then give
> RereadableInputStream a try.
>
>
>
> The inputstreamfactories, in my mind, are somewhat work-arounds for other
> use cases, e.g. retrying/batch etc.
>
>
>
> On Tue, Feb 23, 2021 at 11:41 AM Peter Kronenberg <
> peter.kronenberg@torch.ai> wrote:
>
> So this might be moot, because it seems that TikaInputStream is already
> doing some magic and I’m not sure how.
>
> I was able to re-use the stream without doing anything special after a
> call to parse.  And in fact, I displayed stream.available() and
> stream.position() before and after the call to parse, and the full stream
> was still available at position 0.  What is TikaInputStream doing to make
> this happen?
>
>
>
> Just for some additional context, what I’m doing is running the file
> through Tika and then, depending on the file type, I want to do some
> additional non-tika processing.  I thought that once the Tika parse was
> done, the stream would be used up.
>
>
>
> What is going on?
>
>
>
>
>
> *From:* Peter Kronenberg <pe...@torch.ai>
> *Sent:* Tuesday, February 23, 2021 10:00 AM
> *To:* user@tika.apache.org; lfcnassif@gmail.com
> *Subject:* RE: Re-using a TikaStream
>
>
>
> This email was sent from outside your organisation, yet is displaying the
> name of someone from your organisation. This often happens in phishing
> attempts. Please only interact with this email if you know its source and
> that the content is safe.
>
>
>
> I just found the RereadableInputStream.  This looks more like what I was
> thinking.  Is there any reason not to use it?  What are the Tika best
> practices?  Pros/Cons of each approach?  If RereadableInputStream works as
> it’s supposed to, I’m not sure I see the advantage of InputStreamFactory
>
>
>
> *From:* Peter Kronenberg <pe...@torch.ai>
> *Sent:* Monday, February 22, 2021 8:30 PM
> *To:* lfcnassif@gmail.com
> *Cc:* user@tika.apache.org
> *Subject:* RE: Re-using a TikaStream
>
>
>
> This email was sent from outside your organisation, yet is displaying the
> name of someone from your organisation. This often happens in phishing
> attempts. Please only interact with this email if you know its source and
> that the content is safe.
>
>
>
> Oh ok.  I didn’t realize I needed to write my own class to implement it. I
>  was looking for some sort of existing framework.
>
>
>
> What is the purpose of the 2 InputStreamFactory classes:
>
>
>
> I was re-reading some emails with Nick Burch back around Dec 22-23 and
> maybe I mis-understood him, but it sounds like he was saying that
> TiksInputStream was smart enough to automatically spool the stream to disk
> to allow re-use.
>
>
>
> It seems to me that I need an extra pass through the data in order to save
> to disk.  I’m not starting from a File, but from a stream.  So if I need to
> read the stream twice, I really have to pass through the data 3 times,
> correct?
>
> Unless there is a way to save to disk during the first pass
>
>
>
> (try/catch removed for simplicity)
>
>
>
> tis = TikaInputSream.get(InputStream);
>
> file = tis.getFile();   ç extra pass
>
> tis =  TikaInputStream.get(new MyInputStreamFactory(file));
>
> // first real pass
>
> InputStream is = tis.getInputStreamFactory().getInputStream()
>
> // second real pass
>
> }
>
>
>
>
>
>
>
> *From:* Luís Filipe Nassif <lf...@gmail.com>
> *Sent:* Monday, February 22, 2021 5:42 PM
> *To:* Peter Kronenberg <pe...@torch.ai>
> *Cc:* user@tika.apache.org
> *Subject:* Re: Re-using a TikaStream
>
>
>
> Something like:
>
>
>
> class MyInputStreamFactory implements InputStreamFactory{
>
>
>
>     private File file;
>
>
>
>     public  MyInputStreamFactory(File file){
>
>         this.file = file;
>
>     }
>
>
>
>     public InputStream getInputStream(){
>
>         return new FileInputStream(file);
>
>     }
>
> }
>
>
>
> in your client code:
>
>
>
> Parser parser = new AutoDetectParser();
>
> TikaInputStream tis =  TikaInputStream.get(new MyInputStreamFactory(file));
>
> parser.parse(tis, new ToTextContentHandler(), new Metadata(), new
> ParseContext());
>
>
>
> when you need to reuse the stream (into your parser):
>
>
>
> public void parse(InputStream stream, ContentHandler handler, Metadata
> metadata, ParseContext context)
>             throws IOException, SAXException, TikaException {
>
>    //(...)
>
>    TikaInputStream tis = TikaInputStream.get(stream);
>
>    if(tis.hasInputStreamFactory()){
>
>         try(InputStream is = tis.getInputStreamFactory().getInputStream()){
>
>               //consume the new stream
>
>         }
>
>    }else
>
>        throw new IOException("not a reusable inputStream");
>
>  }
>
>
>
> Of course this is useful if you are not processing files, e.g. reading
> files from the cloud or sockets.
>
>
>
> Regards,
>
> Luis
>
>
>
>
>
> Em seg., 22 de fev. de 2021 às 19:18, Peter Kronenberg <
> peter.kronenberg@torch.ai> escreveu:
>
> I sent this question late on Friday.  Sending it again.  Can you provide a
> little more information how out to use the InputStreamFactory?
>
>
>
> *From:* Peter Kronenberg <pe...@torch.ai>
> *Sent:* Friday, February 19, 2021 5:10 PM
> *To:* user@tika.apache.org; lfcnassif@gmail.com
> *Subject:* RE: Re-using a TikaStream
>
>
>
> This email was sent from outside your organisation, yet is displaying the
> name of someone from your organisation. This often happens in phishing
> attempts. Please only interact with this email if you know its source and
> that the content is safe.
>
>
>
> There appear to be 2 InputStreamFactory classes: in tika-server-core and
> tika-io.  The one in server.core is the only one with a concrete class.
>
> I’m not quite sure I see how to use this.
>
> Normally, I create a TikaInputStream with
> TikaInputStream.get(InputStream).  How do I create it from an
> InputStreamFactory?
>
> TikaInputStream.getInputStreamFactory() only returns a factory if the
> TikaInputStream was created from a factory.
>
> Is there a good example of how this is used
>
>
>
> *From:* Peter Kronenberg <pe...@torch.ai>
> *Sent:* Friday, February 19, 2021 4:57 PM
> *To:* user@tika.apache.org; lfcnassif@gmail.com
> *Subject:* RE: Re-using a TikaStream
>
>
>
> This email was sent from outside your organisation, yet is displaying the
> name of someone from your organisation. This often happens in phishing
> attempts. Please only interact with this email if you know its source and
> that the content is safe.
>
>
>
> Thanks.  I thought that TikaInputStream already automatically saved to
> disk to allow re-reading.
>
>
>
> *From:* Luís Filipe Nassif <lf...@gmail.com>
> *Sent:* Friday, February 19, 2021 3:44 PM
> *To:* user@tika.apache.org
> *Subject:* Re: Re-using a TikaStream
>
>
>
> You could call TikaInputStream.getPath() at the beginning of your parser,
> it will spool to file if not file based. After consuming the original
> inputStream, create a new one from the temp file created.
>
>
>
> If you are using 2.0.0-ALPHA, there is:
>
>
>
>
> https://github.com/apache/tika/blob/main/tika-core/src/main/java/org/apache/tika/io/InputStreamFactory.java
>
>
>
> Use with the new methods from TikaInputStream:
>
> public static TikaInputStream get(InputStreamFactory factory)
>
> public InputStreamFactory getInputStreamFactory()
>
>
>
> Hope this helps,
>
> Luis
>
>
>
> Em sex., 19 de fev. de 2021 às 16:09, Peter Kronenberg <
> peter.kronenberg@torch.ai> escreveu:
>
> If I finish parsing a TikaStream, can I re-use the stream (before it is
> closed)?  I know you said that there is some magic behind the scenes where
> it spools it to a file.  Can I just call reset() to start from the
> beginning?
>
>
>
> Peter
>
>
>
>
>
> *Peter Kronenberg*  *| * *Senior AI Analytic ENGINEER *
>
> *C: 703.887.5623*
>
> [image: Torch AI] <http://www.torch.ai/>
>
> 4303 W. 119th St., Leawood, KS 66209
> <https://www.google.com/maps/search/4303+W.+119th+St.,+Leawood,+KS+66209?entry=gmail&source=g>
> WWW.TORCH.AI <http://www.torch.ai/>
>
>
>
>
>
>

RE: Re-using a TikaStream

Posted by Peter Kronenberg <pe...@torch.ai>.

I think I figured this out.  It seems to depend on what parser is used.  Not sure if this just has to do with inconsistent implementations, or there is some reason behind it.

For most audio files, using the AudioParser, the buffer is still at the beginning.  Even though there is no text extraction, I would think that Tika still needs to read through the stream.
The MP3Parser consumes the stream, but the MP4Parser does not

The OCR parser also leaves the pointer at the beginning.  It definitely consumes the stream, so it must be resetting it.

So what is going on.  And now I get back to my original question, which is, what is the best way to consistently be able to re-use the stream?

From: Peter Kronenberg <pe...@torch.ai>
Sent: Friday, February 26, 2021 12:18 PM
To: user@tika.apache.org; tallison@apache.org
Cc: lfcnassif@gmail.com
Subject: RE: Re-using a TikaStream

This email was sent from outside your organisation, yet is displaying the name of someone from your organisation. This often happens in phishing attempts. Please only interact with this email if you know its source and that the content is safe.

So is this guaranteed, expected behavior?

With a BufferedInputStream – I expect this


try (BufferedInputStream stream = new BufferedInputStream(new FileInputStream(file))) {
    System.out.printf("before - bytes available: %s", stream.available());
    parser.parse(stream, handler, metadata, parseContext);
    System.out.printf("after - bytes available: %s%n", stream.available());
}

before - bytes available: 10546620
after - bytes available: 0



But with a TikaInputStream, I get this


Note that I’m purposing creating a FileInputStream first in order to hide the file information from the TikaInputStream, since in my normal use case, I’m dealing with a regular InputStream, not reading from a file

try (TikaInputStream stream = TikaInputStream.get(new FileInputStream(file))) {
    System.out.printf("before - bytes available: %s, position: %s%n", stream.available(), stream.getPosition());
    parser.parse(stream, handler, metadata, parseContext);
    System.out.printf("after - bytes available: %s, position: %s%n", stream.available(), stream.getPosition());
}

before - bytes available: 10546620, position: 0
after - bytes available: 10546620, position: 0


From: Peter Kronenberg <pe...@torch.ai>>
Sent: Thursday, February 25, 2021 11:28 AM
To: user@tika.apache.org<ma...@tika.apache.org>; tallison@apache.org<ma...@apache.org>
Cc: lfcnassif@gmail.com<ma...@gmail.com>
Subject: RE: Re-using a TikaStream

This email was sent from outside your organisation, yet is displaying the name of someone from your organisation. This often happens in phishing attempts. Please only interact with this email if you know its source and that the content is safe.

Or reading from the cloud, either Google or AWS, in which case I also get a stream.   I know what the file name is, but can’t really use it

From: Peter Kronenberg <pe...@torch.ai>>
Sent: Thursday, February 25, 2021 11:19 AM
To: tallison@apache.org<ma...@apache.org>
Cc: lfcnassif@gmail.com<ma...@gmail.com>; user@tika.apache.org<ma...@tika.apache.org>
Subject: RE: Re-using a TikaStream

This email was sent from outside your organisation, yet is displaying the name of someone from your organisation. This often happens in phishing attempts. Please only interact with this email if you know its source and that the content is safe.

With a stream.  I am reading arbitrary streams and one of the goals is to figure out what it is. So there is no file backing it.

From: Tim Allison <ta...@apache.org>>
Sent: Thursday, February 25, 2021 11:11 AM
To: Peter Kronenberg <pe...@torch.ai>>
Cc: lfcnassif@gmail.com<ma...@gmail.com>; user@tika.apache.org<ma...@tika.apache.org>
Subject: Re: Re-using a TikaStream

Are you initializing w a file or a stream?

On Thu, Feb 25, 2021 at 9:00 AM Peter Kronenberg <pe...@torch.ai>> wrote:
But how is TikaInputStream allowing me to re-use the stream without me doing anything special?   Is it automatically spooling to disk as needed?

I wouldn’t say that I can’t afford to spool to disk.  I’m just looking for the most reasonable solution.  I don’t know how big the streams are that I’ll be processing.  Obviously, if they’re big, the keeping them in memory is not reasonable and disk is the only option.  But for smaller streams, if it can do it all in memory, that’s obviously better.  And for my use case, I don’t *always* have to re-read the stream.

From: Tim Allison <ta...@apache.org>>
Sent: Thursday, February 25, 2021 5:48 AM
To: user@tika.apache.org<ma...@tika.apache.org>
Cc: lfcnassif@gmail.com<ma...@gmail.com>
Subject: Re: Re-using a TikaStream

My $0.02 would be to use TikaInputStream because that gets a lot more use and is battle-tested.  Within the last year or so, we started using RereadableInputStream in one of the Microsoft format parsers so it is also getting some use now.

If you absolutely can't afford to spool to disk, then give RereadableInputStream a try.

The inputstreamfactories, in my mind, are somewhat work-arounds for other use cases, e.g. retrying/batch etc.

On Tue, Feb 23, 2021 at 11:41 AM Peter Kronenberg <pe...@torch.ai>> wrote:
So this might be moot, because it seems that TikaInputStream is already doing some magic and I’m not sure how.
I was able to re-use the stream without doing anything special after a call to parse.  And in fact, I displayed stream.available() and stream.position() before and after the call to parse, and the full stream was still available at position 0.  What is TikaInputStream doing to make this happen?

Just for some additional context, what I’m doing is running the file through Tika and then, depending on the file type, I want to do some additional non-tika processing.  I thought that once the Tika parse was done, the stream would be used up.

What is going on?


From: Peter Kronenberg <pe...@torch.ai>>
Sent: Tuesday, February 23, 2021 10:00 AM
To: user@tika.apache.org<ma...@tika.apache.org>; lfcnassif@gmail.com<ma...@gmail.com>
Subject: RE: Re-using a TikaStream

This email was sent from outside your organisation, yet is displaying the name of someone from your organisation. This often happens in phishing attempts. Please only interact with this email if you know its source and that the content is safe.

I just found the RereadableInputStream.  This looks more like what I was thinking.  Is there any reason not to use it?  What are the Tika best practices?  Pros/Cons of each approach?  If RereadableInputStream works as it’s supposed to, I’m not sure I see the advantage of InputStreamFactory

From: Peter Kronenberg <pe...@torch.ai>>
Sent: Monday, February 22, 2021 8:30 PM
To: lfcnassif@gmail.com<ma...@gmail.com>
Cc: user@tika.apache.org<ma...@tika.apache.org>
Subject: RE: Re-using a TikaStream

This email was sent from outside your organisation, yet is displaying the name of someone from your organisation. This often happens in phishing attempts. Please only interact with this email if you know its source and that the content is safe.

Oh ok.  I didn’t realize I needed to write my own class to implement it. I  was looking for some sort of existing framework.

What is the purpose of the 2 InputStreamFactory classes:

I was re-reading some emails with Nick Burch back around Dec 22-23 and maybe I mis-understood him, but it sounds like he was saying that TiksInputStream was smart enough to automatically spool the stream to disk to allow re-use.

It seems to me that I need an extra pass through the data in order to save to disk.  I’m not starting from a File, but from a stream.  So if I need to read the stream twice, I really have to pass through the data 3 times, correct?
Unless there is a way to save to disk during the first pass

(try/catch removed for simplicity)

tis = TikaInputSream.get(InputStream);
file = tis.getFile();   <== extra pass
tis =  TikaInputStream.get(new MyInputStreamFactory(file));
// first real pass
InputStream is = tis.getInputStreamFactory().getInputStream()
// second real pass
}



From: Luís Filipe Nassif <lf...@gmail.com>>
Sent: Monday, February 22, 2021 5:42 PM
To: Peter Kronenberg <pe...@torch.ai>>
Cc: user@tika.apache.org<ma...@tika.apache.org>
Subject: Re: Re-using a TikaStream

Something like:

class MyInputStreamFactory implements InputStreamFactory{

    private File file;

    public  MyInputStreamFactory(File file){
        this.file = file;
    }

    public InputStream getInputStream(){
        return new FileInputStream(file);
    }
}

in your client code:

Parser parser = new AutoDetectParser();
TikaInputStream tis =  TikaInputStream.get(new MyInputStreamFactory(file));
parser.parse(tis, new ToTextContentHandler(), new Metadata(), new ParseContext());

when you need to reuse the stream (into your parser):

public void parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context)
            throws IOException, SAXException, TikaException {
   //(...)
   TikaInputStream tis = TikaInputStream.get(stream);
   if(tis.hasInputStreamFactory()){
        try(InputStream is = tis.getInputStreamFactory().getInputStream()){
              //consume the new stream
        }
   }else
       throw new IOException("not a reusable inputStream");
 }

Of course this is useful if you are not processing files, e.g. reading files from the cloud or sockets.

Regards,
Luis


Em seg., 22 de fev. de 2021 às 19:18, Peter Kronenberg <pe...@torch.ai>> escreveu:
I sent this question late on Friday.  Sending it again.  Can you provide a little more information how out to use the InputStreamFactory?

From: Peter Kronenberg <pe...@torch.ai>>
Sent: Friday, February 19, 2021 5:10 PM
To: user@tika.apache.org<ma...@tika.apache.org>; lfcnassif@gmail.com<ma...@gmail.com>
Subject: RE: Re-using a TikaStream

This email was sent from outside your organisation, yet is displaying the name of someone from your organisation. This often happens in phishing attempts. Please only interact with this email if you know its source and that the content is safe.

There appear to be 2 InputStreamFactory classes: in tika-server-core and tika-io.  The one in server.core is the only one with a concrete class.
I’m not quite sure I see how to use this.
Normally, I create a TikaInputStream with TikaInputStream.get(InputStream).  How do I create it from an InputStreamFactory?
TikaInputStream.getInputStreamFactory() only returns a factory if the TikaInputStream was created from a factory.
Is there a good example of how this is used

From: Peter Kronenberg <pe...@torch.ai>>
Sent: Friday, February 19, 2021 4:57 PM
To: user@tika.apache.org<ma...@tika.apache.org>; lfcnassif@gmail.com<ma...@gmail.com>
Subject: RE: Re-using a TikaStream

This email was sent from outside your organisation, yet is displaying the name of someone from your organisation. This often happens in phishing attempts. Please only interact with this email if you know its source and that the content is safe.

Thanks.  I thought that TikaInputStream already automatically saved to disk to allow re-reading.

From: Luís Filipe Nassif <lf...@gmail.com>>
Sent: Friday, February 19, 2021 3:44 PM
To: user@tika.apache.org<ma...@tika.apache.org>
Subject: Re: Re-using a TikaStream

You could call TikaInputStream.getPath() at the beginning of your parser, it will spool to file if not file based. After consuming the original inputStream, create a new one from the temp file created.

If you are using 2.0.0-ALPHA, there is:

https://github.com/apache/tika/blob/main/tika-core/src/main/java/org/apache/tika/io/InputStreamFactory.java

Use with the new methods from TikaInputStream:
public static TikaInputStream get(InputStreamFactory factory)
public InputStreamFactory getInputStreamFactory()

Hope this helps,
Luis

Em sex., 19 de fev. de 2021 às 16:09, Peter Kronenberg <pe...@torch.ai>> escreveu:
If I finish parsing a TikaStream, can I re-use the stream (before it is closed)?  I know you said that there is some magic behind the scenes where it spools it to a file.  Can I just call reset() to start from the beginning?

Peter


Peter Kronenberg  |  Senior AI Analytic ENGINEER
C: 703.887.5623
[Torch AI]<http://www.torch.ai/>
4303 W. 119th St., Leawood, KS 66209<https://www.google.com/maps/search/4303+W.+119th+St.,+Leawood,+KS+66209?entry=gmail&source=g>
WWW.TORCH.AI<http://www.torch.ai/>

RE: Re-using a TikaStream

Posted by Peter Kronenberg <pe...@torch.ai>.

So is this guaranteed, expected behavior?

With a BufferedInputStream – I expect this


try (BufferedInputStream stream = new BufferedInputStream(new FileInputStream(file))) {
    System.out.printf("before - bytes available: %s", stream.available());
    parser.parse(stream, handler, metadata, parseContext);
    System.out.printf("after - bytes available: %s%n", stream.available());
}

before - bytes available: 10546620
after - bytes available: 0



But with a TikaInputStream, I get this


Note that I’m purposing creating a FileInputStream first in order to hide the file information from the TikaInputStream, since in my normal use case, I’m dealing with a regular InputStream, not reading from a file

try (TikaInputStream stream = TikaInputStream.get(new FileInputStream(file))) {
    System.out.printf("before - bytes available: %s, position: %s%n", stream.available(), stream.getPosition());
    parser.parse(stream, handler, metadata, parseContext);
    System.out.printf("after - bytes available: %s, position: %s%n", stream.available(), stream.getPosition());
}

before - bytes available: 10546620, position: 0
after - bytes available: 10546620, position: 0


From: Peter Kronenberg <pe...@torch.ai>
Sent: Thursday, February 25, 2021 11:28 AM
To: user@tika.apache.org; tallison@apache.org
Cc: lfcnassif@gmail.com
Subject: RE: Re-using a TikaStream

This email was sent from outside your organisation, yet is displaying the name of someone from your organisation. This often happens in phishing attempts. Please only interact with this email if you know its source and that the content is safe.

Or reading from the cloud, either Google or AWS, in which case I also get a stream.   I know what the file name is, but can’t really use it

From: Peter Kronenberg <pe...@torch.ai>>
Sent: Thursday, February 25, 2021 11:19 AM
To: tallison@apache.org<ma...@apache.org>
Cc: lfcnassif@gmail.com<ma...@gmail.com>; user@tika.apache.org<ma...@tika.apache.org>
Subject: RE: Re-using a TikaStream

This email was sent from outside your organisation, yet is displaying the name of someone from your organisation. This often happens in phishing attempts. Please only interact with this email if you know its source and that the content is safe.

With a stream.  I am reading arbitrary streams and one of the goals is to figure out what it is. So there is no file backing it.

From: Tim Allison <ta...@apache.org>>
Sent: Thursday, February 25, 2021 11:11 AM
To: Peter Kronenberg <pe...@torch.ai>>
Cc: lfcnassif@gmail.com<ma...@gmail.com>; user@tika.apache.org<ma...@tika.apache.org>
Subject: Re: Re-using a TikaStream

Are you initializing w a file or a stream?

On Thu, Feb 25, 2021 at 9:00 AM Peter Kronenberg <pe...@torch.ai>> wrote:
But how is TikaInputStream allowing me to re-use the stream without me doing anything special?   Is it automatically spooling to disk as needed?

I wouldn’t say that I can’t afford to spool to disk.  I’m just looking for the most reasonable solution.  I don’t know how big the streams are that I’ll be processing.  Obviously, if they’re big, the keeping them in memory is not reasonable and disk is the only option.  But for smaller streams, if it can do it all in memory, that’s obviously better.  And for my use case, I don’t *always* have to re-read the stream.

From: Tim Allison <ta...@apache.org>>
Sent: Thursday, February 25, 2021 5:48 AM
To: user@tika.apache.org<ma...@tika.apache.org>
Cc: lfcnassif@gmail.com<ma...@gmail.com>
Subject: Re: Re-using a TikaStream

My $0.02 would be to use TikaInputStream because that gets a lot more use and is battle-tested.  Within the last year or so, we started using RereadableInputStream in one of the Microsoft format parsers so it is also getting some use now.

If you absolutely can't afford to spool to disk, then give RereadableInputStream a try.

The inputstreamfactories, in my mind, are somewhat work-arounds for other use cases, e.g. retrying/batch etc.

On Tue, Feb 23, 2021 at 11:41 AM Peter Kronenberg <pe...@torch.ai>> wrote:
So this might be moot, because it seems that TikaInputStream is already doing some magic and I’m not sure how.
I was able to re-use the stream without doing anything special after a call to parse.  And in fact, I displayed stream.available() and stream.position() before and after the call to parse, and the full stream was still available at position 0.  What is TikaInputStream doing to make this happen?

Just for some additional context, what I’m doing is running the file through Tika and then, depending on the file type, I want to do some additional non-tika processing.  I thought that once the Tika parse was done, the stream would be used up.

What is going on?


From: Peter Kronenberg <pe...@torch.ai>>
Sent: Tuesday, February 23, 2021 10:00 AM
To: user@tika.apache.org<ma...@tika.apache.org>; lfcnassif@gmail.com<ma...@gmail.com>
Subject: RE: Re-using a TikaStream

This email was sent from outside your organisation, yet is displaying the name of someone from your organisation. This often happens in phishing attempts. Please only interact with this email if you know its source and that the content is safe.

I just found the RereadableInputStream.  This looks more like what I was thinking.  Is there any reason not to use it?  What are the Tika best practices?  Pros/Cons of each approach?  If RereadableInputStream works as it’s supposed to, I’m not sure I see the advantage of InputStreamFactory

From: Peter Kronenberg <pe...@torch.ai>>
Sent: Monday, February 22, 2021 8:30 PM
To: lfcnassif@gmail.com<ma...@gmail.com>
Cc: user@tika.apache.org<ma...@tika.apache.org>
Subject: RE: Re-using a TikaStream

This email was sent from outside your organisation, yet is displaying the name of someone from your organisation. This often happens in phishing attempts. Please only interact with this email if you know its source and that the content is safe.

Oh ok.  I didn’t realize I needed to write my own class to implement it. I  was looking for some sort of existing framework.

What is the purpose of the 2 InputStreamFactory classes:

I was re-reading some emails with Nick Burch back around Dec 22-23 and maybe I mis-understood him, but it sounds like he was saying that TiksInputStream was smart enough to automatically spool the stream to disk to allow re-use.

It seems to me that I need an extra pass through the data in order to save to disk.  I’m not starting from a File, but from a stream.  So if I need to read the stream twice, I really have to pass through the data 3 times, correct?
Unless there is a way to save to disk during the first pass

(try/catch removed for simplicity)

tis = TikaInputSream.get(InputStream);
file = tis.getFile();   <== extra pass
tis =  TikaInputStream.get(new MyInputStreamFactory(file));
// first real pass
InputStream is = tis.getInputStreamFactory().getInputStream()
// second real pass
}



From: Luís Filipe Nassif <lf...@gmail.com>>
Sent: Monday, February 22, 2021 5:42 PM
To: Peter Kronenberg <pe...@torch.ai>>
Cc: user@tika.apache.org<ma...@tika.apache.org>
Subject: Re: Re-using a TikaStream

Something like:

class MyInputStreamFactory implements InputStreamFactory{

    private File file;

    public  MyInputStreamFactory(File file){
        this.file = file;
    }

    public InputStream getInputStream(){
        return new FileInputStream(file);
    }
}

in your client code:

Parser parser = new AutoDetectParser();
TikaInputStream tis =  TikaInputStream.get(new MyInputStreamFactory(file));
parser.parse(tis, new ToTextContentHandler(), new Metadata(), new ParseContext());

when you need to reuse the stream (into your parser):

public void parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context)
            throws IOException, SAXException, TikaException {
   //(...)
   TikaInputStream tis = TikaInputStream.get(stream);
   if(tis.hasInputStreamFactory()){
        try(InputStream is = tis.getInputStreamFactory().getInputStream()){
              //consume the new stream
        }
   }else
       throw new IOException("not a reusable inputStream");
 }

Of course this is useful if you are not processing files, e.g. reading files from the cloud or sockets.

Regards,
Luis


Em seg., 22 de fev. de 2021 às 19:18, Peter Kronenberg <pe...@torch.ai>> escreveu:
I sent this question late on Friday.  Sending it again.  Can you provide a little more information how out to use the InputStreamFactory?

From: Peter Kronenberg <pe...@torch.ai>>
Sent: Friday, February 19, 2021 5:10 PM
To: user@tika.apache.org<ma...@tika.apache.org>; lfcnassif@gmail.com<ma...@gmail.com>
Subject: RE: Re-using a TikaStream

This email was sent from outside your organisation, yet is displaying the name of someone from your organisation. This often happens in phishing attempts. Please only interact with this email if you know its source and that the content is safe.

There appear to be 2 InputStreamFactory classes: in tika-server-core and tika-io.  The one in server.core is the only one with a concrete class.
I’m not quite sure I see how to use this.
Normally, I create a TikaInputStream with TikaInputStream.get(InputStream).  How do I create it from an InputStreamFactory?
TikaInputStream.getInputStreamFactory() only returns a factory if the TikaInputStream was created from a factory.
Is there a good example of how this is used

From: Peter Kronenberg <pe...@torch.ai>>
Sent: Friday, February 19, 2021 4:57 PM
To: user@tika.apache.org<ma...@tika.apache.org>; lfcnassif@gmail.com<ma...@gmail.com>
Subject: RE: Re-using a TikaStream

This email was sent from outside your organisation, yet is displaying the name of someone from your organisation. This often happens in phishing attempts. Please only interact with this email if you know its source and that the content is safe.

Thanks.  I thought that TikaInputStream already automatically saved to disk to allow re-reading.

From: Luís Filipe Nassif <lf...@gmail.com>>
Sent: Friday, February 19, 2021 3:44 PM
To: user@tika.apache.org<ma...@tika.apache.org>
Subject: Re: Re-using a TikaStream

You could call TikaInputStream.getPath() at the beginning of your parser, it will spool to file if not file based. After consuming the original inputStream, create a new one from the temp file created.

If you are using 2.0.0-ALPHA, there is:

https://github.com/apache/tika/blob/main/tika-core/src/main/java/org/apache/tika/io/InputStreamFactory.java

Use with the new methods from TikaInputStream:
public static TikaInputStream get(InputStreamFactory factory)
public InputStreamFactory getInputStreamFactory()

Hope this helps,
Luis

Em sex., 19 de fev. de 2021 às 16:09, Peter Kronenberg <pe...@torch.ai>> escreveu:
If I finish parsing a TikaStream, can I re-use the stream (before it is closed)?  I know you said that there is some magic behind the scenes where it spools it to a file.  Can I just call reset() to start from the beginning?

Peter


Peter Kronenberg  |  Senior AI Analytic ENGINEER
C: 703.887.5623
[Torch AI]<http://www.torch.ai/>
4303 W. 119th St., Leawood, KS 66209<https://www.google.com/maps/search/4303+W.+119th+St.,+Leawood,+KS+66209?entry=gmail&source=g>
WWW.TORCH.AI<http://www.torch.ai/>

RE: Re-using a TikaStream

Posted by Peter Kronenberg <pe...@torch.ai>.

Or reading from the cloud, either Google or AWS, in which case I also get a stream.   I know what the file name is, but can’t really use it

From: Peter Kronenberg <pe...@torch.ai>
Sent: Thursday, February 25, 2021 11:19 AM
To: tallison@apache.org
Cc: lfcnassif@gmail.com; user@tika.apache.org
Subject: RE: Re-using a TikaStream

This email was sent from outside your organisation, yet is displaying the name of someone from your organisation. This often happens in phishing attempts. Please only interact with this email if you know its source and that the content is safe.

With a stream.  I am reading arbitrary streams and one of the goals is to figure out what it is. So there is no file backing it.

From: Tim Allison <ta...@apache.org>>
Sent: Thursday, February 25, 2021 11:11 AM
To: Peter Kronenberg <pe...@torch.ai>>
Cc: lfcnassif@gmail.com<ma...@gmail.com>; user@tika.apache.org<ma...@tika.apache.org>
Subject: Re: Re-using a TikaStream

Are you initializing w a file or a stream?

On Thu, Feb 25, 2021 at 9:00 AM Peter Kronenberg <pe...@torch.ai>> wrote:
But how is TikaInputStream allowing me to re-use the stream without me doing anything special?   Is it automatically spooling to disk as needed?

I wouldn’t say that I can’t afford to spool to disk.  I’m just looking for the most reasonable solution.  I don’t know how big the streams are that I’ll be processing.  Obviously, if they’re big, the keeping them in memory is not reasonable and disk is the only option.  But for smaller streams, if it can do it all in memory, that’s obviously better.  And for my use case, I don’t *always* have to re-read the stream.

From: Tim Allison <ta...@apache.org>>
Sent: Thursday, February 25, 2021 5:48 AM
To: user@tika.apache.org<ma...@tika.apache.org>
Cc: lfcnassif@gmail.com<ma...@gmail.com>
Subject: Re: Re-using a TikaStream

My $0.02 would be to use TikaInputStream because that gets a lot more use and is battle-tested.  Within the last year or so, we started using RereadableInputStream in one of the Microsoft format parsers so it is also getting some use now.

If you absolutely can't afford to spool to disk, then give RereadableInputStream a try.

The inputstreamfactories, in my mind, are somewhat work-arounds for other use cases, e.g. retrying/batch etc.

On Tue, Feb 23, 2021 at 11:41 AM Peter Kronenberg <pe...@torch.ai>> wrote:
So this might be moot, because it seems that TikaInputStream is already doing some magic and I’m not sure how.
I was able to re-use the stream without doing anything special after a call to parse.  And in fact, I displayed stream.available() and stream.position() before and after the call to parse, and the full stream was still available at position 0.  What is TikaInputStream doing to make this happen?

Just for some additional context, what I’m doing is running the file through Tika and then, depending on the file type, I want to do some additional non-tika processing.  I thought that once the Tika parse was done, the stream would be used up.

What is going on?

From: Peter Kronenberg <pe...@torch.ai>>
Sent: Tuesday, February 23, 2021 10:00 AM
To: user@tika.apache.org<ma...@tika.apache.org>; lfcnassif@gmail.com<ma...@gmail.com>
Subject: RE: Re-using a TikaStream

This email was sent from outside your organisation, yet is displaying the name of someone from your organisation. This often happens in phishing attempts. Please only interact with this email if you know its source and that the content is safe.

I just found the RereadableInputStream.  This looks more like what I was thinking.  Is there any reason not to use it?  What are the Tika best practices?  Pros/Cons of each approach?  If RereadableInputStream works as it’s supposed to, I’m not sure I see the advantage of InputStreamFactory

From: Peter Kronenberg <pe...@torch.ai>>
Sent: Monday, February 22, 2021 8:30 PM
To: lfcnassif@gmail.com<ma...@gmail.com>
Cc: user@tika.apache.org<ma...@tika.apache.org>
Subject: RE: Re-using a TikaStream

This email was sent from outside your organisation, yet is displaying the name of someone from your organisation. This often happens in phishing attempts. Please only interact with this email if you know its source and that the content is safe.

Oh ok.  I didn’t realize I needed to write my own class to implement it. I  was looking for some sort of existing framework.

What is the purpose of the 2 InputStreamFactory classes:

I was re-reading some emails with Nick Burch back around Dec 22-23 and maybe I mis-understood him, but it sounds like he was saying that TiksInputStream was smart enough to automatically spool the stream to disk to allow re-use.

It seems to me that I need an extra pass through the data in order to save to disk.  I’m not starting from a File, but from a stream.  So if I need to read the stream twice, I really have to pass through the data 3 times, correct?
Unless there is a way to save to disk during the first pass

(try/catch removed for simplicity)

tis = TikaInputSream.get(InputStream);
file = tis.getFile();   <== extra pass
tis =  TikaInputStream.get(new MyInputStreamFactory(file));
// first real pass
InputStream is = tis.getInputStreamFactory().getInputStream()
// second real pass
}

From: Luís Filipe Nassif <lf...@gmail.com>>
Sent: Monday, February 22, 2021 5:42 PM
To: Peter Kronenberg <pe...@torch.ai>>
Cc: user@tika.apache.org<ma...@tika.apache.org>
Subject: Re: Re-using a TikaStream

Something like:

class MyInputStreamFactory implements InputStreamFactory{

    private File file;

    public  MyInputStreamFactory(File file){
        this.file = file;
    }

    public InputStream getInputStream(){
        return new FileInputStream(file);
    }
}

in your client code:

Parser parser = new AutoDetectParser();
TikaInputStream tis =  TikaInputStream.get(new MyInputStreamFactory(file));
parser.parse(tis, new ToTextContentHandler(), new Metadata(), new ParseContext());

when you need to reuse the stream (into your parser):

public void parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context)
            throws IOException, SAXException, TikaException {
   //(...)
   TikaInputStream tis = TikaInputStream.get(stream);
   if(tis.hasInputStreamFactory()){
        try(InputStream is = tis.getInputStreamFactory().getInputStream()){
              //consume the new stream
        }
   }else
       throw new IOException("not a reusable inputStream");
 }

Of course this is useful if you are not processing files, e.g. reading files from the cloud or sockets.

Regards,
Luis

Em seg., 22 de fev. de 2021 às 19:18, Peter Kronenberg <pe...@torch.ai>> escreveu:
I sent this question late on Friday.  Sending it again.  Can you provide a little more information how out to use the InputStreamFactory?

From: Peter Kronenberg <pe...@torch.ai>>
Sent: Friday, February 19, 2021 5:10 PM
To: user@tika.apache.org<ma...@tika.apache.org>; lfcnassif@gmail.com<ma...@gmail.com>
Subject: RE: Re-using a TikaStream

This email was sent from outside your organisation, yet is displaying the name of someone from your organisation. This often happens in phishing attempts. Please only interact with this email if you know its source and that the content is safe.

There appear to be 2 InputStreamFactory classes: in tika-server-core and tika-io.  The one in server.core is the only one with a concrete class.
I’m not quite sure I see how to use this.
Normally, I create a TikaInputStream with TikaInputStream.get(InputStream).  How do I create it from an InputStreamFactory?
TikaInputStream.getInputStreamFactory() only returns a factory if the TikaInputStream was created from a factory.
Is there a good example of how this is used

From: Peter Kronenberg <pe...@torch.ai>>
Sent: Friday, February 19, 2021 4:57 PM
To: user@tika.apache.org<ma...@tika.apache.org>; lfcnassif@gmail.com<ma...@gmail.com>
Subject: RE: Re-using a TikaStream

This email was sent from outside your organisation, yet is displaying the name of someone from your organisation. This often happens in phishing attempts. Please only interact with this email if you know its source and that the content is safe.

Thanks.  I thought that TikaInputStream already automatically saved to disk to allow re-reading.

From: Luís Filipe Nassif <lf...@gmail.com>>
Sent: Friday, February 19, 2021 3:44 PM
To: user@tika.apache.org<ma...@tika.apache.org>
Subject: Re: Re-using a TikaStream

You could call TikaInputStream.getPath() at the beginning of your parser, it will spool to file if not file based. After consuming the original inputStream, create a new one from the temp file created.

If you are using 2.0.0-ALPHA, there is:

https://github.com/apache/tika/blob/main/tika-core/src/main/java/org/apache/tika/io/InputStreamFactory.java

Use with the new methods from TikaInputStream:
public static TikaInputStream get(InputStreamFactory factory)
public InputStreamFactory getInputStreamFactory()

Hope this helps,
Luis

Em sex., 19 de fev. de 2021 às 16:09, Peter Kronenberg <pe...@torch.ai>> escreveu:
If I finish parsing a TikaStream, can I re-use the stream (before it is closed)?  I know you said that there is some magic behind the scenes where it spools it to a file.  Can I just call reset() to start from the beginning?

Peter

Peter Kronenberg  |  Senior AI Analytic ENGINEER
C: 703.887.5623
[Torch AI]<http://www.torch.ai/>
4303 W. 119th St., Leawood, KS 66209<https://www.google.com/maps/search/4303+W.+119th+St.,+Leawood,+KS+66209?entry=gmail&source=g>
WWW.TORCH.AI<http://www.torch.ai/>

RE: Re-using a TikaStream

Posted by Peter Kronenberg <pe...@torch.ai>.

With a stream.  I am reading arbitrary streams and one of the goals is to figure out what it is. So there is no file backing it.

From: Tim Allison <ta...@apache.org>
Sent: Thursday, February 25, 2021 11:11 AM
To: Peter Kronenberg <pe...@torch.ai>
Cc: lfcnassif@gmail.com; user@tika.apache.org
Subject: Re: Re-using a TikaStream

Are you initializing w a file or a stream?

On Thu, Feb 25, 2021 at 9:00 AM Peter Kronenberg <pe...@torch.ai>> wrote:
But how is TikaInputStream allowing me to re-use the stream without me doing anything special?   Is it automatically spooling to disk as needed?

I wouldn’t say that I can’t afford to spool to disk.  I’m just looking for the most reasonable solution.  I don’t know how big the streams are that I’ll be processing.  Obviously, if they’re big, the keeping them in memory is not reasonable and disk is the only option.  But for smaller streams, if it can do it all in memory, that’s obviously better.  And for my use case, I don’t *always* have to re-read the stream.

From: Tim Allison <ta...@apache.org>>
Sent: Thursday, February 25, 2021 5:48 AM
To: user@tika.apache.org<ma...@tika.apache.org>
Cc: lfcnassif@gmail.com<ma...@gmail.com>
Subject: Re: Re-using a TikaStream

My $0.02 would be to use TikaInputStream because that gets a lot more use and is battle-tested.  Within the last year or so, we started using RereadableInputStream in one of the Microsoft format parsers so it is also getting some use now.

If you absolutely can't afford to spool to disk, then give RereadableInputStream a try.

The inputstreamfactories, in my mind, are somewhat work-arounds for other use cases, e.g. retrying/batch etc.

On Tue, Feb 23, 2021 at 11:41 AM Peter Kronenberg <pe...@torch.ai>> wrote:
So this might be moot, because it seems that TikaInputStream is already doing some magic and I’m not sure how.
I was able to re-use the stream without doing anything special after a call to parse.  And in fact, I displayed stream.available() and stream.position() before and after the call to parse, and the full stream was still available at position 0.  What is TikaInputStream doing to make this happen?

Just for some additional context, what I’m doing is running the file through Tika and then, depending on the file type, I want to do some additional non-tika processing.  I thought that once the Tika parse was done, the stream would be used up.

What is going on?

From: Peter Kronenberg <pe...@torch.ai>>
Sent: Tuesday, February 23, 2021 10:00 AM
To: user@tika.apache.org<ma...@tika.apache.org>; lfcnassif@gmail.com<ma...@gmail.com>
Subject: RE: Re-using a TikaStream

This email was sent from outside your organisation, yet is displaying the name of someone from your organisation. This often happens in phishing attempts. Please only interact with this email if you know its source and that the content is safe.

I just found the RereadableInputStream.  This looks more like what I was thinking.  Is there any reason not to use it?  What are the Tika best practices?  Pros/Cons of each approach?  If RereadableInputStream works as it’s supposed to, I’m not sure I see the advantage of InputStreamFactory

From: Peter Kronenberg <pe...@torch.ai>>
Sent: Monday, February 22, 2021 8:30 PM
To: lfcnassif@gmail.com<ma...@gmail.com>
Cc: user@tika.apache.org<ma...@tika.apache.org>
Subject: RE: Re-using a TikaStream

This email was sent from outside your organisation, yet is displaying the name of someone from your organisation. This often happens in phishing attempts. Please only interact with this email if you know its source and that the content is safe.

Oh ok.  I didn’t realize I needed to write my own class to implement it. I  was looking for some sort of existing framework.

What is the purpose of the 2 InputStreamFactory classes:

I was re-reading some emails with Nick Burch back around Dec 22-23 and maybe I mis-understood him, but it sounds like he was saying that TiksInputStream was smart enough to automatically spool the stream to disk to allow re-use.

It seems to me that I need an extra pass through the data in order to save to disk.  I’m not starting from a File, but from a stream.  So if I need to read the stream twice, I really have to pass through the data 3 times, correct?
Unless there is a way to save to disk during the first pass

(try/catch removed for simplicity)

tis = TikaInputSream.get(InputStream);
file = tis.getFile();   <== extra pass
tis =  TikaInputStream.get(new MyInputStreamFactory(file));
// first real pass
InputStream is = tis.getInputStreamFactory().getInputStream()
// second real pass
}

From: Luís Filipe Nassif <lf...@gmail.com>>
Sent: Monday, February 22, 2021 5:42 PM
To: Peter Kronenberg <pe...@torch.ai>>
Cc: user@tika.apache.org<ma...@tika.apache.org>
Subject: Re: Re-using a TikaStream

Something like:

class MyInputStreamFactory implements InputStreamFactory{

    private File file;

    public  MyInputStreamFactory(File file){
        this.file = file;
    }

    public InputStream getInputStream(){
        return new FileInputStream(file);
    }
}

in your client code:

Parser parser = new AutoDetectParser();
TikaInputStream tis =  TikaInputStream.get(new MyInputStreamFactory(file));
parser.parse(tis, new ToTextContentHandler(), new Metadata(), new ParseContext());

when you need to reuse the stream (into your parser):

public void parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context)
            throws IOException, SAXException, TikaException {
   //(...)
   TikaInputStream tis = TikaInputStream.get(stream);
   if(tis.hasInputStreamFactory()){
        try(InputStream is = tis.getInputStreamFactory().getInputStream()){
              //consume the new stream
        }
   }else
       throw new IOException("not a reusable inputStream");
 }

Of course this is useful if you are not processing files, e.g. reading files from the cloud or sockets.

Regards,
Luis

Em seg., 22 de fev. de 2021 às 19:18, Peter Kronenberg <pe...@torch.ai>> escreveu:
I sent this question late on Friday.  Sending it again.  Can you provide a little more information how out to use the InputStreamFactory?

From: Peter Kronenberg <pe...@torch.ai>>
Sent: Friday, February 19, 2021 5:10 PM
To: user@tika.apache.org<ma...@tika.apache.org>; lfcnassif@gmail.com<ma...@gmail.com>
Subject: RE: Re-using a TikaStream

This email was sent from outside your organisation, yet is displaying the name of someone from your organisation. This often happens in phishing attempts. Please only interact with this email if you know its source and that the content is safe.

There appear to be 2 InputStreamFactory classes: in tika-server-core and tika-io.  The one in server.core is the only one with a concrete class.
I’m not quite sure I see how to use this.
Normally, I create a TikaInputStream with TikaInputStream.get(InputStream).  How do I create it from an InputStreamFactory?
TikaInputStream.getInputStreamFactory() only returns a factory if the TikaInputStream was created from a factory.
Is there a good example of how this is used

From: Peter Kronenberg <pe...@torch.ai>>
Sent: Friday, February 19, 2021 4:57 PM
To: user@tika.apache.org<ma...@tika.apache.org>; lfcnassif@gmail.com<ma...@gmail.com>
Subject: RE: Re-using a TikaStream

This email was sent from outside your organisation, yet is displaying the name of someone from your organisation. This often happens in phishing attempts. Please only interact with this email if you know its source and that the content is safe.

Thanks.  I thought that TikaInputStream already automatically saved to disk to allow re-reading.

From: Luís Filipe Nassif <lf...@gmail.com>>
Sent: Friday, February 19, 2021 3:44 PM
To: user@tika.apache.org<ma...@tika.apache.org>
Subject: Re: Re-using a TikaStream

You could call TikaInputStream.getPath() at the beginning of your parser, it will spool to file if not file based. After consuming the original inputStream, create a new one from the temp file created.

If you are using 2.0.0-ALPHA, there is:

https://github.com/apache/tika/blob/main/tika-core/src/main/java/org/apache/tika/io/InputStreamFactory.java

Use with the new methods from TikaInputStream:
public static TikaInputStream get(InputStreamFactory factory)
public InputStreamFactory getInputStreamFactory()

Hope this helps,
Luis

Em sex., 19 de fev. de 2021 às 16:09, Peter Kronenberg <pe...@torch.ai>> escreveu:
If I finish parsing a TikaStream, can I re-use the stream (before it is closed)?  I know you said that there is some magic behind the scenes where it spools it to a file.  Can I just call reset() to start from the beginning?

Peter

Peter Kronenberg  |  Senior AI Analytic ENGINEER
C: 703.887.5623
[Torch AI]<http://www.torch.ai/>
4303 W. 119th St., Leawood, KS 66209<https://www.google.com/maps/search/4303+W.+119th+St.,+Leawood,+KS+66209?entry=gmail&source=g>
WWW.TORCH.AI<http://www.torch.ai/>

Re: Re-using a TikaStream

Posted by Tim Allison <ta...@apache.org>.

Are you initializing w a file or a stream?

On Thu, Feb 25, 2021 at 9:00 AM Peter Kronenberg <pe...@torch.ai>
wrote:

> But how is TikaInputStream allowing me to re-use the stream without me
> doing anything special?   Is it automatically spooling to disk as needed?
>
>
>
> I wouldn’t say that I can’t afford to spool to disk.  I’m just looking for
> the most reasonable solution.  I don’t know how big the streams are that
> I’ll be processing.  Obviously, if they’re big, the keeping them in memory
> is not reasonable and disk is the only option.  But for smaller streams, if
> it can do it all in memory, that’s obviously better.  And for my use case,
> I don’t **always** have to re-read the stream.
>
>
>
> *From:* Tim Allison <ta...@apache.org>
> *Sent:* Thursday, February 25, 2021 5:48 AM
> *To:* user@tika.apache.org
> *Cc:* lfcnassif@gmail.com
> *Subject:* Re: Re-using a TikaStream
>
>
>
> My $0.02 would be to use TikaInputStream because that gets a lot more use
> and is battle-tested.  Within the last year or so, we started using
> RereadableInputStream in one of the Microsoft format parsers so it is also
> getting some use now.
>
>
>
> If you absolutely can't afford to spool to disk, then give
> RereadableInputStream a try.
>
>
>
> The inputstreamfactories, in my mind, are somewhat work-arounds for other
> use cases, e.g. retrying/batch etc.
>
>
>
> On Tue, Feb 23, 2021 at 11:41 AM Peter Kronenberg <
> peter.kronenberg@torch.ai> wrote:
>
> So this might be moot, because it seems that TikaInputStream is already
> doing some magic and I’m not sure how.
>
> I was able to re-use the stream without doing anything special after a
> call to parse.  And in fact, I displayed stream.available() and
> stream.position() before and after the call to parse, and the full stream
> was still available at position 0.  What is TikaInputStream doing to make
> this happen?
>
>
>
> Just for some additional context, what I’m doing is running the file
> through Tika and then, depending on the file type, I want to do some
> additional non-tika processing.  I thought that once the Tika parse was
> done, the stream would be used up.
>
>
>
> What is going on?
>
>
>
>
>
> *From:* Peter Kronenberg <pe...@torch.ai>
> *Sent:* Tuesday, February 23, 2021 10:00 AM
> *To:* user@tika.apache.org; lfcnassif@gmail.com
> *Subject:* RE: Re-using a TikaStream
>
>
>
> This email was sent from outside your organisation, yet is displaying the
> name of someone from your organisation. This often happens in phishing
> attempts. Please only interact with this email if you know its source and
> that the content is safe.
>
>
>
> I just found the RereadableInputStream.  This looks more like what I was
> thinking.  Is there any reason not to use it?  What are the Tika best
> practices?  Pros/Cons of each approach?  If RereadableInputStream works as
> it’s supposed to, I’m not sure I see the advantage of InputStreamFactory
>
>
>
> *From:* Peter Kronenberg <pe...@torch.ai>
> *Sent:* Monday, February 22, 2021 8:30 PM
> *To:* lfcnassif@gmail.com
> *Cc:* user@tika.apache.org
> *Subject:* RE: Re-using a TikaStream
>
>
>
> This email was sent from outside your organisation, yet is displaying the
> name of someone from your organisation. This often happens in phishing
> attempts. Please only interact with this email if you know its source and
> that the content is safe.
>
>
>
> Oh ok.  I didn’t realize I needed to write my own class to implement it. I
>  was looking for some sort of existing framework.
>
>
>
> What is the purpose of the 2 InputStreamFactory classes:
>
>
>
> I was re-reading some emails with Nick Burch back around Dec 22-23 and
> maybe I mis-understood him, but it sounds like he was saying that
> TiksInputStream was smart enough to automatically spool the stream to disk
> to allow re-use.
>
>
>
> It seems to me that I need an extra pass through the data in order to save
> to disk.  I’m not starting from a File, but from a stream.  So if I need to
> read the stream twice, I really have to pass through the data 3 times,
> correct?
>
> Unless there is a way to save to disk during the first pass
>
>
>
> (try/catch removed for simplicity)
>
>
>
> tis = TikaInputSream.get(InputStream);
>
> file = tis.getFile();   ç extra pass
>
> tis =  TikaInputStream.get(new MyInputStreamFactory(file));
>
> // first real pass
>
> InputStream is = tis.getInputStreamFactory().getInputStream()
>
> // second real pass
>
> }
>
>
>
>
>
>
>
> *From:* Luís Filipe Nassif <lf...@gmail.com>
> *Sent:* Monday, February 22, 2021 5:42 PM
> *To:* Peter Kronenberg <pe...@torch.ai>
> *Cc:* user@tika.apache.org
> *Subject:* Re: Re-using a TikaStream
>
>
>
> Something like:
>
>
>
> class MyInputStreamFactory implements InputStreamFactory{
>
>
>
>     private File file;
>
>
>
>     public  MyInputStreamFactory(File file){
>
>         this.file = file;
>
>     }
>
>
>
>     public InputStream getInputStream(){
>
>         return new FileInputStream(file);
>
>     }
>
> }
>
>
>
> in your client code:
>
>
>
> Parser parser = new AutoDetectParser();
>
> TikaInputStream tis =  TikaInputStream.get(new MyInputStreamFactory(file));
>
> parser.parse(tis, new ToTextContentHandler(), new Metadata(), new
> ParseContext());
>
>
>
> when you need to reuse the stream (into your parser):
>
>
>
> public void parse(InputStream stream, ContentHandler handler, Metadata
> metadata, ParseContext context)
>             throws IOException, SAXException, TikaException {
>
>    //(...)
>
>    TikaInputStream tis = TikaInputStream.get(stream);
>
>    if(tis.hasInputStreamFactory()){
>
>         try(InputStream is = tis.getInputStreamFactory().getInputStream()){
>
>               //consume the new stream
>
>         }
>
>    }else
>
>        throw new IOException("not a reusable inputStream");
>
>  }
>
>
>
> Of course this is useful if you are not processing files, e.g. reading
> files from the cloud or sockets.
>
>
>
> Regards,
>
> Luis
>
>
>
>
>
> Em seg., 22 de fev. de 2021 às 19:18, Peter Kronenberg <
> peter.kronenberg@torch.ai> escreveu:
>
> I sent this question late on Friday.  Sending it again.  Can you provide a
> little more information how out to use the InputStreamFactory?
>
>
>
> *From:* Peter Kronenberg <pe...@torch.ai>
> *Sent:* Friday, February 19, 2021 5:10 PM
> *To:* user@tika.apache.org; lfcnassif@gmail.com
> *Subject:* RE: Re-using a TikaStream
>
>
>
> This email was sent from outside your organisation, yet is displaying the
> name of someone from your organisation. This often happens in phishing
> attempts. Please only interact with this email if you know its source and
> that the content is safe.
>
>
>
> There appear to be 2 InputStreamFactory classes: in tika-server-core and
> tika-io.  The one in server.core is the only one with a concrete class.
>
> I’m not quite sure I see how to use this.
>
> Normally, I create a TikaInputStream with
> TikaInputStream.get(InputStream).  How do I create it from an
> InputStreamFactory?
>
> TikaInputStream.getInputStreamFactory() only returns a factory if the
> TikaInputStream was created from a factory.
>
> Is there a good example of how this is used
>
>
>
> *From:* Peter Kronenberg <pe...@torch.ai>
> *Sent:* Friday, February 19, 2021 4:57 PM
> *To:* user@tika.apache.org; lfcnassif@gmail.com
> *Subject:* RE: Re-using a TikaStream
>
>
>
> This email was sent from outside your organisation, yet is displaying the
> name of someone from your organisation. This often happens in phishing
> attempts. Please only interact with this email if you know its source and
> that the content is safe.
>
>
>
> Thanks.  I thought that TikaInputStream already automatically saved to
> disk to allow re-reading.
>
>
>
> *From:* Luís Filipe Nassif <lf...@gmail.com>
> *Sent:* Friday, February 19, 2021 3:44 PM
> *To:* user@tika.apache.org
> *Subject:* Re: Re-using a TikaStream
>
>
>
> You could call TikaInputStream.getPath() at the beginning of your parser,
> it will spool to file if not file based. After consuming the original
> inputStream, create a new one from the temp file created.
>
>
>
> If you are using 2.0.0-ALPHA, there is:
>
>
>
>
> https://github.com/apache/tika/blob/main/tika-core/src/main/java/org/apache/tika/io/InputStreamFactory.java
>
>
>
> Use with the new methods from TikaInputStream:
>
> public static TikaInputStream get(InputStreamFactory factory)
>
> public InputStreamFactory getInputStreamFactory()
>
>
>
> Hope this helps,
>
> Luis
>
>
>
> Em sex., 19 de fev. de 2021 às 16:09, Peter Kronenberg <
> peter.kronenberg@torch.ai> escreveu:
>
> If I finish parsing a TikaStream, can I re-use the stream (before it is
> closed)?  I know you said that there is some magic behind the scenes where
> it spools it to a file.  Can I just call reset() to start from the
> beginning?
>
>
>
> Peter
>
>
>
>
>
> *Peter Kronenberg*  *| * *Senior AI Analytic ENGINEER *
>
> *C: 703.887.5623*
>
> [image: Torch AI] <http://www.torch.ai/>
>
> 4303 W. 119th St., Leawood, KS 66209
> <https://www.google.com/maps/search/4303+W.+119th+St.,+Leawood,+KS+66209?entry=gmail&source=g>
> WWW.TORCH.AI <http://www.torch.ai/>
>
>
>
>
>
>

RE: Re-using a TikaStream

Posted by Peter Kronenberg <pe...@torch.ai>.

But how is TikaInputStream allowing me to re-use the stream without me doing anything special?   Is it automatically spooling to disk as needed?

I wouldn’t say that I can’t afford to spool to disk.  I’m just looking for the most reasonable solution.  I don’t know how big the streams are that I’ll be processing.  Obviously, if they’re big, the keeping them in memory is not reasonable and disk is the only option.  But for smaller streams, if it can do it all in memory, that’s obviously better.  And for my use case, I don’t *always* have to re-read the stream.

From: Tim Allison <ta...@apache.org>
Sent: Thursday, February 25, 2021 5:48 AM
To: user@tika.apache.org
Cc: lfcnassif@gmail.com
Subject: Re: Re-using a TikaStream

My $0.02 would be to use TikaInputStream because that gets a lot more use and is battle-tested.  Within the last year or so, we started using RereadableInputStream in one of the Microsoft format parsers so it is also getting some use now.

If you absolutely can't afford to spool to disk, then give RereadableInputStream a try.

The inputstreamfactories, in my mind, are somewhat work-arounds for other use cases, e.g. retrying/batch etc.

On Tue, Feb 23, 2021 at 11:41 AM Peter Kronenberg <pe...@torch.ai>> wrote:
So this might be moot, because it seems that TikaInputStream is already doing some magic and I’m not sure how.
I was able to re-use the stream without doing anything special after a call to parse.  And in fact, I displayed stream.available() and stream.position() before and after the call to parse, and the full stream was still available at position 0.  What is TikaInputStream doing to make this happen?

Just for some additional context, what I’m doing is running the file through Tika and then, depending on the file type, I want to do some additional non-tika processing.  I thought that once the Tika parse was done, the stream would be used up.

What is going on?

From: Peter Kronenberg <pe...@torch.ai>>
Sent: Tuesday, February 23, 2021 10:00 AM
To: user@tika.apache.org<ma...@tika.apache.org>; lfcnassif@gmail.com<ma...@gmail.com>
Subject: RE: Re-using a TikaStream

This email was sent from outside your organisation, yet is displaying the name of someone from your organisation. This often happens in phishing attempts. Please only interact with this email if you know its source and that the content is safe.

I just found the RereadableInputStream.  This looks more like what I was thinking.  Is there any reason not to use it?  What are the Tika best practices?  Pros/Cons of each approach?  If RereadableInputStream works as it’s supposed to, I’m not sure I see the advantage of InputStreamFactory

From: Peter Kronenberg <pe...@torch.ai>>
Sent: Monday, February 22, 2021 8:30 PM
To: lfcnassif@gmail.com<ma...@gmail.com>
Cc: user@tika.apache.org<ma...@tika.apache.org>
Subject: RE: Re-using a TikaStream

This email was sent from outside your organisation, yet is displaying the name of someone from your organisation. This often happens in phishing attempts. Please only interact with this email if you know its source and that the content is safe.

Oh ok.  I didn’t realize I needed to write my own class to implement it. I  was looking for some sort of existing framework.

What is the purpose of the 2 InputStreamFactory classes:

I was re-reading some emails with Nick Burch back around Dec 22-23 and maybe I mis-understood him, but it sounds like he was saying that TiksInputStream was smart enough to automatically spool the stream to disk to allow re-use.

It seems to me that I need an extra pass through the data in order to save to disk.  I’m not starting from a File, but from a stream.  So if I need to read the stream twice, I really have to pass through the data 3 times, correct?
Unless there is a way to save to disk during the first pass

(try/catch removed for simplicity)

tis = TikaInputSream.get(InputStream);
file = tis.getFile();   <== extra pass
tis =  TikaInputStream.get(new MyInputStreamFactory(file));
// first real pass
InputStream is = tis.getInputStreamFactory().getInputStream()
// second real pass
}

From: Luís Filipe Nassif <lf...@gmail.com>>
Sent: Monday, February 22, 2021 5:42 PM
To: Peter Kronenberg <pe...@torch.ai>>
Cc: user@tika.apache.org<ma...@tika.apache.org>
Subject: Re: Re-using a TikaStream

Something like:

class MyInputStreamFactory implements InputStreamFactory{

    private File file;

    public  MyInputStreamFactory(File file){
        this.file = file;
    }

    public InputStream getInputStream(){
        return new FileInputStream(file);
    }
}

in your client code:

Parser parser = new AutoDetectParser();
TikaInputStream tis =  TikaInputStream.get(new MyInputStreamFactory(file));
parser.parse(tis, new ToTextContentHandler(), new Metadata(), new ParseContext());

when you need to reuse the stream (into your parser):

public void parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context)
            throws IOException, SAXException, TikaException {
   //(...)
   TikaInputStream tis = TikaInputStream.get(stream);
   if(tis.hasInputStreamFactory()){
        try(InputStream is = tis.getInputStreamFactory().getInputStream()){
              //consume the new stream
        }
   }else
       throw new IOException("not a reusable inputStream");
 }

Of course this is useful if you are not processing files, e.g. reading files from the cloud or sockets.

Regards,
Luis

Em seg., 22 de fev. de 2021 às 19:18, Peter Kronenberg <pe...@torch.ai>> escreveu:
I sent this question late on Friday.  Sending it again.  Can you provide a little more information how out to use the InputStreamFactory?

From: Peter Kronenberg <pe...@torch.ai>>
Sent: Friday, February 19, 2021 5:10 PM
To: user@tika.apache.org<ma...@tika.apache.org>; lfcnassif@gmail.com<ma...@gmail.com>
Subject: RE: Re-using a TikaStream

This email was sent from outside your organisation, yet is displaying the name of someone from your organisation. This often happens in phishing attempts. Please only interact with this email if you know its source and that the content is safe.

There appear to be 2 InputStreamFactory classes: in tika-server-core and tika-io.  The one in server.core is the only one with a concrete class.
I’m not quite sure I see how to use this.
Normally, I create a TikaInputStream with TikaInputStream.get(InputStream).  How do I create it from an InputStreamFactory?
TikaInputStream.getInputStreamFactory() only returns a factory if the TikaInputStream was created from a factory.
Is there a good example of how this is used

From: Peter Kronenberg <pe...@torch.ai>>
Sent: Friday, February 19, 2021 4:57 PM
To: user@tika.apache.org<ma...@tika.apache.org>; lfcnassif@gmail.com<ma...@gmail.com>
Subject: RE: Re-using a TikaStream

This email was sent from outside your organisation, yet is displaying the name of someone from your organisation. This often happens in phishing attempts. Please only interact with this email if you know its source and that the content is safe.

Thanks.  I thought that TikaInputStream already automatically saved to disk to allow re-reading.

From: Luís Filipe Nassif <lf...@gmail.com>>
Sent: Friday, February 19, 2021 3:44 PM
To: user@tika.apache.org<ma...@tika.apache.org>
Subject: Re: Re-using a TikaStream

You could call TikaInputStream.getPath() at the beginning of your parser, it will spool to file if not file based. After consuming the original inputStream, create a new one from the temp file created.

If you are using 2.0.0-ALPHA, there is:

https://github.com/apache/tika/blob/main/tika-core/src/main/java/org/apache/tika/io/InputStreamFactory.java

Use with the new methods from TikaInputStream:
public static TikaInputStream get(InputStreamFactory factory)
public InputStreamFactory getInputStreamFactory()

Hope this helps,
Luis

Em sex., 19 de fev. de 2021 às 16:09, Peter Kronenberg <pe...@torch.ai>> escreveu:
If I finish parsing a TikaStream, can I re-use the stream (before it is closed)?  I know you said that there is some magic behind the scenes where it spools it to a file.  Can I just call reset() to start from the beginning?

Peter

Peter Kronenberg  |  Senior AI Analytic ENGINEER
C: 703.887.5623
[Torch AI]<http://www.torch.ai/>
4303 W. 119th St., Leawood, KS 66209
WWW.TORCH.AI<http://www.torch.ai/>

Re: Re-using a TikaStream

Posted by Tim Allison <ta...@apache.org>.

My $0.02 would be to use TikaInputStream because that gets a lot more use
and is battle-tested.  Within the last year or so, we started using
RereadableInputStream in one of the Microsoft format parsers so it is also
getting some use now.

If you absolutely can't afford to spool to disk, then give
RereadableInputStream a try.

The inputstreamfactories, in my mind, are somewhat work-arounds for other
use cases, e.g. retrying/batch etc.

On Tue, Feb 23, 2021 at 11:41 AM Peter Kronenberg <pe...@torch.ai>
wrote:

> So this might be moot, because it seems that TikaInputStream is already
> doing some magic and I’m not sure how.
>
> I was able to re-use the stream without doing anything special after a
> call to parse.  And in fact, I displayed stream.available() and
> stream.position() before and after the call to parse, and the full stream
> was still available at position 0.  What is TikaInputStream doing to make
> this happen?
>
>
>
> Just for some additional context, what I’m doing is running the file
> through Tika and then, depending on the file type, I want to do some
> additional non-tika processing.  I thought that once the Tika parse was
> done, the stream would be used up.
>
>
>
> What is going on?
>
>
>
>
>
> *From:* Peter Kronenberg <pe...@torch.ai>
> *Sent:* Tuesday, February 23, 2021 10:00 AM
> *To:* user@tika.apache.org; lfcnassif@gmail.com
> *Subject:* RE: Re-using a TikaStream
>
>
>
> This email was sent from outside your organisation, yet is displaying the
> name of someone from your organisation. This often happens in phishing
> attempts. Please only interact with this email if you know its source and
> that the content is safe.
>
>
>
> I just found the RereadableInputStream.  This looks more like what I was
> thinking.  Is there any reason not to use it?  What are the Tika best
> practices?  Pros/Cons of each approach?  If RereadableInputStream works as
> it’s supposed to, I’m not sure I see the advantage of InputStreamFactory
>
>
>
> *From:* Peter Kronenberg <pe...@torch.ai>
> *Sent:* Monday, February 22, 2021 8:30 PM
> *To:* lfcnassif@gmail.com
> *Cc:* user@tika.apache.org
> *Subject:* RE: Re-using a TikaStream
>
>
>
> This email was sent from outside your organisation, yet is displaying the
> name of someone from your organisation. This often happens in phishing
> attempts. Please only interact with this email if you know its source and
> that the content is safe.
>
>
>
> Oh ok.  I didn’t realize I needed to write my own class to implement it. I
>  was looking for some sort of existing framework.
>
>
>
> What is the purpose of the 2 InputStreamFactory classes:
>
>
>
> I was re-reading some emails with Nick Burch back around Dec 22-23 and
> maybe I mis-understood him, but it sounds like he was saying that
> TiksInputStream was smart enough to automatically spool the stream to disk
> to allow re-use.
>
>
>
> It seems to me that I need an extra pass through the data in order to save
> to disk.  I’m not starting from a File, but from a stream.  So if I need to
> read the stream twice, I really have to pass through the data 3 times,
> correct?
>
> Unless there is a way to save to disk during the first pass
>
>
>
> (try/catch removed for simplicity)
>
>
>
> tis = TikaInputSream.get(InputStream);
>
> file = tis.getFile();   ç extra pass
>
> tis =  TikaInputStream.get(new MyInputStreamFactory(file));
>
> // first real pass
>
> InputStream is = tis.getInputStreamFactory().getInputStream()
>
> // second real pass
>
> }
>
>
>
>
>
>
>
> *From:* Luís Filipe Nassif <lf...@gmail.com>
> *Sent:* Monday, February 22, 2021 5:42 PM
> *To:* Peter Kronenberg <pe...@torch.ai>
> *Cc:* user@tika.apache.org
> *Subject:* Re: Re-using a TikaStream
>
>
>
> Something like:
>
>
>
> class MyInputStreamFactory implements InputStreamFactory{
>
>
>
>     private File file;
>
>
>
>     public  MyInputStreamFactory(File file){
>
>         this.file = file;
>
>     }
>
>
>
>     public InputStream getInputStream(){
>
>         return new FileInputStream(file);
>
>     }
>
> }
>
>
>
> in your client code:
>
>
>
> Parser parser = new AutoDetectParser();
>
> TikaInputStream tis =  TikaInputStream.get(new MyInputStreamFactory(file));
>
> parser.parse(tis, new ToTextContentHandler(), new Metadata(), new
> ParseContext());
>
>
>
> when you need to reuse the stream (into your parser):
>
>
>
> public void parse(InputStream stream, ContentHandler handler, Metadata
> metadata, ParseContext context)
>             throws IOException, SAXException, TikaException {
>
>    //(...)
>
>    TikaInputStream tis = TikaInputStream.get(stream);
>
>    if(tis.hasInputStreamFactory()){
>
>         try(InputStream is = tis.getInputStreamFactory().getInputStream()){
>
>               //consume the new stream
>
>         }
>
>    }else
>
>        throw new IOException("not a reusable inputStream");
>
>  }
>
>
>
> Of course this is useful if you are not processing files, e.g. reading
> files from the cloud or sockets.
>
>
>
> Regards,
>
> Luis
>
>
>
>
>
> Em seg., 22 de fev. de 2021 às 19:18, Peter Kronenberg <
> peter.kronenberg@torch.ai> escreveu:
>
> I sent this question late on Friday.  Sending it again.  Can you provide a
> little more information how out to use the InputStreamFactory?
>
>
>
> *From:* Peter Kronenberg <pe...@torch.ai>
> *Sent:* Friday, February 19, 2021 5:10 PM
> *To:* user@tika.apache.org; lfcnassif@gmail.com
> *Subject:* RE: Re-using a TikaStream
>
>
>
> This email was sent from outside your organisation, yet is displaying the
> name of someone from your organisation. This often happens in phishing
> attempts. Please only interact with this email if you know its source and
> that the content is safe.
>
>
>
> There appear to be 2 InputStreamFactory classes: in tika-server-core and
> tika-io.  The one in server.core is the only one with a concrete class.
>
> I’m not quite sure I see how to use this.
>
> Normally, I create a TikaInputStream with
> TikaInputStream.get(InputStream).  How do I create it from an
> InputStreamFactory?
>
> TikaInputStream.getInputStreamFactory() only returns a factory if the
> TikaInputStream was created from a factory.
>
> Is there a good example of how this is used
>
>
>
> *From:* Peter Kronenberg <pe...@torch.ai>
> *Sent:* Friday, February 19, 2021 4:57 PM
> *To:* user@tika.apache.org; lfcnassif@gmail.com
> *Subject:* RE: Re-using a TikaStream
>
>
>
> This email was sent from outside your organisation, yet is displaying the
> name of someone from your organisation. This often happens in phishing
> attempts. Please only interact with this email if you know its source and
> that the content is safe.
>
>
>
> Thanks.  I thought that TikaInputStream already automatically saved to
> disk to allow re-reading.
>
>
>
> *From:* Luís Filipe Nassif <lf...@gmail.com>
> *Sent:* Friday, February 19, 2021 3:44 PM
> *To:* user@tika.apache.org
> *Subject:* Re: Re-using a TikaStream
>
>
>
> You could call TikaInputStream.getPath() at the beginning of your parser,
> it will spool to file if not file based. After consuming the original
> inputStream, create a new one from the temp file created.
>
>
>
> If you are using 2.0.0-ALPHA, there is:
>
>
>
>
> https://github.com/apache/tika/blob/main/tika-core/src/main/java/org/apache/tika/io/InputStreamFactory.java
>
>
>
> Use with the new methods from TikaInputStream:
>
> public static TikaInputStream get(InputStreamFactory factory)
>
> public InputStreamFactory getInputStreamFactory()
>
>
>
> Hope this helps,
>
> Luis
>
>
>
> Em sex., 19 de fev. de 2021 às 16:09, Peter Kronenberg <
> peter.kronenberg@torch.ai> escreveu:
>
> If I finish parsing a TikaStream, can I re-use the stream (before it is
> closed)?  I know you said that there is some magic behind the scenes where
> it spools it to a file.  Can I just call reset() to start from the
> beginning?
>
>
>
> Peter
>
>
>
>
>
> *Peter Kronenberg*  *| * *Senior AI Analytic ENGINEER *
>
> *C: 703.887.5623*
>
> [image: Torch AI] <http://www.torch.ai/>
>
> 4303 W. 119th St., Leawood, KS 66209
> WWW.TORCH.AI <http://www.torch.ai/>
>
>
>
>
>
>

RE: Re-using a TikaStream

Posted by Peter Kronenberg <pe...@torch.ai>.

So this might be moot, because it seems that TikaInputStream is already doing some magic and I’m not sure how.
I was able to re-use the stream without doing anything special after a call to parse.  And in fact, I displayed stream.available() and stream.position() before and after the call to parse, and the full stream was still available at position 0.  What is TikaInputStream doing to make this happen?

Just for some additional context, what I’m doing is running the file through Tika and then, depending on the file type, I want to do some additional non-tika processing.  I thought that once the Tika parse was done, the stream would be used up.

What is going on?


From: Peter Kronenberg <pe...@torch.ai>
Sent: Tuesday, February 23, 2021 10:00 AM
To: user@tika.apache.org; lfcnassif@gmail.com
Subject: RE: Re-using a TikaStream

This email was sent from outside your organisation, yet is displaying the name of someone from your organisation. This often happens in phishing attempts. Please only interact with this email if you know its source and that the content is safe.

I just found the RereadableInputStream.  This looks more like what I was thinking.  Is there any reason not to use it?  What are the Tika best practices?  Pros/Cons of each approach?  If RereadableInputStream works as it’s supposed to, I’m not sure I see the advantage of InputStreamFactory

From: Peter Kronenberg <pe...@torch.ai>>
Sent: Monday, February 22, 2021 8:30 PM
To: lfcnassif@gmail.com<ma...@gmail.com>
Cc: user@tika.apache.org<ma...@tika.apache.org>
Subject: RE: Re-using a TikaStream

This email was sent from outside your organisation, yet is displaying the name of someone from your organisation. This often happens in phishing attempts. Please only interact with this email if you know its source and that the content is safe.

Oh ok.  I didn’t realize I needed to write my own class to implement it. I  was looking for some sort of existing framework.

What is the purpose of the 2 InputStreamFactory classes:

I was re-reading some emails with Nick Burch back around Dec 22-23 and maybe I mis-understood him, but it sounds like he was saying that TiksInputStream was smart enough to automatically spool the stream to disk to allow re-use.

It seems to me that I need an extra pass through the data in order to save to disk.  I’m not starting from a File, but from a stream.  So if I need to read the stream twice, I really have to pass through the data 3 times, correct?
Unless there is a way to save to disk during the first pass

(try/catch removed for simplicity)

tis = TikaInputSream.get(InputStream);
file = tis.getFile();   <== extra pass
tis =  TikaInputStream.get(new MyInputStreamFactory(file));
// first real pass
InputStream is = tis.getInputStreamFactory().getInputStream()
// second real pass
}



From: Luís Filipe Nassif <lf...@gmail.com>>
Sent: Monday, February 22, 2021 5:42 PM
To: Peter Kronenberg <pe...@torch.ai>>
Cc: user@tika.apache.org<ma...@tika.apache.org>
Subject: Re: Re-using a TikaStream

Something like:

class MyInputStreamFactory implements InputStreamFactory{

    private File file;

    public  MyInputStreamFactory(File file){
        this.file = file;
    }

    public InputStream getInputStream(){
        return new FileInputStream(file);
    }
}

in your client code:

Parser parser = new AutoDetectParser();
TikaInputStream tis =  TikaInputStream.get(new MyInputStreamFactory(file));
parser.parse(tis, new ToTextContentHandler(), new Metadata(), new ParseContext());

when you need to reuse the stream (into your parser):

public void parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context)
            throws IOException, SAXException, TikaException {
   //(...)
   TikaInputStream tis = TikaInputStream.get(stream);
   if(tis.hasInputStreamFactory()){
        try(InputStream is = tis.getInputStreamFactory().getInputStream()){
              //consume the new stream
        }
   }else
       throw new IOException("not a reusable inputStream");
 }

Of course this is useful if you are not processing files, e.g. reading files from the cloud or sockets.

Regards,
Luis


Em seg., 22 de fev. de 2021 às 19:18, Peter Kronenberg <pe...@torch.ai>> escreveu:
I sent this question late on Friday.  Sending it again.  Can you provide a little more information how out to use the InputStreamFactory?

From: Peter Kronenberg <pe...@torch.ai>>
Sent: Friday, February 19, 2021 5:10 PM
To: user@tika.apache.org<ma...@tika.apache.org>; lfcnassif@gmail.com<ma...@gmail.com>
Subject: RE: Re-using a TikaStream

This email was sent from outside your organisation, yet is displaying the name of someone from your organisation. This often happens in phishing attempts. Please only interact with this email if you know its source and that the content is safe.

There appear to be 2 InputStreamFactory classes: in tika-server-core and tika-io.  The one in server.core is the only one with a concrete class.
I’m not quite sure I see how to use this.
Normally, I create a TikaInputStream with TikaInputStream.get(InputStream).  How do I create it from an InputStreamFactory?
TikaInputStream.getInputStreamFactory() only returns a factory if the TikaInputStream was created from a factory.
Is there a good example of how this is used

From: Peter Kronenberg <pe...@torch.ai>>
Sent: Friday, February 19, 2021 4:57 PM
To: user@tika.apache.org<ma...@tika.apache.org>; lfcnassif@gmail.com<ma...@gmail.com>
Subject: RE: Re-using a TikaStream

This email was sent from outside your organisation, yet is displaying the name of someone from your organisation. This often happens in phishing attempts. Please only interact with this email if you know its source and that the content is safe.

Thanks.  I thought that TikaInputStream already automatically saved to disk to allow re-reading.

From: Luís Filipe Nassif <lf...@gmail.com>>
Sent: Friday, February 19, 2021 3:44 PM
To: user@tika.apache.org<ma...@tika.apache.org>
Subject: Re: Re-using a TikaStream

You could call TikaInputStream.getPath() at the beginning of your parser, it will spool to file if not file based. After consuming the original inputStream, create a new one from the temp file created.

If you are using 2.0.0-ALPHA, there is:

https://github.com/apache/tika/blob/main/tika-core/src/main/java/org/apache/tika/io/InputStreamFactory.java

Use with the new methods from TikaInputStream:
public static TikaInputStream get(InputStreamFactory factory)
public InputStreamFactory getInputStreamFactory()

Hope this helps,
Luis

Em sex., 19 de fev. de 2021 às 16:09, Peter Kronenberg <pe...@torch.ai>> escreveu:
If I finish parsing a TikaStream, can I re-use the stream (before it is closed)?  I know you said that there is some magic behind the scenes where it spools it to a file.  Can I just call reset() to start from the beginning?

Peter


Peter Kronenberg  |  Senior AI Analytic ENGINEER
C: 703.887.5623
[Torch AI]<http://www.torch.ai/>
4303 W. 119th St., Leawood, KS 66209
WWW.TORCH.AI<http://www.torch.ai/>

RE: Re-using a TikaStream

Posted by Peter Kronenberg <pe...@torch.ai>.

I just found the RereadableInputStream.  This looks more like what I was thinking.  Is there any reason not to use it?  What are the Tika best practices?  Pros/Cons of each approach?  If RereadableInputStream works as it’s supposed to, I’m not sure I see the advantage of InputStreamFactory

From: Peter Kronenberg <pe...@torch.ai>
Sent: Monday, February 22, 2021 8:30 PM
To: lfcnassif@gmail.com
Cc: user@tika.apache.org
Subject: RE: Re-using a TikaStream

This email was sent from outside your organisation, yet is displaying the name of someone from your organisation. This often happens in phishing attempts. Please only interact with this email if you know its source and that the content is safe.

Oh ok.  I didn’t realize I needed to write my own class to implement it. I  was looking for some sort of existing framework.

What is the purpose of the 2 InputStreamFactory classes:

I was re-reading some emails with Nick Burch back around Dec 22-23 and maybe I mis-understood him, but it sounds like he was saying that TiksInputStream was smart enough to automatically spool the stream to disk to allow re-use.

It seems to me that I need an extra pass through the data in order to save to disk.  I’m not starting from a File, but from a stream.  So if I need to read the stream twice, I really have to pass through the data 3 times, correct?
Unless there is a way to save to disk during the first pass

(try/catch removed for simplicity)

tis = TikaInputSream.get(InputStream);
file = tis.getFile();   <== extra pass
tis =  TikaInputStream.get(new MyInputStreamFactory(file));
// first real pass
InputStream is = tis.getInputStreamFactory().getInputStream()
// second real pass
}

From: Luís Filipe Nassif <lf...@gmail.com>>
Sent: Monday, February 22, 2021 5:42 PM
To: Peter Kronenberg <pe...@torch.ai>>
Cc: user@tika.apache.org<ma...@tika.apache.org>
Subject: Re: Re-using a TikaStream

Something like:

class MyInputStreamFactory implements InputStreamFactory{

    private File file;

    public  MyInputStreamFactory(File file){
        this.file = file;
    }

    public InputStream getInputStream(){
        return new FileInputStream(file);
    }
}

in your client code:

Parser parser = new AutoDetectParser();
TikaInputStream tis =  TikaInputStream.get(new MyInputStreamFactory(file));
parser.parse(tis, new ToTextContentHandler(), new Metadata(), new ParseContext());

when you need to reuse the stream (into your parser):

public void parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context)
            throws IOException, SAXException, TikaException {
   //(...)
   TikaInputStream tis = TikaInputStream.get(stream);
   if(tis.hasInputStreamFactory()){
        try(InputStream is = tis.getInputStreamFactory().getInputStream()){
              //consume the new stream
        }
   }else
       throw new IOException("not a reusable inputStream");
 }

Of course this is useful if you are not processing files, e.g. reading files from the cloud or sockets.

Regards,
Luis

Em seg., 22 de fev. de 2021 às 19:18, Peter Kronenberg <pe...@torch.ai>> escreveu:
I sent this question late on Friday.  Sending it again.  Can you provide a little more information how out to use the InputStreamFactory?

From: Peter Kronenberg <pe...@torch.ai>>
Sent: Friday, February 19, 2021 5:10 PM
To: user@tika.apache.org<ma...@tika.apache.org>; lfcnassif@gmail.com<ma...@gmail.com>
Subject: RE: Re-using a TikaStream

This email was sent from outside your organisation, yet is displaying the name of someone from your organisation. This often happens in phishing attempts. Please only interact with this email if you know its source and that the content is safe.

There appear to be 2 InputStreamFactory classes: in tika-server-core and tika-io.  The one in server.core is the only one with a concrete class.
I’m not quite sure I see how to use this.
Normally, I create a TikaInputStream with TikaInputStream.get(InputStream).  How do I create it from an InputStreamFactory?
TikaInputStream.getInputStreamFactory() only returns a factory if the TikaInputStream was created from a factory.
Is there a good example of how this is used

From: Peter Kronenberg <pe...@torch.ai>>
Sent: Friday, February 19, 2021 4:57 PM
To: user@tika.apache.org<ma...@tika.apache.org>; lfcnassif@gmail.com<ma...@gmail.com>
Subject: RE: Re-using a TikaStream

This email was sent from outside your organisation, yet is displaying the name of someone from your organisation. This often happens in phishing attempts. Please only interact with this email if you know its source and that the content is safe.

Thanks.  I thought that TikaInputStream already automatically saved to disk to allow re-reading.

From: Luís Filipe Nassif <lf...@gmail.com>>
Sent: Friday, February 19, 2021 3:44 PM
To: user@tika.apache.org<ma...@tika.apache.org>
Subject: Re: Re-using a TikaStream

You could call TikaInputStream.getPath() at the beginning of your parser, it will spool to file if not file based. After consuming the original inputStream, create a new one from the temp file created.

If you are using 2.0.0-ALPHA, there is:

https://github.com/apache/tika/blob/main/tika-core/src/main/java/org/apache/tika/io/InputStreamFactory.java

Use with the new methods from TikaInputStream:
public static TikaInputStream get(InputStreamFactory factory)
public InputStreamFactory getInputStreamFactory()

Hope this helps,
Luis

Em sex., 19 de fev. de 2021 às 16:09, Peter Kronenberg <pe...@torch.ai>> escreveu:
If I finish parsing a TikaStream, can I re-use the stream (before it is closed)?  I know you said that there is some magic behind the scenes where it spools it to a file.  Can I just call reset() to start from the beginning?

Peter

Peter Kronenberg  |  Senior AI Analytic ENGINEER
C: 703.887.5623
[Torch AI]<http://www.torch.ai/>
4303 W. 119th St., Leawood, KS 66209
WWW.TORCH.AI<http://www.torch.ai/>

RE: Re-using a TikaStream

Posted by Nick Burch <ap...@gagravarr.org>.

On Tue, 23 Feb 2021, Peter Kronenberg wrote:
> I was re-reading some emails with Nick Burch back around Dec 22-23 and 
> maybe I mis-understood him, but it sounds like he was saying that 
> TiksInputStream was smart enough to automatically spool the stream to 
> disk to allow re-use.

If a parser knows it is going to need to have a File, or knows it will 
need to re-read multiple times, it can tell TikaInputStream which will 
save to a temp file. If you as the caller know this, you can force it with 
a getFile / getPath call

If spooling to a local file is expensive, but restarting the stream 
reading is cheap, then the InputStreamFactory can be used instead. 
Typically that's with cloud storage or the like

Nick

RE: Re-using a TikaStream

Posted by Peter Kronenberg <pe...@torch.ai>.

Oh ok.  I didn’t realize I needed to write my own class to implement it. I  was looking for some sort of existing framework.

What is the purpose of the 2 InputStreamFactory classes:

I was re-reading some emails with Nick Burch back around Dec 22-23 and maybe I mis-understood him, but it sounds like he was saying that TiksInputStream was smart enough to automatically spool the stream to disk to allow re-use.

It seems to me that I need an extra pass through the data in order to save to disk.  I’m not starting from a File, but from a stream.  So if I need to read the stream twice, I really have to pass through the data 3 times, correct?
Unless there is a way to save to disk during the first pass

(try/catch removed for simplicity)

tis = TikaInputSream.get(InputStream);
file = tis.getFile();   <== extra pass
tis =  TikaInputStream.get(new MyInputStreamFactory(file));
// first real pass
InputStream is = tis.getInputStreamFactory().getInputStream()
// second real pass
}



From: Luís Filipe Nassif <lf...@gmail.com>
Sent: Monday, February 22, 2021 5:42 PM
To: Peter Kronenberg <pe...@torch.ai>
Cc: user@tika.apache.org
Subject: Re: Re-using a TikaStream

Something like:

class MyInputStreamFactory implements InputStreamFactory{

    private File file;

    public  MyInputStreamFactory(File file){
        this.file = file;
    }

    public InputStream getInputStream(){
        return new FileInputStream(file);
    }
}

in your client code:

Parser parser = new AutoDetectParser();
TikaInputStream tis =  TikaInputStream.get(new MyInputStreamFactory(file));
parser.parse(tis, new ToTextContentHandler(), new Metadata(), new ParseContext());

when you need to reuse the stream (into your parser):

public void parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context)
            throws IOException, SAXException, TikaException {
   //(...)
   TikaInputStream tis = TikaInputStream.get(stream);
   if(tis.hasInputStreamFactory()){
        try(InputStream is = tis.getInputStreamFactory().getInputStream()){
              //consume the new stream
        }
   }else
       throw new IOException("not a reusable inputStream");
 }

Of course this is useful if you are not processing files, e.g. reading files from the cloud or sockets.

Regards,
Luis


Em seg., 22 de fev. de 2021 às 19:18, Peter Kronenberg <pe...@torch.ai>> escreveu:
I sent this question late on Friday.  Sending it again.  Can you provide a little more information how out to use the InputStreamFactory?

From: Peter Kronenberg <pe...@torch.ai>>
Sent: Friday, February 19, 2021 5:10 PM
To: user@tika.apache.org<ma...@tika.apache.org>; lfcnassif@gmail.com<ma...@gmail.com>
Subject: RE: Re-using a TikaStream

This email was sent from outside your organisation, yet is displaying the name of someone from your organisation. This often happens in phishing attempts. Please only interact with this email if you know its source and that the content is safe.

There appear to be 2 InputStreamFactory classes: in tika-server-core and tika-io.  The one in server.core is the only one with a concrete class.
I’m not quite sure I see how to use this.
Normally, I create a TikaInputStream with TikaInputStream.get(InputStream).  How do I create it from an InputStreamFactory?
TikaInputStream.getInputStreamFactory() only returns a factory if the TikaInputStream was created from a factory.
Is there a good example of how this is used

From: Peter Kronenberg <pe...@torch.ai>>
Sent: Friday, February 19, 2021 4:57 PM
To: user@tika.apache.org<ma...@tika.apache.org>; lfcnassif@gmail.com<ma...@gmail.com>
Subject: RE: Re-using a TikaStream

This email was sent from outside your organisation, yet is displaying the name of someone from your organisation. This often happens in phishing attempts. Please only interact with this email if you know its source and that the content is safe.

Thanks.  I thought that TikaInputStream already automatically saved to disk to allow re-reading.

From: Luís Filipe Nassif <lf...@gmail.com>>
Sent: Friday, February 19, 2021 3:44 PM
To: user@tika.apache.org<ma...@tika.apache.org>
Subject: Re: Re-using a TikaStream

You could call TikaInputStream.getPath() at the beginning of your parser, it will spool to file if not file based. After consuming the original inputStream, create a new one from the temp file created.

If you are using 2.0.0-ALPHA, there is:

https://github.com/apache/tika/blob/main/tika-core/src/main/java/org/apache/tika/io/InputStreamFactory.java

Use with the new methods from TikaInputStream:
public static TikaInputStream get(InputStreamFactory factory)
public InputStreamFactory getInputStreamFactory()

Hope this helps,
Luis

Em sex., 19 de fev. de 2021 às 16:09, Peter Kronenberg <pe...@torch.ai>> escreveu:
If I finish parsing a TikaStream, can I re-use the stream (before it is closed)?  I know you said that there is some magic behind the scenes where it spools it to a file.  Can I just call reset() to start from the beginning?

Peter


Peter Kronenberg  |  Senior AI Analytic ENGINEER
C: 703.887.5623
[Torch AI]<http://www.torch.ai/>
4303 W. 119th St., Leawood, KS 66209
WWW.TORCH.AI<http://www.torch.ai/>

Re: Re-using a TikaStream

Posted by Luís Filipe Nassif <lf...@gmail.com>.

Something like:

class MyInputStreamFactory implements InputStreamFactory{

    private File file;

    public  MyInputStreamFactory(File file){
        this.file = file;
    }

    public InputStream getInputStream(){
        return new FileInputStream(file);
    }
}

in your client code:

Parser parser = new AutoDetectParser();
TikaInputStream tis =  TikaInputStream.get(new MyInputStreamFactory(file));
parser.parse(tis, new ToTextContentHandler(), new Metadata(), new
ParseContext());

when you need to reuse the stream (into your parser):

public void parse(InputStream stream, ContentHandler handler, Metadata
metadata, ParseContext context)
            throws IOException, SAXException, TikaException {
   //(...)
   TikaInputStream tis = TikaInputStream.get(stream);
   if(tis.hasInputStreamFactory()){
        try(InputStream is = tis.getInputStreamFactory().getInputStream()){
              //consume the new stream
        }
   }else
       throw new IOException("not a reusable inputStream");
 }

Of course this is useful if you are not processing files, e.g. reading
files from the cloud or sockets.

Regards,
Luis


Em seg., 22 de fev. de 2021 às 19:18, Peter Kronenberg <
peter.kronenberg@torch.ai> escreveu:

> I sent this question late on Friday.  Sending it again.  Can you provide a
> little more information how out to use the InputStreamFactory?
>
>
>
> *From:* Peter Kronenberg <pe...@torch.ai>
> *Sent:* Friday, February 19, 2021 5:10 PM
> *To:* user@tika.apache.org; lfcnassif@gmail.com
> *Subject:* RE: Re-using a TikaStream
>
>
>
> This email was sent from outside your organisation, yet is displaying the
> name of someone from your organisation. This often happens in phishing
> attempts. Please only interact with this email if you know its source and
> that the content is safe.
>
>
>
> There appear to be 2 InputStreamFactory classes: in tika-server-core and
> tika-io.  The one in server.core is the only one with a concrete class.
>
> I’m not quite sure I see how to use this.
>
> Normally, I create a TikaInputStream with
> TikaInputStream.get(InputStream).  How do I create it from an
> InputStreamFactory?
>
> TikaInputStream.getInputStreamFactory() only returns a factory if the
> TikaInputStream was created from a factory.
>
> Is there a good example of how this is used
>
>
>
> *From:* Peter Kronenberg <pe...@torch.ai>
> *Sent:* Friday, February 19, 2021 4:57 PM
> *To:* user@tika.apache.org; lfcnassif@gmail.com
> *Subject:* RE: Re-using a TikaStream
>
>
>
> This email was sent from outside your organisation, yet is displaying the
> name of someone from your organisation. This often happens in phishing
> attempts. Please only interact with this email if you know its source and
> that the content is safe.
>
>
>
> Thanks.  I thought that TikaInputStream already automatically saved to
> disk to allow re-reading.
>
>
>
> *From:* Luís Filipe Nassif <lf...@gmail.com>
> *Sent:* Friday, February 19, 2021 3:44 PM
> *To:* user@tika.apache.org
> *Subject:* Re: Re-using a TikaStream
>
>
>
> You could call TikaInputStream.getPath() at the beginning of your parser,
> it will spool to file if not file based. After consuming the original
> inputStream, create a new one from the temp file created.
>
>
>
> If you are using 2.0.0-ALPHA, there is:
>
>
>
>
> https://github.com/apache/tika/blob/main/tika-core/src/main/java/org/apache/tika/io/InputStreamFactory.java
>
>
>
> Use with the new methods from TikaInputStream:
>
> public static TikaInputStream get(InputStreamFactory factory)
>
> public InputStreamFactory getInputStreamFactory()
>
>
>
> Hope this helps,
>
> Luis
>
>
>
> Em sex., 19 de fev. de 2021 às 16:09, Peter Kronenberg <
> peter.kronenberg@torch.ai> escreveu:
>
> If I finish parsing a TikaStream, can I re-use the stream (before it is
> closed)?  I know you said that there is some magic behind the scenes where
> it spools it to a file.  Can I just call reset() to start from the
> beginning?
>
>
>
> Peter
>
>
>
>
>
> *Peter Kronenberg*  *| * *Senior AI Analytic ENGINEER *
>
> *C: 703.887.5623*
>
> [image: Torch AI] <http://www.torch.ai/>
>
> 4303 W. 119th St., Leawood, KS 66209
> WWW.TORCH.AI <http://www.torch.ai/>
>
>
>
>
>
>

RE: Re-using a TikaStream

Posted by Peter Kronenberg <pe...@torch.ai>.

I sent this question late on Friday.  Sending it again.  Can you provide a little more information how out to use the InputStreamFactory?

From: Peter Kronenberg <pe...@torch.ai>
Sent: Friday, February 19, 2021 5:10 PM
To: user@tika.apache.org; lfcnassif@gmail.com
Subject: RE: Re-using a TikaStream

This email was sent from outside your organisation, yet is displaying the name of someone from your organisation. This often happens in phishing attempts. Please only interact with this email if you know its source and that the content is safe.

There appear to be 2 InputStreamFactory classes: in tika-server-core and tika-io.  The one in server.core is the only one with a concrete class.
I’m not quite sure I see how to use this.
Normally, I create a TikaInputStream with TikaInputStream.get(InputStream).  How do I create it from an InputStreamFactory?
TikaInputStream.getInputStreamFactory() only returns a factory if the TikaInputStream was created from a factory.
Is there a good example of how this is used

From: Peter Kronenberg <pe...@torch.ai>>
Sent: Friday, February 19, 2021 4:57 PM
To: user@tika.apache.org<ma...@tika.apache.org>; lfcnassif@gmail.com<ma...@gmail.com>
Subject: RE: Re-using a TikaStream

This email was sent from outside your organisation, yet is displaying the name of someone from your organisation. This often happens in phishing attempts. Please only interact with this email if you know its source and that the content is safe.

Thanks.  I thought that TikaInputStream already automatically saved to disk to allow re-reading.

From: Luís Filipe Nassif <lf...@gmail.com>>
Sent: Friday, February 19, 2021 3:44 PM
To: user@tika.apache.org<ma...@tika.apache.org>
Subject: Re: Re-using a TikaStream

You could call TikaInputStream.getPath() at the beginning of your parser, it will spool to file if not file based. After consuming the original inputStream, create a new one from the temp file created.

If you are using 2.0.0-ALPHA, there is:

https://github.com/apache/tika/blob/main/tika-core/src/main/java/org/apache/tika/io/InputStreamFactory.java

Use with the new methods from TikaInputStream:
public static TikaInputStream get(InputStreamFactory factory)
public InputStreamFactory getInputStreamFactory()

Hope this helps,
Luis

Em sex., 19 de fev. de 2021 às 16:09, Peter Kronenberg <pe...@torch.ai>> escreveu:
If I finish parsing a TikaStream, can I re-use the stream (before it is closed)?  I know you said that there is some magic behind the scenes where it spools it to a file.  Can I just call reset() to start from the beginning?

Peter

Peter Kronenberg  |  Senior AI Analytic ENGINEER
C: 703.887.5623
[Torch AI]<http://www.torch.ai/>
4303 W. 119th St., Leawood, KS 66209
WWW.TORCH.AI<http://www.torch.ai/>

RE: Re-using a TikaStream

Posted by Peter Kronenberg <pe...@torch.ai>.

There appear to be 2 InputStreamFactory classes: in tika-server-core and tika-io.  The one in server.core is the only one with a concrete class.
I’m not quite sure I see how to use this.
Normally, I create a TikaInputStream with TikaInputStream.get(InputStream).  How do I create it from an InputStreamFactory?
TikaInputStream.getInputStreamFactory() only returns a factory if the TikaInputStream was created from a factory.
Is there a good example of how this is used

From: Peter Kronenberg <pe...@torch.ai>
Sent: Friday, February 19, 2021 4:57 PM
To: user@tika.apache.org; lfcnassif@gmail.com
Subject: RE: Re-using a TikaStream

This email was sent from outside your organisation, yet is displaying the name of someone from your organisation. This often happens in phishing attempts. Please only interact with this email if you know its source and that the content is safe.

Thanks.  I thought that TikaInputStream already automatically saved to disk to allow re-reading.

From: Luís Filipe Nassif <lf...@gmail.com>>
Sent: Friday, February 19, 2021 3:44 PM
To: user@tika.apache.org<ma...@tika.apache.org>
Subject: Re: Re-using a TikaStream

You could call TikaInputStream.getPath() at the beginning of your parser, it will spool to file if not file based. After consuming the original inputStream, create a new one from the temp file created.

If you are using 2.0.0-ALPHA, there is:

https://github.com/apache/tika/blob/main/tika-core/src/main/java/org/apache/tika/io/InputStreamFactory.java

Use with the new methods from TikaInputStream:
public static TikaInputStream get(InputStreamFactory factory)
public InputStreamFactory getInputStreamFactory()

Hope this helps,
Luis

Em sex., 19 de fev. de 2021 às 16:09, Peter Kronenberg <pe...@torch.ai>> escreveu:
If I finish parsing a TikaStream, can I re-use the stream (before it is closed)?  I know you said that there is some magic behind the scenes where it spools it to a file.  Can I just call reset() to start from the beginning?

Peter

Peter Kronenberg  |  Senior AI Analytic ENGINEER
C: 703.887.5623
[Torch AI]<http://www.torch.ai/>
4303 W. 119th St., Leawood, KS 66209
WWW.TORCH.AI<http://www.torch.ai/>

RE: Re-using a TikaStream

Posted by Peter Kronenberg <pe...@torch.ai>.

Thanks.  I thought that TikaInputStream already automatically saved to disk to allow re-reading.

From: Luís Filipe Nassif <lf...@gmail.com>
Sent: Friday, February 19, 2021 3:44 PM
To: user@tika.apache.org
Subject: Re: Re-using a TikaStream

You could call TikaInputStream.getPath() at the beginning of your parser, it will spool to file if not file based. After consuming the original inputStream, create a new one from the temp file created.

If you are using 2.0.0-ALPHA, there is:

https://github.com/apache/tika/blob/main/tika-core/src/main/java/org/apache/tika/io/InputStreamFactory.java

Use with the new methods from TikaInputStream:
public static TikaInputStream get(InputStreamFactory factory)
public InputStreamFactory getInputStreamFactory()

Hope this helps,
Luis

Em sex., 19 de fev. de 2021 às 16:09, Peter Kronenberg <pe...@torch.ai>> escreveu:
If I finish parsing a TikaStream, can I re-use the stream (before it is closed)?  I know you said that there is some magic behind the scenes where it spools it to a file.  Can I just call reset() to start from the beginning?

Peter

Peter Kronenberg  |  Senior AI Analytic ENGINEER
C: 703.887.5623
[Torch AI]<http://www.torch.ai/>
4303 W. 119th St., Leawood, KS 66209
WWW.TORCH.AI<http://www.torch.ai/>

Re: Re-using a TikaStream

Posted by Luís Filipe Nassif <lf...@gmail.com>.

You could call TikaInputStream.getPath() at the beginning of your parser,
it will spool to file if not file based. After consuming the original
inputStream, create a new one from the temp file created.

If you are using 2.0.0-ALPHA, there is:

https://github.com/apache/tika/blob/main/tika-core/src/main/java/org/apache/tika/io/InputStreamFactory.java

Use with the new methods from TikaInputStream:
public static TikaInputStream get(InputStreamFactory factory)
public InputStreamFactory getInputStreamFactory()
Hope this helps,
Luis

Em sex., 19 de fev. de 2021 às 16:09, Peter Kronenberg <
peter.kronenberg@torch.ai> escreveu:

> If I finish parsing a TikaStream, can I re-use the stream (before it is
> closed)?  I know you said that there is some magic behind the scenes where
> it spools it to a file.  Can I just call reset() to start from the
> beginning?
>
>
>
> Peter
>
>
>
>
>
> *Peter Kronenberg*  *| * *Senior AI Analytic ENGINEER *
>
> *C: 703.887.5623*
>
> [image: Torch AI] <http://www.torch.ai/>
>
> 4303 W. 119th St., Leawood, KS 66209
> WWW.TORCH.AI <http://www.torch.ai/>
>
>
>
>
>