You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by kevin slote <ks...@gmail.com> on 2013/09/30 16:36:34 UTC

problem with the inputstream after calling the detect(InputStream in) method

Hi,  I have been using tika for a while now without any problems and I am a
big fan of the software.  I wanted to do my part and report what I suspect
might be a bug.

My code uses two different libraries, javaMail,  java-libpst, and I am unit
testing with dumpster.  When I send the email, the last unit test that I
built with dumbpster was to make sure that all of the attachments were
appended correctly, this failed.  After doing some nitty and gritty
debugging, I discovered that if I positioned a
System.out.println(in.read());   directly before where I was calling tika,
it would yield the correct number on the console.  However, if I used the
same command after where tiks was called for this case, it read -1.

public void sendAsEmail(PSTMessage email, String parent, String dir)
throws IOException, MessagingException, PSTException {
String subject = email.getSubject();
String to = primaryRecipientsEmail(email);
String from = email.getSenderEmailAddress();
if (!isValidEmailAddress(from)) {
from = "emptyFromString@placeholder.com";
}
Properties props = new Properties();
props.put("mail.transport.protocol", "smtp");
props.put("mail.smtp.host", "localhost");
props.put("mail.smtp.auth", "false");
props.put("mail.debug", "false");
props.put("mail.smtp.port", "3025");//change back to 25

        Session session = Session.getDefaultInstance(props);

Transport transport = session.getTransport("smtp");
transport.connect();

Message message = new MimeMessage(session);
message.addHeader("Parent-Info", parent);
message.addHeader("directory", dir);
message.setSubject(subject);
messageBodyPart.setText(email.getBody());
multipart.addBodyPart(messageBodyPart);
message.setFrom(new InternetAddress(from));
message.setRecipients(Message.RecipientType.TO<http://message.recipienttype.to/>,
InternetAddress
.parse(to));

try {
String transportHeaders = email.getTransportMessageHeaders();
String[] headers = parseTransporHeaders(transportHeaders);
for (String header : headers) {
messageBodyPart.addHeaderLine(header);
multipart.addBodyPart(messageBodyPart);
}
} catch (Exception e) {
log.info("missing chunk is transport headers: " + e);
}
try {
           if(email.hasAttachments()){
         int attachmentIndex = 0;
         while (attachmentIndex < email.getNumberOfAttachments()) {
         PSTAttachment attachment = email.getAttachment(attachmentIndex);
         InputStream in= attachment.getFileInputStream();
         if (attachment.getAttachMethod() !=
PSTAttachment.ATTACHMENT_METHOD_EMBEDDED
           && attachment.getAttachMethod() !=
PSTAttachment.ATTACHMENT_METHOD_OLE) {

              String filename = attachment.getFilename();
         String mime = tika.detect(in);                 //here is where I
called tika for use in a method that has since been depreciated.
         messageBodyPart = new MimeBodyPart();
         messageBodyPart.attachFile(file);
         messageBodyPart.setFileName(filename);
         multipart.addBodyPart(messageBodyPart);

         } else {
         log.info("not base 64 file: " + attachment.getFilename());
         }
         in.close();
         attachmentIndex++;
         }
                }
}catch(Exception e){
log.info("failed attaching file to "+e);
}
 message.setContent(multipart);
transport.sendMessage(message, message.getAllRecipients());
                transport.close();
}

Following the advice of Ken Krugler, I though I would share this on this
list to see if it was an error in my code or an issue in tika.

Re: problem with the inputstream after calling the detect(InputStream in) method

Posted by kevin slote <ks...@gmail.com>.
Well, there was no error during runtime, it was just that the data was
erased.  After debugging it with a print statement,
System.out.println(in.read());,  I discovered that the InputStream was
being erased after I called the detect(InputStream in) method.


On Mon, Sep 30, 2013 at 11:19 AM, Sergey Beryozkin <sb...@gmail.com>wrote:

> Hi
>
> On 30/09/13 15:49, kevin slote wrote:
>
>> Ok, thanks.  That was my problem.  Also, I read your book and enjoyed it
>> very much.  Is this the forum where I could bring up an issue I found with
>> the Tika-JAX-RS server?
>>
>>  What kind of issue are you seeing ?
>
> Sergey
>
>
>> On Mon, Sep 30, 2013 at 10:45 AM, Jukka Zitting <ju...@gmail.com>
>> **wrote:
>>
>>  Hi,
>>>
>>> On Mon, Sep 30, 2013 at 10:36 AM, kevin slote <ks...@gmail.com> wrote:
>>>
>>>> InputStream in= attachment.getFileInputStream(**);
>>>> [...]
>>>> String mime = tika.detect(in);
>>>>
>>>
>>> See the javadocs [1]: "If the document stream supports the mark
>>> feature, then the stream is marked and reset to the original position
>>> before this method returns"
>>>
>>> I believe the stream you're using does not support the mark feature
>>> (see [2]), which makes it impossible for Tika to restore the original
>>> state of the stream once type detection is done.
>>>
>>> Using BufferedInputStream [3] should fix your problem:
>>>
>>>      InputStream in= new
>>> BufferedInputStream(**attachment.getFileInputStream(**));
>>>
>>> [1]
>>> http://tika.apache.org/1.4/**api/org/apache/tika/Tika.html#**
>>> detect(java.io.InputStream)<http://tika.apache.org/1.4/api/org/apache/tika/Tika.html#detect(java.io.InputStream)>
>>> [2]
>>> http://docs.oracle.com/javase/**7/docs/api/java/io/**
>>> InputStream.html#mark(int)<http://docs.oracle.com/javase/7/docs/api/java/io/InputStream.html#mark(int)>
>>> [3]
>>> http://docs.oracle.com/javase/**7/docs/api/java/io/**
>>> BufferedInputStream.html<http://docs.oracle.com/javase/7/docs/api/java/io/BufferedInputStream.html>
>>>
>>> BR,
>>>
>>> Jukka Zitting
>>>
>>>
>>
>

Re: problem with the inputstream after calling the detect(InputStream in) method

Posted by Sergey Beryozkin <sb...@gmail.com>.
Hi
On 30/09/13 15:49, kevin slote wrote:
> Ok, thanks.  That was my problem.  Also, I read your book and enjoyed it
> very much.  Is this the forum where I could bring up an issue I found with
> the Tika-JAX-RS server?
>
What kind of issue are you seeing ?

Sergey
>
> On Mon, Sep 30, 2013 at 10:45 AM, Jukka Zitting <ju...@gmail.com>wrote:
>
>> Hi,
>>
>> On Mon, Sep 30, 2013 at 10:36 AM, kevin slote <ks...@gmail.com> wrote:
>>> InputStream in= attachment.getFileInputStream();
>>> [...]
>>> String mime = tika.detect(in);
>>
>> See the javadocs [1]: "If the document stream supports the mark
>> feature, then the stream is marked and reset to the original position
>> before this method returns"
>>
>> I believe the stream you're using does not support the mark feature
>> (see [2]), which makes it impossible for Tika to restore the original
>> state of the stream once type detection is done.
>>
>> Using BufferedInputStream [3] should fix your problem:
>>
>>      InputStream in= new
>> BufferedInputStream(attachment.getFileInputStream());
>>
>> [1]
>> http://tika.apache.org/1.4/api/org/apache/tika/Tika.html#detect(java.io.InputStream)
>> [2]
>> http://docs.oracle.com/javase/7/docs/api/java/io/InputStream.html#mark(int)
>> [3]
>> http://docs.oracle.com/javase/7/docs/api/java/io/BufferedInputStream.html
>>
>> BR,
>>
>> Jukka Zitting
>>
>


Re: problem with the inputstream after calling the detect(InputStream in) method

Posted by kevin slote <ks...@gmail.com>.
Ok, thanks.  That was my problem.  Also, I read your book and enjoyed it
very much.  Is this the forum where I could bring up an issue I found with
the Tika-JAX-RS server?


On Mon, Sep 30, 2013 at 10:45 AM, Jukka Zitting <ju...@gmail.com>wrote:

> Hi,
>
> On Mon, Sep 30, 2013 at 10:36 AM, kevin slote <ks...@gmail.com> wrote:
> > InputStream in= attachment.getFileInputStream();
> > [...]
> > String mime = tika.detect(in);
>
> See the javadocs [1]: "If the document stream supports the mark
> feature, then the stream is marked and reset to the original position
> before this method returns"
>
> I believe the stream you're using does not support the mark feature
> (see [2]), which makes it impossible for Tika to restore the original
> state of the stream once type detection is done.
>
> Using BufferedInputStream [3] should fix your problem:
>
>     InputStream in= new
> BufferedInputStream(attachment.getFileInputStream());
>
> [1]
> http://tika.apache.org/1.4/api/org/apache/tika/Tika.html#detect(java.io.InputStream)
> [2]
> http://docs.oracle.com/javase/7/docs/api/java/io/InputStream.html#mark(int)
> [3]
> http://docs.oracle.com/javase/7/docs/api/java/io/BufferedInputStream.html
>
> BR,
>
> Jukka Zitting
>

Re: problem with the inputstream after calling the detect(InputStream in) method

Posted by Jukka Zitting <ju...@gmail.com>.
Hi,

On Mon, Sep 30, 2013 at 10:36 AM, kevin slote <ks...@gmail.com> wrote:
> InputStream in= attachment.getFileInputStream();
> [...]
> String mime = tika.detect(in);

See the javadocs [1]: "If the document stream supports the mark
feature, then the stream is marked and reset to the original position
before this method returns"

I believe the stream you're using does not support the mark feature
(see [2]), which makes it impossible for Tika to restore the original
state of the stream once type detection is done.

Using BufferedInputStream [3] should fix your problem:

    InputStream in= new BufferedInputStream(attachment.getFileInputStream());

[1] http://tika.apache.org/1.4/api/org/apache/tika/Tika.html#detect(java.io.InputStream)
[2] http://docs.oracle.com/javase/7/docs/api/java/io/InputStream.html#mark(int)
[3] http://docs.oracle.com/javase/7/docs/api/java/io/BufferedInputStream.html

BR,

Jukka Zitting