You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by kevin slote <ks...@gmail.com> on 2013/09/30 16:36:34 UTC
problem with the inputstream after calling the detect(InputStream in) method
Hi, I have been using tika for a while now without any problems and I am a
big fan of the software. I wanted to do my part and report what I suspect
might be a bug.
My code uses two different libraries, javaMail, java-libpst, and I am unit
testing with dumpster. When I send the email, the last unit test that I
built with dumbpster was to make sure that all of the attachments were
appended correctly, this failed. After doing some nitty and gritty
debugging, I discovered that if I positioned a
System.out.println(in.read()); directly before where I was calling tika,
it would yield the correct number on the console. However, if I used the
same command after where tiks was called for this case, it read -1.
public void sendAsEmail(PSTMessage email, String parent, String dir)
throws IOException, MessagingException, PSTException {
String subject = email.getSubject();
String to = primaryRecipientsEmail(email);
String from = email.getSenderEmailAddress();
if (!isValidEmailAddress(from)) {
from = "emptyFromString@placeholder.com";
}
Properties props = new Properties();
props.put("mail.transport.protocol", "smtp");
props.put("mail.smtp.host", "localhost");
props.put("mail.smtp.auth", "false");
props.put("mail.debug", "false");
props.put("mail.smtp.port", "3025");//change back to 25
Session session = Session.getDefaultInstance(props);
Transport transport = session.getTransport("smtp");
transport.connect();
Message message = new MimeMessage(session);
message.addHeader("Parent-Info", parent);
message.addHeader("directory", dir);
message.setSubject(subject);
messageBodyPart.setText(email.getBody());
multipart.addBodyPart(messageBodyPart);
message.setFrom(new InternetAddress(from));
message.setRecipients(Message.RecipientType.TO<http://message.recipienttype.to/>,
InternetAddress
.parse(to));
try {
String transportHeaders = email.getTransportMessageHeaders();
String[] headers = parseTransporHeaders(transportHeaders);
for (String header : headers) {
messageBodyPart.addHeaderLine(header);
multipart.addBodyPart(messageBodyPart);
}
} catch (Exception e) {
log.info("missing chunk is transport headers: " + e);
}
try {
if(email.hasAttachments()){
int attachmentIndex = 0;
while (attachmentIndex < email.getNumberOfAttachments()) {
PSTAttachment attachment = email.getAttachment(attachmentIndex);
InputStream in= attachment.getFileInputStream();
if (attachment.getAttachMethod() !=
PSTAttachment.ATTACHMENT_METHOD_EMBEDDED
&& attachment.getAttachMethod() !=
PSTAttachment.ATTACHMENT_METHOD_OLE) {
String filename = attachment.getFilename();
String mime = tika.detect(in); //here is where I
called tika for use in a method that has since been depreciated.
messageBodyPart = new MimeBodyPart();
messageBodyPart.attachFile(file);
messageBodyPart.setFileName(filename);
multipart.addBodyPart(messageBodyPart);
} else {
log.info("not base 64 file: " + attachment.getFilename());
}
in.close();
attachmentIndex++;
}
}
}catch(Exception e){
log.info("failed attaching file to "+e);
}
message.setContent(multipart);
transport.sendMessage(message, message.getAllRecipients());
transport.close();
}
Following the advice of Ken Krugler, I though I would share this on this
list to see if it was an error in my code or an issue in tika.
Re: problem with the inputstream after calling the detect(InputStream
in) method
Posted by kevin slote <ks...@gmail.com>.
Well, there was no error during runtime, it was just that the data was
erased. After debugging it with a print statement,
System.out.println(in.read());, I discovered that the InputStream was
being erased after I called the detect(InputStream in) method.
On Mon, Sep 30, 2013 at 11:19 AM, Sergey Beryozkin <sb...@gmail.com>wrote:
> Hi
>
> On 30/09/13 15:49, kevin slote wrote:
>
>> Ok, thanks. That was my problem. Also, I read your book and enjoyed it
>> very much. Is this the forum where I could bring up an issue I found with
>> the Tika-JAX-RS server?
>>
>> What kind of issue are you seeing ?
>
> Sergey
>
>
>> On Mon, Sep 30, 2013 at 10:45 AM, Jukka Zitting <ju...@gmail.com>
>> **wrote:
>>
>> Hi,
>>>
>>> On Mon, Sep 30, 2013 at 10:36 AM, kevin slote <ks...@gmail.com> wrote:
>>>
>>>> InputStream in= attachment.getFileInputStream(**);
>>>> [...]
>>>> String mime = tika.detect(in);
>>>>
>>>
>>> See the javadocs [1]: "If the document stream supports the mark
>>> feature, then the stream is marked and reset to the original position
>>> before this method returns"
>>>
>>> I believe the stream you're using does not support the mark feature
>>> (see [2]), which makes it impossible for Tika to restore the original
>>> state of the stream once type detection is done.
>>>
>>> Using BufferedInputStream [3] should fix your problem:
>>>
>>> InputStream in= new
>>> BufferedInputStream(**attachment.getFileInputStream(**));
>>>
>>> [1]
>>> http://tika.apache.org/1.4/**api/org/apache/tika/Tika.html#**
>>> detect(java.io.InputStream)<http://tika.apache.org/1.4/api/org/apache/tika/Tika.html#detect(java.io.InputStream)>
>>> [2]
>>> http://docs.oracle.com/javase/**7/docs/api/java/io/**
>>> InputStream.html#mark(int)<http://docs.oracle.com/javase/7/docs/api/java/io/InputStream.html#mark(int)>
>>> [3]
>>> http://docs.oracle.com/javase/**7/docs/api/java/io/**
>>> BufferedInputStream.html<http://docs.oracle.com/javase/7/docs/api/java/io/BufferedInputStream.html>
>>>
>>> BR,
>>>
>>> Jukka Zitting
>>>
>>>
>>
>
Re: problem with the inputstream after calling the detect(InputStream
in) method
Posted by Sergey Beryozkin <sb...@gmail.com>.
Hi
On 30/09/13 15:49, kevin slote wrote:
> Ok, thanks. That was my problem. Also, I read your book and enjoyed it
> very much. Is this the forum where I could bring up an issue I found with
> the Tika-JAX-RS server?
>
What kind of issue are you seeing ?
Sergey
>
> On Mon, Sep 30, 2013 at 10:45 AM, Jukka Zitting <ju...@gmail.com>wrote:
>
>> Hi,
>>
>> On Mon, Sep 30, 2013 at 10:36 AM, kevin slote <ks...@gmail.com> wrote:
>>> InputStream in= attachment.getFileInputStream();
>>> [...]
>>> String mime = tika.detect(in);
>>
>> See the javadocs [1]: "If the document stream supports the mark
>> feature, then the stream is marked and reset to the original position
>> before this method returns"
>>
>> I believe the stream you're using does not support the mark feature
>> (see [2]), which makes it impossible for Tika to restore the original
>> state of the stream once type detection is done.
>>
>> Using BufferedInputStream [3] should fix your problem:
>>
>> InputStream in= new
>> BufferedInputStream(attachment.getFileInputStream());
>>
>> [1]
>> http://tika.apache.org/1.4/api/org/apache/tika/Tika.html#detect(java.io.InputStream)
>> [2]
>> http://docs.oracle.com/javase/7/docs/api/java/io/InputStream.html#mark(int)
>> [3]
>> http://docs.oracle.com/javase/7/docs/api/java/io/BufferedInputStream.html
>>
>> BR,
>>
>> Jukka Zitting
>>
>
Re: problem with the inputstream after calling the detect(InputStream
in) method
Posted by kevin slote <ks...@gmail.com>.
Ok, thanks. That was my problem. Also, I read your book and enjoyed it
very much. Is this the forum where I could bring up an issue I found with
the Tika-JAX-RS server?
On Mon, Sep 30, 2013 at 10:45 AM, Jukka Zitting <ju...@gmail.com>wrote:
> Hi,
>
> On Mon, Sep 30, 2013 at 10:36 AM, kevin slote <ks...@gmail.com> wrote:
> > InputStream in= attachment.getFileInputStream();
> > [...]
> > String mime = tika.detect(in);
>
> See the javadocs [1]: "If the document stream supports the mark
> feature, then the stream is marked and reset to the original position
> before this method returns"
>
> I believe the stream you're using does not support the mark feature
> (see [2]), which makes it impossible for Tika to restore the original
> state of the stream once type detection is done.
>
> Using BufferedInputStream [3] should fix your problem:
>
> InputStream in= new
> BufferedInputStream(attachment.getFileInputStream());
>
> [1]
> http://tika.apache.org/1.4/api/org/apache/tika/Tika.html#detect(java.io.InputStream)
> [2]
> http://docs.oracle.com/javase/7/docs/api/java/io/InputStream.html#mark(int)
> [3]
> http://docs.oracle.com/javase/7/docs/api/java/io/BufferedInputStream.html
>
> BR,
>
> Jukka Zitting
>
Re: problem with the inputstream after calling the detect(InputStream
in) method
Posted by Jukka Zitting <ju...@gmail.com>.
Hi,
On Mon, Sep 30, 2013 at 10:36 AM, kevin slote <ks...@gmail.com> wrote:
> InputStream in= attachment.getFileInputStream();
> [...]
> String mime = tika.detect(in);
See the javadocs [1]: "If the document stream supports the mark
feature, then the stream is marked and reset to the original position
before this method returns"
I believe the stream you're using does not support the mark feature
(see [2]), which makes it impossible for Tika to restore the original
state of the stream once type detection is done.
Using BufferedInputStream [3] should fix your problem:
InputStream in= new BufferedInputStream(attachment.getFileInputStream());
[1] http://tika.apache.org/1.4/api/org/apache/tika/Tika.html#detect(java.io.InputStream)
[2] http://docs.oracle.com/javase/7/docs/api/java/io/InputStream.html#mark(int)
[3] http://docs.oracle.com/javase/7/docs/api/java/io/BufferedInputStream.html
BR,
Jukka Zitting