You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by "Allison, Timothy B." <ta...@mitre.org> on 2014/06/30 21:05:06 UTC
RE: Stack Overflow Question
DefaultHandler is effectively a NullHandler; it doesn't store or do anything.
Try BodyContentHandler or ToXMLHandler or maybe WriteoutHandler.
If you want to write out each embedded file as a binary, try subclassing EmbeddedResourceHandler.
QUOTE:
0down votefavorite<http://stackoverflow.com/questions/24495504/unable-tp-read-zipfile-using-apache-tika?sem=2>
i am using Apache Tika 1.5 for parsing the contents present in a zip file,
here's my sample code
Parser parser = new AutoDetectParser();
ParseContext context = new ParseContext();
context.set(Parser.class, parser);
ContentHandler handler = new DefaultHandler();
Metadata metadata = new Metadata();
InputStream stream = null;
try {
stream = TikaInputStream.get(new File(zipFilePath));
} catch (FileNotFoundException e) {
e.printStackTrace();
}
try {
parser.parse(stream, handler, metadata, context);
logger.info("Content:\t" + handler.toString());
} catch (IOException e) {
e.printStackTrace();
} catch (SAXException e) {
e.printStackTrace();
} catch (TikaException e) {
e.printStackTrace();
} finally {
try {
stream.close();
} catch (IOException e) {
e.printStackTrace();
}
}
in the logger statement all i see is org.xml.sax.helpers.DefaultHandler@5bd8e367
i am missing something, unable to figure it out, looking for some help
-----Original Message-----
From: yeshwanth kumar [mailto:yeshwanth43@gmail.com]
Sent: Monday, June 30, 2014 1:28 PM
To: dev@tika.apache.org
Subject: Stack Overflow Question
Unable tp read zipfile using Apache Tika
http://stackoverflow.com/q/24495504/1899893?sem=2
RE: Stack Overflow Question
Posted by "Allison, Timothy B." <ta...@mitre.org>.
Good to hear. Let us know if you have any other questions or when you run into surprises.
From: yeshwanth kumar [mailto:yeshwanth43@gmail.com]
Sent: Tuesday, July 01, 2014 10:23 AM
To: Allison, Timothy B.
Subject: Re: Stack Overflow Question
hi tim,
i forgot to change the BodyContentHandler to ToXMLContentHandler in RecursiveMetada, i changed it only in my
calling method,
now i am getting the entire document as the structure u specified.
thanks a ton.
-yeshwanth
On Tue, Jul 1, 2014 at 7:16 PM, Allison, Timothy B. <ta...@mitre.org>> wrote:
Hmmm….
When I use the ToXMLHandler on the test doc submitted with TIKA-1329, I see this:
<div class="embedded" id="embed4.zip" />
<div class="package-entry"><h1>embed4.zip</h1>
<div class="embedded" id="embed4.txt" />
<div class="package-entry"><h1>embed4.txt</h1>
<p>embed_4</p>
</div>
</div>
</div>
</div>
That’s a text file inside of a zip file that is itself embedded. I could see doing some parsing on the XML to scrape out <div class=”package-entry”> contents and grab the file name from the <h1> element.
If I committed TIKA-1329, would that be of any use to you? That returns a list of metadata objects. There is one metadata object per embedded file. The text content of each file can be retrieved from each metadata object by this key: “tika:content.”
Best,
Tim
From: yeshwanth kumar [mailto:yeshwanth43@gmail.com<ma...@gmail.com>]
Sent: Tuesday, July 01, 2014 9:00 AM
To: Allison, Timothy B.
Subject: Re: Stack Overflow Question
output is same even with ToXMLHandler
On Tue, Jul 1, 2014 at 5:59 PM, Allison, Timothy B. <ta...@mitre.org>> wrote:
Did you try the ToXMLHandler?
From: yeshwanth kumar [mailto:yeshwanth43@gmail.com<ma...@gmail.com>]
Sent: Monday, June 30, 2014 4:50 PM
To: Allison, Timothy B.
Subject: Re: Stack Overflow Question
hi tim,
i tried in all possible ways,
instead of reading entire zip file i parsed individual zipentries,
but even then i faced exceptions such as
org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.microsoft.OfficeParser@37ba3e33<ma...@37ba3e33>
Caused by: java.io.IOException: Invalid header signature; read 0x725020706968736E, expected 0xE11AB1A1E011CFD0 - Your file appears not to be a valid OLE2 document
org.apache.tika.exception.TikaException: Unable to unpack document stream
org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.microsoft.OfficeParser@6f0ee75a<ma...@6f0ee75a>
org.apache.tika.exception.TikaException: Error creating OOXML extractor
any suggestions regarding these issues,
thanks,
yeshwanth
On Tue, Jul 1, 2014 at 2:00 AM, yeshwanth kumar <ye...@gmail.com>> wrote:
hi tim,
thanks, for sharing the resources but i am unable to figure out how to implement it in my code,
what i didn't understand is the flow and recursive steps, when i ran the RecursiveMetadataParser
it still giving the same kind of output as filenames combined with content of the files,
i am totally confused.
On Tue, Jul 1, 2014 at 1:29 AM, Allison, Timothy B. <ta...@mitre.org>> wrote:
Or use the ToXMLHandler and parse the XML?
From: Allison, Timothy B. [mailto:tallison@mitre.org<ma...@mitre.org>]
Sent: Monday, June 30, 2014 3:55 PM
To: yeshwanth kumar
Cc: user@tika.apache.org<ma...@tika.apache.org>
Subject: RE: Stack Overflow Question
Might want to look into RecursiveMetadata Parser
http://wiki.apache.org/tika/RecursiveMetadata
Or
https://issues.apache.org/jira/i#browse/TIKA-1329?issueKey=TIKA-1329&serverRenderedViewIssue=true
From: yeshwanth kumar [mailto:yeshwanth43@gmail.com]
Sent: Monday, June 30, 2014 3:24 PM
To: Allison, Timothy B.
Subject: Re: Stack Overflow Question
hi tim,
thanks for quick reply,
i changed the contenthandler to bodyContentHandler i got exception for maximum word limit,
i used -1 in the bodycontenthandler constructor,
now its another problem, filenames and content are present in string returned from handler.tostring()
how can i map a fileName to its content.
thanks,
yeshwanth
On Tue, Jul 1, 2014 at 12:35 AM, Allison, Timothy B. <ta...@mitre.org>> wrote:
DefaultHandler is effectively a NullHandler; it doesn't store or do anything.
Try BodyContentHandler or ToXMLHandler or maybe WriteoutHandler.
If you want to write out each embedded file as a binary, try subclassing EmbeddedResourceHandler.
QUOTE:
0down votefavorite<http://stackoverflow.com/questions/24495504/unable-tp-read-zipfile-using-apache-tika?sem=2>
i am using Apache Tika 1.5 for parsing the contents present in a zip file,
here's my sample code
Parser parser = new AutoDetectParser();
ParseContext context = new ParseContext();
context.set(Parser.class, parser);
ContentHandler handler = new DefaultHandler();
Metadata metadata = new Metadata();
InputStream stream = null;
try {
stream = TikaInputStream.get(new File(zipFilePath));
} catch (FileNotFoundException e) {
e.printStackTrace();
}
try {
parser.parse(stream, handler, metadata, context);
logger.info<http://logger.info>("Content:\t" + handler.toString());
} catch (IOException e) {
e.printStackTrace();
} catch (SAXException e) {
e.printStackTrace();
} catch (TikaException e) {
e.printStackTrace();
} finally {
try {
stream.close();
} catch (IOException e) {
e.printStackTrace();
}
}
in the logger statement all i see is org.xml.sax.helpers.DefaultHandler@5bd8e367<ma...@5bd8e367>
i am missing something, unable to figure it out, looking for some help
-----Original Message-----
From: yeshwanth kumar [mailto:yeshwanth43@gmail.com<ma...@gmail.com>]
Sent: Monday, June 30, 2014 1:28 PM
To: dev@tika.apache.org<ma...@tika.apache.org>
Subject: Stack Overflow Question
Unable tp read zipfile using Apache Tika
http://stackoverflow.com/q/24495504/1899893?sem=2
RE: Stack Overflow Question
Posted by "Allison, Timothy B." <ta...@mitre.org>.
Hmmm….
When I use the ToXMLHandler on the test doc submitted with TIKA-1329, I see this:
<div class="embedded" id="embed4.zip" />
<div class="package-entry"><h1>embed4.zip</h1>
<div class="embedded" id="embed4.txt" />
<div class="package-entry"><h1>embed4.txt</h1>
<p>embed_4</p>
</div>
</div>
</div>
</div>
That’s a text file inside of a zip file that is itself embedded. I could see doing some parsing on the XML to scrape out <div class=”package-entry”> contents and grab the file name from the <h1> element.
If I committed TIKA-1329, would that be of any use to you? That returns a list of metadata objects. There is one metadata object per embedded file. The text content of each file can be retrieved from each metadata object by this key: “tika:content.”
Best,
Tim
From: yeshwanth kumar [mailto:yeshwanth43@gmail.com]
Sent: Tuesday, July 01, 2014 9:00 AM
To: Allison, Timothy B.
Subject: Re: Stack Overflow Question
output is same even with ToXMLHandler
On Tue, Jul 1, 2014 at 5:59 PM, Allison, Timothy B. <ta...@mitre.org>> wrote:
Did you try the ToXMLHandler?
From: yeshwanth kumar [mailto:yeshwanth43@gmail.com<ma...@gmail.com>]
Sent: Monday, June 30, 2014 4:50 PM
To: Allison, Timothy B.
Subject: Re: Stack Overflow Question
hi tim,
i tried in all possible ways,
instead of reading entire zip file i parsed individual zipentries,
but even then i faced exceptions such as
org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.microsoft.OfficeParser@37ba3e33<ma...@37ba3e33>
Caused by: java.io.IOException: Invalid header signature; read 0x725020706968736E, expected 0xE11AB1A1E011CFD0 - Your file appears not to be a valid OLE2 document
org.apache.tika.exception.TikaException: Unable to unpack document stream
org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.microsoft.OfficeParser@6f0ee75a<ma...@6f0ee75a>
org.apache.tika.exception.TikaException: Error creating OOXML extractor
any suggestions regarding these issues,
thanks,
yeshwanth
On Tue, Jul 1, 2014 at 2:00 AM, yeshwanth kumar <ye...@gmail.com>> wrote:
hi tim,
thanks, for sharing the resources but i am unable to figure out how to implement it in my code,
what i didn't understand is the flow and recursive steps, when i ran the RecursiveMetadataParser
it still giving the same kind of output as filenames combined with content of the files,
i am totally confused.
On Tue, Jul 1, 2014 at 1:29 AM, Allison, Timothy B. <ta...@mitre.org>> wrote:
Or use the ToXMLHandler and parse the XML?
From: Allison, Timothy B. [mailto:tallison@mitre.org<ma...@mitre.org>]
Sent: Monday, June 30, 2014 3:55 PM
To: yeshwanth kumar
Cc: user@tika.apache.org<ma...@tika.apache.org>
Subject: RE: Stack Overflow Question
Might want to look into RecursiveMetadata Parser
http://wiki.apache.org/tika/RecursiveMetadata
Or
https://issues.apache.org/jira/i#browse/TIKA-1329?issueKey=TIKA-1329&serverRenderedViewIssue=true
From: yeshwanth kumar [mailto:yeshwanth43@gmail.com]
Sent: Monday, June 30, 2014 3:24 PM
To: Allison, Timothy B.
Subject: Re: Stack Overflow Question
hi tim,
thanks for quick reply,
i changed the contenthandler to bodyContentHandler i got exception for maximum word limit,
i used -1 in the bodycontenthandler constructor,
now its another problem, filenames and content are present in string returned from handler.tostring()
how can i map a fileName to its content.
thanks,
yeshwanth
On Tue, Jul 1, 2014 at 12:35 AM, Allison, Timothy B. <ta...@mitre.org>> wrote:
DefaultHandler is effectively a NullHandler; it doesn't store or do anything.
Try BodyContentHandler or ToXMLHandler or maybe WriteoutHandler.
If you want to write out each embedded file as a binary, try subclassing EmbeddedResourceHandler.
QUOTE:
0down votefavorite<http://stackoverflow.com/questions/24495504/unable-tp-read-zipfile-using-apache-tika?sem=2>
i am using Apache Tika 1.5 for parsing the contents present in a zip file,
here's my sample code
Parser parser = new AutoDetectParser();
ParseContext context = new ParseContext();
context.set(Parser.class, parser);
ContentHandler handler = new DefaultHandler();
Metadata metadata = new Metadata();
InputStream stream = null;
try {
stream = TikaInputStream.get(new File(zipFilePath));
} catch (FileNotFoundException e) {
e.printStackTrace();
}
try {
parser.parse(stream, handler, metadata, context);
logger.info<http://logger.info>("Content:\t" + handler.toString());
} catch (IOException e) {
e.printStackTrace();
} catch (SAXException e) {
e.printStackTrace();
} catch (TikaException e) {
e.printStackTrace();
} finally {
try {
stream.close();
} catch (IOException e) {
e.printStackTrace();
}
}
in the logger statement all i see is org.xml.sax.helpers.DefaultHandler@5bd8e367<ma...@5bd8e367>
i am missing something, unable to figure it out, looking for some help
-----Original Message-----
From: yeshwanth kumar [mailto:yeshwanth43@gmail.com<ma...@gmail.com>]
Sent: Monday, June 30, 2014 1:28 PM
To: dev@tika.apache.org<ma...@tika.apache.org>
Subject: Stack Overflow Question
Unable tp read zipfile using Apache Tika
http://stackoverflow.com/q/24495504/1899893?sem=2
RE: Stack Overflow Question
Posted by "Allison, Timothy B." <ta...@mitre.org>.
Did you try the ToXMLHandler?
From: yeshwanth kumar [mailto:yeshwanth43@gmail.com]
Sent: Monday, June 30, 2014 4:50 PM
To: Allison, Timothy B.
Subject: Re: Stack Overflow Question
hi tim,
i tried in all possible ways,
instead of reading entire zip file i parsed individual zipentries,
but even then i faced exceptions such as
org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.microsoft.OfficeParser@37ba3e33<ma...@37ba3e33>
Caused by: java.io.IOException: Invalid header signature; read 0x725020706968736E, expected 0xE11AB1A1E011CFD0 - Your file appears not to be a valid OLE2 document
org.apache.tika.exception.TikaException: Unable to unpack document stream
org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.microsoft.OfficeParser@6f0ee75a<ma...@6f0ee75a>
org.apache.tika.exception.TikaException: Error creating OOXML extractor
any suggestions regarding these issues,
thanks,
yeshwanth
On Tue, Jul 1, 2014 at 2:00 AM, yeshwanth kumar <ye...@gmail.com>> wrote:
hi tim,
thanks, for sharing the resources but i am unable to figure out how to implement it in my code,
what i didn't understand is the flow and recursive steps, when i ran the RecursiveMetadataParser
it still giving the same kind of output as filenames combined with content of the files,
i am totally confused.
On Tue, Jul 1, 2014 at 1:29 AM, Allison, Timothy B. <ta...@mitre.org>> wrote:
Or use the ToXMLHandler and parse the XML?
From: Allison, Timothy B. [mailto:tallison@mitre.org<ma...@mitre.org>]
Sent: Monday, June 30, 2014 3:55 PM
To: yeshwanth kumar
Cc: user@tika.apache.org<ma...@tika.apache.org>
Subject: RE: Stack Overflow Question
Might want to look into RecursiveMetadata Parser
http://wiki.apache.org/tika/RecursiveMetadata
Or
https://issues.apache.org/jira/i#browse/TIKA-1329?issueKey=TIKA-1329&serverRenderedViewIssue=true
From: yeshwanth kumar [mailto:yeshwanth43@gmail.com]
Sent: Monday, June 30, 2014 3:24 PM
To: Allison, Timothy B.
Subject: Re: Stack Overflow Question
hi tim,
thanks for quick reply,
i changed the contenthandler to bodyContentHandler i got exception for maximum word limit,
i used -1 in the bodycontenthandler constructor,
now its another problem, filenames and content are present in string returned from handler.tostring()
how can i map a fileName to its content.
thanks,
yeshwanth
On Tue, Jul 1, 2014 at 12:35 AM, Allison, Timothy B. <ta...@mitre.org>> wrote:
DefaultHandler is effectively a NullHandler; it doesn't store or do anything.
Try BodyContentHandler or ToXMLHandler or maybe WriteoutHandler.
If you want to write out each embedded file as a binary, try subclassing EmbeddedResourceHandler.
QUOTE:
0down votefavorite<http://stackoverflow.com/questions/24495504/unable-tp-read-zipfile-using-apache-tika?sem=2>
i am using Apache Tika 1.5 for parsing the contents present in a zip file,
here's my sample code
Parser parser = new AutoDetectParser();
ParseContext context = new ParseContext();
context.set(Parser.class, parser);
ContentHandler handler = new DefaultHandler();
Metadata metadata = new Metadata();
InputStream stream = null;
try {
stream = TikaInputStream.get(new File(zipFilePath));
} catch (FileNotFoundException e) {
e.printStackTrace();
}
try {
parser.parse(stream, handler, metadata, context);
logger.info<http://logger.info>("Content:\t" + handler.toString());
} catch (IOException e) {
e.printStackTrace();
} catch (SAXException e) {
e.printStackTrace();
} catch (TikaException e) {
e.printStackTrace();
} finally {
try {
stream.close();
} catch (IOException e) {
e.printStackTrace();
}
}
in the logger statement all i see is org.xml.sax.helpers.DefaultHandler@5bd8e367<ma...@5bd8e367>
i am missing something, unable to figure it out, looking for some help
-----Original Message-----
From: yeshwanth kumar [mailto:yeshwanth43@gmail.com<ma...@gmail.com>]
Sent: Monday, June 30, 2014 1:28 PM
To: dev@tika.apache.org<ma...@tika.apache.org>
Subject: Stack Overflow Question
Unable tp read zipfile using Apache Tika
http://stackoverflow.com/q/24495504/1899893?sem=2
RE: Stack Overflow Question
Posted by "Allison, Timothy B." <ta...@mitre.org>.
Or use the ToXMLHandler and parse the XML?
From: Allison, Timothy B. [mailto:tallison@mitre.org]
Sent: Monday, June 30, 2014 3:55 PM
To: yeshwanth kumar
Cc: user@tika.apache.org
Subject: RE: Stack Overflow Question
Might want to look into RecursiveMetadata Parser
http://wiki.apache.org/tika/RecursiveMetadata
Or
https://issues.apache.org/jira/i#browse/TIKA-1329?issueKey=TIKA-1329&serverRenderedViewIssue=true
From: yeshwanth kumar [mailto:yeshwanth43@gmail.com]
Sent: Monday, June 30, 2014 3:24 PM
To: Allison, Timothy B.
Subject: Re: Stack Overflow Question
hi tim,
thanks for quick reply,
i changed the contenthandler to bodyContentHandler i got exception for maximum word limit,
i used -1 in the bodycontenthandler constructor,
now its another problem, filenames and content are present in string returned from handler.tostring()
how can i map a fileName to its content.
thanks,
yeshwanth
On Tue, Jul 1, 2014 at 12:35 AM, Allison, Timothy B. <ta...@mitre.org>> wrote:
DefaultHandler is effectively a NullHandler; it doesn't store or do anything.
Try BodyContentHandler or ToXMLHandler or maybe WriteoutHandler.
If you want to write out each embedded file as a binary, try subclassing EmbeddedResourceHandler.
QUOTE:
0down votefavorite<http://stackoverflow.com/questions/24495504/unable-tp-read-zipfile-using-apache-tika?sem=2>
i am using Apache Tika 1.5 for parsing the contents present in a zip file,
here's my sample code
Parser parser = new AutoDetectParser();
ParseContext context = new ParseContext();
context.set(Parser.class, parser);
ContentHandler handler = new DefaultHandler();
Metadata metadata = new Metadata();
InputStream stream = null;
try {
stream = TikaInputStream.get(new File(zipFilePath));
} catch (FileNotFoundException e) {
e.printStackTrace();
}
try {
parser.parse(stream, handler, metadata, context);
logger.info("Content:\t" + handler.toString());
} catch (IOException e) {
e.printStackTrace();
} catch (SAXException e) {
e.printStackTrace();
} catch (TikaException e) {
e.printStackTrace();
} finally {
try {
stream.close();
} catch (IOException e) {
e.printStackTrace();
}
}
in the logger statement all i see is org.xml.sax.helpers.DefaultHandler@5bd8e367<ma...@5bd8e367>
i am missing something, unable to figure it out, looking for some help
-----Original Message-----
From: yeshwanth kumar [mailto:yeshwanth43@gmail.com<ma...@gmail.com>]
Sent: Monday, June 30, 2014 1:28 PM
To: dev@tika.apache.org<ma...@tika.apache.org>
Subject: Stack Overflow Question
Unable tp read zipfile using Apache Tika
http://stackoverflow.com/q/24495504/1899893?sem=2
RE: Stack Overflow Question
Posted by "Allison, Timothy B." <ta...@mitre.org>.
Might want to look into RecursiveMetadata Parser
http://wiki.apache.org/tika/RecursiveMetadata
Or
https://issues.apache.org/jira/i#browse/TIKA-1329?issueKey=TIKA-1329&serverRenderedViewIssue=true
From: yeshwanth kumar [mailto:yeshwanth43@gmail.com]
Sent: Monday, June 30, 2014 3:24 PM
To: Allison, Timothy B.
Subject: Re: Stack Overflow Question
hi tim,
thanks for quick reply,
i changed the contenthandler to bodyContentHandler i got exception for maximum word limit,
i used -1 in the bodycontenthandler constructor,
now its another problem, filenames and content are present in string returned from handler.tostring()
how can i map a fileName to its content.
thanks,
yeshwanth
On Tue, Jul 1, 2014 at 12:35 AM, Allison, Timothy B. <ta...@mitre.org>> wrote:
DefaultHandler is effectively a NullHandler; it doesn't store or do anything.
Try BodyContentHandler or ToXMLHandler or maybe WriteoutHandler.
If you want to write out each embedded file as a binary, try subclassing EmbeddedResourceHandler.
QUOTE:
0down votefavorite<http://stackoverflow.com/questions/24495504/unable-tp-read-zipfile-using-apache-tika?sem=2>
i am using Apache Tika 1.5 for parsing the contents present in a zip file,
here's my sample code
Parser parser = new AutoDetectParser();
ParseContext context = new ParseContext();
context.set(Parser.class, parser);
ContentHandler handler = new DefaultHandler();
Metadata metadata = new Metadata();
InputStream stream = null;
try {
stream = TikaInputStream.get(new File(zipFilePath));
} catch (FileNotFoundException e) {
e.printStackTrace();
}
try {
parser.parse(stream, handler, metadata, context);
logger.info("Content:\t" + handler.toString());
} catch (IOException e) {
e.printStackTrace();
} catch (SAXException e) {
e.printStackTrace();
} catch (TikaException e) {
e.printStackTrace();
} finally {
try {
stream.close();
} catch (IOException e) {
e.printStackTrace();
}
}
in the logger statement all i see is org.xml.sax.helpers.DefaultHandler@5bd8e367<ma...@5bd8e367>
i am missing something, unable to figure it out, looking for some help
-----Original Message-----
From: yeshwanth kumar [mailto:yeshwanth43@gmail.com<ma...@gmail.com>]
Sent: Monday, June 30, 2014 1:28 PM
To: dev@tika.apache.org<ma...@tika.apache.org>
Subject: Stack Overflow Question
Unable tp read zipfile using Apache Tika
http://stackoverflow.com/q/24495504/1899893?sem=2