You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@jmeter.apache.org by Anthony Johnson <an...@gmail.com> on 2012/11/03 21:39:58 UTC

RE: Add Apache Tika in JMeter to extract text from various file

 type
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: 7bit

Would jmeter be able to pull out URLs from these media types using this
library? What a great way to find broken links in documents if so.

From: Milamber
Sent: 11/3/2012 3:22 PM
To: JMeter Dev List
Subject: Add Apache Tika in JMeter to extract text from various file
type
Hello,

Currently, I work to add Apache Tika 1.2 [1] in JMeter to improve
functional tests.

With Tika, you can extract the text form various documents, like MS
Office (Word, Excel, PowerPoint 97-2003, 2007-2010 (openxml), OpenOffice
(writer, calc, impress), HTML, Gz, jar/zip files (list of content), and
some "multimedia" files like mp3, mp4, flv, etc.

In JMeter, Tika can be used by the View Results Tree to view the text
data of this files, Regular extractor to catch some text from this files
and Response assertion to assert on the data.

The inconvenient is: Apache Tika requires a big jar (25Mb) or a lot of
jar files (see below). With all jars in the binary package, the new size
(for tgz) is 45 Mb (JMeter 2.8 tgz : 23Mb)

The question: are you agree to add Tika (and new capability to "extract
text from Document") in JMeter with the new binary size?

Secondary question: what the good way? : 1/ Add only tika-app.jar (which
include all dependencies) [2], or 2/ Add several jar files (tika-core,
tika-parser, etc + dependencies) [3]

Milamber


[1] http://tika.apache.org/

[2] One Jar :
+tika-app.version                = 1.2
+tika-app.jar                    = tika-app-${tika-app.version}.jar
+tika-app.loc                    =
${maven2.repo}/org/apache/tika/tika-app/${tika-app.version}
+tika-app.md5                    = e0ec70c80a6f3b113d8ac1c12a33338f

[3] Several Jars (i must check if jar is missing)

+tika-core.version                = 1.2
+tika-core.jar                    = tika-core-${tika-core.version}.jar
+tika-core.loc                    =
${maven2.repo}/org/apache/tika/tika-core/${tika-core.version}
+tika-core.md5                    = 17cfec5a9b28b323375de0692ce5ecb1
+
+tika-parsers.version                = 1.2
+tika-parsers.jar                    =
tika-parsers-${tika-parsers.version}.jar
+tika-parsers.loc                    =
${maven2.repo}/org/apache/tika/tika-parsers/${tika-parsers.version}
+tika-parsers.md5                    = a15b071726358fd195d5c4b0625cdfb5
+
+
+tika-parsers.version                = 1.2
+tika-parsers.jar                    =
tika-parsers-${tika-parsers.version}.jar
+tika-parsers.loc                    =
${maven2.repo}/org/apache/tika/tika-parsers/${tika-parsers.version}
+tika-parsers.md5                    = a15b071726358fd195d5c4b0625cdfb5
+
+netcdf.version                = 4.2-min
+netcdf.jar                    = netcdf-${netcdf.version}.jar
+netcdf.loc                    =
${maven2.repo}/edu/ucar/netcdf/${netcdf.version}
+netcdf.md5                    = eb00b40b0511f0fc1dfcfc9cb89e3c53
+
+apache-mime4j-core.version                = 0.7.2
+apache-mime4j-core.jar                    =
apache-mime4j-core-${apache-mime4j-core.version}.jar
+apache-mime4j-core.loc                    =
${maven2.repo}/org/apache/james/apache-mime4j-core/${apache-mime4j-core.version}
+apache-mime4j-core.md5                    =
88f799546eca803c53eee01a4ce5edcd
+
+apache-mime4j-dom.version                = 0.7.2
+apache-mime4j-dom.jar                    =
apache-mime4j-dom-${apache-mime4j-dom.version}.jar
+apache-mime4j-dom.loc                    =
${maven2.repo}/org/apache/james/apache-mime4j-dom/${apache-mime4j-dom.version}
+apache-mime4j-dom.md5                    = dedc747b5c367fbd7f8a7235d1d7cbee
+
+commons-compress.version                = 1.4.1
+commons-compress.jar                    =
commons-compress-${commons-compress.version}.jar
+commons-compress.loc                    =
${maven2.repo}/org/apache/commons/commons-compress/${commons-compress.version}
+commons-compress.md5                    = 7f7ff9255a831325f38a170992b70073
+
+pdfbox.version                = 1.7.0
+pdfbox.jar                    = pdfbox-${pdfbox.version}.jar
+pdfbox.loc                    =
${maven2.repo}/org/apache/pdfbox/pdfbox/${pdfbox.version}
+pdfbox.md5                    = da9ff2f1b43dc92b15fe3ba39a1cddcd
+
+fontbox.version                = 1.7.0
+fontbox.jar                    = fontbox-${fontbox.version}.jar
+fontbox.loc                    =
${maven2.repo}/org/apache/pdfbox/fontbox/${fontbox.version}
+fontbox.md5                    = 9e03f94d92af257facb148c138af22fa
+
+jempbox.version                = 1.7.0
+jempbox.jar                    = jempbox-${jempbox.version}.jar
+jempbox.loc                    =
${maven2.repo}/org/apache/pdfbox/jempbox/${jempbox.version}
+jempbox.md5                    = 69dfbd6872c29f89a4df1179dd54b44e
+
+poi.version                = 3.8
+poi.jar                    = poi-${poi.version}.jar
+poi.loc                    =
${maven2.repo}/org/apache/poi/poi/${poi.version}
+poi.md5                    = 5c915f48922046c71121fd7021aa23cb
+
+poi-scratchpad.version                = 3.8
+poi-scratchpad.jar                    =
poi-scratchpad-${poi-scratchpad.version}.jar
+poi-scratchpad.loc                    =
${maven2.repo}/org/apache/poi/poi-scratchpad/${poi-scratchpad.version}
+poi-scratchpad.md5                    = 7427b6b9e53dcee57d382ba022efc3be
+
+poi-ooxml.version                = 3.8
+poi-ooxml.jar                    = poi-ooxml-${poi-ooxml.version}.jar
+poi-ooxml.loc                    =
${maven2.repo}/org/apache/poi/poi-ooxml/${poi-ooxml.version}
+poi-ooxml.md5                    = 8f147b248f078799c24c8714f185b1a8
+
+geronimo-stax-api_1.0_spec.version                = 1.0.1
+geronimo-stax-api_1.0_spec.jar                    =
geronimo-stax-api_1.0_spec-${geronimo-stax-api_1.0_spec.version}.jar
+geronimo-stax-api_1.0_spec.loc                    =
${maven2.repo}/org/apache/geronimo/specs/geronimo-stax-api_1.0_spec/${geronimo-stax-api_1.0_spec.version}
+geronimo-stax-api_1.0_spec.md5                    =
b7c2a715cd3d1c43dc4ccfae426e8e2e
+
+tagsoup.version                = 1.2.1
+tagsoup.jar                    = tagsoup-${tagsoup.version}.jar
+tagsoup.loc                    =
${maven2.repo}/org/ccil/cowan/tagsoup/tagsoup/${tagsoup.version}
+tagsoup.md5                    = ae73a52cdcbec10cd61d9ef22fab5936
+
+asm.version                = 3.1
+asm.jar                    = asm-${asm.version}.jar
+asm.loc                    =
${maven2.repo}/org/ow2/util/asm/asm/${asm.version}
+asm.md5                    = b1a36e247bf18fb4da46ce3a54627d1b
+
+isoparser.version                = 1.0-RC-1
+isoparser.jar                    = isoparser-${isoparser.version}.jar
+isoparser.loc                    =
${maven2.repo}/com/googlecode/mp4parser/isoparser/${isoparser.version}
+isoparser.md5                    = b0444fde2290319c9028564c3c3ff1ab
+
+metadata-extractor.version                = 2.4.0-beta-1
+metadata-extractor.jar                    =
metadata-extractor-${metadata-extractor.version}.jar
+metadata-extractor.loc                    =
${maven2.repo}/com/drewnoakes/metadata-extractor/${metadata-extractor.version}
+metadata-extractor.md5                    =
6e0ad2f0fe78047cb34ec056b39633d3
+
+boilerpipe.version                = 1.1.0
+boilerpipe.jar                    = boilerpipe-${boilerpipe.version}.jar
+boilerpipe.loc                    =
${maven2.repo}/de/l3s/boilerpipe/boilerpipe/${boilerpipe.version}
+boilerpipe.md5                    = 0616568083786d0f49e2cb07a5d09fe4
+
+rome.version                = 0.9
+rome.jar                    = rome-${rome.version}.jar
+rome.loc                    = ${maven2.repo}/rome/rome/${rome.version}
+rome.md5                    = 19589699b01c59ccb4d5e61e4c78b311
+
+vorbis-java-core.version                = 0.1
+vorbis-java-core.jar                    =
vorbis-java-core-${vorbis-java-core.version}.jar
+vorbis-java-core.loc                    =
${maven2.repo}/org/gagravarr/vorbis-java-core/${vorbis-java-core.version}
+vorbis-java-core.md5                    = b88115be2754cb6883e652ba68ca46c8
+
+juniversalchardet.version                = 1.0.3
+juniversalchardet.jar                    =
juniversalchardet-${juniversalchardet.version}.jar
+juniversalchardet.loc                    =
${maven2.repo}/com/googlecode/juniversalchardet/juniversalchardet/${juniversalchardet.version}
+juniversalchardet.md5                    = d9ea0a9a275336c175b343f2e4cd8f27
+
+xz.version                = 1.1
+xz.jar                    = xz-${xz.version}.jar
+xz.loc                    = ${maven2.repo}/org/tukaani/xz/${xz.version}
+xz.md5                    = 4d0ba9643c8f3f7c6721be3a1286da1c
+
+dom4j.version                 = 1.6.1
+dom4j.jar                = dom4j-${dom4j.version}.jar
+dom4j.loc                = ${maven2.repo}/dom4j/dom4j/${dom4j.version}
+dom4j.md5                = 4d8f51d3fe3900efc6e395be48030d6d
+
+xmlbeans.version                 = 2.6.0
+xmlbeans.jar                = xmlbeans-${xmlbeans.version}.jar
+xmlbeans.loc                =
${maven2.repo}/org/apache/xmlbeans/xmlbeans/${xmlbeans.version}
+xmlbeans.md5                = 6591c08682d613194dacb01e95c78c2c
+
+poi-ooxml.version                 = 3.8
+poi-ooxml.jar                = poi-ooxml-${poi-ooxml.version}.jar
+poi-ooxml.loc                =
${maven2.repo}/org/apache/poi/poi-ooxml/${poi-ooxml.version}
+poi-ooxml.md5                = 8f147b248f078799c24c8714f185b1a8
+
+poi-ooxml-schemas.version                 = 3.8
+poi-ooxml-schemas.jar                =
poi-ooxml-schemas-${poi-ooxml-schemas.version}.jar
+poi-ooxml-schemas.loc                =
${maven2.repo}/org/apache/poi/poi-ooxml-schemas/${poi-ooxml-schemas.version}
+poi-ooxml-schemas.md5                = 7ebcffdc4d82b2b8cbc6464d4543cd07

Re: Add Apache Tika in JMeter to extract text from various file

Posted by Milamber <mi...@apache.org>.

Le 03/11/2012 20:39, Anthony Johnson a ecrit :
>   type
> MIME-Version: 1.0
> Content-Type: text/plain; charset="utf-8"
> Content-Transfer-Encoding: 7bit
>
> Would jmeter be able to pull out URLs from these media types using this
> library? What a great way to find broken links in documents if so.

Apache Tika allows to extract all URL links (website url, embedded 
images, etc.) with the LinkContentHandler [1] class on the "HTML" 
document type. [2]
I've tested with success the URL extraction with my development 
JMeter+Tika (in a viewer in View Results Tree) on some documents (docx, 
odt, ods, html, pdf) (doc seems don't work.)

Now, if we want add a "URL extractor from Document", we must cogitate to 
the best way to add this in JMeter.

Milamber

[1] 
http://tika.apache.org/1.2/api/index.html?org/apache/tika/sax/LinkContentHandler.html
[2] http://chrisjordan.ca/post/15219674437/parsing-html-with-apache-tika
>
> From: Milamber
> Sent: 11/3/2012 3:22 PM
> To: JMeter Dev List
> Subject: Add Apache Tika in JMeter to extract text from various file
> type
> Hello,
>
> Currently, I work to add Apache Tika 1.2 [1] in JMeter to improve
> functional tests.
>
> With Tika, you can extract the text form various documents, like MS
> Office (Word, Excel, PowerPoint 97-2003, 2007-2010 (openxml), OpenOffice
> (writer, calc, impress), HTML, Gz, jar/zip files (list of content), and
> some "multimedia" files like mp3, mp4, flv, etc.
>
> In JMeter, Tika can be used by the View Results Tree to view the text
> data of this files, Regular extractor to catch some text from this files
> and Response assertion to assert on the data.
>
> The inconvenient is: Apache Tika requires a big jar (25Mb) or a lot of
> jar files (see below). With all jars in the binary package, the new size
> (for tgz) is 45 Mb (JMeter 2.8 tgz : 23Mb)
>
> The question: are you agree to add Tika (and new capability to "extract
> text from Document") in JMeter with the new binary size?
>
> Secondary question: what the good way? : 1/ Add only tika-app.jar (which
> include all dependencies) [2], or 2/ Add several jar files (tika-core,
> tika-parser, etc + dependencies) [3]
>
> Milamber
>
>
> [1] http://tika.apache.org/
>
> [2] One Jar :
> +tika-app.version                = 1.2
> +tika-app.jar                    = tika-app-${tika-app.version}.jar
> +tika-app.loc                    =
> ${maven2.repo}/org/apache/tika/tika-app/${tika-app.version}
> +tika-app.md5                    = e0ec70c80a6f3b113d8ac1c12a33338f
>
> [3] Several Jars (i must check if jar is missing)
>
> +tika-core.version                = 1.2
> +tika-core.jar                    = tika-core-${tika-core.version}.jar
> +tika-core.loc                    =
> ${maven2.repo}/org/apache/tika/tika-core/${tika-core.version}
> +tika-core.md5                    = 17cfec5a9b28b323375de0692ce5ecb1
> +
> +tika-parsers.version                = 1.2
> +tika-parsers.jar                    =
> tika-parsers-${tika-parsers.version}.jar
> +tika-parsers.loc                    =
> ${maven2.repo}/org/apache/tika/tika-parsers/${tika-parsers.version}
> +tika-parsers.md5                    = a15b071726358fd195d5c4b0625cdfb5
> +
> +
> +tika-parsers.version                = 1.2
> +tika-parsers.jar                    =
> tika-parsers-${tika-parsers.version}.jar
> +tika-parsers.loc                    =
> ${maven2.repo}/org/apache/tika/tika-parsers/${tika-parsers.version}
> +tika-parsers.md5                    = a15b071726358fd195d5c4b0625cdfb5
> +
> +netcdf.version                = 4.2-min
> +netcdf.jar                    = netcdf-${netcdf.version}.jar
> +netcdf.loc                    =
> ${maven2.repo}/edu/ucar/netcdf/${netcdf.version}
> +netcdf.md5                    = eb00b40b0511f0fc1dfcfc9cb89e3c53
> +
> +apache-mime4j-core.version                = 0.7.2
> +apache-mime4j-core.jar                    =
> apache-mime4j-core-${apache-mime4j-core.version}.jar
> +apache-mime4j-core.loc                    =
> ${maven2.repo}/org/apache/james/apache-mime4j-core/${apache-mime4j-core.version}
> +apache-mime4j-core.md5                    =
> 88f799546eca803c53eee01a4ce5edcd
> +
> +apache-mime4j-dom.version                = 0.7.2
> +apache-mime4j-dom.jar                    =
> apache-mime4j-dom-${apache-mime4j-dom.version}.jar
> +apache-mime4j-dom.loc                    =
> ${maven2.repo}/org/apache/james/apache-mime4j-dom/${apache-mime4j-dom.version}
> +apache-mime4j-dom.md5                    = dedc747b5c367fbd7f8a7235d1d7cbee
> +
> +commons-compress.version                = 1.4.1
> +commons-compress.jar                    =
> commons-compress-${commons-compress.version}.jar
> +commons-compress.loc                    =
> ${maven2.repo}/org/apache/commons/commons-compress/${commons-compress.version}
> +commons-compress.md5                    = 7f7ff9255a831325f38a170992b70073
> +
> +pdfbox.version                = 1.7.0
> +pdfbox.jar                    = pdfbox-${pdfbox.version}.jar
> +pdfbox.loc                    =
> ${maven2.repo}/org/apache/pdfbox/pdfbox/${pdfbox.version}
> +pdfbox.md5                    = da9ff2f1b43dc92b15fe3ba39a1cddcd
> +
> +fontbox.version                = 1.7.0
> +fontbox.jar                    = fontbox-${fontbox.version}.jar
> +fontbox.loc                    =
> ${maven2.repo}/org/apache/pdfbox/fontbox/${fontbox.version}
> +fontbox.md5                    = 9e03f94d92af257facb148c138af22fa
> +
> +jempbox.version                = 1.7.0
> +jempbox.jar                    = jempbox-${jempbox.version}.jar
> +jempbox.loc                    =
> ${maven2.repo}/org/apache/pdfbox/jempbox/${jempbox.version}
> +jempbox.md5                    = 69dfbd6872c29f89a4df1179dd54b44e
> +
> +poi.version                = 3.8
> +poi.jar                    = poi-${poi.version}.jar
> +poi.loc                    =
> ${maven2.repo}/org/apache/poi/poi/${poi.version}
> +poi.md5                    = 5c915f48922046c71121fd7021aa23cb
> +
> +poi-scratchpad.version                = 3.8
> +poi-scratchpad.jar                    =
> poi-scratchpad-${poi-scratchpad.version}.jar
> +poi-scratchpad.loc                    =
> ${maven2.repo}/org/apache/poi/poi-scratchpad/${poi-scratchpad.version}
> +poi-scratchpad.md5                    = 7427b6b9e53dcee57d382ba022efc3be
> +
> +poi-ooxml.version                = 3.8
> +poi-ooxml.jar                    = poi-ooxml-${poi-ooxml.version}.jar
> +poi-ooxml.loc                    =
> ${maven2.repo}/org/apache/poi/poi-ooxml/${poi-ooxml.version}
> +poi-ooxml.md5                    = 8f147b248f078799c24c8714f185b1a8
> +
> +geronimo-stax-api_1.0_spec.version                = 1.0.1
> +geronimo-stax-api_1.0_spec.jar                    =
> geronimo-stax-api_1.0_spec-${geronimo-stax-api_1.0_spec.version}.jar
> +geronimo-stax-api_1.0_spec.loc                    =
> ${maven2.repo}/org/apache/geronimo/specs/geronimo-stax-api_1.0_spec/${geronimo-stax-api_1.0_spec.version}
> +geronimo-stax-api_1.0_spec.md5                    =
> b7c2a715cd3d1c43dc4ccfae426e8e2e
> +
> +tagsoup.version                = 1.2.1
> +tagsoup.jar                    = tagsoup-${tagsoup.version}.jar
> +tagsoup.loc                    =
> ${maven2.repo}/org/ccil/cowan/tagsoup/tagsoup/${tagsoup.version}
> +tagsoup.md5                    = ae73a52cdcbec10cd61d9ef22fab5936
> +
> +asm.version                = 3.1
> +asm.jar                    = asm-${asm.version}.jar
> +asm.loc                    =
> ${maven2.repo}/org/ow2/util/asm/asm/${asm.version}
> +asm.md5                    = b1a36e247bf18fb4da46ce3a54627d1b
> +
> +isoparser.version                = 1.0-RC-1
> +isoparser.jar                    = isoparser-${isoparser.version}.jar
> +isoparser.loc                    =
> ${maven2.repo}/com/googlecode/mp4parser/isoparser/${isoparser.version}
> +isoparser.md5                    = b0444fde2290319c9028564c3c3ff1ab
> +
> +metadata-extractor.version                = 2.4.0-beta-1
> +metadata-extractor.jar                    =
> metadata-extractor-${metadata-extractor.version}.jar
> +metadata-extractor.loc                    =
> ${maven2.repo}/com/drewnoakes/metadata-extractor/${metadata-extractor.version}
> +metadata-extractor.md5                    =
> 6e0ad2f0fe78047cb34ec056b39633d3
> +
> +boilerpipe.version                = 1.1.0
> +boilerpipe.jar                    = boilerpipe-${boilerpipe.version}.jar
> +boilerpipe.loc                    =
> ${maven2.repo}/de/l3s/boilerpipe/boilerpipe/${boilerpipe.version}
> +boilerpipe.md5                    = 0616568083786d0f49e2cb07a5d09fe4
> +
> +rome.version                = 0.9
> +rome.jar                    = rome-${rome.version}.jar
> +rome.loc                    = ${maven2.repo}/rome/rome/${rome.version}
> +rome.md5                    = 19589699b01c59ccb4d5e61e4c78b311
> +
> +vorbis-java-core.version                = 0.1
> +vorbis-java-core.jar                    =
> vorbis-java-core-${vorbis-java-core.version}.jar
> +vorbis-java-core.loc                    =
> ${maven2.repo}/org/gagravarr/vorbis-java-core/${vorbis-java-core.version}
> +vorbis-java-core.md5                    = b88115be2754cb6883e652ba68ca46c8
> +
> +juniversalchardet.version                = 1.0.3
> +juniversalchardet.jar                    =
> juniversalchardet-${juniversalchardet.version}.jar
> +juniversalchardet.loc                    =
> ${maven2.repo}/com/googlecode/juniversalchardet/juniversalchardet/${juniversalchardet.version}
> +juniversalchardet.md5                    = d9ea0a9a275336c175b343f2e4cd8f27
> +
> +xz.version                = 1.1
> +xz.jar                    = xz-${xz.version}.jar
> +xz.loc                    = ${maven2.repo}/org/tukaani/xz/${xz.version}
> +xz.md5                    = 4d0ba9643c8f3f7c6721be3a1286da1c
> +
> +dom4j.version                 = 1.6.1
> +dom4j.jar                = dom4j-${dom4j.version}.jar
> +dom4j.loc                = ${maven2.repo}/dom4j/dom4j/${dom4j.version}
> +dom4j.md5                = 4d8f51d3fe3900efc6e395be48030d6d
> +
> +xmlbeans.version                 = 2.6.0
> +xmlbeans.jar                = xmlbeans-${xmlbeans.version}.jar
> +xmlbeans.loc                =
> ${maven2.repo}/org/apache/xmlbeans/xmlbeans/${xmlbeans.version}
> +xmlbeans.md5                = 6591c08682d613194dacb01e95c78c2c
> +
> +poi-ooxml.version                 = 3.8
> +poi-ooxml.jar                = poi-ooxml-${poi-ooxml.version}.jar
> +poi-ooxml.loc                =
> ${maven2.repo}/org/apache/poi/poi-ooxml/${poi-ooxml.version}
> +poi-ooxml.md5                = 8f147b248f078799c24c8714f185b1a8
> +
> +poi-ooxml-schemas.version                 = 3.8
> +poi-ooxml-schemas.jar                =
> poi-ooxml-schemas-${poi-ooxml-schemas.version}.jar
> +poi-ooxml-schemas.loc                =
> ${maven2.repo}/org/apache/poi/poi-ooxml-schemas/${poi-ooxml-schemas.version}
> +poi-ooxml-schemas.md5                = 7ebcffdc4d82b2b8cbc6464d4543cd07
>