You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@jmeter.apache.org by Milamber <mi...@apache.org> on 2012/11/03 20:23:58 UTC

Add Apache Tika in JMeter to extract text from various file type

Hello,

Currently, I work to add Apache Tika 1.2 [1] in JMeter to improve 
functional tests.

With Tika, you can extract the text form various documents, like MS 
Office (Word, Excel, PowerPoint 97-2003, 2007-2010 (openxml), OpenOffice 
(writer, calc, impress), HTML, Gz, jar/zip files (list of content), and 
some "multimedia" files like mp3, mp4, flv, etc.

In JMeter, Tika can be used by the View Results Tree to view the text 
data of this files, Regular extractor to catch some text from this files 
and Response assertion to assert on the data.

The inconvenient is: Apache Tika requires a big jar (25Mb) or a lot of 
jar files (see below). With all jars in the binary package, the new size 
(for tgz) is 45 Mb (JMeter 2.8 tgz : 23Mb)

The question: are you agree to add Tika (and new capability to "extract 
text from Document") in JMeter with the new binary size?

Secondary question: what the good way? : 1/ Add only tika-app.jar (which 
include all dependencies) [2], or 2/ Add several jar files (tika-core, 
tika-parser, etc + dependencies) [3]

Milamber


[1] http://tika.apache.org/

[2] One Jar :
+tika-app.version                = 1.2
+tika-app.jar                    = tika-app-${tika-app.version}.jar
+tika-app.loc                    = 
${maven2.repo}/org/apache/tika/tika-app/${tika-app.version}
+tika-app.md5                    = e0ec70c80a6f3b113d8ac1c12a33338f

[3] Several Jars (i must check if jar is missing)

+tika-core.version                = 1.2
+tika-core.jar                    = tika-core-${tika-core.version}.jar
+tika-core.loc                    = 
${maven2.repo}/org/apache/tika/tika-core/${tika-core.version}
+tika-core.md5                    = 17cfec5a9b28b323375de0692ce5ecb1
+
+tika-parsers.version                = 1.2
+tika-parsers.jar                    = 
tika-parsers-${tika-parsers.version}.jar
+tika-parsers.loc                    = 
${maven2.repo}/org/apache/tika/tika-parsers/${tika-parsers.version}
+tika-parsers.md5                    = a15b071726358fd195d5c4b0625cdfb5
+
+
+tika-parsers.version                = 1.2
+tika-parsers.jar                    = 
tika-parsers-${tika-parsers.version}.jar
+tika-parsers.loc                    = 
${maven2.repo}/org/apache/tika/tika-parsers/${tika-parsers.version}
+tika-parsers.md5                    = a15b071726358fd195d5c4b0625cdfb5
+
+netcdf.version                = 4.2-min
+netcdf.jar                    = netcdf-${netcdf.version}.jar
+netcdf.loc                    = 
${maven2.repo}/edu/ucar/netcdf/${netcdf.version}
+netcdf.md5                    = eb00b40b0511f0fc1dfcfc9cb89e3c53
+
+apache-mime4j-core.version                = 0.7.2
+apache-mime4j-core.jar                    = 
apache-mime4j-core-${apache-mime4j-core.version}.jar
+apache-mime4j-core.loc                    = 
${maven2.repo}/org/apache/james/apache-mime4j-core/${apache-mime4j-core.version}
+apache-mime4j-core.md5                    = 
88f799546eca803c53eee01a4ce5edcd
+
+apache-mime4j-dom.version                = 0.7.2
+apache-mime4j-dom.jar                    = 
apache-mime4j-dom-${apache-mime4j-dom.version}.jar
+apache-mime4j-dom.loc                    = 
${maven2.repo}/org/apache/james/apache-mime4j-dom/${apache-mime4j-dom.version}
+apache-mime4j-dom.md5                    = dedc747b5c367fbd7f8a7235d1d7cbee
+
+commons-compress.version                = 1.4.1
+commons-compress.jar                    = 
commons-compress-${commons-compress.version}.jar
+commons-compress.loc                    = 
${maven2.repo}/org/apache/commons/commons-compress/${commons-compress.version}
+commons-compress.md5                    = 7f7ff9255a831325f38a170992b70073
+
+pdfbox.version                = 1.7.0
+pdfbox.jar                    = pdfbox-${pdfbox.version}.jar
+pdfbox.loc                    = 
${maven2.repo}/org/apache/pdfbox/pdfbox/${pdfbox.version}
+pdfbox.md5                    = da9ff2f1b43dc92b15fe3ba39a1cddcd
+
+fontbox.version                = 1.7.0
+fontbox.jar                    = fontbox-${fontbox.version}.jar
+fontbox.loc                    = 
${maven2.repo}/org/apache/pdfbox/fontbox/${fontbox.version}
+fontbox.md5                    = 9e03f94d92af257facb148c138af22fa
+
+jempbox.version                = 1.7.0
+jempbox.jar                    = jempbox-${jempbox.version}.jar
+jempbox.loc                    = 
${maven2.repo}/org/apache/pdfbox/jempbox/${jempbox.version}
+jempbox.md5                    = 69dfbd6872c29f89a4df1179dd54b44e
+
+poi.version                = 3.8
+poi.jar                    = poi-${poi.version}.jar
+poi.loc                    = 
${maven2.repo}/org/apache/poi/poi/${poi.version}
+poi.md5                    = 5c915f48922046c71121fd7021aa23cb
+
+poi-scratchpad.version                = 3.8
+poi-scratchpad.jar                    = 
poi-scratchpad-${poi-scratchpad.version}.jar
+poi-scratchpad.loc                    = 
${maven2.repo}/org/apache/poi/poi-scratchpad/${poi-scratchpad.version}
+poi-scratchpad.md5                    = 7427b6b9e53dcee57d382ba022efc3be
+
+poi-ooxml.version                = 3.8
+poi-ooxml.jar                    = poi-ooxml-${poi-ooxml.version}.jar
+poi-ooxml.loc                    = 
${maven2.repo}/org/apache/poi/poi-ooxml/${poi-ooxml.version}
+poi-ooxml.md5                    = 8f147b248f078799c24c8714f185b1a8
+
+geronimo-stax-api_1.0_spec.version                = 1.0.1
+geronimo-stax-api_1.0_spec.jar                    = 
geronimo-stax-api_1.0_spec-${geronimo-stax-api_1.0_spec.version}.jar
+geronimo-stax-api_1.0_spec.loc                    = 
${maven2.repo}/org/apache/geronimo/specs/geronimo-stax-api_1.0_spec/${geronimo-stax-api_1.0_spec.version}
+geronimo-stax-api_1.0_spec.md5                    = 
b7c2a715cd3d1c43dc4ccfae426e8e2e
+
+tagsoup.version                = 1.2.1
+tagsoup.jar                    = tagsoup-${tagsoup.version}.jar
+tagsoup.loc                    = 
${maven2.repo}/org/ccil/cowan/tagsoup/tagsoup/${tagsoup.version}
+tagsoup.md5                    = ae73a52cdcbec10cd61d9ef22fab5936
+
+asm.version                = 3.1
+asm.jar                    = asm-${asm.version}.jar
+asm.loc                    = 
${maven2.repo}/org/ow2/util/asm/asm/${asm.version}
+asm.md5                    = b1a36e247bf18fb4da46ce3a54627d1b
+
+isoparser.version                = 1.0-RC-1
+isoparser.jar                    = isoparser-${isoparser.version}.jar
+isoparser.loc                    = 
${maven2.repo}/com/googlecode/mp4parser/isoparser/${isoparser.version}
+isoparser.md5                    = b0444fde2290319c9028564c3c3ff1ab
+
+metadata-extractor.version                = 2.4.0-beta-1
+metadata-extractor.jar                    = 
metadata-extractor-${metadata-extractor.version}.jar
+metadata-extractor.loc                    = 
${maven2.repo}/com/drewnoakes/metadata-extractor/${metadata-extractor.version}
+metadata-extractor.md5                    = 
6e0ad2f0fe78047cb34ec056b39633d3
+
+boilerpipe.version                = 1.1.0
+boilerpipe.jar                    = boilerpipe-${boilerpipe.version}.jar
+boilerpipe.loc                    = 
${maven2.repo}/de/l3s/boilerpipe/boilerpipe/${boilerpipe.version}
+boilerpipe.md5                    = 0616568083786d0f49e2cb07a5d09fe4
+
+rome.version                = 0.9
+rome.jar                    = rome-${rome.version}.jar
+rome.loc                    = ${maven2.repo}/rome/rome/${rome.version}
+rome.md5                    = 19589699b01c59ccb4d5e61e4c78b311
+
+vorbis-java-core.version                = 0.1
+vorbis-java-core.jar                    = 
vorbis-java-core-${vorbis-java-core.version}.jar
+vorbis-java-core.loc                    = 
${maven2.repo}/org/gagravarr/vorbis-java-core/${vorbis-java-core.version}
+vorbis-java-core.md5                    = b88115be2754cb6883e652ba68ca46c8
+
+juniversalchardet.version                = 1.0.3
+juniversalchardet.jar                    = 
juniversalchardet-${juniversalchardet.version}.jar
+juniversalchardet.loc                    = 
${maven2.repo}/com/googlecode/juniversalchardet/juniversalchardet/${juniversalchardet.version}
+juniversalchardet.md5                    = d9ea0a9a275336c175b343f2e4cd8f27
+
+xz.version                = 1.1
+xz.jar                    = xz-${xz.version}.jar
+xz.loc                    = ${maven2.repo}/org/tukaani/xz/${xz.version}
+xz.md5                    = 4d0ba9643c8f3f7c6721be3a1286da1c
+
+dom4j.version                 = 1.6.1
+dom4j.jar                = dom4j-${dom4j.version}.jar
+dom4j.loc                = ${maven2.repo}/dom4j/dom4j/${dom4j.version}
+dom4j.md5                = 4d8f51d3fe3900efc6e395be48030d6d
+
+xmlbeans.version                 = 2.6.0
+xmlbeans.jar                = xmlbeans-${xmlbeans.version}.jar
+xmlbeans.loc                = 
${maven2.repo}/org/apache/xmlbeans/xmlbeans/${xmlbeans.version}
+xmlbeans.md5                = 6591c08682d613194dacb01e95c78c2c
+
+poi-ooxml.version                 = 3.8
+poi-ooxml.jar                = poi-ooxml-${poi-ooxml.version}.jar
+poi-ooxml.loc                = 
${maven2.repo}/org/apache/poi/poi-ooxml/${poi-ooxml.version}
+poi-ooxml.md5                = 8f147b248f078799c24c8714f185b1a8
+
+poi-ooxml-schemas.version                 = 3.8
+poi-ooxml-schemas.jar                = 
poi-ooxml-schemas-${poi-ooxml-schemas.version}.jar
+poi-ooxml-schemas.loc                = 
${maven2.repo}/org/apache/poi/poi-ooxml-schemas/${poi-ooxml-schemas.version}
+poi-ooxml-schemas.md5                = 7ebcffdc4d82b2b8cbc6464d4543cd07




Re: Add Apache Tika in JMeter to extract text from various file type

Posted by sebb <se...@gmail.com>.
On 5 November 2012 20:05, Philippe Mouawad <ph...@gmail.com> wrote:
> But wouln't this make setup more complex and error prone ?
> See nightly build experience, lot of people miss the fact they must copy
> lib folder in first zip.
>
> It would not work out of the box anymore as it does for now.

JMeter would work, provided that the the missing features were not used.

> Isn't too much work for just size concern ?

I don't think so otherwise I would not have raised the issue.

> Sebb what do you mean by catching exception ?

Exactly that.

AIUI only two jars are needed to use Tika; I assume that the other
jars are referenced automatically from tika-core or tika-parser.
We just need to catch whatever error is generated when Tika cannot
load the required jars.

> Is it at first time or every call , if so wouln't impact negatively
> performances ?

There would be no performance impact if the required jars are present,
unlike if we used dynamic loading.

If some jars are missing, then some functionality would not work.
This is similar what already happens if someone uses a 3rd party
add-on and forgets to install the jar.
However, hopefully we could improve the error reporting in the Tika case.

> Regards
> Philippe
>
>
> On Monday, November 5, 2012, sebb wrote:
>
>> On 5 November 2012 14:00, Milamber <milamber@apache.org <javascript:;>>
>> wrote:
>> >
>> >
>> > Le 05/11/2012 11:26, sebb a ecrit :
>> >
>> >> On 3 November 2012 19:23, Milamber<milamber@apache.org <javascript:;>>
>>  wrote:
>> >>>
>> >>> Hello,
>> >>>
>> >>> Currently, I work to add Apache Tika 1.2 [1] in JMeter to improve
>> >>> functional
>> >>> tests.
>> >>>
>> >>> With Tika, you can extract the text form various documents, like MS
>> >>> Office
>> >>> (Word, Excel, PowerPoint 97-2003, 2007-2010 (openxml), OpenOffice
>> >>> (writer,
>> >>> calc, impress), HTML, Gz, jar/zip files (list of content), and some
>> >>> "multimedia" files like mp3, mp4, flv, etc.
>> >>>
>> >>> In JMeter, Tika can be used by the View Results Tree to view the text
>> >>> data
>> >>> of this files, Regular extractor to catch some text from this files and
>> >>> Response assertion to assert on the data.
>> >>>
>> >>> The inconvenient is: Apache Tika requires a big jar (25Mb) or a lot of
>> >>> jar
>> >>> files (see below). With all jars in the binary package, the new size
>> (for
>> >>> tgz) is 45 Mb (JMeter 2.8 tgz : 23Mb)
>> >>>
>> >>> The question: are you agree to add Tika (and new capability to "extract
>> >>> text
>> >>> from Document") in JMeter with the new binary size?
>> >>>
>> >>> Secondary question: what the good way? : 1/ Add only tika-app.jar
>> (which
>> >>> include all dependencies) [2], or 2/ Add several jar files (tika-core,
>> >>> tika-parser, etc + dependencies) [3]
>> >>
>> >> I'm concerned that using Tika would double the size of JMeter.
>> >> Although the extra features would be useful, I suspect that most test
>> >> cases won't need the extra functionality.
>> >>
>> >> Would it be possible to make the Tika jars optional?
>> >> i.e. add the functionality, but if the jars are not present it is
>> >> disabled.
>> >
>> >
>> > Yes seems possible via a dynamic class control / loading
>> >
>> >
>> >
>> >>
>> >> If we accept that developers must download Tika, then it should be
>> >> easy enough to structure the add-on so that JMeter can fail gracefully
>> >> if the jars are missing.
>> >> But ideally developers would not need to download all the jars either.
>> >
>> >
>> > Currently, to compile the "tika" elements, we must have only these jars :
>> > tika-core.jar
>> > tika-parsers.jar
>>
>> That would be fine.
>>
>> > To the binary release, we needs had these jars (full list):
>> > apache-mime4j-core.jar
>> > apache-mime4j-dom.jar
>> > asm.jar
>> > aspectjrt.jar
>> > boilerpipe.jar
>> > commons-compress.jar
>> > dom4j.jar
>> > fontbox.jar
>> > geronimo-stax-api_1.0_spec.jar
>> > gson.jar
>> > isoparser.jar
>> > jempbox.jar
>> > juniversalchardet.jar
>> > log4j.jar
>> > metadata-extractor.jar
>> > netcdf.jar
>> > pdfbox.jar
>> > poi-ooxml-schemas.jar
>> > poi-ooxml.jar
>> > poi-scratchpad.jar
>> > poi.jar
>> > rome.jar
>> > slf4j-api.jar
>> > slf4j-log4j12.jar
>> > tagsoup.jar
>> > tika-core.jar
>> > tika-parsers.jar
>> > tika-xmp.jar
>> > vorbis-java-core.jar
>> > vorbis-java-tika.jar
>> > xmlbeans.jar
>> > xmpcore.jar
>> > xz.jar
>> >
>> > Or only the tika-app.jar (25Mb)
>> >
>> >
>> > So, we can add the "tika" functionalities with dynamic class loading, add
>> > some warning messages to indicate the download of tika-app.jar if you
>> want
>> > have the tika behavior
>> >
>> > For View Results Tree, when the "Document" combo list is choosed: a
>> message
>> > in Response data to indicate the missing tika-app.jar (with some
>> indication
>> > where download it)
>> >
>> > For RegExp and Response Assertion, if missing tika-app.jar, a warning
>> dialog
>> > to show the message when the radio button "Response as a Document" is
>> > selected
>> >
>> > And in all cases, a warning message in jmeter.log.
>>
>> Rather than use dynamic class loading, would it not be possible to
>> just catch the Exceptions that are thrown when the jars are missing?
>>
>> If the code builds OK with just tika-core.jar and tika-parsers.jar
>> this should be sufficient.
>>
>> >
>> >
>> >
>> >>
>> >
>>
>
>
> --
> Cordialement.
> Philippe Mouawad.

Re: Add Apache Tika in JMeter to extract text from various file type

Posted by Philippe Mouawad <ph...@gmail.com>.
But wouln't this make setup more complex and error prone ?
See nightly build experience, lot of people miss the fact they must copy
lib folder in first zip.

It would not work out of the box anymore as it does for now.
Isn't too much work for just size concern ?

Sebb what do you mean by catching exception ?
Is it at first time or every call , if so wouln't impact negatively
performances ?
Regards
Philippe


On Monday, November 5, 2012, sebb wrote:

> On 5 November 2012 14:00, Milamber <milamber@apache.org <javascript:;>>
> wrote:
> >
> >
> > Le 05/11/2012 11:26, sebb a ecrit :
> >
> >> On 3 November 2012 19:23, Milamber<milamber@apache.org <javascript:;>>
>  wrote:
> >>>
> >>> Hello,
> >>>
> >>> Currently, I work to add Apache Tika 1.2 [1] in JMeter to improve
> >>> functional
> >>> tests.
> >>>
> >>> With Tika, you can extract the text form various documents, like MS
> >>> Office
> >>> (Word, Excel, PowerPoint 97-2003, 2007-2010 (openxml), OpenOffice
> >>> (writer,
> >>> calc, impress), HTML, Gz, jar/zip files (list of content), and some
> >>> "multimedia" files like mp3, mp4, flv, etc.
> >>>
> >>> In JMeter, Tika can be used by the View Results Tree to view the text
> >>> data
> >>> of this files, Regular extractor to catch some text from this files and
> >>> Response assertion to assert on the data.
> >>>
> >>> The inconvenient is: Apache Tika requires a big jar (25Mb) or a lot of
> >>> jar
> >>> files (see below). With all jars in the binary package, the new size
> (for
> >>> tgz) is 45 Mb (JMeter 2.8 tgz : 23Mb)
> >>>
> >>> The question: are you agree to add Tika (and new capability to "extract
> >>> text
> >>> from Document") in JMeter with the new binary size?
> >>>
> >>> Secondary question: what the good way? : 1/ Add only tika-app.jar
> (which
> >>> include all dependencies) [2], or 2/ Add several jar files (tika-core,
> >>> tika-parser, etc + dependencies) [3]
> >>
> >> I'm concerned that using Tika would double the size of JMeter.
> >> Although the extra features would be useful, I suspect that most test
> >> cases won't need the extra functionality.
> >>
> >> Would it be possible to make the Tika jars optional?
> >> i.e. add the functionality, but if the jars are not present it is
> >> disabled.
> >
> >
> > Yes seems possible via a dynamic class control / loading
> >
> >
> >
> >>
> >> If we accept that developers must download Tika, then it should be
> >> easy enough to structure the add-on so that JMeter can fail gracefully
> >> if the jars are missing.
> >> But ideally developers would not need to download all the jars either.
> >
> >
> > Currently, to compile the "tika" elements, we must have only these jars :
> > tika-core.jar
> > tika-parsers.jar
>
> That would be fine.
>
> > To the binary release, we needs had these jars (full list):
> > apache-mime4j-core.jar
> > apache-mime4j-dom.jar
> > asm.jar
> > aspectjrt.jar
> > boilerpipe.jar
> > commons-compress.jar
> > dom4j.jar
> > fontbox.jar
> > geronimo-stax-api_1.0_spec.jar
> > gson.jar
> > isoparser.jar
> > jempbox.jar
> > juniversalchardet.jar
> > log4j.jar
> > metadata-extractor.jar
> > netcdf.jar
> > pdfbox.jar
> > poi-ooxml-schemas.jar
> > poi-ooxml.jar
> > poi-scratchpad.jar
> > poi.jar
> > rome.jar
> > slf4j-api.jar
> > slf4j-log4j12.jar
> > tagsoup.jar
> > tika-core.jar
> > tika-parsers.jar
> > tika-xmp.jar
> > vorbis-java-core.jar
> > vorbis-java-tika.jar
> > xmlbeans.jar
> > xmpcore.jar
> > xz.jar
> >
> > Or only the tika-app.jar (25Mb)
> >
> >
> > So, we can add the "tika" functionalities with dynamic class loading, add
> > some warning messages to indicate the download of tika-app.jar if you
> want
> > have the tika behavior
> >
> > For View Results Tree, when the "Document" combo list is choosed: a
> message
> > in Response data to indicate the missing tika-app.jar (with some
> indication
> > where download it)
> >
> > For RegExp and Response Assertion, if missing tika-app.jar, a warning
> dialog
> > to show the message when the radio button "Response as a Document" is
> > selected
> >
> > And in all cases, a warning message in jmeter.log.
>
> Rather than use dynamic class loading, would it not be possible to
> just catch the Exceptions that are thrown when the jars are missing?
>
> If the code builds OK with just tika-core.jar and tika-parsers.jar
> this should be sufficient.
>
> >
> >
> >
> >>
> >
>


-- 
Cordialement.
Philippe Mouawad.

Re: Add Apache Tika in JMeter to extract text from various file type

Posted by sebb <se...@gmail.com>.
On 5 November 2012 14:00, Milamber <mi...@apache.org> wrote:
>
>
> Le 05/11/2012 11:26, sebb a ecrit :
>
>> On 3 November 2012 19:23, Milamber<mi...@apache.org>  wrote:
>>>
>>> Hello,
>>>
>>> Currently, I work to add Apache Tika 1.2 [1] in JMeter to improve
>>> functional
>>> tests.
>>>
>>> With Tika, you can extract the text form various documents, like MS
>>> Office
>>> (Word, Excel, PowerPoint 97-2003, 2007-2010 (openxml), OpenOffice
>>> (writer,
>>> calc, impress), HTML, Gz, jar/zip files (list of content), and some
>>> "multimedia" files like mp3, mp4, flv, etc.
>>>
>>> In JMeter, Tika can be used by the View Results Tree to view the text
>>> data
>>> of this files, Regular extractor to catch some text from this files and
>>> Response assertion to assert on the data.
>>>
>>> The inconvenient is: Apache Tika requires a big jar (25Mb) or a lot of
>>> jar
>>> files (see below). With all jars in the binary package, the new size (for
>>> tgz) is 45 Mb (JMeter 2.8 tgz : 23Mb)
>>>
>>> The question: are you agree to add Tika (and new capability to "extract
>>> text
>>> from Document") in JMeter with the new binary size?
>>>
>>> Secondary question: what the good way? : 1/ Add only tika-app.jar (which
>>> include all dependencies) [2], or 2/ Add several jar files (tika-core,
>>> tika-parser, etc + dependencies) [3]
>>
>> I'm concerned that using Tika would double the size of JMeter.
>> Although the extra features would be useful, I suspect that most test
>> cases won't need the extra functionality.
>>
>> Would it be possible to make the Tika jars optional?
>> i.e. add the functionality, but if the jars are not present it is
>> disabled.
>
>
> Yes seems possible via a dynamic class control / loading
>
>
>
>>
>> If we accept that developers must download Tika, then it should be
>> easy enough to structure the add-on so that JMeter can fail gracefully
>> if the jars are missing.
>> But ideally developers would not need to download all the jars either.
>
>
> Currently, to compile the "tika" elements, we must have only these jars :
> tika-core.jar
> tika-parsers.jar

That would be fine.

> To the binary release, we needs had these jars (full list):
> apache-mime4j-core.jar
> apache-mime4j-dom.jar
> asm.jar
> aspectjrt.jar
> boilerpipe.jar
> commons-compress.jar
> dom4j.jar
> fontbox.jar
> geronimo-stax-api_1.0_spec.jar
> gson.jar
> isoparser.jar
> jempbox.jar
> juniversalchardet.jar
> log4j.jar
> metadata-extractor.jar
> netcdf.jar
> pdfbox.jar
> poi-ooxml-schemas.jar
> poi-ooxml.jar
> poi-scratchpad.jar
> poi.jar
> rome.jar
> slf4j-api.jar
> slf4j-log4j12.jar
> tagsoup.jar
> tika-core.jar
> tika-parsers.jar
> tika-xmp.jar
> vorbis-java-core.jar
> vorbis-java-tika.jar
> xmlbeans.jar
> xmpcore.jar
> xz.jar
>
> Or only the tika-app.jar (25Mb)
>
>
> So, we can add the "tika" functionalities with dynamic class loading, add
> some warning messages to indicate the download of tika-app.jar if you want
> have the tika behavior
>
> For View Results Tree, when the "Document" combo list is choosed: a message
> in Response data to indicate the missing tika-app.jar (with some indication
> where download it)
>
> For RegExp and Response Assertion, if missing tika-app.jar, a warning dialog
> to show the message when the radio button "Response as a Document" is
> selected
>
> And in all cases, a warning message in jmeter.log.

Rather than use dynamic class loading, would it not be possible to
just catch the Exceptions that are thrown when the jars are missing?

If the code builds OK with just tika-core.jar and tika-parsers.jar
this should be sufficient.

>
>
>
>>
>

Re: Add Apache Tika in JMeter to extract text from various file type

Posted by Milamber <mi...@apache.org>.

Le 05/11/2012 11:26, sebb a ecrit :
> On 3 November 2012 19:23, Milamber<mi...@apache.org>  wrote:
>> Hello,
>>
>> Currently, I work to add Apache Tika 1.2 [1] in JMeter to improve functional
>> tests.
>>
>> With Tika, you can extract the text form various documents, like MS Office
>> (Word, Excel, PowerPoint 97-2003, 2007-2010 (openxml), OpenOffice (writer,
>> calc, impress), HTML, Gz, jar/zip files (list of content), and some
>> "multimedia" files like mp3, mp4, flv, etc.
>>
>> In JMeter, Tika can be used by the View Results Tree to view the text data
>> of this files, Regular extractor to catch some text from this files and
>> Response assertion to assert on the data.
>>
>> The inconvenient is: Apache Tika requires a big jar (25Mb) or a lot of jar
>> files (see below). With all jars in the binary package, the new size (for
>> tgz) is 45 Mb (JMeter 2.8 tgz : 23Mb)
>>
>> The question: are you agree to add Tika (and new capability to "extract text
>> from Document") in JMeter with the new binary size?
>>
>> Secondary question: what the good way? : 1/ Add only tika-app.jar (which
>> include all dependencies) [2], or 2/ Add several jar files (tika-core,
>> tika-parser, etc + dependencies) [3]
> I'm concerned that using Tika would double the size of JMeter.
> Although the extra features would be useful, I suspect that most test
> cases won't need the extra functionality.
>
> Would it be possible to make the Tika jars optional?
> i.e. add the functionality, but if the jars are not present it is disabled.

Yes seems possible via a dynamic class control / loading


>
> If we accept that developers must download Tika, then it should be
> easy enough to structure the add-on so that JMeter can fail gracefully
> if the jars are missing.
> But ideally developers would not need to download all the jars either.

Currently, to compile the "tika" elements, we must have only these jars :
tika-core.jar
tika-parsers.jar

To the binary release, we needs had these jars (full list):
apache-mime4j-core.jar
apache-mime4j-dom.jar
asm.jar
aspectjrt.jar
boilerpipe.jar
commons-compress.jar
dom4j.jar
fontbox.jar
geronimo-stax-api_1.0_spec.jar
gson.jar
isoparser.jar
jempbox.jar
juniversalchardet.jar
log4j.jar
metadata-extractor.jar
netcdf.jar
pdfbox.jar
poi-ooxml-schemas.jar
poi-ooxml.jar
poi-scratchpad.jar
poi.jar
rome.jar
slf4j-api.jar
slf4j-log4j12.jar
tagsoup.jar
tika-core.jar
tika-parsers.jar
tika-xmp.jar
vorbis-java-core.jar
vorbis-java-tika.jar
xmlbeans.jar
xmpcore.jar
xz.jar

Or only the tika-app.jar (25Mb)


So, we can add the "tika" functionalities with dynamic class loading, 
add some warning messages to indicate the download of tika-app.jar if 
you want have the tika behavior

For View Results Tree, when the "Document" combo list is choosed: a 
message in Response data to indicate the missing tika-app.jar (with some 
indication where download it)

For RegExp and Response Assertion, if missing tika-app.jar, a warning 
dialog to show the message when the radio button "Response as a 
Document" is selected

And in all cases, a warning message in jmeter.log.




>


Re: Add Apache Tika in JMeter to extract text from various file type

Posted by sebb <se...@gmail.com>.
On 3 November 2012 19:23, Milamber <mi...@apache.org> wrote:
> Hello,
>
> Currently, I work to add Apache Tika 1.2 [1] in JMeter to improve functional
> tests.
>
> With Tika, you can extract the text form various documents, like MS Office
> (Word, Excel, PowerPoint 97-2003, 2007-2010 (openxml), OpenOffice (writer,
> calc, impress), HTML, Gz, jar/zip files (list of content), and some
> "multimedia" files like mp3, mp4, flv, etc.
>
> In JMeter, Tika can be used by the View Results Tree to view the text data
> of this files, Regular extractor to catch some text from this files and
> Response assertion to assert on the data.
>
> The inconvenient is: Apache Tika requires a big jar (25Mb) or a lot of jar
> files (see below). With all jars in the binary package, the new size (for
> tgz) is 45 Mb (JMeter 2.8 tgz : 23Mb)
>
> The question: are you agree to add Tika (and new capability to "extract text
> from Document") in JMeter with the new binary size?
>
> Secondary question: what the good way? : 1/ Add only tika-app.jar (which
> include all dependencies) [2], or 2/ Add several jar files (tika-core,
> tika-parser, etc + dependencies) [3]

I'm concerned that using Tika would double the size of JMeter.
Although the extra features would be useful, I suspect that most test
cases won't need the extra functionality.

Would it be possible to make the Tika jars optional?
i.e. add the functionality, but if the jars are not present it is disabled.

If we accept that developers must download Tika, then it should be
easy enough to structure the add-on so that JMeter can fail gracefully
if the jars are missing.
But ideally developers would not need to download all the jars either.

Re: Add Apache Tika in JMeter to extract text from various file type

Posted by Milamber <mi...@apache.org>.

Le 03/11/2012 20:10, Philippe Mouawad a ecrit :
> Hello Milamber,
> My answers below.
>
> Regards
> Philippe
>
> On Sat, Nov 3, 2012 at 8:23 PM, Milamber<mi...@apache.org>  wrote:
>
>> Hello,
>>
>> Currently, I work to add Apache Tika 1.2 [1] in JMeter to improve
>> functional tests.
>>
>
>> With Tika, you can extract the text form various documents, like MS Office
>> (Word, Excel, PowerPoint 97-2003, 2007-2010 (openxml), OpenOffice (writer,
>> calc, impress), HTML, Gz, jar/zip files (list of content), and some
>> "multimedia" files like mp3, mp4, flv, etc.
>>
>> In JMeter, Tika can be used by the View Results Tree to view the text data
>> of this files, Regular extractor to catch some text from this files and
>> Response assertion to assert on the data.
>>
>> The inconvenient is: Apache Tika requires a big jar (25Mb) or a lot of jar
>> files (see below). With all jars in the binary package, the new size (for
>> tgz) is 45 Mb (JMeter 2.8 tgz : 23Mb)
>>
>> The question: are you agree to add Tika (and new capability to "extract
>> text from Document") in JMeter with the new binary size?
>>
> I agree but we should check impact on JMeter performance  and if it's
> important warn clearly about it and when to use it.

The performance impacts will be only when the option "Body as a 
Document" in Response Assertion or RegExp element is selected.
In View Results Tree, only when the Viewer "Document" is selected (and 
the VRT isn't recommended for load test)

If we want make a load test (not a functional test), we can add some 
sentences in docs/wiki to recommend to avoid to use the "Body as 
Document" option.


>
>
>> Secondary question: what the good way? : 1/ Add only tika-app.jar (which
>> include all dependencies) [2], or 2/ Add several jar files (tika-core,
>> tika-parser, etc + dependencies) [3]
>>
>> I would see:
>     - Tika (core + modules) if available
>     - Dependencies
>
> But  not Tika+dependencies in one JAR.
> If first option not possible, reference all tika modules + dependencies

It's possible. tika-(core|parsers) + list of dependencies below.



>
>
>> Milamber
>>
>>
>> [1] http://tika.apache.org/
>>
>> [2] One Jar :
>> +tika-app.version                = 1.2
>> +tika-app.jar                    = tika-app-${tika-app.version}.**jar
>> +tika-app.loc                    = ${maven2.repo}/org/apache/**
>> tika/tika-app/${tika-app.**version}
>> +tika-app.md5                    = e0ec70c80a6f3b113d8ac1c12a3333**8f
>>
>> [3] Several Jars (i must check if jar is missing)
>>
>> +tika-core.version                = 1.2
>> +tika-core.jar                    = tika-core-${tika-core.version}**.jar
>> +tika-core.loc                    = ${maven2.repo}/org/apache/**
>> tika/tika-core/${tika-core.**version}
>> +tika-core.md5                    = 17cfec5a9b28b323375de0692ce5ec**b1
>> +
>> +tika-parsers.version                = 1.2
>> +tika-parsers.jar                    = tika-parsers-${tika-parsers.**
>> version}.jar
>> +tika-parsers.loc                    = ${maven2.repo}/org/apache/**
>> tika/tika-parsers/${tika-**parsers.version}
>> +tika-parsers.md5                    = a15b071726358fd195d5c4b0625cdf**b5
>> +
>> +
>> +tika-parsers.version                = 1.2
>> +tika-parsers.jar                    = tika-parsers-${tika-parsers.**
>> version}.jar
>> +tika-parsers.loc                    = ${maven2.repo}/org/apache/**
>> tika/tika-parsers/${tika-**parsers.version}
>> +tika-parsers.md5                    = a15b071726358fd195d5c4b0625cdf**b5
>> +
>> +netcdf.version                = 4.2-min
>> +netcdf.jar                    = netcdf-${netcdf.version}.jar
>> +netcdf.loc                    = ${maven2.repo}/edu/ucar/**
>> netcdf/${netcdf.version}
>> +netcdf.md5                    = eb00b40b0511f0fc1dfcfc9cb89e3c**53
>> +
>> +apache-mime4j-core.version                = 0.7.2
>> +apache-mime4j-core.jar                    = apache-mime4j-core-${apache-*
>> *mime4j-core.version}.jar
>> +apache-mime4j-core.loc                    = ${maven2.repo}/org/apache/**
>> james/apache-mime4j-core/${**apache-mime4j-core.version}
>> +apache-mime4j-core.md5                    = 88f799546eca803c53eee01a4ce5ed
>> **cd
>> +
>> +apache-mime4j-dom.version                = 0.7.2
>> +apache-mime4j-dom.jar                    = apache-mime4j-dom-${apache-**
>> mime4j-dom.version}.jar
>> +apache-mime4j-dom.loc                    = ${maven2.repo}/org/apache/**
>> james/apache-mime4j-dom/${**apache-mime4j-dom.version}
>> +apache-mime4j-dom.md5                    = dedc747b5c367fbd7f8a7235d1d7cb
>> **ee
>> +
>> +commons-compress.version                = 1.4.1
>> +commons-compress.jar                    = commons-compress-${commons-**
>> compress.version}.jar
>> +commons-compress.loc                    = ${maven2.repo}/org/apache/**
>> commons/commons-compress/${**commons-compress.version}
>> +commons-compress.md5                    = 7f7ff9255a831325f38a170992b700*
>> *73
>> +
>> +pdfbox.version                = 1.7.0
>> +pdfbox.jar                    = pdfbox-${pdfbox.version}.jar
>> +pdfbox.loc                    = ${maven2.repo}/org/apache/**
>> pdfbox/pdfbox/${pdfbox.**version}
>> +pdfbox.md5                    = da9ff2f1b43dc92b15fe3ba39a1cdd**cd
>> +
>> +fontbox.version                = 1.7.0
>> +fontbox.jar                    = fontbox-${fontbox.version}.jar
>> +fontbox.loc                    = ${maven2.repo}/org/apache/**
>> pdfbox/fontbox/${fontbox.**version}
>> +fontbox.md5                    = 9e03f94d92af257facb148c138af22**fa
>> +
>> +jempbox.version                = 1.7.0
>> +jempbox.jar                    = jempbox-${jempbox.version}.jar
>> +jempbox.loc                    = ${maven2.repo}/org/apache/**
>> pdfbox/jempbox/${jempbox.**version}
>> +jempbox.md5                    = 69dfbd6872c29f89a4df1179dd54b4**4e
>> +
>> +poi.version                = 3.8
>> +poi.jar                    = poi-${poi.version}.jar
>> +poi.loc                    = ${maven2.repo}/org/apache/poi/**
>> poi/${poi.version}
>> +poi.md5                    = 5c915f48922046c71121fd7021aa23**cb
>> +
>> +poi-scratchpad.version                = 3.8
>> +poi-scratchpad.jar                    = poi-scratchpad-${poi-**
>> scratchpad.version}.jar
>> +poi-scratchpad.loc                    = ${maven2.repo}/org/apache/poi/**
>> poi-scratchpad/${poi-**scratchpad.version}
>> +poi-scratchpad.md5                    = 7427b6b9e53dcee57d382ba022efc3**
>> be
>> +
>> +poi-ooxml.version                = 3.8
>> +poi-ooxml.jar                    = poi-ooxml-${poi-ooxml.version}**.jar
>> +poi-ooxml.loc                    = ${maven2.repo}/org/apache/poi/**
>> poi-ooxml/${poi-ooxml.version}
>> +poi-ooxml.md5                    = 8f147b248f078799c24c8714f185b1**a8
>> +
>> +geronimo-stax-api_1.0_spec.**version                = 1.0.1
>> +geronimo-stax-api_1.0_spec.**jar                    =
>> geronimo-stax-api_1.0_spec-${**geronimo-stax-api_1.0_spec.**version}.jar
>> +geronimo-stax-api_1.0_spec.**loc                    =
>> ${maven2.repo}/org/apache/**geronimo/specs/geronimo-stax-**
>> api_1.0_spec/${geronimo-stax-**api_1.0_spec.version}
>> +geronimo-stax-api_1.0_spec.**md5                    =
>> b7c2a715cd3d1c43dc4ccfae426e8e**2e
>> +
>> +tagsoup.version                = 1.2.1
>> +tagsoup.jar                    = tagsoup-${tagsoup.version}.jar
>> +tagsoup.loc                    = ${maven2.repo}/org/ccil/cowan/**
>> tagsoup/tagsoup/${tagsoup.**version}
>> +tagsoup.md5                    = ae73a52cdcbec10cd61d9ef22fab59**36
>> +
>> +asm.version                = 3.1
>> +asm.jar                    = asm-${asm.version}.jar
>> +asm.loc                    = ${maven2.repo}/org/ow2/util/**
>> asm/asm/${asm.version}
>> +asm.md5                    = b1a36e247bf18fb4da46ce3a54627d**1b
>> +
>> +isoparser.version                = 1.0-RC-1
>> +isoparser.jar                    = isoparser-${isoparser.version}**.jar
>> +isoparser.loc                    = ${maven2.repo}/com/googlecode/**
>> mp4parser/isoparser/${**isoparser.version}
>> +isoparser.md5                    = b0444fde2290319c9028564c3c3ff1**ab
>> +
>> +metadata-extractor.version                = 2.4.0-beta-1
>> +metadata-extractor.jar                    = metadata-extractor-${metadata-
>> **extractor.version}.jar
>> +metadata-extractor.loc                    = ${maven2.repo}/com/drewnoakes/
>> **metadata-extractor/${metadata-**extractor.version}
>> +metadata-extractor.md5                    = 6e0ad2f0fe78047cb34ec056b39633
>> **d3
>> +
>> +boilerpipe.version                = 1.1.0
>> +boilerpipe.jar                    = boilerpipe-${boilerpipe.**
>> version}.jar
>> +boilerpipe.loc                    = ${maven2.repo}/de/l3s/**
>> boilerpipe/boilerpipe/${**boilerpipe.version}
>> +boilerpipe.md5                    = 0616568083786d0f49e2cb07a5d09f**e4
>> +
>> +rome.version                = 0.9
>> +rome.jar                    = rome-${rome.version}.jar
>> +rome.loc                    = ${maven2.repo}/rome/rome/${**rome.version}
>> +rome.md5                    = 19589699b01c59ccb4d5e61e4c78b3**11
>> +
>> +vorbis-java-core.version                = 0.1
>> +vorbis-java-core.jar                    = vorbis-java-core-${vorbis-**
>> java-core.version}.jar
>> +vorbis-java-core.loc                    = ${maven2.repo}/org/gagravarr/**
>> vorbis-java-core/${vorbis-**java-core.version}
>> +vorbis-java-core.md5                    = b88115be2754cb6883e652ba68ca46*
>> *c8
>> +
>> +juniversalchardet.version                = 1.0.3
>> +juniversalchardet.jar                    = juniversalchardet-${**
>> juniversalchardet.version}.jar
>> +juniversalchardet.loc                    = ${maven2.repo}/com/googlecode/
>> **juniversalchardet/**juniversalchardet/${**juniversalchardet.version}
>> +juniversalchardet.md5                    = d9ea0a9a275336c175b343f2e4cd8f
>> **27
>> +
>> +xz.version                = 1.1
>> +xz.jar                    = xz-${xz.version}.jar
>> +xz.loc                    = ${maven2.repo}/org/tukaani/xz/**${xz.version}
>> +xz.md5                    = 4d0ba9643c8f3f7c6721be3a1286da**1c
>> +
>> +dom4j.version                 = 1.6.1
>> +dom4j.jar                = dom4j-${dom4j.version}.jar
>> +dom4j.loc                = ${maven2.repo}/dom4j/dom4j/${**dom4j.version}
>> +dom4j.md5                = 4d8f51d3fe3900efc6e395be48030d**6d
>> +
>> +xmlbeans.version                 = 2.6.0
>> +xmlbeans.jar                = xmlbeans-${xmlbeans.version}.**jar
>> +xmlbeans.loc                = ${maven2.repo}/org/apache/**
>> xmlbeans/xmlbeans/${xmlbeans.**version}
>> +xmlbeans.md5                = 6591c08682d613194dacb01e95c78c**2c
>> +
>> +poi-ooxml.version                 = 3.8
>> +poi-ooxml.jar                = poi-ooxml-${poi-ooxml.version}**.jar
>> +poi-ooxml.loc                = ${maven2.repo}/org/apache/poi/**
>> poi-ooxml/${poi-ooxml.version}
>> +poi-ooxml.md5                = 8f147b248f078799c24c8714f185b1**a8
>> +
>> +poi-ooxml-schemas.version                 = 3.8
>> +poi-ooxml-schemas.jar                = poi-ooxml-schemas-${poi-ooxml-**
>> schemas.version}.jar
>> +poi-ooxml-schemas.loc                = ${maven2.repo}/org/apache/poi/**
>> poi-ooxml-schemas/${poi-ooxml-**schemas.version}
>> +poi-ooxml-schemas.md5                = 7ebcffdc4d82b2b8cbc6464d4543cd**07
>>
>>
>>
>>
>


Re: Add Apache Tika in JMeter to extract text from various file type

Posted by Shmuel Krakower <sh...@gmail.com>.
Cool. Is there a bugzilla for this so I can follow on a build which will
have this functionality? I already think of a case where I can use this and
may provide some feedback.
בתאריך 2012 11 4 15:17, מאת "Philippe Mouawad" <ph...@gmail.com>:

> On Sun, Nov 4, 2012 at 2:09 PM, Milamber <mi...@apache.org> wrote:
>
> >
> >
> > Le 03/11/2012 23:47, Shmuel Krakower a ecrit :
> >
> >  Hi Philippe
> >> if you concern about performance with the assertion. Maybe as a starter
> it
> >> would be better to begin with a separte assertion component for
> documents?
> >> Only this one will use Tika and will make it harder for users to mis-use
> >> Tika when not needed.
> >>
> >
> > There haven't performance issues when JMeter runs a test if the radio
> > button "Body as a Document" not selected (in Response Assertion or RegExp
> > extractor)
> >
> > Add several new elements (Tika regexp, Tika Assertion) for only one new
> > radio button in "Response Field to check" section (in current Regexp
> > extractor and Response Assertion) seems not necessary.
> > I can add some tooltip on the "Body as a Document" radio button + some
> > warning in component reference. Seems a good compromise?
> >
> > +1 for me.
>
> > Milamber
> >
> >
> >
> >
> >> Regarding size of JMeter, it shouldn't be a concern.
> >>
> >> Overall sounds like a nice upgrade to JMeter capabilities.
> >>
> >> Best,
> >> Shmuel.
> >> בתאריך 2012 11 3 22:11, מאת "Philippe Mouawad"<philippe.mouawad@**
> >> gmail.com <ph...@gmail.com>>:
> >>
> >>  Hello Milamber,
> >>> My answers below.
> >>>
> >>> Regards
> >>> Philippe
> >>>
> >>> On Sat, Nov 3, 2012 at 8:23 PM, Milamber<mi...@apache.org>  wrote:
> >>>
> >>>  Hello,
> >>>>
> >>>> Currently, I work to add Apache Tika 1.2 [1] in JMeter to improve
> >>>> functional tests.
> >>>>
> >>>>
> >>>  With Tika, you can extract the text form various documents, like MS
> >>>>
> >>> Office
> >>>
> >>>> (Word, Excel, PowerPoint 97-2003, 2007-2010 (openxml), OpenOffice
> >>>>
> >>> (writer,
> >>>
> >>>> calc, impress), HTML, Gz, jar/zip files (list of content), and some
> >>>> "multimedia" files like mp3, mp4, flv, etc.
> >>>>
> >>>> In JMeter, Tika can be used by the View Results Tree to view the text
> >>>>
> >>> data
> >>>
> >>>> of this files, Regular extractor to catch some text from this files
> and
> >>>> Response assertion to assert on the data.
> >>>>
> >>>> The inconvenient is: Apache Tika requires a big jar (25Mb) or a lot of
> >>>>
> >>> jar
> >>>
> >>>> files (see below). With all jars in the binary package, the new size
> >>>> (for
> >>>> tgz) is 45 Mb (JMeter 2.8 tgz : 23Mb)
> >>>>
> >>>> The question: are you agree to add Tika (and new capability to
> "extract
> >>>> text from Document") in JMeter with the new binary size?
> >>>>
> >>>>  I agree but we should check impact on JMeter performance  and if it's
> >>> important warn clearly about it and when to use it.
> >>>
> >>>
> >>>  Secondary question: what the good way? : 1/ Add only tika-app.jar
> (which
> >>>> include all dependencies) [2], or 2/ Add several jar files (tika-core,
> >>>> tika-parser, etc + dependencies) [3]
> >>>>
> >>>> I would see:
> >>>>
> >>>     - Tika (core + modules) if available
> >>>     - Dependencies
> >>>
> >>> But  not Tika+dependencies in one JAR.
> >>> If first option not possible, reference all tika modules + dependencies
> >>>
> >>>
> >>>  Milamber
> >>>>
> >>>>
> >>>> [1] http://tika.apache.org/
> >>>>
> >>>> [2] One Jar :
> >>>> +tika-app.version                = 1.2
> >>>> +tika-app.jar                    =
> tika-app-${tika-app.version}.****jar
> >>>> +tika-app.loc                    = ${maven2.repo}/org/apache/**
> >>>> tika/tika-app/${tika-app.****version}
> >>>> +tika-app.md5                    =
> e0ec70c80a6f3b113d8ac1c12a3333****8f
> >>>>
> >>>> [3] Several Jars (i must check if jar is missing)
> >>>>
> >>>> +tika-core.version                = 1.2
> >>>> +tika-core.jar                    = tika-core-${tika-core.version}**
> >>>> **.jar
> >>>> +tika-core.loc                    = ${maven2.repo}/org/apache/**
> >>>> tika/tika-core/${tika-core.****version}
> >>>> +tika-core.md5                    = 17cfec5a9b28b323375de0692ce5ec**
> >>>> **b1
> >>>> +
> >>>> +tika-parsers.version                = 1.2
> >>>> +tika-parsers.jar                    = tika-parsers-${tika-parsers.**
> >>>> version}.jar
> >>>> +tika-parsers.loc                    = ${maven2.repo}/org/apache/**
> >>>> tika/tika-parsers/${tika-****parsers.version}
> >>>> +tika-parsers.md5                    =
> a15b071726358fd195d5c4b0625cdf**
> >>>> **b5
> >>>> +
> >>>> +
> >>>> +tika-parsers.version                = 1.2
> >>>> +tika-parsers.jar                    = tika-parsers-${tika-parsers.**
> >>>> version}.jar
> >>>> +tika-parsers.loc                    = ${maven2.repo}/org/apache/**
> >>>> tika/tika-parsers/${tika-****parsers.version}
> >>>> +tika-parsers.md5                    =
> a15b071726358fd195d5c4b0625cdf**
> >>>> **b5
> >>>> +
> >>>> +netcdf.version                = 4.2-min
> >>>> +netcdf.jar                    = netcdf-${netcdf.version}.jar
> >>>> +netcdf.loc                    = ${maven2.repo}/edu/ucar/**
> >>>> netcdf/${netcdf.version}
> >>>> +netcdf.md5                    = eb00b40b0511f0fc1dfcfc9cb89e3c****53
> >>>> +
> >>>> +apache-mime4j-core.version                = 0.7.2
> >>>> +apache-mime4j-core.jar                    =
> >>>>
> >>> apache-mime4j-core-${apache-*
> >>>
> >>>> *mime4j-core.version}.jar
> >>>> +apache-mime4j-core.loc                    =
> >>>> ${maven2.repo}/org/apache/**
> >>>> james/apache-mime4j-core/${****apache-mime4j-core.version}
> >>>> +apache-mime4j-core.md5                    =
> >>>>
> >>> 88f799546eca803c53eee01a4ce5ed
> >>>
> >>>> **cd
> >>>> +
> >>>> +apache-mime4j-dom.version                = 0.7.2
> >>>> +apache-mime4j-dom.jar                    =
> >>>> apache-mime4j-dom-${apache-**
> >>>> mime4j-dom.version}.jar
> >>>> +apache-mime4j-dom.loc                    =
> ${maven2.repo}/org/apache/**
> >>>> james/apache-mime4j-dom/${****apache-mime4j-dom.version}
> >>>> +apache-mime4j-dom.md5                    =
> >>>>
> >>> dedc747b5c367fbd7f8a7235d1d7cb
> >>>
> >>>> **ee
> >>>> +
> >>>> +commons-compress.version                = 1.4.1
> >>>> +commons-compress.jar                    =
> commons-compress-${commons-**
> >>>> compress.version}.jar
> >>>> +commons-compress.loc                    =
> ${maven2.repo}/org/apache/**
> >>>> commons/commons-compress/${****commons-compress.version}
> >>>> +commons-compress.md5                    =
> >>>>
> >>> 7f7ff9255a831325f38a170992b700***
> >>>
> >>>> *73
> >>>> +
> >>>> +pdfbox.version                = 1.7.0
> >>>> +pdfbox.jar                    = pdfbox-${pdfbox.version}.jar
> >>>> +pdfbox.loc                    = ${maven2.repo}/org/apache/**
> >>>> pdfbox/pdfbox/${pdfbox.****version}
> >>>> +pdfbox.md5                    = da9ff2f1b43dc92b15fe3ba39a1cdd****cd
> >>>> +
> >>>> +fontbox.version                = 1.7.0
> >>>> +fontbox.jar                    = fontbox-${fontbox.version}.jar
> >>>> +fontbox.loc                    = ${maven2.repo}/org/apache/**
> >>>> pdfbox/fontbox/${fontbox.****version}
> >>>> +fontbox.md5                    = 9e03f94d92af257facb148c138af22****fa
> >>>> +
> >>>> +jempbox.version                = 1.7.0
> >>>> +jempbox.jar                    = jempbox-${jempbox.version}.jar
> >>>> +jempbox.loc                    = ${maven2.repo}/org/apache/**
> >>>> pdfbox/jempbox/${jempbox.****version}
> >>>> +jempbox.md5                    = 69dfbd6872c29f89a4df1179dd54b4****4e
> >>>> +
> >>>> +poi.version                = 3.8
> >>>> +poi.jar                    = poi-${poi.version}.jar
> >>>> +poi.loc                    = ${maven2.repo}/org/apache/poi/****
> >>>> poi/${poi.version}
> >>>> +poi.md5                    = 5c915f48922046c71121fd7021aa23****cb
> >>>> +
> >>>> +poi-scratchpad.version                = 3.8
> >>>> +poi-scratchpad.jar                    = poi-scratchpad-${poi-**
> >>>> scratchpad.version}.jar
> >>>> +poi-scratchpad.loc                    =
> ${maven2.repo}/org/apache/poi/
> >>>> ****
> >>>> poi-scratchpad/${poi-****scratchpad.version}
> >>>> +poi-scratchpad.md5                    =
> 7427b6b9e53dcee57d382ba022efc3
> >>>> ****
> >>>> be
> >>>> +
> >>>> +poi-ooxml.version                = 3.8
> >>>> +poi-ooxml.jar                    = poi-ooxml-${poi-ooxml.version}**
> >>>> **.jar
> >>>> +poi-ooxml.loc                    = ${maven2.repo}/org/apache/poi/****
> >>>> poi-ooxml/${poi-ooxml.version}
> >>>> +poi-ooxml.md5                    = 8f147b248f078799c24c8714f185b1**
> >>>> **a8
> >>>> +
> >>>> +geronimo-stax-api_1.0_spec.****version                = 1.0.1
> >>>> +geronimo-stax-api_1.0_spec.****jar                    =
> >>>> geronimo-stax-api_1.0_spec-${****geronimo-stax-api_1.0_spec.****
> >>>> version}.jar
> >>>> +geronimo-stax-api_1.0_spec.****loc                    =
> >>>> ${maven2.repo}/org/apache/****geronimo/specs/geronimo-stax-****
> >>>> api_1.0_spec/${geronimo-stax-****api_1.0_spec.version}
> >>>> +geronimo-stax-api_1.0_spec.****md5                    =
> >>>> b7c2a715cd3d1c43dc4ccfae426e8e****2e
> >>>> +
> >>>> +tagsoup.version                = 1.2.1
> >>>> +tagsoup.jar                    = tagsoup-${tagsoup.version}.jar
> >>>> +tagsoup.loc                    = ${maven2.repo}/org/ccil/cowan/****
> >>>> tagsoup/tagsoup/${tagsoup.****version}
> >>>> +tagsoup.md5                    = ae73a52cdcbec10cd61d9ef22fab59****36
> >>>> +
> >>>> +asm.version                = 3.1
> >>>> +asm.jar                    = asm-${asm.version}.jar
> >>>> +asm.loc                    = ${maven2.repo}/org/ow2/util/**
> >>>> asm/asm/${asm.version}
> >>>> +asm.md5                    = b1a36e247bf18fb4da46ce3a54627d****1b
> >>>> +
> >>>> +isoparser.version                = 1.0-RC-1
> >>>> +isoparser.jar                    = isoparser-${isoparser.version}**
> >>>> **.jar
> >>>> +isoparser.loc                    = ${maven2.repo}/com/googlecode/****
> >>>> mp4parser/isoparser/${****isoparser.version}
> >>>> +isoparser.md5                    = b0444fde2290319c9028564c3c3ff1**
> >>>> **ab
> >>>> +
> >>>> +metadata-extractor.version                = 2.4.0-beta-1
> >>>> +metadata-extractor.jar                    =
> >>>>
> >>> metadata-extractor-${metadata-
> >>>
> >>>> **extractor.version}.jar
> >>>> +metadata-extractor.loc                    =
> >>>>
> >>> ${maven2.repo}/com/drewnoakes/
> >>>
> >>>> **metadata-extractor/${**metadata-**extractor.version}
> >>>> +metadata-extractor.md5                    =
> >>>>
> >>> 6e0ad2f0fe78047cb34ec056b39633
> >>>
> >>>> **d3
> >>>> +
> >>>> +boilerpipe.version                = 1.1.0
> >>>> +boilerpipe.jar                    = boilerpipe-${boilerpipe.**
> >>>> version}.jar
> >>>> +boilerpipe.loc                    = ${maven2.repo}/de/l3s/**
> >>>> boilerpipe/boilerpipe/${****boilerpipe.version}
> >>>> +boilerpipe.md5                    = 0616568083786d0f49e2cb07a5d09f**
> >>>> **e4
> >>>> +
> >>>> +rome.version                = 0.9
> >>>> +rome.jar                    = rome-${rome.version}.jar
> >>>> +rome.loc                    = ${maven2.repo}/rome/rome/${****
> >>>> rome.version}
> >>>> +rome.md5                    = 19589699b01c59ccb4d5e61e4c78b3****11
> >>>> +
> >>>> +vorbis-java-core.version                = 0.1
> >>>> +vorbis-java-core.jar                    =
> vorbis-java-core-${vorbis-**
> >>>> java-core.version}.jar
> >>>> +vorbis-java-core.loc                    =
> >>>>
> >>> ${maven2.repo}/org/gagravarr/****
> >>>
> >>>> vorbis-java-core/${vorbis-****java-core.version}
> >>>> +vorbis-java-core.md5                    =
> >>>>
> >>> b88115be2754cb6883e652ba68ca46***
> >>>
> >>>> *c8
> >>>> +
> >>>> +juniversalchardet.version                = 1.0.3
> >>>> +juniversalchardet.jar                    = juniversalchardet-${**
> >>>> juniversalchardet.version}.jar
> >>>> +juniversalchardet.loc                    =
> >>>>
> >>> ${maven2.repo}/com/googlecode/
> >>>
> >>>> **juniversalchardet/****juniversalchardet/${****
> >>>> juniversalchardet.version}
> >>>> +juniversalchardet.md5                    =
> >>>>
> >>> d9ea0a9a275336c175b343f2e4cd8f
> >>>
> >>>> **27
> >>>> +
> >>>> +xz.version                = 1.1
> >>>> +xz.jar                    = xz-${xz.version}.jar
> >>>> +xz.loc                    =
> >>>>
> >>> ${maven2.repo}/org/tukaani/xz/****${xz.version}
> >>>
> >>>> +xz.md5                    = 4d0ba9643c8f3f7c6721be3a1286da****1c
> >>>> +
> >>>> +dom4j.version                 = 1.6.1
> >>>> +dom4j.jar                = dom4j-${dom4j.version}.jar
> >>>> +dom4j.loc                = ${maven2.repo}/dom4j/dom4j/${***
> >>>> *dom4j.version}
> >>>> +dom4j.md5                = 4d8f51d3fe3900efc6e395be48030d****6d
> >>>> +
> >>>> +xmlbeans.version                 = 2.6.0
> >>>> +xmlbeans.jar                = xmlbeans-${xmlbeans.version}.****jar
> >>>> +xmlbeans.loc                = ${maven2.repo}/org/apache/**
> >>>> xmlbeans/xmlbeans/${xmlbeans.****version}
> >>>> +xmlbeans.md5                = 6591c08682d613194dacb01e95c78c****2c
> >>>> +
> >>>> +poi-ooxml.version                 = 3.8
> >>>> +poi-ooxml.jar                = poi-ooxml-${poi-ooxml.version}****.jar
> >>>> +poi-ooxml.loc                = ${maven2.repo}/org/apache/poi/****
> >>>> poi-ooxml/${poi-ooxml.version}
> >>>> +poi-ooxml.md5                = 8f147b248f078799c24c8714f185b1****a8
> >>>> +
> >>>> +poi-ooxml-schemas.version                 = 3.8
> >>>> +poi-ooxml-schemas.jar                =
> poi-ooxml-schemas-${poi-ooxml-*
> >>>> ***
> >>>> schemas.version}.jar
> >>>> +poi-ooxml-schemas.loc                =
> ${maven2.repo}/org/apache/poi/*
> >>>> ***
> >>>> poi-ooxml-schemas/${poi-ooxml-****schemas.version}
> >>>> +poi-ooxml-schemas.md5                =
> >>>>
> >>> 7ebcffdc4d82b2b8cbc6464d4543cd****07
> >>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>> --
> >>> Cordialement.
> >>> Philippe Mouawad.
> >>>
> >>>
> >
> >
>
>
> --
> Cordialement.
> Philippe Mouawad.
>

Re: Add Apache Tika in JMeter to extract text from various file type

Posted by Philippe Mouawad <ph...@gmail.com>.
On Sun, Nov 4, 2012 at 2:09 PM, Milamber <mi...@apache.org> wrote:

>
>
> Le 03/11/2012 23:47, Shmuel Krakower a ecrit :
>
>  Hi Philippe
>> if you concern about performance with the assertion. Maybe as a starter it
>> would be better to begin with a separte assertion component for documents?
>> Only this one will use Tika and will make it harder for users to mis-use
>> Tika when not needed.
>>
>
> There haven't performance issues when JMeter runs a test if the radio
> button "Body as a Document" not selected (in Response Assertion or RegExp
> extractor)
>
> Add several new elements (Tika regexp, Tika Assertion) for only one new
> radio button in "Response Field to check" section (in current Regexp
> extractor and Response Assertion) seems not necessary.
> I can add some tooltip on the "Body as a Document" radio button + some
> warning in component reference. Seems a good compromise?
>
> +1 for me.

> Milamber
>
>
>
>
>> Regarding size of JMeter, it shouldn't be a concern.
>>
>> Overall sounds like a nice upgrade to JMeter capabilities.
>>
>> Best,
>> Shmuel.
>> בתאריך 2012 11 3 22:11, מאת "Philippe Mouawad"<philippe.mouawad@**
>> gmail.com <ph...@gmail.com>>:
>>
>>  Hello Milamber,
>>> My answers below.
>>>
>>> Regards
>>> Philippe
>>>
>>> On Sat, Nov 3, 2012 at 8:23 PM, Milamber<mi...@apache.org>  wrote:
>>>
>>>  Hello,
>>>>
>>>> Currently, I work to add Apache Tika 1.2 [1] in JMeter to improve
>>>> functional tests.
>>>>
>>>>
>>>  With Tika, you can extract the text form various documents, like MS
>>>>
>>> Office
>>>
>>>> (Word, Excel, PowerPoint 97-2003, 2007-2010 (openxml), OpenOffice
>>>>
>>> (writer,
>>>
>>>> calc, impress), HTML, Gz, jar/zip files (list of content), and some
>>>> "multimedia" files like mp3, mp4, flv, etc.
>>>>
>>>> In JMeter, Tika can be used by the View Results Tree to view the text
>>>>
>>> data
>>>
>>>> of this files, Regular extractor to catch some text from this files and
>>>> Response assertion to assert on the data.
>>>>
>>>> The inconvenient is: Apache Tika requires a big jar (25Mb) or a lot of
>>>>
>>> jar
>>>
>>>> files (see below). With all jars in the binary package, the new size
>>>> (for
>>>> tgz) is 45 Mb (JMeter 2.8 tgz : 23Mb)
>>>>
>>>> The question: are you agree to add Tika (and new capability to "extract
>>>> text from Document") in JMeter with the new binary size?
>>>>
>>>>  I agree but we should check impact on JMeter performance  and if it's
>>> important warn clearly about it and when to use it.
>>>
>>>
>>>  Secondary question: what the good way? : 1/ Add only tika-app.jar (which
>>>> include all dependencies) [2], or 2/ Add several jar files (tika-core,
>>>> tika-parser, etc + dependencies) [3]
>>>>
>>>> I would see:
>>>>
>>>     - Tika (core + modules) if available
>>>     - Dependencies
>>>
>>> But  not Tika+dependencies in one JAR.
>>> If first option not possible, reference all tika modules + dependencies
>>>
>>>
>>>  Milamber
>>>>
>>>>
>>>> [1] http://tika.apache.org/
>>>>
>>>> [2] One Jar :
>>>> +tika-app.version                = 1.2
>>>> +tika-app.jar                    = tika-app-${tika-app.version}.****jar
>>>> +tika-app.loc                    = ${maven2.repo}/org/apache/**
>>>> tika/tika-app/${tika-app.****version}
>>>> +tika-app.md5                    = e0ec70c80a6f3b113d8ac1c12a3333****8f
>>>>
>>>> [3] Several Jars (i must check if jar is missing)
>>>>
>>>> +tika-core.version                = 1.2
>>>> +tika-core.jar                    = tika-core-${tika-core.version}**
>>>> **.jar
>>>> +tika-core.loc                    = ${maven2.repo}/org/apache/**
>>>> tika/tika-core/${tika-core.****version}
>>>> +tika-core.md5                    = 17cfec5a9b28b323375de0692ce5ec**
>>>> **b1
>>>> +
>>>> +tika-parsers.version                = 1.2
>>>> +tika-parsers.jar                    = tika-parsers-${tika-parsers.**
>>>> version}.jar
>>>> +tika-parsers.loc                    = ${maven2.repo}/org/apache/**
>>>> tika/tika-parsers/${tika-****parsers.version}
>>>> +tika-parsers.md5                    = a15b071726358fd195d5c4b0625cdf**
>>>> **b5
>>>> +
>>>> +
>>>> +tika-parsers.version                = 1.2
>>>> +tika-parsers.jar                    = tika-parsers-${tika-parsers.**
>>>> version}.jar
>>>> +tika-parsers.loc                    = ${maven2.repo}/org/apache/**
>>>> tika/tika-parsers/${tika-****parsers.version}
>>>> +tika-parsers.md5                    = a15b071726358fd195d5c4b0625cdf**
>>>> **b5
>>>> +
>>>> +netcdf.version                = 4.2-min
>>>> +netcdf.jar                    = netcdf-${netcdf.version}.jar
>>>> +netcdf.loc                    = ${maven2.repo}/edu/ucar/**
>>>> netcdf/${netcdf.version}
>>>> +netcdf.md5                    = eb00b40b0511f0fc1dfcfc9cb89e3c****53
>>>> +
>>>> +apache-mime4j-core.version                = 0.7.2
>>>> +apache-mime4j-core.jar                    =
>>>>
>>> apache-mime4j-core-${apache-*
>>>
>>>> *mime4j-core.version}.jar
>>>> +apache-mime4j-core.loc                    =
>>>> ${maven2.repo}/org/apache/**
>>>> james/apache-mime4j-core/${****apache-mime4j-core.version}
>>>> +apache-mime4j-core.md5                    =
>>>>
>>> 88f799546eca803c53eee01a4ce5ed
>>>
>>>> **cd
>>>> +
>>>> +apache-mime4j-dom.version                = 0.7.2
>>>> +apache-mime4j-dom.jar                    =
>>>> apache-mime4j-dom-${apache-**
>>>> mime4j-dom.version}.jar
>>>> +apache-mime4j-dom.loc                    = ${maven2.repo}/org/apache/**
>>>> james/apache-mime4j-dom/${****apache-mime4j-dom.version}
>>>> +apache-mime4j-dom.md5                    =
>>>>
>>> dedc747b5c367fbd7f8a7235d1d7cb
>>>
>>>> **ee
>>>> +
>>>> +commons-compress.version                = 1.4.1
>>>> +commons-compress.jar                    = commons-compress-${commons-**
>>>> compress.version}.jar
>>>> +commons-compress.loc                    = ${maven2.repo}/org/apache/**
>>>> commons/commons-compress/${****commons-compress.version}
>>>> +commons-compress.md5                    =
>>>>
>>> 7f7ff9255a831325f38a170992b700***
>>>
>>>> *73
>>>> +
>>>> +pdfbox.version                = 1.7.0
>>>> +pdfbox.jar                    = pdfbox-${pdfbox.version}.jar
>>>> +pdfbox.loc                    = ${maven2.repo}/org/apache/**
>>>> pdfbox/pdfbox/${pdfbox.****version}
>>>> +pdfbox.md5                    = da9ff2f1b43dc92b15fe3ba39a1cdd****cd
>>>> +
>>>> +fontbox.version                = 1.7.0
>>>> +fontbox.jar                    = fontbox-${fontbox.version}.jar
>>>> +fontbox.loc                    = ${maven2.repo}/org/apache/**
>>>> pdfbox/fontbox/${fontbox.****version}
>>>> +fontbox.md5                    = 9e03f94d92af257facb148c138af22****fa
>>>> +
>>>> +jempbox.version                = 1.7.0
>>>> +jempbox.jar                    = jempbox-${jempbox.version}.jar
>>>> +jempbox.loc                    = ${maven2.repo}/org/apache/**
>>>> pdfbox/jempbox/${jempbox.****version}
>>>> +jempbox.md5                    = 69dfbd6872c29f89a4df1179dd54b4****4e
>>>> +
>>>> +poi.version                = 3.8
>>>> +poi.jar                    = poi-${poi.version}.jar
>>>> +poi.loc                    = ${maven2.repo}/org/apache/poi/****
>>>> poi/${poi.version}
>>>> +poi.md5                    = 5c915f48922046c71121fd7021aa23****cb
>>>> +
>>>> +poi-scratchpad.version                = 3.8
>>>> +poi-scratchpad.jar                    = poi-scratchpad-${poi-**
>>>> scratchpad.version}.jar
>>>> +poi-scratchpad.loc                    = ${maven2.repo}/org/apache/poi/
>>>> ****
>>>> poi-scratchpad/${poi-****scratchpad.version}
>>>> +poi-scratchpad.md5                    = 7427b6b9e53dcee57d382ba022efc3
>>>> ****
>>>> be
>>>> +
>>>> +poi-ooxml.version                = 3.8
>>>> +poi-ooxml.jar                    = poi-ooxml-${poi-ooxml.version}**
>>>> **.jar
>>>> +poi-ooxml.loc                    = ${maven2.repo}/org/apache/poi/****
>>>> poi-ooxml/${poi-ooxml.version}
>>>> +poi-ooxml.md5                    = 8f147b248f078799c24c8714f185b1**
>>>> **a8
>>>> +
>>>> +geronimo-stax-api_1.0_spec.****version                = 1.0.1
>>>> +geronimo-stax-api_1.0_spec.****jar                    =
>>>> geronimo-stax-api_1.0_spec-${****geronimo-stax-api_1.0_spec.****
>>>> version}.jar
>>>> +geronimo-stax-api_1.0_spec.****loc                    =
>>>> ${maven2.repo}/org/apache/****geronimo/specs/geronimo-stax-****
>>>> api_1.0_spec/${geronimo-stax-****api_1.0_spec.version}
>>>> +geronimo-stax-api_1.0_spec.****md5                    =
>>>> b7c2a715cd3d1c43dc4ccfae426e8e****2e
>>>> +
>>>> +tagsoup.version                = 1.2.1
>>>> +tagsoup.jar                    = tagsoup-${tagsoup.version}.jar
>>>> +tagsoup.loc                    = ${maven2.repo}/org/ccil/cowan/****
>>>> tagsoup/tagsoup/${tagsoup.****version}
>>>> +tagsoup.md5                    = ae73a52cdcbec10cd61d9ef22fab59****36
>>>> +
>>>> +asm.version                = 3.1
>>>> +asm.jar                    = asm-${asm.version}.jar
>>>> +asm.loc                    = ${maven2.repo}/org/ow2/util/**
>>>> asm/asm/${asm.version}
>>>> +asm.md5                    = b1a36e247bf18fb4da46ce3a54627d****1b
>>>> +
>>>> +isoparser.version                = 1.0-RC-1
>>>> +isoparser.jar                    = isoparser-${isoparser.version}**
>>>> **.jar
>>>> +isoparser.loc                    = ${maven2.repo}/com/googlecode/****
>>>> mp4parser/isoparser/${****isoparser.version}
>>>> +isoparser.md5                    = b0444fde2290319c9028564c3c3ff1**
>>>> **ab
>>>> +
>>>> +metadata-extractor.version                = 2.4.0-beta-1
>>>> +metadata-extractor.jar                    =
>>>>
>>> metadata-extractor-${metadata-
>>>
>>>> **extractor.version}.jar
>>>> +metadata-extractor.loc                    =
>>>>
>>> ${maven2.repo}/com/drewnoakes/
>>>
>>>> **metadata-extractor/${**metadata-**extractor.version}
>>>> +metadata-extractor.md5                    =
>>>>
>>> 6e0ad2f0fe78047cb34ec056b39633
>>>
>>>> **d3
>>>> +
>>>> +boilerpipe.version                = 1.1.0
>>>> +boilerpipe.jar                    = boilerpipe-${boilerpipe.**
>>>> version}.jar
>>>> +boilerpipe.loc                    = ${maven2.repo}/de/l3s/**
>>>> boilerpipe/boilerpipe/${****boilerpipe.version}
>>>> +boilerpipe.md5                    = 0616568083786d0f49e2cb07a5d09f**
>>>> **e4
>>>> +
>>>> +rome.version                = 0.9
>>>> +rome.jar                    = rome-${rome.version}.jar
>>>> +rome.loc                    = ${maven2.repo}/rome/rome/${****
>>>> rome.version}
>>>> +rome.md5                    = 19589699b01c59ccb4d5e61e4c78b3****11
>>>> +
>>>> +vorbis-java-core.version                = 0.1
>>>> +vorbis-java-core.jar                    = vorbis-java-core-${vorbis-**
>>>> java-core.version}.jar
>>>> +vorbis-java-core.loc                    =
>>>>
>>> ${maven2.repo}/org/gagravarr/****
>>>
>>>> vorbis-java-core/${vorbis-****java-core.version}
>>>> +vorbis-java-core.md5                    =
>>>>
>>> b88115be2754cb6883e652ba68ca46***
>>>
>>>> *c8
>>>> +
>>>> +juniversalchardet.version                = 1.0.3
>>>> +juniversalchardet.jar                    = juniversalchardet-${**
>>>> juniversalchardet.version}.jar
>>>> +juniversalchardet.loc                    =
>>>>
>>> ${maven2.repo}/com/googlecode/
>>>
>>>> **juniversalchardet/****juniversalchardet/${****
>>>> juniversalchardet.version}
>>>> +juniversalchardet.md5                    =
>>>>
>>> d9ea0a9a275336c175b343f2e4cd8f
>>>
>>>> **27
>>>> +
>>>> +xz.version                = 1.1
>>>> +xz.jar                    = xz-${xz.version}.jar
>>>> +xz.loc                    =
>>>>
>>> ${maven2.repo}/org/tukaani/xz/****${xz.version}
>>>
>>>> +xz.md5                    = 4d0ba9643c8f3f7c6721be3a1286da****1c
>>>> +
>>>> +dom4j.version                 = 1.6.1
>>>> +dom4j.jar                = dom4j-${dom4j.version}.jar
>>>> +dom4j.loc                = ${maven2.repo}/dom4j/dom4j/${***
>>>> *dom4j.version}
>>>> +dom4j.md5                = 4d8f51d3fe3900efc6e395be48030d****6d
>>>> +
>>>> +xmlbeans.version                 = 2.6.0
>>>> +xmlbeans.jar                = xmlbeans-${xmlbeans.version}.****jar
>>>> +xmlbeans.loc                = ${maven2.repo}/org/apache/**
>>>> xmlbeans/xmlbeans/${xmlbeans.****version}
>>>> +xmlbeans.md5                = 6591c08682d613194dacb01e95c78c****2c
>>>> +
>>>> +poi-ooxml.version                 = 3.8
>>>> +poi-ooxml.jar                = poi-ooxml-${poi-ooxml.version}****.jar
>>>> +poi-ooxml.loc                = ${maven2.repo}/org/apache/poi/****
>>>> poi-ooxml/${poi-ooxml.version}
>>>> +poi-ooxml.md5                = 8f147b248f078799c24c8714f185b1****a8
>>>> +
>>>> +poi-ooxml-schemas.version                 = 3.8
>>>> +poi-ooxml-schemas.jar                = poi-ooxml-schemas-${poi-ooxml-*
>>>> ***
>>>> schemas.version}.jar
>>>> +poi-ooxml-schemas.loc                = ${maven2.repo}/org/apache/poi/*
>>>> ***
>>>> poi-ooxml-schemas/${poi-ooxml-****schemas.version}
>>>> +poi-ooxml-schemas.md5                =
>>>>
>>> 7ebcffdc4d82b2b8cbc6464d4543cd****07
>>>
>>>>
>>>>
>>>>
>>>>
>>> --
>>> Cordialement.
>>> Philippe Mouawad.
>>>
>>>
>
>


-- 
Cordialement.
Philippe Mouawad.

Re: Add Apache Tika in JMeter to extract text from various file type

Posted by Milamber <mi...@apache.org>.

Le 03/11/2012 23:47, Shmuel Krakower a ecrit :
> Hi Philippe
> if you concern about performance with the assertion. Maybe as a starter it
> would be better to begin with a separte assertion component for documents?
> Only this one will use Tika and will make it harder for users to mis-use
> Tika when not needed.

There haven't performance issues when JMeter runs a test if the radio 
button "Body as a Document" not selected (in Response Assertion or 
RegExp extractor)

Add several new elements (Tika regexp, Tika Assertion) for only one new 
radio button in "Response Field to check" section (in current Regexp 
extractor and Response Assertion) seems not necessary.
I can add some tooltip on the "Body as a Document" radio button + some 
warning in component reference. Seems a good compromise?

Milamber


>
> Regarding size of JMeter, it shouldn't be a concern.
>
> Overall sounds like a nice upgrade to JMeter capabilities.
>
> Best,
> Shmuel.
> בתאריך 2012 11 3 22:11, מאת "Philippe Mouawad"<ph...@gmail.com>:
>
>> Hello Milamber,
>> My answers below.
>>
>> Regards
>> Philippe
>>
>> On Sat, Nov 3, 2012 at 8:23 PM, Milamber<mi...@apache.org>  wrote:
>>
>>> Hello,
>>>
>>> Currently, I work to add Apache Tika 1.2 [1] in JMeter to improve
>>> functional tests.
>>>
>>
>>> With Tika, you can extract the text form various documents, like MS
>> Office
>>> (Word, Excel, PowerPoint 97-2003, 2007-2010 (openxml), OpenOffice
>> (writer,
>>> calc, impress), HTML, Gz, jar/zip files (list of content), and some
>>> "multimedia" files like mp3, mp4, flv, etc.
>>>
>>> In JMeter, Tika can be used by the View Results Tree to view the text
>> data
>>> of this files, Regular extractor to catch some text from this files and
>>> Response assertion to assert on the data.
>>>
>>> The inconvenient is: Apache Tika requires a big jar (25Mb) or a lot of
>> jar
>>> files (see below). With all jars in the binary package, the new size (for
>>> tgz) is 45 Mb (JMeter 2.8 tgz : 23Mb)
>>>
>>> The question: are you agree to add Tika (and new capability to "extract
>>> text from Document") in JMeter with the new binary size?
>>>
>> I agree but we should check impact on JMeter performance  and if it's
>> important warn clearly about it and when to use it.
>>
>>
>>> Secondary question: what the good way? : 1/ Add only tika-app.jar (which
>>> include all dependencies) [2], or 2/ Add several jar files (tika-core,
>>> tika-parser, etc + dependencies) [3]
>>>
>>> I would see:
>>     - Tika (core + modules) if available
>>     - Dependencies
>>
>> But  not Tika+dependencies in one JAR.
>> If first option not possible, reference all tika modules + dependencies
>>
>>
>>> Milamber
>>>
>>>
>>> [1] http://tika.apache.org/
>>>
>>> [2] One Jar :
>>> +tika-app.version                = 1.2
>>> +tika-app.jar                    = tika-app-${tika-app.version}.**jar
>>> +tika-app.loc                    = ${maven2.repo}/org/apache/**
>>> tika/tika-app/${tika-app.**version}
>>> +tika-app.md5                    = e0ec70c80a6f3b113d8ac1c12a3333**8f
>>>
>>> [3] Several Jars (i must check if jar is missing)
>>>
>>> +tika-core.version                = 1.2
>>> +tika-core.jar                    = tika-core-${tika-core.version}**.jar
>>> +tika-core.loc                    = ${maven2.repo}/org/apache/**
>>> tika/tika-core/${tika-core.**version}
>>> +tika-core.md5                    = 17cfec5a9b28b323375de0692ce5ec**b1
>>> +
>>> +tika-parsers.version                = 1.2
>>> +tika-parsers.jar                    = tika-parsers-${tika-parsers.**
>>> version}.jar
>>> +tika-parsers.loc                    = ${maven2.repo}/org/apache/**
>>> tika/tika-parsers/${tika-**parsers.version}
>>> +tika-parsers.md5                    = a15b071726358fd195d5c4b0625cdf**b5
>>> +
>>> +
>>> +tika-parsers.version                = 1.2
>>> +tika-parsers.jar                    = tika-parsers-${tika-parsers.**
>>> version}.jar
>>> +tika-parsers.loc                    = ${maven2.repo}/org/apache/**
>>> tika/tika-parsers/${tika-**parsers.version}
>>> +tika-parsers.md5                    = a15b071726358fd195d5c4b0625cdf**b5
>>> +
>>> +netcdf.version                = 4.2-min
>>> +netcdf.jar                    = netcdf-${netcdf.version}.jar
>>> +netcdf.loc                    = ${maven2.repo}/edu/ucar/**
>>> netcdf/${netcdf.version}
>>> +netcdf.md5                    = eb00b40b0511f0fc1dfcfc9cb89e3c**53
>>> +
>>> +apache-mime4j-core.version                = 0.7.2
>>> +apache-mime4j-core.jar                    =
>> apache-mime4j-core-${apache-*
>>> *mime4j-core.version}.jar
>>> +apache-mime4j-core.loc                    = ${maven2.repo}/org/apache/**
>>> james/apache-mime4j-core/${**apache-mime4j-core.version}
>>> +apache-mime4j-core.md5                    =
>> 88f799546eca803c53eee01a4ce5ed
>>> **cd
>>> +
>>> +apache-mime4j-dom.version                = 0.7.2
>>> +apache-mime4j-dom.jar                    = apache-mime4j-dom-${apache-**
>>> mime4j-dom.version}.jar
>>> +apache-mime4j-dom.loc                    = ${maven2.repo}/org/apache/**
>>> james/apache-mime4j-dom/${**apache-mime4j-dom.version}
>>> +apache-mime4j-dom.md5                    =
>> dedc747b5c367fbd7f8a7235d1d7cb
>>> **ee
>>> +
>>> +commons-compress.version                = 1.4.1
>>> +commons-compress.jar                    = commons-compress-${commons-**
>>> compress.version}.jar
>>> +commons-compress.loc                    = ${maven2.repo}/org/apache/**
>>> commons/commons-compress/${**commons-compress.version}
>>> +commons-compress.md5                    =
>> 7f7ff9255a831325f38a170992b700*
>>> *73
>>> +
>>> +pdfbox.version                = 1.7.0
>>> +pdfbox.jar                    = pdfbox-${pdfbox.version}.jar
>>> +pdfbox.loc                    = ${maven2.repo}/org/apache/**
>>> pdfbox/pdfbox/${pdfbox.**version}
>>> +pdfbox.md5                    = da9ff2f1b43dc92b15fe3ba39a1cdd**cd
>>> +
>>> +fontbox.version                = 1.7.0
>>> +fontbox.jar                    = fontbox-${fontbox.version}.jar
>>> +fontbox.loc                    = ${maven2.repo}/org/apache/**
>>> pdfbox/fontbox/${fontbox.**version}
>>> +fontbox.md5                    = 9e03f94d92af257facb148c138af22**fa
>>> +
>>> +jempbox.version                = 1.7.0
>>> +jempbox.jar                    = jempbox-${jempbox.version}.jar
>>> +jempbox.loc                    = ${maven2.repo}/org/apache/**
>>> pdfbox/jempbox/${jempbox.**version}
>>> +jempbox.md5                    = 69dfbd6872c29f89a4df1179dd54b4**4e
>>> +
>>> +poi.version                = 3.8
>>> +poi.jar                    = poi-${poi.version}.jar
>>> +poi.loc                    = ${maven2.repo}/org/apache/poi/**
>>> poi/${poi.version}
>>> +poi.md5                    = 5c915f48922046c71121fd7021aa23**cb
>>> +
>>> +poi-scratchpad.version                = 3.8
>>> +poi-scratchpad.jar                    = poi-scratchpad-${poi-**
>>> scratchpad.version}.jar
>>> +poi-scratchpad.loc                    = ${maven2.repo}/org/apache/poi/**
>>> poi-scratchpad/${poi-**scratchpad.version}
>>> +poi-scratchpad.md5                    = 7427b6b9e53dcee57d382ba022efc3**
>>> be
>>> +
>>> +poi-ooxml.version                = 3.8
>>> +poi-ooxml.jar                    = poi-ooxml-${poi-ooxml.version}**.jar
>>> +poi-ooxml.loc                    = ${maven2.repo}/org/apache/poi/**
>>> poi-ooxml/${poi-ooxml.version}
>>> +poi-ooxml.md5                    = 8f147b248f078799c24c8714f185b1**a8
>>> +
>>> +geronimo-stax-api_1.0_spec.**version                = 1.0.1
>>> +geronimo-stax-api_1.0_spec.**jar                    =
>>> geronimo-stax-api_1.0_spec-${**geronimo-stax-api_1.0_spec.**version}.jar
>>> +geronimo-stax-api_1.0_spec.**loc                    =
>>> ${maven2.repo}/org/apache/**geronimo/specs/geronimo-stax-**
>>> api_1.0_spec/${geronimo-stax-**api_1.0_spec.version}
>>> +geronimo-stax-api_1.0_spec.**md5                    =
>>> b7c2a715cd3d1c43dc4ccfae426e8e**2e
>>> +
>>> +tagsoup.version                = 1.2.1
>>> +tagsoup.jar                    = tagsoup-${tagsoup.version}.jar
>>> +tagsoup.loc                    = ${maven2.repo}/org/ccil/cowan/**
>>> tagsoup/tagsoup/${tagsoup.**version}
>>> +tagsoup.md5                    = ae73a52cdcbec10cd61d9ef22fab59**36
>>> +
>>> +asm.version                = 3.1
>>> +asm.jar                    = asm-${asm.version}.jar
>>> +asm.loc                    = ${maven2.repo}/org/ow2/util/**
>>> asm/asm/${asm.version}
>>> +asm.md5                    = b1a36e247bf18fb4da46ce3a54627d**1b
>>> +
>>> +isoparser.version                = 1.0-RC-1
>>> +isoparser.jar                    = isoparser-${isoparser.version}**.jar
>>> +isoparser.loc                    = ${maven2.repo}/com/googlecode/**
>>> mp4parser/isoparser/${**isoparser.version}
>>> +isoparser.md5                    = b0444fde2290319c9028564c3c3ff1**ab
>>> +
>>> +metadata-extractor.version                = 2.4.0-beta-1
>>> +metadata-extractor.jar                    =
>> metadata-extractor-${metadata-
>>> **extractor.version}.jar
>>> +metadata-extractor.loc                    =
>> ${maven2.repo}/com/drewnoakes/
>>> **metadata-extractor/${metadata-**extractor.version}
>>> +metadata-extractor.md5                    =
>> 6e0ad2f0fe78047cb34ec056b39633
>>> **d3
>>> +
>>> +boilerpipe.version                = 1.1.0
>>> +boilerpipe.jar                    = boilerpipe-${boilerpipe.**
>>> version}.jar
>>> +boilerpipe.loc                    = ${maven2.repo}/de/l3s/**
>>> boilerpipe/boilerpipe/${**boilerpipe.version}
>>> +boilerpipe.md5                    = 0616568083786d0f49e2cb07a5d09f**e4
>>> +
>>> +rome.version                = 0.9
>>> +rome.jar                    = rome-${rome.version}.jar
>>> +rome.loc                    = ${maven2.repo}/rome/rome/${**rome.version}
>>> +rome.md5                    = 19589699b01c59ccb4d5e61e4c78b3**11
>>> +
>>> +vorbis-java-core.version                = 0.1
>>> +vorbis-java-core.jar                    = vorbis-java-core-${vorbis-**
>>> java-core.version}.jar
>>> +vorbis-java-core.loc                    =
>> ${maven2.repo}/org/gagravarr/**
>>> vorbis-java-core/${vorbis-**java-core.version}
>>> +vorbis-java-core.md5                    =
>> b88115be2754cb6883e652ba68ca46*
>>> *c8
>>> +
>>> +juniversalchardet.version                = 1.0.3
>>> +juniversalchardet.jar                    = juniversalchardet-${**
>>> juniversalchardet.version}.jar
>>> +juniversalchardet.loc                    =
>> ${maven2.repo}/com/googlecode/
>>> **juniversalchardet/**juniversalchardet/${**juniversalchardet.version}
>>> +juniversalchardet.md5                    =
>> d9ea0a9a275336c175b343f2e4cd8f
>>> **27
>>> +
>>> +xz.version                = 1.1
>>> +xz.jar                    = xz-${xz.version}.jar
>>> +xz.loc                    =
>> ${maven2.repo}/org/tukaani/xz/**${xz.version}
>>> +xz.md5                    = 4d0ba9643c8f3f7c6721be3a1286da**1c
>>> +
>>> +dom4j.version                 = 1.6.1
>>> +dom4j.jar                = dom4j-${dom4j.version}.jar
>>> +dom4j.loc                = ${maven2.repo}/dom4j/dom4j/${**dom4j.version}
>>> +dom4j.md5                = 4d8f51d3fe3900efc6e395be48030d**6d
>>> +
>>> +xmlbeans.version                 = 2.6.0
>>> +xmlbeans.jar                = xmlbeans-${xmlbeans.version}.**jar
>>> +xmlbeans.loc                = ${maven2.repo}/org/apache/**
>>> xmlbeans/xmlbeans/${xmlbeans.**version}
>>> +xmlbeans.md5                = 6591c08682d613194dacb01e95c78c**2c
>>> +
>>> +poi-ooxml.version                 = 3.8
>>> +poi-ooxml.jar                = poi-ooxml-${poi-ooxml.version}**.jar
>>> +poi-ooxml.loc                = ${maven2.repo}/org/apache/poi/**
>>> poi-ooxml/${poi-ooxml.version}
>>> +poi-ooxml.md5                = 8f147b248f078799c24c8714f185b1**a8
>>> +
>>> +poi-ooxml-schemas.version                 = 3.8
>>> +poi-ooxml-schemas.jar                = poi-ooxml-schemas-${poi-ooxml-**
>>> schemas.version}.jar
>>> +poi-ooxml-schemas.loc                = ${maven2.repo}/org/apache/poi/**
>>> poi-ooxml-schemas/${poi-ooxml-**schemas.version}
>>> +poi-ooxml-schemas.md5                =
>> 7ebcffdc4d82b2b8cbc6464d4543cd**07
>>>
>>>
>>>
>>
>> --
>> Cordialement.
>> Philippe Mouawad.
>>



Re: Add Apache Tika in JMeter to extract text from various file type

Posted by Shmuel Krakower <sh...@gmail.com>.
Hi Philippe
if you concern about performance with the assertion. Maybe as a starter it
would be better to begin with a separte assertion component for documents?
Only this one will use Tika and will make it harder for users to mis-use
Tika when not needed.

Regarding size of JMeter, it shouldn't be a concern.

Overall sounds like a nice upgrade to JMeter capabilities.

Best,
Shmuel.
בתאריך 2012 11 3 22:11, מאת "Philippe Mouawad" <ph...@gmail.com>:

> Hello Milamber,
> My answers below.
>
> Regards
> Philippe
>
> On Sat, Nov 3, 2012 at 8:23 PM, Milamber <mi...@apache.org> wrote:
>
> > Hello,
> >
> > Currently, I work to add Apache Tika 1.2 [1] in JMeter to improve
> > functional tests.
> >
>
>
> >
> > With Tika, you can extract the text form various documents, like MS
> Office
> > (Word, Excel, PowerPoint 97-2003, 2007-2010 (openxml), OpenOffice
> (writer,
> > calc, impress), HTML, Gz, jar/zip files (list of content), and some
> > "multimedia" files like mp3, mp4, flv, etc.
> >
> > In JMeter, Tika can be used by the View Results Tree to view the text
> data
> > of this files, Regular extractor to catch some text from this files and
> > Response assertion to assert on the data.
> >
> > The inconvenient is: Apache Tika requires a big jar (25Mb) or a lot of
> jar
> > files (see below). With all jars in the binary package, the new size (for
> > tgz) is 45 Mb (JMeter 2.8 tgz : 23Mb)
> >
> > The question: are you agree to add Tika (and new capability to "extract
> > text from Document") in JMeter with the new binary size?
> >
>
> I agree but we should check impact on JMeter performance  and if it's
> important warn clearly about it and when to use it.
>
>
> > Secondary question: what the good way? : 1/ Add only tika-app.jar (which
> > include all dependencies) [2], or 2/ Add several jar files (tika-core,
> > tika-parser, etc + dependencies) [3]
> >
> > I would see:
>
>    - Tika (core + modules) if available
>    - Dependencies
>
> But  not Tika+dependencies in one JAR.
> If first option not possible, reference all tika modules + dependencies
>
>
> > Milamber
> >
> >
> > [1] http://tika.apache.org/
> >
> > [2] One Jar :
> > +tika-app.version                = 1.2
> > +tika-app.jar                    = tika-app-${tika-app.version}.**jar
> > +tika-app.loc                    = ${maven2.repo}/org/apache/**
> > tika/tika-app/${tika-app.**version}
> > +tika-app.md5                    = e0ec70c80a6f3b113d8ac1c12a3333**8f
> >
> > [3] Several Jars (i must check if jar is missing)
> >
> > +tika-core.version                = 1.2
> > +tika-core.jar                    = tika-core-${tika-core.version}**.jar
> > +tika-core.loc                    = ${maven2.repo}/org/apache/**
> > tika/tika-core/${tika-core.**version}
> > +tika-core.md5                    = 17cfec5a9b28b323375de0692ce5ec**b1
> > +
> > +tika-parsers.version                = 1.2
> > +tika-parsers.jar                    = tika-parsers-${tika-parsers.**
> > version}.jar
> > +tika-parsers.loc                    = ${maven2.repo}/org/apache/**
> > tika/tika-parsers/${tika-**parsers.version}
> > +tika-parsers.md5                    = a15b071726358fd195d5c4b0625cdf**b5
> > +
> > +
> > +tika-parsers.version                = 1.2
> > +tika-parsers.jar                    = tika-parsers-${tika-parsers.**
> > version}.jar
> > +tika-parsers.loc                    = ${maven2.repo}/org/apache/**
> > tika/tika-parsers/${tika-**parsers.version}
> > +tika-parsers.md5                    = a15b071726358fd195d5c4b0625cdf**b5
> > +
> > +netcdf.version                = 4.2-min
> > +netcdf.jar                    = netcdf-${netcdf.version}.jar
> > +netcdf.loc                    = ${maven2.repo}/edu/ucar/**
> > netcdf/${netcdf.version}
> > +netcdf.md5                    = eb00b40b0511f0fc1dfcfc9cb89e3c**53
> > +
> > +apache-mime4j-core.version                = 0.7.2
> > +apache-mime4j-core.jar                    =
> apache-mime4j-core-${apache-*
> > *mime4j-core.version}.jar
> > +apache-mime4j-core.loc                    = ${maven2.repo}/org/apache/**
> > james/apache-mime4j-core/${**apache-mime4j-core.version}
> > +apache-mime4j-core.md5                    =
> 88f799546eca803c53eee01a4ce5ed
> > **cd
> > +
> > +apache-mime4j-dom.version                = 0.7.2
> > +apache-mime4j-dom.jar                    = apache-mime4j-dom-${apache-**
> > mime4j-dom.version}.jar
> > +apache-mime4j-dom.loc                    = ${maven2.repo}/org/apache/**
> > james/apache-mime4j-dom/${**apache-mime4j-dom.version}
> > +apache-mime4j-dom.md5                    =
> dedc747b5c367fbd7f8a7235d1d7cb
> > **ee
> > +
> > +commons-compress.version                = 1.4.1
> > +commons-compress.jar                    = commons-compress-${commons-**
> > compress.version}.jar
> > +commons-compress.loc                    = ${maven2.repo}/org/apache/**
> > commons/commons-compress/${**commons-compress.version}
> > +commons-compress.md5                    =
> 7f7ff9255a831325f38a170992b700*
> > *73
> > +
> > +pdfbox.version                = 1.7.0
> > +pdfbox.jar                    = pdfbox-${pdfbox.version}.jar
> > +pdfbox.loc                    = ${maven2.repo}/org/apache/**
> > pdfbox/pdfbox/${pdfbox.**version}
> > +pdfbox.md5                    = da9ff2f1b43dc92b15fe3ba39a1cdd**cd
> > +
> > +fontbox.version                = 1.7.0
> > +fontbox.jar                    = fontbox-${fontbox.version}.jar
> > +fontbox.loc                    = ${maven2.repo}/org/apache/**
> > pdfbox/fontbox/${fontbox.**version}
> > +fontbox.md5                    = 9e03f94d92af257facb148c138af22**fa
> > +
> > +jempbox.version                = 1.7.0
> > +jempbox.jar                    = jempbox-${jempbox.version}.jar
> > +jempbox.loc                    = ${maven2.repo}/org/apache/**
> > pdfbox/jempbox/${jempbox.**version}
> > +jempbox.md5                    = 69dfbd6872c29f89a4df1179dd54b4**4e
> > +
> > +poi.version                = 3.8
> > +poi.jar                    = poi-${poi.version}.jar
> > +poi.loc                    = ${maven2.repo}/org/apache/poi/**
> > poi/${poi.version}
> > +poi.md5                    = 5c915f48922046c71121fd7021aa23**cb
> > +
> > +poi-scratchpad.version                = 3.8
> > +poi-scratchpad.jar                    = poi-scratchpad-${poi-**
> > scratchpad.version}.jar
> > +poi-scratchpad.loc                    = ${maven2.repo}/org/apache/poi/**
> > poi-scratchpad/${poi-**scratchpad.version}
> > +poi-scratchpad.md5                    = 7427b6b9e53dcee57d382ba022efc3**
> > be
> > +
> > +poi-ooxml.version                = 3.8
> > +poi-ooxml.jar                    = poi-ooxml-${poi-ooxml.version}**.jar
> > +poi-ooxml.loc                    = ${maven2.repo}/org/apache/poi/**
> > poi-ooxml/${poi-ooxml.version}
> > +poi-ooxml.md5                    = 8f147b248f078799c24c8714f185b1**a8
> > +
> > +geronimo-stax-api_1.0_spec.**version                = 1.0.1
> > +geronimo-stax-api_1.0_spec.**jar                    =
> > geronimo-stax-api_1.0_spec-${**geronimo-stax-api_1.0_spec.**version}.jar
> > +geronimo-stax-api_1.0_spec.**loc                    =
> > ${maven2.repo}/org/apache/**geronimo/specs/geronimo-stax-**
> > api_1.0_spec/${geronimo-stax-**api_1.0_spec.version}
> > +geronimo-stax-api_1.0_spec.**md5                    =
> > b7c2a715cd3d1c43dc4ccfae426e8e**2e
> > +
> > +tagsoup.version                = 1.2.1
> > +tagsoup.jar                    = tagsoup-${tagsoup.version}.jar
> > +tagsoup.loc                    = ${maven2.repo}/org/ccil/cowan/**
> > tagsoup/tagsoup/${tagsoup.**version}
> > +tagsoup.md5                    = ae73a52cdcbec10cd61d9ef22fab59**36
> > +
> > +asm.version                = 3.1
> > +asm.jar                    = asm-${asm.version}.jar
> > +asm.loc                    = ${maven2.repo}/org/ow2/util/**
> > asm/asm/${asm.version}
> > +asm.md5                    = b1a36e247bf18fb4da46ce3a54627d**1b
> > +
> > +isoparser.version                = 1.0-RC-1
> > +isoparser.jar                    = isoparser-${isoparser.version}**.jar
> > +isoparser.loc                    = ${maven2.repo}/com/googlecode/**
> > mp4parser/isoparser/${**isoparser.version}
> > +isoparser.md5                    = b0444fde2290319c9028564c3c3ff1**ab
> > +
> > +metadata-extractor.version                = 2.4.0-beta-1
> > +metadata-extractor.jar                    =
> metadata-extractor-${metadata-
> > **extractor.version}.jar
> > +metadata-extractor.loc                    =
> ${maven2.repo}/com/drewnoakes/
> > **metadata-extractor/${metadata-**extractor.version}
> > +metadata-extractor.md5                    =
> 6e0ad2f0fe78047cb34ec056b39633
> > **d3
> > +
> > +boilerpipe.version                = 1.1.0
> > +boilerpipe.jar                    = boilerpipe-${boilerpipe.**
> > version}.jar
> > +boilerpipe.loc                    = ${maven2.repo}/de/l3s/**
> > boilerpipe/boilerpipe/${**boilerpipe.version}
> > +boilerpipe.md5                    = 0616568083786d0f49e2cb07a5d09f**e4
> > +
> > +rome.version                = 0.9
> > +rome.jar                    = rome-${rome.version}.jar
> > +rome.loc                    = ${maven2.repo}/rome/rome/${**rome.version}
> > +rome.md5                    = 19589699b01c59ccb4d5e61e4c78b3**11
> > +
> > +vorbis-java-core.version                = 0.1
> > +vorbis-java-core.jar                    = vorbis-java-core-${vorbis-**
> > java-core.version}.jar
> > +vorbis-java-core.loc                    =
> ${maven2.repo}/org/gagravarr/**
> > vorbis-java-core/${vorbis-**java-core.version}
> > +vorbis-java-core.md5                    =
> b88115be2754cb6883e652ba68ca46*
> > *c8
> > +
> > +juniversalchardet.version                = 1.0.3
> > +juniversalchardet.jar                    = juniversalchardet-${**
> > juniversalchardet.version}.jar
> > +juniversalchardet.loc                    =
> ${maven2.repo}/com/googlecode/
> > **juniversalchardet/**juniversalchardet/${**juniversalchardet.version}
> > +juniversalchardet.md5                    =
> d9ea0a9a275336c175b343f2e4cd8f
> > **27
> > +
> > +xz.version                = 1.1
> > +xz.jar                    = xz-${xz.version}.jar
> > +xz.loc                    =
> ${maven2.repo}/org/tukaani/xz/**${xz.version}
> > +xz.md5                    = 4d0ba9643c8f3f7c6721be3a1286da**1c
> > +
> > +dom4j.version                 = 1.6.1
> > +dom4j.jar                = dom4j-${dom4j.version}.jar
> > +dom4j.loc                = ${maven2.repo}/dom4j/dom4j/${**dom4j.version}
> > +dom4j.md5                = 4d8f51d3fe3900efc6e395be48030d**6d
> > +
> > +xmlbeans.version                 = 2.6.0
> > +xmlbeans.jar                = xmlbeans-${xmlbeans.version}.**jar
> > +xmlbeans.loc                = ${maven2.repo}/org/apache/**
> > xmlbeans/xmlbeans/${xmlbeans.**version}
> > +xmlbeans.md5                = 6591c08682d613194dacb01e95c78c**2c
> > +
> > +poi-ooxml.version                 = 3.8
> > +poi-ooxml.jar                = poi-ooxml-${poi-ooxml.version}**.jar
> > +poi-ooxml.loc                = ${maven2.repo}/org/apache/poi/**
> > poi-ooxml/${poi-ooxml.version}
> > +poi-ooxml.md5                = 8f147b248f078799c24c8714f185b1**a8
> > +
> > +poi-ooxml-schemas.version                 = 3.8
> > +poi-ooxml-schemas.jar                = poi-ooxml-schemas-${poi-ooxml-**
> > schemas.version}.jar
> > +poi-ooxml-schemas.loc                = ${maven2.repo}/org/apache/poi/**
> > poi-ooxml-schemas/${poi-ooxml-**schemas.version}
> > +poi-ooxml-schemas.md5                =
> 7ebcffdc4d82b2b8cbc6464d4543cd**07
> >
> >
> >
> >
>
>
> --
> Cordialement.
> Philippe Mouawad.
>

Re: Add Apache Tika in JMeter to extract text from various file type

Posted by Philippe Mouawad <ph...@gmail.com>.
Hello Milamber,
My answers below.

Regards
Philippe

On Sat, Nov 3, 2012 at 8:23 PM, Milamber <mi...@apache.org> wrote:

> Hello,
>
> Currently, I work to add Apache Tika 1.2 [1] in JMeter to improve
> functional tests.
>


>
> With Tika, you can extract the text form various documents, like MS Office
> (Word, Excel, PowerPoint 97-2003, 2007-2010 (openxml), OpenOffice (writer,
> calc, impress), HTML, Gz, jar/zip files (list of content), and some
> "multimedia" files like mp3, mp4, flv, etc.
>
> In JMeter, Tika can be used by the View Results Tree to view the text data
> of this files, Regular extractor to catch some text from this files and
> Response assertion to assert on the data.
>
> The inconvenient is: Apache Tika requires a big jar (25Mb) or a lot of jar
> files (see below). With all jars in the binary package, the new size (for
> tgz) is 45 Mb (JMeter 2.8 tgz : 23Mb)
>
> The question: are you agree to add Tika (and new capability to "extract
> text from Document") in JMeter with the new binary size?
>

I agree but we should check impact on JMeter performance  and if it's
important warn clearly about it and when to use it.


> Secondary question: what the good way? : 1/ Add only tika-app.jar (which
> include all dependencies) [2], or 2/ Add several jar files (tika-core,
> tika-parser, etc + dependencies) [3]
>
> I would see:

   - Tika (core + modules) if available
   - Dependencies

But  not Tika+dependencies in one JAR.
If first option not possible, reference all tika modules + dependencies


> Milamber
>
>
> [1] http://tika.apache.org/
>
> [2] One Jar :
> +tika-app.version                = 1.2
> +tika-app.jar                    = tika-app-${tika-app.version}.**jar
> +tika-app.loc                    = ${maven2.repo}/org/apache/**
> tika/tika-app/${tika-app.**version}
> +tika-app.md5                    = e0ec70c80a6f3b113d8ac1c12a3333**8f
>
> [3] Several Jars (i must check if jar is missing)
>
> +tika-core.version                = 1.2
> +tika-core.jar                    = tika-core-${tika-core.version}**.jar
> +tika-core.loc                    = ${maven2.repo}/org/apache/**
> tika/tika-core/${tika-core.**version}
> +tika-core.md5                    = 17cfec5a9b28b323375de0692ce5ec**b1
> +
> +tika-parsers.version                = 1.2
> +tika-parsers.jar                    = tika-parsers-${tika-parsers.**
> version}.jar
> +tika-parsers.loc                    = ${maven2.repo}/org/apache/**
> tika/tika-parsers/${tika-**parsers.version}
> +tika-parsers.md5                    = a15b071726358fd195d5c4b0625cdf**b5
> +
> +
> +tika-parsers.version                = 1.2
> +tika-parsers.jar                    = tika-parsers-${tika-parsers.**
> version}.jar
> +tika-parsers.loc                    = ${maven2.repo}/org/apache/**
> tika/tika-parsers/${tika-**parsers.version}
> +tika-parsers.md5                    = a15b071726358fd195d5c4b0625cdf**b5
> +
> +netcdf.version                = 4.2-min
> +netcdf.jar                    = netcdf-${netcdf.version}.jar
> +netcdf.loc                    = ${maven2.repo}/edu/ucar/**
> netcdf/${netcdf.version}
> +netcdf.md5                    = eb00b40b0511f0fc1dfcfc9cb89e3c**53
> +
> +apache-mime4j-core.version                = 0.7.2
> +apache-mime4j-core.jar                    = apache-mime4j-core-${apache-*
> *mime4j-core.version}.jar
> +apache-mime4j-core.loc                    = ${maven2.repo}/org/apache/**
> james/apache-mime4j-core/${**apache-mime4j-core.version}
> +apache-mime4j-core.md5                    = 88f799546eca803c53eee01a4ce5ed
> **cd
> +
> +apache-mime4j-dom.version                = 0.7.2
> +apache-mime4j-dom.jar                    = apache-mime4j-dom-${apache-**
> mime4j-dom.version}.jar
> +apache-mime4j-dom.loc                    = ${maven2.repo}/org/apache/**
> james/apache-mime4j-dom/${**apache-mime4j-dom.version}
> +apache-mime4j-dom.md5                    = dedc747b5c367fbd7f8a7235d1d7cb
> **ee
> +
> +commons-compress.version                = 1.4.1
> +commons-compress.jar                    = commons-compress-${commons-**
> compress.version}.jar
> +commons-compress.loc                    = ${maven2.repo}/org/apache/**
> commons/commons-compress/${**commons-compress.version}
> +commons-compress.md5                    = 7f7ff9255a831325f38a170992b700*
> *73
> +
> +pdfbox.version                = 1.7.0
> +pdfbox.jar                    = pdfbox-${pdfbox.version}.jar
> +pdfbox.loc                    = ${maven2.repo}/org/apache/**
> pdfbox/pdfbox/${pdfbox.**version}
> +pdfbox.md5                    = da9ff2f1b43dc92b15fe3ba39a1cdd**cd
> +
> +fontbox.version                = 1.7.0
> +fontbox.jar                    = fontbox-${fontbox.version}.jar
> +fontbox.loc                    = ${maven2.repo}/org/apache/**
> pdfbox/fontbox/${fontbox.**version}
> +fontbox.md5                    = 9e03f94d92af257facb148c138af22**fa
> +
> +jempbox.version                = 1.7.0
> +jempbox.jar                    = jempbox-${jempbox.version}.jar
> +jempbox.loc                    = ${maven2.repo}/org/apache/**
> pdfbox/jempbox/${jempbox.**version}
> +jempbox.md5                    = 69dfbd6872c29f89a4df1179dd54b4**4e
> +
> +poi.version                = 3.8
> +poi.jar                    = poi-${poi.version}.jar
> +poi.loc                    = ${maven2.repo}/org/apache/poi/**
> poi/${poi.version}
> +poi.md5                    = 5c915f48922046c71121fd7021aa23**cb
> +
> +poi-scratchpad.version                = 3.8
> +poi-scratchpad.jar                    = poi-scratchpad-${poi-**
> scratchpad.version}.jar
> +poi-scratchpad.loc                    = ${maven2.repo}/org/apache/poi/**
> poi-scratchpad/${poi-**scratchpad.version}
> +poi-scratchpad.md5                    = 7427b6b9e53dcee57d382ba022efc3**
> be
> +
> +poi-ooxml.version                = 3.8
> +poi-ooxml.jar                    = poi-ooxml-${poi-ooxml.version}**.jar
> +poi-ooxml.loc                    = ${maven2.repo}/org/apache/poi/**
> poi-ooxml/${poi-ooxml.version}
> +poi-ooxml.md5                    = 8f147b248f078799c24c8714f185b1**a8
> +
> +geronimo-stax-api_1.0_spec.**version                = 1.0.1
> +geronimo-stax-api_1.0_spec.**jar                    =
> geronimo-stax-api_1.0_spec-${**geronimo-stax-api_1.0_spec.**version}.jar
> +geronimo-stax-api_1.0_spec.**loc                    =
> ${maven2.repo}/org/apache/**geronimo/specs/geronimo-stax-**
> api_1.0_spec/${geronimo-stax-**api_1.0_spec.version}
> +geronimo-stax-api_1.0_spec.**md5                    =
> b7c2a715cd3d1c43dc4ccfae426e8e**2e
> +
> +tagsoup.version                = 1.2.1
> +tagsoup.jar                    = tagsoup-${tagsoup.version}.jar
> +tagsoup.loc                    = ${maven2.repo}/org/ccil/cowan/**
> tagsoup/tagsoup/${tagsoup.**version}
> +tagsoup.md5                    = ae73a52cdcbec10cd61d9ef22fab59**36
> +
> +asm.version                = 3.1
> +asm.jar                    = asm-${asm.version}.jar
> +asm.loc                    = ${maven2.repo}/org/ow2/util/**
> asm/asm/${asm.version}
> +asm.md5                    = b1a36e247bf18fb4da46ce3a54627d**1b
> +
> +isoparser.version                = 1.0-RC-1
> +isoparser.jar                    = isoparser-${isoparser.version}**.jar
> +isoparser.loc                    = ${maven2.repo}/com/googlecode/**
> mp4parser/isoparser/${**isoparser.version}
> +isoparser.md5                    = b0444fde2290319c9028564c3c3ff1**ab
> +
> +metadata-extractor.version                = 2.4.0-beta-1
> +metadata-extractor.jar                    = metadata-extractor-${metadata-
> **extractor.version}.jar
> +metadata-extractor.loc                    = ${maven2.repo}/com/drewnoakes/
> **metadata-extractor/${metadata-**extractor.version}
> +metadata-extractor.md5                    = 6e0ad2f0fe78047cb34ec056b39633
> **d3
> +
> +boilerpipe.version                = 1.1.0
> +boilerpipe.jar                    = boilerpipe-${boilerpipe.**
> version}.jar
> +boilerpipe.loc                    = ${maven2.repo}/de/l3s/**
> boilerpipe/boilerpipe/${**boilerpipe.version}
> +boilerpipe.md5                    = 0616568083786d0f49e2cb07a5d09f**e4
> +
> +rome.version                = 0.9
> +rome.jar                    = rome-${rome.version}.jar
> +rome.loc                    = ${maven2.repo}/rome/rome/${**rome.version}
> +rome.md5                    = 19589699b01c59ccb4d5e61e4c78b3**11
> +
> +vorbis-java-core.version                = 0.1
> +vorbis-java-core.jar                    = vorbis-java-core-${vorbis-**
> java-core.version}.jar
> +vorbis-java-core.loc                    = ${maven2.repo}/org/gagravarr/**
> vorbis-java-core/${vorbis-**java-core.version}
> +vorbis-java-core.md5                    = b88115be2754cb6883e652ba68ca46*
> *c8
> +
> +juniversalchardet.version                = 1.0.3
> +juniversalchardet.jar                    = juniversalchardet-${**
> juniversalchardet.version}.jar
> +juniversalchardet.loc                    = ${maven2.repo}/com/googlecode/
> **juniversalchardet/**juniversalchardet/${**juniversalchardet.version}
> +juniversalchardet.md5                    = d9ea0a9a275336c175b343f2e4cd8f
> **27
> +
> +xz.version                = 1.1
> +xz.jar                    = xz-${xz.version}.jar
> +xz.loc                    = ${maven2.repo}/org/tukaani/xz/**${xz.version}
> +xz.md5                    = 4d0ba9643c8f3f7c6721be3a1286da**1c
> +
> +dom4j.version                 = 1.6.1
> +dom4j.jar                = dom4j-${dom4j.version}.jar
> +dom4j.loc                = ${maven2.repo}/dom4j/dom4j/${**dom4j.version}
> +dom4j.md5                = 4d8f51d3fe3900efc6e395be48030d**6d
> +
> +xmlbeans.version                 = 2.6.0
> +xmlbeans.jar                = xmlbeans-${xmlbeans.version}.**jar
> +xmlbeans.loc                = ${maven2.repo}/org/apache/**
> xmlbeans/xmlbeans/${xmlbeans.**version}
> +xmlbeans.md5                = 6591c08682d613194dacb01e95c78c**2c
> +
> +poi-ooxml.version                 = 3.8
> +poi-ooxml.jar                = poi-ooxml-${poi-ooxml.version}**.jar
> +poi-ooxml.loc                = ${maven2.repo}/org/apache/poi/**
> poi-ooxml/${poi-ooxml.version}
> +poi-ooxml.md5                = 8f147b248f078799c24c8714f185b1**a8
> +
> +poi-ooxml-schemas.version                 = 3.8
> +poi-ooxml-schemas.jar                = poi-ooxml-schemas-${poi-ooxml-**
> schemas.version}.jar
> +poi-ooxml-schemas.loc                = ${maven2.repo}/org/apache/poi/**
> poi-ooxml-schemas/${poi-ooxml-**schemas.version}
> +poi-ooxml-schemas.md5                = 7ebcffdc4d82b2b8cbc6464d4543cd**07
>
>
>
>


-- 
Cordialement.
Philippe Mouawad.