You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Allison, Timothy B." <ta...@mitre.org> on 2014/09/18 19:19:09 UTC

RE: How to exclude a mimetype in tika?

Speaking of which...last time I went looking for an example of an up-to-date tika config file, it was hard to find (thank you, jboss and https://wiki.csc.calpoly.edu/DocuCategMontano/browser/Parser/tika-config.xml).

Should I add a DefaultTikaConfigDumper to the examples module that would dump a default tika config with the current version of Tika so that people can dump it and then modify it?

Or, did I just plain miss an already existing example on our website/wiki?

Best,

            Tim


-----Original Message-----
From: Mattmann, Chris A (3980) [mailto:chris.a.mattmann@jpl.nasa.gov] 
Sent: Thursday, September 18, 2014 12:56 PM
To: user@tika.apache.org
Cc: keeblerh@yahoo.com
Subject: Re: How to exclude a mimetype in tika?

+1 Tim, I believe so?

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++






-----Original Message-----
From: <Allison>, "Timothy B." <ta...@mitre.org>
Reply-To: "user@tika.apache.org" <us...@tika.apache.org>
Date: Thursday, September 18, 2014 7:45 AM
To: "user@tika.apache.org" <us...@tika.apache.org>
Cc: "keeblerh@yahoo.com" <ke...@yahoo.com>
Subject: FW: How to exclude a mimetype in tika?

>Tika Colleagues (Tika'ers, Tikis?),
>
>Is this the right answer:
>
>Drop the relevant parsers from the tika.config file and make sure to
>point solr to this file in your solr request handler definition: <str
>name="tika.config">/my/path/to/tika.config</str>?
>
>  I only have experience as a programmatic user of Tika and would use a
>DocumentSelector, but would the above work?
>
>-----Original Message-----
>From: keeblerh [mailto:keeblerh@yahoo.com]
>Sent: Thursday, September 18, 2014 10:15 AM
>To: solr-user@lucene.apache.org
>Subject: Re: How to exclude a mimetype in tika?
>
>eShard wrote
>> Good afternoon,
>> I'm using solr 4.0 Final
>> I need movies "hidden" in zip files that need to be excluded from the
>> index.
>> I can't filter movies on the crawler because then I would have to
>>exclude
>> all zip files.
>> I was told I can have tika skip the movies.
>> the details are escaping me at this point.
>> How do I exclude a file in the tika configuration?
>> I assume it's something I add in the update/extract handler but I'm not
>> sure.
>> 
>> Thanks,
>
>I am having the same issue.  I need to exlcude some mime types from the
>zip
>files and using SOLR 4.8.  Did you ever get an answer to this?  THanks.
>
>
>
>--
>View this message in context:
>http://lucene.472066.n3.nabble.com/How-to-exclude-a-mimetype-in-tika-tp412
>7168p4159676.html
>Sent from the Solr - User mailing list archive at Nabble.com.


Re: How to exclude a mimetype in tika?

Posted by "Mattmann, Chris A (3980)" <ch...@jpl.nasa.gov>.
Config dumper would be most appreciated in tika-examples!

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++






-----Original Message-----
From: <Allison>, "Timothy B." <ta...@mitre.org>
Reply-To: "dev@tika.apache.org" <de...@tika.apache.org>
Date: Thursday, September 18, 2014 10:19 AM
To: "dev@tika.apache.org" <de...@tika.apache.org>
Cc: "keeblerh@yahoo.com" <ke...@yahoo.com>
Subject: RE: How to exclude a mimetype in tika?

>Speaking of which...last time I went looking for an example of an
>up-to-date tika config file, it was hard to find (thank you, jboss and
>https://wiki.csc.calpoly.edu/DocuCategMontano/browser/Parser/tika-config.x
>ml).
>
>Should I add a DefaultTikaConfigDumper to the examples module that would
>dump a default tika config with the current version of Tika so that
>people can dump it and then modify it?
>
>Or, did I just plain miss an already existing example on our website/wiki?
>
>Best,
>
>            Tim
>
>
>-----Original Message-----
>From: Mattmann, Chris A (3980) [mailto:chris.a.mattmann@jpl.nasa.gov]
>Sent: Thursday, September 18, 2014 12:56 PM
>To: user@tika.apache.org
>Cc: keeblerh@yahoo.com
>Subject: Re: How to exclude a mimetype in tika?
>
>+1 Tim, I believe so?
>
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>Chris Mattmann, Ph.D.
>Chief Architect
>Instrument Software and Science Data Systems Section (398)
>NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>Office: 168-519, Mailstop: 168-527
>Email: chris.a.mattmann@nasa.gov
>WWW:  http://sunset.usc.edu/~mattmann/
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>Adjunct Associate Professor, Computer Science Department
>University of Southern California, Los Angeles, CA 90089 USA
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
>
>
>
>-----Original Message-----
>From: <Allison>, "Timothy B." <ta...@mitre.org>
>Reply-To: "user@tika.apache.org" <us...@tika.apache.org>
>Date: Thursday, September 18, 2014 7:45 AM
>To: "user@tika.apache.org" <us...@tika.apache.org>
>Cc: "keeblerh@yahoo.com" <ke...@yahoo.com>
>Subject: FW: How to exclude a mimetype in tika?
>
>>Tika Colleagues (Tika'ers, Tikis?),
>>
>>Is this the right answer:
>>
>>Drop the relevant parsers from the tika.config file and make sure to
>>point solr to this file in your solr request handler definition: <str
>>name="tika.config">/my/path/to/tika.config</str>?
>>
>>  I only have experience as a programmatic user of Tika and would use a
>>DocumentSelector, but would the above work?
>>
>>-----Original Message-----
>>From: keeblerh [mailto:keeblerh@yahoo.com]
>>Sent: Thursday, September 18, 2014 10:15 AM
>>To: solr-user@lucene.apache.org
>>Subject: Re: How to exclude a mimetype in tika?
>>
>>eShard wrote
>>> Good afternoon,
>>> I'm using solr 4.0 Final
>>> I need movies "hidden" in zip files that need to be excluded from the
>>> index.
>>> I can't filter movies on the crawler because then I would have to
>>>exclude
>>> all zip files.
>>> I was told I can have tika skip the movies.
>>> the details are escaping me at this point.
>>> How do I exclude a file in the tika configuration?
>>> I assume it's something I add in the update/extract handler but I'm not
>>> sure.
>>> 
>>> Thanks,
>>
>>I am having the same issue.  I need to exlcude some mime types from the
>>zip
>>files and using SOLR 4.8.  Did you ever get an answer to this?  THanks.
>>
>>
>>
>>--
>>View this message in context:
>>http://lucene.472066.n3.nabble.com/How-to-exclude-a-mimetype-in-tika-tp41
>>2
>>7168p4159676.html
>>Sent from the Solr - User mailing list archive at Nabble.com.
>