You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by keeblerh <ke...@yahoo.com> on 2014/09/18 16:14:39 UTC
Re: How to exclude a mimetype in tika?
eShard wrote
> Good afternoon,
> I'm using solr 4.0 Final
> I need movies "hidden" in zip files that need to be excluded from the
> index.
> I can't filter movies on the crawler because then I would have to exclude
> all zip files.
> I was told I can have tika skip the movies.
> the details are escaping me at this point.
> How do I exclude a file in the tika configuration?
> I assume it's something I add in the update/extract handler but I'm not
> sure.
>
> Thanks,
I am having the same issue. I need to exlcude some mime types from the zip
files and using SOLR 4.8. Did you ever get an answer to this? THanks.
--
View this message in context: http://lucene.472066.n3.nabble.com/How-to-exclude-a-mimetype-in-tika-tp4127168p4159676.html
Sent from the Solr - User mailing list archive at Nabble.com.
Re: How to exclude a mimetype in tika?
Posted by "Mattmann, Chris A (3980)" <ch...@jpl.nasa.gov>.
Config dumper would be most appreciated in tika-examples!
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann@nasa.gov
WWW: http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
-----Original Message-----
From: <Allison>, "Timothy B." <ta...@mitre.org>
Reply-To: "dev@tika.apache.org" <de...@tika.apache.org>
Date: Thursday, September 18, 2014 10:19 AM
To: "dev@tika.apache.org" <de...@tika.apache.org>
Cc: "keeblerh@yahoo.com" <ke...@yahoo.com>
Subject: RE: How to exclude a mimetype in tika?
>Speaking of which...last time I went looking for an example of an
>up-to-date tika config file, it was hard to find (thank you, jboss and
>https://wiki.csc.calpoly.edu/DocuCategMontano/browser/Parser/tika-config.x
>ml).
>
>Should I add a DefaultTikaConfigDumper to the examples module that would
>dump a default tika config with the current version of Tika so that
>people can dump it and then modify it?
>
>Or, did I just plain miss an already existing example on our website/wiki?
>
>Best,
>
> Tim
>
>
>-----Original Message-----
>From: Mattmann, Chris A (3980) [mailto:chris.a.mattmann@jpl.nasa.gov]
>Sent: Thursday, September 18, 2014 12:56 PM
>To: user@tika.apache.org
>Cc: keeblerh@yahoo.com
>Subject: Re: How to exclude a mimetype in tika?
>
>+1 Tim, I believe so?
>
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>Chris Mattmann, Ph.D.
>Chief Architect
>Instrument Software and Science Data Systems Section (398)
>NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>Office: 168-519, Mailstop: 168-527
>Email: chris.a.mattmann@nasa.gov
>WWW: http://sunset.usc.edu/~mattmann/
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>Adjunct Associate Professor, Computer Science Department
>University of Southern California, Los Angeles, CA 90089 USA
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
>
>
>
>-----Original Message-----
>From: <Allison>, "Timothy B." <ta...@mitre.org>
>Reply-To: "user@tika.apache.org" <us...@tika.apache.org>
>Date: Thursday, September 18, 2014 7:45 AM
>To: "user@tika.apache.org" <us...@tika.apache.org>
>Cc: "keeblerh@yahoo.com" <ke...@yahoo.com>
>Subject: FW: How to exclude a mimetype in tika?
>
>>Tika Colleagues (Tika'ers, Tikis?),
>>
>>Is this the right answer:
>>
>>Drop the relevant parsers from the tika.config file and make sure to
>>point solr to this file in your solr request handler definition: <str
>>name="tika.config">/my/path/to/tika.config</str>?
>>
>> I only have experience as a programmatic user of Tika and would use a
>>DocumentSelector, but would the above work?
>>
>>-----Original Message-----
>>From: keeblerh [mailto:keeblerh@yahoo.com]
>>Sent: Thursday, September 18, 2014 10:15 AM
>>To: solr-user@lucene.apache.org
>>Subject: Re: How to exclude a mimetype in tika?
>>
>>eShard wrote
>>> Good afternoon,
>>> I'm using solr 4.0 Final
>>> I need movies "hidden" in zip files that need to be excluded from the
>>> index.
>>> I can't filter movies on the crawler because then I would have to
>>>exclude
>>> all zip files.
>>> I was told I can have tika skip the movies.
>>> the details are escaping me at this point.
>>> How do I exclude a file in the tika configuration?
>>> I assume it's something I add in the update/extract handler but I'm not
>>> sure.
>>>
>>> Thanks,
>>
>>I am having the same issue. I need to exlcude some mime types from the
>>zip
>>files and using SOLR 4.8. Did you ever get an answer to this? THanks.
>>
>>
>>
>>--
>>View this message in context:
>>http://lucene.472066.n3.nabble.com/How-to-exclude-a-mimetype-in-tika-tp41
>>2
>>7168p4159676.html
>>Sent from the Solr - User mailing list archive at Nabble.com.
>
RE: How to exclude a mimetype in tika?
Posted by "Allison, Timothy B." <ta...@mitre.org>.
Speaking of which...last time I went looking for an example of an up-to-date tika config file, it was hard to find (thank you, jboss and https://wiki.csc.calpoly.edu/DocuCategMontano/browser/Parser/tika-config.xml).
Should I add a DefaultTikaConfigDumper to the examples module that would dump a default tika config with the current version of Tika so that people can dump it and then modify it?
Or, did I just plain miss an already existing example on our website/wiki?
Best,
Tim
-----Original Message-----
From: Mattmann, Chris A (3980) [mailto:chris.a.mattmann@jpl.nasa.gov]
Sent: Thursday, September 18, 2014 12:56 PM
To: user@tika.apache.org
Cc: keeblerh@yahoo.com
Subject: Re: How to exclude a mimetype in tika?
+1 Tim, I believe so?
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann@nasa.gov
WWW: http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
-----Original Message-----
From: <Allison>, "Timothy B." <ta...@mitre.org>
Reply-To: "user@tika.apache.org" <us...@tika.apache.org>
Date: Thursday, September 18, 2014 7:45 AM
To: "user@tika.apache.org" <us...@tika.apache.org>
Cc: "keeblerh@yahoo.com" <ke...@yahoo.com>
Subject: FW: How to exclude a mimetype in tika?
>Tika Colleagues (Tika'ers, Tikis?),
>
>Is this the right answer:
>
>Drop the relevant parsers from the tika.config file and make sure to
>point solr to this file in your solr request handler definition: <str
>name="tika.config">/my/path/to/tika.config</str>?
>
> I only have experience as a programmatic user of Tika and would use a
>DocumentSelector, but would the above work?
>
>-----Original Message-----
>From: keeblerh [mailto:keeblerh@yahoo.com]
>Sent: Thursday, September 18, 2014 10:15 AM
>To: solr-user@lucene.apache.org
>Subject: Re: How to exclude a mimetype in tika?
>
>eShard wrote
>> Good afternoon,
>> I'm using solr 4.0 Final
>> I need movies "hidden" in zip files that need to be excluded from the
>> index.
>> I can't filter movies on the crawler because then I would have to
>>exclude
>> all zip files.
>> I was told I can have tika skip the movies.
>> the details are escaping me at this point.
>> How do I exclude a file in the tika configuration?
>> I assume it's something I add in the update/extract handler but I'm not
>> sure.
>>
>> Thanks,
>
>I am having the same issue. I need to exlcude some mime types from the
>zip
>files and using SOLR 4.8. Did you ever get an answer to this? THanks.
>
>
>
>--
>View this message in context:
>http://lucene.472066.n3.nabble.com/How-to-exclude-a-mimetype-in-tika-tp412
>7168p4159676.html
>Sent from the Solr - User mailing list archive at Nabble.com.
Re: How to exclude a mimetype in tika?
Posted by "Mattmann, Chris A (3980)" <ch...@jpl.nasa.gov>.
+1 Tim, I believe so?
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann@nasa.gov
WWW: http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
-----Original Message-----
From: <Allison>, "Timothy B." <ta...@mitre.org>
Reply-To: "user@tika.apache.org" <us...@tika.apache.org>
Date: Thursday, September 18, 2014 7:45 AM
To: "user@tika.apache.org" <us...@tika.apache.org>
Cc: "keeblerh@yahoo.com" <ke...@yahoo.com>
Subject: FW: How to exclude a mimetype in tika?
>Tika Colleagues (Tika'ers, Tikis?),
>
>Is this the right answer:
>
>Drop the relevant parsers from the tika.config file and make sure to
>point solr to this file in your solr request handler definition: <str
>name="tika.config">/my/path/to/tika.config</str>?
>
> I only have experience as a programmatic user of Tika and would use a
>DocumentSelector, but would the above work?
>
>-----Original Message-----
>From: keeblerh [mailto:keeblerh@yahoo.com]
>Sent: Thursday, September 18, 2014 10:15 AM
>To: solr-user@lucene.apache.org
>Subject: Re: How to exclude a mimetype in tika?
>
>eShard wrote
>> Good afternoon,
>> I'm using solr 4.0 Final
>> I need movies "hidden" in zip files that need to be excluded from the
>> index.
>> I can't filter movies on the crawler because then I would have to
>>exclude
>> all zip files.
>> I was told I can have tika skip the movies.
>> the details are escaping me at this point.
>> How do I exclude a file in the tika configuration?
>> I assume it's something I add in the update/extract handler but I'm not
>> sure.
>>
>> Thanks,
>
>I am having the same issue. I need to exlcude some mime types from the
>zip
>files and using SOLR 4.8. Did you ever get an answer to this? THanks.
>
>
>
>--
>View this message in context:
>http://lucene.472066.n3.nabble.com/How-to-exclude-a-mimetype-in-tika-tp412
>7168p4159676.html
>Sent from the Solr - User mailing list archive at Nabble.com.
FW: How to exclude a mimetype in tika?
Posted by "Allison, Timothy B." <ta...@mitre.org>.
Tika Colleagues (Tika'ers, Tikis?),
Is this the right answer:
Drop the relevant parsers from the tika.config file and make sure to point solr to this file in your solr request handler definition: <str name="tika.config">/my/path/to/tika.config</str>?
I only have experience as a programmatic user of Tika and would use a DocumentSelector, but would the above work?
-----Original Message-----
From: keeblerh [mailto:keeblerh@yahoo.com]
Sent: Thursday, September 18, 2014 10:15 AM
To: solr-user@lucene.apache.org
Subject: Re: How to exclude a mimetype in tika?
eShard wrote
> Good afternoon,
> I'm using solr 4.0 Final
> I need movies "hidden" in zip files that need to be excluded from the
> index.
> I can't filter movies on the crawler because then I would have to exclude
> all zip files.
> I was told I can have tika skip the movies.
> the details are escaping me at this point.
> How do I exclude a file in the tika configuration?
> I assume it's something I add in the update/extract handler but I'm not
> sure.
>
> Thanks,
I am having the same issue. I need to exlcude some mime types from the zip
files and using SOLR 4.8. Did you ever get an answer to this? THanks.
--
View this message in context: http://lucene.472066.n3.nabble.com/How-to-exclude-a-mimetype-in-tika-tp4127168p4159676.html
Sent from the Solr - User mailing list archive at Nabble.com.
Re: How to exclude a mimetype in tika?
Posted by Jorge Luis Betancourt Gonzalez <jl...@uci.cu>.
Which crawler are you using?
On Sep 18, 2014, at 10:14 AM, keeblerh <ke...@yahoo.com> wrote:
> eShard wrote
>> Good afternoon,
>> I'm using solr 4.0 Final
>> I need movies "hidden" in zip files that need to be excluded from the
>> index.
>> I can't filter movies on the crawler because then I would have to exclude
>> all zip files.
>> I was told I can have tika skip the movies.
>> the details are escaping me at this point.
>> How do I exclude a file in the tika configuration?
>> I assume it's something I add in the update/extract handler but I'm not
>> sure.
>>
>> Thanks,
>
> I am having the same issue. I need to exlcude some mime types from the zip
> files and using SOLR 4.8. Did you ever get an answer to this? THanks.
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/How-to-exclude-a-mimetype-in-tika-tp4127168p4159676.html
> Sent from the Solr - User mailing list archive at Nabble.com.
Concurso "Mi selfie por los 5". Detalles en http://justiciaparaloscinco.wordpress.com
RE: How to exclude a mimetype in tika?
Posted by "Allison, Timothy B." <ta...@mitre.org>.
One option (I think--answer is untested!) is to remove the parsers you don't want from the tika config file. Make sure to specify the tika.config file parameter in your ExtractingRequestHandler in Solr (https://wiki.apache.org/solr/ExtractingRequestHandler).
In response to this question, I just added an example to tika trunk (TIKA-1418) for how to dump the current tika config (org.apache.tika.example.DumpTikaConfigExample). Users can use the dumped config file to make modifications. The last time I looked for a tika config file, examples were difficult to find.
An example from the dumper is here:
https://issues.apache.org/jira/secure/attachment/12670000/tika-config-SNAPSHOT-1.7_20140919.xml
Let me know if the above recommendation works!
Happy extraction!
Best,
Tim
-----Original Message-----
From: keeblerh [mailto:keeblerh@yahoo.com]
Sent: Thursday, September 18, 2014 10:15 AM
To: solr-user@lucene.apache.org
Subject: Re: How to exclude a mimetype in tika?
eShard wrote
> Good afternoon,
> I'm using solr 4.0 Final
> I need movies "hidden" in zip files that need to be excluded from the
> index.
> I can't filter movies on the crawler because then I would have to exclude
> all zip files.
> I was told I can have tika skip the movies.
> the details are escaping me at this point.
> How do I exclude a file in the tika configuration?
> I assume it's something I add in the update/extract handler but I'm not
> sure.
>
> Thanks,
I am having the same issue. I need to exlcude some mime types from the zip
files and using SOLR 4.8. Did you ever get an answer to this? THanks.
--
View this message in context: http://lucene.472066.n3.nabble.com/How-to-exclude-a-mimetype-in-tika-tp4127168p4159676.html
Sent from the Solr - User mailing list archive at Nabble.com.