You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Surendra <cs...@gmail.com> on 2011/06/20 13:59:11 UTC
Re: upgrading to Tika 0.9 on Solr 1.4.1
Mattmann, Chris A (388J <chris.a.mattmann <at> jpl.nasa.gov> writes:
>
> Hi Jo,
>
> You may consider checking out Tika trunk, where we recently have a Tika JAX-RS
web service [1] committed as
> part of the tika-server module. You could probably wire DIH into it and
accomplish the same thing.
>
> Cheers,
> Chris
>
> [1] https://issues.apache.org/jira/browse/TIKA-593
>
> On Feb 24, 2011, at 12:42 PM, jo wrote:
>
> >
> > I have tried the steps indicated here:
> > http://wiki.apache.org/solr/ExtractingRequestHandler
> > http://wiki.apache.org/solr/ExtractingRequestHandler
> >
> > and when I try to parse a document nothing would happen, no error.. I have
> > copied the jar files everywhere, and nothing.. can anyone give me the steps
> > on how to upgrade just tika, btw, currently on 1.4.1 has tika 0.4
> >
> > thank you
> >
> >
> > --
> > View this message in context:
http://lucene.472066.n3.nabble.com/upgrading-to-Tika-0-9-on-Solr-1-4-1-tp2570526p2570526.html
> > Sent from the Solr - User mailing list archive at Nabble.com.
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Senior Computer Scientist
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 171-266B, Mailstop: 171-246
> Email: chris.a.mattmann <at> nasa.gov
> WWW: http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Assistant Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Hey Chris
I have added tika-core 0.9 and tika-parsers 0.9 to Solr1.4.1 (extraction/lib)
after building them using the source provided by TIKA. Now I have an issue with
this. I am working with extracting PDF content using Solr. I have added
fmap.content to the configurable params as "attr_content" where I can see the
entire extracted document. After the TIKA update i am not able to see
attr_content appearing in the search results. When I restore it with old 0.4
TIKA jars again the attr_content appears. I didn't find any exceptions shown up
there in the console. Is this a known behavior that someone have faced already?
Can you guide me to resolve this?
-- Surendra
Re: upgrading to Tika 0.9 on Solr 1.4.1
Posted by bing <JS...@hotmail.com>.
Hi, all,
I tried to upgrade tika0.8 to tika0.10 on solr3.3.0, following the similar
steps, but failed.
1. Replace the following jars in /contrib/extraction/
fontbox-1.6.0, jempbox-1.6.0, pdfbox-1.6.0, tika-core-0.10,
tika-parsers-0.10;
2. Copy all the jars in /contrib/langid/* from solr3.5.0
3. Copy /dist/apache-solr-langid-3.5.0 from solr3.5.0
4. Configure solrconfig.xml in solr3.3.0, adding the following lib and
definition of updateRequestProcessorChain.
<lib dir="../../contrib/langid/lib" />
<lib dir="../../dist/" regex="apache-solr-langid-\d.*\.jar" />
<updateRequestProcessorChain name="langid">
<processor
class="org.apache.solr.update.processor.LangDetectLanguageIdentifierUpdateProcessorFactory">
<str name="langid.fl">text,title,author</str>
<str name="langid.langField">language_s</str>
<str name="langid.fallback">en</str>
</processor>
<processor class="solr.LogUpdateProcessorFactory" />
<processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>
Errors: (typical errors when factory is not found)
org.apache.solr.common.SolrException: Error loading class
'org.apache.solr.update.processor.LangDetectLanguageIdentifierUpdateProcessorFactory'
at
org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:389)
at
Anyone tried similar things before. Pls advice. Thank you.
Best Regards,
Bing
--
View this message in context: http://lucene.472066.n3.nabble.com/upgrading-to-Tika-0-9-on-Solr-1-4-1-tp2570526p3772177.html
Sent from the Solr - User mailing list archive at Nabble.com.
Re: upgrading to Tika 0.9 on Solr 1.4.1
Posted by Surendra <cs...@gmail.com>.
I have upgraded my Solr Distribution to 3.2 and also the referring jars of my
application (especially the solr jar was 1.4.1 in my application which calls
solr...hence causing javabin exception...) . Also updated the
pdfbox/jempbox/fontbox to latest versions and Tika to 0.9 version...which made
things up for me!
-- Surendranadh
Re: upgrading to Tika 0.9 on Solr 1.4.1
Posted by "Mattmann, Chris A (388J)" <ch...@jpl.nasa.gov>.
Glad it worked out!
Cheers,
Chris
On Jun 22, 2011, at 5:14 AM, Surendra wrote:
> Hi Chris ,Andreas
>
> I have upgraded to solr 3.2 ... everything seems fine now. I will have to
> integrate this to my application and observe if any further issues...again
> thanks for your patience and time...
>
> --Surendra
>
>
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattmann@nasa.gov
WWW: http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Re: upgrading to Tika 0.9 on Solr 1.4.1
Posted by Surendra <cs...@gmail.com>.
Hi Chris ,Andreas
I have upgraded to solr 3.2 ... everything seems fine now. I will have to
integrate this to my application and observe if any further issues...again
thanks for your patience and time...
--Surendra
Re: upgrading to Tika 0.9 on Solr 1.4.1
Posted by Andreas Kemkes <a5...@yahoo.com>.
We are successfully extracting PDF content with Solr 3.1 and Tika 0.9.
Replace
fontbox-1.3.1.jar jempbox-1.3.1.jar pdfbox-1.3.1.jar tika-core-0.8.jar
tika-parsers-0.8.jar
with
fontbox-1.4.0.jar jempbox-1.4.0.jar pdfbox-1.4.0.jar tika-core-0.9.jar
tika-parsers-0.9.jar
I'm not entirely certain, if a recompile of Solr was necessary or not.
Andreas
________________________________
From: Surendra <cs...@gmail.com>
To: solr-user@lucene.apache.org
Sent: Tue, June 21, 2011 5:18:31 AM
Subject: Re: upgrading to Tika 0.9 on Solr 1.4.1
Hi Andreas
I tried solr 3.1 as well as 3.2... i was not able to overcome these issues with
the newer versions too. For me, I need the attr_content:* should return me
results (with 1.4.1 this is successful) which is not happening . It indexes well
in 3.1 but in 3.2 i have the following issue.
Invalid version or the data in not in 'javabin' format
--Surendra
Re: upgrading to Tika 0.9 on Solr 1.4.1
Posted by Surendra <cs...@gmail.com>.
Hi Andreas
I tried solr 3.1 as well as 3.2... i was not able to overcome these issues with
the newer versions too. For me, I need the attr_content:* should return me
results (with 1.4.1 this is successful) which is not happening . It indexes well
in 3.1 but in 3.2 i have the following issue.
Invalid version or the data in not in 'javabin' format
--Surendra
Re: upgrading to Tika 0.9 on Solr 1.4.1
Posted by "Mattmann, Chris A (388J)" <ch...@jpl.nasa.gov>.
Hi Surendra,
Thanks. Besides replacing the tika-*-0.9.jar files, you also need to replace the dependency jar files for the other libs as well since they have been upgraded. It's also possible that b/c of API changes, Solr 1.4.1 won't work with Tika 0.9 without modifying the ExtractingRequestHandler code...
Cheers,
Chris
On Jun 21, 2011, at 12:28 AM, Surendra wrote:
> Hi Chris
>
> I did a proper checkout of TIKA 0.9 and built the jars as specified in the
> "http://tika.apache.org/0.9/gettingstarted.html" and replaced the existing
> tika0.4 jars with 0.9 jars. I don't see any difference. The documents are
> getting indexed but the fmap.content(attr_content) is still not available for
> me. Am I missing something? Between I'm digging further in this isse... if I can
> get any further help it would be great! Thanks for your time...
>
> -- Surendra
>
>
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattmann@nasa.gov
WWW: http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Re: upgrading to Tika 0.9 on Solr 1.4.1
Posted by Surendra <cs...@gmail.com>.
Hi Chris
I did a proper checkout of TIKA 0.9 and built the jars as specified in the
"http://tika.apache.org/0.9/gettingstarted.html" and replaced the existing
tika0.4 jars with 0.9 jars. I don't see any difference. The documents are
getting indexed but the fmap.content(attr_content) is still not available for
me. Am I missing something? Between I'm digging further in this isse... if I can
get any further help it would be great! Thanks for your time...
-- Surendra
Re: upgrading to Tika 0.9 on Solr 1.4.1
Posted by Andreas Kemkes <a5...@yahoo.com>.
I've unsuccessfully attempted to go down this road - there are API changes, some
of which I was able to solve by taking code snippets from Solr 3.1. Some
extraction-related tests for wouldn't pass (look for 'Solr 1.4.1 and Tika 0.9 -
some tests not passing' in the archive). Ultimately, I decided that the then
newly released Solr 3.1 was the less rocky route. Not sure if that is an option
for you.
Andreas
________________________________
From: "Mattmann, Chris A (388J)" <ch...@jpl.nasa.gov>
To: "solr-user@lucene.apache.org" <so...@lucene.apache.org>
Sent: Mon, June 20, 2011 7:18:34 AM
Subject: Re: upgrading to Tika 0.9 on Solr 1.4.1
Hi Surendra,
On Jun 20, 2011, at 4:59 AM, Surendra wrote:
> Hey Chris
>
> I have added tika-core 0.9 and tika-parsers 0.9 to Solr1.4.1 (extraction/lib)
> after building them using the source provided by TIKA. Now I have an issue
with
> this. I am working with extracting PDF content using Solr. I have added
> fmap.content to the configurable params as "attr_content" where I can see the
> entire extracted document. After the TIKA update i am not able to see
> attr_content appearing in the search results. When I restore it with old 0.4
> TIKA jars again the attr_content appears. I didn't find any exceptions shown
up
> there in the console. Is this a known behavior that someone have faced
already?
> Can you guide me to resolve this?
I don't think you can simple add a new tika-core-0.9 and tika-parsers-0.9 to
extraction/lib -- I think you'll need to replace the set of prior Tika jars in
there. Have a look here to see what jars you would need to replace, HTH:
http://tika.apache.org/0.9/gettingstarted.html
Cheers,
Chris
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattmann@nasa.gov
WWW: http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Re: upgrading to Tika 0.9 on Solr 1.4.1
Posted by "Mattmann, Chris A (388J)" <ch...@jpl.nasa.gov>.
Hi Surendra,
On Jun 20, 2011, at 4:59 AM, Surendra wrote:
> Hey Chris
>
> I have added tika-core 0.9 and tika-parsers 0.9 to Solr1.4.1 (extraction/lib)
> after building them using the source provided by TIKA. Now I have an issue with
> this. I am working with extracting PDF content using Solr. I have added
> fmap.content to the configurable params as "attr_content" where I can see the
> entire extracted document. After the TIKA update i am not able to see
> attr_content appearing in the search results. When I restore it with old 0.4
> TIKA jars again the attr_content appears. I didn't find any exceptions shown up
> there in the console. Is this a known behavior that someone have faced already?
> Can you guide me to resolve this?
I don't think you can simple add a new tika-core-0.9 and tika-parsers-0.9 to extraction/lib -- I think you'll need to replace the set of prior Tika jars in there. Have a look here to see what jars you would need to replace, HTH:
http://tika.apache.org/0.9/gettingstarted.html
Cheers,
Chris
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattmann@nasa.gov
WWW: http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++