You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Surendra <cs...@gmail.com> on 2011/06/20 13:59:11 UTC

Re: upgrading to Tika 0.9 on Solr 1.4.1

Mattmann, Chris A (388J <chris.a.mattmann <at> jpl.nasa.gov> writes:

> 
> Hi Jo,
> 
> You may consider checking out Tika trunk, where we recently have a Tika JAX-RS
web service [1] committed as
> part of the tika-server module. You could probably wire DIH into it and
accomplish the same thing.
> 
> Cheers,
> Chris
> 
> [1] https://issues.apache.org/jira/browse/TIKA-593
> 
> On Feb 24, 2011, at 12:42 PM, jo wrote:
> 
> > 
> > I have tried the steps indicated here:
> > http://wiki.apache.org/solr/ExtractingRequestHandler
> > http://wiki.apache.org/solr/ExtractingRequestHandler 
> > 
> > and when I try to parse a document nothing would happen, no error.. I have
> > copied the jar files everywhere, and nothing.. can anyone give me the steps
> > on how to upgrade just tika, btw, currently on 1.4.1 has tika 0.4
> > 
> > thank you
> > 
> > 
> > -- 
> > View this message in context:
http://lucene.472066.n3.nabble.com/upgrading-to-Tika-0-9-on-Solr-1-4-1-tp2570526p2570526.html
> > Sent from the Solr - User mailing list archive at Nabble.com.
> 
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Senior Computer Scientist
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 171-266B, Mailstop: 171-246
> Email: chris.a.mattmann <at> nasa.gov
> WWW:   http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Assistant Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Hey Chris

I have added tika-core 0.9 and tika-parsers 0.9 to Solr1.4.1 (extraction/lib)
after building them using the source provided by TIKA. Now I have an issue with
this. I am working with extracting PDF content using Solr. I have added
fmap.content to the configurable params as "attr_content" where I can see the
entire extracted document. After the TIKA update i am not able to see
attr_content appearing in the search results. When I restore it with old 0.4
TIKA jars again the attr_content appears. I didn't find any exceptions shown up
there in the console. Is this a known behavior that someone have faced already?
Can you guide me to resolve this?

-- Surendra






Re: upgrading to Tika 0.9 on Solr 1.4.1

Posted by bing <JS...@hotmail.com>.
Hi, all, 

I tried to upgrade tika0.8 to tika0.10 on solr3.3.0, following the similar
steps, but failed. 

1. Replace the following jars in /contrib/extraction/ 
fontbox-1.6.0, jempbox-1.6.0, pdfbox-1.6.0, tika-core-0.10,
tika-parsers-0.10;

2. Copy all the jars in /contrib/langid/* from solr3.5.0 

3. Copy /dist/apache-solr-langid-3.5.0 from solr3.5.0

4. Configure solrconfig.xml in solr3.3.0, adding the following lib and
definition of updateRequestProcessorChain.

  <lib dir="../../contrib/langid/lib" />
  <lib dir="../../dist/" regex="apache-solr-langid-\d.*\.jar" />

  <updateRequestProcessorChain name="langid">
      
	   <processor
class="org.apache.solr.update.processor.LangDetectLanguageIdentifierUpdateProcessorFactory"> 
         <str name="langid.fl">text,title,author</str>
         <str name="langid.langField">language_s</str>
         <str name="langid.fallback">en</str>
       </processor>
       <processor class="solr.LogUpdateProcessorFactory" />
       <processor class="solr.RunUpdateProcessorFactory" />
     </updateRequestProcessorChain>


Errors:  (typical errors when factory is not found)

org.apache.solr.common.SolrException: Error loading class
'org.apache.solr.update.processor.LangDetectLanguageIdentifierUpdateProcessorFactory'
at
org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:389)
at 

Anyone tried similar things before. Pls advice. Thank you. 

Best Regards, 
Bing 
    

--
View this message in context: http://lucene.472066.n3.nabble.com/upgrading-to-Tika-0-9-on-Solr-1-4-1-tp2570526p3772177.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: upgrading to Tika 0.9 on Solr 1.4.1

Posted by Surendra <cs...@gmail.com>.
I have upgraded my Solr Distribution to 3.2 and also the referring jars of my
application (especially the solr jar was 1.4.1 in my application which calls
solr...hence causing javabin exception...) . Also updated the
pdfbox/jempbox/fontbox to latest versions and Tika to 0.9 version...which made
things up for me!

-- Surendranadh



Re: upgrading to Tika 0.9 on Solr 1.4.1

Posted by "Mattmann, Chris A (388J)" <ch...@jpl.nasa.gov>.
Glad it worked out!

Cheers,
Chris

On Jun 22, 2011, at 5:14 AM, Surendra wrote:

> Hi Chris ,Andreas
> 
> I have upgraded to solr 3.2 ... everything seems fine now. I will have to
> integrate this to my application and observe if any further issues...again
> thanks for your patience and time...
> 
> --Surendra
> 
> 


++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattmann@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++


Re: upgrading to Tika 0.9 on Solr 1.4.1

Posted by Surendra <cs...@gmail.com>.
Hi Chris ,Andreas

I have upgraded to solr 3.2 ... everything seems fine now. I will have to
integrate this to my application and observe if any further issues...again
thanks for your patience and time...

--Surendra



Re: upgrading to Tika 0.9 on Solr 1.4.1

Posted by Andreas Kemkes <a5...@yahoo.com>.
We are successfully extracting PDF content with Solr 3.1 and Tika 0.9.

Replace
fontbox-1.3.1.jar jempbox-1.3.1.jar pdfbox-1.3.1.jar tika-core-0.8.jar 
tika-parsers-0.8.jar 

with
 
fontbox-1.4.0.jar jempbox-1.4.0.jar pdfbox-1.4.0.jar tika-core-0.9.jar 
tika-parsers-0.9.jar 

I'm not entirely certain, if a recompile of Solr was necessary or not.
Andreas



________________________________
From: Surendra <cs...@gmail.com>
To: solr-user@lucene.apache.org
Sent: Tue, June 21, 2011 5:18:31 AM
Subject: Re: upgrading to Tika 0.9 on Solr 1.4.1

Hi Andreas
I tried solr 3.1 as well as 3.2... i was not able to overcome these issues with
the newer versions too. For me, I need the attr_content:* should return me
results (with 1.4.1 this is successful) which is not happening . It indexes well
in 3.1 but in 3.2 i have the following issue.
Invalid version or the data in not in 'javabin' format
--Surendra

Re: upgrading to Tika 0.9 on Solr 1.4.1

Posted by Surendra <cs...@gmail.com>.
Hi Andreas
I tried solr 3.1 as well as 3.2... i was not able to overcome these issues with
the newer versions too. For me, I need the attr_content:* should return me
results (with 1.4.1 this is successful) which is not happening . It indexes well
in 3.1 but in 3.2 i have the following issue.
Invalid version or the data in not in 'javabin' format
--Surendra




Re: upgrading to Tika 0.9 on Solr 1.4.1

Posted by "Mattmann, Chris A (388J)" <ch...@jpl.nasa.gov>.
Hi Surendra,

Thanks. Besides replacing the tika-*-0.9.jar files, you also need to replace the dependency jar files for the other libs as well since they have been upgraded. It's also possible that b/c of API changes, Solr 1.4.1 won't work with Tika 0.9 without modifying the ExtractingRequestHandler  code...

Cheers,
Chris

On Jun 21, 2011, at 12:28 AM, Surendra wrote:

> Hi Chris
> 
> I did a proper checkout of TIKA 0.9 and built the jars as specified in the
> "http://tika.apache.org/0.9/gettingstarted.html" and replaced the existing
> tika0.4 jars with 0.9 jars. I don't see any difference. The documents are
> getting indexed but the fmap.content(attr_content) is still not available for
> me. Am I missing something? Between I'm digging further in this isse... if I can
> get any further help it would be great! Thanks for your time...
> 
> -- Surendra
> 
> 


++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattmann@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++


Re: upgrading to Tika 0.9 on Solr 1.4.1

Posted by Surendra <cs...@gmail.com>.
Hi Chris

I did a proper checkout of TIKA 0.9 and built the jars as specified in the
"http://tika.apache.org/0.9/gettingstarted.html" and replaced the existing
tika0.4 jars with 0.9 jars. I don't see any difference. The documents are
getting indexed but the fmap.content(attr_content) is still not available for
me. Am I missing something? Between I'm digging further in this isse... if I can
get any further help it would be great! Thanks for your time...

-- Surendra



Re: upgrading to Tika 0.9 on Solr 1.4.1

Posted by Andreas Kemkes <a5...@yahoo.com>.
I've unsuccessfully attempted to go down this road - there are API changes, some 
of which I was able to solve by taking code snippets from Solr 3.1.  Some 
 extraction-related tests for wouldn't pass (look for 'Solr 1.4.1 and Tika 0.9 - 
some tests not passing' in the archive).  Ultimately, I decided that the then 
newly released Solr 3.1 was the less rocky route.  Not sure if that is an option 
for you.

Andreas



________________________________
From: "Mattmann, Chris A (388J)" <ch...@jpl.nasa.gov>
To: "solr-user@lucene.apache.org" <so...@lucene.apache.org>
Sent: Mon, June 20, 2011 7:18:34 AM
Subject: Re: upgrading to Tika 0.9 on Solr 1.4.1

Hi Surendra,

On Jun 20, 2011, at 4:59 AM, Surendra wrote:

> Hey Chris
> 
> I have added tika-core 0.9 and tika-parsers 0.9 to Solr1.4.1 (extraction/lib)
> after building them using the source provided by TIKA. Now I have an issue 
with
> this. I am working with extracting PDF content using Solr. I have added
> fmap.content to the configurable params as "attr_content" where I can see the
> entire extracted document. After the TIKA update i am not able to see
> attr_content appearing in the search results. When I restore it with old 0.4
> TIKA jars again the attr_content appears. I didn't find any exceptions shown 
up
> there in the console. Is this a known behavior that someone have faced 
already?
> Can you guide me to resolve this?

I don't think you can simple add a new tika-core-0.9 and tika-parsers-0.9 to 
extraction/lib -- I think you'll need to replace the set of prior Tika jars in 
there. Have a look here to see what jars you would need to replace, HTH:

http://tika.apache.org/0.9/gettingstarted.html

Cheers,
Chris

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattmann@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Re: upgrading to Tika 0.9 on Solr 1.4.1

Posted by "Mattmann, Chris A (388J)" <ch...@jpl.nasa.gov>.
Hi Surendra,

On Jun 20, 2011, at 4:59 AM, Surendra wrote:

> Hey Chris
> 
> I have added tika-core 0.9 and tika-parsers 0.9 to Solr1.4.1 (extraction/lib)
> after building them using the source provided by TIKA. Now I have an issue with
> this. I am working with extracting PDF content using Solr. I have added
> fmap.content to the configurable params as "attr_content" where I can see the
> entire extracted document. After the TIKA update i am not able to see
> attr_content appearing in the search results. When I restore it with old 0.4
> TIKA jars again the attr_content appears. I didn't find any exceptions shown up
> there in the console. Is this a known behavior that someone have faced already?
> Can you guide me to resolve this?

I don't think you can simple add a new tika-core-0.9 and tika-parsers-0.9 to extraction/lib -- I think you'll need to replace the set of prior Tika jars in there. Have a look here to see what jars you would need to replace, HTH:

http://tika.apache.org/0.9/gettingstarted.html

Cheers,
Chris

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattmann@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++