You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by sweety <sw...@yahoo.com> on 2013/12/21 20:06:39 UTC

indexing .docx using solrj

i am trying to index .docx file using solrj, i referred this link:
http://wiki.apache.org/solr/ContentStreamUpdateRequestExample

My code is :
import java.io.File;
import java.io.IOException;

import org.apache.solr.client.solrj.SolrServer;
import org.apache.solr.client.solrj.SolrServerException;
	    
	    import org.apache.solr.client.solrj.request.AbstractUpdateRequest;
import org.apache.solr.client.solrj.response.QueryResponse;
import org.apache.solr.client.solrj.SolrQuery;
import org.apache.solr.client.solrj.impl.*;
import org.apache.solr.client.solrj.request.ContentStreamUpdateRequest;
public class rich_index {
	 
	     public static void main(String[] args) {
	       try {
	         //Solr cell can also index MS file (2003 version and 2007 version)
types.
	         String fileName = "C:\\solr\\document\\src\\test1\\contract.docx"; 
	         //this will be unique Id used by Solr to index the file contents.
	        String solrId = "contract.docx"; 
	        
	        indexFilesSolrCell(fileName, solrId);
	        
	      } catch (Exception ex) {
	        System.out.println(ex.toString());
	      }
	    }
	     
	   public static void indexFilesSolrCell(String fileName, String solrId) 
	       throws IOException, SolrServerException {
	       
	       String urlString = "http://localhost:8080/solr/document"; 
	       SolrServer solr = new HttpSolrServer(urlString);
	       
	       ContentStreamUpdateRequest up  = new
ContentStreamUpdateRequest("/update/extract");
	       
	       up.addFile(new File(fileName), "text");
	       
	       
		up.setParam("literal.id", solrId);
	       up.setParam("uprefix", "ignored_");
	       up.setParam("fmap.content", "contents");
	       
	       up.setAction(AbstractUpdateRequest.ACTION.COMMIT, true, true);
	       
	       solr.request(up);
	       
	       QueryResponse rsp = solr.query(new SolrQuery("*:*"));
	       
	       System.out.println(rsp);
	     }	
}



This is my logs:
Dec 22, 2013 12:27:58 AM org.apache.solr.update.processor.LogUpdateProcessor
finish
INFO: [document] webapp=/solr path=/update/extract
params={fmap.content=contents&waitSearcher=true&commit=true&uprefix=ignored_&literal.id=contract.docx&wt=javabin&version=2&softCommit=false}
{} 0 0
Dec 22, 2013 12:27:58 AM org.apache.solr.common.SolrException log
SEVERE: null:java.lang.RuntimeException: java.lang.NoClassDefFoundError:
*org/apache/xml/serialize/BaseMarkupSerializer*
	at
org.apache.solr.servlet.SolrDispatchFilter.sendError(SolrDispatchFilter.java:651)
	at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:364)
	at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:141)
	at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:243)
	at
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210)
	at
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:224)
	at
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:169)
	at
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:168)
	at
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:98)
	at
org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:928)
	at
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:118)
	at
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:407)
	at
org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:987)
	at
org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:539)
	at
org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:298)

To resolve this i added xerces.jar in the build path,this has.
org/apache/xml/serialize/BaseMarkupSerializer class,but the error is not
resolved.
What is the problem??


*Solrconfig:*
<requestHandler name="/update/extract" 
class="solr.extraction.ExtractingRequestHandler" >
<lst name="defaults">
<str name="map.Last-Modified">last_modified</str>
<str name="fmap.content">contents</str>
<str name="lowernames">true</str>
<str name="uprefix">ignored_</str>

</lst>
</requestHandler>

*scehma:*
<fields> 

<field name="doc_id" type="uuid" indexed="true" stored="true" default="NEW"
multiValued="false"/>
<field name="id" type="integer" indexed="true" stored="true" required="true"
multiValued="false"/>
<field name="contents" type="text" indexed="true" stored="true"
multiValued="false"/>
<field name="author" type="title_text" indexed="true" stored="true"
multiValued="true"/>
<field name="title" type="title_text" indexed="true" stored="true"/>
<field name="date_modified" type="date" indexed="true" stored="true"
multivalued="true"/>
</fields>



--
View this message in context: http://lucene.472066.n3.nabble.com/indexing-docx-using-solrj-tp4107737.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: indexing .docx using solrj

Posted by Andrea Gazzarini <a....@gmail.com>.
The error you were getting is a LinkageError so, simplifying, a class that
was available at compile time is not there at runtime (again, very
simplicistic definition because in this way this could be similar to a
ClassNotFoundException...and isn't).

Probably the class (and the jar) is there somewhere but what is wrong is
the classloader that owns that. That's the reason why I asked you the
content of lib folders...classes loaded in solr.home/lib have a different
classloader from classes / jars in tomcat libs (they have a parent / child
relationship).

Now, that is (moreless) theory...practictally and sincerely I don't know
why the error disappeared after a reboot (maybe because in W\\\ the
"restart" strategy is a well-known Panacea? :D )

Anyway, nice to hear that problem is solved.

Best,
Andrea
On 21 Dec 2013 22:44, "sweety" <sw...@yahoo.com> wrote:

> It is working now,i just restarted computer.
> But i dont still get the reason for the error.
> Thank you though,for your efforts.
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/indexing-docx-using-solrj-tp4107737p4107755.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: indexing .docx using solrj

Posted by sweety <sw...@yahoo.com>.
It is working now,i just restarted computer.
But i dont still get the reason for the error.
Thank you though,for your efforts.



--
View this message in context: http://lucene.472066.n3.nabble.com/indexing-docx-using-solrj-tp4107737p4107755.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: indexing .docx using solrj

Posted by Ahmet Arslan <io...@yahoo.com>.
It looks like class path jars are mixed. Make clean fresh installation from scratch is recommended. 



On Saturday, December 21, 2013 11:52 PM, sweety <sw...@yahoo.com> wrote:
yes,i copied all jars from contrib/extraction to solr/lib.
It is not getting the poi jar now, as mentioned in above post of mine, new
error it shows now.



--
View this message in context: http://lucene.472066.n3.nabble.com/indexing-docx-using-solrj-tp4107737p4107758.html

Sent from the Solr - User mailing list archive at Nabble.com.


Re: indexing .docx using solrj

Posted by sweety <sw...@yahoo.com>.
yes,i copied all jars from contrib/extraction to solr/lib.
It is not getting the poi jar now, as mentioned in above post of mine, new
error it shows now.



--
View this message in context: http://lucene.472066.n3.nabble.com/indexing-docx-using-solrj-tp4107737p4107758.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: indexing .docx using solrj

Posted by Ahmet Arslan <io...@yahoo.com>.
Hi,

Did you add jar files under contrib/extraction/lib too?

Since you are already using solrJ consider using http://searchhub.org/2012/02/14/indexing-with-solrj/




On Saturday, December 21, 2013 11:29 PM, sweety <sw...@yahoo.com> wrote:
solr: 4.2
tomcat: 7.0
jdk1.7.0.45

i have created solr home in c:\solr as in java options: 
-Dsolr.solr.home=C:\solr

c:solr/lib contains:

tika jars, actually i pasted all the jars from the solr 4.2 dist,contrib
folders in c:solr/lib

tomcat/lib contains:
all the jars when installed.




--
View this message in context: http://lucene.472066.n3.nabble.com/indexing-docx-using-solrj-tp4107737p4107752.html

Sent from the Solr - User mailing list archive at Nabble.com.


Re: indexing .docx using solrj

Posted by sweety <sw...@yahoo.com>.
solr: 4.2
tomcat: 7.0
jdk1.7.0.45

i have created solr home in c:\solr as in java options: 
-Dsolr.solr.home=C:\solr

c:solr/lib contains:

tika jars, actually i pasted all the jars from the solr 4.2 dist,contrib
folders in c:solr/lib

tomcat/lib contains:
all the jars when installed.




--
View this message in context: http://lucene.472066.n3.nabble.com/indexing-docx-using-solrj-tp4107737p4107752.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: indexing .docx using solrj

Posted by Andrea Gazzarini <a....@gmail.com>.
Ok, then please tell us a bit more about your context:

- versions (solr / java / tomcat)
- where are tika libs? In solr.home lib or in tomcat lib?
On 21 Dec 2013 21:15, "sweety" <sw...@yahoo.com> wrote:

> I have added that jar,in the build path.
> but the same error,i get.
> Why is eclipse not recognising that jar??
>
> Logs also show this,
> Caused by: java.lang.NoClassDefFoundError:
> org/apache/xml/serialize/BaseMarkupSerializer
>         at
>
> org.apache.solr.handler.extraction.ExtractingRequestHandler.newLoader(ExtractingRequestHandler.java:117)
>         at
>
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:63)
>         at
>
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
>         at org.apache.solr.core.SolrCore.execute(SolrCore.java:1797)
>         at
>
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:637)
>         at
>
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:343)
>         ... 16 more
> Caused by: java.lang.ClassNotFoundException:
> org.apache.xml.serialize.BaseMarkupSerializer
>         at
>
> org.apache.catalina.loader.WebappClassLoader.loadClass(WebappClassLoader.java:1688)
>         at
>
> org.apache.catalina.loader.WebappClassLoader.loadClass(WebappClassLoader.java:1533)
>         ... 22 more
>
>
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/indexing-docx-using-solrj-tp4107737p4107746.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: indexing .docx using solrj

Posted by sweety <sw...@yahoo.com>.
Jar is already there in the lib folder of solr home.



--
View this message in context: http://lucene.472066.n3.nabble.com/indexing-docx-using-solrj-tp4107737p4107748.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: indexing .docx using solrj

Posted by Andrea Gazzarini <ag...@apache.org>.
That is not a jar for your (eclipse) compiler but for tomcat. You should
have that jar available in tomcat or (better) in lib folder of your
solr.home

Eclipse doesn't need to rcognise that
On 21 Dec 2013 21:15, "sweety" <sw...@yahoo.com> wrote:

> I have added that jar,in the build path.
> but the same error,i get.
> Why is eclipse not recognising that jar??
>
> Logs also show this,
> Caused by: java.lang.NoClassDefFoundError:
> org/apache/xml/serialize/BaseMarkupSerializer
>         at
>
> org.apache.solr.handler.extraction.ExtractingRequestHandler.newLoader(ExtractingRequestHandler.java:117)
>         at
>
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:63)
>         at
>
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
>         at org.apache.solr.core.SolrCore.execute(SolrCore.java:1797)
>         at
>
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:637)
>         at
>
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:343)
>         ... 16 more
> Caused by: java.lang.ClassNotFoundException:
> org.apache.xml.serialize.BaseMarkupSerializer
>         at
>
> org.apache.catalina.loader.WebappClassLoader.loadClass(WebappClassLoader.java:1688)
>         at
>
> org.apache.catalina.loader.WebappClassLoader.loadClass(WebappClassLoader.java:1533)
>         ... 22 more
>
>
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/indexing-docx-using-solrj-tp4107737p4107746.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: indexing .docx using solrj

Posted by sweety <sw...@yahoo.com>.
I have added that jar,in the build path.
but the same error,i get.
Why is eclipse not recognising that jar??

Logs also show this,
Caused by: java.lang.NoClassDefFoundError:
org/apache/xml/serialize/BaseMarkupSerializer
	at 
org.apache.solr.handler.extraction.ExtractingRequestHandler.newLoader(ExtractingRequestHandler.java:117)
	at
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:63)
	at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
	at org.apache.solr.core.SolrCore.execute(SolrCore.java:1797)
	at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:637)
	at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:343)
	... 16 more
Caused by: java.lang.ClassNotFoundException:
org.apache.xml.serialize.BaseMarkupSerializer
	at
org.apache.catalina.loader.WebappClassLoader.loadClass(WebappClassLoader.java:1688)
	at
org.apache.catalina.loader.WebappClassLoader.loadClass(WebappClassLoader.java:1533)
	... 22 more





--
View this message in context: http://lucene.472066.n3.nabble.com/indexing-docx-using-solrj-tp4107737p4107746.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: indexing .docx using solrj

Posted by Andrea Gazzarini <a....@gmail.com>.
That class seems to be in xercesImpl jar...probably is a dependency of tika
or a required lib of the underlying parser used for that kind of document

Andrea
On 21 Dec 2013 20:07, "sweety" <sw...@yahoo.com> wrote:

> i am trying to index .docx file using solrj, i referred this link:
> http://wiki.apache.org/solr/ContentStreamUpdateRequestExample
>
> My code is :
> import java.io.File;
> import java.io.IOException;
>
> import org.apache.solr.client.solrj.SolrServer;
> import org.apache.solr.client.solrj.SolrServerException;
>
>             import
> org.apache.solr.client.solrj.request.AbstractUpdateRequest;
> import org.apache.solr.client.solrj.response.QueryResponse;
> import org.apache.solr.client.solrj.SolrQuery;
> import org.apache.solr.client.solrj.impl.*;
> import org.apache.solr.client.solrj.request.ContentStreamUpdateRequest;
> public class rich_index {
>
>              public static void main(String[] args) {
>                try {
>                  //Solr cell can also index MS file (2003 version and 2007
> version)
> types.
>                  String fileName =
> "C:\\solr\\document\\src\\test1\\contract.docx";
>                  //this will be unique Id used by Solr to index the file
> contents.
>                 String solrId = "contract.docx";
>
>                 indexFilesSolrCell(fileName, solrId);
>
>               } catch (Exception ex) {
>                 System.out.println(ex.toString());
>               }
>             }
>
>            public static void indexFilesSolrCell(String fileName, String
> solrId)
>                throws IOException, SolrServerException {
>
>                String urlString = "http://localhost:8080/solr/document";
>                SolrServer solr = new HttpSolrServer(urlString);
>
>                ContentStreamUpdateRequest up  = new
> ContentStreamUpdateRequest("/update/extract");
>
>                up.addFile(new File(fileName), "text");
>
>
>                 up.setParam("literal.id", solrId);
>                up.setParam("uprefix", "ignored_");
>                up.setParam("fmap.content", "contents");
>
>                up.setAction(AbstractUpdateRequest.ACTION.COMMIT, true,
> true);
>
>                solr.request(up);
>
>                QueryResponse rsp = solr.query(new SolrQuery("*:*"));
>
>                System.out.println(rsp);
>              }
> }
>
>
>
> This is my logs:
> Dec 22, 2013 12:27:58 AM
> org.apache.solr.update.processor.LogUpdateProcessor
> finish
> INFO: [document] webapp=/solr path=/update/extract
>
> params={fmap.content=contents&waitSearcher=true&commit=true&uprefix=ignored_&
> literal.id=contract.docx&wt=javabin&version=2&softCommit=false}
> {} 0 0
> Dec 22, 2013 12:27:58 AM org.apache.solr.common.SolrException log
> SEVERE: null:java.lang.RuntimeException: java.lang.NoClassDefFoundError:
> *org/apache/xml/serialize/BaseMarkupSerializer*
>         at
>
> org.apache.solr.servlet.SolrDispatchFilter.sendError(SolrDispatchFilter.java:651)
>         at
>
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:364)
>         at
>
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:141)
>         at
>
> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:243)
>         at
>
> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210)
>         at
>
> org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:224)
>         at
>
> org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:169)
>         at
>
> org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:168)
>         at
>
> org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:98)
>         at
> org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:928)
>         at
>
> org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:118)
>         at
> org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:407)
>         at
>
> org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:987)
>         at
>
> org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:539)
>         at
>
> org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:298)
>
> To resolve this i added xerces.jar in the build path,this has.
> org/apache/xml/serialize/BaseMarkupSerializer class,but the error is not
> resolved.
> What is the problem??
>
>
> *Solrconfig:*
> <requestHandler name="/update/extract"
> class="solr.extraction.ExtractingRequestHandler" >
> <lst name="defaults">
> <str name="map.Last-Modified">last_modified</str>
> <str name="fmap.content">contents</str>
> <str name="lowernames">true</str>
> <str name="uprefix">ignored_</str>
>
> </lst>
> </requestHandler>
>
> *scehma:*
> <fields>
>
> <field name="doc_id" type="uuid" indexed="true" stored="true" default="NEW"
> multiValued="false"/>
> <field name="id" type="integer" indexed="true" stored="true"
> required="true"
> multiValued="false"/>
> <field name="contents" type="text" indexed="true" stored="true"
> multiValued="false"/>
> <field name="author" type="title_text" indexed="true" stored="true"
> multiValued="true"/>
> <field name="title" type="title_text" indexed="true" stored="true"/>
> <field name="date_modified" type="date" indexed="true" stored="true"
> multivalued="true"/>
> </fields>
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/indexing-docx-using-solrj-tp4107737.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>