You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Kevin Miller <ke...@oktax.state.ok.us> on 2009/08/18 22:20:24 UTC

Using Solr Cell to index a Word Document

I am using the Solr nightly build 8/11/09.  I have set the text field in the solrconfig.xml file to be stored.  I index an MS Word document and when I search for a word in the text of the document and it pulls up the xml format.  The text field is showing the text of the document but there are areas in the document that are FORMDROPDOWNs.  What I want to know is if there is some way that the information that was entered into the FORMDROPDOWNs can be retrieved.  The text field contains the following information (I have entered in parenthesis the actual data from the MS Word document for the FORMDROPDOWNs:

<arr name="text">
−
<str>
       	 OKLAHOMA TAX COMMISSION  
  
  	FISCAL IMPACT STATEMENT AND/OR ADMINISTRATIVE IMPACT STATEMENT
  	FIRST REGULAR SESSION, FIFTY-SECOND OKLAHOMA LEGISLATURE
  
  
  DATE OF IMPACT STATEMENT:	 May 21, 2009
   
  BILL NUMBER:  HB 1097	
  
  STATUS AND DATE OF BILL:	  FORMDROPDOWN(Enrolled Bill)    05/20/2009 
  
  AUTHORS:	House	  FORMTEXT   Dank 			Senate	Brogdon
  
  TAX TYPE (S):    All   SUBJECT:      FORMDROPDOWN(Credit)   
  
  PROPOSAL:	  FORMDROPDOWN(New Law)     
  
  This measure creates a nine (9) member task force to study tax credits.  The measure also includes provisions of procedures and duties for the task force and directs the task force to produce a final written report for the Speaker, the Governor and the Pro Tempore.
  
  
  EFFECTIVE DATE:	August 21, 2009 (Assuming sine die is May 22, 2009)
  
  REVENUE IMPACT: 
  
  Insert dollar amount (plus or minus) of the expected change in state revenues due to this proposed legislation.
  
  FY 09:	None		
  FY 10:	None  FORMTEXT    
  
  ADMINISTRATIVE IMPACT:
  
  Insert the estimated cost or savings to the Tax Commission due to this proposed legislation.
  
  FY 10:	None
  
  
    	  	lrh 
  DATE				DIVISION DIRECTOR
  
                                                                                         
  DATE				XXXXX XXXXXX, ECONOMIST
  
                                     
  DATE				FOR THE COMMISSION
   
</str>
</arr>

Kevin Miller
Web Services

Re: Using Solr Cell to index a Word Document

Posted by Mark Miller <ma...@gmail.com>.
Solr defers to Tika for this. Tika uses getParagraph text from the POI 
WordExtractor class:

http://poi.apache.org/apidocs/org/apache/poi/hwpf/extractor/WordExtractor.html

POI appears to be in limbo and I'm not seeing anything in WordExtractor 
that looks like it might help you.

I'd inquire at the Tika project though.


-- 
- Mark

http://www.lucidimagination.com



Kevin Miller wrote:
> I am using the Solr nightly build 8/11/09.  I have set the text field in the solrconfig.xml file to be stored.  I index an MS Word document and when I search for a word in the text of the document and it pulls up the xml format.  The text field is showing the text of the document but there are areas in the document that are FORMDROPDOWNs.  What I want to know is if there is some way that the information that was entered into the FORMDROPDOWNs can be retrieved.  The text field contains the following information (I have entered in parenthesis the actual data from the MS Word document for the FORMDROPDOWNs:
>
> <arr name="text">
> −
> <str>
>        	 OKLAHOMA TAX COMMISSION  
>   
>   	FISCAL IMPACT STATEMENT AND/OR ADMINISTRATIVE IMPACT STATEMENT
>   	FIRST REGULAR SESSION, FIFTY-SECOND OKLAHOMA LEGISLATURE
>   
>   
>   DATE OF IMPACT STATEMENT:	 May 21, 2009
>    
>   BILL NUMBER:  HB 1097	
>   
>   STATUS AND DATE OF BILL:	  FORMDROPDOWN(Enrolled Bill)    05/20/2009 
>   
>   AUTHORS:	House	  FORMTEXT   Dank 			Senate	Brogdon
>   
>   TAX TYPE (S):    All   SUBJECT:      FORMDROPDOWN(Credit)   
>   
>   PROPOSAL:	  FORMDROPDOWN(New Law)     
>   
>   This measure creates a nine (9) member task force to study tax credits.  The measure also includes provisions of procedures and duties for the task force and directs the task force to produce a final written report for the Speaker, the Governor and the Pro Tempore.
>   
>   
>   EFFECTIVE DATE:	August 21, 2009 (Assuming sine die is May 22, 2009)
>   
>   REVENUE IMPACT: 
>   
>   Insert dollar amount (plus or minus) of the expected change in state revenues due to this proposed legislation.
>   
>   FY 09:	None		
>   FY 10:	None  FORMTEXT    
>   
>   ADMINISTRATIVE IMPACT:
>   
>   Insert the estimated cost or savings to the Tax Commission due to this proposed legislation.
>   
>   FY 10:	None
>   
>   
>     	  	lrh 
>   DATE				DIVISION DIRECTOR
>   
>                                                                                          
>   DATE				XXXXX XXXXXX, ECONOMIST
>   
>                                      
>   DATE				FOR THE COMMISSION
>    
> </str>
> </arr>
>
> Kevin Miller
> Web Services
>