You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Kevin Miller <ke...@oktax.state.ok.us> on 2009/08/18 22:20:24 UTC
Using Solr Cell to index a Word Document
I am using the Solr nightly build 8/11/09. I have set the text field in the solrconfig.xml file to be stored. I index an MS Word document and when I search for a word in the text of the document and it pulls up the xml format. The text field is showing the text of the document but there are areas in the document that are FORMDROPDOWNs. What I want to know is if there is some way that the information that was entered into the FORMDROPDOWNs can be retrieved. The text field contains the following information (I have entered in parenthesis the actual data from the MS Word document for the FORMDROPDOWNs:
<arr name="text">
−
<str>
OKLAHOMA TAX COMMISSION
FISCAL IMPACT STATEMENT AND/OR ADMINISTRATIVE IMPACT STATEMENT
FIRST REGULAR SESSION, FIFTY-SECOND OKLAHOMA LEGISLATURE
DATE OF IMPACT STATEMENT: May 21, 2009
BILL NUMBER: HB 1097
STATUS AND DATE OF BILL: FORMDROPDOWN(Enrolled Bill) 05/20/2009
AUTHORS: House FORMTEXT Dank Senate Brogdon
TAX TYPE (S): All SUBJECT: FORMDROPDOWN(Credit)
PROPOSAL: FORMDROPDOWN(New Law)
This measure creates a nine (9) member task force to study tax credits. The measure also includes provisions of procedures and duties for the task force and directs the task force to produce a final written report for the Speaker, the Governor and the Pro Tempore.
EFFECTIVE DATE: August 21, 2009 (Assuming sine die is May 22, 2009)
REVENUE IMPACT:
Insert dollar amount (plus or minus) of the expected change in state revenues due to this proposed legislation.
FY 09: None
FY 10: None FORMTEXT
ADMINISTRATIVE IMPACT:
Insert the estimated cost or savings to the Tax Commission due to this proposed legislation.
FY 10: None
lrh
DATE DIVISION DIRECTOR
DATE XXXXX XXXXXX, ECONOMIST
DATE FOR THE COMMISSION
</str>
</arr>
Kevin Miller
Web Services
Re: Using Solr Cell to index a Word Document
Posted by Mark Miller <ma...@gmail.com>.
Solr defers to Tika for this. Tika uses getParagraph text from the POI
WordExtractor class:
http://poi.apache.org/apidocs/org/apache/poi/hwpf/extractor/WordExtractor.html
POI appears to be in limbo and I'm not seeing anything in WordExtractor
that looks like it might help you.
I'd inquire at the Tika project though.
--
- Mark
http://www.lucidimagination.com
Kevin Miller wrote:
> I am using the Solr nightly build 8/11/09. I have set the text field in the solrconfig.xml file to be stored. I index an MS Word document and when I search for a word in the text of the document and it pulls up the xml format. The text field is showing the text of the document but there are areas in the document that are FORMDROPDOWNs. What I want to know is if there is some way that the information that was entered into the FORMDROPDOWNs can be retrieved. The text field contains the following information (I have entered in parenthesis the actual data from the MS Word document for the FORMDROPDOWNs:
>
> <arr name="text">
> −
> <str>
> OKLAHOMA TAX COMMISSION
>
> FISCAL IMPACT STATEMENT AND/OR ADMINISTRATIVE IMPACT STATEMENT
> FIRST REGULAR SESSION, FIFTY-SECOND OKLAHOMA LEGISLATURE
>
>
> DATE OF IMPACT STATEMENT: May 21, 2009
>
> BILL NUMBER: HB 1097
>
> STATUS AND DATE OF BILL: FORMDROPDOWN(Enrolled Bill) 05/20/2009
>
> AUTHORS: House FORMTEXT Dank Senate Brogdon
>
> TAX TYPE (S): All SUBJECT: FORMDROPDOWN(Credit)
>
> PROPOSAL: FORMDROPDOWN(New Law)
>
> This measure creates a nine (9) member task force to study tax credits. The measure also includes provisions of procedures and duties for the task force and directs the task force to produce a final written report for the Speaker, the Governor and the Pro Tempore.
>
>
> EFFECTIVE DATE: August 21, 2009 (Assuming sine die is May 22, 2009)
>
> REVENUE IMPACT:
>
> Insert dollar amount (plus or minus) of the expected change in state revenues due to this proposed legislation.
>
> FY 09: None
> FY 10: None FORMTEXT
>
> ADMINISTRATIVE IMPACT:
>
> Insert the estimated cost or savings to the Tax Commission due to this proposed legislation.
>
> FY 10: None
>
>
> lrh
> DATE DIVISION DIRECTOR
>
>
> DATE XXXXX XXXXXX, ECONOMIST
>
>
> DATE FOR THE COMMISSION
>
> </str>
> </arr>
>
> Kevin Miller
> Web Services
>