You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Alex Cougarman <ac...@bwc.org> on 2012/08/30 15:25:06 UTC

Extract footer/header text out of Word docs

Hi. Is it possible to specifically extract footer/header and body text out of a Word document using Solr? In other words, we'd like to index/store those items in different Solr fields.

Also, is it possible to search on specific styles within a Word document? Can these attributes be indexed? Thanks.

Sincerely,
Alex

Re: Extract footer/header text out of Word docs

Posted by Lance Norskog <go...@gmail.com>.

Tika generates a block-structured stream of events for the document.
It would be cool to have an alternate Tika processor in the DIH that
generates this stream as XML. You could then use the XPath tools to
grab whatever you want.

On Fri, Aug 31, 2012 at 4:25 AM, Erick Erickson <er...@gmail.com> wrote:
> You can also move the Tika processing off Solr to the client and perhaps have
> more control there. I haven't tried this particular thing, so....
>
> see: http://searchhub.org/dev/2012/02/14/indexing-with-solrj/
>
> Best
> Erick
>
> On Thu, Aug 30, 2012 at 9:35 AM, Markus Jelsma
> <ma...@openindex.io> wrote:
>> Tika can do this but Solr doesn't use the BoilerpipeContentHandler. Perhaps it should made configurable which content handler Solr uses and in case of the BoilerpipeContentHandler which extractor implementation to use.
>>
>> -----Original message-----
>>> From:Otis Gospodnetic <ot...@yahoo.com>
>>> Sent: Thu 30-Aug-2012 15:30
>>> To: solr-user@lucene.apache.org
>>> Subject: Re: Extract footer/header text out of Word docs
>>>
>>> Hi Alex,
>>>
>>> I think you may get better help on the Tika mailing list - Solr uses Tika to parse rich text docs and extract text from them.  I don't know if Tika can figure out what's from a header and a footer...
>>>
>>> Otis
>>> ----
>>> Performance Monitoring for Solr / ElasticSearch / HBase - http://sematext.com/spm
>>>
>>>
>>>
>>> ----- Original Message -----
>>> > From: Alex Cougarman <ac...@bwc.org>
>>> > To: "solr-user@lucene.apache.org" <so...@lucene.apache.org>
>>> > Cc:
>>> > Sent: Thursday, August 30, 2012 9:25 AM
>>> > Subject: Extract footer/header text out of Word docs
>>> >
>>> > Hi. Is it possible to specifically extract footer/header and body text out of a
>>> > Word document using Solr? In other words, we'd like to index/store those
>>> > items in different Solr fields.
>>> >
>>> > Also, is it possible to search on specific styles within a Word document? Can
>>> > these attributes be indexed? Thanks.
>>> >
>>> > Sincerely,
>>> > Alex
>>> >
>>>



-- 
Lance Norskog
goksron@gmail.com

Re: Extract footer/header text out of Word docs

Posted by Erick Erickson <er...@gmail.com>.

You can also move the Tika processing off Solr to the client and perhaps have
more control there. I haven't tried this particular thing, so....

see: http://searchhub.org/dev/2012/02/14/indexing-with-solrj/

Best
Erick

On Thu, Aug 30, 2012 at 9:35 AM, Markus Jelsma
<ma...@openindex.io> wrote:
> Tika can do this but Solr doesn't use the BoilerpipeContentHandler. Perhaps it should made configurable which content handler Solr uses and in case of the BoilerpipeContentHandler which extractor implementation to use.
>
> -----Original message-----
>> From:Otis Gospodnetic <ot...@yahoo.com>
>> Sent: Thu 30-Aug-2012 15:30
>> To: solr-user@lucene.apache.org
>> Subject: Re: Extract footer/header text out of Word docs
>>
>> Hi Alex,
>>
>> I think you may get better help on the Tika mailing list - Solr uses Tika to parse rich text docs and extract text from them.  I don't know if Tika can figure out what's from a header and a footer...
>>
>> Otis
>> ----
>> Performance Monitoring for Solr / ElasticSearch / HBase - http://sematext.com/spm
>>
>>
>>
>> ----- Original Message -----
>> > From: Alex Cougarman <ac...@bwc.org>
>> > To: "solr-user@lucene.apache.org" <so...@lucene.apache.org>
>> > Cc:
>> > Sent: Thursday, August 30, 2012 9:25 AM
>> > Subject: Extract footer/header text out of Word docs
>> >
>> > Hi. Is it possible to specifically extract footer/header and body text out of a
>> > Word document using Solr? In other words, we'd like to index/store those
>> > items in different Solr fields.
>> >
>> > Also, is it possible to search on specific styles within a Word document? Can
>> > these attributes be indexed? Thanks.
>> >
>> > Sincerely,
>> > Alex
>> >
>>

RE: Extract footer/header text out of Word docs

Posted by Markus Jelsma <ma...@openindex.io>.

Tika can do this but Solr doesn't use the BoilerpipeContentHandler. Perhaps it should made configurable which content handler Solr uses and in case of the BoilerpipeContentHandler which extractor implementation to use.
 
-----Original message-----
> From:Otis Gospodnetic <ot...@yahoo.com>
> Sent: Thu 30-Aug-2012 15:30
> To: solr-user@lucene.apache.org
> Subject: Re: Extract footer/header text out of Word docs
> 
> Hi Alex,
> 
> I think you may get better help on the Tika mailing list - Solr uses Tika to parse rich text docs and extract text from them.  I don't know if Tika can figure out what's from a header and a footer...
> 
> Otis 
> ----
> Performance Monitoring for Solr / ElasticSearch / HBase - http://sematext.com/spm 
> 
> 
> 
> ----- Original Message -----
> > From: Alex Cougarman <ac...@bwc.org>
> > To: "solr-user@lucene.apache.org" <so...@lucene.apache.org>
> > Cc: 
> > Sent: Thursday, August 30, 2012 9:25 AM
> > Subject: Extract footer/header text out of Word docs
> > 
> > Hi. Is it possible to specifically extract footer/header and body text out of a 
> > Word document using Solr? In other words, we'd like to index/store those 
> > items in different Solr fields.
> > 
> > Also, is it possible to search on specific styles within a Word document? Can 
> > these attributes be indexed? Thanks.
> > 
> > Sincerely,
> > Alex
> > 
>

Re: FW: Extract footer/header text out of Word docs

Posted by Nick Burch <ap...@gagravarr.org>.

On Thu, 30 Aug 2012, Alex Cougarman wrote:
>> Hi. Is it possible to specifically extract footer/header and body text 
>> out of a Word document using Solr? In other words, we'd like to 
>> index/store those items in different Solr fields.

As long as the have a suitable style applied, yes Tika will be able to 
tell you

If you run Tika against this sample document from POI:
https://svn.apache.org/repos/asf/poi/trunk/test-data/document/HeaderFooterUnicode.doc

You can see the headers and footers in the xhtml:


<body><div class="header"><p>This is a simple header, with a € euro symbol 
in it.</p>
</div>
<p>This is a fairly simple word document, over two pages, with headers and 
footers.</p>

(snip)

<p>This is page two. <i>Les Précieuses ridicules. </i>The end.
</p>
<div class="footer"><p>The footer, with Molière, has Unicode in it.
</p>
</div>
</body></html>


Just filter on the footer and header classes on the surrounding DIV's

Nick

FW: Extract footer/header text out of Word docs

Posted by Alex Cougarman <ac...@bwc.org>.

Dear friends. Sorry, I posted to Solr. Any ideas on this question?

Sincerely,
Alex 


-----Original Message-----
From: Otis Gospodnetic [mailto:otis_gospodnetic@yahoo.com] 
Sent: 30 August 2012 4:28 PM
To: solr-user@lucene.apache.org
Subject: Re: Extract footer/header text out of Word docs

Hi Alex,

I think you may get better help on the Tika mailing list - Solr uses Tika to parse rich text docs and extract text from them.  I don't know if Tika can figure out what's from a header and a footer...

Otis 
----
Performance Monitoring for Solr / ElasticSearch / HBase - http://sematext.com/spm 



----- Original Message -----
> From: Alex Cougarman <ac...@bwc.org>
> To: "solr-user@lucene.apache.org" <so...@lucene.apache.org>
> Cc: 
> Sent: Thursday, August 30, 2012 9:25 AM
> Subject: Extract footer/header text out of Word docs
> 
> Hi. Is it possible to specifically extract footer/header and body text out of a 
> Word document using Solr? In other words, we'd like to index/store those 
> items in different Solr fields.
> 
> Also, is it possible to search on specific styles within a Word document? Can 
> these attributes be indexed? Thanks.
> 
> Sincerely,
> Alex
>

Re: Extract footer/header text out of Word docs

Posted by Otis Gospodnetic <ot...@yahoo.com>.

Hi Alex,

I think you may get better help on the Tika mailing list - Solr uses Tika to parse rich text docs and extract text from them.  I don't know if Tika can figure out what's from a header and a footer...

Otis 
----
Performance Monitoring for Solr / ElasticSearch / HBase - http://sematext.com/spm 



----- Original Message -----
> From: Alex Cougarman <ac...@bwc.org>
> To: "solr-user@lucene.apache.org" <so...@lucene.apache.org>
> Cc: 
> Sent: Thursday, August 30, 2012 9:25 AM
> Subject: Extract footer/header text out of Word docs
> 
> Hi. Is it possible to specifically extract footer/header and body text out of a 
> Word document using Solr? In other words, we'd like to index/store those 
> items in different Solr fields.
> 
> Also, is it possible to search on specific styles within a Word document? Can 
> these attributes be indexed? Thanks.
> 
> Sincerely,
> Alex
>