You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Alex Cougarman <ac...@bwc.org> on 2012/08/30 15:25:06 UTC
Extract footer/header text out of Word docs
Hi. Is it possible to specifically extract footer/header and body text out of a Word document using Solr? In other words, we'd like to index/store those items in different Solr fields.
Also, is it possible to search on specific styles within a Word document? Can these attributes be indexed? Thanks.
Sincerely,
Alex
Re: Extract footer/header text out of Word docs
Posted by Lance Norskog <go...@gmail.com>.
Tika generates a block-structured stream of events for the document.
It would be cool to have an alternate Tika processor in the DIH that
generates this stream as XML. You could then use the XPath tools to
grab whatever you want.
On Fri, Aug 31, 2012 at 4:25 AM, Erick Erickson <er...@gmail.com> wrote:
> You can also move the Tika processing off Solr to the client and perhaps have
> more control there. I haven't tried this particular thing, so....
>
> see: http://searchhub.org/dev/2012/02/14/indexing-with-solrj/
>
> Best
> Erick
>
> On Thu, Aug 30, 2012 at 9:35 AM, Markus Jelsma
> <ma...@openindex.io> wrote:
>> Tika can do this but Solr doesn't use the BoilerpipeContentHandler. Perhaps it should made configurable which content handler Solr uses and in case of the BoilerpipeContentHandler which extractor implementation to use.
>>
>> -----Original message-----
>>> From:Otis Gospodnetic <ot...@yahoo.com>
>>> Sent: Thu 30-Aug-2012 15:30
>>> To: solr-user@lucene.apache.org
>>> Subject: Re: Extract footer/header text out of Word docs
>>>
>>> Hi Alex,
>>>
>>> I think you may get better help on the Tika mailing list - Solr uses Tika to parse rich text docs and extract text from them. I don't know if Tika can figure out what's from a header and a footer...
>>>
>>> Otis
>>> ----
>>> Performance Monitoring for Solr / ElasticSearch / HBase - http://sematext.com/spm
>>>
>>>
>>>
>>> ----- Original Message -----
>>> > From: Alex Cougarman <ac...@bwc.org>
>>> > To: "solr-user@lucene.apache.org" <so...@lucene.apache.org>
>>> > Cc:
>>> > Sent: Thursday, August 30, 2012 9:25 AM
>>> > Subject: Extract footer/header text out of Word docs
>>> >
>>> > Hi. Is it possible to specifically extract footer/header and body text out of a
>>> > Word document using Solr? In other words, we'd like to index/store those
>>> > items in different Solr fields.
>>> >
>>> > Also, is it possible to search on specific styles within a Word document? Can
>>> > these attributes be indexed? Thanks.
>>> >
>>> > Sincerely,
>>> > Alex
>>> >
>>>
--
Lance Norskog
goksron@gmail.com
Re: Extract footer/header text out of Word docs
Posted by Erick Erickson <er...@gmail.com>.
You can also move the Tika processing off Solr to the client and perhaps have
more control there. I haven't tried this particular thing, so....
see: http://searchhub.org/dev/2012/02/14/indexing-with-solrj/
Best
Erick
On Thu, Aug 30, 2012 at 9:35 AM, Markus Jelsma
<ma...@openindex.io> wrote:
> Tika can do this but Solr doesn't use the BoilerpipeContentHandler. Perhaps it should made configurable which content handler Solr uses and in case of the BoilerpipeContentHandler which extractor implementation to use.
>
> -----Original message-----
>> From:Otis Gospodnetic <ot...@yahoo.com>
>> Sent: Thu 30-Aug-2012 15:30
>> To: solr-user@lucene.apache.org
>> Subject: Re: Extract footer/header text out of Word docs
>>
>> Hi Alex,
>>
>> I think you may get better help on the Tika mailing list - Solr uses Tika to parse rich text docs and extract text from them. I don't know if Tika can figure out what's from a header and a footer...
>>
>> Otis
>> ----
>> Performance Monitoring for Solr / ElasticSearch / HBase - http://sematext.com/spm
>>
>>
>>
>> ----- Original Message -----
>> > From: Alex Cougarman <ac...@bwc.org>
>> > To: "solr-user@lucene.apache.org" <so...@lucene.apache.org>
>> > Cc:
>> > Sent: Thursday, August 30, 2012 9:25 AM
>> > Subject: Extract footer/header text out of Word docs
>> >
>> > Hi. Is it possible to specifically extract footer/header and body text out of a
>> > Word document using Solr? In other words, we'd like to index/store those
>> > items in different Solr fields.
>> >
>> > Also, is it possible to search on specific styles within a Word document? Can
>> > these attributes be indexed? Thanks.
>> >
>> > Sincerely,
>> > Alex
>> >
>>
RE: Extract footer/header text out of Word docs
Posted by Markus Jelsma <ma...@openindex.io>.
Tika can do this but Solr doesn't use the BoilerpipeContentHandler. Perhaps it should made configurable which content handler Solr uses and in case of the BoilerpipeContentHandler which extractor implementation to use.
-----Original message-----
> From:Otis Gospodnetic <ot...@yahoo.com>
> Sent: Thu 30-Aug-2012 15:30
> To: solr-user@lucene.apache.org
> Subject: Re: Extract footer/header text out of Word docs
>
> Hi Alex,
>
> I think you may get better help on the Tika mailing list - Solr uses Tika to parse rich text docs and extract text from them. I don't know if Tika can figure out what's from a header and a footer...
>
> Otis
> ----
> Performance Monitoring for Solr / ElasticSearch / HBase - http://sematext.com/spm
>
>
>
> ----- Original Message -----
> > From: Alex Cougarman <ac...@bwc.org>
> > To: "solr-user@lucene.apache.org" <so...@lucene.apache.org>
> > Cc:
> > Sent: Thursday, August 30, 2012 9:25 AM
> > Subject: Extract footer/header text out of Word docs
> >
> > Hi. Is it possible to specifically extract footer/header and body text out of a
> > Word document using Solr? In other words, we'd like to index/store those
> > items in different Solr fields.
> >
> > Also, is it possible to search on specific styles within a Word document? Can
> > these attributes be indexed? Thanks.
> >
> > Sincerely,
> > Alex
> >
>
Re: FW: Extract footer/header text out of Word docs
Posted by Nick Burch <ap...@gagravarr.org>.
On Thu, 30 Aug 2012, Alex Cougarman wrote:
>> Hi. Is it possible to specifically extract footer/header and body text
>> out of a Word document using Solr? In other words, we'd like to
>> index/store those items in different Solr fields.
As long as the have a suitable style applied, yes Tika will be able to
tell you
If you run Tika against this sample document from POI:
https://svn.apache.org/repos/asf/poi/trunk/test-data/document/HeaderFooterUnicode.doc
You can see the headers and footers in the xhtml:
<body><div class="header"><p>This is a simple header, with a € euro symbol
in it.</p>
</div>
<p>This is a fairly simple word document, over two pages, with headers and
footers.</p>
(snip)
<p>This is page two. <i>Les Précieuses ridicules. </i>The end.
</p>
<div class="footer"><p>The footer, with Molière, has Unicode in it.
</p>
</div>
</body></html>
Just filter on the footer and header classes on the surrounding DIV's
Nick
FW: Extract footer/header text out of Word docs
Posted by Alex Cougarman <ac...@bwc.org>.
Dear friends. Sorry, I posted to Solr. Any ideas on this question?
Sincerely,
Alex
-----Original Message-----
From: Otis Gospodnetic [mailto:otis_gospodnetic@yahoo.com]
Sent: 30 August 2012 4:28 PM
To: solr-user@lucene.apache.org
Subject: Re: Extract footer/header text out of Word docs
Hi Alex,
I think you may get better help on the Tika mailing list - Solr uses Tika to parse rich text docs and extract text from them. I don't know if Tika can figure out what's from a header and a footer...
Otis
----
Performance Monitoring for Solr / ElasticSearch / HBase - http://sematext.com/spm
----- Original Message -----
> From: Alex Cougarman <ac...@bwc.org>
> To: "solr-user@lucene.apache.org" <so...@lucene.apache.org>
> Cc:
> Sent: Thursday, August 30, 2012 9:25 AM
> Subject: Extract footer/header text out of Word docs
>
> Hi. Is it possible to specifically extract footer/header and body text out of a
> Word document using Solr? In other words, we'd like to index/store those
> items in different Solr fields.
>
> Also, is it possible to search on specific styles within a Word document? Can
> these attributes be indexed? Thanks.
>
> Sincerely,
> Alex
>
Re: Extract footer/header text out of Word docs
Posted by Otis Gospodnetic <ot...@yahoo.com>.
Hi Alex,
I think you may get better help on the Tika mailing list - Solr uses Tika to parse rich text docs and extract text from them. I don't know if Tika can figure out what's from a header and a footer...
Otis
----
Performance Monitoring for Solr / ElasticSearch / HBase - http://sematext.com/spm
----- Original Message -----
> From: Alex Cougarman <ac...@bwc.org>
> To: "solr-user@lucene.apache.org" <so...@lucene.apache.org>
> Cc:
> Sent: Thursday, August 30, 2012 9:25 AM
> Subject: Extract footer/header text out of Word docs
>
> Hi. Is it possible to specifically extract footer/header and body text out of a
> Word document using Solr? In other words, we'd like to index/store those
> items in different Solr fields.
>
> Also, is it possible to search on specific styles within a Word document? Can
> these attributes be indexed? Thanks.
>
> Sincerely,
> Alex
>