You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Oystein Reigem <oy...@aksis.uib.no> on 2007/03/13 15:59:20 UTC

Highlighting of original documents

Hi,

I want to implement fulltext search on a collection of documents. I try 
to figure out which system is the better choice - eXist, or Lucene, or 
some combination of the two. I have some knowledge of eXist, but don't 
know too much about Lucene.

I'd like to display the result of a search as a list of 
excerpts/snippets with highlighted search words. When the user clicks an 
item in the result list to bring up the document in full, I'd like to 
have search words highlighted in the full document as well.

The document collection is very diverse. There are pure text documents 
and well-formed XML and HTML documents, but unfortunately also HTML 
documents that are not quite well-formed, Word documents and PDFs. Many 
of the formats go beyond what eXist and Lucene can handle, and I realise 
some conversion, or text extraction, is necessary. As far as I know 
Lucene can only index and search pure text (and fields), so the 
documents must be run through appropriate filters extracting the text 
(and field values). Afterwards fulltext search is possible.

But what about highlighting? I know it is possible to get highlighting 
in the pure text version, but what about the original document, when the 
original document is something else than pure text, e.g, a simple XML 
document? Is it at all possible to get the search words tagged in the 
XML document?

I assume not, but ask anyway. :-)

Cheers,

- Øystein -


-- 
Øystein Reigem, The department of culture, language and information technology (Aksis), Allegt 27, N-5007 Bergen, Norway. Tel: +47 55 58 32 42. Fax: +47 55 58 94 70. E-mail: <oy...@aksis.uib.no>. Home tel: +47 56 14 06 11. Mobile: +47 97 16 96 64. Home e-mail: <or...@broadpark.no>. Aksis home page: <www.aksis.uib.no>.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Highlighting of original documents

Posted by Oystein Reigem <oy...@aksis.uib.no>.
Mark Miller wrote:

> Depends on the work you want to do. If you want to highlight a simple 
> XML doc the approach would be to extract all of the text elements and 
> run them through the highlighter and then correctly update them. That 
> would be mostly simple DOM manipulation.

OK.

I guess there will be some details that need special attention. One case 
that springs to mind is the occurrence of words that in the original 
document are broken up by encoding, like "en<hyphen/>coding" or 
"<em>mid</em>range".

> The same approach should work with any format but the difficulty in 
> modifying the text may increase. If you can pull the text out 
> appropriately it would seem you could put it back in though, or modify 
> it in place as you might with the DOM.

Do you know if tools (classes) for "appropriate" extraction from "my" 
file formats already exist in Lucene? I.e, something that not just 
extracts the text, but keeps track of its position in the original?

I saw POI <http://jakarta.apache.org/poi/> mentioned in a posting on 
this list. Perhaps a solution for Word documents can be based on POI.

- Øystein -

>
> - Mark
>
> Oystein Reigem wrote:
>
>> Hi,
>>
>> I want to implement fulltext search on a collection of documents. I 
>> try to figure out which system is the better choice - eXist, or 
>> Lucene, or some combination of the two. I have some knowledge of 
>> eXist, but don't know too much about Lucene.
>>
>> I'd like to display the result of a search as a list of 
>> excerpts/snippets with highlighted search words. When the user clicks 
>> an item in the result list to bring up the document in full, I'd like 
>> to have search words highlighted in the full document as well.
>>
>> The document collection is very diverse. There are pure text 
>> documents and well-formed XML and HTML documents, but unfortunately 
>> also HTML documents that are not quite well-formed, Word documents 
>> and PDFs. Many of the formats go beyond what eXist and Lucene can 
>> handle, and I realise some conversion, or text extraction, is 
>> necessary. As far as I know Lucene can only index and search pure 
>> text (and fields), so the documents must be run through appropriate 
>> filters extracting the text (and field values). Afterwards fulltext 
>> search is possible.
>>
>> But what about highlighting? I know it is possible to get 
>> highlighting in the pure text version, but what about the original 
>> document, when the original document is something else than pure 
>> text, e.g, a simple XML document? Is it at all possible to get the 
>> search words tagged in the XML document?
>>
>> I assume not, but ask anyway. :-)
>>
>> Cheers,
>>
>> - Øystein -
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


-- 
Øystein Reigem, The department of culture, language and information technology (Aksis), Allegt 27, N-5007 Bergen, Norway. Tel: +47 55 58 32 42. Fax: +47 55 58 94 70. E-mail: <oy...@aksis.uib.no>. Home tel: +47 56 14 06 11. Mobile: +47 97 16 96 64. Home e-mail: <or...@broadpark.no>. Aksis home page: <www.aksis.uib.no>.


Re: Highlighting of original documents

Posted by Mark Miller <ma...@gmail.com>.
Depends on the work you want to do. If you want to highlight a simple 
XML doc the approach would be to extract all of the text elements and 
run them through the highlighter and then correctly update them. That 
would be mostly simple DOM manipulation. The same approach should work 
with any format but the difficulty in modifying the text may increase. 
If you can pull the text out appropriately it would seem you could put 
it back in though, or modify it in place as you might with the DOM.

- Mark

Oystein Reigem wrote:
> Hi,
>
> I want to implement fulltext search on a collection of documents. I 
> try to figure out which system is the better choice - eXist, or 
> Lucene, or some combination of the two. I have some knowledge of 
> eXist, but don't know too much about Lucene.
>
> I'd like to display the result of a search as a list of 
> excerpts/snippets with highlighted search words. When the user clicks 
> an item in the result list to bring up the document in full, I'd like 
> to have search words highlighted in the full document as well.
>
> The document collection is very diverse. There are pure text documents 
> and well-formed XML and HTML documents, but unfortunately also HTML 
> documents that are not quite well-formed, Word documents and PDFs. 
> Many of the formats go beyond what eXist and Lucene can handle, and I 
> realise some conversion, or text extraction, is necessary. As far as I 
> know Lucene can only index and search pure text (and fields), so the 
> documents must be run through appropriate filters extracting the text 
> (and field values). Afterwards fulltext search is possible.
>
> But what about highlighting? I know it is possible to get highlighting 
> in the pure text version, but what about the original document, when 
> the original document is something else than pure text, e.g, a simple 
> XML document? Is it at all possible to get the search words tagged in 
> the XML document?
>
> I assume not, but ask anyway. :-)
>
> Cheers,
>
> - Øystein -
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org