You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Mag Gam <ma...@gmail.com> on 2006/09/08 00:37:38 UTC

Highligher Example

Hey

Anyone have a search result highlighter example?

I have various doc, PDFs, DOC, TXT, PPT, and I would like to show a
highlight, similar to how google does it...

tia

Re: Highligher Example

Posted by Shane Perry <sh...@lingotek.com>.
Not sure if this is something of interest, but there is an open source 
project called File2XLIFF4j on Sourceforge.net 
(http://file2xliff4j.sourceforge.net/).  The project converts many 
common file formats to XLIFF.  It may be useful for getting a common 
format, highlighting, and the recreating the original file with the format.

Erik Hatcher wrote:
> There are test cases in the Highlighter codebase that exercise it and 
> show its use, as well as a few examples of it in the "Lucene in 
> Action" codebase.
>
> These examples output plain text with some prefix and suffix 
> surrounding the highlighted terms.  Highlighting text in a PDF is 
> possible, I'm pretty sure, but I don't think the same would be easily 
> possible with Microsoft document formats.  I'm not sure if you are 
> asking for these document types to be highlighted or just a plain text 
> representation of them, though.
>
>     Erik
>
> On Sep 7, 2006, at 6:37 PM, Mag Gam wrote:
>
>> Hey
>>
>> Anyone have a search result highlighter example?
>>
>> I have various doc, PDFs, DOC, TXT, PPT, and I would like to show a
>> highlight, similar to how google does it...
>>
>> tia
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Highligher Example

Posted by Tom Emerson <tr...@gmail.com>.
Autonomy's KeyView is an alternative to Stellent. It does not cover all of
the file formats that Stellent does, though many of them are probably not
interesting for most applications. When I last looked at it it did not
handle mail archives, though there was a plan to add it. I found it more
stable than Stellent, and it has a JNI interface that works quite well. It
is still quite expensive, however.

PDFBox works, but we found it to be really really slow.

YMMV,

     -tree

-- 
Tom Emerson
tremerson@gmail.com
http://www.dreamersrealm.net/~tree

Re: Highligher Example

Posted by Till Kinstler <ki...@gbv.de>.
Mark Miller schrieb:
> Highlighting a PDF document, last time I looked (quite a while ago),
> involves supplying an xml file that describes offsets for highlighting.
> You can specify the file in the URL. 

PDFBox (http://www.pdfbox.org/), which is also convenient for parsing
PDFs, can generate those XML files through its class PDFHighlighter
(http://www.pdfbox.org/javadoc/org/pdfbox/util/PDFHighlighter.html).
There is a page discribing the various options for highlighting PDFs
with PDFBox: http://www.pdfbox.org/userguide/highlighting.html.
Unfortunately, highlighting through these XML files seems not to work in
the Acrobat Reader plugin for Linux.

Till

-- 
Till Kinstler
Verbundzentrale des Gemeinsamen Bibliotheksverbundes (VZG)
Platz der Göttinger Sieben 1, D 37073 Göttingen
kinstler@gbv.de, +49 (0) 551 39-13431, http://www.gbv.de

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Highligher Example

Posted by Daniel Noll <da...@nuix.com.au>.
Dejan Nenov wrote:
> Second that - I was a client of Stellent - the libs work great but are
> expensive. To see Stellent in action - get a copy of the free X1 desktop
> search or the X1 server (Lucene based).

I would say that the libs work great but are slow.

One problem is that they don't provide a Java API.  The "Java" API they 
provide is sample code which calls a native executable, not even a JNI 
library.  So you pay the penalty of that native app starting up every 
time you extract a document.

If all you want is the plain text, for many document types it's actually 
fairly fast, and beats having to write code for every document type 
yourself (or locating libraries to do it for you.)  But as soon as you 
want the marked up text, it becomes a completely different story.  We 
benchmarked it to be something like 10 times slower to handle markup 
than handling raw text and metadata.  Most of this extra time was spent 
parsing the XML it outputs, which is often far more verbose than it 
needs to be for the amount of formatting it actually contains.

Daniel


-- 
Daniel Noll

Nuix Pty Ltd
Suite 79, 89 Jones St, Ultimo NSW 2007, Australia    Ph: +61 2 9280 0699
Web: http://www.nuix.com.au/                        Fax: +61 2 9212 6902

This message is intended only for the named recipient. If you are not
the intended recipient you are notified that disclosing, copying,
distributing or taking any action in reliance on the contents of this
message or attachment is strictly prohibited.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


RE: Highligher Example

Posted by Dejan Nenov <de...@jollyobject.com>.
Second that - I was a client of Stellent - the libs work great but are
expensive. To see Stellent in action - get a copy of the free X1 desktop
search or the X1 server (Lucene based).
Another alternative is KeyView from Verity - now Autonomy.

-----Original Message-----
From: mark harwood [mailto:markharw00d@yahoo.co.uk] 
Sent: Friday, September 08, 2006 1:27 AM
To: java-user@lucene.apache.org
Subject: Re: Highligher Example

If you have a budget for this stuff then Stellent provide tools for parsing
multiple document types and also have a viewer that can display documents
with their original formatting, plus your highlights. See
http://www.stellent.com/en/products/outside_in/viewer_tech/index.htm

I don't work for Stellent and haven't used it but I do know this stuff is
hard to do and they are the only ones I'm aware of trying to provide tools
to cover all document types which is why I mention it. If anyone has any
other similar recommendations I would be interested to hear them.


----- Original Message ----
From: Mark Miller <ma...@gmail.com>
To: java-user@lucene.apache.org
Sent: Friday, 8 September, 2006 2:02:47 AM
Subject: Re: Highligher Example

Highlighting a PDF document, last time I looked (quite a while ago), 
involves supplying an xml file that describes offsets for highlighting. 
You can specify the file in the URL. You can also do simple highlighting 
by passing in a list of words to be highlighted, but this does not even 
catch minor differences, like singular to plural.

If someone knows more about using to the lucene highlighter to highlight 
PDF's then please speak up. I think I will have to get into this soon.

- Mark

Mag Gam wrote:
> Thanks for the quick response Erik. I will be getting my LIA book back 
> very
> soon, I forgot it at a destination :-(
>
> Lets assume, there is a document called "hello.pdf" and it has the 
> content
> "this is hello.pdf. It uses Acrobat"
>
> When I perform a search for "Acrobat", i want hello.pdf to show up, 
> and also
> the 'It uses <highlight>Acrobat</highlight>'
>
> something like that.
>
> tia
>
>
>
> On 9/7/06, Erik Hatcher <er...@ehatchersolutions.com> wrote:
>>
>> There are test cases in the Highlighter codebase that exercise it and
>> show its use, as well as a few examples of it in the "Lucene in
>> Action" codebase.
>>
>> These examples output plain text with some prefix and suffix
>> surrounding the highlighted terms.  Highlighting text in a PDF is
>> possible, I'm pretty sure, but I don't think the same would be easily
>> possible with Microsoft document formats.  I'm not sure if you are
>> asking for these document types to be highlighted or just a plain
>> text representation of them, though.
>>
>>         Erik
>>
>> On Sep 7, 2006, at 6:37 PM, Mag Gam wrote:
>>
>> > Hey
>> >
>> > Anyone have a search result highlighter example?
>> >
>> > I have various doc, PDFs, DOC, TXT, PPT, and I would like to show a
>> > highlight, similar to how google does it...
>> >
>> > tia
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org





---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Highligher Example

Posted by mark harwood <ma...@yahoo.co.uk>.
If you have a budget for this stuff then Stellent provide tools for parsing multiple document types and also have a viewer that can display documents with their original formatting, plus your highlights. See http://www.stellent.com/en/products/outside_in/viewer_tech/index.htm

I don't work for Stellent and haven't used it but I do know this stuff is hard to do and they are the only ones I'm aware of trying to provide tools to cover all document types which is why I mention it. If anyone has any other similar recommendations I would be interested to hear them.


----- Original Message ----
From: Mark Miller <ma...@gmail.com>
To: java-user@lucene.apache.org
Sent: Friday, 8 September, 2006 2:02:47 AM
Subject: Re: Highligher Example

Highlighting a PDF document, last time I looked (quite a while ago), 
involves supplying an xml file that describes offsets for highlighting. 
You can specify the file in the URL. You can also do simple highlighting 
by passing in a list of words to be highlighted, but this does not even 
catch minor differences, like singular to plural.

If someone knows more about using to the lucene highlighter to highlight 
PDF's then please speak up. I think I will have to get into this soon.

- Mark

Mag Gam wrote:
> Thanks for the quick response Erik. I will be getting my LIA book back 
> very
> soon, I forgot it at a destination :-(
>
> Lets assume, there is a document called "hello.pdf" and it has the 
> content
> "this is hello.pdf. It uses Acrobat"
>
> When I perform a search for "Acrobat", i want hello.pdf to show up, 
> and also
> the 'It uses <highlight>Acrobat</highlight>'
>
> something like that.
>
> tia
>
>
>
> On 9/7/06, Erik Hatcher <er...@ehatchersolutions.com> wrote:
>>
>> There are test cases in the Highlighter codebase that exercise it and
>> show its use, as well as a few examples of it in the "Lucene in
>> Action" codebase.
>>
>> These examples output plain text with some prefix and suffix
>> surrounding the highlighted terms.  Highlighting text in a PDF is
>> possible, I'm pretty sure, but I don't think the same would be easily
>> possible with Microsoft document formats.  I'm not sure if you are
>> asking for these document types to be highlighted or just a plain
>> text representation of them, though.
>>
>>         Erik
>>
>> On Sep 7, 2006, at 6:37 PM, Mag Gam wrote:
>>
>> > Hey
>> >
>> > Anyone have a search result highlighter example?
>> >
>> > I have various doc, PDFs, DOC, TXT, PPT, and I would like to show a
>> > highlight, similar to how google does it...
>> >
>> > tia
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org





---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Highligher Example

Posted by Mark Miller <ma...@gmail.com>.
Highlighting a PDF document, last time I looked (quite a while ago), 
involves supplying an xml file that describes offsets for highlighting. 
You can specify the file in the URL. You can also do simple highlighting 
by passing in a list of words to be highlighted, but this does not even 
catch minor differences, like singular to plural.

If someone knows more about using to the lucene highlighter to highlight 
PDF's then please speak up. I think I will have to get into this soon.

- Mark

Mag Gam wrote:
> Thanks for the quick response Erik. I will be getting my LIA book back 
> very
> soon, I forgot it at a destination :-(
>
> Lets assume, there is a document called "hello.pdf" and it has the 
> content
> "this is hello.pdf. It uses Acrobat"
>
> When I perform a search for "Acrobat", i want hello.pdf to show up, 
> and also
> the 'It uses <highlight>Acrobat</highlight>'
>
> something like that.
>
> tia
>
>
>
> On 9/7/06, Erik Hatcher <er...@ehatchersolutions.com> wrote:
>>
>> There are test cases in the Highlighter codebase that exercise it and
>> show its use, as well as a few examples of it in the "Lucene in
>> Action" codebase.
>>
>> These examples output plain text with some prefix and suffix
>> surrounding the highlighted terms.  Highlighting text in a PDF is
>> possible, I'm pretty sure, but I don't think the same would be easily
>> possible with Microsoft document formats.  I'm not sure if you are
>> asking for these document types to be highlighted or just a plain
>> text representation of them, though.
>>
>>         Erik
>>
>> On Sep 7, 2006, at 6:37 PM, Mag Gam wrote:
>>
>> > Hey
>> >
>> > Anyone have a search result highlighter example?
>> >
>> > I have various doc, PDFs, DOC, TXT, PPT, and I would like to show a
>> > highlight, similar to how google does it...
>> >
>> > tia
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Highligher Example

Posted by Mag Gam <ma...@gmail.com>.
Thanks for the quick response Erik. I will be getting my LIA book back very
soon, I forgot it at a destination :-(

Lets assume, there is a document called "hello.pdf" and it has the content
"this is hello.pdf. It uses Acrobat"

When I perform a search for "Acrobat", i want hello.pdf to show up, and also
the 'It uses <highlight>Acrobat</highlight>'

something like that.

tia



On 9/7/06, Erik Hatcher <er...@ehatchersolutions.com> wrote:
>
> There are test cases in the Highlighter codebase that exercise it and
> show its use, as well as a few examples of it in the "Lucene in
> Action" codebase.
>
> These examples output plain text with some prefix and suffix
> surrounding the highlighted terms.  Highlighting text in a PDF is
> possible, I'm pretty sure, but I don't think the same would be easily
> possible with Microsoft document formats.  I'm not sure if you are
> asking for these document types to be highlighted or just a plain
> text representation of them, though.
>
>         Erik
>
> On Sep 7, 2006, at 6:37 PM, Mag Gam wrote:
>
> > Hey
> >
> > Anyone have a search result highlighter example?
> >
> > I have various doc, PDFs, DOC, TXT, PPT, and I would like to show a
> > highlight, similar to how google does it...
> >
> > tia
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Highligher Example

Posted by Erik Hatcher <er...@ehatchersolutions.com>.
There are test cases in the Highlighter codebase that exercise it and  
show its use, as well as a few examples of it in the "Lucene in  
Action" codebase.

These examples output plain text with some prefix and suffix  
surrounding the highlighted terms.  Highlighting text in a PDF is  
possible, I'm pretty sure, but I don't think the same would be easily  
possible with Microsoft document formats.  I'm not sure if you are  
asking for these document types to be highlighted or just a plain  
text representation of them, though.

	Erik

On Sep 7, 2006, at 6:37 PM, Mag Gam wrote:

> Hey
>
> Anyone have a search result highlighter example?
>
> I have various doc, PDFs, DOC, TXT, PPT, and I would like to show a
> highlight, similar to how google does it...
>
> tia


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org