You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by webdev1977 <we...@gmail.com> on 2012/08/15 14:07:04 UTC

Cached page (like google) with hits highlighted

Hello Everyone!

I am up and running with my nutch 1.4 /solr 3.3  architecture and am looking
to add a few new features.  

My users want the ability to view their solr results as xhtml with the hits
highlighted in the document.  So a word document/pdf would become an XHTML
version first.

I see that Tika can produce XHTML but I don't see a way to integrate that
with the parsing that nutch does in the parse-tika plugin.  Seems like the
results sent to solr for the "content" field are just the text of the
document.  

Is there a way to do this?

Thanks!



--
View this message in context: http://lucene.472066.n3.nabble.com/Cached-page-like-google-with-hits-highlighted-tp4001374.html
Sent from the Nutch - User mailing list archive at Nabble.com.

RE: Cached page (like google) with hits highlighted

Posted by webdev1977 <we...@gmail.com>.

tika-app (the gui) gives me back the xhtml just fine.. not sure what is going
on here.. maybe it is not stored properly in the documentfragment upon
parsing?



--
View this message in context: http://lucene.472066.n3.nabble.com/Cached-page-like-google-with-hits-highlighted-tp4001374p4001449.html
Sent from the Nutch - User mailing list archive at Nabble.com.

RE: Cached page (like google) with hits highlighted

Posted by Markus Jelsma <ma...@openindex.io>.

No, it doesn't come with Nutch. You can download Tika 1.2 or build trunk from source.

Code looks fine. But you might want to check the headings plugin, it uses the NodeWalker to make things easier:
http://svn.apache.org/viewvc/nutch/trunk/src/plugin/headings/src/java/org/apache/nutch/parse/headings/HeadingsParseFilter.java?revision=1349233&view=markup

 
 
-----Original message-----
> From:webdev1977 <we...@gmail.com>
> Sent: Wed 15-Aug-2012 19:00
> To: user@nutch.apache.org
> Subject: RE: Cached page (like google) with hits highlighted
> 
> Does the 1.4 version of nutch have tika-app?  Also..maybe I am not using the
> DocumentFragment object properly?  Below is a summary version of my code:
> 
> public ParseResult filter(Content content, ParseResult parseResult,
>            HTMLMetaTags metaTags, DocumentFragment doc) {
> 
>    for (int x = 0; x < doc.getChildNodes().getLength(); x++) {
>    
>      System.out.println("xml node name" +
> doc.getChildNodes().item(x).getNodeName());
>      System.out.println("xml node value" +
> doc.getChildNodes().item(x).getNodeValue());
>      System.out.println("xml text content" +
> doc.getChildNodes().item(x).getTextContent());
> 
>   }
> 
> 
> 
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Cached-page-like-google-with-hits-highlighted-tp4001374p4001440.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>

RE: Cached page (like google) with hits highlighted

Posted by webdev1977 <we...@gmail.com>.

Does the 1.4 version of nutch have tika-app?  Also..maybe I am not using the
DocumentFragment object properly?  Below is a summary version of my code:

public ParseResult filter(Content content, ParseResult parseResult,
           HTMLMetaTags metaTags, DocumentFragment doc) {

   for (int x = 0; x < doc.getChildNodes().getLength(); x++) {
   
     System.out.println("xml node name" +
doc.getChildNodes().item(x).getNodeName());
     System.out.println("xml node value" +
doc.getChildNodes().item(x).getNodeValue());
     System.out.println("xml text content" +
doc.getChildNodes().item(x).getTextContent());

  }



--
View this message in context: http://lucene.472066.n3.nabble.com/Cached-page-like-google-with-hits-highlighted-tp4001374p4001440.html
Sent from the Nutch - User mailing list archive at Nabble.com.

RE: Cached page (like google) with hits highlighted

Posted by Markus Jelsma <ma...@openindex.io>.

Hmm, i would also expect PDF and office documents to have at least paragraph and heading tags in Tika's XHTML representation. You can test if it's true with java -jar tika-app -x <URL>. I think it was -x, use --help to see all options.
 
 
-----Original message-----
> From:webdev1977 <we...@gmail.com>
> Sent: Wed 15-Aug-2012 18:22
> To: user@nutch.apache.org
> Subject: RE: Cached page (like google) with hits highlighted
> 
> Thanks Markus!
> 
> So after some testing and walking the DocumentFragment, I see that all I get
> is one node:
> <html>
> some content here and here
> </html>
> 
> I guess I expected to see more from a PDF/word document (like H1 tags, etc)
> that would help make the xhtml format more readable.
> 
> Am I missing something? Do I have to do anything special to the
> DocumentFragment to format it?
> 
> Thanks!
> 
> 
> 
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Cached-page-like-google-with-hits-highlighted-tp4001374p4001434.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>

RE: Cached page (like google) with hits highlighted

Posted by webdev1977 <we...@gmail.com>.

PDF2XHTML is already being loaded by the pdf parser.  Something is not adding
it to the DocumentFragment however, I can't seem to find out where?
*
any other ideas? * I don't want to run Tika separately during the parse step
to get the XHTML (seems silly) but I will if I absolutely have to. 



--
View this message in context: http://lucene.472066.n3.nabble.com/Cached-page-like-google-with-hits-highlighted-tp4001374p4003801.html
Sent from the Nutch - User mailing list archive at Nabble.com.

RE: Cached page (like google) with hits highlighted

Posted by webdev1977 <we...@gmail.com>.

PDF2XHTML is already being loaded by the pdf parser.  Something is not adding
it to the DocumentFragment however, I can't seem to find out where?
*
any other ideas? * I don't want to run Tika separately during the parse step
to get the XHTML (seems silly) but I will if I absolutely have to. 



--
View this message in context: http://lucene.472066.n3.nabble.com/Cached-page-like-google-with-hits-highlighted-tp4001374p4003800.html
Sent from the Nutch - User mailing list archive at Nabble.com.

RE: Cached page (like google) with hits highlighted

Posted by Markus Jelsma <ma...@openindex.io>.

Tika has a PDF2XHTML.java in the PDF parser but i think the standard PDFParser.java is executed for the MIME-type. In ParseTika.java we ask TikaConfig for the parser of a given MIME-type. To quickly test if it works like that you can try to hack in TikaParser and load PDF2XHTML instead of getting the parser via TikaConfig.

You can also override tell the CompositeParser.setParsers(Map<MediaType, Parser> parsers) in Tika via TikeConfig.getParser() to map the PDF2XHTML parser to the PDF MIME-type. By reading the code I think that should work.
 
 
-----Original message-----
> From:webdev1977 <we...@gmail.com>
> Sent: Thu 16-Aug-2012 12:51
> To: user@nutch.apache.org
> Subject: Re: Cached page (like google) with hits highlighted
> 
> Thanks Julien and Markus for all your help.
> 
> I poked around the code some more yesterday and it seems like the markup is
> just not getting in the DocumentFragment.  All I get (for word and pdf) is
> just one html tag with the text of the document in between.  Maybe something
> is not using parse-tika properly (somewhere in the nutch implementation of
> the parser?)
> 
> The same two documents give me tons of markup using the tika-app gui.  The
> versions are the same.  I am out of ideas, anyone, anyone? 
> 
> Thanks!
> 
> 
> 
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Cached-page-like-google-with-hits-highlighted-tp4001374p4001593.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>

Re: Cached page (like google) with hits highlighted

Posted by webdev1977 <we...@gmail.com>.

Thanks Julien and Markus for all your help.

I poked around the code some more yesterday and it seems like the markup is
just not getting in the DocumentFragment.  All I get (for word and pdf) is
just one html tag with the text of the document in between.  Maybe something
is not using parse-tika properly (somewhere in the nutch implementation of
the parser?)

The same two documents give me tons of markup using the tika-app gui.  The
versions are the same.  I am out of ideas, anyone, anyone? 

Thanks!



--
View this message in context: http://lucene.472066.n3.nabble.com/Cached-page-like-google-with-hits-highlighted-tp4001374p4001593.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Cached page (like google) with hits highlighted

Posted by Julien Nioche <li...@gmail.com>.

Sorry I had missed your previous comments.

On 16 August 2012 09:32, Julien Nioche <li...@gmail.com>wrote:

> You need to use parse-tika, however the underlying parser for pdf does not
> currently generate much markup, the Word one does IIRC.
>
> Why don't you try Tika standalone with its GUI to explore what is given
> per mime-type?
>
> Julien
>
>
> On 15 August 2012 17:19, webdev1977 <we...@gmail.com> wrote:
>
>> Thanks Markus!
>>
>> So after some testing and walking the DocumentFragment, I see that all I
>> get
>> is one node:
>> <html>
>> some content here and here
>> </html>
>>
>> I guess I expected to see more from a PDF/word document (like H1 tags,
>> etc)
>> that would help make the xhtml format more readable.
>>
>> Am I missing something? Do I have to do anything special to the
>> DocumentFragment to format it?
>>
>> Thanks!
>>
>>
>>
>> --
>> View this message in context:
>> http://lucene.472066.n3.nabble.com/Cached-page-like-google-with-hits-highlighted-tp4001374p4001434.html
>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>
>
>
>
> --
> *
> *Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
> http://twitter.com/digitalpebble
>
>


-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Re: Cached page (like google) with hits highlighted

Posted by Julien Nioche <li...@gmail.com>.

You need to use parse-tika, however the underlying parser for pdf does not
currently generate much markup, the Word one does IIRC.

Why don't you try Tika standalone with its GUI to explore what is given per
mime-type?

Julien


On 15 August 2012 17:19, webdev1977 <we...@gmail.com> wrote:

> Thanks Markus!
>
> So after some testing and walking the DocumentFragment, I see that all I
> get
> is one node:
> <html>
> some content here and here
> </html>
>
> I guess I expected to see more from a PDF/word document (like H1 tags, etc)
> that would help make the xhtml format more readable.
>
> Am I missing something? Do I have to do anything special to the
> DocumentFragment to format it?
>
> Thanks!
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Cached-page-like-google-with-hits-highlighted-tp4001374p4001434.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

RE: Cached page (like google) with hits highlighted

Posted by webdev1977 <we...@gmail.com>.

Thanks Markus!

So after some testing and walking the DocumentFragment, I see that all I get
is one node:
<html>
some content here and here
</html>

I guess I expected to see more from a PDF/word document (like H1 tags, etc)
that would help make the xhtml format more readable.

Am I missing something? Do I have to do anything special to the
DocumentFragment to format it?

Thanks!



--
View this message in context: http://lucene.472066.n3.nabble.com/Cached-page-like-google-with-hits-highlighted-tp4001374p4001434.html
Sent from the Nutch - User mailing list archive at Nabble.com.

RE: Cached page (like google) with hits highlighted

Posted by Markus Jelsma <ma...@openindex.io>.

Hi,

You can catch the XML in a Parse Filter by walking over the DocumentFragment that is passed. It should contain the proper mark up. 

Cheers,

 
 
-----Original message-----
> From:webdev1977 <we...@gmail.com>
> Sent: Wed 15-Aug-2012 14:09
> To: user@nutch.apache.org
> Subject: Cached page (like google) with hits highlighted
> 
> Hello Everyone!
> 
> I am up and running with my nutch 1.4 /solr 3.3  architecture and am looking
> to add a few new features.  
> 
> My users want the ability to view their solr results as xhtml with the hits
> highlighted in the document.  So a word document/pdf would become an XHTML
> version first.
> 
> I see that Tika can produce XHTML but I don't see a way to integrate that
> with the parsing that nutch does in the parse-tika plugin.  Seems like the
> results sent to solr for the "content" field are just the text of the
> document.  
> 
> Is there a way to do this?
> 
> Thanks!
> 
> 
> 
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Cached-page-like-google-with-hits-highlighted-tp4001374.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>