You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by "Avni, Itamar" <It...@verint.com> on 2009/12/16 16:37:55 UTC

Extracting Essence of Page and Indexing only when Changed

Hi all,

It's my first project with Nutch, so be gentile with me :-)

1) I want nutch (1.0) to index only the essence of a current URL.

I plugged a new implementation of org.apache.nutch.parse.Parser, which calls Parse.setText with the essence content of the page reviewed. This Parse is set to the returned ParseResult.

(Extracting the essence of the page is done with Neko, using heuristics such as "Node is not script", etc. Not the most amazing thing, but it pretty much does the job)

I rejected the idea of doing it in an implementation of org.apache.nutch.indexer.IndexingFilter, as at the time IndexingFilter.filter is called, its received Parse argument already holds the text as string (how would I extract the essence now?).

2) When indexing again the same URL, I want to index the page only if the essence of it has changed.

I need to compare the current essence of the page, to the one indexed in a previous run, in order to recognize a change.

I can't do it by checking the segments, because I have no knowledge of the way the content is stored inside them.

So, I want to plug a new org.apache.nutch.indexer.IndexingFilter, which will load the previous indexed text as a NutchDocument, and compare it to the current essence I'm holding in Parse.getText.

How can I do it? How do I load a NutchDocument from the underlying Lucene repository, given its URL?

Thanks

Itamar Avni

This electronic message may contain proprietary and confidential information of Verint Systems Inc., its affiliates and/or subsidiaries.
The information is intended to be for the use of the individual(s) or
entity(ies) named above. If you are not the intended recipient (or authorized to receive this e-mail for the intended recipient), you may not use, copy, disclose or distribute to anyone this message or any information contained in this message. If you have received this electronic message in error, please notify us by replying to this e-mail.

RE: Extracting Essence of Page and Indexing only when Changed

Posted by BELLINI ADAM <mb...@msn.com>.

yes guess you could 

see in the bin folder of your nutch installation folder, you will find the nutch scrtip, inside you have all java classes...

elif [ "$COMMAND" = "segread" ] ; then
  echo "[DEPRECATED] Command 'segread' is deprecated, use 'readseg' instead."
  CLASS=org.apache.nutch.segment.SegmentReader

so the class of reading segments is  CLASS=org.apache.nutch.segment.SegmentReader


> From: Itamar.Avni@verint.com
> To: nutch-user@lucene.apache.org
> Date: Wed, 16 Dec 2009 18:43:51 +0200
> Subject: RE: Extracting Essence of Page and Indexing only when Changed
> 
> Thanks 
> 
> Thanks BELLINI ADAM
> 
> Is there a way to do it in java?
> 
> Itamar Avni
> 
> 
> -----Original Message-----
> From: BELLINI ADAM [mailto:mbellil@msn.com] 
> Sent: Wednesday, December 16, 2009 6:35 PM
> To: nutch-user@lucene.apache.org
> Subject: RE: Extracting Essence of Page and Indexing only when Changed
> 
> 
> hi
> you now that you can extract the content of the page by reading the segment: 
> 
> type readseg to see the options :  to dump only content you will use this command, it displays only content.
> 
> ./bin/nutch readseg -dump crawl_folder/segments/20091001145126/ dump_folder -nofetch -nogenerate -noparse -noparsedata -noparsetex
> 
> hope it could help.
> 
> 
> 
> 
> 
> > From: Itamar.Avni@verint.com
> > To: nutch-user@lucene.apache.org
> > Date: Wed, 16 Dec 2009 17:37:55 +0200
> > Subject: Extracting Essence of Page and Indexing only when Changed
> > 
> > Hi all,
> > 
> > It's my first project with Nutch, so be gentile with me :-)
> > 
> > 
> > 
> > 1) I want nutch (1.0) to index only the essence of a current URL.
> > 
> > I plugged a new implementation of org.apache.nutch.parse.Parser, which calls Parse.setText with the essence content of the page reviewed. This Parse is set to the returned ParseResult.
> > 
> > (Extracting the essence of the page is done with Neko, using heuristics such as "Node is not script", etc. Not the most amazing thing, but it pretty much does the job)
> > 
> > I rejected the idea of doing it in an implementation of org.apache.nutch.indexer.IndexingFilter, as at the time IndexingFilter.filter is called, its received Parse argument already holds the text as string (how would I extract the essence now?).
> > 
> > 
> > 
> > 2) When indexing again the same URL, I want to index the page only if the essence of it has changed.
> > 
> > I need to compare the current essence of the page, to the one indexed in a previous run, in order to recognize a change.
> > 
> > I can't do it by checking the segments, because I have no knowledge of the way the content is stored inside them.
> > 
> > So, I want to plug a new org.apache.nutch.indexer.IndexingFilter, which will load the previous indexed text as a NutchDocument, and compare it to the current essence I'm holding in Parse.getText.
> > 
> > How can I do it? How do I load a NutchDocument from the underlying Lucene repository, given its URL?
> > 
> > Thanks
> > 
> > Itamar Avni
> > 
> > 
> > This electronic message may contain proprietary and confidential information of Verint Systems Inc., its affiliates and/or subsidiaries.
> > The information is intended to be for the use of the individual(s) or
> > entity(ies) named above.  If you are not the intended recipient (or authorized to receive this e-mail for the intended recipient), you may not use, copy, disclose or distribute to anyone this message or any information contained in this message.  If you have received this electronic message in error, please notify us by replying to this e-mail.
> > 
>  		 	   		  
> _________________________________________________________________
> Windows Live: Make it easier for your friends to see what you're up to on Facebook.
> http://go.microsoft.com/?linkid=9691816
> This electronic message may contain proprietary and confidential information of Verint Systems Inc., its affiliates and/or subsidiaries.
> The information is intended to be for the use of the individual(s) or
> entity(ies) named above.  If you are not the intended recipient (or authorized to receive this e-mail for the intended recipient), you may not use, copy, disclose or distribute to anyone this message or any information contained in this message.  If you have received this electronic message in error, please notify us by replying to this e-mail.
> 
> 
 		 	   		  
_________________________________________________________________
Eligible CDN College & University students can upgrade to Windows 7 before Jan 3 for only $39.99. Upgrade now!
http://go.microsoft.com/?linkid=9691819

RE: Extracting Essence of Page and Indexing only when Changed

Posted by "Avni, Itamar" <It...@verint.com>.

Thanks 

Thanks BELLINI ADAM

Is there a way to do it in java?

Itamar Avni


-----Original Message-----
From: BELLINI ADAM [mailto:mbellil@msn.com] 
Sent: Wednesday, December 16, 2009 6:35 PM
To: nutch-user@lucene.apache.org
Subject: RE: Extracting Essence of Page and Indexing only when Changed


hi
you now that you can extract the content of the page by reading the segment: 

type readseg to see the options :  to dump only content you will use this command, it displays only content.

./bin/nutch readseg -dump crawl_folder/segments/20091001145126/ dump_folder -nofetch -nogenerate -noparse -noparsedata -noparsetex

hope it could help.





> From: Itamar.Avni@verint.com
> To: nutch-user@lucene.apache.org
> Date: Wed, 16 Dec 2009 17:37:55 +0200
> Subject: Extracting Essence of Page and Indexing only when Changed
> 
> Hi all,
> 
> It's my first project with Nutch, so be gentile with me :-)
> 
> 
> 
> 1) I want nutch (1.0) to index only the essence of a current URL.
> 
> I plugged a new implementation of org.apache.nutch.parse.Parser, which calls Parse.setText with the essence content of the page reviewed. This Parse is set to the returned ParseResult.
> 
> (Extracting the essence of the page is done with Neko, using heuristics such as "Node is not script", etc. Not the most amazing thing, but it pretty much does the job)
> 
> I rejected the idea of doing it in an implementation of org.apache.nutch.indexer.IndexingFilter, as at the time IndexingFilter.filter is called, its received Parse argument already holds the text as string (how would I extract the essence now?).
> 
> 
> 
> 2) When indexing again the same URL, I want to index the page only if the essence of it has changed.
> 
> I need to compare the current essence of the page, to the one indexed in a previous run, in order to recognize a change.
> 
> I can't do it by checking the segments, because I have no knowledge of the way the content is stored inside them.
> 
> So, I want to plug a new org.apache.nutch.indexer.IndexingFilter, which will load the previous indexed text as a NutchDocument, and compare it to the current essence I'm holding in Parse.getText.
> 
> How can I do it? How do I load a NutchDocument from the underlying Lucene repository, given its URL?
> 
> Thanks
> 
> Itamar Avni
> 
> 
> This electronic message may contain proprietary and confidential information of Verint Systems Inc., its affiliates and/or subsidiaries.
> The information is intended to be for the use of the individual(s) or
> entity(ies) named above.  If you are not the intended recipient (or authorized to receive this e-mail for the intended recipient), you may not use, copy, disclose or distribute to anyone this message or any information contained in this message.  If you have received this electronic message in error, please notify us by replying to this e-mail.
> 
 		 	   		  
_________________________________________________________________
Windows Live: Make it easier for your friends to see what you're up to on Facebook.
http://go.microsoft.com/?linkid=9691816
This electronic message may contain proprietary and confidential information of Verint Systems Inc., its affiliates and/or subsidiaries.
The information is intended to be for the use of the individual(s) or
entity(ies) named above.  If you are not the intended recipient (or authorized to receive this e-mail for the intended recipient), you may not use, copy, disclose or distribute to anyone this message or any information contained in this message.  If you have received this electronic message in error, please notify us by replying to this e-mail.

RE: Extracting Essence of Page and Indexing only when Changed

Posted by BELLINI ADAM <mb...@msn.com>.

hi
you now that you can extract the content of the page by reading the segment: 

type readseg to see the options :  to dump only content you will use this command, it displays only content.

./bin/nutch readseg -dump crawl_folder/segments/20091001145126/ dump_folder -nofetch -nogenerate -noparse -noparsedata -noparsetex

hope it could help.





> From: Itamar.Avni@verint.com
> To: nutch-user@lucene.apache.org
> Date: Wed, 16 Dec 2009 17:37:55 +0200
> Subject: Extracting Essence of Page and Indexing only when Changed
> 
> Hi all,
> 
> It's my first project with Nutch, so be gentile with me :-)
> 
> 
> 
> 1) I want nutch (1.0) to index only the essence of a current URL.
> 
> I plugged a new implementation of org.apache.nutch.parse.Parser, which calls Parse.setText with the essence content of the page reviewed. This Parse is set to the returned ParseResult.
> 
> (Extracting the essence of the page is done with Neko, using heuristics such as "Node is not script", etc. Not the most amazing thing, but it pretty much does the job)
> 
> I rejected the idea of doing it in an implementation of org.apache.nutch.indexer.IndexingFilter, as at the time IndexingFilter.filter is called, its received Parse argument already holds the text as string (how would I extract the essence now?).
> 
> 
> 
> 2) When indexing again the same URL, I want to index the page only if the essence of it has changed.
> 
> I need to compare the current essence of the page, to the one indexed in a previous run, in order to recognize a change.
> 
> I can't do it by checking the segments, because I have no knowledge of the way the content is stored inside them.
> 
> So, I want to plug a new org.apache.nutch.indexer.IndexingFilter, which will load the previous indexed text as a NutchDocument, and compare it to the current essence I'm holding in Parse.getText.
> 
> How can I do it? How do I load a NutchDocument from the underlying Lucene repository, given its URL?
> 
> Thanks
> 
> Itamar Avni
> 
> 
> This electronic message may contain proprietary and confidential information of Verint Systems Inc., its affiliates and/or subsidiaries.
> The information is intended to be for the use of the individual(s) or
> entity(ies) named above.  If you are not the intended recipient (or authorized to receive this e-mail for the intended recipient), you may not use, copy, disclose or distribute to anyone this message or any information contained in this message.  If you have received this electronic message in error, please notify us by replying to this e-mail.
> 
 		 	   		  
_________________________________________________________________
Windows Live: Make it easier for your friends to see what you’re up to on Facebook.
http://go.microsoft.com/?linkid=9691816