You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by aidahaj <ai...@gmail.com> on 2009/04/24 18:13:47 UTC

Solr index

Hi ,
I'm using Nutch to crawl a list of web sites.
Solr is my index(Nutch-1.0 integration with solr).
I'm working on detecting web site defacement(if there's any changes in the
text of a web page).
I want to know if solr may give me the possibility to detect the changes in
the Documents in the indexe before commiting or a log file or something like
that(the text that has been changed between two points of time ).
I'm looking for your help. Thanks a lot.
-- 
View this message in context: http://www.nabble.com/Solr-index-tp23219842p23219842.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr index

Posted by aidahaj <ai...@gmail.com>.
Thanks a lot,
I have made a look in these classes.
But what I exactly want to do is to detect if a Document(in the index of
solr)has changed when I recrawl a site with Nutch.
Not to block deduplication, but to detect if a Document has changed and
extract changes in a file without writing them over the old Document.
After that I decide wether to rewrite the Document or to keep both of them
the old and the new one.
I wish I am more precise.
Thanks and permit my poor english.


-- 
View this message in context: http://www.nabble.com/Solr-index-tp23219842p23254601.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr index

Posted by Shalin Shekhar Mangar <sh...@gmail.com>.
Solr 1.4 (trunk) has a similar functionality.

http://wiki.apache.org/solr/Deduplication

On Fri, Apr 24, 2009 at 9:53 PM, Otis Gospodnetic <
otis_gospodnetic@yahoo.com> wrote:

>
> Hi,
>
> Solr doesn't include such functionality.  But Nutch has:
>
> [otis@localhost src]$ ff \*Signature\*java
> ./test/org/apache/nutch/crawl/TestSignatureFactory.java
> ./java/org/apache/nutch/crawl/SignatureFactory.java
> ./java/org/apache/nutch/crawl/MD5Signature.java
> ./java/org/apache/nutch/crawl/Signature.java
> ./java/org/apache/nutch/crawl/TextProfileSignature.java
> ./java/org/apache/nutch/crawl/SignatureComparator.java
>
>  Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
>
>
> ----- Original Message ----
> > From: aidahaj <ai...@gmail.com>
> > To: solr-user@lucene.apache.org
> > Sent: Friday, April 24, 2009 12:13:47 PM
> > Subject: Solr index
> >
> >
> > Hi ,
> > I'm using Nutch to crawl a list of web sites.
> > Solr is my index(Nutch-1.0 integration with solr).
> > I'm working on detecting web site defacement(if there's any changes in
> the
> > text of a web page).
> > I want to know if solr may give me the possibility to detect the changes
> in
> > the Documents in the indexe before commiting or a log file or something
> like
> > that(the text that has been changed between two points of time ).
> > I'm looking for your help. Thanks a lot.
> > --
> > View this message in context:
> > http://www.nabble.com/Solr-index-tp23219842p23219842.html
> > Sent from the Solr - User mailing list archive at Nabble.com.
>
>


-- 
Regards,
Shalin Shekhar Mangar.

Re: Solr index

Posted by Otis Gospodnetic <ot...@yahoo.com>.
Hi,

Solr doesn't include such functionality.  But Nutch has:

[otis@localhost src]$ ff \*Signature\*java
./test/org/apache/nutch/crawl/TestSignatureFactory.java
./java/org/apache/nutch/crawl/SignatureFactory.java
./java/org/apache/nutch/crawl/MD5Signature.java
./java/org/apache/nutch/crawl/Signature.java
./java/org/apache/nutch/crawl/TextProfileSignature.java
./java/org/apache/nutch/crawl/SignatureComparator.java

 Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



----- Original Message ----
> From: aidahaj <ai...@gmail.com>
> To: solr-user@lucene.apache.org
> Sent: Friday, April 24, 2009 12:13:47 PM
> Subject: Solr index
> 
> 
> Hi ,
> I'm using Nutch to crawl a list of web sites.
> Solr is my index(Nutch-1.0 integration with solr).
> I'm working on detecting web site defacement(if there's any changes in the
> text of a web page).
> I want to know if solr may give me the possibility to detect the changes in
> the Documents in the indexe before commiting or a log file or something like
> that(the text that has been changed between two points of time ).
> I'm looking for your help. Thanks a lot.
> -- 
> View this message in context: 
> http://www.nabble.com/Solr-index-tp23219842p23219842.html
> Sent from the Solr - User mailing list archive at Nabble.com.