You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Arun Kumar Sharma <sh...@yahoo.co.in> on 2005/12/16 09:04:43 UTC

Is there any way to check that no duplicate url get inserted through "WebDBInjector"

Hi
      I have list of urls which may contain duplicate urls. I want to check that there is no duplicate url insertion through WebDBInjector. Is there any way to achieve this using nutch functionality???
     answer awaited anxiously...


Regards,
 
Arun Kumar Sharma (Tech Lead -Java/J2EE)
Mob: +91.981.529.5761




Send instant messages to your online friends http://in.messenger.yahoo.com 

Re: Is there any way to check that no duplicate url get inserted through "WebDBInjector"

Posted by Arun Kaundal <ar...@gmail.com>.
I am knew to these things . Can u let me know in details, where this filter
if found and what is mercator ? I want to use it with java. Is there a way
by which I use this functionality from java and remove duplicate urls from
urls.txt file.

On 12/16/05, Zhao Loen <le...@gmail.com> wrote:
>
> 1.bloom filter
> high effient algorithm to elimate duplicate URL.
>
> 2.based on disk hash table
> mercator uses it
>
> 2005/12/16, Arun Kumar Sharma <sh...@yahoo.co.in>:
> >
> > Hi
> >      I have list of urls which may contain duplicate urls. I want to
> check
> > that there is no duplicate url insertion through WebDBInjector. Is there
> any
> > way to achieve this using nutch functionality???
> >     answer awaited anxiously...
> >
> >
> > Regards,
> >
> > Arun Kumar Sharma (Tech Lead -Java/J2EE)
> > Mob: +91.981.529.5761
> >
> >
> >
> >
> > Send instant messages to your online friends
> http://in.messenger.yahoo.com
> >
>
>
>
> --
> 想搜就搜
>

Re: Is there any way to check that no duplicate url get inserted through "WebDBInjector"

Posted by Zhao Loen <le...@gmail.com>.
1.bloom filter
high effient algorithm to elimate duplicate URL.

2.based on disk hash table
mercator uses it

2005/12/16, Arun Kumar Sharma <sh...@yahoo.co.in>:
>
> Hi
>      I have list of urls which may contain duplicate urls. I want to check
> that there is no duplicate url insertion through WebDBInjector. Is there any
> way to achieve this using nutch functionality???
>     answer awaited anxiously...
>
>
> Regards,
>
> Arun Kumar Sharma (Tech Lead -Java/J2EE)
> Mob: +91.981.529.5761
>
>
>
>
> Send instant messages to your online friends http://in.messenger.yahoo.com
>



--
想搜就搜

Re: Is there any way to check that no duplicate url get inserted through "WebDBInjector"

Posted by Transbuerg Tian <ac...@gmail.com>.
if you want more info about : mercator  , visit here :
http://research.compaq.com/SRC/mercator/


Home Page of the Mercator Web Crawler

Welcome to the Mercator home page. Mercator is a web crawler built by
researchers at Compaq's Systems Research Center.

Why did you choose the name "Mercator"?

*Gerardus Mercator, 1512-1594. Flemish cartographer whose most important
innovation was a map, later known as the Mercator projection, on which
parallels and meridians are rendered as straight lines spaced so as to
produce at any point an accurate ratio of latitude to longitude. Mercator
also introduced the term atlas for a collection of maps.* --Encyclopædia
Britannica

Our crawler, like the famous cartographer, aims at producing ``maps'' of the
known (virtual) world that accurately depict its dimensions.


2005/12/16, Arun Kaundal <ar...@gmail.com>:
>
> This may be the case if we run nutch only once on crawled directory.What I
> am doing is that I am running nutch Crawl tool on already existed crawled
> directory by modifying CrawlTool a little bit. That is if db directory
> already exist it do not create it and neither return any error messages.
> But
> after doing this if see the content of files say "db\webdb\linksByMD5" it
> is
> nearly triple to what was earlier after single run. How it is possible to
> run nutch more than once on same crawled directory? Do u think I am wrong
> somewhere in my approach...
>       Answer awaited...
>
>
> On 12/16/05, Stefan Groschupf <sg...@media-style.com> wrote:
> >
> > The web DB itself handle duplicate urls by ignoring the duplicates.
> > So incase you inject 100 times yahoo.in the webdb will only have one
> > entry.
> >
> >
> > Am 16.12.2005 um 09:04 schrieb Arun Kumar Sharma:
> >
> > > Hi
> > >       I have list of urls which may contain duplicate urls. I want
> > > to check that there is no duplicate url insertion through
> > > WebDBInjector. Is there any way to achieve this using nutch
> > > functionality???
> > >      answer awaited anxiously...
> > >
> > >
> > > Regards,
> > >
> > > Arun Kumar Sharma (Tech Lead -Java/J2EE)
> > > Mob: +91.981.529.5761
> > >
> > >
> > >
> > >
> > > Send instant messages to your online friends http://
> > > in.messenger.yahoo.com
> >
> >
>
>

http://domolo.oicp.net/bbs/dispbbs.asp?boardid=29&id=56&star=1#56

Re: Is there any way to check that no duplicate url get inserted through "WebDBInjector"

Posted by Arun Kaundal <ar...@gmail.com>.
This may be the case if we run nutch only once on crawled directory.What I
am doing is that I am running nutch Crawl tool on already existed crawled
directory by modifying CrawlTool a little bit. That is if db directory
already exist it do not create it and neither return any error messages. But
after doing this if see the content of files say "db\webdb\linksByMD5" it is
nearly triple to what was earlier after single run. How it is possible to
run nutch more than once on same crawled directory? Do u think I am wrong
somewhere in my approach...
      Answer awaited...


On 12/16/05, Stefan Groschupf <sg...@media-style.com> wrote:
>
> The web DB itself handle duplicate urls by ignoring the duplicates.
> So incase you inject 100 times yahoo.in the webdb will only have one
> entry.
>
>
> Am 16.12.2005 um 09:04 schrieb Arun Kumar Sharma:
>
> > Hi
> >       I have list of urls which may contain duplicate urls. I want
> > to check that there is no duplicate url insertion through
> > WebDBInjector. Is there any way to achieve this using nutch
> > functionality???
> >      answer awaited anxiously...
> >
> >
> > Regards,
> >
> > Arun Kumar Sharma (Tech Lead -Java/J2EE)
> > Mob: +91.981.529.5761
> >
> >
> >
> >
> > Send instant messages to your online friends http://
> > in.messenger.yahoo.com
>
>

Re: Is there any way to check that no duplicate url get inserted through "WebDBInjector"

Posted by Stefan Groschupf <sg...@media-style.com>.
The web DB itself handle duplicate urls by ignoring the duplicates.  
So incase you inject 100 times yahoo.in the webdb will only have one  
entry.


Am 16.12.2005 um 09:04 schrieb Arun Kumar Sharma:

> Hi
>       I have list of urls which may contain duplicate urls. I want  
> to check that there is no duplicate url insertion through  
> WebDBInjector. Is there any way to achieve this using nutch  
> functionality???
>      answer awaited anxiously...
>
>
> Regards,
>
> Arun Kumar Sharma (Tech Lead -Java/J2EE)
> Mob: +91.981.529.5761
>
>
>
>
> Send instant messages to your online friends http:// 
> in.messenger.yahoo.com