You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by mina <ta...@gmail.com> on 2011/09/24 17:37:42 UTC

how do recrawl sites and filesystems?

hi all,
i want to recrawl siltes with an short interval and file system with long
interval.
it means sites have a crawl period different from file system crawl period.
how can i do that?

--
View this message in context: http://lucene.472066.n3.nabble.com/how-do-recrawl-sites-and-filesystems-tp3364532p3364532.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: how do recrawl sites and filesystems?

Posted by mina <ta...@gmail.com>.
On Sat, Sep 24, 2011 at 9:23 AM, tahere ganjiyar
<ta...@gmail.com>wrote:

> thanks for your answer.
> how i can use Jira?
> i don't know it?
> please help me.
>
>
>>
>>
>> ------------------------------
>>  If you reply to this email, your message will be added to the discussion
>> below:
>>
>> http://lucene.472066.n3.nabble.com/how-do-recrawl-sites-and-filesystems-tp3364532p3364600.html
>>  To unsubscribe from how do recrawl sites and filesystems?, click here<http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=3364532&code=dGFoZXJlZ2Fuaml5YXJAZ21haWwuY29tfDMzNjQ1MzJ8NTgyODE5NjA3>.
>>
>>
>
>


--
View this message in context: http://lucene.472066.n3.nabble.com/how-do-recrawl-sites-and-filesystems-tp3364532p3364644.html
Sent from the Nutch - User mailing list archive at Nabble.com.

PruneIndexTool doesn't work?

Posted by Patricio Galeas <pg...@yahoo.de>.
Hello,


I execute the PruneIndexTool to remove some unwanted URLs from my index, but it doesn't work.

I run : 
bin/nutch org.apache.nutch.tools.PruneIndexTool /nutch/local/my_crawl/index -queries queries.txt -output pruned.txt

where:
queries.txt hat the following entries:
site:topsy.com
site:osdir.com
site:www.cez.cz
site:biblecourses.com
site:bbftv.tv
site:autoavangarde.org
site:www.volkswagen.com
site:premiere21.com

After execute the command, pruned.txt contains a lot of URLs with the pruned sites, but when I run a new query all pruned sites are still in the results.

What I'm doing wrong?

Thanks
Patricio

Re: how do recrawl sites and filesystems?

Posted by mina <ta...@gmail.com>.
thank you, i will use this solution.

On Sat, Sep 24, 2011 at 1:35 PM, lewis john mcgibbney [via Lucene] <
ml-node+s472066n3365096h95@n3.nabble.com> wrote:

> Hi Mina,
>
> Off the top of my head, I can't remeber if you need to register with JIRA,
> then you can simply create an issue with the option on the top right hand
> corner.
>
> The resulting process will enable you to create an issue which should be
> accompanied with a description and a suggestion for how the issue is to be
> solved. If you have the time and resources to submit a patch which other
> dev's can use then please feel free.
>
> Lewis
>
> On Sat, Sep 24, 2011 at 6:32 PM, mina <[hidden email]<http://user/SendEmail.jtp?type=node&node=3365096&i=0>>
> wrote:
>
> > how i should use this?
> >
> > On Sat, Sep 24, 2011 at 9:46 AM, Markus Jelsma-2 [via Lucene] <
> > [hidden email] <http://user/SendEmail.jtp?type=node&node=3365096&i=1>>
> wrote:
> >
> > > No need to send multiple messages. Here's Nutch' Jira issue tracker:
> > > https://issues.apache.org/jira/browse/NUTCH
> > >
> > > > thanks for your answer.
> > > > how i can use Jira?
> > > > i don't know it?
> > > > please help me.
> > > >
> > > > > ------------------------------
> > > > >
> > > > >  If you reply to this email, your message will be added to the
> > > discussion
> > > > >
> > > > > below:
> > > > >
> > > > >
> > >
> >
> http://lucene.472066.n3.nabble.com/how-do-recrawl-sites-and-filesystems-t
> > > > > p3364532p3364600.html
> > > > >
> > > > >  To unsubscribe from how do recrawl sites and filesystems?, click
> > > > >  here<
> > > http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=
> > > > >
> > >
>  unsubscribe_by_code&node=3364532&code=dGFoZXJlZ2Fuaml5YXJAZ21haWwuY29tfD
> > > > >  MzNjQ1MzJ8NTgyODE5NjA3>.
> > > >
> > > > --
> > > > View this message in context:
> > > >
> > >
> >
> http://lucene.472066.n3.nabble.com/how-do-recrawl-sites-and-filesystems-tp
> > > > 3364532p3364642.html Sent from the Nutch - User mailing list archive
> at
> > > > Nabble.com.
> > >
> > >
> > > ------------------------------
> > >  If you reply to this email, your message will be added to the
> discussion
> > > below:
> > >
> > >
> >
> http://lucene.472066.n3.nabble.com/how-do-recrawl-sites-and-filesystems-tp3364532p3364686.html
> > >  To unsubscribe from how do recrawl sites and filesystems?, click here<
>
> >
> >.
> > >
> > >
> >
> >
> > --
> > View this message in context:
> >
> http://lucene.472066.n3.nabble.com/how-do-recrawl-sites-and-filesystems-tp3364532p3364762.html
>
> > Sent from the Nutch - User mailing list archive at Nabble.com.
>
>
>
>
> --
> *Lewis*
>
>
> ------------------------------
>  If you reply to this email, your message will be added to the discussion
> below:
>
> http://lucene.472066.n3.nabble.com/how-do-recrawl-sites-and-filesystems-tp3364532p3365096.html
>  To unsubscribe from how do recrawl sites and filesystems?, click here<http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=3364532&code=dGFoZXJlZ2Fuaml5YXJAZ21haWwuY29tfDMzNjQ1MzJ8NTgyODE5NjA3>.
>
>


--
View this message in context: http://lucene.472066.n3.nabble.com/how-do-recrawl-sites-and-filesystems-tp3364532p3365293.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: how do recrawl sites and filesystems?

Posted by lewis john mcgibbney <le...@gmail.com>.
Hi Mina,

Off the top of my head, I can't remeber if you need to register with JIRA,
then you can simply create an issue with the option on the top right hand
corner.

The resulting process will enable you to create an issue which should be
accompanied with a description and a suggestion for how the issue is to be
solved. If you have the time and resources to submit a patch which other
dev's can use then please feel free.

Lewis

On Sat, Sep 24, 2011 at 6:32 PM, mina <ta...@gmail.com> wrote:

> how i should use this?
>
> On Sat, Sep 24, 2011 at 9:46 AM, Markus Jelsma-2 [via Lucene] <
> ml-node+s472066n3364686h85@n3.nabble.com> wrote:
>
> > No need to send multiple messages. Here's Nutch' Jira issue tracker:
> > https://issues.apache.org/jira/browse/NUTCH
> >
> > > thanks for your answer.
> > > how i can use Jira?
> > > i don't know it?
> > > please help me.
> > >
> > > > ------------------------------
> > > >
> > > >  If you reply to this email, your message will be added to the
> > discussion
> > > >
> > > > below:
> > > >
> > > >
> >
> http://lucene.472066.n3.nabble.com/how-do-recrawl-sites-and-filesystems-t
> > > > p3364532p3364600.html
> > > >
> > > >  To unsubscribe from how do recrawl sites and filesystems?, click
> > > >  here<
> > http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=
> > > >
> >  unsubscribe_by_code&node=3364532&code=dGFoZXJlZ2Fuaml5YXJAZ21haWwuY29tfD
> > > >  MzNjQ1MzJ8NTgyODE5NjA3>.
> > >
> > > --
> > > View this message in context:
> > >
> >
> http://lucene.472066.n3.nabble.com/how-do-recrawl-sites-and-filesystems-tp
> > > 3364532p3364642.html Sent from the Nutch - User mailing list archive at
> > > Nabble.com.
> >
> >
> > ------------------------------
> >  If you reply to this email, your message will be added to the discussion
> > below:
> >
> >
> http://lucene.472066.n3.nabble.com/how-do-recrawl-sites-and-filesystems-tp3364532p3364686.html
> >  To unsubscribe from how do recrawl sites and filesystems?, click here<
> http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=3364532&code=dGFoZXJlZ2Fuaml5YXJAZ21haWwuY29tfDMzNjQ1MzJ8NTgyODE5NjA3
> >.
> >
> >
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/how-do-recrawl-sites-and-filesystems-tp3364532p3364762.html
> Sent from the Nutch - User mailing list archive at Nabble.com.




-- 
*Lewis*

Re: how do recrawl sites and filesystems?

Posted by mina <ta...@gmail.com>.
how i should use this?

On Sat, Sep 24, 2011 at 9:46 AM, Markus Jelsma-2 [via Lucene] <
ml-node+s472066n3364686h85@n3.nabble.com> wrote:

> No need to send multiple messages. Here's Nutch' Jira issue tracker:
> https://issues.apache.org/jira/browse/NUTCH
>
> > thanks for your answer.
> > how i can use Jira?
> > i don't know it?
> > please help me.
> >
> > > ------------------------------
> > >
> > >  If you reply to this email, your message will be added to the
> discussion
> > >
> > > below:
> > >
> > >
> http://lucene.472066.n3.nabble.com/how-do-recrawl-sites-and-filesystems-t
> > > p3364532p3364600.html
> > >
> > >  To unsubscribe from how do recrawl sites and filesystems?, click
> > >  here<
> http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=
> > >
>  unsubscribe_by_code&node=3364532&code=dGFoZXJlZ2Fuaml5YXJAZ21haWwuY29tfD
> > >  MzNjQ1MzJ8NTgyODE5NjA3>.
> >
> > --
> > View this message in context:
> >
> http://lucene.472066.n3.nabble.com/how-do-recrawl-sites-and-filesystems-tp
> > 3364532p3364642.html Sent from the Nutch - User mailing list archive at
> > Nabble.com.
>
>
> ------------------------------
>  If you reply to this email, your message will be added to the discussion
> below:
>
> http://lucene.472066.n3.nabble.com/how-do-recrawl-sites-and-filesystems-tp3364532p3364686.html
>  To unsubscribe from how do recrawl sites and filesystems?, click here<http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=3364532&code=dGFoZXJlZ2Fuaml5YXJAZ21haWwuY29tfDMzNjQ1MzJ8NTgyODE5NjA3>.
>
>


--
View this message in context: http://lucene.472066.n3.nabble.com/how-do-recrawl-sites-and-filesystems-tp3364532p3364762.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: how do recrawl sites and filesystems?

Posted by Markus Jelsma <ma...@openindex.io>.
No need to send multiple messages. Here's Nutch' Jira issue tracker:
https://issues.apache.org/jira/browse/NUTCH

> thanks for your answer.
> how i can use Jira?
> i don't know it?
> please help me.
> 
> > ------------------------------
> > 
> >  If you reply to this email, your message will be added to the discussion
> > 
> > below:
> > 
> > http://lucene.472066.n3.nabble.com/how-do-recrawl-sites-and-filesystems-t
> > p3364532p3364600.html
> > 
> >  To unsubscribe from how do recrawl sites and filesystems?, click
> >  here<http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=
> >  unsubscribe_by_code&node=3364532&code=dGFoZXJlZ2Fuaml5YXJAZ21haWwuY29tfD
> >  MzNjQ1MzJ8NTgyODE5NjA3>.
> 
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/how-do-recrawl-sites-and-filesystems-tp
> 3364532p3364642.html Sent from the Nutch - User mailing list archive at
> Nabble.com.

Re: how do recrawl sites and filesystems?

Posted by mina <ta...@gmail.com>.
thanks for your answer.
how i can use Jira?
i don't know it?
please help me.


>
>
> ------------------------------
>  If you reply to this email, your message will be added to the discussion
> below:
>
> http://lucene.472066.n3.nabble.com/how-do-recrawl-sites-and-filesystems-tp3364532p3364600.html
>  To unsubscribe from how do recrawl sites and filesystems?, click here<http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=3364532&code=dGFoZXJlZ2Fuaml5YXJAZ21haWwuY29tfDMzNjQ1MzJ8NTgyODE5NjA3>.
>
>


--
View this message in context: http://lucene.472066.n3.nabble.com/how-do-recrawl-sites-and-filesystems-tp3364532p3364642.html
Sent from the Nutch - User mailing list archive at Nabble.com.

helpME

Posted by mina <ta...@gmail.com>.
thanks for your answer.

how i can use Jira?

i don't know it?

please help me.


--
View this message in context: http://lucene.472066.n3.nabble.com/how-do-recrawl-sites-and-filesystems-tp3364532p3364648.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: how do recrawl sites and filesystems?

Posted by Markus Jelsma <ma...@openindex.io>.
This is not possible with the supplied fetch schedulers so you need a 
customized scheduler.
In Jira there's an open issue for support for different intervals depending on 
MIME-type, it is very similar to what you ask. It extends 
AdaptiveFetchSchedule and allows for setting different increment and decrement 
rates per MIME.

> hi all,
> i want to recrawl siltes with an short interval and file system with long
> interval.
> it means sites have a crawl period different from file system crawl period.
> how can i do that?
> 
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/how-do-recrawl-sites-and-filesystems-tp
> 3364532p3364532.html Sent from the Nutch - User mailing list archive at
> Nabble.com.