You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Asier Martínez <ax...@gmail.com> on 2011/01/12 23:01:08 UTC
How store only home page of domains but crawl all the pages to detect
all different domains
Hi to all,
here is my problem. I want to crawl "all" ( to certain depth limit,
you know ) the pages of certain domains/subdomains to detect them, but
only store the home pages of the domains.( I don't have the list of
the domains )
¿There is a easy way to do this? or I have to change the source code
of some plugin? where can I start to looking?
Thanks in advance,
Re: How store only home page of domains but crawl all the pages to detect all different domains
Posted by Charan K <ch...@gmail.com>.
In that case.. You can generate from one database . Do db update to a different crawl db..
On Jan 15, 2011, at 10:06 AM, "Marseld Dedgjonaj" <ma...@ikubinfo.com> wrote:
> Hi,
> Thanks for your response.
> If I set -depth 1, this will function only for the first crawl.
> But sense initial urls are very dynamic webpages and the content changes every hour,
> I need to crawl the initial urls continuously(only initial urls).
>
>
> Best Regards,
> Marseldi
> -----Original Message-----
> From: alxsss@aim.com [mailto:alxsss@aim.com]
> Sent: Saturday, January 15, 2011 6:58 PM
> To: user@nutch.apache.org
> Subject: Re: How store only home page of domains but crawl all the pages to detect all different domains
>
>
> Can not you do this by specifying -depth 1 in crawl command?
>
>
>
>
>
>
>
>
> -----Original Message-----
> From: Marseld Dedgjonaj <ma...@ikubinfo.com>
> To: user <us...@nutch.apache.org>; markus.jelsma <ma...@openindex.io>; 'Asier Martínez' <ax...@gmail.com>
> Sent: Sat, Jan 15, 2011 3:44 am
> Subject: RE: How store only home page of domains but crawl all the pages to detect all different domains
>
>
> Hi Markus,
>
> I am also interested in using different regex-urlfilter for Generate step
>
> because I need to crawl only homepage of 10 websites continuously and index
>
> all links which are in the homepage but not go crawling recursively.
>
> I think it can be done by puting in regex-urlfilter file for generate only
>
> these 10 websites,
>
> And in other steps(fetch, updated, invertlinks, index) to use the default
>
> regex-urlfilter.
>
> As I see you said that is possible to have different nutch-site configs for
>
> each step.
>
> How can I configure to have another nutch-site config file for generate step
>
> and another nutch-site config file for other steps?
>
> Should I change code for this or is just a configuration trick?
>
>
>
> Please help me for this. I really need it.
>
>
>
> Best regards,
>
> Marseld
>
>
>
>
>
> -----Original Message-----
>
> From: Markus Jelsma [mailto:markus.jelsma@openindex.io]
>
> Sent: Thursday, January 13, 2011 1:51 PM
>
> To: Asier Martínez
>
> Cc: user@nutch.apache.org
>
> Subject: Re: How store only home page of domains but crawl all the pages to
>
> detect all different domains
>
>
>
> Hi,
>
>
>
> You will need to create different versions of the regex-urlfilter.txt for
>
> the
>
> different jobs. You can have different nutch-site configs where each has a
>
> different setting for urlfilter.regex.file, pointing to the relevant regex-
>
> urlfilter file. Or you can just copy regex-urlfilter-<JOB>.txt to regex-
>
> urlfilter.txt before executing that job.
>
>
>
> Cheers,
>
>
>
> On Thursday 13 January 2011 02:06:15 Asier Martínez wrote:
>
>> Oh Thank you Markus for your input. The homepage thing I have "solved"
>
>> in my crawler in Python, but I founded that Nutch works more more fast
>
>> than my original crawler based on Twitested Lib.
>
>> And I want to learn more :-).
>
>>
>
>> I didn't know about different url filters for fetching, updating etc,
>
>> ¿Where can I change those filters?
>
>>
>
>> Thank you,
>
>>
>
>> 2011/1/12 Markus Jelsma <ma...@openindex.io>:
>
>>> Hi,
>
>>>
>
>>> This is rather tricky. You can crawl a lot but index a little if you use
>
>>> different url filters for fetching, updating the db and indexing so that
>
>>> part is rather easy.
>
>>>
>
>>> The question is how to define a home page in the url filters. For this
>
>>> website its /, for another its /home.html and another redirects to
>
>>> subdomain.domain.extension and even another will redirect to language
>
>>> based url.
>
>>>
>
>>> Cheers,
>
>>>
>
>>>> Hi to all,
>
>>>> here is my problem. I want to crawl "all" ( to certain depth limit,
>
>>>> you know ) the pages of certain domains/subdomains to detect them, but
>
>>>> only store the home pages of the domains.( I don't have the list of
>
>>>> the domains )
>
>>>> ¿There is a easy way to do this? or I have to change the source code
>
>>>> of some plugin? where can I start to looking?
>
>>>>
>
>>>> Thanks in advance,
>
>
>
> --
>
> Markus Jelsma - CTO - Openindex
>
> http://www.linkedin.com/in/markus17
>
> 050-8536620 / 06-50258350
>
>
>
>
>
>
>
>
>
> <p class="MsoNormal"><span style="color: rgb(31, 73, 125);">Gjeni <b>Punë
>
> të Mirë</b> dhe <b>të Mirë për Punë</b>...
>
> Vizitoni: <a target="_blank" href="http://www.punaime.al/">www.punaime.al</a></span></p>
>
> <p><a target="_blank" href="http://www.punaime.al/"><span style="text-decoration:
>
> none;"><img width="165" height="31" border="0" alt="punaime"
>
> src="http://www.ikub.al/images/punaime.al_small.png" /></span></a></p>
>
>
>
>
>
>
>
>
>
>
>
>
> <p class="MsoNormal"><span style="color: rgb(31, 73, 125);">Gjeni <b>Punë të Mirë</b> dhe <b>të Mirë për Punë</b>... Vizitoni: <a target="_blank" href="http://www.punaime.al/">www.punaime.al</a></span></p>
> <p><a target="_blank" href="http://www.punaime.al/"><span style="text-decoration: none;"><img width="165" height="31" border="0" alt="punaime" src="http://www.ikub.al/images/punaime.al_small.png" /></span></a></p>
>
>
RE: How store only home page of domains but crawl all the pages to detect all different domains
Posted by Marseld Dedgjonaj <ma...@ikubinfo.com>.
Hi,
Thanks for your response.
If I set -depth 1, this will function only for the first crawl.
But sense initial urls are very dynamic webpages and the content changes every hour,
I need to crawl the initial urls continuously(only initial urls).
Best Regards,
Marseldi
-----Original Message-----
From: alxsss@aim.com [mailto:alxsss@aim.com]
Sent: Saturday, January 15, 2011 6:58 PM
To: user@nutch.apache.org
Subject: Re: How store only home page of domains but crawl all the pages to detect all different domains
Can not you do this by specifying -depth 1 in crawl command?
-----Original Message-----
From: Marseld Dedgjonaj <ma...@ikubinfo.com>
To: user <us...@nutch.apache.org>; markus.jelsma <ma...@openindex.io>; 'Asier Martínez' <ax...@gmail.com>
Sent: Sat, Jan 15, 2011 3:44 am
Subject: RE: How store only home page of domains but crawl all the pages to detect all different domains
Hi Markus,
I am also interested in using different regex-urlfilter for Generate step
because I need to crawl only homepage of 10 websites continuously and index
all links which are in the homepage but not go crawling recursively.
I think it can be done by puting in regex-urlfilter file for generate only
these 10 websites,
And in other steps(fetch, updated, invertlinks, index) to use the default
regex-urlfilter.
As I see you said that is possible to have different nutch-site configs for
each step.
How can I configure to have another nutch-site config file for generate step
and another nutch-site config file for other steps?
Should I change code for this or is just a configuration trick?
Please help me for this. I really need it.
Best regards,
Marseld
-----Original Message-----
From: Markus Jelsma [mailto:markus.jelsma@openindex.io]
Sent: Thursday, January 13, 2011 1:51 PM
To: Asier Martínez
Cc: user@nutch.apache.org
Subject: Re: How store only home page of domains but crawl all the pages to
detect all different domains
Hi,
You will need to create different versions of the regex-urlfilter.txt for
the
different jobs. You can have different nutch-site configs where each has a
different setting for urlfilter.regex.file, pointing to the relevant regex-
urlfilter file. Or you can just copy regex-urlfilter-<JOB>.txt to regex-
urlfilter.txt before executing that job.
Cheers,
On Thursday 13 January 2011 02:06:15 Asier Martínez wrote:
> Oh Thank you Markus for your input. The homepage thing I have "solved"
> in my crawler in Python, but I founded that Nutch works more more fast
> than my original crawler based on Twitested Lib.
> And I want to learn more :-).
>
> I didn't know about different url filters for fetching, updating etc,
> ¿Where can I change those filters?
>
> Thank you,
>
> 2011/1/12 Markus Jelsma <ma...@openindex.io>:
> > Hi,
> >
> > This is rather tricky. You can crawl a lot but index a little if you use
> > different url filters for fetching, updating the db and indexing so that
> > part is rather easy.
> >
> > The question is how to define a home page in the url filters. For this
> > website its /, for another its /home.html and another redirects to
> > subdomain.domain.extension and even another will redirect to language
> > based url.
> >
> > Cheers,
> >
> >> Hi to all,
> >> here is my problem. I want to crawl "all" ( to certain depth limit,
> >> you know ) the pages of certain domains/subdomains to detect them, but
> >> only store the home pages of the domains.( I don't have the list of
> >> the domains )
> >> ¿There is a easy way to do this? or I have to change the source code
> >> of some plugin? where can I start to looking?
> >>
> >> Thanks in advance,
--
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350
<p class="MsoNormal"><span style="color: rgb(31, 73, 125);">Gjeni <b>Punë
të Mirë</b> dhe <b>të Mirë për Punë</b>...
Vizitoni: <a target="_blank" href="http://www.punaime.al/">www.punaime.al</a></span></p>
<p><a target="_blank" href="http://www.punaime.al/"><span style="text-decoration:
none;"><img width="165" height="31" border="0" alt="punaime"
src="http://www.ikub.al/images/punaime.al_small.png" /></span></a></p>
<p class="MsoNormal"><span style="color: rgb(31, 73, 125);">Gjeni <b>Punë të Mirë</b> dhe <b>të Mirë për Punë</b>... Vizitoni: <a target="_blank" href="http://www.punaime.al/">www.punaime.al</a></span></p>
<p><a target="_blank" href="http://www.punaime.al/"><span style="text-decoration: none;"><img width="165" height="31" border="0" alt="punaime" src="http://www.ikub.al/images/punaime.al_small.png" /></span></a></p>
Re: How store only home page of domains but crawl all the pages to detect
all different domains
Posted by al...@aim.com.
Can not you do this by specifying -depth 1 in crawl command?
-----Original Message-----
From: Marseld Dedgjonaj <ma...@ikubinfo.com>
To: user <us...@nutch.apache.org>; markus.jelsma <ma...@openindex.io>; 'Asier Martínez' <ax...@gmail.com>
Sent: Sat, Jan 15, 2011 3:44 am
Subject: RE: How store only home page of domains but crawl all the pages to detect all different domains
Hi Markus,
I am also interested in using different regex-urlfilter for Generate step
because I need to crawl only homepage of 10 websites continuously and index
all links which are in the homepage but not go crawling recursively.
I think it can be done by puting in regex-urlfilter file for generate only
these 10 websites,
And in other steps(fetch, updated, invertlinks, index) to use the default
regex-urlfilter.
As I see you said that is possible to have different nutch-site configs for
each step.
How can I configure to have another nutch-site config file for generate step
and another nutch-site config file for other steps?
Should I change code for this or is just a configuration trick?
Please help me for this. I really need it.
Best regards,
Marseld
-----Original Message-----
From: Markus Jelsma [mailto:markus.jelsma@openindex.io]
Sent: Thursday, January 13, 2011 1:51 PM
To: Asier Martínez
Cc: user@nutch.apache.org
Subject: Re: How store only home page of domains but crawl all the pages to
detect all different domains
Hi,
You will need to create different versions of the regex-urlfilter.txt for
the
different jobs. You can have different nutch-site configs where each has a
different setting for urlfilter.regex.file, pointing to the relevant regex-
urlfilter file. Or you can just copy regex-urlfilter-<JOB>.txt to regex-
urlfilter.txt before executing that job.
Cheers,
On Thursday 13 January 2011 02:06:15 Asier Martínez wrote:
> Oh Thank you Markus for your input. The homepage thing I have "solved"
> in my crawler in Python, but I founded that Nutch works more more fast
> than my original crawler based on Twitested Lib.
> And I want to learn more :-).
>
> I didn't know about different url filters for fetching, updating etc,
> ¿Where can I change those filters?
>
> Thank you,
>
> 2011/1/12 Markus Jelsma <ma...@openindex.io>:
> > Hi,
> >
> > This is rather tricky. You can crawl a lot but index a little if you use
> > different url filters for fetching, updating the db and indexing so that
> > part is rather easy.
> >
> > The question is how to define a home page in the url filters. For this
> > website its /, for another its /home.html and another redirects to
> > subdomain.domain.extension and even another will redirect to language
> > based url.
> >
> > Cheers,
> >
> >> Hi to all,
> >> here is my problem. I want to crawl "all" ( to certain depth limit,
> >> you know ) the pages of certain domains/subdomains to detect them, but
> >> only store the home pages of the domains.( I don't have the list of
> >> the domains )
> >> ¿There is a easy way to do this? or I have to change the source code
> >> of some plugin? where can I start to looking?
> >>
> >> Thanks in advance,
--
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350
<p class="MsoNormal"><span style="color: rgb(31, 73, 125);">Gjeni <b>Punë
të Mirë</b> dhe <b>të Mirë për Punë</b>...
Vizitoni: <a target="_blank" href="http://www.punaime.al/">www.punaime.al</a></span></p>
<p><a target="_blank" href="http://www.punaime.al/"><span style="text-decoration:
none;"><img width="165" height="31" border="0" alt="punaime"
src="http://www.ikub.al/images/punaime.al_small.png" /></span></a></p>
RE: How store only home page of domains but crawl all the pages to detect all different domains
Posted by Marseld Dedgjonaj <ma...@ikubinfo.com>.
Hi Markus,
I am also interested in using different regex-urlfilter for Generate step
because I need to crawl only homepage of 10 websites continuously and index
all links which are in the homepage but not go crawling recursively.
I think it can be done by puting in regex-urlfilter file for generate only
these 10 websites,
And in other steps(fetch, updated, invertlinks, index) to use the default
regex-urlfilter.
As I see you said that is possible to have different nutch-site configs for
each step.
How can I configure to have another nutch-site config file for generate step
and another nutch-site config file for other steps?
Should I change code for this or is just a configuration trick?
Please help me for this. I really need it.
Best regards,
Marseld
-----Original Message-----
From: Markus Jelsma [mailto:markus.jelsma@openindex.io]
Sent: Thursday, January 13, 2011 1:51 PM
To: Asier Martínez
Cc: user@nutch.apache.org
Subject: Re: How store only home page of domains but crawl all the pages to
detect all different domains
Hi,
You will need to create different versions of the regex-urlfilter.txt for
the
different jobs. You can have different nutch-site configs where each has a
different setting for urlfilter.regex.file, pointing to the relevant regex-
urlfilter file. Or you can just copy regex-urlfilter-<JOB>.txt to regex-
urlfilter.txt before executing that job.
Cheers,
On Thursday 13 January 2011 02:06:15 Asier Martínez wrote:
> Oh Thank you Markus for your input. The homepage thing I have "solved"
> in my crawler in Python, but I founded that Nutch works more more fast
> than my original crawler based on Twitested Lib.
> And I want to learn more :-).
>
> I didn't know about different url filters for fetching, updating etc,
> ¿Where can I change those filters?
>
> Thank you,
>
> 2011/1/12 Markus Jelsma <ma...@openindex.io>:
> > Hi,
> >
> > This is rather tricky. You can crawl a lot but index a little if you use
> > different url filters for fetching, updating the db and indexing so that
> > part is rather easy.
> >
> > The question is how to define a home page in the url filters. For this
> > website its /, for another its /home.html and another redirects to
> > subdomain.domain.extension and even another will redirect to language
> > based url.
> >
> > Cheers,
> >
> >> Hi to all,
> >> here is my problem. I want to crawl "all" ( to certain depth limit,
> >> you know ) the pages of certain domains/subdomains to detect them, but
> >> only store the home pages of the domains.( I don't have the list of
> >> the domains )
> >> ¿There is a easy way to do this? or I have to change the source code
> >> of some plugin? where can I start to looking?
> >>
> >> Thanks in advance,
--
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350
<p class="MsoNormal"><span style="color: rgb(31, 73, 125);">Gjeni <b>Punë të Mirë</b> dhe <b>të Mirë për Punë</b>... Vizitoni: <a target="_blank" href="http://www.punaime.al/">www.punaime.al</a></span></p>
<p><a target="_blank" href="http://www.punaime.al/"><span style="text-decoration: none;"><img width="165" height="31" border="0" alt="punaime" src="http://www.ikub.al/images/punaime.al_small.png" /></span></a></p>
Re: How store only home page of domains but crawl all the pages to detect all different domains
Posted by Markus Jelsma <ma...@openindex.io>.
Hi,
You will need to create different versions of the regex-urlfilter.txt for the
different jobs. You can have different nutch-site configs where each has a
different setting for urlfilter.regex.file, pointing to the relevant regex-
urlfilter file. Or you can just copy regex-urlfilter-<JOB>.txt to regex-
urlfilter.txt before executing that job.
Cheers,
On Thursday 13 January 2011 02:06:15 Asier Martínez wrote:
> Oh Thank you Markus for your input. The homepage thing I have "solved"
> in my crawler in Python, but I founded that Nutch works more more fast
> than my original crawler based on Twitested Lib.
> And I want to learn more :-).
>
> I didn't know about different url filters for fetching, updating etc,
> ¿Where can I change those filters?
>
> Thank you,
>
> 2011/1/12 Markus Jelsma <ma...@openindex.io>:
> > Hi,
> >
> > This is rather tricky. You can crawl a lot but index a little if you use
> > different url filters for fetching, updating the db and indexing so that
> > part is rather easy.
> >
> > The question is how to define a home page in the url filters. For this
> > website its /, for another its /home.html and another redirects to
> > subdomain.domain.extension and even another will redirect to language
> > based url.
> >
> > Cheers,
> >
> >> Hi to all,
> >> here is my problem. I want to crawl "all" ( to certain depth limit,
> >> you know ) the pages of certain domains/subdomains to detect them, but
> >> only store the home pages of the domains.( I don't have the list of
> >> the domains )
> >> ¿There is a easy way to do this? or I have to change the source code
> >> of some plugin? where can I start to looking?
> >>
> >> Thanks in advance,
--
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350
Re: How store only home page of domains but crawl all the pages to
detect all different domains
Posted by Asier Martínez <ax...@gmail.com>.
Oh Thank you Markus for your input. The homepage thing I have "solved"
in my crawler in Python, but I founded that Nutch works more more fast
than my original crawler based on Twitested Lib.
And I want to learn more :-).
I didn't know about different url filters for fetching, updating etc,
¿Where can I change those filters?
Thank you,
2011/1/12 Markus Jelsma <ma...@openindex.io>:
> Hi,
>
> This is rather tricky. You can crawl a lot but index a little if you use
> different url filters for fetching, updating the db and indexing so that part is
> rather easy.
>
> The question is how to define a home page in the url filters. For this website
> its /, for another its /home.html and another redirects to
> subdomain.domain.extension and even another will redirect to language based
> url.
>
> Cheers,
>
>> Hi to all,
>> here is my problem. I want to crawl "all" ( to certain depth limit,
>> you know ) the pages of certain domains/subdomains to detect them, but
>> only store the home pages of the domains.( I don't have the list of
>> the domains )
>> ¿There is a easy way to do this? or I have to change the source code
>> of some plugin? where can I start to looking?
>>
>> Thanks in advance,
>
Re: How store only home page of domains but crawl all the pages to detect all different domains
Posted by Markus Jelsma <ma...@openindex.io>.
Hi,
This is rather tricky. You can crawl a lot but index a little if you use
different url filters for fetching, updating the db and indexing so that part is
rather easy.
The question is how to define a home page in the url filters. For this website
its /, for another its /home.html and another redirects to
subdomain.domain.extension and even another will redirect to language based
url.
Cheers,
> Hi to all,
> here is my problem. I want to crawl "all" ( to certain depth limit,
> you know ) the pages of certain domains/subdomains to detect them, but
> only store the home pages of the domains.( I don't have the list of
> the domains )
> ¿There is a easy way to do this? or I have to change the source code
> of some plugin? where can I start to looking?
>
> Thanks in advance,