You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Peyman Mohajerian <mo...@gmail.com> on 2011/11/06 21:35:30 UTC

crawling a subdomain

Hi Guys,

Let's say my input file is:
http://www.xyz.com/stuff

and I have thousands of these URLs in my input. How do I configure
Nutch to also crawl this subdomain for each input:
http://abc.xyz.com/stuff

I don't want to just replace 'www' with 'abc' i want to crawl both.

Thanks
Peyman

Re: crawling a subdomain

Posted by Mathijs Homminga <ma...@gmail.com>.

Yo could

Mathijs Homminga

On Nov 7, 2011, at 7:15, Peyman Mohajerian <mo...@gmail.com> wrote:

> Thanks Sergey,
> I don't think I was clear on the issue, the subdomain I'm speaking of
> won't be found by the crawler, I have to somehow add it, so in my
> original input url of: http://www.xyz.com/stuff
> there is absolutely no way the crawler would know about http://abc.xyz.com/stuff
> I have to somehow dynamically add the subdomain.
> I also don't have the option of actually adding
> 'http://abc.xyz.com/stuff' in my input file (a bit of an extra
> convolution I don't want to bore you with!!).
> 
> Thanks,
> Peyman
> 
> On Sun, Nov 6, 2011 at 1:21 PM, Sergey A Volkov
> <se...@gmail.com> wrote:
>> Hi!
>> 
>> I think you should use urlfilter-regex like "http://\w\.xyz\.com/stuff.*"
>> instead of urlfilter-domain and set db.ignore.external.links to false, this
>> will work, but this is quite slow if you have many regex.
>> 
>> You may also try to add xyz.com to domain-suffixes.xml, this may cause some
>> side effects, i had never tested this, just looked in DomainURLFilter
>> source, so it's probably not really good idea.
>> 
>> Sergey Volkov
>> 
>> On Mon 07 Nov 2011 12:35:30 AM MSK, Peyman Mohajerian wrote:
>>> 
>>> Hi Guys,
>>> 
>>> Let's say my input file is:
>>> http://www.xyz.com/stuff
>>> 
>>> and I have thousands of these URLs in my input. How do I configure
>>> Nutch to also crawl this subdomain for each input:
>>> http://abc.xyz.com/stuff
>>> 
>>> I don't want to just replace 'www' with 'abc' i want to crawl both.
>>> 
>>> Thanks
>>> Peyman
>> 
>> 
>> 
>

Re: crawling a subdomain

Posted by Peyman Mohajerian <mo...@gmail.com>.

That is correct and of course the new URLs are based on replacing some
parameter in the original list of URLs, e.g. 'www' with 'abc',
opposite of filtering.  I think I have to modify the source code for
this, if so my guess is Injector class would be the best place? Of
course idealy I don't want to add my own customization!!

On Sun, Nov 6, 2011 at 11:22 PM, Sergey A Volkov
<se...@gmail.com> wrote:
> If I understand correctly,
> нou can run inject job on your crawldb with new url's and new input file,
> old url's would be still in crawldb
>
> On Mon 07 Nov 2011 10:15:26 AM MSK, Peyman Mohajerian wrote:
>>
>> Thanks Sergey,
>> I don't think I was clear on the issue, the subdomain I'm speaking of
>> won't be found by the crawler, I have to somehow add it, so in my
>> original input url of: http://www.xyz.com/stuff
>> there is absolutely no way the crawler would know about
>> http://abc.xyz.com/stuff
>> I have to somehow dynamically add the subdomain.
>> I also don't have the option of actually adding
>> 'http://abc.xyz.com/stuff' in my input file (a bit of an extra
>> convolution I don't want to bore you with!!).
>>
>> Thanks,
>> Peyman
>>
>> On Sun, Nov 6, 2011 at 1:21 PM, Sergey A Volkov
>> <se...@gmail.com>  wrote:
>>>
>>> Hi!
>>>
>>> I think you should use urlfilter-regex like "http://\w\.xyz\.com/stuff.*"
>>> instead of urlfilter-domain and set db.ignore.external.links to false,
>>> this
>>> will work, but this is quite slow if you have many regex.
>>>
>>> You may also try to add xyz.com to domain-suffixes.xml, this may cause
>>> some
>>> side effects, i had never tested this, just looked in DomainURLFilter
>>> source, so it's probably not really good idea.
>>>
>>> Sergey Volkov
>>>
>>> On Mon 07 Nov 2011 12:35:30 AM MSK, Peyman Mohajerian wrote:
>>>>
>>>> Hi Guys,
>>>>
>>>> Let's say my input file is:
>>>> http://www.xyz.com/stuff
>>>>
>>>> and I have thousands of these URLs in my input. How do I configure
>>>> Nutch to also crawl this subdomain for each input:
>>>> http://abc.xyz.com/stuff
>>>>
>>>> I don't want to just replace 'www' with 'abc' i want to crawl both.
>>>>
>>>> Thanks
>>>> Peyman
>>>
>>>
>>>
>
>
>

Re: crawling a subdomain

Posted by Sergey A Volkov <se...@gmail.com>.

If I understand correctly,
нou can run inject job on your crawldb with new url's and new input 
file, old url's would be still in crawldb

On Mon 07 Nov 2011 10:15:26 AM MSK, Peyman Mohajerian wrote:
> Thanks Sergey,
> I don't think I was clear on the issue, the subdomain I'm speaking of
> won't be found by the crawler, I have to somehow add it, so in my
> original input url of: http://www.xyz.com/stuff
> there is absolutely no way the crawler would know about http://abc.xyz.com/stuff
> I have to somehow dynamically add the subdomain.
> I also don't have the option of actually adding
> 'http://abc.xyz.com/stuff' in my input file (a bit of an extra
> convolution I don't want to bore you with!!).
>
> Thanks,
> Peyman
>
> On Sun, Nov 6, 2011 at 1:21 PM, Sergey A Volkov
> <se...@gmail.com>  wrote:
>> Hi!
>>
>> I think you should use urlfilter-regex like "http://\w\.xyz\.com/stuff.*"
>> instead of urlfilter-domain and set db.ignore.external.links to false, this
>> will work, but this is quite slow if you have many regex.
>>
>> You may also try to add xyz.com to domain-suffixes.xml, this may cause some
>> side effects, i had never tested this, just looked in DomainURLFilter
>> source, so it's probably not really good idea.
>>
>> Sergey Volkov
>>
>> On Mon 07 Nov 2011 12:35:30 AM MSK, Peyman Mohajerian wrote:
>>>
>>> Hi Guys,
>>>
>>> Let's say my input file is:
>>> http://www.xyz.com/stuff
>>>
>>> and I have thousands of these URLs in my input. How do I configure
>>> Nutch to also crawl this subdomain for each input:
>>> http://abc.xyz.com/stuff
>>>
>>> I don't want to just replace 'www' with 'abc' i want to crawl both.
>>>
>>> Thanks
>>> Peyman
>>
>>
>>

Re: crawling a subdomain

Posted by Mathijs Homminga <ma...@gmail.com>.

You could write your own simple parse plugin that generates abc.xyz.com/stuff as outlink of www.xyz.com/stuff. Which is then crawled in (one of the) subsequent crawl cycles.

Mathijs Homminga

On Nov 7, 2011, at 7:15, Peyman Mohajerian <mo...@gmail.com> wrote:

> Thanks Sergey,
> I don't think I was clear on the issue, the subdomain I'm speaking of
> won't be found by the crawler, I have to somehow add it, so in my
> original input url of: http://www.xyz.com/stuff
> there is absolutely no way the crawler would know about http://abc.xyz.com/stuff
> I have to somehow dynamically add the subdomain.
> I also don't have the option of actually adding
> 'http://abc.xyz.com/stuff' in my input file (a bit of an extra
> convolution I don't want to bore you with!!).
> 
> Thanks,
> Peyman
> 
> On Sun, Nov 6, 2011 at 1:21 PM, Sergey A Volkov
> <se...@gmail.com> wrote:
>> Hi!
>> 
>> I think you should use urlfilter-regex like "http://\w\.xyz\.com/stuff.*"
>> instead of urlfilter-domain and set db.ignore.external.links to false, this
>> will work, but this is quite slow if you have many regex.
>> 
>> You may also try to add xyz.com to domain-suffixes.xml, this may cause some
>> side effects, i had never tested this, just looked in DomainURLFilter
>> source, so it's probably not really good idea.
>> 
>> Sergey Volkov
>> 
>> On Mon 07 Nov 2011 12:35:30 AM MSK, Peyman Mohajerian wrote:
>>> 
>>> Hi Guys,
>>> 
>>> Let's say my input file is:
>>> http://www.xyz.com/stuff
>>> 
>>> and I have thousands of these URLs in my input. How do I configure
>>> Nutch to also crawl this subdomain for each input:
>>> http://abc.xyz.com/stuff
>>> 
>>> I don't want to just replace 'www' with 'abc' i want to crawl both.
>>> 
>>> Thanks
>>> Peyman
>> 
>> 
>> 
>

Re: crawling a subdomain

Posted by Peyman Mohajerian <mo...@gmail.com>.

Thanks Sergey,
I don't think I was clear on the issue, the subdomain I'm speaking of
won't be found by the crawler, I have to somehow add it, so in my
original input url of: http://www.xyz.com/stuff
there is absolutely no way the crawler would know about http://abc.xyz.com/stuff
I have to somehow dynamically add the subdomain.
I also don't have the option of actually adding
'http://abc.xyz.com/stuff' in my input file (a bit of an extra
convolution I don't want to bore you with!!).

Thanks,
Peyman

On Sun, Nov 6, 2011 at 1:21 PM, Sergey A Volkov
<se...@gmail.com> wrote:
> Hi!
>
> I think you should use urlfilter-regex like "http://\w\.xyz\.com/stuff.*"
> instead of urlfilter-domain and set db.ignore.external.links to false, this
> will work, but this is quite slow if you have many regex.
>
> You may also try to add xyz.com to domain-suffixes.xml, this may cause some
> side effects, i had never tested this, just looked in DomainURLFilter
> source, so it's probably not really good idea.
>
> Sergey Volkov
>
> On Mon 07 Nov 2011 12:35:30 AM MSK, Peyman Mohajerian wrote:
>>
>> Hi Guys,
>>
>> Let's say my input file is:
>> http://www.xyz.com/stuff
>>
>> and I have thousands of these URLs in my input. How do I configure
>> Nutch to also crawl this subdomain for each input:
>> http://abc.xyz.com/stuff
>>
>> I don't want to just replace 'www' with 'abc' i want to crawl both.
>>
>> Thanks
>> Peyman
>
>
>

Re: crawling a subdomain

Posted by Sergey A Volkov <se...@gmail.com>.

Hi!

I think you should use urlfilter-regex like 
"http://\w\.xyz\.com/stuff.*" instead of urlfilter-domain and set 
db.ignore.external.links to false, this will work, but this is quite 
slow if you have many regex.

You may also try to add xyz.com to domain-suffixes.xml, this may cause 
some side effects, i had never tested this, just looked in 
DomainURLFilter source, so it's probably not really good idea.

Sergey Volkov

On Mon 07 Nov 2011 12:35:30 AM MSK, Peyman Mohajerian wrote:
> Hi Guys,
>
> Let's say my input file is:
> http://www.xyz.com/stuff
>
> and I have thousands of these URLs in my input. How do I configure
> Nutch to also crawl this subdomain for each input:
> http://abc.xyz.com/stuff
>
> I don't want to just replace 'www' with 'abc' i want to crawl both.
>
> Thanks
> Peyman