You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by Cihad Guzel <cg...@gmail.com> on 2015/07/04 14:56:53 UTC

GSOC2015- Sitemap crawler roudmap problems

Hi Lewis,

I and Talat talk about architecture for sitemap supporting . We thought the
problem could be solved in nutch life cycle . We don't want to build a
different life cycle for sitemap crawling.

So, I have some problems as following:

If the sitemap file is too large size, it can not be fetched and parsed. It
gets timeout. I solved timeout problem temporarily to parse by raising the
value of timeout in nutch-site.xml and to fetch by working small size file.
It is not good.

Moreover, you know sitemap files have some special tags as "loc",
"lastmod", "changefreq" or "priority". It has been parsed using my parse
plugin. I want to  record to crawldb, but the Parse  object doesn't support
metadata or same fields. It has only outlink array. It isn't enough for
recording metadata.

I want to record each url in sitemap file with the metadata seperately.

I viewed all patchs and comments from NUTCH-1465 and there are some
solution for same problems in it. But, new job for sitemap crawling have
been created.

Could you show me a way out?

Thanks.

Re: GSOC2015- Sitemap crawler roudmap problems

Posted by Cihad Guzel <cg...@gmail.com>.

Hi

I am proceesing my work. My code is integreted nutch life cycle. Sitemap
files are can injeceted and parsed. You known, sitemap file have any tags
as lastmodified, priortiy and changefreq. Firstly, I put the tags value to
metadata. Then, I update last modified and fetch inteval field of webpage
as for the tags. But I didn't use priority tags. I want to calculate new
score using priority for list of urls from sitemap. While the urls of
sitemap have priority value, another webpage urls doesn't have the value.
There are disorder.  How do you think should be implemented it?

I attached the last code as patch on this email.


2015-07-11 12:10 GMT+03:00 Cihad Guzel <cg...@gmail.com>:

> Hi Lewis.
>
> Thanks for your suggestions. I will be thinking about this.
>
> 2015-07-10 3:47 GMT+03:00 Lewis John Mcgibbney <le...@gmail.com>
> :
>
>> Hi Cihad,
>> I'll take a look tonight.
>> My understanding is that this would be implemented as part of core and
>> not as a plugin. Within the plugin we can, at time, have acesss to less
>> verbose data structures. This is of course not always the case, but
>> generally speaking we see more issues, depending on which interfaces we
>> extend, with appropriate access to the correct data structures. We then
>> have the issue of dependency management.
>> I'll have a look through the various links you have sent and then write
>> back here in due course.
>> Apologies about the delay.
>> Thanks
>>
>> On Mon, Jul 6, 2015 at 12:20 AM, Cihad Guzel <cg...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I have find a patch for my metadata problem [1]. But , the problem isn't
>>> solved for 2.x [2]. I guess, I need to solve it.
>>>
>>> [1] https://issues.apache.org/jira/browse/NUTCH-1622
>>> [2] https://issues.apache.org/jira/browse/NUTCH-1816
>>>
>>> 2015-07-04 15:56 GMT+03:00 Cihad Guzel <cg...@gmail.com>:
>>>
>>>> Hi Lewis,
>>>>
>>>> I and Talat talk about architecture for sitemap supporting . We thought
>>>> the problem could be solved in nutch life cycle . We don't want to build a
>>>> different life cycle for sitemap crawling.
>>>>
>>>> So, I have some problems as following:
>>>>
>>>> If the sitemap file is too large size, it can not be fetched and
>>>> parsed. It gets timeout. I solved timeout problem temporarily to parse by
>>>> raising the value of timeout in nutch-site.xml and to fetch by working
>>>> small size file. It is not good.
>>>>
>>>> Moreover, you know sitemap files have some special tags as "loc",
>>>> "lastmod", "changefreq" or "priority". It has been parsed using my parse
>>>> plugin. I want to  record to crawldb, but the Parse  object doesn't
>>>> support metadata or same fields. It has only outlink array. It isn't enough
>>>> for recording metadata.
>>>>
>>>> I want to record each url in sitemap file with the metadata seperately.
>>>>
>>>> I viewed all patchs and comments from NUTCH-1465 and there are some
>>>> solution for same problems in it. But, new job for sitemap crawling have
>>>> been created.
>>>>
>>>> Could you show me a way out?
>>>>
>>>> Thanks.
>>>>
>>>
>>>
>>
>>
>> --
>> *Lewis*
>>
>
>

Re: GSOC2015- Sitemap crawler roudmap problems

Posted by Cihad Guzel <cg...@gmail.com>.

Hi Lewis.

Thanks for your suggestions. I will be thinking about this.

2015-07-10 3:47 GMT+03:00 Lewis John Mcgibbney <le...@gmail.com>:

> Hi Cihad,
> I'll take a look tonight.
> My understanding is that this would be implemented as part of core and not
> as a plugin. Within the plugin we can, at time, have acesss to less verbose
> data structures. This is of course not always the case, but generally
> speaking we see more issues, depending on which interfaces we extend, with
> appropriate access to the correct data structures. We then have the issue
> of dependency management.
> I'll have a look through the various links you have sent and then write
> back here in due course.
> Apologies about the delay.
> Thanks
>
> On Mon, Jul 6, 2015 at 12:20 AM, Cihad Guzel <cg...@gmail.com> wrote:
>
>> Hi,
>>
>> I have find a patch for my metadata problem [1]. But , the problem isn't
>> solved for 2.x [2]. I guess, I need to solve it.
>>
>> [1] https://issues.apache.org/jira/browse/NUTCH-1622
>> [2] https://issues.apache.org/jira/browse/NUTCH-1816
>>
>> 2015-07-04 15:56 GMT+03:00 Cihad Guzel <cg...@gmail.com>:
>>
>>> Hi Lewis,
>>>
>>> I and Talat talk about architecture for sitemap supporting . We thought
>>> the problem could be solved in nutch life cycle . We don't want to build a
>>> different life cycle for sitemap crawling.
>>>
>>> So, I have some problems as following:
>>>
>>> If the sitemap file is too large size, it can not be fetched and parsed.
>>> It gets timeout. I solved timeout problem temporarily to parse by raising
>>> the value of timeout in nutch-site.xml and to fetch by working small size
>>> file. It is not good.
>>>
>>> Moreover, you know sitemap files have some special tags as "loc",
>>> "lastmod", "changefreq" or "priority". It has been parsed using my parse
>>> plugin. I want to  record to crawldb, but the Parse  object doesn't
>>> support metadata or same fields. It has only outlink array. It isn't enough
>>> for recording metadata.
>>>
>>> I want to record each url in sitemap file with the metadata seperately.
>>>
>>> I viewed all patchs and comments from NUTCH-1465 and there are some
>>> solution for same problems in it. But, new job for sitemap crawling have
>>> been created.
>>>
>>> Could you show me a way out?
>>>
>>> Thanks.
>>>
>>
>>
>
>
> --
> *Lewis*
>

Re: GSOC2015- Sitemap crawler roudmap problems

Posted by Lewis John Mcgibbney <le...@gmail.com>.

Hi Cihad,
I'll take a look tonight.
My understanding is that this would be implemented as part of core and not
as a plugin. Within the plugin we can, at time, have acesss to less verbose
data structures. This is of course not always the case, but generally
speaking we see more issues, depending on which interfaces we extend, with
appropriate access to the correct data structures. We then have the issue
of dependency management.
I'll have a look through the various links you have sent and then write
back here in due course.
Apologies about the delay.
Thanks

On Mon, Jul 6, 2015 at 12:20 AM, Cihad Guzel <cg...@gmail.com> wrote:

> Hi,
>
> I have find a patch for my metadata problem [1]. But , the problem isn't
> solved for 2.x [2]. I guess, I need to solve it.
>
> [1] https://issues.apache.org/jira/browse/NUTCH-1622
> [2] https://issues.apache.org/jira/browse/NUTCH-1816
>
> 2015-07-04 15:56 GMT+03:00 Cihad Guzel <cg...@gmail.com>:
>
>> Hi Lewis,
>>
>> I and Talat talk about architecture for sitemap supporting . We thought
>> the problem could be solved in nutch life cycle . We don't want to build a
>> different life cycle for sitemap crawling.
>>
>> So, I have some problems as following:
>>
>> If the sitemap file is too large size, it can not be fetched and parsed.
>> It gets timeout. I solved timeout problem temporarily to parse by raising
>> the value of timeout in nutch-site.xml and to fetch by working small size
>> file. It is not good.
>>
>> Moreover, you know sitemap files have some special tags as "loc",
>> "lastmod", "changefreq" or "priority". It has been parsed using my parse
>> plugin. I want to  record to crawldb, but the Parse  object doesn't
>> support metadata or same fields. It has only outlink array. It isn't enough
>> for recording metadata.
>>
>> I want to record each url in sitemap file with the metadata seperately.
>>
>> I viewed all patchs and comments from NUTCH-1465 and there are some
>> solution for same problems in it. But, new job for sitemap crawling have
>> been created.
>>
>> Could you show me a way out?
>>
>> Thanks.
>>
>
>

-- 
*Lewis*

Re: GSOC2015- Sitemap crawler roudmap problems

Posted by Cihad Guzel <cg...@gmail.com>.

Hi,

I have find a patch for my metadata problem [1]. But , the problem isn't
solved for 2.x [2]. I guess, I need to solve it.

[1] https://issues.apache.org/jira/browse/NUTCH-1622
[2] https://issues.apache.org/jira/browse/NUTCH-1816

2015-07-04 15:56 GMT+03:00 Cihad Guzel <cg...@gmail.com>:

> Hi Lewis,
>
> I and Talat talk about architecture for sitemap supporting . We thought
> the problem could be solved in nutch life cycle . We don't want to build a
> different life cycle for sitemap crawling.
>
> So, I have some problems as following:
>
> If the sitemap file is too large size, it can not be fetched and parsed.
> It gets timeout. I solved timeout problem temporarily to parse by raising
> the value of timeout in nutch-site.xml and to fetch by working small size
> file. It is not good.
>
> Moreover, you know sitemap files have some special tags as "loc",
> "lastmod", "changefreq" or "priority". It has been parsed using my parse
> plugin. I want to  record to crawldb, but the Parse  object doesn't
> support metadata or same fields. It has only outlink array. It isn't enough
> for recording metadata.
>
> I want to record each url in sitemap file with the metadata seperately.
>
> I viewed all patchs and comments from NUTCH-1465 and there are some
> solution for same problems in it. But, new job for sitemap crawling have
> been created.
>
> Could you show me a way out?
>
> Thanks.
>