You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Aamir Khan <sy...@gmail.com> on 2012/04/03 06:45:51 UTC

GSoC : Web page scraper plugin

Hi,

My name is Aamir Khan. I'm 3rd year computer science student at IIT
Roorkee. I'm willing to participate in GSoC 2012.

The project of web scraping at
https://issues.apache.org/jira/browse/NUTCH-978 looks good to me. I
understood the basic concept of the project but as I'm new to Nutch it will
take some time to understand it fully in context of NUTCH.

I'm looking forward for guidance from your side, how should I go about
submitting a proposal for GSoC.

Thanks in advance!





-- 
Aamir Khan | 3rd Year  | Computer Science & Engineering | IIT Roorkee

Re: GSoC : Web page scraper plugin

Posted by Aamir Khan <sy...@gmail.com>.
On Tue, Apr 3, 2012 at 4:45 PM, Lewis John Mcgibbney <
lewis.mcgibbney@gmail.com> wrote:

> Hi Aamir,
>
>
> On Tue, Apr 3, 2012 at 12:05 PM, Aamir Khan <sy...@gmail.com> wrote:
>
>>
>> Exactly, I will have full summer to understand and get up to speed. But
>> since my knowledge is very limited my proposal won't be too good.. :)
>>
>>>
>>> This doesn't need to be the case. In fact it is crucial that the
> submission is of a reasonable quality. The original issue was pretty well
> discussed iirc, and additionally there is also some code uploaded by the
> original author so you could have a look at that over the next few days
> before making a crack at the submission. I can say one thing for sure
> though, this issue might need to be branded more generically... just now
> Nutch would benefit more from a generically oriented plugin for scraping
> various parts of html. The original author had a use case driven approach
> to this issue which meant he had to extract very specific content from news
> sites... this may not suit you, and certainly isn't absolutely everyone's
> cup of tea within the community. It would be great if you could discuss
> both in your application and on the Jira thread how the issue could be
> opened up, subsequently enabling more Nutch users to benefit... as you are
> stepping up to apply here, how you wish to do this is entirely your own
> choice so I would take the positives from the flexibility you have here and
> focus on them within your submission. Does this sounds reasonable?
>

Sounds good to me. Where can I chat with nutch-developers ? not many people
are present on IRC channel #nutch. BTW, I created a rough draft with all my
personal bio and other necessary information and submitted to
google-melange [1]. I will update the project schedule soon preferably
after having some discussions.

[1] =
http://google-melange.appspot.com/gsoc/proposal/review/google/gsoc2012/syst3mw0rm/9001

>
> I look forward to seeing any progress you have and will seriously consider
> stepping up to be a potential mentor as it was me that added the issue to
> GSoC list of projects.
>

that would be great!!

>
> Thank you
>
> Lewis
>
>
>


-- 
Aamir Khan | 3rd Year  | Computer Science & Engineering | IIT Roorkee

Re: GSoC : Web page scraper plugin

Posted by Lewis John Mcgibbney <le...@gmail.com>.
Hi Aamir,

On Tue, Apr 3, 2012 at 12:05 PM, Aamir Khan <sy...@gmail.com> wrote:

>
> Exactly, I will have full summer to understand and get up to speed. But
> since my knowledge is very limited my proposal won't be too good.. :)
>
>>
>> This doesn't need to be the case. In fact it is crucial that the
submission is of a reasonable quality. The original issue was pretty well
discussed iirc, and additionally there is also some code uploaded by the
original author so you could have a look at that over the next few days
before making a crack at the submission. I can say one thing for sure
though, this issue might need to be branded more generically... just now
Nutch would benefit more from a generically oriented plugin for scraping
various parts of html. The original author had a use case driven approach
to this issue which meant he had to extract very specific content from news
sites... this may not suit you, and certainly isn't absolutely everyone's
cup of tea within the community. It would be great if you could discuss
both in your application and on the Jira thread how the issue could be
opened up, subsequently enabling more Nutch users to benefit... as you are
stepping up to apply here, how you wish to do this is entirely your own
choice so I would take the positives from the flexibility you have here and
focus on them within your submission. Does this sounds reasonable?

I look forward to seeing any progress you have and will seriously consider
stepping up to be a potential mentor as it was me that added the issue to
GSoC list of projects.

Thank you

Lewis

Re: GSoC : Web page scraper plugin

Posted by Aamir Khan <sy...@gmail.com>.
On Tue, Apr 3, 2012 at 4:31 PM, Lewis John Mcgibbney <
lewis.mcgibbney@gmail.com> wrote:

> Hi Aamir,
>
> Please excuse me not getting back to you off-list, the message is in my
> drafts and I got distracted yesterday.
>

No problem.

>
> At this stage if you intend on applying for the issue then I would advise
> you to get registered with GSoC, and begin writing up a publicly viewable
> draft submission. You have until the 6th to do so, so plenty of time.
>
> On Tue, Apr 3, 2012 at 5:45 AM, Aamir Khan <sy...@gmail.com> wrote:
>
>>
>> The project of web scraping at
>> https://issues.apache.org/jira/browse/NUTCH-978 looks good to me. I
>> understood the basic concept of the project but as I'm new to Nutch it will
>> take some time to understand it fully in context of NUTCH.
>>
>
> Well you have the summer to get up to speed with Nutch right? So I
> wouldn't necessarily worry too much about this just now. Just get your
> submission ready and we will take it from there.
>

Exactly, I will have full summer to understand and get up to speed. But
since my knowledge is very limited my proposal won't be too good.. :)

>
>> I'm looking forward for guidance from your side, how should I go about
>> submitting a proposal for GSoC.
>>
>
> If you feel you need help with any aspect of the issue or the submission
> then please get on to user@ and we will try to help out as much over
> there. In the meantime please see here [0] for guidance on your application
> submission. There is plenty of documentation and guidance over there.
>

Sure.

>
> Thanks and again apologies for not getting back to you yesterday.
>

No problem.. :)

>
> Lewis
>
> [0] http://community.apache.org/gsoc.html
>
>
>>
>> Thanks in advance!
>>
>>
>>
>>
>>
>> --
>> Aamir Khan | 3rd Year  | Computer Science & Engineering | IIT Roorkee
>>
>>
>>
>>
>
>
> --
> *Lewis*
>
>


-- 
Aamir Khan | 3rd Year  | Computer Science & Engineering | IIT Roorkee

Re: GSoC : Web page scraper plugin

Posted by Lewis John Mcgibbney <le...@gmail.com>.
Hi Aamir,

Please excuse me not getting back to you off-list, the message is in my
drafts and I got distracted yesterday.

At this stage if you intend on applying for the issue then I would advise
you to get registered with GSoC, and begin writing up a publicly viewable
draft submission. You have until the 6th to do so, so plenty of time.

On Tue, Apr 3, 2012 at 5:45 AM, Aamir Khan <sy...@gmail.com> wrote:

>
> The project of web scraping at
> https://issues.apache.org/jira/browse/NUTCH-978 looks good to me. I
> understood the basic concept of the project but as I'm new to Nutch it will
> take some time to understand it fully in context of NUTCH.
>

Well you have the summer to get up to speed with Nutch right? So I wouldn't
necessarily worry too much about this just now. Just get your submission
ready and we will take it from there.

>
> I'm looking forward for guidance from your side, how should I go about
> submitting a proposal for GSoC.
>

If you feel you need help with any aspect of the issue or the submission
then please get on to user@ and we will try to help out as much over there.
In the meantime please see here [0] for guidance on your application
submission. There is plenty of documentation and guidance over there.

Thanks and again apologies for not getting back to you yesterday.

Lewis

[0] http://community.apache.org/gsoc.html


>
> Thanks in advance!
>
>
>
>
>
> --
> Aamir Khan | 3rd Year  | Computer Science & Engineering | IIT Roorkee
>
>
>
>


-- 
*Lewis*

GSoC : Web page scraper plugin

Posted by Aamir Khan <sy...@gmail.com>.
Hi,

My name is Aamir Khan. I'm 3rd year computer science student at IIT
Roorkee. I'm willing to participate in GSoC 2012.

The project of web scraping at
https://issues.apache.org/jira/browse/NUTCH-978 looks good to me. I
understood the basic concept of the project but as I'm new to Nutch it will
take some time to understand it fully in context of NUTCH.

I'm looking forward for guidance from your side, how should I go about
submitting a proposal for GSoC.

Thanks in advance!





-- 
Aamir Khan | 3rd Year  | Computer Science & Engineering | IIT Roorkee