You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Anand Bhagwat <ab...@gmail.com> on 2013/03/11 11:53:16 UTC

How to identify seed URL for a given record from Webpage

Hi,
Is there any way to identify seed URL from a record in WebPage table? What
I am trying to find out is what was the origin of given record? I know
there are inlinks and outlinks but is there any alternate way?

-Anand.

How to identify seed URL for a given record from Webpage

Posted by Lewis John Mcgibbney <le...@gmail.com>.

Hi Anand,
Couple of things.
1. Good to hear you solved your problem, I will be working on the jira
issue regardless.
2. If you are interested in contributing your code to the Nutch community
it would be welcomed. You can open a jira issue and upload it, or
alternatively you can open a wiki page and embed your code there. Its up to
you.
Thanks for the feedback anyways
lewis

On Wednesday, March 13, 2013, Anand Bhagwat <ab...@gmail.com> wrote:
> Hi Lewis,
> I looked at the JIRA you mentioned and its little different then what I
was
> looking for. What I need is a way to associate seed url to all the records
> which are derived from this url. So I added seedUrl and its value to
> metadata column during inject phase if it is null and later on in updatedb
> phase I propagated it to subsequent outlinks / new records. So now all the
> records and any future child records will have the same seedurl as one of
> the metadata.
>
> I was looking for some plugin which I could use but in this case I did not
> find any suitable plugin.
>
> Regards,
> Anand.
>
> On 13 March 2013 22:40, Lewis John Mcgibbney <lewis.mcgibbney@gmail.com
>wrote:
>
>> Hi Anand,
>> The first step is to look at thew issue over on NUTCH-1533
>> If you feel like addressing anything then please do.
>> This particular issue has nothing to do with Gora, or Hadoop so you will
>> not need to look at any of the code there.
>> I will also be working on that issue when I get some time.s
>> Thanks
>> Lewis
>>
>> On Mon, Mar 11, 2013 at 9:44 PM, Anand Bhagwat <abbhagwatgm@gmail.com
>> >wrote:
>>
>> > I would love to work on it but the thing is I am new to all the
>> frameworks
>> > which are being used here. I mean Apache Hadoop, Apache Gora and Nutch
>> > itself. I am going though the source code of Nutch 2. But as you said
>> with
>> > little bit of help I think I would be able to contribute.
>> >
>> > -Anand.
>> >
>> >
>>
>

-- 
*Lewis*

Re: How to identify seed URL for a given record from Webpage

Posted by Anand Bhagwat <ab...@gmail.com>.

Hi Lewis,
I looked at the JIRA you mentioned and its little different then what I was
looking for. What I need is a way to associate seed url to all the records
which are derived from this url. So I added seedUrl and its value to
metadata column during inject phase if it is null and later on in updatedb
phase I propagated it to subsequent outlinks / new records. So now all the
records and any future child records will have the same seedurl as one of
the metadata.

I was looking for some plugin which I could use but in this case I did not
find any suitable plugin.

Regards,
Anand.

On 13 March 2013 22:40, Lewis John Mcgibbney <le...@gmail.com>wrote:

> Hi Anand,
> The first step is to look at thew issue over on NUTCH-1533
> If you feel like addressing anything then please do.
> This particular issue has nothing to do with Gora, or Hadoop so you will
> not need to look at any of the code there.
> I will also be working on that issue when I get some time.s
> Thanks
> Lewis
>
> On Mon, Mar 11, 2013 at 9:44 PM, Anand Bhagwat <abbhagwatgm@gmail.com
> >wrote:
>
> > I would love to work on it but the thing is I am new to all the
> frameworks
> > which are being used here. I mean Apache Hadoop, Apache Gora and Nutch
> > itself. I am going though the source code of Nutch 2. But as you said
> with
> > little bit of help I think I would be able to contribute.
> >
> > -Anand.
> >
> >
>

Re: How to identify seed URL for a given record from Webpage

Posted by Lewis John Mcgibbney <le...@gmail.com>.

Hi Anand,
The first step is to look at thew issue over on NUTCH-1533
If you feel like addressing anything then please do.
This particular issue has nothing to do with Gora, or Hadoop so you will
not need to look at any of the code there.
I will also be working on that issue when I get some time.s
Thanks
Lewis

On Mon, Mar 11, 2013 at 9:44 PM, Anand Bhagwat <ab...@gmail.com>wrote:

> I would love to work on it but the thing is I am new to all the frameworks
> which are being used here. I mean Apache Hadoop, Apache Gora and Nutch
> itself. I am going though the source code of Nutch 2. But as you said with
> little bit of help I think I would be able to contribute.
>
> -Anand.
>
>

Re: How to identify seed URL for a given record from Webpage

Posted by Anand Bhagwat <ab...@gmail.com>.

I would love to work on it but the thing is I am new to all the frameworks
which are being used here. I mean Apache Hadoop, Apache Gora and Nutch
itself. I am going though the source code of Nutch 2. But as you said with
little bit of help I think I would be able to contribute.

-Anand.

On 12 March 2013 09:14, Lewis John Mcgibbney <le...@gmail.com>wrote:

> Do you have an interest to work on implementing NUTCH-1533?
> I would be happy to work on this as well.
> Lewis
>
> On Mon, Mar 11, 2013 at 7:39 PM, Anand Bhagwat <abbhagwatgm@gmail.com
> >wrote:
>
> > Thanks for the information. I guess using the batch id is a good idea..
> >
> > On 11 March 2013 21:50, Lewis John Mcgibbney <lewis.mcgibbney@gmail.com
> > >wrote:
> >
> > > There are numerous methods to do this.
> > > *You can either assign some metadata to each URL chen injecting andf
> > > bootstrapping the system
> > > *You could embed some meta tags or other distinguishing feature in the
> > URLs
> > > and use the facilities (existing or available in Jira) to identify
> these
> > > pages.
> > > *You may also be able to attach the original batchId to all original
> Seed
> > > URLs. [0]
> > > I imagine that all of the above require you adapt the source code at
> some
> > > stage... this is why we don't release 2.x binaries.
> > > I recently opened an issue which could easily be adapted for the third
> > > point above [0]
> > > Kiran's contribution to porting metadata plugins to Nutch 2.x would
> > > probably enable you to address point 2 I would imagine.
> > >
> > > [0] https://issues.apache.org/jira/browse/NUTCH-1533
> > >
> > > On Mon, Mar 11, 2013 at 3:53 AM, Anand Bhagwat <abbhagwatgm@gmail.com
> > > >wrote:
> > >
> > > > Hi,
> > > > Is there any way to identify seed URL from a record in WebPage table?
> > > What
> > > > I am trying to find out is what was the origin of given record? I
> know
> > > > there are inlinks and outlinks but is there any alternate way?
> > > >
> > > > -Anand.
> > > >
> > >
> > >
> > >
> > > --
> > > *Lewis*
> > >
> >
>
>
>
> --
> *Lewis*
>

Re: How to identify seed URL for a given record from Webpage

Posted by Lewis John Mcgibbney <le...@gmail.com>.

Do you have an interest to work on implementing NUTCH-1533?
I would be happy to work on this as well.
Lewis

On Mon, Mar 11, 2013 at 7:39 PM, Anand Bhagwat <ab...@gmail.com>wrote:

> Thanks for the information. I guess using the batch id is a good idea..
>
> On 11 March 2013 21:50, Lewis John Mcgibbney <lewis.mcgibbney@gmail.com
> >wrote:
>
> > There are numerous methods to do this.
> > *You can either assign some metadata to each URL chen injecting andf
> > bootstrapping the system
> > *You could embed some meta tags or other distinguishing feature in the
> URLs
> > and use the facilities (existing or available in Jira) to identify these
> > pages.
> > *You may also be able to attach the original batchId to all original Seed
> > URLs. [0]
> > I imagine that all of the above require you adapt the source code at some
> > stage... this is why we don't release 2.x binaries.
> > I recently opened an issue which could easily be adapted for the third
> > point above [0]
> > Kiran's contribution to porting metadata plugins to Nutch 2.x would
> > probably enable you to address point 2 I would imagine.
> >
> > [0] https://issues.apache.org/jira/browse/NUTCH-1533
> >
> > On Mon, Mar 11, 2013 at 3:53 AM, Anand Bhagwat <abbhagwatgm@gmail.com
> > >wrote:
> >
> > > Hi,
> > > Is there any way to identify seed URL from a record in WebPage table?
> > What
> > > I am trying to find out is what was the origin of given record? I know
> > > there are inlinks and outlinks but is there any alternate way?
> > >
> > > -Anand.
> > >
> >
> >
> >
> > --
> > *Lewis*
> >
>



-- 
*Lewis*

Re: How to identify seed URL for a given record from Webpage

Posted by Anand Bhagwat <ab...@gmail.com>.

Thanks for the information. I guess using the batch id is a good idea..

On 11 March 2013 21:50, Lewis John Mcgibbney <le...@gmail.com>wrote:

> There are numerous methods to do this.
> *You can either assign some metadata to each URL chen injecting andf
> bootstrapping the system
> *You could embed some meta tags or other distinguishing feature in the URLs
> and use the facilities (existing or available in Jira) to identify these
> pages.
> *You may also be able to attach the original batchId to all original Seed
> URLs. [0]
> I imagine that all of the above require you adapt the source code at some
> stage... this is why we don't release 2.x binaries.
> I recently opened an issue which could easily be adapted for the third
> point above [0]
> Kiran's contribution to porting metadata plugins to Nutch 2.x would
> probably enable you to address point 2 I would imagine.
>
> [0] https://issues.apache.org/jira/browse/NUTCH-1533
>
> On Mon, Mar 11, 2013 at 3:53 AM, Anand Bhagwat <abbhagwatgm@gmail.com
> >wrote:
>
> > Hi,
> > Is there any way to identify seed URL from a record in WebPage table?
> What
> > I am trying to find out is what was the origin of given record? I know
> > there are inlinks and outlinks but is there any alternate way?
> >
> > -Anand.
> >
>
>
>
> --
> *Lewis*
>

Re: How to identify seed URL for a given record from Webpage

Posted by Lewis John Mcgibbney <le...@gmail.com>.

There are numerous methods to do this.
*You can either assign some metadata to each URL chen injecting and
bootstrapping the system
*You could embed some meta tags or other distinguishing feature in the URLs
and use the facilities (existing or available in Jira) to identify these
pages.
*You may also be able to attach the original batchId to all original Seed
URLs. [0]
I imagine that all of the above require you adapt the source code at some
stage... this is why we don't release 2.x binaries.
I recently opened an issue which could easily be adapted for the third
point above [0]
Kiran's contribution to porting metadata plugins to Nutch 2.x would
probably enable you to address point 2 I would imagine.

[0] https://issues.apache.org/jira/browse/NUTCH-1533

On Mon, Mar 11, 2013 at 3:53 AM, Anand Bhagwat <ab...@gmail.com>wrote:

> Hi,
> Is there any way to identify seed URL from a record in WebPage table? What
> I am trying to find out is what was the origin of given record? I know
> there are inlinks and outlinks but is there any alternate way?
>
> -Anand.
>

-- 
*Lewis*