You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Adriana Farina <ad...@gmail.com> on 2013/05/10 11:26:20 UTC

Nutch 2.1 seed list

Hello,

I'm using Nutch 2.1 on top of Hadoop 1.0.4, with HBase 0.90.4 as storage
system. I run Nutch in distributed mode.

I need to associate an id to each url inside the seed list of nutch and to
store this information in HBase. I think that I have to create a new column
family in HBase and modify the gora and hbase configuration files in the
nutch conf folder.

However, I think I need to modify the code of Nutch, but I don't know which
classes I have to modify. I googled a bit, but I didn't find any
documentation; I've searched inside the code but I wasn't able to solve my
problem.

Can anybody help me?

Thank you!


-- 
Adriana Farina

Re: Nutch 2.1 seed list

Posted by Renato Marroquín Mogrovejo <re...@gmail.com>.
Thanks Feng!


Renato M.

2013/5/14 feng lu <am...@gmail.com>:
> yes, the id will be automatically stored in HBase and  the outlinks that
> extract from seed url will not have any of this information. the
> information is store in the metadata of current url, as part of the
> metadata of current url.
>
>
>
>
> On Fri, May 10, 2013 at 10:59 PM, Renato Marroquín Mogrovejo <
> renatoj.marroquin@gmail.com> wrote:
>
>> Hi Feng,
>>
>> So this means I could put any type of information for the seed urls but
>> what about the ones fetched in the next cycles? They won't have any of this
>> information right?
>> And where is this information stored? As part of the fetched or the parsed
>> information?
>> Thanks!
>>
>> Renato M.
>> On May 10, 2013 9:46 AM, "Adriana Farina" <ad...@gmail.com>
>> wrote:
>>
>> > And the ids and will be automatically stored in HBase?
>> >
>> >
>> > 2013/5/10 feng lu <am...@gmail.com>
>> >
>> > > Hi Adriana
>> > >
>> > > you can add metadata to each seed url like this
>> > >
>> > > http://www.example.com  id=123
>> > > http://www.example.com  id=456
>> > >
>> > > each CrawlDatum include many metadatas, you can use that to store any
>> > > information about url.
>> > >
>> > >
>> > >
>> > >
>> > >
>> > > On Fri, May 10, 2013 at 5:26 PM, Adriana Farina
>> > > <ad...@gmail.com>wrote:
>> > >
>> > > > Hello,
>> > > >
>> > > > I'm using Nutch 2.1 on top of Hadoop 1.0.4, with HBase 0.90.4 as
>> > storage
>> > > > system. I run Nutch in distributed mode.
>> > > >
>> > > > I need to associate an id to each url inside the seed list of nutch
>> and
>> > > to
>> > > > store this information in HBase. I think that I have to create a new
>> > > column
>> > > > family in HBase and modify the gora and hbase configuration files in
>> > the
>> > > > nutch conf folder.
>> > > >
>> > > > However, I think I need to modify the code of Nutch, but I don't know
>> > > which
>> > > > classes I have to modify. I googled a bit, but I didn't find any
>> > > > documentation; I've searched inside the code but I wasn't able to
>> solve
>> > > my
>> > > > problem.
>> > > >
>> > > > Can anybody help me?
>> > > >
>> > > > Thank you!
>> > > >
>> > > >
>> > > > --
>> > > > Adriana Farina
>> > > >
>> > >
>> > >
>> > >
>> > > --
>> > > Don't Grow Old, Grow Up... :-)
>> > >
>> >
>> >
>> >
>> > --
>> > Adriana Farina
>> >
>>
>
>
>
> --
> Don't Grow Old, Grow Up... :-)

Re: Nutch 2.1 seed list

Posted by Renato Marroquín Mogrovejo <re...@gmail.com>.
Thanks Feng!!!


Renato M.

2013/5/15 Adriana Farina <ad...@gmail.com>:
> Thank you very much!
>
> 2013/5/14 feng lu <am...@gmail.com>
>
>> yes, the id will be automatically stored in HBase and  the outlinks that
>> extract from seed url will not have any of this information. the
>> information is store in the metadata of current url, as part of the
>> metadata of current url.
>>
>>
>>
>>
>> On Fri, May 10, 2013 at 10:59 PM, Renato Marroquín Mogrovejo <
>> renatoj.marroquin@gmail.com> wrote:
>>
>> > Hi Feng,
>> >
>> > So this means I could put any type of information for the seed urls but
>> > what about the ones fetched in the next cycles? They won't have any of
>> this
>> > information right?
>> > And where is this information stored? As part of the fetched or the
>> parsed
>> > information?
>> > Thanks!
>> >
>> > Renato M.
>> > On May 10, 2013 9:46 AM, "Adriana Farina" <ad...@gmail.com>
>> > wrote:
>> >
>> > > And the ids and will be automatically stored in HBase?
>> > >
>> > >
>> > > 2013/5/10 feng lu <am...@gmail.com>
>> > >
>> > > > Hi Adriana
>> > > >
>> > > > you can add metadata to each seed url like this
>> > > >
>> > > > http://www.example.com  id=123
>> > > > http://www.example.com  id=456
>> > > >
>> > > > each CrawlDatum include many metadatas, you can use that to store any
>> > > > information about url.
>> > > >
>> > > >
>> > > >
>> > > >
>> > > >
>> > > > On Fri, May 10, 2013 at 5:26 PM, Adriana Farina
>> > > > <ad...@gmail.com>wrote:
>> > > >
>> > > > > Hello,
>> > > > >
>> > > > > I'm using Nutch 2.1 on top of Hadoop 1.0.4, with HBase 0.90.4 as
>> > > storage
>> > > > > system. I run Nutch in distributed mode.
>> > > > >
>> > > > > I need to associate an id to each url inside the seed list of nutch
>> > and
>> > > > to
>> > > > > store this information in HBase. I think that I have to create a
>> new
>> > > > column
>> > > > > family in HBase and modify the gora and hbase configuration files
>> in
>> > > the
>> > > > > nutch conf folder.
>> > > > >
>> > > > > However, I think I need to modify the code of Nutch, but I don't
>> know
>> > > > which
>> > > > > classes I have to modify. I googled a bit, but I didn't find any
>> > > > > documentation; I've searched inside the code but I wasn't able to
>> > solve
>> > > > my
>> > > > > problem.
>> > > > >
>> > > > > Can anybody help me?
>> > > > >
>> > > > > Thank you!
>> > > > >
>> > > > >
>> > > > > --
>> > > > > Adriana Farina
>> > > > >
>> > > >
>> > > >
>> > > >
>> > > > --
>> > > > Don't Grow Old, Grow Up... :-)
>> > > >
>> > >
>> > >
>> > >
>> > > --
>> > > Adriana Farina
>> > >
>> >
>>
>>
>>
>> --
>> Don't Grow Old, Grow Up... :-)
>>
>
>
>
> --
> Adriana Farina

Re: Nutch 2.1 seed list

Posted by Adriana Farina <ad...@gmail.com>.
Thank you very much!

2013/5/14 feng lu <am...@gmail.com>

> yes, the id will be automatically stored in HBase and  the outlinks that
> extract from seed url will not have any of this information. the
> information is store in the metadata of current url, as part of the
> metadata of current url.
>
>
>
>
> On Fri, May 10, 2013 at 10:59 PM, Renato Marroquín Mogrovejo <
> renatoj.marroquin@gmail.com> wrote:
>
> > Hi Feng,
> >
> > So this means I could put any type of information for the seed urls but
> > what about the ones fetched in the next cycles? They won't have any of
> this
> > information right?
> > And where is this information stored? As part of the fetched or the
> parsed
> > information?
> > Thanks!
> >
> > Renato M.
> > On May 10, 2013 9:46 AM, "Adriana Farina" <ad...@gmail.com>
> > wrote:
> >
> > > And the ids and will be automatically stored in HBase?
> > >
> > >
> > > 2013/5/10 feng lu <am...@gmail.com>
> > >
> > > > Hi Adriana
> > > >
> > > > you can add metadata to each seed url like this
> > > >
> > > > http://www.example.com  id=123
> > > > http://www.example.com  id=456
> > > >
> > > > each CrawlDatum include many metadatas, you can use that to store any
> > > > information about url.
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > On Fri, May 10, 2013 at 5:26 PM, Adriana Farina
> > > > <ad...@gmail.com>wrote:
> > > >
> > > > > Hello,
> > > > >
> > > > > I'm using Nutch 2.1 on top of Hadoop 1.0.4, with HBase 0.90.4 as
> > > storage
> > > > > system. I run Nutch in distributed mode.
> > > > >
> > > > > I need to associate an id to each url inside the seed list of nutch
> > and
> > > > to
> > > > > store this information in HBase. I think that I have to create a
> new
> > > > column
> > > > > family in HBase and modify the gora and hbase configuration files
> in
> > > the
> > > > > nutch conf folder.
> > > > >
> > > > > However, I think I need to modify the code of Nutch, but I don't
> know
> > > > which
> > > > > classes I have to modify. I googled a bit, but I didn't find any
> > > > > documentation; I've searched inside the code but I wasn't able to
> > solve
> > > > my
> > > > > problem.
> > > > >
> > > > > Can anybody help me?
> > > > >
> > > > > Thank you!
> > > > >
> > > > >
> > > > > --
> > > > > Adriana Farina
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Don't Grow Old, Grow Up... :-)
> > > >
> > >
> > >
> > >
> > > --
> > > Adriana Farina
> > >
> >
>
>
>
> --
> Don't Grow Old, Grow Up... :-)
>



-- 
Adriana Farina

Re: Nutch 2.1 seed list

Posted by feng lu <am...@gmail.com>.
yes, the id will be automatically stored in HBase and  the outlinks that
extract from seed url will not have any of this information. the
information is store in the metadata of current url, as part of the
metadata of current url.




On Fri, May 10, 2013 at 10:59 PM, Renato Marroquín Mogrovejo <
renatoj.marroquin@gmail.com> wrote:

> Hi Feng,
>
> So this means I could put any type of information for the seed urls but
> what about the ones fetched in the next cycles? They won't have any of this
> information right?
> And where is this information stored? As part of the fetched or the parsed
> information?
> Thanks!
>
> Renato M.
> On May 10, 2013 9:46 AM, "Adriana Farina" <ad...@gmail.com>
> wrote:
>
> > And the ids and will be automatically stored in HBase?
> >
> >
> > 2013/5/10 feng lu <am...@gmail.com>
> >
> > > Hi Adriana
> > >
> > > you can add metadata to each seed url like this
> > >
> > > http://www.example.com  id=123
> > > http://www.example.com  id=456
> > >
> > > each CrawlDatum include many metadatas, you can use that to store any
> > > information about url.
> > >
> > >
> > >
> > >
> > >
> > > On Fri, May 10, 2013 at 5:26 PM, Adriana Farina
> > > <ad...@gmail.com>wrote:
> > >
> > > > Hello,
> > > >
> > > > I'm using Nutch 2.1 on top of Hadoop 1.0.4, with HBase 0.90.4 as
> > storage
> > > > system. I run Nutch in distributed mode.
> > > >
> > > > I need to associate an id to each url inside the seed list of nutch
> and
> > > to
> > > > store this information in HBase. I think that I have to create a new
> > > column
> > > > family in HBase and modify the gora and hbase configuration files in
> > the
> > > > nutch conf folder.
> > > >
> > > > However, I think I need to modify the code of Nutch, but I don't know
> > > which
> > > > classes I have to modify. I googled a bit, but I didn't find any
> > > > documentation; I've searched inside the code but I wasn't able to
> solve
> > > my
> > > > problem.
> > > >
> > > > Can anybody help me?
> > > >
> > > > Thank you!
> > > >
> > > >
> > > > --
> > > > Adriana Farina
> > > >
> > >
> > >
> > >
> > > --
> > > Don't Grow Old, Grow Up... :-)
> > >
> >
> >
> >
> > --
> > Adriana Farina
> >
>



-- 
Don't Grow Old, Grow Up... :-)

Re: Nutch 2.1 seed list

Posted by Renato Marroquín Mogrovejo <re...@gmail.com>.
Hi Feng,

So this means I could put any type of information for the seed urls but
what about the ones fetched in the next cycles? They won't have any of this
information right?
And where is this information stored? As part of the fetched or the parsed
information?
Thanks!

Renato M.
On May 10, 2013 9:46 AM, "Adriana Farina" <ad...@gmail.com>
wrote:

> And the ids and will be automatically stored in HBase?
>
>
> 2013/5/10 feng lu <am...@gmail.com>
>
> > Hi Adriana
> >
> > you can add metadata to each seed url like this
> >
> > http://www.example.com  id=123
> > http://www.example.com  id=456
> >
> > each CrawlDatum include many metadatas, you can use that to store any
> > information about url.
> >
> >
> >
> >
> >
> > On Fri, May 10, 2013 at 5:26 PM, Adriana Farina
> > <ad...@gmail.com>wrote:
> >
> > > Hello,
> > >
> > > I'm using Nutch 2.1 on top of Hadoop 1.0.4, with HBase 0.90.4 as
> storage
> > > system. I run Nutch in distributed mode.
> > >
> > > I need to associate an id to each url inside the seed list of nutch and
> > to
> > > store this information in HBase. I think that I have to create a new
> > column
> > > family in HBase and modify the gora and hbase configuration files in
> the
> > > nutch conf folder.
> > >
> > > However, I think I need to modify the code of Nutch, but I don't know
> > which
> > > classes I have to modify. I googled a bit, but I didn't find any
> > > documentation; I've searched inside the code but I wasn't able to solve
> > my
> > > problem.
> > >
> > > Can anybody help me?
> > >
> > > Thank you!
> > >
> > >
> > > --
> > > Adriana Farina
> > >
> >
> >
> >
> > --
> > Don't Grow Old, Grow Up... :-)
> >
>
>
>
> --
> Adriana Farina
>

Re: Nutch 2.1 seed list

Posted by Adriana Farina <ad...@gmail.com>.
And the ids and will be automatically stored in HBase?


2013/5/10 feng lu <am...@gmail.com>

> Hi Adriana
>
> you can add metadata to each seed url like this
>
> http://www.example.com  id=123
> http://www.example.com  id=456
>
> each CrawlDatum include many metadatas, you can use that to store any
> information about url.
>
>
>
>
>
> On Fri, May 10, 2013 at 5:26 PM, Adriana Farina
> <ad...@gmail.com>wrote:
>
> > Hello,
> >
> > I'm using Nutch 2.1 on top of Hadoop 1.0.4, with HBase 0.90.4 as storage
> > system. I run Nutch in distributed mode.
> >
> > I need to associate an id to each url inside the seed list of nutch and
> to
> > store this information in HBase. I think that I have to create a new
> column
> > family in HBase and modify the gora and hbase configuration files in the
> > nutch conf folder.
> >
> > However, I think I need to modify the code of Nutch, but I don't know
> which
> > classes I have to modify. I googled a bit, but I didn't find any
> > documentation; I've searched inside the code but I wasn't able to solve
> my
> > problem.
> >
> > Can anybody help me?
> >
> > Thank you!
> >
> >
> > --
> > Adriana Farina
> >
>
>
>
> --
> Don't Grow Old, Grow Up... :-)
>



-- 
Adriana Farina

Re: Nutch 2.1 seed list

Posted by feng lu <am...@gmail.com>.
Hi Adriana

you can add metadata to each seed url like this

http://www.example.com  id=123
http://www.example.com  id=456

each CrawlDatum include many metadatas, you can use that to store any
information about url.





On Fri, May 10, 2013 at 5:26 PM, Adriana Farina
<ad...@gmail.com>wrote:

> Hello,
>
> I'm using Nutch 2.1 on top of Hadoop 1.0.4, with HBase 0.90.4 as storage
> system. I run Nutch in distributed mode.
>
> I need to associate an id to each url inside the seed list of nutch and to
> store this information in HBase. I think that I have to create a new column
> family in HBase and modify the gora and hbase configuration files in the
> nutch conf folder.
>
> However, I think I need to modify the code of Nutch, but I don't know which
> classes I have to modify. I googled a bit, but I didn't find any
> documentation; I've searched inside the code but I wasn't able to solve my
> problem.
>
> Can anybody help me?
>
> Thank you!
>
>
> --
> Adriana Farina
>



-- 
Don't Grow Old, Grow Up... :-)