You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by varunpandeyengg <va...@gmail.com> on 2012/03/03 05:36:00 UTC

Nutch with Letor

Hey Guys,

I am new to Nutch. I am part of a IR research team & need to create a setup
where in I need to crawl Microsoft's LETOR Dataset (
http://research.microsoft.com/en-us/projects/mslr/
http://research.microsoft.com/en-us/projects/mslr/ ) with Nutch. After
googling for a while, I didn't get any tutorial or help. Could anyone guide
me for the same?

I am using Nutch 1.4 on Ubuntu 11.10 & Eclipse 3.7.

Till now I am able to crawl public network from my Nutch setup integrated
with Eclipse...

Thanks in advance.

-
Varun

--
View this message in context: http://lucene.472066.n3.nabble.com/Nutch-with-Letor-tp3795402p3795402.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Nutch 1.4 with Hadoop - how does Nutch know where Hadoop is running

Posted by shlomi java <sh...@gmail.com>.

The Hadoop jar comes with some XMLs inside.
In core-default.xml you'll find the fs.default.name property.
In mapred-default.xml you'll find the mapred.job.tracker property.

You can simply override them in your nutch-site.xml with your desired
values.

SJ

On Tue, Mar 20, 2012 at 1:03 PM, Markus Jelsma
<ma...@openindex.io>wrote:

> This is not a Nutch thing. A Nutch job, any job, is submitted to the
> Hadoop Jobtracker. It knows where the cluster is and what config is to be
> used. The bin/nutch script does litte more than submitting the job to the
> tracker with job specific parameters.
>
>
>
> On Tue, 20 Mar 2012 10:51:41 +0000, Dean Pullen <de...@semantico.com>
> wrote:
>
>> Hi all,
>>
>> An odd question, but I can't work out how Nutch 1.4 actually knows
>> where Hadoop is running.
>>
>> Usually I copy Hadoop over the top of Nutch, but if we want to put
>> Hadoop somewhere else, and on a different port etc, where is the Nutch
>> config to alter these settings to point to the non-default Hadoop?
>>
>> Regards,
>>
>> Dean.
>>
>
> --
> Markus Jelsma - CTO - Openindex
> http://www.linkedin.com/in/**markus17<http://www.linkedin.com/in/markus17>
> 050-8536600 / 06-50258350
>

Re: Nutch 1.4 with Hadoop - how does Nutch know where Hadoop is running

Posted by Markus Jelsma <ma...@openindex.io>.

 It doesn't really. It'll look for hadoop on the path and Hadoop will 
 then take care of it. You can submit jobs on any properly installed 
 Hadoop node or external client if they share important parts of the 
 config such as jobtracker address.

 On Tue, 20 Mar 2012 10:59:28 +0000, Dean Pullen 
 <de...@semantico.com> wrote:
> Thanks for your reply.
>
> I understand what you've said, but how does Nutch know where the
> Hadoop jobtracker is running?
>
> Regards,
>
> Dean.
>
> On 20/03/2012 11:03, Markus Jelsma wrote:
>> This is not a Nutch thing. A Nutch job, any job, is submitted to the 
>> Hadoop Jobtracker. It knows where the cluster is and what config is to 
>> be used. The bin/nutch script does litte more than submitting the job 
>> to the tracker with job specific parameters.
>>
>>
>> On Tue, 20 Mar 2012 10:51:41 +0000, Dean Pullen 
>> <de...@semantico.com> wrote:
>>> Hi all,
>>>
>>> An odd question, but I can't work out how Nutch 1.4 actually knows
>>> where Hadoop is running.
>>>
>>> Usually I copy Hadoop over the top of Nutch, but if we want to put
>>> Hadoop somewhere else, and on a different port etc, where is the 
>>> Nutch
>>> config to alter these settings to point to the non-default Hadoop?
>>>
>>> Regards,
>>>
>>> Dean.
>>

-- 
 Markus Jelsma - CTO - Openindex
 http://www.linkedin.com/in/markus17
 050-8536600 / 06-50258350

Re: Nutch 1.4 with Hadoop - how does Nutch know where Hadoop is running

Posted by Dean Pullen <de...@semantico.com>.

Thanks for your reply.

I understand what you've said, but how does Nutch know where the Hadoop 
jobtracker is running?

Regards,

Dean.

On 20/03/2012 11:03, Markus Jelsma wrote:
> This is not a Nutch thing. A Nutch job, any job, is submitted to the 
> Hadoop Jobtracker. It knows where the cluster is and what config is to 
> be used. The bin/nutch script does litte more than submitting the job 
> to the tracker with job specific parameters.
>
>
> On Tue, 20 Mar 2012 10:51:41 +0000, Dean Pullen 
> <de...@semantico.com> wrote:
>> Hi all,
>>
>> An odd question, but I can't work out how Nutch 1.4 actually knows
>> where Hadoop is running.
>>
>> Usually I copy Hadoop over the top of Nutch, but if we want to put
>> Hadoop somewhere else, and on a different port etc, where is the Nutch
>> config to alter these settings to point to the non-default Hadoop?
>>
>> Regards,
>>
>> Dean.
>

Re: Nutch 1.4 with Hadoop - how does Nutch know where Hadoop is running

Posted by Markus Jelsma <ma...@openindex.io>.

 This is not a Nutch thing. A Nutch job, any job, is submitted to the 
 Hadoop Jobtracker. It knows where the cluster is and what config is to 
 be used. The bin/nutch script does litte more than submitting the job to 
 the tracker with job specific parameters.

 On Tue, 20 Mar 2012 10:51:41 +0000, Dean Pullen 
 <de...@semantico.com> wrote:
> Hi all,
>
> An odd question, but I can't work out how Nutch 1.4 actually knows
> where Hadoop is running.
>
> Usually I copy Hadoop over the top of Nutch, but if we want to put
> Hadoop somewhere else, and on a different port etc, where is the 
> Nutch
> config to alter these settings to point to the non-default Hadoop?
>
> Regards,
>
> Dean.

-- 
 Markus Jelsma - CTO - Openindex
 http://www.linkedin.com/in/markus17
 050-8536600 / 06-50258350

Nutch 1.4 with Hadoop - how does Nutch know where Hadoop is running

Posted by Dean Pullen <de...@semantico.com>.

Hi all,

An odd question, but I can't work out how Nutch 1.4 actually knows where 
Hadoop is running.

Usually I copy Hadoop over the top of Nutch, but if we want to put 
Hadoop somewhere else, and on a different port etc, where is the Nutch 
config to alter these settings to point to the non-default Hadoop?

Regards,

Dean.

Re: Nutch with Letor

Posted by Lewis John Mcgibbney <le...@gmail.com>.

Yeah you can crawl the web as well as localhost, it's the same mechanism
(plugin) that you would use within nutch, http-protocol for HTTP.

Follow the Nutch tutorial for crawling the web

Lewis

On Fri, Mar 9, 2012 at 3:44 PM, varunpandeyengg
<va...@gmail.com>wrote:

> I really appreciate your help... Thanks a lot. i think you are right and
> this
> is "NO GO"...
>
> I read somewhere that Nutch can crawl www as well as LocalHost. I really
> wanted to see how to do that.
>
> Although my question has no answer but i am marking yours as Correct...
> Thanks again.
>
>
> --
> Varun
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Nutch-with-Letor-tp3795402p3813008.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>



-- 
*Lewis*

Re: Nutch with Letor

Posted by varunpandeyengg <va...@gmail.com>.

I really appreciate your help... Thanks a lot. i think you are right and this
is "NO GO"...

I read somewhere that Nutch can crawl www as well as LocalHost. I really
wanted to see how to do that.

Although my question has no answer but i am marking yours as Correct...
Thanks again.


--
Varun


--
View this message in context: http://lucene.472066.n3.nabble.com/Nutch-with-Letor-tp3795402p3813008.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Nutch with Letor

Posted by Lewis John Mcgibbney <le...@gmail.com>.

Having read a bit into the purpose of these datasets, also the reason why
they are actually produced I can see that this is not something which I
think is going to work with Nutch.

1) the datasets are machine learning data
2) the data consists of queries and URLs (pairs) represented by id's
3) further to this, the datasets comprise of feature vectors accompanied by
relevance judement labels

The purpose of this is to utilise the ranking algorithm (Learning to Rank),
to infer relative certainties (rankings) from across the dataset corpus.

Nutch does not do this, we utilise graph datasets (e.g. web) to do our work
on, this is not machine learning therefore you might wish to look elsewhere
for your answers... possibly Apache Mahout.

Alternatively get in touch with the people at Microsoft and pass your
queries there

mailto:letor@microsoft.com

It took me around an hour to download the dataset and in all honesty I was
really extremely dissapointed that there was no sure way of navigating the
dataset effectively... so I understand your frustration having pondered
over this one for a week or so.

Lewis

On Fri, Mar 9, 2012 at 1:35 PM, varunpandeyengg
<va...@gmail.com>wrote:

> I think you are right about the wiki. Regarding Dataset, Yes I tried... But
> nothing worked out.
> Frankly speaking, I just don't know where to start it. I am guessing random
> things.
>
> --
> Varun
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Nutch-with-Letor-tp3795402p3812668.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>

-- 
*Lewis*

Re: Nutch with Letor

Posted by varunpandeyengg <va...@gmail.com>.

lewis john mcgibbney wrote
> 
> Well I can categorically say that you haven't found anything on the Nutch
> wiki regarding this dataset simply because it doesn't exist.
> 
> I'm assuming that over the duration of the week you've actually tried
> using
> Nutch to crawk and score/rank entities from within the Dataset???
> -- 
> *Lewis*
> 

I think you are right about the wiki. Regarding Dataset, Yes I tried... But
nothing worked out.
Frankly speaking, I just don't know where to start it. I am guessing random
things.   

--
Varun

--
View this message in context: http://lucene.472066.n3.nabble.com/Nutch-with-Letor-tp3795402p3812668.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Nutch with Letor

Posted by Lewis John Mcgibbney <le...@gmail.com>.

Well I can categorically say that you haven't found anything on the Nutch
wiki regarding this dataset simply because it doesn't exist.

I'm assuming that over the duration of the week you've actually tried using
Nutch to crawk and score/rank entities from within the Dataset???

On Fri, Mar 9, 2012 at 12:29 PM, varunpandeyengg
<va...@gmail.com>wrote:

>
> l
> I tried searching for tutorials or wiki for LETOR and Nutch together. I
> even
> tried finding some other standard Dataset integration with nutch, but
> didn't
> get any success. It has been a week now and this thing has started to
> bother
> me... Help!!!
>
> I am waiting for the results...
>
> --
> Varun
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Nutch-with-Letor-tp3795402p3812539.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>

-- 
*Lewis*

Re: Nutch with Letor

Posted by varunpandeyengg <va...@gmail.com>.

lewis john mcgibbney wrote
> 
> Hi Varun,
> 
> Apologies for taking so long to get back to you.
> I'm trying this out today (later), and will post some results/findings
> when
> I get them.
> 
> Have you read any of the tutorials or publications from the research
> community regarding the LETOR datasets?
> 
> Thanks
> 
> -- 
> *Lewis*
> 

Hey Lewis,

I tried searching for tutorials or wiki for LETOR and Nutch together. I even
tried finding some other standard Dataset integration with nutch, but didn't
get any success. It has been a week now and this thing has started to bother
me... Help!!!

I am waiting for the results... 

--
Varun

--
View this message in context: http://lucene.472066.n3.nabble.com/Nutch-with-Letor-tp3795402p3812539.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Nutch with Letor

Posted by Lewis John Mcgibbney <le...@gmail.com>.

Hi Varun,

Apologies for taking so long to get back to you.
I'm trying this out today (later), and will post some results/findings when
I get them.

Have you read any of the tutorials or publications from the research
community regarding the LETOR datasets?

Thanks

On Sat, Mar 3, 2012 at 4:36 AM, varunpandeyengg
<va...@gmail.com>wrote:

> Hey Guys,
>
> I am new to Nutch. I am part of a IR research team & need to create a setup
> where in I need to crawl Microsoft's LETOR Dataset (
> http://research.microsoft.com/en-us/projects/mslr/
> http://research.microsoft.com/en-us/projects/mslr/ ) with Nutch. After
> googling for a while, I didn't get any tutorial or help. Could anyone guide
> me for the same?
>
> I am using Nutch 1.4 on Ubuntu 11.10 & Eclipse 3.7.
>
> Till now I am able to crawl public network from my Nutch setup integrated
> with Eclipse...
>
> Thanks in advance.
>
> -
> Varun
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Nutch-with-Letor-tp3795402p3795402.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>



-- 
*Lewis*