You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by an...@orbita1.ru on 2006/05/05 14:26:02 UTC

to count the number of pages from each domain

We tried to develop a solution to count the number of pages from each
domain.

We thought to do it so: 

.map - had following input k - UTF8 (url of page), v - CrawlDatum and
following output k - UTF8 (domain of page), v - UrlAndPage implemented
Writable (structure which contained url of page and its CrawlDatum)   

.reduce - had following input k - UTF8 (domain of page), v - iterator for
list of UrlAndPage and output was k - UTF8 (url of page), v - CrawlDatum

.in map function we parsed domain from url, created UrlAndPage structure and
put them to OutputCollector

.in reduce we counted how many elements are in list connected with iterator,
and put it into each CrawlDatum, then we formed new pairs of k, v (url,
CrawlDatum) and put them to OutputCollector
 

Following problem has arisen: as far as we see the types of input and output
of map and reduce should be same, but in our case they were different and it
caused the error like this: 

060505 183200 task_0104_m_000000_3 java.lang.RuntimeException:
java.lang.InstantiationException:
org.apache.nutch.crawl.PostUpdateFilter$UrlAn

dPage

060505 183200 task_0104_m_000000_3      at
org.apache.hadoop.mapred.JobConf.newInstance(JobConf.java:366)

060505 183200 task_0104_m_000000_3      at
org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:45)

060505 183200 task_0104_m_000000_3      at
org.apache.hadoop.mapred.MapTask.run(MapTask.java:129)

060505 183200 task_0104_m_000000_3      at
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:755)

060505 183200 task_0104_m_000000_3 Caused by:
java.lang.InstantiationException:
org.apache.nutch.crawl.PostUpdateFilter$UrlAndPage

060505 183200 task_0104_m_000000_3      at
java.lang.Class.newInstance0(Class.java:335)

060505 183200 task_0104_m_000000_3      at
java.lang.Class.newInstance(Class.java:303)

060505 183200 task_0104_m_000000_3      at
org.apache.hadoop.mapred.JobConf.newInstance(JobConf.java:364)

 

We decided that it is impossible in hadoop to have different input/output
types for map and reduce. Then we decided to use another scheme. This scheme
assumes to run two jobs. First job has map function, second job has reduce
task. These jobs have different classes for input and output parameters. New
map and reduce will do the same as described above.  

 

 

We'd like to ask you for advice which way is best for tasks like these. Is
the second way is good? Are there any other variants to do this better?

Re: Merging segments

Posted by Andrzej Bialecki <ab...@getopt.org>.

Chris Fellows wrote:
> That's great.
>
> Well, my follow up to that then is: 
>
> Will the new tool allow any form of "diff'ing"
> segments? In practice this would allow you to run a
>   

No, it does only two things - merging and slicing. That's already one 
too many... ;)

> crawl on a series of sites one week. Then run another
> crawl on the same sites a week or so later. Diff the
> segments and allow users to search on changes within
> the search domain.
>   

Interesting concept, but I think it would be better implemented as a 
variant of de-duplication, rather than segment content manipulation.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Merging segments

Posted by Chris Fellows <cc...@sbcglobal.net>.

That's great.

Well, my follow up to that then is: 

Will the new tool allow any form of "diff'ing"
segments? In practice this would allow you to run a
crawl on a series of sites one week. Then run another
crawl on the same sites a week or so later. Diff the
segments and allow users to search on changes within
the search domain.

--- Andrzej Bialecki <ab...@getopt.org> wrote:

> Chris Fellows wrote:
> > Hello,
> >
> > So the last discussion on merging segments was
> back in
> > Jan. Has there been any progress in this
> direction?
> > What would be the benefit of being able merge
> > segments? Would being able to merge segments open
> up
> > new functionality options or is merging just a
> > convience? Also, what's the estimate for how
> involved
> > merge functionality development is?
> >   
> 
> Relief is on the way. Fine folks at houxou.com have
> sponsored the 
> development of a brand-new SegmentMerger + slicer,
> and decided to donate 
> it to the project - big thanks!
> 
> I'm running some final tests, and will commit it
> today/tomorrow.
> 
> -- 
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _  
> __________________________________
> [__ || __|__/|__||\/|  Information Retrieval,
> Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System
> Integration
> http://www.sigram.com  Contact: info at sigram dot
> com
> 
> 
>

Re: Merging segments

Posted by Andrzej Bialecki <ab...@getopt.org>.

Chris Fellows wrote:
> Hello,
>
> So the last discussion on merging segments was back in
> Jan. Has there been any progress in this direction?
> What would be the benefit of being able merge
> segments? Would being able to merge segments open up
> new functionality options or is merging just a
> convience? Also, what's the estimate for how involved
> merge functionality development is?
>   

Relief is on the way. Fine folks at houxou.com have sponsored the 
development of a brand-new SegmentMerger + slicer, and decided to donate 
it to the project - big thanks!

I'm running some final tests, and will commit it today/tomorrow.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Merging segments

Posted by Chris Fellows <cc...@sbcglobal.net>.

Hello,

So the last discussion on merging segments was back in
Jan. Has there been any progress in this direction?
What would be the benefit of being able merge
segments? Would being able to merge segments open up
new functionality options or is merging just a
convience? Also, what's the estimate for how involved
merge functionality development is?

Regards,

- Chris

Re: to count the number of pages from each domain

Posted by Andrzej Bialecki <ab...@getopt.org>.

anton@orbita1.ru wrote:
> We decided that it is impossible in hadoop to have different input/output
> types for map and reduce. Then we decided to use another scheme. This scheme
> assumes to run two jobs. First job has map function, second job has reduce
> task. These jobs have different classes for input and output parameters. New
> map and reduce will do the same as described above.  
>   

You can use ObjectWritable to pass any type of Writable inside it. This 
way you can mix/match different input/output types easily. The overhead 
of this wrapping is probably still smaller than submitting another job 
just to change the types...

Please take a look at Indexer.java, where this trick is used.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com