You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Jorge Luis Betancourt González <jl...@uci.cu> on 2014/12/01 02:57:23 UTC

Re: Processing Pages in Pairs

Perhaps you could use something similar to urlmeta plugin, passing the extracted content from the Page A to the Page B, you'll need to prevent Page A from being indexed in your storage and only allow Page B (which by then will have the information from both pages). The advantage with this approach is that you'll have 1 document per "entity" in Solr, which is what you want, the downside you'll need to transfer all the extracted metadata from Page A to Page B which kind of invert the logical order because instead of aggregating Page B content into Page A it will be the other way around, but I think that the end result will be the same. Although you'll need to write a little more of code I think.

For preventing Page A from being indexed you could write your own indexing plugin that implements your logic and return null in case the document doesn't need to be indexed, take into account that you want Page A to be crawled, parsed but not indexed. Take a look into https://github.com/jorgelbg/mimetype-filter for an example of this, basically this plugin works on the MIME type property which is not what you want, but illustrate the process.

Hope it helps,

----- Original Message -----
From: "Iain Lopata" <il...@hotmail.com>
To: user@nutch.apache.org
Sent: Saturday, November 29, 2014 5:59:02 PM
Subject: RE: Processing Pages in Pairs

Thanks Marcus, your pointers are very helpful.

I have looked at BlockJoins.  Since there is a 1-to-1 relationship between the pairs of pages I need to process, I think BlockJoins would add unnecessary complexity to the queries. A custom update processor appears to me to be the better option.

I have found a couple of useful examples that may help others tackling similar problems.

First, I am going to try using the links-extractor indexing plugin found at https://github.com/jorgelbg/links-extractor to ensure that I have a reference to "Page A" at that time I index "Page B".

Second, I am going to start with solr-field-update UpdateRequestProcessor found at https://github.com/guardian/solr-field-update as a template, but will modify the lookup approach to use the inlink from the link extractor.

I will still need to build the custom parser for vCard, unless anyone has one they can share.  I plan to do this based on ez-vcard found at https://code.google.com/p/ez-vcard/wiki/ReadingVCards#3_Differences_between_Ezvcard_and_reader_classes

Plenty to do, but I think you have me headed in the right direction - and certainly seems better than hacking the map/reduce processing in the Nutch indexer.

Thanks again


-----Original Message-----
From: Markus Jelsma [mailto:markus.jelsma@openindex.io] 
Sent: Wednesday, November 26, 2014 1:39 PM
To: user@nutch.apache.org
Subject: RE: Processing Pages in Pairs

Using Solr BlockJoins would probably be the easiest these days unless you really need to process them in Nutch. If you still want to process them simultaneously you can write a custom Solr UpdateRequestProcessor plugin and build the logic there.
 
-----Original message-----
> From:Lewis John Mcgibbney <le...@gmail.com>
> Sent: Wednesday 26th November 2014 0:10
> To: user@nutch.apache.org
> Subject: Re: Processing Pages in Pairs
> 
> Hi Iain,
> 
> On Tue, Nov 25, 2014 at 2:44 PM, <us...@nutch.apache.org> wrote:
> 
> >
> >
> > What would you recommend in this situation?  Are there other options 
> > that I am missing?
> 
> 
> I think that our good friend Markus has previously provided some 
> insight into the technical implementation of a task which may be 
> synonymous with what you are trying to achieve.
> http://www.mail-archive.com/user%40nutch.apache.org/msg04695.html
> Sounds pretty hands on to me, it would be difficult to keep your 
> version of Nutch up-to-date with trunk if you were doing that.
> hth
> Lewis
> 


---------------------------------------------------
XII Aniversario de la creación de la Universidad de las Ciencias Informáticas. 12 años de historia junto a Fidel. 12 de diciembre de 2014.