You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by SUJIT PAL <su...@comcast.net> on 2012/02/04 19:59:34 UTC

One WebPage to many NutchDocuments

Hi,

I have a requirement to break up incoming pages into sections so sections can be searched independently, sort of like a "search inside this document" functionality.

For this, I have a custom parser plugin that parses the incoming page (XML) into the top level document and multiple sections. The sections are parsed and put into the metadata as a structured JSON string (list of maps).

I've been looking at IndexingFilters but the filter method is 1 to 1, ie, you pass in the WebPage and NutchDocument built so far, and get back a NutchDocument. What I would like to do is to pass in a WebPage and get back a list of NutchDocuments which can be passed to the SolrWriter. So now, I am looking at doing something like this, in the IndexerReducer.java class, in the reduce method:

// last line in reduce
context.write(key, doc);
// now handle sections
if (conf.getBoolean("mycomapny.indexing.page.explode", false) == true) {
  if (page.getMetaData().get("u_sections") != null) {
    NutchDocument sectionDoc = new NutchDocument();
    // parse the JSON and populate the sectionDoc with a combination of
    // the JSON values and the parent values
    context.write(sectionKey, doc);
  }
}

Problem is, this means changing the Nutch source code (rather than extending it). So I was thinking of having an additional hook here which uses an array of IndexingPageExploders (similar to IndexingFilters but which return a List<NutchDocument> instead of a single NutchDocument. Then the code becomes slightly more generic, like so:

context.write(key, doc);
if (conf.getBoolean("mycompany.indexing.page.explode", false) == true) {
  for (IndexingPageExploder exploder : exploders) {
    List<NutchDocument> explodedDocs = exploder.explode(doc, page);
    String newKey = key + "-" + somevalue;
    for (NutchDocument doc : explodedDocs) {
      context.write(newKey, doc);
    }
  }
}

and the actual code to do the application specific stuff could live in the PageExploder implementation.

Is this a reasonable approach given my problem description? I don't want to (a) break the pages up outside Nutch and feed them in one at a time, since I don't want my content store (Nutch's cassandra backend) to contain separate sections, or (b) build another stage to do the explosion, but that will require another pass over the input.

Thanks for any suggestions/pointers you can provide,

Sujit



Re: One WebPage to many NutchDocuments

Posted by SUJIT PAL <su...@comcast.net>.
Hi Marcus,

Actually my plan is to give these "synthetic" URLs (based on the parent URL and the section id) which don't actually resolve to anything. The sections need only be searchable, they don't need to point anywhere. The content part of the application would use the section ID to XPath into the document and display the appropriate section.

I didn't quite understand your comment about the dynamic fields approach. My understanding is that dynamic fields are just a naming pattern, so anything with a particular name pattern would written out to Solr as a particular field type. So in this case, assuming a document had 10 sections, I would be writing out 1 Solr record, which would contain 10 dynamic fields say section_[0-9]. Now assuming a query (within record #1) matched section_1 and section_2, I want to be able to show them as separate results in search (ideally without doing anything special on the front end). I don't think this can be done with this approach, at least from what I understand of dynamic fields. I'm probably missing something, would appreciate if you could provide some more pointers into this approach?

Thanks,
-sujit

On Feb 4, 2012, at 1:16 PM, Markus Jelsma wrote:

> Hi,
> 
> I didn't give it too much thought but i guess my first endeavour would be 
> making an indexing filter that produces dynamic fields for Solr that contain 
> sections of the content field. This also maintains a single URL per document 
> which makes sense.
> 
> Producing multiple documents and keeping a single URL for a document could be 
> possible if we use anchors but this breaks other semantics such as 
> deduplication.
> 
> Cheers
> 
>> Actually, I've been thinking about this, and it would probably be better to
>> just create a separate stage to populate the index with the subpages,
>> outside of Nutch but using the Nutch infrastructure. Cons are an
>> additional pass over the index, but that should be manageable.
>> 
>> So... nevermind, I guess :-). Sorry for wasting bandwidth.
>> 
>> -sujit
>> 
>> On Feb 4, 2012, at 10:59 AM, SUJIT PAL wrote:
>>> Hi,
>>> 
>>> I have a requirement to break up incoming pages into sections so sections
>>> can be searched independently, sort of like a "search inside this
>>> document" functionality.
>>> 
>>> For this, I have a custom parser plugin that parses the incoming page
>>> (XML) into the top level document and multiple sections. The sections
>>> are parsed and put into the metadata as a structured JSON string (list
>>> of maps).
>>> 
>>> I've been looking at IndexingFilters but the filter method is 1 to 1, ie,
>>> you pass in the WebPage and NutchDocument built so far, and get back a
>>> NutchDocument. What I would like to do is to pass in a WebPage and get
>>> back a list of NutchDocuments which can be passed to the SolrWriter. So
>>> now, I am looking at doing something like this, in the
>>> IndexerReducer.java class, in the reduce method:
>>> 
>>> // last line in reduce
>>> context.write(key, doc);
>>> // now handle sections
>>> if (conf.getBoolean("mycomapny.indexing.page.explode", false) == true) {
>>> 
>>> if (page.getMetaData().get("u_sections") != null) {
>>> 
>>>   NutchDocument sectionDoc = new NutchDocument();
>>>   // parse the JSON and populate the sectionDoc with a combination of
>>>   // the JSON values and the parent values
>>>   context.write(sectionKey, doc);
>>> 
>>> }
>>> 
>>> }
>>> 
>>> Problem is, this means changing the Nutch source code (rather than
>>> extending it). So I was thinking of having an additional hook here which
>>> uses an array of IndexingPageExploders (similar to IndexingFilters but
>>> which return a List<NutchDocument> instead of a single NutchDocument.
>>> Then the code becomes slightly more generic, like so:
>>> 
>>> context.write(key, doc);
>>> if (conf.getBoolean("mycompany.indexing.page.explode", false) == true) {
>>> 
>>> for (IndexingPageExploder exploder : exploders) {
>>> 
>>>   List<NutchDocument> explodedDocs = exploder.explode(doc, page);
>>>   String newKey = key + "-" + somevalue;
>>>   for (NutchDocument doc : explodedDocs) {
>>> 
>>>     context.write(newKey, doc);
>>> 
>>>   }
>>> 
>>> }
>>> 
>>> }
>>> 
>>> and the actual code to do the application specific stuff could live in
>>> the PageExploder implementation.
>>> 
>>> Is this a reasonable approach given my problem description? I don't want
>>> to (a) break the pages up outside Nutch and feed them in one at a time,
>>> since I don't want my content store (Nutch's cassandra backend) to
>>> contain separate sections, or (b) build another stage to do the
>>> explosion, but that will require another pass over the input.
>>> 
>>> Thanks for any suggestions/pointers you can provide,
>>> 
>>> Sujit


Re: One WebPage to many NutchDocuments

Posted by Markus Jelsma <ma...@apache.org>.
Hi,

I didn't give it too much thought but i guess my first endeavour would be 
making an indexing filter that produces dynamic fields for Solr that contain 
sections of the content field. This also maintains a single URL per document 
which makes sense.

Producing multiple documents and keeping a single URL for a document could be 
possible if we use anchors but this breaks other semantics such as 
deduplication.

Cheers

> Actually, I've been thinking about this, and it would probably be better to
> just create a separate stage to populate the index with the subpages,
> outside of Nutch but using the Nutch infrastructure. Cons are an
> additional pass over the index, but that should be manageable.
> 
> So... nevermind, I guess :-). Sorry for wasting bandwidth.
> 
> -sujit
> 
> On Feb 4, 2012, at 10:59 AM, SUJIT PAL wrote:
> > Hi,
> > 
> > I have a requirement to break up incoming pages into sections so sections
> > can be searched independently, sort of like a "search inside this
> > document" functionality.
> > 
> > For this, I have a custom parser plugin that parses the incoming page
> > (XML) into the top level document and multiple sections. The sections
> > are parsed and put into the metadata as a structured JSON string (list
> > of maps).
> > 
> > I've been looking at IndexingFilters but the filter method is 1 to 1, ie,
> > you pass in the WebPage and NutchDocument built so far, and get back a
> > NutchDocument. What I would like to do is to pass in a WebPage and get
> > back a list of NutchDocuments which can be passed to the SolrWriter. So
> > now, I am looking at doing something like this, in the
> > IndexerReducer.java class, in the reduce method:
> > 
> > // last line in reduce
> > context.write(key, doc);
> > // now handle sections
> > if (conf.getBoolean("mycomapny.indexing.page.explode", false) == true) {
> > 
> >  if (page.getMetaData().get("u_sections") != null) {
> >  
> >    NutchDocument sectionDoc = new NutchDocument();
> >    // parse the JSON and populate the sectionDoc with a combination of
> >    // the JSON values and the parent values
> >    context.write(sectionKey, doc);
> >  
> >  }
> > 
> > }
> > 
> > Problem is, this means changing the Nutch source code (rather than
> > extending it). So I was thinking of having an additional hook here which
> > uses an array of IndexingPageExploders (similar to IndexingFilters but
> > which return a List<NutchDocument> instead of a single NutchDocument.
> > Then the code becomes slightly more generic, like so:
> > 
> > context.write(key, doc);
> > if (conf.getBoolean("mycompany.indexing.page.explode", false) == true) {
> > 
> >  for (IndexingPageExploder exploder : exploders) {
> >  
> >    List<NutchDocument> explodedDocs = exploder.explode(doc, page);
> >    String newKey = key + "-" + somevalue;
> >    for (NutchDocument doc : explodedDocs) {
> >    
> >      context.write(newKey, doc);
> >    
> >    }
> >  
> >  }
> > 
> > }
> > 
> > and the actual code to do the application specific stuff could live in
> > the PageExploder implementation.
> > 
> > Is this a reasonable approach given my problem description? I don't want
> > to (a) break the pages up outside Nutch and feed them in one at a time,
> > since I don't want my content store (Nutch's cassandra backend) to
> > contain separate sections, or (b) build another stage to do the
> > explosion, but that will require another pass over the input.
> > 
> > Thanks for any suggestions/pointers you can provide,
> > 
> > Sujit

Re: One WebPage to many NutchDocuments

Posted by SUJIT PAL <su...@comcast.net>.
Actually, I've been thinking about this, and it would probably be better to just create a separate stage to populate the index with the subpages, outside of Nutch but using the Nutch infrastructure. Cons are an additional pass over the index, but that should be manageable.

So... nevermind, I guess :-). Sorry for wasting bandwidth.

-sujit

On Feb 4, 2012, at 10:59 AM, SUJIT PAL wrote:

> Hi,
> 
> I have a requirement to break up incoming pages into sections so sections can be searched independently, sort of like a "search inside this document" functionality.
> 
> For this, I have a custom parser plugin that parses the incoming page (XML) into the top level document and multiple sections. The sections are parsed and put into the metadata as a structured JSON string (list of maps).
> 
> I've been looking at IndexingFilters but the filter method is 1 to 1, ie, you pass in the WebPage and NutchDocument built so far, and get back a NutchDocument. What I would like to do is to pass in a WebPage and get back a list of NutchDocuments which can be passed to the SolrWriter. So now, I am looking at doing something like this, in the IndexerReducer.java class, in the reduce method:
> 
> // last line in reduce
> context.write(key, doc);
> // now handle sections
> if (conf.getBoolean("mycomapny.indexing.page.explode", false) == true) {
>  if (page.getMetaData().get("u_sections") != null) {
>    NutchDocument sectionDoc = new NutchDocument();
>    // parse the JSON and populate the sectionDoc with a combination of
>    // the JSON values and the parent values
>    context.write(sectionKey, doc);
>  }
> }
> 
> Problem is, this means changing the Nutch source code (rather than extending it). So I was thinking of having an additional hook here which uses an array of IndexingPageExploders (similar to IndexingFilters but which return a List<NutchDocument> instead of a single NutchDocument. Then the code becomes slightly more generic, like so:
> 
> context.write(key, doc);
> if (conf.getBoolean("mycompany.indexing.page.explode", false) == true) {
>  for (IndexingPageExploder exploder : exploders) {
>    List<NutchDocument> explodedDocs = exploder.explode(doc, page);
>    String newKey = key + "-" + somevalue;
>    for (NutchDocument doc : explodedDocs) {
>      context.write(newKey, doc);
>    }
>  }
> }
> 
> and the actual code to do the application specific stuff could live in the PageExploder implementation.
> 
> Is this a reasonable approach given my problem description? I don't want to (a) break the pages up outside Nutch and feed them in one at a time, since I don't want my content store (Nutch's cassandra backend) to contain separate sections, or (b) build another stage to do the explosion, but that will require another pass over the input.
> 
> Thanks for any suggestions/pointers you can provide,
> 
> Sujit
> 
>