You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "david.stuart@progressivealliance.co.uk" <da...@progressivealliance.co.uk> on 2009/11/24 11:57:13 UTC

Plugin Developement Help

Hi All,

I think I am just about finished my plugin (nutch 1.0) which adds extra metadata
to during parsing the problem I am having is it doesn't seem to be adding the
data to the system (via luke or readseg). I looked at in the wiki but it seems
to be for 0.9 and the syntax looks different.

{code}        
  public ParseResult filter(Content content, ParseResult parseResult,
HTMLMetaTags metaTags, DocumentFragment doc) {
      Metadata metadata = new Metadata();
      // parse the content
      DocumentFragment root;    
      String docTrans;
      try {
        byte[] contentInOctets = content.getContent();
        String input = new String(contentInOctets);
        XSLTSimpleTransform DocTransform = new XSLTSimpleTransform();
        docTrans = DocTransform.doTransform(input);
        Parse parse = parseResult.get(content.getUrl());
        metadata = parse.getData().getParseMeta();
        metadata.add("filter_html_data", docTrans);

      } catch (Exception e) {
        e.printStackTrace(LogUtil.getWarnStream(LOG));
      }
     
    return parseResult;
  }
{code}

Cheers,

Dave

Re: Plugin Developement Help

Posted by David Stuart <da...@progressivealliance.co.uk>.
Sorry I meant doesn't get to doc.add

David

On 24 Nov 2009, at 11:27, "david.stuart@progressivealliance.co.uk" <david.stuart@progressivealliance.co.uk 
 > wrote:

> I thought I did but I thought before I did a bin/nutch index (or  
> solrindex) it would be stored somewhere it does seems to be getting  
> to the doc.add bit which makes me think the variable is empty
> {code}
>     public void addIndexBackendOptions(Configuration conf) {
>       LOG.warn("+_+_You called me _+_+");
>       LuceneWriter.addFieldOptions("html_filter_data", STORE.YES,  
> INDEX.UNTOKENIZED, conf);
>     }
>
>     public NutchDocument filter(NutchDocument doc, Parse parse, Text  
> url, CrawlDatum datum, Inlinks inlinks) throws IndexingException {
>       LOG.warn 
> ("________________________FILTER_______________________");
>       String html_filter_data = parse.getData().getMeta 
> ("html_filter_data");
>       if (html_filter_data != null){
>           LOG.warn("________________________Adding filter  
> data_______________________");
>           doc.add("html_filter_data", html_filter_data);
>       }
>       return doc;
>     }
> {code}
> On 24 November 2009 at 12:05 Andrzej Bialecki <ab...@getopt.org> wrote:
>
> > david.stuart@progressivealliance.co.uk wrote:
> > >   Hi All,
> > >
> > > I think I am just about finished my plugin (nutch 1.0) which  
> adds extra
> > > metadata to during parsing the problem I am having is it doesn't  
> seem to
> > > be adding the data to the system (via luke or readseg). I looked  
> at in
> > > the wiki but it seems to be for 0.9 and the syntax looks  
> different.
> > >
> > > {code}
> > >   public ParseResult filter(Content content, ParseResult  
> parseResult,
> > > HTMLMetaTags metaTags, DocumentFragment doc) {
> > >       Metadata metadata = new Metadata();
> > >       // parse the content
> > >       DocumentFragment root;
> > >       String docTrans;
> > >       try {
> > >         byte[] contentInOctets = content.getContent();
> > >         String input = new String(contentInOctets);
> > >         XSLTSimpleTransform DocTransform = new  
> XSLTSimpleTransform();
> > >         docTrans = DocTransform.doTransform(input);
> > >         Parse parse = parseResult.get(content.getUrl());
> > >         metadata = parse.getData().getParseMeta();
> > >         metadata.add("filter_html_data", docTrans);
> > >
> > >       } catch (Exception e) {
> > >         e.printStackTrace(LogUtil.getWarnStream(LOG));
> > >       }
> > >
> > >     return parseResult;
> > >   }
> > > {code}
> >
> > Did you declare that you are adding this field in the
> > IndexingFilter.addIndexBackendOptions(..) ? See how other indexing
> > plugins do this.
> >
> >
> > --
> > Best regards,
> > Andrzej Bialecki     <><
> >   ___. ___ ___ ___ _ _   __________________________________
> > [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> > ___|||__||  \|  ||  |  Embedded Unix, System Integration
> > http://www.sigram.com  Contact: info at sigram dot com
> >

Re: Plugin Developement Help

Posted by "david.stuart@progressivealliance.co.uk" <da...@progressivealliance.co.uk>.
Sorry keep pressing

But I dont quite understanding how the metadata is passed from the parse to the
index if in my
public ParseResult filter...

Do this 
        Parse parse = parseResult.get(content.getUrl());
        metadata = parse.getData().getParseMeta();
        metadata.add("filter_html_data", docTrans);

Then return
return parseResult;

Is the data passed by reference into parseResult? because when I try and
retrieve it in 
public NutchDocument filter...

by doing
      String html_filter_data = parse.getData().getMeta("html_filter_data");
      LOG.warn(html_filter_data);
      if (html_filter_data != null){
          LOG.warn("________________________Adding filter
data_______________________");
          doc.add("html_filter_data", html_filter_data);
      }
I Never reach the add because the variable html_filter_data is empty

any ideas

Thanks for you help



On 24 November 2009 at 12:27 "david.stuart@progressivealliance.co.uk"
<da...@progressivealliance.co.uk> wrote:

> I thought I did but I thought before I did a bin/nutch index (or solrindex) it
> would be stored somewhere it does seems to be getting to the doc.add bit which
> makes me think the variable is empty
> {code}
>     public void addIndexBackendOptions(Configuration conf) {
>       LOG.warn("+_+_You called me _+_+");
>       LuceneWriter.addFieldOptions("html_filter_data", STORE.YES,
> INDEX.UNTOKENIZED, conf);
>     }
>     
>     public NutchDocument filter(NutchDocument doc, Parse parse, Text url,
> CrawlDatum datum, Inlinks inlinks) throws IndexingException {
>       LOG.warn("________________________FILTER_______________________");
>       String html_filter_data = parse.getData().getMeta("html_filter_data");
>       if (html_filter_data != null){
>           LOG.warn("________________________Adding filter
> data_______________________");
>           doc.add("html_filter_data", html_filter_data);
>       }
>       return doc;
>     }
> {code}
> On 24 November 2009 at 12:05 Andrzej Bialecki <ab...@getopt.org> wrote:
> 
> > david.stuart@progressivealliance.co.uk wrote:
> > >   Hi All,
> > > 
> > > I think I am just about finished my plugin (nutch 1.0) which adds extra 
> > > metadata to during parsing the problem I am having is it doesn't seem to 
> > > be adding the data to the system (via luke or readseg). I looked at in 
> > > the wiki but it seems to be for 0.9 and the syntax looks different.
> > > 
> > > {code}       
> > >   public ParseResult filter(Content content, ParseResult parseResult, 
> > > HTMLMetaTags metaTags, DocumentFragment doc) {
> > >       Metadata metadata = new Metadata();
> > >       // parse the content
> > >       DocumentFragment root;   
> > >       String docTrans;
> > >       try {
> > >         byte[] contentInOctets = content.getContent();
> > >         String input = new String(contentInOctets);
> > >         XSLTSimpleTransform DocTransform = new XSLTSimpleTransform();
> > >         docTrans = DocTransform.doTransform(input);
> > >         Parse parse = parseResult.get(content.getUrl());
> > >         metadata = parse.getData().getParseMeta();
> > >         metadata.add("filter_html_data", docTrans);
> > > 
> > >       } catch (Exception e) {
> > >         e.printStackTrace(LogUtil.getWarnStream(LOG));
> > >       }
> > >      
> > >     return parseResult;
> > >   }
> > > {code}
> > 
> > Did you declare that you are adding this field in the 
> > IndexingFilter.addIndexBackendOptions(..) ? See how other indexing 
> > plugins do this.
> > 
> > 
> > -- 
> > Best regards,
> > Andrzej Bialecki     <><
> >   ___. ___ ___ ___ _ _   __________________________________
> > [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> > ___|||__||  \|  ||  |  Embedded Unix, System Integration
> > http://www.sigram.com  Contact: info at sigram dot com
> >

Re: Plugin Developement Help

Posted by "david.stuart@progressivealliance.co.uk" <da...@progressivealliance.co.uk>.
Sorry its suppose to say "would be stored somewhere it DOESN'T seem to be
getting to the doc.add bit which"

On 24 November 2009 at 12:27 "david.stuart@progressivealliance.co.uk"
<da...@progressivealliance.co.uk> wrote:

> I thought I did but I thought before I did a bin/nutch index (or solrindex) it
> would be stored somewhere it does seems to be getting to the doc.add bit which
> makes me think the variable is empty
> {code}
>     public void addIndexBackendOptions(Configuration conf) {
>       LOG.warn("+_+_You called me _+_+");
>       LuceneWriter.addFieldOptions("html_filter_data", STORE.YES,
> INDEX.UNTOKENIZED, conf);
>     }
>     
>     public NutchDocument filter(NutchDocument doc, Parse parse, Text url,
> CrawlDatum datum, Inlinks inlinks) throws IndexingException {
>       LOG.warn("________________________FILTER_______________________");
>       String html_filter_data = parse.getData().getMeta("html_filter_data");
>       if (html_filter_data != null){
>           LOG.warn("________________________Adding filter
> data_______________________");
>           doc.add("html_filter_data", html_filter_data);
>       }
>       return doc;
>     }
> {code}
> On 24 November 2009 at 12:05 Andrzej Bialecki <ab...@getopt.org> wrote:
> 
> > david.stuart@progressivealliance.co.uk wrote:
> > >   Hi All,
> > > 
> > > I think I am just about finished my plugin (nutch 1.0) which adds extra 
> > > metadata to during parsing the problem I am having is it doesn't seem to 
> > > be adding the data to the system (via luke or readseg). I looked at in 
> > > the wiki but it seems to be for 0.9 and the syntax looks different.
> > > 
> > > {code}       
> > >   public ParseResult filter(Content content, ParseResult parseResult, 
> > > HTMLMetaTags metaTags, DocumentFragment doc) {
> > >       Metadata metadata = new Metadata();
> > >       // parse the content
> > >       DocumentFragment root;   
> > >       String docTrans;
> > >       try {
> > >         byte[] contentInOctets = content.getContent();
> > >         String input = new String(contentInOctets);
> > >         XSLTSimpleTransform DocTransform = new XSLTSimpleTransform();
> > >         docTrans = DocTransform.doTransform(input);
> > >         Parse parse = parseResult.get(content.getUrl());
> > >         metadata = parse.getData().getParseMeta();
> > >         metadata.add("filter_html_data", docTrans);
> > > 
> > >       } catch (Exception e) {
> > >         e.printStackTrace(LogUtil.getWarnStream(LOG));
> > >       }
> > >      
> > >     return parseResult;
> > >   }
> > > {code}
> > 
> > Did you declare that you are adding this field in the 
> > IndexingFilter.addIndexBackendOptions(..) ? See how other indexing 
> > plugins do this.
> > 
> > 
> > -- 
> > Best regards,
> > Andrzej Bialecki     <><
> >   ___. ___ ___ ___ _ _   __________________________________
> > [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> > ___|||__||  \|  ||  |  Embedded Unix, System Integration
> > http://www.sigram.com  Contact: info at sigram dot com
> >

Re: Plugin Developement Help

Posted by "david.stuart@progressivealliance.co.uk" <da...@progressivealliance.co.uk>.
I thought I did but I thought before I did a bin/nutch index (or solrindex) it
would be stored somewhere it does seems to be getting to the doc.add bit which
makes me think the variable is empty
{code}
    public void addIndexBackendOptions(Configuration conf) {
      LOG.warn("+_+_You called me _+_+");
      LuceneWriter.addFieldOptions("html_filter_data", STORE.YES,
INDEX.UNTOKENIZED, conf);
    }
    
    public NutchDocument filter(NutchDocument doc, Parse parse, Text url,
CrawlDatum datum, Inlinks inlinks) throws IndexingException {
      LOG.warn("________________________FILTER_______________________");
      String html_filter_data = parse.getData().getMeta("html_filter_data");
      if (html_filter_data != null){
          LOG.warn("________________________Adding filter
data_______________________");
          doc.add("html_filter_data", html_filter_data);
      }
      return doc;
    }
{code}
On 24 November 2009 at 12:05 Andrzej Bialecki <ab...@getopt.org> wrote:

> david.stuart@progressivealliance.co.uk wrote:
> >   Hi All,
> > 
> > I think I am just about finished my plugin (nutch 1.0) which adds extra 
> > metadata to during parsing the problem I am having is it doesn't seem to 
> > be adding the data to the system (via luke or readseg). I looked at in 
> > the wiki but it seems to be for 0.9 and the syntax looks different.
> > 
> > {code}       
> >   public ParseResult filter(Content content, ParseResult parseResult, 
> > HTMLMetaTags metaTags, DocumentFragment doc) {
> >       Metadata metadata = new Metadata();
> >       // parse the content
> >       DocumentFragment root;   
> >       String docTrans;
> >       try {
> >         byte[] contentInOctets = content.getContent();
> >         String input = new String(contentInOctets);
> >         XSLTSimpleTransform DocTransform = new XSLTSimpleTransform();
> >         docTrans = DocTransform.doTransform(input);
> >         Parse parse = parseResult.get(content.getUrl());
> >         metadata = parse.getData().getParseMeta();
> >         metadata.add("filter_html_data", docTrans);
> > 
> >       } catch (Exception e) {
> >         e.printStackTrace(LogUtil.getWarnStream(LOG));
> >       }
> >      
> >     return parseResult;
> >   }
> > {code}
> 
> Did you declare that you are adding this field in the 
> IndexingFilter.addIndexBackendOptions(..) ? See how other indexing 
> plugins do this.
> 
> 
> -- 
> Best regards,
> Andrzej Bialecki     <><
>   ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>

Re: Plugin Developement Help

Posted by Andrzej Bialecki <ab...@getopt.org>.
david.stuart@progressivealliance.co.uk wrote:
>   Hi All,
> 
> I think I am just about finished my plugin (nutch 1.0) which adds extra 
> metadata to during parsing the problem I am having is it doesn't seem to 
> be adding the data to the system (via luke or readseg). I looked at in 
> the wiki but it seems to be for 0.9 and the syntax looks different.
> 
> {code}       
>   public ParseResult filter(Content content, ParseResult parseResult, 
> HTMLMetaTags metaTags, DocumentFragment doc) {
>       Metadata metadata = new Metadata();
>       // parse the content
>       DocumentFragment root;   
>       String docTrans;
>       try {
>         byte[] contentInOctets = content.getContent();
>         String input = new String(contentInOctets);
>         XSLTSimpleTransform DocTransform = new XSLTSimpleTransform();
>         docTrans = DocTransform.doTransform(input);
>         Parse parse = parseResult.get(content.getUrl());
>         metadata = parse.getData().getParseMeta();
>         metadata.add("filter_html_data", docTrans);
> 
>       } catch (Exception e) {
>         e.printStackTrace(LogUtil.getWarnStream(LOG));
>       }
>      
>     return parseResult;
>   }
> {code}

Did you declare that you are adding this field in the 
IndexingFilter.addIndexBackendOptions(..) ? See how other indexing 
plugins do this.


-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com