You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "david.stuart@progressivealliance.co.uk" <da...@progressivealliance.co.uk> on 2009/11/24 11:57:13 UTC
Plugin Developement Help
Hi All,
I think I am just about finished my plugin (nutch 1.0) which adds extra metadata
to during parsing the problem I am having is it doesn't seem to be adding the
data to the system (via luke or readseg). I looked at in the wiki but it seems
to be for 0.9 and the syntax looks different.
{code}
public ParseResult filter(Content content, ParseResult parseResult,
HTMLMetaTags metaTags, DocumentFragment doc) {
Metadata metadata = new Metadata();
// parse the content
DocumentFragment root;
String docTrans;
try {
byte[] contentInOctets = content.getContent();
String input = new String(contentInOctets);
XSLTSimpleTransform DocTransform = new XSLTSimpleTransform();
docTrans = DocTransform.doTransform(input);
Parse parse = parseResult.get(content.getUrl());
metadata = parse.getData().getParseMeta();
metadata.add("filter_html_data", docTrans);
} catch (Exception e) {
e.printStackTrace(LogUtil.getWarnStream(LOG));
}
return parseResult;
}
{code}
Cheers,
Dave
Re: Plugin Developement Help
Posted by David Stuart <da...@progressivealliance.co.uk>.
Sorry I meant doesn't get to doc.add
David
On 24 Nov 2009, at 11:27, "david.stuart@progressivealliance.co.uk" <david.stuart@progressivealliance.co.uk
> wrote:
> I thought I did but I thought before I did a bin/nutch index (or
> solrindex) it would be stored somewhere it does seems to be getting
> to the doc.add bit which makes me think the variable is empty
> {code}
> public void addIndexBackendOptions(Configuration conf) {
> LOG.warn("+_+_You called me _+_+");
> LuceneWriter.addFieldOptions("html_filter_data", STORE.YES,
> INDEX.UNTOKENIZED, conf);
> }
>
> public NutchDocument filter(NutchDocument doc, Parse parse, Text
> url, CrawlDatum datum, Inlinks inlinks) throws IndexingException {
> LOG.warn
> ("________________________FILTER_______________________");
> String html_filter_data = parse.getData().getMeta
> ("html_filter_data");
> if (html_filter_data != null){
> LOG.warn("________________________Adding filter
> data_______________________");
> doc.add("html_filter_data", html_filter_data);
> }
> return doc;
> }
> {code}
> On 24 November 2009 at 12:05 Andrzej Bialecki <ab...@getopt.org> wrote:
>
> > david.stuart@progressivealliance.co.uk wrote:
> > > Hi All,
> > >
> > > I think I am just about finished my plugin (nutch 1.0) which
> adds extra
> > > metadata to during parsing the problem I am having is it doesn't
> seem to
> > > be adding the data to the system (via luke or readseg). I looked
> at in
> > > the wiki but it seems to be for 0.9 and the syntax looks
> different.
> > >
> > > {code}
> > > public ParseResult filter(Content content, ParseResult
> parseResult,
> > > HTMLMetaTags metaTags, DocumentFragment doc) {
> > > Metadata metadata = new Metadata();
> > > // parse the content
> > > DocumentFragment root;
> > > String docTrans;
> > > try {
> > > byte[] contentInOctets = content.getContent();
> > > String input = new String(contentInOctets);
> > > XSLTSimpleTransform DocTransform = new
> XSLTSimpleTransform();
> > > docTrans = DocTransform.doTransform(input);
> > > Parse parse = parseResult.get(content.getUrl());
> > > metadata = parse.getData().getParseMeta();
> > > metadata.add("filter_html_data", docTrans);
> > >
> > > } catch (Exception e) {
> > > e.printStackTrace(LogUtil.getWarnStream(LOG));
> > > }
> > >
> > > return parseResult;
> > > }
> > > {code}
> >
> > Did you declare that you are adding this field in the
> > IndexingFilter.addIndexBackendOptions(..) ? See how other indexing
> > plugins do this.
> >
> >
> > --
> > Best regards,
> > Andrzej Bialecki <><
> > ___. ___ ___ ___ _ _ __________________________________
> > [__ || __|__/|__||\/| Information Retrieval, Semantic Web
> > ___|||__|| \| || | Embedded Unix, System Integration
> > http://www.sigram.com Contact: info at sigram dot com
> >
Re: Plugin Developement Help
Posted by "david.stuart@progressivealliance.co.uk" <da...@progressivealliance.co.uk>.
Sorry keep pressing
But I dont quite understanding how the metadata is passed from the parse to the
index if in my
public ParseResult filter...
Do this
Parse parse = parseResult.get(content.getUrl());
metadata = parse.getData().getParseMeta();
metadata.add("filter_html_data", docTrans);
Then return
return parseResult;
Is the data passed by reference into parseResult? because when I try and
retrieve it in
public NutchDocument filter...
by doing
String html_filter_data = parse.getData().getMeta("html_filter_data");
LOG.warn(html_filter_data);
if (html_filter_data != null){
LOG.warn("________________________Adding filter
data_______________________");
doc.add("html_filter_data", html_filter_data);
}
I Never reach the add because the variable html_filter_data is empty
any ideas
Thanks for you help
On 24 November 2009 at 12:27 "david.stuart@progressivealliance.co.uk"
<da...@progressivealliance.co.uk> wrote:
> I thought I did but I thought before I did a bin/nutch index (or solrindex) it
> would be stored somewhere it does seems to be getting to the doc.add bit which
> makes me think the variable is empty
> {code}
> public void addIndexBackendOptions(Configuration conf) {
> LOG.warn("+_+_You called me _+_+");
> LuceneWriter.addFieldOptions("html_filter_data", STORE.YES,
> INDEX.UNTOKENIZED, conf);
> }
>
> public NutchDocument filter(NutchDocument doc, Parse parse, Text url,
> CrawlDatum datum, Inlinks inlinks) throws IndexingException {
> LOG.warn("________________________FILTER_______________________");
> String html_filter_data = parse.getData().getMeta("html_filter_data");
> if (html_filter_data != null){
> LOG.warn("________________________Adding filter
> data_______________________");
> doc.add("html_filter_data", html_filter_data);
> }
> return doc;
> }
> {code}
> On 24 November 2009 at 12:05 Andrzej Bialecki <ab...@getopt.org> wrote:
>
> > david.stuart@progressivealliance.co.uk wrote:
> > > Hi All,
> > >
> > > I think I am just about finished my plugin (nutch 1.0) which adds extra
> > > metadata to during parsing the problem I am having is it doesn't seem to
> > > be adding the data to the system (via luke or readseg). I looked at in
> > > the wiki but it seems to be for 0.9 and the syntax looks different.
> > >
> > > {code}
> > > public ParseResult filter(Content content, ParseResult parseResult,
> > > HTMLMetaTags metaTags, DocumentFragment doc) {
> > > Metadata metadata = new Metadata();
> > > // parse the content
> > > DocumentFragment root;
> > > String docTrans;
> > > try {
> > > byte[] contentInOctets = content.getContent();
> > > String input = new String(contentInOctets);
> > > XSLTSimpleTransform DocTransform = new XSLTSimpleTransform();
> > > docTrans = DocTransform.doTransform(input);
> > > Parse parse = parseResult.get(content.getUrl());
> > > metadata = parse.getData().getParseMeta();
> > > metadata.add("filter_html_data", docTrans);
> > >
> > > } catch (Exception e) {
> > > e.printStackTrace(LogUtil.getWarnStream(LOG));
> > > }
> > >
> > > return parseResult;
> > > }
> > > {code}
> >
> > Did you declare that you are adding this field in the
> > IndexingFilter.addIndexBackendOptions(..) ? See how other indexing
> > plugins do this.
> >
> >
> > --
> > Best regards,
> > Andrzej Bialecki <><
> > ___. ___ ___ ___ _ _ __________________________________
> > [__ || __|__/|__||\/| Information Retrieval, Semantic Web
> > ___|||__|| \| || | Embedded Unix, System Integration
> > http://www.sigram.com Contact: info at sigram dot com
> >
Re: Plugin Developement Help
Posted by "david.stuart@progressivealliance.co.uk" <da...@progressivealliance.co.uk>.
Sorry its suppose to say "would be stored somewhere it DOESN'T seem to be
getting to the doc.add bit which"
On 24 November 2009 at 12:27 "david.stuart@progressivealliance.co.uk"
<da...@progressivealliance.co.uk> wrote:
> I thought I did but I thought before I did a bin/nutch index (or solrindex) it
> would be stored somewhere it does seems to be getting to the doc.add bit which
> makes me think the variable is empty
> {code}
> public void addIndexBackendOptions(Configuration conf) {
> LOG.warn("+_+_You called me _+_+");
> LuceneWriter.addFieldOptions("html_filter_data", STORE.YES,
> INDEX.UNTOKENIZED, conf);
> }
>
> public NutchDocument filter(NutchDocument doc, Parse parse, Text url,
> CrawlDatum datum, Inlinks inlinks) throws IndexingException {
> LOG.warn("________________________FILTER_______________________");
> String html_filter_data = parse.getData().getMeta("html_filter_data");
> if (html_filter_data != null){
> LOG.warn("________________________Adding filter
> data_______________________");
> doc.add("html_filter_data", html_filter_data);
> }
> return doc;
> }
> {code}
> On 24 November 2009 at 12:05 Andrzej Bialecki <ab...@getopt.org> wrote:
>
> > david.stuart@progressivealliance.co.uk wrote:
> > > Hi All,
> > >
> > > I think I am just about finished my plugin (nutch 1.0) which adds extra
> > > metadata to during parsing the problem I am having is it doesn't seem to
> > > be adding the data to the system (via luke or readseg). I looked at in
> > > the wiki but it seems to be for 0.9 and the syntax looks different.
> > >
> > > {code}
> > > public ParseResult filter(Content content, ParseResult parseResult,
> > > HTMLMetaTags metaTags, DocumentFragment doc) {
> > > Metadata metadata = new Metadata();
> > > // parse the content
> > > DocumentFragment root;
> > > String docTrans;
> > > try {
> > > byte[] contentInOctets = content.getContent();
> > > String input = new String(contentInOctets);
> > > XSLTSimpleTransform DocTransform = new XSLTSimpleTransform();
> > > docTrans = DocTransform.doTransform(input);
> > > Parse parse = parseResult.get(content.getUrl());
> > > metadata = parse.getData().getParseMeta();
> > > metadata.add("filter_html_data", docTrans);
> > >
> > > } catch (Exception e) {
> > > e.printStackTrace(LogUtil.getWarnStream(LOG));
> > > }
> > >
> > > return parseResult;
> > > }
> > > {code}
> >
> > Did you declare that you are adding this field in the
> > IndexingFilter.addIndexBackendOptions(..) ? See how other indexing
> > plugins do this.
> >
> >
> > --
> > Best regards,
> > Andrzej Bialecki <><
> > ___. ___ ___ ___ _ _ __________________________________
> > [__ || __|__/|__||\/| Information Retrieval, Semantic Web
> > ___|||__|| \| || | Embedded Unix, System Integration
> > http://www.sigram.com Contact: info at sigram dot com
> >
Re: Plugin Developement Help
Posted by "david.stuart@progressivealliance.co.uk" <da...@progressivealliance.co.uk>.
I thought I did but I thought before I did a bin/nutch index (or solrindex) it
would be stored somewhere it does seems to be getting to the doc.add bit which
makes me think the variable is empty
{code}
public void addIndexBackendOptions(Configuration conf) {
LOG.warn("+_+_You called me _+_+");
LuceneWriter.addFieldOptions("html_filter_data", STORE.YES,
INDEX.UNTOKENIZED, conf);
}
public NutchDocument filter(NutchDocument doc, Parse parse, Text url,
CrawlDatum datum, Inlinks inlinks) throws IndexingException {
LOG.warn("________________________FILTER_______________________");
String html_filter_data = parse.getData().getMeta("html_filter_data");
if (html_filter_data != null){
LOG.warn("________________________Adding filter
data_______________________");
doc.add("html_filter_data", html_filter_data);
}
return doc;
}
{code}
On 24 November 2009 at 12:05 Andrzej Bialecki <ab...@getopt.org> wrote:
> david.stuart@progressivealliance.co.uk wrote:
> > Hi All,
> >
> > I think I am just about finished my plugin (nutch 1.0) which adds extra
> > metadata to during parsing the problem I am having is it doesn't seem to
> > be adding the data to the system (via luke or readseg). I looked at in
> > the wiki but it seems to be for 0.9 and the syntax looks different.
> >
> > {code}
> > public ParseResult filter(Content content, ParseResult parseResult,
> > HTMLMetaTags metaTags, DocumentFragment doc) {
> > Metadata metadata = new Metadata();
> > // parse the content
> > DocumentFragment root;
> > String docTrans;
> > try {
> > byte[] contentInOctets = content.getContent();
> > String input = new String(contentInOctets);
> > XSLTSimpleTransform DocTransform = new XSLTSimpleTransform();
> > docTrans = DocTransform.doTransform(input);
> > Parse parse = parseResult.get(content.getUrl());
> > metadata = parse.getData().getParseMeta();
> > metadata.add("filter_html_data", docTrans);
> >
> > } catch (Exception e) {
> > e.printStackTrace(LogUtil.getWarnStream(LOG));
> > }
> >
> > return parseResult;
> > }
> > {code}
>
> Did you declare that you are adding this field in the
> IndexingFilter.addIndexBackendOptions(..) ? See how other indexing
> plugins do this.
>
>
> --
> Best regards,
> Andrzej Bialecki <><
> ___. ___ ___ ___ _ _ __________________________________
> [__ || __|__/|__||\/| Information Retrieval, Semantic Web
> ___|||__|| \| || | Embedded Unix, System Integration
> http://www.sigram.com Contact: info at sigram dot com
>
Re: Plugin Developement Help
Posted by Andrzej Bialecki <ab...@getopt.org>.
david.stuart@progressivealliance.co.uk wrote:
> Hi All,
>
> I think I am just about finished my plugin (nutch 1.0) which adds extra
> metadata to during parsing the problem I am having is it doesn't seem to
> be adding the data to the system (via luke or readseg). I looked at in
> the wiki but it seems to be for 0.9 and the syntax looks different.
>
> {code}
> public ParseResult filter(Content content, ParseResult parseResult,
> HTMLMetaTags metaTags, DocumentFragment doc) {
> Metadata metadata = new Metadata();
> // parse the content
> DocumentFragment root;
> String docTrans;
> try {
> byte[] contentInOctets = content.getContent();
> String input = new String(contentInOctets);
> XSLTSimpleTransform DocTransform = new XSLTSimpleTransform();
> docTrans = DocTransform.doTransform(input);
> Parse parse = parseResult.get(content.getUrl());
> metadata = parse.getData().getParseMeta();
> metadata.add("filter_html_data", docTrans);
>
> } catch (Exception e) {
> e.printStackTrace(LogUtil.getWarnStream(LOG));
> }
>
> return parseResult;
> }
> {code}
Did you declare that you are adding this field in the
IndexingFilter.addIndexBackendOptions(..) ? See how other indexing
plugins do this.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com