You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Cam Bazz <ca...@gmail.com> on 2011/07/16 02:21:14 UTC

modifying parse implementation

Hello,

In my quest to create a custom parser, I have modified parseimpl to
hold another ParseText called features, such as:

  public ParseImpl(String text, String features, ParseData data) {
    this(new ParseText(text), new ParseText(features), data, true);
  }

  public ParseImpl(ParseText text, ParseText features, ParseData data,
boolean isCanonical) {
    this.text = text;
    this.data = data;
    this.features = features;
    this.isCanonical = isCanonical;
  }

  public String getFeatures() {
        return this.features.getText();
  }


and although I create the parseImpl like

ParseResult parseResult =
ParseResult.createParseResult(content.getUrl(), new ParseImpl(text,
features, parseData));

in the HtmlParser.java

I get an error when indexing if I do parse.getFeatures() -
parse.getText() will return the correct text, but if I call
parse.getFeatures() in index-basic plugin I get:

SolrIndexer: starting at 2011-07-16 03:06:54
java.io.IOException: Job failed!


I am getting a much better understanding of how nutch works. I dont
think my approach of butchering HtmlParser and ParseImpl is the best,
and I am sure all these can be put inside a another plugin.

Best Regards,
C.B.

Re: modifying parse implementation

Posted by Joye <ma...@gmail.com>.

Hello,

You could put the features into ParseData by calling

/parseData.getParseMeta().set("features", valueOfFeatures);

/When you wanna use it, call parseData.getParseMeta().get("features") to 
get it out/, /the same as the use of Java Map.

No need call the setter method. :-)/

/Regards,
Joey/
/

On 07/17/2011 04:55 AM, Cam Bazz wrote:
> Hello,
>
> I did not understand ParseData.parseData -
>
> In ParseData there are getContentMeta and getParseMeta
>
> There is also a getMeta(String string) - it appears that there is no
> setter for this.
>
> There is also setParseMeta, but it appears content meta is not settable.
>
> Best Regards,
> C.B.
>
>
>
>
> On Sat, Jul 16, 2011 at 3:43 AM, Joye<ma...@gmail.com>  wrote:
>> Hello,
>>
>> Because the ParseImpl implements the interface of Writable and it will be
>> serialized and deserialized when transferring among namenode and datanodes
>> in hadoop. So, if you add a property in any class implements "Writable", you
>> should add the read and write code for the new property in read and write
>> functions of ParseImpl class, which tells nutch how to do when serializing
>> and deserializing ParseImpl class.
>>
>> P.S. For the "features" is a string, so you could put it into
>> ParseData.parseData (it's Map structure), without any changes in base
>> classes of nutch.
>>
>> Regards,
>> Joey
>>
>>
>> On 07/16/2011 08:21 AM, Cam Bazz wrote:
>>> Hello,
>>>
>>> In my quest to create a custom parser, I have modified parseimpl to
>>> hold another ParseText called features, such as:
>>>
>>>    public ParseImpl(String text, String features, ParseData data) {
>>>      this(new ParseText(text), new ParseText(features), data, true);
>>>    }
>>>
>>>    public ParseImpl(ParseText text, ParseText features, ParseData data,
>>> boolean isCanonical) {
>>>      this.text = text;
>>>      this.data = data;
>>>      this.features = features;
>>>      this.isCanonical = isCanonical;
>>>    }
>>>
>>>    public String getFeatures() {
>>>          return this.features.getText();
>>>    }
>>>
>>>
>>> and although I create the parseImpl like
>>>
>>> ParseResult parseResult =
>>> ParseResult.createParseResult(content.getUrl(), new ParseImpl(text,
>>> features, parseData));
>>>
>>> in the HtmlParser.java
>>>
>>> I get an error when indexing if I do parse.getFeatures() -
>>> parse.getText() will return the correct text, but if I call
>>> parse.getFeatures() in index-basic plugin I get:
>>>
>>> SolrIndexer: starting at 2011-07-16 03:06:54
>>> java.io.IOException: Job failed!
>>>
>>>
>>> I am getting a much better understanding of how nutch works. I dont
>>> think my approach of butchering HtmlParser and ParseImpl is the best,
>>> and I am sure all these can be put inside a another plugin.
>>>
>>> Best Regards,
>>> C.B.
>>

Re: modifying parse implementation

Posted by Cam Bazz <ca...@gmail.com>.

Hello,

I did not understand ParseData.parseData -

In ParseData there are getContentMeta and getParseMeta

There is also a getMeta(String string) - it appears that there is no
setter for this.

There is also setParseMeta, but it appears content meta is not settable.

Best Regards,
C.B.




On Sat, Jul 16, 2011 at 3:43 AM, Joye <ma...@gmail.com> wrote:
> Hello,
>
> Because the ParseImpl implements the interface of Writable and it will be
> serialized and deserialized when transferring among namenode and datanodes
> in hadoop. So, if you add a property in any class implements "Writable", you
> should add the read and write code for the new property in read and write
> functions of ParseImpl class, which tells nutch how to do when serializing
> and deserializing ParseImpl class.
>
> P.S. For the "features" is a string, so you could put it into
> ParseData.parseData (it's Map structure), without any changes in base
> classes of nutch.
>
> Regards,
> Joey
>
>
> On 07/16/2011 08:21 AM, Cam Bazz wrote:
>>
>> Hello,
>>
>> In my quest to create a custom parser, I have modified parseimpl to
>> hold another ParseText called features, such as:
>>
>>   public ParseImpl(String text, String features, ParseData data) {
>>     this(new ParseText(text), new ParseText(features), data, true);
>>   }
>>
>>   public ParseImpl(ParseText text, ParseText features, ParseData data,
>> boolean isCanonical) {
>>     this.text = text;
>>     this.data = data;
>>     this.features = features;
>>     this.isCanonical = isCanonical;
>>   }
>>
>>   public String getFeatures() {
>>         return this.features.getText();
>>   }
>>
>>
>> and although I create the parseImpl like
>>
>> ParseResult parseResult =
>> ParseResult.createParseResult(content.getUrl(), new ParseImpl(text,
>> features, parseData));
>>
>> in the HtmlParser.java
>>
>> I get an error when indexing if I do parse.getFeatures() -
>> parse.getText() will return the correct text, but if I call
>> parse.getFeatures() in index-basic plugin I get:
>>
>> SolrIndexer: starting at 2011-07-16 03:06:54
>> java.io.IOException: Job failed!
>>
>>
>> I am getting a much better understanding of how nutch works. I dont
>> think my approach of butchering HtmlParser and ParseImpl is the best,
>> and I am sure all these can be put inside a another plugin.
>>
>> Best Regards,
>> C.B.
>
>

Re: modifying parse implementation

Posted by Joye <ma...@gmail.com>.

Hello,

Because the ParseImpl implements the interface of Writable and it will 
be serialized and deserialized when transferring among namenode and 
datanodes in hadoop. So, if you add a property in any class implements 
"Writable", you should add the read and write code for the new property 
in read and write functions of ParseImpl class, which tells nutch how to 
do when serializing and deserializing ParseImpl class.

P.S. For the "features" is a string, so you could put it into 
ParseData.parseData (it's Map structure), without any changes in base 
classes of nutch.

Regards,
Joey


On 07/16/2011 08:21 AM, Cam Bazz wrote:
> Hello,
>
> In my quest to create a custom parser, I have modified parseimpl to
> hold another ParseText called features, such as:
>
>    public ParseImpl(String text, String features, ParseData data) {
>      this(new ParseText(text), new ParseText(features), data, true);
>    }
>
>    public ParseImpl(ParseText text, ParseText features, ParseData data,
> boolean isCanonical) {
>      this.text = text;
>      this.data = data;
>      this.features = features;
>      this.isCanonical = isCanonical;
>    }
>
>    public String getFeatures() {
>          return this.features.getText();
>    }
>
>
> and although I create the parseImpl like
>
> ParseResult parseResult =
> ParseResult.createParseResult(content.getUrl(), new ParseImpl(text,
> features, parseData));
>
> in the HtmlParser.java
>
> I get an error when indexing if I do parse.getFeatures() -
> parse.getText() will return the correct text, but if I call
> parse.getFeatures() in index-basic plugin I get:
>
> SolrIndexer: starting at 2011-07-16 03:06:54
> java.io.IOException: Job failed!
>
>
> I am getting a much better understanding of how nutch works. I dont
> think my approach of butchering HtmlParser and ParseImpl is the best,
> and I am sure all these can be put inside a another plugin.
>
> Best Regards,
> C.B.