You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Safdar Kureishy <sa...@gmail.com> on 2012/08/29 21:23:01 UTC

Need to transfer Parse metadata obtained in HtmlParseFilter.filter() to the CrawlDb

Hi,

I've built a custom HtmlParseFilter and am doing custom language
identification in the filter() API. Here, I am able to set the relevant
lang id properties on a ParseResult object via getParseMeta().put("LangId",
id). I am also able to retrieve these properties in my custom
ScoringFilter, for use during distributeScoreToOutlinks(). However, what I
also need is to persist this data as as metadata in the relevant CrawlDb
record (i.e., in the CrawlDatum.getMetadata() data structure). My intent,
from all this, is finally to be able to write custom Hadoop jobs to gather
language distribution statistics directy from the Crawldb (without having
to do any joins on the ParseText, Content, ParseData types). The only way I
see this being possible, is if each URL's CrawlDatum also has the lang-id
in its metadata.

This is turning out to be a challenge. I first tried transfering the parse
properties in my custom ScoringFilter.distributeScoreToOutlinks() API,
because that API offers access to the ParseResut as well as an "adjust"
CrawlDatum parameter for updating the CrawlDb (according to the Javadocs).
However, doing that is not updating the crawldb. Then, in the newsgroup
archives, I stumbled upon a thread about the
"db.max.outlinks.per.page"property being used by the ParseOutputFormat
class to do exactly the same
property transfer at a different stage of the crawl cycle, but that doesn't
work either.

So, I'm writing to the newsgroup hoping someone could give me specific
advice on which API I should override, or which configuration setting I
should change, so as to transfer custom parse-time metadata to the CrawlDb.

Thanks in advance.

Cheers,
Safdar

RE: Need to transfer Parse metadata obtained in HtmlParseFilter.filter() to the CrawlDb

Posted by Safdar Kureishy <sa...@gmail.com>.

Hi Marcus,

I had mentioned In my emaIl that I had tried that parameter already but it
didn't work. Is that the only way to achieve this? Can I add code in some
plugin for this somewhere?

Thanks,
Safdar
On Aug 29, 2012 11:18 PM, "Markus Jelsma" <ma...@openindex.io>
wrote:

> Hi
>
> Check the db.parsemeta.to.crawldb parameter. It'll send your parse meta
> keys to the CrawlDatum meta data.
>
> Cheers
>
>
>
> -----Original message-----
> > From:Safdar Kureishy <sa...@gmail.com>
> > Sent: Wed 29-Aug-2012 21:26
> > To: user@nutch.apache.org
> > Subject: Need to transfer Parse metadata obtained in
> HtmlParseFilter.filter() to the CrawlDb
> >
> > Hi,
> >
> > I've built a custom HtmlParseFilter and am doing custom language
> > identification in the filter() API. Here, I am able to set the relevant
> > lang id properties on a ParseResult object via
> getParseMeta().put("LangId",
> > id). I am also able to retrieve these properties in my custom
> > ScoringFilter, for use during distributeScoreToOutlinks(). However, what
> I
> > also need is to persist this data as as metadata in the relevant CrawlDb
> > record (i.e., in the CrawlDatum.getMetadata() data structure). My intent,
> > from all this, is finally to be able to write custom Hadoop jobs to
> gather
> > language distribution statistics directy from the Crawldb (without having
> > to do any joins on the ParseText, Content, ParseData types). The only
> way I
> > see this being possible, is if each URL's CrawlDatum also has the lang-id
> > in its metadata.
> >
> > This is turning out to be a challenge. I first tried transfering the
> parse
> > properties in my custom ScoringFilter.distributeScoreToOutlinks() API,
> > because that API offers access to the ParseResut as well as an "adjust"
> > CrawlDatum parameter for updating the CrawlDb (according to the
> Javadocs).
> > However, doing that is not updating the crawldb. Then, in the newsgroup
> > archives, I stumbled upon a thread about the
> > "db.max.outlinks.per.page"property being used by the ParseOutputFormat
> > class to do exactly the same
> > property transfer at a different stage of the crawl cycle, but that
> doesn't
> > work either.
> >
> > So, I'm writing to the newsgroup hoping someone could give me specific
> > advice on which API I should override, or which configuration setting I
> > should change, so as to transfer custom parse-time metadata to the
> CrawlDb.
> >
> > Thanks in advance.
> >
> > Cheers,
> > Safdar
> >
>

RE: Need to transfer Parse metadata obtained in HtmlParseFilter.filter() to the CrawlDb

Posted by Markus Jelsma <ma...@openindex.io>.

Hi

Check the db.parsemeta.to.crawldb parameter. It'll send your parse meta keys to the CrawlDatum meta data.

Cheers

 
 
-----Original message-----
> From:Safdar Kureishy <sa...@gmail.com>
> Sent: Wed 29-Aug-2012 21:26
> To: user@nutch.apache.org
> Subject: Need to transfer Parse metadata obtained in HtmlParseFilter.filter() to the CrawlDb
> 
> Hi,
> 
> I've built a custom HtmlParseFilter and am doing custom language
> identification in the filter() API. Here, I am able to set the relevant
> lang id properties on a ParseResult object via getParseMeta().put("LangId",
> id). I am also able to retrieve these properties in my custom
> ScoringFilter, for use during distributeScoreToOutlinks(). However, what I
> also need is to persist this data as as metadata in the relevant CrawlDb
> record (i.e., in the CrawlDatum.getMetadata() data structure). My intent,
> from all this, is finally to be able to write custom Hadoop jobs to gather
> language distribution statistics directy from the Crawldb (without having
> to do any joins on the ParseText, Content, ParseData types). The only way I
> see this being possible, is if each URL's CrawlDatum also has the lang-id
> in its metadata.
> 
> This is turning out to be a challenge. I first tried transfering the parse
> properties in my custom ScoringFilter.distributeScoreToOutlinks() API,
> because that API offers access to the ParseResut as well as an "adjust"
> CrawlDatum parameter for updating the CrawlDb (according to the Javadocs).
> However, doing that is not updating the crawldb. Then, in the newsgroup
> archives, I stumbled upon a thread about the
> "db.max.outlinks.per.page"property being used by the ParseOutputFormat
> class to do exactly the same
> property transfer at a different stage of the crawl cycle, but that doesn't
> work either.
> 
> So, I'm writing to the newsgroup hoping someone could give me specific
> advice on which API I should override, or which configuration setting I
> should change, so as to transfer custom parse-time metadata to the CrawlDb.
> 
> Thanks in advance.
> 
> Cheers,
> Safdar
>