You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by Renxia Wang <re...@usc.edu> on 2015/02/23 00:36:52 UTC

How to read metadata/content of an URL in URLFilter?

Hi

I want to develop an UrlFIlter which takes an url, takes its metadata or
even the fetched content, then use some duplicate detection algorithms to
determine if it is a duplicate of any url in bitch. However, the only
parameter passed into the Urlfilter is the url, is it possible to get the
data I want of that input url in Urlfilter?

Thanks,

Zhique

Re: How to read metadata/content of an URL in URLFilter?

Posted by Renxia Wang <re...@usc.edu>.

YES. I tried that out, while that one has only url as input. The problem is
how to get the data of that url locally.

On Sunday, February 22, 2015, Nagarjun Pola <np...@usc.edu> wrote:

> I have just started looking up in those lines and found that interface
> URLFilter has a method named "filter". And I think this is our point of
> interest.
> Maybe you should look at how to use this method in your plugin.
>
>
>
>
> On Sun, Feb 22, 2015 at 4:41 PM, Jiaxin Ye <jiaxinye@usc.edu
> <javascript:_e(%7B%7D,'cvml','jiaxinye@usc.edu');>> wrote:
>
>> You are absolutely right! I am just throwing ideas :) If you are looking
>> at local data, org.apache.nutch.segment.SegmentReader may be helpful I
>> guess. As all data contents parsed are located there.
>>
>> On Sun, Feb 22, 2015 at 4:33 PM, Renxia Wang <renxiawa@usc.edu
>> <javascript:_e(%7B%7D,'cvml','renxiawa@usc.edu');>> wrote:
>>
>>> Thank you for you suggestion. I will take a look at that. There is a
>>> URLUtil class in nutch's source code, but I am just wonder if that one will
>>> send a request to the URL again to get the data. Cause the url's metadata
>>> has already been downloaded, it is better if we can get the data locally.
>>>
>>>
>>> On Sunday, February 22, 2015, Jiaxin Ye <jiaxinye@usc.edu
>>> <javascript:_e(%7B%7D,'cvml','jiaxinye@usc.edu');>> wrote:
>>>
>>>> Hey,
>>>>
>>>> I haven't started working on the deduplicatiin yet, but if I were you I
>>>> will use tika library to retrieve the MIMEtype and metadata. The code is
>>>> presented in the book tika. Why not try that out? :)
>>>>
>>>> Best,
>>>> Jiaxin
>>>>
>>>> On Sunday, February 22, 2015, Renxia Wang <re...@usc.edu> wrote:
>>>>
>>>>> Hi
>>>>>
>>>>> I want to develop an UrlFIlter which takes an url, takes its metadata
>>>>> or even the fetched content, then use some duplicate detection algorithms
>>>>> to determine if it is a duplicate of any url in bitch. However, the only
>>>>> parameter passed into the Urlfilter is the url, is it possible to get the
>>>>> data I want of that input url in Urlfilter?
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Zhique
>>>>>
>>>>
>>
>

Re: How to read metadata/content of an URL in URLFilter?

Posted by Nagarjun Pola <np...@usc.edu>.

I have just started looking up in those lines and found that interface
URLFilter has a method named "filter". And I think this is our point of
interest.
Maybe you should look at how to use this method in your plugin.




On Sun, Feb 22, 2015 at 4:41 PM, Jiaxin Ye <ji...@usc.edu> wrote:

> You are absolutely right! I am just throwing ideas :) If you are looking
> at local data, org.apache.nutch.segment.SegmentReader may be helpful I
> guess. As all data contents parsed are located there.
>
> On Sun, Feb 22, 2015 at 4:33 PM, Renxia Wang <re...@usc.edu> wrote:
>
>> Thank you for you suggestion. I will take a look at that. There is a
>> URLUtil class in nutch's source code, but I am just wonder if that one will
>> send a request to the URL again to get the data. Cause the url's metadata
>> has already been downloaded, it is better if we can get the data locally.
>>
>>
>> On Sunday, February 22, 2015, Jiaxin Ye <ji...@usc.edu> wrote:
>>
>>> Hey,
>>>
>>> I haven't started working on the deduplicatiin yet, but if I were you I
>>> will use tika library to retrieve the MIMEtype and metadata. The code is
>>> presented in the book tika. Why not try that out? :)
>>>
>>> Best,
>>> Jiaxin
>>>
>>> On Sunday, February 22, 2015, Renxia Wang <re...@usc.edu> wrote:
>>>
>>>> Hi
>>>>
>>>> I want to develop an UrlFIlter which takes an url, takes its metadata
>>>> or even the fetched content, then use some duplicate detection algorithms
>>>> to determine if it is a duplicate of any url in bitch. However, the only
>>>> parameter passed into the Urlfilter is the url, is it possible to get the
>>>> data I want of that input url in Urlfilter?
>>>>
>>>> Thanks,
>>>>
>>>> Zhique
>>>>
>>>
>

Re: How to read metadata/content of an URL in URLFilter?

Posted by Renxia Wang <re...@usc.edu>.

Thanks. I will take a look at that.

On Sunday, February 22, 2015, Jiaxin Ye <ji...@usc.edu> wrote:

> You are absolutely right! I am just throwing ideas :) If you are looking
> at local data, org.apache.nutch.segment.SegmentReader may be helpful I
> guess. As all data contents parsed are located there.
>
> On Sun, Feb 22, 2015 at 4:33 PM, Renxia Wang <renxiawa@usc.edu
> <javascript:_e(%7B%7D,'cvml','renxiawa@usc.edu');>> wrote:
>
>> Thank you for you suggestion. I will take a look at that. There is a
>> URLUtil class in nutch's source code, but I am just wonder if that one will
>> send a request to the URL again to get the data. Cause the url's metadata
>> has already been downloaded, it is better if we can get the data locally.
>>
>>
>> On Sunday, February 22, 2015, Jiaxin Ye <jiaxinye@usc.edu
>> <javascript:_e(%7B%7D,'cvml','jiaxinye@usc.edu');>> wrote:
>>
>>> Hey,
>>>
>>> I haven't started working on the deduplicatiin yet, but if I were you I
>>> will use tika library to retrieve the MIMEtype and metadata. The code is
>>> presented in the book tika. Why not try that out? :)
>>>
>>> Best,
>>> Jiaxin
>>>
>>> On Sunday, February 22, 2015, Renxia Wang <re...@usc.edu> wrote:
>>>
>>>> Hi
>>>>
>>>> I want to develop an UrlFIlter which takes an url, takes its metadata
>>>> or even the fetched content, then use some duplicate detection algorithms
>>>> to determine if it is a duplicate of any url in bitch. However, the only
>>>> parameter passed into the Urlfilter is the url, is it possible to get the
>>>> data I want of that input url in Urlfilter?
>>>>
>>>> Thanks,
>>>>
>>>> Zhique
>>>>
>>>
>

Re: How to read metadata/content of an URL in URLFilter?

Posted by Jiaxin Ye <ji...@usc.edu>.

You are absolutely right! I am just throwing ideas :) If you are looking at
local data, org.apache.nutch.segment.SegmentReader may be helpful I guess.
As all data contents parsed are located there.

On Sun, Feb 22, 2015 at 4:33 PM, Renxia Wang <re...@usc.edu> wrote:

> Thank you for you suggestion. I will take a look at that. There is a
> URLUtil class in nutch's source code, but I am just wonder if that one will
> send a request to the URL again to get the data. Cause the url's metadata
> has already been downloaded, it is better if we can get the data locally.
>
>
> On Sunday, February 22, 2015, Jiaxin Ye <ji...@usc.edu> wrote:
>
>> Hey,
>>
>> I haven't started working on the deduplicatiin yet, but if I were you I
>> will use tika library to retrieve the MIMEtype and metadata. The code is
>> presented in the book tika. Why not try that out? :)
>>
>> Best,
>> Jiaxin
>>
>> On Sunday, February 22, 2015, Renxia Wang <re...@usc.edu> wrote:
>>
>>> Hi
>>>
>>> I want to develop an UrlFIlter which takes an url, takes its metadata or
>>> even the fetched content, then use some duplicate detection algorithms to
>>> determine if it is a duplicate of any url in bitch. However, the only
>>> parameter passed into the Urlfilter is the url, is it possible to get the
>>> data I want of that input url in Urlfilter?
>>>
>>> Thanks,
>>>
>>> Zhique
>>>
>>

Re: How to read metadata/content of an URL in URLFilter?

Posted by Renxia Wang <re...@usc.edu>.

Thank you for you suggestion. I will take a look at that. There is a
URLUtil class in nutch's source code, but I am just wonder if that one will
send a request to the URL again to get the data. Cause the url's metadata
has already been downloaded, it is better if we can get the data locally.

On Sunday, February 22, 2015, Jiaxin Ye <ji...@usc.edu> wrote:

> Hey,
>
> I haven't started working on the deduplicatiin yet, but if I were you I
> will use tika library to retrieve the MIMEtype and metadata. The code is
> presented in the book tika. Why not try that out? :)
>
> Best,
> Jiaxin
>
> On Sunday, February 22, 2015, Renxia Wang <renxiawa@usc.edu
> <javascript:_e(%7B%7D,'cvml','renxiawa@usc.edu');>> wrote:
>
>> Hi
>>
>> I want to develop an UrlFIlter which takes an url, takes its metadata or
>> even the fetched content, then use some duplicate detection algorithms to
>> determine if it is a duplicate of any url in bitch. However, the only
>> parameter passed into the Urlfilter is the url, is it possible to get the
>> data I want of that input url in Urlfilter?
>>
>> Thanks,
>>
>> Zhique
>>
>

Re: How to read metadata/content of an URL in URLFilter?

Posted by Jiaxin Ye <ji...@usc.edu>.

Hey,

I haven't started working on the deduplicatiin yet, but if I were you I
will use tika library to retrieve the MIMEtype and metadata. The code is
presented in the book tika. Why not try that out? :)

Best,
Jiaxin

On Sunday, February 22, 2015, Renxia Wang <re...@usc.edu> wrote:

> Hi
>
> I want to develop an UrlFIlter which takes an url, takes its metadata or
> even the fetched content, then use some duplicate detection algorithms to
> determine if it is a duplicate of any url in bitch. However, the only
> parameter passed into the Urlfilter is the url, is it possible to get the
> data I want of that input url in Urlfilter?
>
> Thanks,
>
> Zhique
>

Re: How to read metadata/content of an URL in URLFilter?

Posted by Renxia Wang <re...@usc.edu>.

Thanks. That's what I was trying to figure out, but don't know which class
to get the path to the data files. Thanks to point it out.

On Sunday, February 22, 2015, Mattmann, Chris A (3980) <
chris.a.mattmann@jpl.nasa.gov> wrote:

> In the constructor of your URLFilter, why not consider passing
> in a NutchConfiguration object, and then reading the path to e.g,
> the LinkDb from the config. Then have a private member variable
> for the LinkDbReader (maybe static initialized for efficiency)
> and use that in your interface method.
>
> Cheers,
> Chris
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: chris.a.mattmann@nasa.gov <javascript:;>
> WWW:  http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
>
>
>
> -----Original Message-----
> From: Renxia Wang <renxiawa@usc.edu <javascript:;>>
> Reply-To: "dev@nutch.apache.org <javascript:;>" <dev@nutch.apache.org
> <javascript:;>>
> Date: Sunday, February 22, 2015 at 3:36 PM
> To: "dev@nutch.apache.org <javascript:;>" <dev@nutch.apache.org
> <javascript:;>>
> Subject: How to read metadata/content of an URL in URLFilter?
>
> >
> >
> >
> >Hi
> >
> >
> >I want to develop an UrlFIlter which takes an url, takes its metadata or
> >even the fetched content, then use some duplicate detection algorithms to
> >determine if it is a duplicate of any url in bitch. However, the only
> >parameter passed into the Urlfilter
> > is the url, is it possible to get the data I want of that input url in
> >Urlfilter?
> >
> >
> >Thanks,
> >
> >
> >Zhique
>
>

Re: [MASSMAIL]Re: How to read metadata/content of an URL in URLFilter?

Posted by Renxia Wang <re...@usc.edu>.

Thanks Jorge for your useful information.

So since there are multiple URLFilter instances being created during
crawling, is there any way to share data among them? Like a hashmap, which
may be useful to my purpose, duplicate detection. Or use a external
in-memory database?

I am also failed to get the path of linkdb/segments/crawldb. Here is what I
did:

I implemented the setConf and getConf methods in the urlfilter, and the
pass the conf to the LinkDbReader/CrawlDbReader/SegmentReader with a path.
There are so many properties in conf, like mapred.input.dir, but it
sometimes pointing to a specific version of linkdb and somtimes it points
to segments. I also tried to just hard code the path, but it throws null
pointer exception. I am particular interested in reading the parse_data in
segments since I need to parsed metadata. Any thought about getting this
work?

Thanks,

Zhique



On Mon, Feb 23, 2015 at 12:56 PM, Jorge Luis Betancourt González <
jlbetancourt@uci.cu> wrote:

> My two cents on the topic:
>
> The URLFilter family plugin are handled by the URLFilfters class, this
> class gets instantiated in several places in the source code, including the
> Fetcher and the Injector. The URLFilters class uses PluginRepository.get()
> method to load the plugins, this method indeed use a cache based on the
> UUID of the NutchConfiguration object passed as an argument, this generated
> UUID can be found inside the config object under the "nutch.conf.uuid" key,
> for what I can see in the NutchConfiguration class each time the create()
> method is called a new instance of the Configuration class is also created
> and a new UUID generated, the new UUID will cause a cache miss and a new
> PluginRepository will be created and cached.
>
> ------------------------------
> *From: *"Renxia Wang" <re...@usc.edu>
> *To: *dev@nutch.apache.org
> *Sent: *Monday, February 23, 2015 1:00:30 AM
> *Subject: *[MASSMAIL]Re: How to read metadata/content of an URL in
> URLFilter?
>
>
> I log the instance id and get the result:
>
> 2015-02-22 21:42:15,972 INFO  exactdup.ExactDupURLFilter - URlFilter ID:
> 423250256
> 2015-02-22 21:42:24,782 INFO  exactdup.ExactDupURLFilter - URlFilter ID:
> 828433560
> 2015-02-22 21:42:24,795 INFO  exactdup.ExactDupURLFilter - URlFilter ID:
> 828433560
> 2015-02-22 21:42:24,804 INFO  exactdup.ExactDupURLFilter - URlFilter ID:
> 828433560
> ...
> 2015-02-22 21:42:25,039 INFO  exactdup.ExactDupURLFilter - URlFilter ID:
> 828433560
> 2015-02-22 21:42:25,041 INFO  exactdup.ExactDupURLFilter - URlFilter ID:
> 828433560
> 2015-02-22 21:42:28,282 INFO  exactdup.ExactDupURLFilter - URlFilter ID:
> 1240209240
> 2015-02-22 21:42:28,292 INFO  exactdup.ExactDupURLFilter - URlFilter ID:
> 1240209240
> ...
> 2015-02-22 21:42:28,487 INFO  exactdup.ExactDupURLFilter - URlFilter ID:
> 1240209240
> 2015-02-22 21:42:28,489 INFO  exactdup.ExactDupURLFilter - URlFilter ID:
> 1240209240
> 2015-02-22 21:42:43,984 INFO  exactdup.ExactDupURLFilter - URlFilter ID:
> 1818924295
> 2015-02-22 21:42:44,090 INFO  exactdup.ExactDupURLFilter - URlFilter ID:
> 1818924295
> ...
> 2015-02-22 21:42:53,404 INFO  exactdup.ExactDupURLFilter - URlFilter ID:
> 1818924295
> 2015-02-22 21:44:08,533 INFO  exactdup.ExactDupURLFilter - URlFilter ID:
> 908006650
> 2015-02-22 21:44:08,544 INFO  exactdup.ExactDupURLFilter - URlFilter ID:
> 908006650
> ...
> 2015-02-22 21:44:10,418 INFO  exactdup.ExactDupURLFilter - URlFilter ID:
> 908006650
> 2015-02-22 21:44:10,420 INFO  exactdup.ExactDupURLFilter - URlFilter ID:
> 908006650
> 2015-02-22 21:44:14,467 INFO  exactdup.ExactDupURLFilter - URlFilter ID:
> 619451848
> 2015-02-22 21:44:14,478 INFO  exactdup.ExactDupURLFilter - URlFilter ID:
> 619451848
> ...
> 2015-02-22 21:44:15,643 INFO  exactdup.ExactDupURLFilter - URlFilter ID:
> 619451848
> 2015-02-22 21:44:15,644 INFO  exactdup.ExactDupURLFilter - URlFilter ID:
> 619451848
> 2015-02-22 21:44:26,189 INFO  exactdup.ExactDupURLFilter - URlFilter ID:
> 1343455839
> 2015-02-22 21:44:28,501 INFO  exactdup.ExactDupURLFilter - URlFilter ID:
> 1343455839
> ...
> 2015-02-22 21:45:29,707 INFO  exactdup.ExactDupURLFilter - URlFilter ID:
> 1343455839
>
> As the url filters are called inthe injector and crawldb update, I grep:
> ➜  local git:(trunk) ✗ grep 'Injector: starting at' logs/hadoop.log
> 2015-02-22 21:42:14,896 INFO  crawl.Injector - Injector: starting at *2015-02-22
> 21:42:14*
>
> Which means the URlFilter ID: 423250256 is the one created in the
> injector.
>
> ➜  local git:(trunk) ✗ grep 'CrawlDb update: starting at' logs/hadoop.log
> 2015-02-22 21:42:25,951 INFO  crawl.CrawlDb - CrawlDb update: starting at
> 2015-02-22 21:42:25
> 2015-02-22 21:44:11,208 INFO  crawl.CrawlDb - CrawlDb update: starting at
> 2015-02-22 21:44:11
>
> Here is confusing, there are 6 unique urlfilter ids after injector, while
> there are only two crawldb update.
>
> On Sun, Feb 22, 2015 at 9:24 PM, Mattmann, Chris A (3980) <
> chris.a.mattmann@jpl.nasa.gov> wrote:
>
>> Cool, good test. I thought the Nutch plugin system cached instances
>> of plugins - I am not sure if it creates a new one each time. are you
>> sure you don’t have the same URLFilter instance, it’s just called on
>> different datasets and thus produces different counts?
>>
>> Either way, so you should simply proceed with the filters in whatever
>> form they are working in (cached or not).
>>
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Chris Mattmann, Ph.D.
>> Chief Architect
>> Instrument Software and Science Data Systems Section (398)
>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> Office: 168-519, Mailstop: 168-527
>> Email: chris.a.mattmann@nasa.gov
>> WWW:  http://sunset.usc.edu/~mattmann/
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Adjunct Associate Professor, Computer Science Department
>> University of Southern California, Los Angeles, CA 90089 USA
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>
>>
>>
>>
>>
>>
>> -----Original Message-----
>> From: Renxia Wang <re...@usc.edu>
>> Reply-To: "dev@nutch.apache.org" <de...@nutch.apache.org>
>> Date: Sunday, February 22, 2015 at 9:16 PM
>> To: "dev@nutch.apache.org" <de...@nutch.apache.org>
>> Subject: Re: How to read metadata/content of an URL in URLFilter?
>>
>> >I just added a counter in my URLFilter, and prove that the URLFilter
>> >instances in each fetching circle are different.
>> >
>> >
>> >Sample logs:
>> >2015-02-22 21:07:10,636 INFO  exactdup.ExactDupURLFilter - Processed 69
>> >links
>> >2015-02-22 21:07:10,638 INFO  exactdup.ExactDupURLFilter - Processed 70
>> >links
>> >2015-02-22 21:07:10,640 INFO  exactdup.ExactDupURLFilter - Processed 71
>> >links
>> >2015-02-22 21:07:10,641 INFO  exactdup.ExactDupURLFilter - Processed 72
>> >links
>> >2015-02-22 21:07:10,643 INFO  exactdup.ExactDupURLFilter - Processed 73
>> >links
>> >2015-02-22 21:07:10,645 INFO  exactdup.ExactDupURLFilter - Processed 74
>> >links
>> >2015-02-22 21:07:10,647 INFO  exactdup.ExactDupURLFilter - Processed 75
>> >links
>> >2015-02-22 21:07:10,649 INFO  exactdup.ExactDupURLFilter - Processed 76
>> >links
>> >2015-02-22 21:07:10,650 INFO  exactdup.ExactDupURLFilter - Processed 77
>> >links
>> >2015-02-22 21:07:13,835 INFO  exactdup.ExactDupURLFilter - Processed 1
>> >links
>> >2015-02-22 21:07:13,850 INFO  exactdup.ExactDupURLFilter - Processed 2
>> >links
>> >2015-02-22 21:07:13,865 INFO  exactdup.ExactDupURLFilter - Processed 3
>> >links
>> >2015-02-22 21:07:13,878 INFO  exactdup.ExactDupURLFilter - Processed 4
>> >links
>> >2015-02-22 21:07:13,889 INFO  exactdup.ExactDupURLFilter - Processed 5
>> >links
>> >2015-02-22 21:07:13,899 INFO  exactdup.ExactDupURLFilter - Processed 6
>> >links
>> >
>> >
>> >
>> >Not sure if it is configurable?
>> >
>> >
>> >
>> >
>> >On Sun, Feb 22, 2015 at 8:56 PM, Mattmann, Chris A (3980)
>> ><ch...@jpl.nasa.gov> wrote:
>> >
>> >That’s one way - for sure - but what I was implying is that
>> >you can train (read: feed data into) your model (read: algorithm)
>> >using previously crawled information. So, no I wasn’t implying
>> >machine learning.
>> >
>> >++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> >Chris Mattmann, Ph.D.
>> >Chief Architect
>> >Instrument Software and Science Data Systems Section (398)
>> >NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> >Office: 168-519, Mailstop: 168-527
>> >Email: chris.a.mattmann@nasa.gov
>> >WWW:  http://sunset.usc.edu/~mattmann/
>> >++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> >Adjunct Associate Professor, Computer Science Department
>> >University of Southern California, Los Angeles, CA 90089 USA
>> >++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> >
>> >
>> >
>> >
>> >
>> >
>> >-----Original Message-----
>> >From: Renxia Wang <re...@usc.edu>
>> >Reply-To: "dev@nutch.apache.org" <de...@nutch.apache.org>
>> >Date: Sunday, February 22, 2015 at 8:47 PM
>> >To: "dev@nutch.apache.org" <de...@nutch.apache.org>
>> >Subject: Re: How to read metadata/content of an URL in URLFilter?
>> >
>> >>Hi Prof Mattmann,
>> >>
>> >>
>> >>You are saying "train" and "model", are we expected to use machine
>> >>learning algorithms to train model for duplication detection?
>> >>
>> >>
>> >>Thanks,
>> >>
>> >>
>> >>Renxia
>> >>
>> >>
>> >>On Sun, Feb 22, 2015 at 8:39 PM, Mattmann, Chris A (3980)
>> >><ch...@jpl.nasa.gov> wrote:
>> >>
>> >>There is nothing stating in your assignment that you can’t
>> >>use *previously* crawled data to train your model - you
>> >>should have at least 2 full sets of this.
>> >>
>> >>Cheers,
>> >>Chris
>> >>
>> >>
>> >>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> >>Chris Mattmann, Ph.D.
>> >>Chief Architect
>> >>Instrument Software and Science Data Systems Section (398)
>> >>NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> >>Office: 168-519, Mailstop: 168-527
>> >>Email: chris.a.mattmann@nasa.gov
>> >>WWW:  http://sunset.usc.edu/~mattmann/
>> >>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> >>Adjunct Associate Professor, Computer Science Department
>> >>University of Southern California, Los Angeles, CA 90089 USA
>> >>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>-----Original Message-----
>> >>From: Majisha Parambath <pa...@usc.edu>
>> >>Reply-To: "dev@nutch.apache.org" <de...@nutch.apache.org>
>> >>Date: Sunday, February 22, 2015 at 8:30 PM
>> >>To: dev <de...@nutch.apache.org>
>> >>Subject: Re: How to read metadata/content of an URL in URLFilter?
>> >>
>> >>>
>> >>>
>> >>>
>> >>>My understanding is that the LinkDB or CrawlDB will contain the results
>> >>>of previously fetched and parsed pages.
>> >>>
>> >>>However if we want to get the contents of a URL/page in the URL
>> >>>Filtering
>> >>>stage(
>> >>>which is not yet fetched) , is there any util in Nutch  that we can use
>> >>>to fetch the contents of the page ?
>> >>>
>> >>>
>> >>>Thanks and regards,
>> >>>Majisha Namath Parambath
>> >>>Graduate Student, M.S in Computer Science
>> >>>Viterbi School of Engineering
>> >>>University of Southern California, Los Angeles
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>On Sun, Feb 22, 2015 at 4:53 PM, Mattmann, Chris A (3980)
>> >>><ch...@jpl.nasa.gov> wrote:
>> >>>
>> >>>In the constructor of your URLFilter, why not consider passing
>> >>>in a NutchConfiguration object, and then reading the path to e.g,
>> >>>the LinkDb from the config. Then have a private member variable
>> >>>for the LinkDbReader (maybe static initialized for efficiency)
>> >>>and use that in your interface method.
>> >>>
>> >>>Cheers,
>> >>>Chris
>> >>>
>> >>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> >>>Chris Mattmann, Ph.D.
>> >>>Chief Architect
>> >>>Instrument Software and Science Data Systems Section (398)
>> >>>NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> >>>Office: 168-519, Mailstop: 168-527
>> >>>Email: chris.a.mattmann@nasa.gov
>> >>>WWW:  http://sunset.usc.edu/~mattmann/
>> >>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> >>>Adjunct Associate Professor, Computer Science Department
>> >>>University of Southern California, Los Angeles, CA 90089 USA
>> >>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>-----Original Message-----
>> >>>From: Renxia Wang <re...@usc.edu>
>> >>>Reply-To: "dev@nutch.apache.org" <de...@nutch.apache.org>
>> >>>Date: Sunday, February 22, 2015 at 3:36 PM
>> >>>To: "dev@nutch.apache.org" <de...@nutch.apache.org>
>> >>>Subject: How to read metadata/content of an URL in URLFilter?
>> >>>
>> >>>>
>> >>>>
>> >>>>
>> >>>>Hi
>> >>>>
>> >>>>
>> >>>>I want to develop an UrlFIlter which takes an url, takes its metadata
>> >>>>or
>> >>>>even the fetched content, then use some duplicate detection algorithms
>> >>>>to
>> >>>>determine if it is a duplicate of any url in bitch. However, the only
>> >>>>parameter passed into the Urlfilter
>> >>>> is the url, is it possible to get the data I want of that input url
>> in
>> >>>>Urlfilter?
>> >>>>
>> >>>>
>> >>>>Thanks,
>> >>>>
>> >>>>
>> >>>>Zhique
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>>
>>
>
>
>
>

Re: [MASSMAIL]Re: How to read metadata/content of an URL in URLFilter?

Posted by Jorge Luis Betancourt González <jl...@uci.cu>.

My two cents on the topic: 

The URLFilter family plugin are handled by the URLFilfters class, this class gets instantiated in several places in the source code, including the Fetcher and the Injector. The URLFilters class uses PluginRepository.get() method to load the plugins, this method indeed use a cache based on the UUID of the NutchConfiguration object passed as an argument, this generated UUID can be found inside the config object under the "nutch.conf.uuid" key, for what I can see in the NutchConfiguration class each time the create() method is called a new instance of the Configuration class is also created and a new UUID generated, the new UUID will cause a cache miss and a new PluginRepository will be created and cached. 

----- Original Message -----

From: "Renxia Wang" <re...@usc.edu> 
To: dev@nutch.apache.org 
Sent: Monday, February 23, 2015 1:00:30 AM 
Subject: [MASSMAIL]Re: How to read metadata/content of an URL in URLFilter? 

I log the instance id and get the result: 

2015-02-22 21:42:15,972 INFO exactdup.ExactDupURLFilter - URlFilter ID: 423250256 
2015-02-22 21:42:24,782 INFO exactdup.ExactDupURLFilter - URlFilter ID: 828433560 
2015-02-22 21:42:24,795 INFO exactdup.ExactDupURLFilter - URlFilter ID: 828433560 
2015-02-22 21:42:24,804 INFO exactdup.ExactDupURLFilter - URlFilter ID: 828433560 
... 
2015-02-22 21:42:25,039 INFO exactdup.ExactDupURLFilter - URlFilter ID: 828433560 
2015-02-22 21:42:25,041 INFO exactdup.ExactDupURLFilter - URlFilter ID: 828433560 
2015-02-22 21:42:28,282 INFO exactdup.ExactDupURLFilter - URlFilter ID: 1240209240 
2015-02-22 21:42:28,292 INFO exactdup.ExactDupURLFilter - URlFilter ID: 1240209240 
... 
2015-02-22 21:42:28,487 INFO exactdup.ExactDupURLFilter - URlFilter ID: 1240209240 
2015-02-22 21:42:28,489 INFO exactdup.ExactDupURLFilter - URlFilter ID: 1240209240 
2015-02-22 21:42:43,984 INFO exactdup.ExactDupURLFilter - URlFilter ID: 1818924295 
2015-02-22 21:42:44,090 INFO exactdup.ExactDupURLFilter - URlFilter ID: 1818924295 
... 
2015-02-22 21:42:53,404 INFO exactdup.ExactDupURLFilter - URlFilter ID: 1818924295 
2015-02-22 21:44:08,533 INFO exactdup.ExactDupURLFilter - URlFilter ID: 908006650 
2015-02-22 21:44:08,544 INFO exactdup.ExactDupURLFilter - URlFilter ID: 908006650 
... 
2015-02-22 21:44:10,418 INFO exactdup.ExactDupURLFilter - URlFilter ID: 908006650 
2015-02-22 21:44:10,420 INFO exactdup.ExactDupURLFilter - URlFilter ID: 908006650 
2015-02-22 21:44:14,467 INFO exactdup.ExactDupURLFilter - URlFilter ID: 619451848 
2015-02-22 21:44:14,478 INFO exactdup.ExactDupURLFilter - URlFilter ID: 619451848 
... 
2015-02-22 21:44:15,643 INFO exactdup.ExactDupURLFilter - URlFilter ID: 619451848 
2015-02-22 21:44:15,644 INFO exactdup.ExactDupURLFilter - URlFilter ID: 619451848 
2015-02-22 21:44:26,189 INFO exactdup.ExactDupURLFilter - URlFilter ID: 1343455839 
2015-02-22 21:44:28,501 INFO exactdup.ExactDupURLFilter - URlFilter ID: 1343455839 
... 
2015-02-22 21:45:29,707 INFO exactdup.ExactDupURLFilter - URlFilter ID: 1343455839 

As the url filters are called inthe injector and crawldb update, I grep: 
➜ local git:(trunk) ✗ grep 'Injector: starting at' logs/hadoop.log 
2015-02-22 21:42:14,896 INFO crawl.Injector - Injector: starting at 2015-02-22 21:42:14 

Which means the URlFilter ID: 423250256 is the one created in the injector. 

➜ local git:(trunk) ✗ grep 'CrawlDb update: starting at' logs/hadoop.log 
2015-02-22 21:42:25,951 INFO crawl.CrawlDb - CrawlDb update: starting at 2015-02-22 21:42:25 
2015-02-22 21:44:11,208 INFO crawl.CrawlDb - CrawlDb update: starting at 2015-02-22 21:44:11 

Here is confusing, there are 6 unique urlfilter ids after injector, while there are only two crawldb update. 

On Sun, Feb 22, 2015 at 9:24 PM, Mattmann, Chris A (3980) < chris.a.mattmann@jpl.nasa.gov > wrote: 


Cool, good test. I thought the Nutch plugin system cached instances 
of plugins - I am not sure if it creates a new one each time. are you 
sure you don’t have the same URLFilter instance, it’s just called on 
different datasets and thus produces different counts? 

Either way, so you should simply proceed with the filters in whatever 
form they are working in (cached or not). 

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 
Chris Mattmann, Ph.D. 
Chief Architect 
Instrument Software and Science Data Systems Section (398) 
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA 
Office: 168-519, Mailstop: 168-527 
Email: chris.a.mattmann@nasa.gov 
WWW: http://sunset.usc.edu/~mattmann/ 
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 
Adjunct Associate Professor, Computer Science Department 
University of Southern California, Los Angeles, CA 90089 USA 
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 






-----Original Message----- 
From: Renxia Wang < renxiawa@usc.edu > 
Reply-To: " dev@nutch.apache.org " < dev@nutch.apache.org > 
Date: Sunday, February 22, 2015 at 9:16 PM 
To: " dev@nutch.apache.org " < dev@nutch.apache.org > 
Subject: Re: How to read metadata/content of an URL in URLFilter? 

>I just added a counter in my URLFilter, and prove that the URLFilter 
>instances in each fetching circle are different. 
> 
> 
>Sample logs: 
>2015-02-22 21:07:10,636 INFO exactdup.ExactDupURLFilter - Processed 69 
>links 
>2015-02-22 21:07:10,638 INFO exactdup.ExactDupURLFilter - Processed 70 
>links 
>2015-02-22 21:07:10,640 INFO exactdup.ExactDupURLFilter - Processed 71 
>links 
>2015-02-22 21:07:10,641 INFO exactdup.ExactDupURLFilter - Processed 72 
>links 
>2015-02-22 21:07:10,643 INFO exactdup.ExactDupURLFilter - Processed 73 
>links 
>2015-02-22 21:07:10,645 INFO exactdup.ExactDupURLFilter - Processed 74 
>links 
>2015-02-22 21:07:10,647 INFO exactdup.ExactDupURLFilter - Processed 75 
>links 
>2015-02-22 21:07:10,649 INFO exactdup.ExactDupURLFilter - Processed 76 
>links 
>2015-02-22 21:07:10,650 INFO exactdup.ExactDupURLFilter - Processed 77 
>links 
>2015-02-22 21:07:13,835 INFO exactdup.ExactDupURLFilter - Processed 1 
>links 
>2015-02-22 21:07:13,850 INFO exactdup.ExactDupURLFilter - Processed 2 
>links 
>2015-02-22 21:07:13,865 INFO exactdup.ExactDupURLFilter - Processed 3 
>links 
>2015-02-22 21:07:13,878 INFO exactdup.ExactDupURLFilter - Processed 4 
>links 
>2015-02-22 21:07:13,889 INFO exactdup.ExactDupURLFilter - Processed 5 
>links 
>2015-02-22 21:07:13,899 INFO exactdup.ExactDupURLFilter - Processed 6 
>links 
> 
> 
> 
>Not sure if it is configurable? 
> 
> 
> 
> 
>On Sun, Feb 22, 2015 at 8:56 PM, Mattmann, Chris A (3980) 
>< chris.a.mattmann@jpl.nasa.gov > wrote: 
> 
>That’s one way - for sure - but what I was implying is that 
>you can train (read: feed data into) your model (read: algorithm) 
>using previously crawled information. So, no I wasn’t implying 
>machine learning. 
> 
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 
>Chris Mattmann, Ph.D. 
>Chief Architect 
>Instrument Software and Science Data Systems Section (398) 
>NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA 
>Office: 168-519, Mailstop: 168-527 
>Email: chris.a.mattmann@nasa.gov 
>WWW: http://sunset.usc.edu/~mattmann/ 
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 
>Adjunct Associate Professor, Computer Science Department 
>University of Southern California, Los Angeles, CA 90089 USA 
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 
> 
> 
> 
> 
> 
> 
>-----Original Message----- 
>From: Renxia Wang < renxiawa@usc.edu > 
>Reply-To: " dev@nutch.apache.org " < dev@nutch.apache.org > 
>Date: Sunday, February 22, 2015 at 8:47 PM 
>To: " dev@nutch.apache.org " < dev@nutch.apache.org > 
>Subject: Re: How to read metadata/content of an URL in URLFilter? 
> 
>>Hi Prof Mattmann, 
>> 
>> 
>>You are saying "train" and "model", are we expected to use machine 
>>learning algorithms to train model for duplication detection? 
>> 
>> 
>>Thanks, 
>> 
>> 
>>Renxia 
>> 
>> 
>>On Sun, Feb 22, 2015 at 8:39 PM, Mattmann, Chris A (3980) 
>>< chris.a.mattmann@jpl.nasa.gov > wrote: 
>> 
>>There is nothing stating in your assignment that you can’t 
>>use *previously* crawled data to train your model - you 
>>should have at least 2 full sets of this. 
>> 
>>Cheers, 
>>Chris 
>> 
>> 
>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 
>>Chris Mattmann, Ph.D. 
>>Chief Architect 
>>Instrument Software and Science Data Systems Section (398) 
>>NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA 
>>Office: 168-519, Mailstop: 168-527 
>>Email: chris.a.mattmann@nasa.gov 
>>WWW: http://sunset.usc.edu/~mattmann/ 
>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 
>>Adjunct Associate Professor, Computer Science Department 
>>University of Southern California, Los Angeles, CA 90089 USA 
>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 
>> 
>> 
>> 
>> 
>> 
>> 
>>-----Original Message----- 
>>From: Majisha Parambath < parambat@usc.edu > 
>>Reply-To: " dev@nutch.apache.org " < dev@nutch.apache.org > 
>>Date: Sunday, February 22, 2015 at 8:30 PM 
>>To: dev < dev@nutch.apache.org > 
>>Subject: Re: How to read metadata/content of an URL in URLFilter? 
>> 
>>> 
>>> 
>>> 
>>>My understanding is that the LinkDB or CrawlDB will contain the results 
>>>of previously fetched and parsed pages. 
>>> 
>>>However if we want to get the contents of a URL/page in the URL 
>>>Filtering 
>>>stage( 
>>>which is not yet fetched) , is there any util in Nutch that we can use 
>>>to fetch the contents of the page ? 
>>> 
>>> 
>>>Thanks and regards, 
>>>Majisha Namath Parambath 
>>>Graduate Student, M.S in Computer Science 
>>>Viterbi School of Engineering 
>>>University of Southern California, Los Angeles 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>>On Sun, Feb 22, 2015 at 4:53 PM, Mattmann, Chris A (3980) 
>>>< chris.a.mattmann@jpl.nasa.gov > wrote: 
>>> 
>>>In the constructor of your URLFilter, why not consider passing 
>>>in a NutchConfiguration object, and then reading the path to e.g, 
>>>the LinkDb from the config. Then have a private member variable 
>>>for the LinkDbReader (maybe static initialized for efficiency) 
>>>and use that in your interface method. 
>>> 
>>>Cheers, 
>>>Chris 
>>> 
>>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 
>>>Chris Mattmann, Ph.D. 
>>>Chief Architect 
>>>Instrument Software and Science Data Systems Section (398) 
>>>NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA 
>>>Office: 168-519, Mailstop: 168-527 
>>>Email: chris.a.mattmann@nasa.gov 
>>>WWW: http://sunset.usc.edu/~mattmann/ 
>>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 
>>>Adjunct Associate Professor, Computer Science Department 
>>>University of Southern California, Los Angeles, CA 90089 USA 
>>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>>-----Original Message----- 
>>>From: Renxia Wang < renxiawa@usc.edu > 
>>>Reply-To: " dev@nutch.apache.org " < dev@nutch.apache.org > 
>>>Date: Sunday, February 22, 2015 at 3:36 PM 
>>>To: " dev@nutch.apache.org " < dev@nutch.apache.org > 
>>>Subject: How to read metadata/content of an URL in URLFilter? 
>>> 
>>>> 
>>>> 
>>>> 
>>>>Hi 
>>>> 
>>>> 
>>>>I want to develop an UrlFIlter which takes an url, takes its metadata 
>>>>or 
>>>>even the fetched content, then use some duplicate detection algorithms 
>>>>to 
>>>>determine if it is a duplicate of any url in bitch. However, the only 
>>>>parameter passed into the Urlfilter 
>>>> is the url, is it possible to get the data I want of that input url in 
>>>>Urlfilter? 
>>>> 
>>>> 
>>>>Thanks, 
>>>> 
>>>> 
>>>>Zhique 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
> 
> 
> 
> 
> 
> 
> 
>

Re: How to read metadata/content of an URL in URLFilter?

Posted by Renxia Wang <re...@usc.edu>.

I log the instance id and get the result:

2015-02-22 21:42:15,972 INFO  exactdup.ExactDupURLFilter - URlFilter ID:
423250256
2015-02-22 21:42:24,782 INFO  exactdup.ExactDupURLFilter - URlFilter ID:
828433560
2015-02-22 21:42:24,795 INFO  exactdup.ExactDupURLFilter - URlFilter ID:
828433560
2015-02-22 21:42:24,804 INFO  exactdup.ExactDupURLFilter - URlFilter ID:
828433560
...
2015-02-22 21:42:25,039 INFO  exactdup.ExactDupURLFilter - URlFilter ID:
828433560
2015-02-22 21:42:25,041 INFO  exactdup.ExactDupURLFilter - URlFilter ID:
828433560
2015-02-22 21:42:28,282 INFO  exactdup.ExactDupURLFilter - URlFilter ID:
1240209240
2015-02-22 21:42:28,292 INFO  exactdup.ExactDupURLFilter - URlFilter ID:
1240209240
...
2015-02-22 21:42:28,487 INFO  exactdup.ExactDupURLFilter - URlFilter ID:
1240209240
2015-02-22 21:42:28,489 INFO  exactdup.ExactDupURLFilter - URlFilter ID:
1240209240
2015-02-22 21:42:43,984 INFO  exactdup.ExactDupURLFilter - URlFilter ID:
1818924295
2015-02-22 21:42:44,090 INFO  exactdup.ExactDupURLFilter - URlFilter ID:
1818924295
...
2015-02-22 21:42:53,404 INFO  exactdup.ExactDupURLFilter - URlFilter ID:
1818924295
2015-02-22 21:44:08,533 INFO  exactdup.ExactDupURLFilter - URlFilter ID:
908006650
2015-02-22 21:44:08,544 INFO  exactdup.ExactDupURLFilter - URlFilter ID:
908006650
...
2015-02-22 21:44:10,418 INFO  exactdup.ExactDupURLFilter - URlFilter ID:
908006650
2015-02-22 21:44:10,420 INFO  exactdup.ExactDupURLFilter - URlFilter ID:
908006650
2015-02-22 21:44:14,467 INFO  exactdup.ExactDupURLFilter - URlFilter ID:
619451848
2015-02-22 21:44:14,478 INFO  exactdup.ExactDupURLFilter - URlFilter ID:
619451848
...
2015-02-22 21:44:15,643 INFO  exactdup.ExactDupURLFilter - URlFilter ID:
619451848
2015-02-22 21:44:15,644 INFO  exactdup.ExactDupURLFilter - URlFilter ID:
619451848
2015-02-22 21:44:26,189 INFO  exactdup.ExactDupURLFilter - URlFilter ID:
1343455839
2015-02-22 21:44:28,501 INFO  exactdup.ExactDupURLFilter - URlFilter ID:
1343455839
...
2015-02-22 21:45:29,707 INFO  exactdup.ExactDupURLFilter - URlFilter ID:
1343455839

As the url filters are called inthe injector and crawldb update, I grep:
➜  local git:(trunk) ✗ grep 'Injector: starting at' logs/hadoop.log
2015-02-22 21:42:14,896 INFO  crawl.Injector - Injector: starting at
*2015-02-22
21:42:14*

Which means the URlFilter ID: 423250256 is the one created in the injector.

➜  local git:(trunk) ✗ grep 'CrawlDb update: starting at' logs/hadoop.log
2015-02-22 21:42:25,951 INFO  crawl.CrawlDb - CrawlDb update: starting at
2015-02-22 21:42:25
2015-02-22 21:44:11,208 INFO  crawl.CrawlDb - CrawlDb update: starting at
2015-02-22 21:44:11

Here is confusing, there are 6 unique urlfilter ids after injector, while
there are only two crawldb update.

On Sun, Feb 22, 2015 at 9:24 PM, Mattmann, Chris A (3980) <
chris.a.mattmann@jpl.nasa.gov> wrote:

> Cool, good test. I thought the Nutch plugin system cached instances
> of plugins - I am not sure if it creates a new one each time. are you
> sure you don’t have the same URLFilter instance, it’s just called on
> different datasets and thus produces different counts?
>
> Either way, so you should simply proceed with the filters in whatever
> form they are working in (cached or not).
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: chris.a.mattmann@nasa.gov
> WWW:  http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
>
>
>
> -----Original Message-----
> From: Renxia Wang <re...@usc.edu>
> Reply-To: "dev@nutch.apache.org" <de...@nutch.apache.org>
> Date: Sunday, February 22, 2015 at 9:16 PM
> To: "dev@nutch.apache.org" <de...@nutch.apache.org>
> Subject: Re: How to read metadata/content of an URL in URLFilter?
>
> >I just added a counter in my URLFilter, and prove that the URLFilter
> >instances in each fetching circle are different.
> >
> >
> >Sample logs:
> >2015-02-22 21:07:10,636 INFO  exactdup.ExactDupURLFilter - Processed 69
> >links
> >2015-02-22 21:07:10,638 INFO  exactdup.ExactDupURLFilter - Processed 70
> >links
> >2015-02-22 21:07:10,640 INFO  exactdup.ExactDupURLFilter - Processed 71
> >links
> >2015-02-22 21:07:10,641 INFO  exactdup.ExactDupURLFilter - Processed 72
> >links
> >2015-02-22 21:07:10,643 INFO  exactdup.ExactDupURLFilter - Processed 73
> >links
> >2015-02-22 21:07:10,645 INFO  exactdup.ExactDupURLFilter - Processed 74
> >links
> >2015-02-22 21:07:10,647 INFO  exactdup.ExactDupURLFilter - Processed 75
> >links
> >2015-02-22 21:07:10,649 INFO  exactdup.ExactDupURLFilter - Processed 76
> >links
> >2015-02-22 21:07:10,650 INFO  exactdup.ExactDupURLFilter - Processed 77
> >links
> >2015-02-22 21:07:13,835 INFO  exactdup.ExactDupURLFilter - Processed 1
> >links
> >2015-02-22 21:07:13,850 INFO  exactdup.ExactDupURLFilter - Processed 2
> >links
> >2015-02-22 21:07:13,865 INFO  exactdup.ExactDupURLFilter - Processed 3
> >links
> >2015-02-22 21:07:13,878 INFO  exactdup.ExactDupURLFilter - Processed 4
> >links
> >2015-02-22 21:07:13,889 INFO  exactdup.ExactDupURLFilter - Processed 5
> >links
> >2015-02-22 21:07:13,899 INFO  exactdup.ExactDupURLFilter - Processed 6
> >links
> >
> >
> >
> >Not sure if it is configurable?
> >
> >
> >
> >
> >On Sun, Feb 22, 2015 at 8:56 PM, Mattmann, Chris A (3980)
> ><ch...@jpl.nasa.gov> wrote:
> >
> >That’s one way - for sure - but what I was implying is that
> >you can train (read: feed data into) your model (read: algorithm)
> >using previously crawled information. So, no I wasn’t implying
> >machine learning.
> >
> >++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >Chris Mattmann, Ph.D.
> >Chief Architect
> >Instrument Software and Science Data Systems Section (398)
> >NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> >Office: 168-519, Mailstop: 168-527
> >Email: chris.a.mattmann@nasa.gov
> >WWW:  http://sunset.usc.edu/~mattmann/
> >++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >Adjunct Associate Professor, Computer Science Department
> >University of Southern California, Los Angeles, CA 90089 USA
> >++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >
> >
> >
> >
> >
> >
> >-----Original Message-----
> >From: Renxia Wang <re...@usc.edu>
> >Reply-To: "dev@nutch.apache.org" <de...@nutch.apache.org>
> >Date: Sunday, February 22, 2015 at 8:47 PM
> >To: "dev@nutch.apache.org" <de...@nutch.apache.org>
> >Subject: Re: How to read metadata/content of an URL in URLFilter?
> >
> >>Hi Prof Mattmann,
> >>
> >>
> >>You are saying "train" and "model", are we expected to use machine
> >>learning algorithms to train model for duplication detection?
> >>
> >>
> >>Thanks,
> >>
> >>
> >>Renxia
> >>
> >>
> >>On Sun, Feb 22, 2015 at 8:39 PM, Mattmann, Chris A (3980)
> >><ch...@jpl.nasa.gov> wrote:
> >>
> >>There is nothing stating in your assignment that you can’t
> >>use *previously* crawled data to train your model - you
> >>should have at least 2 full sets of this.
> >>
> >>Cheers,
> >>Chris
> >>
> >>
> >>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>Chris Mattmann, Ph.D.
> >>Chief Architect
> >>Instrument Software and Science Data Systems Section (398)
> >>NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> >>Office: 168-519, Mailstop: 168-527
> >>Email: chris.a.mattmann@nasa.gov
> >>WWW:  http://sunset.usc.edu/~mattmann/
> >>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>Adjunct Associate Professor, Computer Science Department
> >>University of Southern California, Los Angeles, CA 90089 USA
> >>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>
> >>
> >>
> >>
> >>
> >>
> >>-----Original Message-----
> >>From: Majisha Parambath <pa...@usc.edu>
> >>Reply-To: "dev@nutch.apache.org" <de...@nutch.apache.org>
> >>Date: Sunday, February 22, 2015 at 8:30 PM
> >>To: dev <de...@nutch.apache.org>
> >>Subject: Re: How to read metadata/content of an URL in URLFilter?
> >>
> >>>
> >>>
> >>>
> >>>My understanding is that the LinkDB or CrawlDB will contain the results
> >>>of previously fetched and parsed pages.
> >>>
> >>>However if we want to get the contents of a URL/page in the URL
> >>>Filtering
> >>>stage(
> >>>which is not yet fetched) , is there any util in Nutch  that we can use
> >>>to fetch the contents of the page ?
> >>>
> >>>
> >>>Thanks and regards,
> >>>Majisha Namath Parambath
> >>>Graduate Student, M.S in Computer Science
> >>>Viterbi School of Engineering
> >>>University of Southern California, Los Angeles
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>On Sun, Feb 22, 2015 at 4:53 PM, Mattmann, Chris A (3980)
> >>><ch...@jpl.nasa.gov> wrote:
> >>>
> >>>In the constructor of your URLFilter, why not consider passing
> >>>in a NutchConfiguration object, and then reading the path to e.g,
> >>>the LinkDb from the config. Then have a private member variable
> >>>for the LinkDbReader (maybe static initialized for efficiency)
> >>>and use that in your interface method.
> >>>
> >>>Cheers,
> >>>Chris
> >>>
> >>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>>Chris Mattmann, Ph.D.
> >>>Chief Architect
> >>>Instrument Software and Science Data Systems Section (398)
> >>>NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> >>>Office: 168-519, Mailstop: 168-527
> >>>Email: chris.a.mattmann@nasa.gov
> >>>WWW:  http://sunset.usc.edu/~mattmann/
> >>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>>Adjunct Associate Professor, Computer Science Department
> >>>University of Southern California, Los Angeles, CA 90089 USA
> >>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>-----Original Message-----
> >>>From: Renxia Wang <re...@usc.edu>
> >>>Reply-To: "dev@nutch.apache.org" <de...@nutch.apache.org>
> >>>Date: Sunday, February 22, 2015 at 3:36 PM
> >>>To: "dev@nutch.apache.org" <de...@nutch.apache.org>
> >>>Subject: How to read metadata/content of an URL in URLFilter?
> >>>
> >>>>
> >>>>
> >>>>
> >>>>Hi
> >>>>
> >>>>
> >>>>I want to develop an UrlFIlter which takes an url, takes its metadata
> >>>>or
> >>>>even the fetched content, then use some duplicate detection algorithms
> >>>>to
> >>>>determine if it is a duplicate of any url in bitch. However, the only
> >>>>parameter passed into the Urlfilter
> >>>> is the url, is it possible to get the data I want of that input url in
> >>>>Urlfilter?
> >>>>
> >>>>
> >>>>Thanks,
> >>>>
> >>>>
> >>>>Zhique
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >
> >
> >
> >
> >
> >
> >
> >
>
>

Re: How to read metadata/content of an URL in URLFilter?

Posted by "Mattmann, Chris A (3980)" <ch...@jpl.nasa.gov>.

Cool, good test. I thought the Nutch plugin system cached instances
of plugins - I am not sure if it creates a new one each time. are you
sure you don’t have the same URLFilter instance, it’s just called on
different datasets and thus produces different counts?

Either way, so you should simply proceed with the filters in whatever
form they are working in (cached or not).

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++






-----Original Message-----
From: Renxia Wang <re...@usc.edu>
Reply-To: "dev@nutch.apache.org" <de...@nutch.apache.org>
Date: Sunday, February 22, 2015 at 9:16 PM
To: "dev@nutch.apache.org" <de...@nutch.apache.org>
Subject: Re: How to read metadata/content of an URL in URLFilter?

>I just added a counter in my URLFilter, and prove that the URLFilter
>instances in each fetching circle are different.
>
>
>Sample logs:
>2015-02-22 21:07:10,636 INFO  exactdup.ExactDupURLFilter - Processed 69
>links
>2015-02-22 21:07:10,638 INFO  exactdup.ExactDupURLFilter - Processed 70
>links
>2015-02-22 21:07:10,640 INFO  exactdup.ExactDupURLFilter - Processed 71
>links
>2015-02-22 21:07:10,641 INFO  exactdup.ExactDupURLFilter - Processed 72
>links
>2015-02-22 21:07:10,643 INFO  exactdup.ExactDupURLFilter - Processed 73
>links
>2015-02-22 21:07:10,645 INFO  exactdup.ExactDupURLFilter - Processed 74
>links
>2015-02-22 21:07:10,647 INFO  exactdup.ExactDupURLFilter - Processed 75
>links
>2015-02-22 21:07:10,649 INFO  exactdup.ExactDupURLFilter - Processed 76
>links
>2015-02-22 21:07:10,650 INFO  exactdup.ExactDupURLFilter - Processed 77
>links
>2015-02-22 21:07:13,835 INFO  exactdup.ExactDupURLFilter - Processed 1
>links
>2015-02-22 21:07:13,850 INFO  exactdup.ExactDupURLFilter - Processed 2
>links
>2015-02-22 21:07:13,865 INFO  exactdup.ExactDupURLFilter - Processed 3
>links
>2015-02-22 21:07:13,878 INFO  exactdup.ExactDupURLFilter - Processed 4
>links
>2015-02-22 21:07:13,889 INFO  exactdup.ExactDupURLFilter - Processed 5
>links
>2015-02-22 21:07:13,899 INFO  exactdup.ExactDupURLFilter - Processed 6
>links
>
>
>
>Not sure if it is configurable?
>
>
>
>
>On Sun, Feb 22, 2015 at 8:56 PM, Mattmann, Chris A (3980)
><ch...@jpl.nasa.gov> wrote:
>
>That’s one way - for sure - but what I was implying is that
>you can train (read: feed data into) your model (read: algorithm)
>using previously crawled information. So, no I wasn’t implying
>machine learning.
>
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>Chris Mattmann, Ph.D.
>Chief Architect
>Instrument Software and Science Data Systems Section (398)
>NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>Office: 168-519, Mailstop: 168-527
>Email: chris.a.mattmann@nasa.gov
>WWW:  http://sunset.usc.edu/~mattmann/
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>Adjunct Associate Professor, Computer Science Department
>University of Southern California, Los Angeles, CA 90089 USA
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
>
>
>
>-----Original Message-----
>From: Renxia Wang <re...@usc.edu>
>Reply-To: "dev@nutch.apache.org" <de...@nutch.apache.org>
>Date: Sunday, February 22, 2015 at 8:47 PM
>To: "dev@nutch.apache.org" <de...@nutch.apache.org>
>Subject: Re: How to read metadata/content of an URL in URLFilter?
>
>>Hi Prof Mattmann,
>>
>>
>>You are saying "train" and "model", are we expected to use machine
>>learning algorithms to train model for duplication detection?
>>
>>
>>Thanks,
>>
>>
>>Renxia
>>
>>
>>On Sun, Feb 22, 2015 at 8:39 PM, Mattmann, Chris A (3980)
>><ch...@jpl.nasa.gov> wrote:
>>
>>There is nothing stating in your assignment that you can’t
>>use *previously* crawled data to train your model - you
>>should have at least 2 full sets of this.
>>
>>Cheers,
>>Chris
>>
>>
>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>Chris Mattmann, Ph.D.
>>Chief Architect
>>Instrument Software and Science Data Systems Section (398)
>>NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>Office: 168-519, Mailstop: 168-527
>>Email: chris.a.mattmann@nasa.gov
>>WWW:  http://sunset.usc.edu/~mattmann/
>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>Adjunct Associate Professor, Computer Science Department
>>University of Southern California, Los Angeles, CA 90089 USA
>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>
>>
>>
>>
>>
>>
>>-----Original Message-----
>>From: Majisha Parambath <pa...@usc.edu>
>>Reply-To: "dev@nutch.apache.org" <de...@nutch.apache.org>
>>Date: Sunday, February 22, 2015 at 8:30 PM
>>To: dev <de...@nutch.apache.org>
>>Subject: Re: How to read metadata/content of an URL in URLFilter?
>>
>>>
>>>
>>>
>>>My understanding is that the LinkDB or CrawlDB will contain the results
>>>of previously fetched and parsed pages.
>>>
>>>However if we want to get the contents of a URL/page in the URL
>>>Filtering
>>>stage(
>>>which is not yet fetched) , is there any util in Nutch  that we can use
>>>to fetch the contents of the page ?
>>>
>>>
>>>Thanks and regards,
>>>Majisha Namath Parambath
>>>Graduate Student, M.S in Computer Science
>>>Viterbi School of Engineering
>>>University of Southern California, Los Angeles
>>>
>>>
>>>
>>>
>>>
>>>
>>>On Sun, Feb 22, 2015 at 4:53 PM, Mattmann, Chris A (3980)
>>><ch...@jpl.nasa.gov> wrote:
>>>
>>>In the constructor of your URLFilter, why not consider passing
>>>in a NutchConfiguration object, and then reading the path to e.g,
>>>the LinkDb from the config. Then have a private member variable
>>>for the LinkDbReader (maybe static initialized for efficiency)
>>>and use that in your interface method.
>>>
>>>Cheers,
>>>Chris
>>>
>>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>Chris Mattmann, Ph.D.
>>>Chief Architect
>>>Instrument Software and Science Data Systems Section (398)
>>>NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>>Office: 168-519, Mailstop: 168-527
>>>Email: chris.a.mattmann@nasa.gov
>>>WWW:  http://sunset.usc.edu/~mattmann/
>>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>Adjunct Associate Professor, Computer Science Department
>>>University of Southern California, Los Angeles, CA 90089 USA
>>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>
>>>
>>>
>>>
>>>
>>>
>>>-----Original Message-----
>>>From: Renxia Wang <re...@usc.edu>
>>>Reply-To: "dev@nutch.apache.org" <de...@nutch.apache.org>
>>>Date: Sunday, February 22, 2015 at 3:36 PM
>>>To: "dev@nutch.apache.org" <de...@nutch.apache.org>
>>>Subject: How to read metadata/content of an URL in URLFilter?
>>>
>>>>
>>>>
>>>>
>>>>Hi
>>>>
>>>>
>>>>I want to develop an UrlFIlter which takes an url, takes its metadata
>>>>or
>>>>even the fetched content, then use some duplicate detection algorithms
>>>>to
>>>>determine if it is a duplicate of any url in bitch. However, the only
>>>>parameter passed into the Urlfilter
>>>> is the url, is it possible to get the data I want of that input url in
>>>>Urlfilter?
>>>>
>>>>
>>>>Thanks,
>>>>
>>>>
>>>>Zhique
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>
>>
>>
>>
>>
>>
>>
>>
>
>
>
>
>
>
>
>

Re: How to read metadata/content of an URL in URLFilter?

Posted by Renxia Wang <re...@usc.edu>.

I just added a counter in my URLFilter, and prove that the URLFilter
instances in each fetching circle are different.

Sample logs:
2015-02-22 21:07:10,636 INFO  exactdup.ExactDupURLFilter - Processed 69
links
2015-02-22 21:07:10,638 INFO  exactdup.ExactDupURLFilter - Processed 70
links
2015-02-22 21:07:10,640 INFO  exactdup.ExactDupURLFilter - Processed 71
links
2015-02-22 21:07:10,641 INFO  exactdup.ExactDupURLFilter - Processed 72
links
2015-02-22 21:07:10,643 INFO  exactdup.ExactDupURLFilter - Processed 73
links
2015-02-22 21:07:10,645 INFO  exactdup.ExactDupURLFilter - Processed 74
links
2015-02-22 21:07:10,647 INFO  exactdup.ExactDupURLFilter - Processed 75
links
2015-02-22 21:07:10,649 INFO  exactdup.ExactDupURLFilter - Processed 76
links
2015-02-22 21:07:10,650 INFO  exactdup.ExactDupURLFilter - Processed 77
links
2015-02-22 21:07:13,835 INFO  exactdup.ExactDupURLFilter - Processed 1 links
2015-02-22 21:07:13,850 INFO  exactdup.ExactDupURLFilter - Processed 2 links
2015-02-22 21:07:13,865 INFO  exactdup.ExactDupURLFilter - Processed 3 links
2015-02-22 21:07:13,878 INFO  exactdup.ExactDupURLFilter - Processed 4 links
2015-02-22 21:07:13,889 INFO  exactdup.ExactDupURLFilter - Processed 5 links
2015-02-22 21:07:13,899 INFO  exactdup.ExactDupURLFilter - Processed 6 links

Not sure if it is configurable?


On Sun, Feb 22, 2015 at 8:56 PM, Mattmann, Chris A (3980) <
chris.a.mattmann@jpl.nasa.gov> wrote:

> That’s one way - for sure - but what I was implying is that
> you can train (read: feed data into) your model (read: algorithm)
> using previously crawled information. So, no I wasn’t implying
> machine learning.
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: chris.a.mattmann@nasa.gov
> WWW:  http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
>
>
>
> -----Original Message-----
> From: Renxia Wang <re...@usc.edu>
> Reply-To: "dev@nutch.apache.org" <de...@nutch.apache.org>
> Date: Sunday, February 22, 2015 at 8:47 PM
> To: "dev@nutch.apache.org" <de...@nutch.apache.org>
> Subject: Re: How to read metadata/content of an URL in URLFilter?
>
> >Hi Prof Mattmann,
> >
> >
> >You are saying "train" and "model", are we expected to use machine
> >learning algorithms to train model for duplication detection?
> >
> >
> >Thanks,
> >
> >
> >Renxia
> >
> >
> >On Sun, Feb 22, 2015 at 8:39 PM, Mattmann, Chris A (3980)
> ><ch...@jpl.nasa.gov> wrote:
> >
> >There is nothing stating in your assignment that you can’t
> >use *previously* crawled data to train your model - you
> >should have at least 2 full sets of this.
> >
> >Cheers,
> >Chris
> >
> >
> >++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >Chris Mattmann, Ph.D.
> >Chief Architect
> >Instrument Software and Science Data Systems Section (398)
> >NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> >Office: 168-519, Mailstop: 168-527
> >Email: chris.a.mattmann@nasa.gov
> >WWW:  http://sunset.usc.edu/~mattmann/
> >++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >Adjunct Associate Professor, Computer Science Department
> >University of Southern California, Los Angeles, CA 90089 USA
> >++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >
> >
> >
> >
> >
> >
> >-----Original Message-----
> >From: Majisha Parambath <pa...@usc.edu>
> >Reply-To: "dev@nutch.apache.org" <de...@nutch.apache.org>
> >Date: Sunday, February 22, 2015 at 8:30 PM
> >To: dev <de...@nutch.apache.org>
> >Subject: Re: How to read metadata/content of an URL in URLFilter?
> >
> >>
> >>
> >>
> >>My understanding is that the LinkDB or CrawlDB will contain the results
> >>of previously fetched and parsed pages.
> >>
> >>However if we want to get the contents of a URL/page in the URL Filtering
> >>stage(
> >>which is not yet fetched) , is there any util in Nutch  that we can use
> >>to fetch the contents of the page ?
> >>
> >>
> >>Thanks and regards,
> >>Majisha Namath Parambath
> >>Graduate Student, M.S in Computer Science
> >>Viterbi School of Engineering
> >>University of Southern California, Los Angeles
> >>
> >>
> >>
> >>
> >>
> >>
> >>On Sun, Feb 22, 2015 at 4:53 PM, Mattmann, Chris A (3980)
> >><ch...@jpl.nasa.gov> wrote:
> >>
> >>In the constructor of your URLFilter, why not consider passing
> >>in a NutchConfiguration object, and then reading the path to e.g,
> >>the LinkDb from the config. Then have a private member variable
> >>for the LinkDbReader (maybe static initialized for efficiency)
> >>and use that in your interface method.
> >>
> >>Cheers,
> >>Chris
> >>
> >>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>Chris Mattmann, Ph.D.
> >>Chief Architect
> >>Instrument Software and Science Data Systems Section (398)
> >>NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> >>Office: 168-519, Mailstop: 168-527
> >>Email: chris.a.mattmann@nasa.gov
> >>WWW:  http://sunset.usc.edu/~mattmann/
> >>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>Adjunct Associate Professor, Computer Science Department
> >>University of Southern California, Los Angeles, CA 90089 USA
> >>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>
> >>
> >>
> >>
> >>
> >>
> >>-----Original Message-----
> >>From: Renxia Wang <re...@usc.edu>
> >>Reply-To: "dev@nutch.apache.org" <de...@nutch.apache.org>
> >>Date: Sunday, February 22, 2015 at 3:36 PM
> >>To: "dev@nutch.apache.org" <de...@nutch.apache.org>
> >>Subject: How to read metadata/content of an URL in URLFilter?
> >>
> >>>
> >>>
> >>>
> >>>Hi
> >>>
> >>>
> >>>I want to develop an UrlFIlter which takes an url, takes its metadata or
> >>>even the fetched content, then use some duplicate detection algorithms
> >>>to
> >>>determine if it is a duplicate of any url in bitch. However, the only
> >>>parameter passed into the Urlfilter
> >>> is the url, is it possible to get the data I want of that input url in
> >>>Urlfilter?
> >>>
> >>>
> >>>Thanks,
> >>>
> >>>
> >>>Zhique
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >
> >
> >
> >
> >
> >
> >
> >
>
>

Re: How to read metadata/content of an URL in URLFilter?

Posted by "Mattmann, Chris A (3980)" <ch...@jpl.nasa.gov>.

That’s one way - for sure - but what I was implying is that
you can train (read: feed data into) your model (read: algorithm)
using previously crawled information. So, no I wasn’t implying
machine learning. 

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++






-----Original Message-----
From: Renxia Wang <re...@usc.edu>
Reply-To: "dev@nutch.apache.org" <de...@nutch.apache.org>
Date: Sunday, February 22, 2015 at 8:47 PM
To: "dev@nutch.apache.org" <de...@nutch.apache.org>
Subject: Re: How to read metadata/content of an URL in URLFilter?

>Hi Prof Mattmann,
>
>
>You are saying "train" and "model", are we expected to use machine
>learning algorithms to train model for duplication detection?
>
>
>Thanks,
>
>
>Renxia
>
>
>On Sun, Feb 22, 2015 at 8:39 PM, Mattmann, Chris A (3980)
><ch...@jpl.nasa.gov> wrote:
>
>There is nothing stating in your assignment that you can’t
>use *previously* crawled data to train your model - you
>should have at least 2 full sets of this.
>
>Cheers,
>Chris
>
>
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>Chris Mattmann, Ph.D.
>Chief Architect
>Instrument Software and Science Data Systems Section (398)
>NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>Office: 168-519, Mailstop: 168-527
>Email: chris.a.mattmann@nasa.gov
>WWW:  http://sunset.usc.edu/~mattmann/
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>Adjunct Associate Professor, Computer Science Department
>University of Southern California, Los Angeles, CA 90089 USA
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
>
>
>
>-----Original Message-----
>From: Majisha Parambath <pa...@usc.edu>
>Reply-To: "dev@nutch.apache.org" <de...@nutch.apache.org>
>Date: Sunday, February 22, 2015 at 8:30 PM
>To: dev <de...@nutch.apache.org>
>Subject: Re: How to read metadata/content of an URL in URLFilter?
>
>>
>>
>>
>>My understanding is that the LinkDB or CrawlDB will contain the results
>>of previously fetched and parsed pages.
>>
>>However if we want to get the contents of a URL/page in the URL Filtering
>>stage(
>>which is not yet fetched) , is there any util in Nutch  that we can use
>>to fetch the contents of the page ?
>>
>>
>>Thanks and regards,
>>Majisha Namath Parambath
>>Graduate Student, M.S in Computer Science
>>Viterbi School of Engineering
>>University of Southern California, Los Angeles
>>
>>
>>
>>
>>
>>
>>On Sun, Feb 22, 2015 at 4:53 PM, Mattmann, Chris A (3980)
>><ch...@jpl.nasa.gov> wrote:
>>
>>In the constructor of your URLFilter, why not consider passing
>>in a NutchConfiguration object, and then reading the path to e.g,
>>the LinkDb from the config. Then have a private member variable
>>for the LinkDbReader (maybe static initialized for efficiency)
>>and use that in your interface method.
>>
>>Cheers,
>>Chris
>>
>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>Chris Mattmann, Ph.D.
>>Chief Architect
>>Instrument Software and Science Data Systems Section (398)
>>NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>Office: 168-519, Mailstop: 168-527
>>Email: chris.a.mattmann@nasa.gov
>>WWW:  http://sunset.usc.edu/~mattmann/
>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>Adjunct Associate Professor, Computer Science Department
>>University of Southern California, Los Angeles, CA 90089 USA
>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>
>>
>>
>>
>>
>>
>>-----Original Message-----
>>From: Renxia Wang <re...@usc.edu>
>>Reply-To: "dev@nutch.apache.org" <de...@nutch.apache.org>
>>Date: Sunday, February 22, 2015 at 3:36 PM
>>To: "dev@nutch.apache.org" <de...@nutch.apache.org>
>>Subject: How to read metadata/content of an URL in URLFilter?
>>
>>>
>>>
>>>
>>>Hi
>>>
>>>
>>>I want to develop an UrlFIlter which takes an url, takes its metadata or
>>>even the fetched content, then use some duplicate detection algorithms
>>>to
>>>determine if it is a duplicate of any url in bitch. However, the only
>>>parameter passed into the Urlfilter
>>> is the url, is it possible to get the data I want of that input url in
>>>Urlfilter?
>>>
>>>
>>>Thanks,
>>>
>>>
>>>Zhique
>>
>>
>>
>>
>>
>>
>>
>>
>>
>
>
>
>
>
>
>
>

Re: How to read metadata/content of an URL in URLFilter?

Posted by Renxia Wang <re...@usc.edu>.

Hi Prof Mattmann,

You are saying "train" and "model", are we expected to use machine learning
algorithms to train model for duplication detection?

Thanks,

Renxia

On Sun, Feb 22, 2015 at 8:39 PM, Mattmann, Chris A (3980) <
chris.a.mattmann@jpl.nasa.gov> wrote:

> There is nothing stating in your assignment that you can’t
> use *previously* crawled data to train your model - you
> should have at least 2 full sets of this.
>
> Cheers,
> Chris
>
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: chris.a.mattmann@nasa.gov
> WWW:  http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
>
>
>
> -----Original Message-----
> From: Majisha Parambath <pa...@usc.edu>
> Reply-To: "dev@nutch.apache.org" <de...@nutch.apache.org>
> Date: Sunday, February 22, 2015 at 8:30 PM
> To: dev <de...@nutch.apache.org>
> Subject: Re: How to read metadata/content of an URL in URLFilter?
>
> >
> >
> >
> >My understanding is that the LinkDB or CrawlDB will contain the results
> >of previously fetched and parsed pages.
> >
> >However if we want to get the contents of a URL/page in the URL Filtering
> >stage(
> >which is not yet fetched) , is there any util in Nutch  that we can use
> >to fetch the contents of the page ?
> >
> >
> >Thanks and regards,
> >Majisha Namath Parambath
> >Graduate Student, M.S in Computer Science
> >Viterbi School of Engineering
> >University of Southern California, Los Angeles
> >
> >
> >
> >
> >
> >
> >On Sun, Feb 22, 2015 at 4:53 PM, Mattmann, Chris A (3980)
> ><ch...@jpl.nasa.gov> wrote:
> >
> >In the constructor of your URLFilter, why not consider passing
> >in a NutchConfiguration object, and then reading the path to e.g,
> >the LinkDb from the config. Then have a private member variable
> >for the LinkDbReader (maybe static initialized for efficiency)
> >and use that in your interface method.
> >
> >Cheers,
> >Chris
> >
> >++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >Chris Mattmann, Ph.D.
> >Chief Architect
> >Instrument Software and Science Data Systems Section (398)
> >NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> >Office: 168-519, Mailstop: 168-527
> >Email: chris.a.mattmann@nasa.gov
> >WWW:  http://sunset.usc.edu/~mattmann/
> >++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >Adjunct Associate Professor, Computer Science Department
> >University of Southern California, Los Angeles, CA 90089 USA
> >++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >
> >
> >
> >
> >
> >
> >-----Original Message-----
> >From: Renxia Wang <re...@usc.edu>
> >Reply-To: "dev@nutch.apache.org" <de...@nutch.apache.org>
> >Date: Sunday, February 22, 2015 at 3:36 PM
> >To: "dev@nutch.apache.org" <de...@nutch.apache.org>
> >Subject: How to read metadata/content of an URL in URLFilter?
> >
> >>
> >>
> >>
> >>Hi
> >>
> >>
> >>I want to develop an UrlFIlter which takes an url, takes its metadata or
> >>even the fetched content, then use some duplicate detection algorithms to
> >>determine if it is a duplicate of any url in bitch. However, the only
> >>parameter passed into the Urlfilter
> >> is the url, is it possible to get the data I want of that input url in
> >>Urlfilter?
> >>
> >>
> >>Thanks,
> >>
> >>
> >>Zhique
> >
> >
> >
> >
> >
> >
> >
> >
> >
>
>

Re: How to read metadata/content of an URL in URLFilter?

Posted by "Mattmann, Chris A (3980)" <ch...@jpl.nasa.gov>.

There is nothing stating in your assignment that you can’t
use *previously* crawled data to train your model - you
should have at least 2 full sets of this.

Cheers,
Chris


++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++






-----Original Message-----
From: Majisha Parambath <pa...@usc.edu>
Reply-To: "dev@nutch.apache.org" <de...@nutch.apache.org>
Date: Sunday, February 22, 2015 at 8:30 PM
To: dev <de...@nutch.apache.org>
Subject: Re: How to read metadata/content of an URL in URLFilter?

>
>
>
>My understanding is that the LinkDB or CrawlDB will contain the results
>of previously fetched and parsed pages.
>
>However if we want to get the contents of a URL/page in the URL Filtering
>stage(
>which is not yet fetched) , is there any util in Nutch  that we can use
>to fetch the contents of the page ?
>
>
>Thanks and regards,
>Majisha Namath Parambath
>Graduate Student, M.S in Computer Science
>Viterbi School of Engineering
>University of Southern California, Los Angeles
>
>
>
>
>
>
>On Sun, Feb 22, 2015 at 4:53 PM, Mattmann, Chris A (3980)
><ch...@jpl.nasa.gov> wrote:
>
>In the constructor of your URLFilter, why not consider passing
>in a NutchConfiguration object, and then reading the path to e.g,
>the LinkDb from the config. Then have a private member variable
>for the LinkDbReader (maybe static initialized for efficiency)
>and use that in your interface method.
>
>Cheers,
>Chris
>
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>Chris Mattmann, Ph.D.
>Chief Architect
>Instrument Software and Science Data Systems Section (398)
>NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>Office: 168-519, Mailstop: 168-527
>Email: chris.a.mattmann@nasa.gov
>WWW:  http://sunset.usc.edu/~mattmann/
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>Adjunct Associate Professor, Computer Science Department
>University of Southern California, Los Angeles, CA 90089 USA
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
>
>
>
>-----Original Message-----
>From: Renxia Wang <re...@usc.edu>
>Reply-To: "dev@nutch.apache.org" <de...@nutch.apache.org>
>Date: Sunday, February 22, 2015 at 3:36 PM
>To: "dev@nutch.apache.org" <de...@nutch.apache.org>
>Subject: How to read metadata/content of an URL in URLFilter?
>
>>
>>
>>
>>Hi
>>
>>
>>I want to develop an UrlFIlter which takes an url, takes its metadata or
>>even the fetched content, then use some duplicate detection algorithms to
>>determine if it is a duplicate of any url in bitch. However, the only
>>parameter passed into the Urlfilter
>> is the url, is it possible to get the data I want of that input url in
>>Urlfilter?
>>
>>
>>Thanks,
>>
>>
>>Zhique
>
>
>
>
>
>
>
>
>

Re: How to read metadata/content of an URL in URLFilter?

Posted by Renxia Wang <re...@usc.edu>.

Hi Majisha,

>From the source code of the URLFilter interface comments, the urlfilter is
called in the injector and db updater, which means that you do have the
data of the url you are processing in the the filter crawled.
You may want to take a look at this article, which illustrate the workflow
of Nutch, although it is for Nutch 1.4:
http://www.atlantbh.com/apache-nutch-overview/

Thanks,

Renxia

On Sun, Feb 22, 2015 at 8:30 PM, Majisha Parambath <pa...@usc.edu> wrote:

>
>
> My understanding is that the LinkDB or CrawlDB will contain the results of
> previously fetched and parsed pages.
> However if we want to get the contents of a URL/page in the URL Filtering
> stage( *which is not yet fetched*) , is there any util in Nutch  that we
> can use to fetch the contents of the page ?
>
> Thanks and regards,
> *Majisha Namath Parambath*
> *Graduate Student, M.S in Computer Science*
> *Viterbi School of Engineering*
> *University of Southern California, Los Angeles*
>
> On Sun, Feb 22, 2015 at 4:53 PM, Mattmann, Chris A (3980) <
> chris.a.mattmann@jpl.nasa.gov> wrote:
>
>> In the constructor of your URLFilter, why not consider passing
>> in a NutchConfiguration object, and then reading the path to e.g,
>> the LinkDb from the config. Then have a private member variable
>> for the LinkDbReader (maybe static initialized for efficiency)
>> and use that in your interface method.
>>
>> Cheers,
>> Chris
>>
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Chris Mattmann, Ph.D.
>> Chief Architect
>> Instrument Software and Science Data Systems Section (398)
>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> Office: 168-519, Mailstop: 168-527
>> Email: chris.a.mattmann@nasa.gov
>> WWW:  http://sunset.usc.edu/~mattmann/
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Adjunct Associate Professor, Computer Science Department
>> University of Southern California, Los Angeles, CA 90089 USA
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>
>>
>>
>>
>>
>>
>> -----Original Message-----
>> From: Renxia Wang <re...@usc.edu>
>> Reply-To: "dev@nutch.apache.org" <de...@nutch.apache.org>
>> Date: Sunday, February 22, 2015 at 3:36 PM
>> To: "dev@nutch.apache.org" <de...@nutch.apache.org>
>> Subject: How to read metadata/content of an URL in URLFilter?
>>
>> >
>> >
>> >
>> >Hi
>> >
>> >
>> >I want to develop an UrlFIlter which takes an url, takes its metadata or
>> >even the fetched content, then use some duplicate detection algorithms to
>> >determine if it is a duplicate of any url in bitch. However, the only
>> >parameter passed into the Urlfilter
>> > is the url, is it possible to get the data I want of that input url in
>> >Urlfilter?
>> >
>> >
>> >Thanks,
>> >
>> >
>> >Zhique
>>
>>
>

Re: How to read metadata/content of an URL in URLFilter?

Posted by Majisha Parambath <pa...@usc.edu>.

My understanding is that the LinkDB or CrawlDB will contain the results of
previously fetched and parsed pages.
However if we want to get the contents of a URL/page in the URL Filtering
stage( *which is not yet fetched*) , is there any util in Nutch  that we
can use to fetch the contents of the page ?

Thanks and regards,
*Majisha Namath Parambath*
*Graduate Student, M.S in Computer Science*
*Viterbi School of Engineering*
*University of Southern California, Los Angeles*

On Sun, Feb 22, 2015 at 4:53 PM, Mattmann, Chris A (3980) <
chris.a.mattmann@jpl.nasa.gov> wrote:

> In the constructor of your URLFilter, why not consider passing
> in a NutchConfiguration object, and then reading the path to e.g,
> the LinkDb from the config. Then have a private member variable
> for the LinkDbReader (maybe static initialized for efficiency)
> and use that in your interface method.
>
> Cheers,
> Chris
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: chris.a.mattmann@nasa.gov
> WWW:  http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
>
>
>
> -----Original Message-----
> From: Renxia Wang <re...@usc.edu>
> Reply-To: "dev@nutch.apache.org" <de...@nutch.apache.org>
> Date: Sunday, February 22, 2015 at 3:36 PM
> To: "dev@nutch.apache.org" <de...@nutch.apache.org>
> Subject: How to read metadata/content of an URL in URLFilter?
>
> >
> >
> >
> >Hi
> >
> >
> >I want to develop an UrlFIlter which takes an url, takes its metadata or
> >even the fetched content, then use some duplicate detection algorithms to
> >determine if it is a duplicate of any url in bitch. However, the only
> >parameter passed into the Urlfilter
> > is the url, is it possible to get the data I want of that input url in
> >Urlfilter?
> >
> >
> >Thanks,
> >
> >
> >Zhique
>
>

Re: How to read metadata/content of an URL in URLFilter?

Posted by "Mattmann, Chris A (3980)" <ch...@jpl.nasa.gov>.

I believe the Plugin system caches plugins, but you will need
to confirm (haven’t looked in a long time).


++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++






-----Original Message-----
From: Renxia Wang <re...@usc.edu>
Reply-To: "dev@nutch.apache.org" <de...@nutch.apache.org>
Date: Sunday, February 22, 2015 at 6:37 PM
To: "dev@nutch.apache.org" <de...@nutch.apache.org>
Subject: Re: How to read metadata/content of an URL in URLFilter?

>
>
>
>Is there only one instance of a plugin for all fetch circles? I am
>assuming that when the job is started, a plugin instance is initialized
>and used in every fetching circle. Is it correct?
>
>On Sunday, February 22, 2015, Mattmann, Chris A (3980)
><ch...@jpl.nasa.gov> wrote:
>
>In the constructor of your URLFilter, why not consider passing
>in a NutchConfiguration object, and then reading the path to e.g,
>the LinkDb from the config. Then have a private member variable
>for the LinkDbReader (maybe static initialized for efficiency)
>and use that in your interface method.
>
>Cheers,
>Chris
>
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>Chris Mattmann, Ph.D.
>Chief Architect
>Instrument Software and Science Data Systems Section (398)
>NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>Office: 168-519, Mailstop: 168-527
>Email: 
>chris.a.mattmann@nasa.gov <javascript:;>
>WWW:  http://sunset.usc.edu/~mattmann/
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>Adjunct Associate Professor, Computer Science Department
>University of Southern California, Los Angeles, CA 90089 USA
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
>
>
>
>-----Original Message-----
>From: Renxia Wang <renxiawa@usc.edu <javascript:;>>
>Reply-To: "dev@nutch.apache.org <javascript:;>" <dev@nutch.apache.org
><javascript:;>>
>Date: Sunday, February 22, 2015 at 3:36 PM
>To: "dev@nutch.apache.org <javascript:;>" <dev@nutch.apache.org
><javascript:;>>
>Subject: How to read metadata/content of an URL in URLFilter?
>
>>
>>
>>
>>Hi
>>
>>
>>I want to develop an UrlFIlter which takes an url, takes its metadata or
>>even the fetched content, then use some duplicate detection algorithms to
>>determine if it is a duplicate of any url in bitch. However, the only
>>parameter passed into the Urlfilter
>> is the url, is it possible to get the data I want of that input url in
>>Urlfilter?
>>
>>
>>Thanks,
>>
>>
>>Zhique
>
>
>

Re: How to read metadata/content of an URL in URLFilter?

Posted by Renxia Wang <re...@usc.edu>.

Is there only one instance of a plugin for all fetch circles? I am assuming
that when the job is started, a plugin instance is initialized and used in
every fetching circle. Is it correct?

On Sunday, February 22, 2015, Mattmann, Chris A (3980) <
chris.a.mattmann@jpl.nasa.gov> wrote:

> In the constructor of your URLFilter, why not consider passing
> in a NutchConfiguration object, and then reading the path to e.g,
> the LinkDb from the config. Then have a private member variable
> for the LinkDbReader (maybe static initialized for efficiency)
> and use that in your interface method.
>
> Cheers,
> Chris
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: chris.a.mattmann@nasa.gov <javascript:;>
> WWW:  http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
>
>
>
> -----Original Message-----
> From: Renxia Wang <renxiawa@usc.edu <javascript:;>>
> Reply-To: "dev@nutch.apache.org <javascript:;>" <dev@nutch.apache.org
> <javascript:;>>
> Date: Sunday, February 22, 2015 at 3:36 PM
> To: "dev@nutch.apache.org <javascript:;>" <dev@nutch.apache.org
> <javascript:;>>
> Subject: How to read metadata/content of an URL in URLFilter?
>
> >
> >
> >
> >Hi
> >
> >
> >I want to develop an UrlFIlter which takes an url, takes its metadata or
> >even the fetched content, then use some duplicate detection algorithms to
> >determine if it is a duplicate of any url in bitch. However, the only
> >parameter passed into the Urlfilter
> > is the url, is it possible to get the data I want of that input url in
> >Urlfilter?
> >
> >
> >Thanks,
> >
> >
> >Zhique
>
>

Re: How to read metadata/content of an URL in URLFilter?

Posted by "Mattmann, Chris A (3980)" <ch...@jpl.nasa.gov>.

In the constructor of your URLFilter, why not consider passing
in a NutchConfiguration object, and then reading the path to e.g,
the LinkDb from the config. Then have a private member variable
for the LinkDbReader (maybe static initialized for efficiency)
and use that in your interface method.

Cheers,
Chris

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++






-----Original Message-----
From: Renxia Wang <re...@usc.edu>
Reply-To: "dev@nutch.apache.org" <de...@nutch.apache.org>
Date: Sunday, February 22, 2015 at 3:36 PM
To: "dev@nutch.apache.org" <de...@nutch.apache.org>
Subject: How to read metadata/content of an URL in URLFilter?

>
>
>
>Hi 
>
>
>I want to develop an UrlFIlter which takes an url, takes its metadata or
>even the fetched content, then use some duplicate detection algorithms to
>determine if it is a duplicate of any url in bitch. However, the only
>parameter passed into the Urlfilter
> is the url, is it possible to get the data I want of that input url in
>Urlfilter? 
>
>
>Thanks, 
>
>
>Zhique