You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@oodt.apache.org by "Verma, Rishi (388J)" <Ri...@jpl.nasa.gov> on 2012/04/26 01:36:00 UTC

Registering a custom ProductCrawler with cas-crawler

Hi all,

I wrote a custom cas-crawler ProductCrawler, but I'm having some difficulty registering my custom product crawler with cas-crawler.

I created a product crawler by extending StdProductCrawler, and I've added this product-crawler name to crawler config files (following the example of StdProductCrawler):
* crawler/policy/crawler-beans.xml
* crawler/policy/cmd-line-option-beans.xml

However, after running the below command, I can clearly see my custom product crawler (called LabCASProductCrawler) is not available. A crawler ingest try also tells me that there is no "bean" by the name of my "LabCASProductCrawler" available:
> bash-3.2$ ./crawler_launcher —printSupportedCrawlers
ProductCrawlers:
  Id: StdProductCrawler
  Id: MetExtractorProductCrawler
  Id: AutoDetectProductCrawler

> ./crawler_launcher --crawlerId LabCASProductCrawler --filemgrUrl http://localhost:9000 --productPath /data/staging/HGHAGA9 --failureDir /tmp/failed_ingest --metFileExtension met —clientTransferer org.apache.oodt.cas.filemgr.datatransfer.LocalDataTransferFactory
Failed to parse options : No bean named 'LabCASProductCrawler' is defined

I noticed in files like crawler-config.xml and cmd-line-option-beans.xml, there were references made to crawler config files stored in the cas-crawler JAR. Looking more into this, it seems to me that crawler is pre-loading config files directly from that JAR and overshadowing any of my config changes:
* crawler/lib/cas-crawler-0.3.jar:org/apache/oodt/cas/crawl/crawler-beans.xml
* crawler/lib/cas-crawler-0.3.jar:org/apache/oodt/cas/crawl/crawler-config.xml

So two questions:
1. Am I editing the correct policy files, in order to register my custom product crawler with cas-crawler?
2. It seems the cas-crawler JAR contains crawler config files that take greater precedence than the ones available for editing under crawler/policy. Is there a way around this?

Thanks!
rishi

Re: Registering a custom ProductCrawler with cas-crawler

Posted by Brian Foster <ho...@me.com>.
Hey Rishi,

Do what Chris said AND add a CmdLineAction to cmd-line-action.xml to run you crawler (should just be a copy/paste and change some names)

-Brian

On Apr 25, 2012, at 10:12 PM, "Mattmann, Chris A (388J)" <ch...@jpl.nasa.gov> wrote:

> Hey Rishi,
> 
> I think you need to change the actionRepo, something akin to this:
> 
> 1. Edit your crawler_launcher script:
> - make sure that the -Dorg.apache.oodt.cas.crawl.bean.repo is set, e.g., to 
> something like file:/path/to/crawler/policy/crawler-config.xml
> 
> 2. Make sure that /path/to/crawler/policy/crawler-config.xml has the configuration
> that you are trying to override (e.g., your new bean definitions).
> 
> HTH!
> 
> Cheers,
> Chris
> 
> On Apr 25, 2012, at 4:36 PM, Verma, Rishi (388J) wrote:
> 
>> Hi all,
>> 
>> I wrote a custom cas-crawler ProductCrawler, but I'm having some difficulty registering my custom product crawler with cas-crawler.
>> 
>> I created a product crawler by extending StdProductCrawler, and I've added this product-crawler name to crawler config files (following the example of StdProductCrawler):
>> * crawler/policy/crawler-beans.xml
>> * crawler/policy/cmd-line-option-beans.xml
>> 
>> However, after running the below command, I can clearly see my custom product crawler (called LabCASProductCrawler) is not available. A crawler ingest try also tells me that there is no "bean" by the name of my "LabCASProductCrawler" available:
>>> bash-3.2$ ./crawler_launcher —printSupportedCrawlers
>> ProductCrawlers:
>> Id: StdProductCrawler
>> Id: MetExtractorProductCrawler
>> Id: AutoDetectProductCrawler
>> 
>>> ./crawler_launcher --crawlerId LabCASProductCrawler --filemgrUrl http://localhost:9000 --productPath /data/staging/HGHAGA9 --failureDir /tmp/failed_ingest --metFileExtension met —clientTransferer org.apache.oodt.cas.filemgr.datatransfer.LocalDataTransferFactory
>> Failed to parse options : No bean named 'LabCASProductCrawler' is defined
>> 
>> I noticed in files like crawler-config.xml and cmd-line-option-beans.xml, there were references made to crawler config files stored in the cas-crawler JAR. Looking more into this, it seems to me that crawler is pre-loading config files directly from that JAR and overshadowing any of my config changes:
>> * crawler/lib/cas-crawler-0.3.jar:org/apache/oodt/cas/crawl/crawler-beans.xml
>> * crawler/lib/cas-crawler-0.3.jar:org/apache/oodt/cas/crawl/crawler-config.xml
>> 
>> So two questions:
>> 1. Am I editing the correct policy files, in order to register my custom product crawler with cas-crawler?
>> 2. It seems the cas-crawler JAR contains crawler config files that take greater precedence than the ones available for editing under crawler/policy. Is there a way around this?
>> 
>> Thanks!
>> rishi
> 
> 
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Senior Computer Scientist
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 171-266B, Mailstop: 171-246
> Email: chris.a.mattmann@nasa.gov
> WWW:   http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Assistant Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> 

Re: Registering a custom ProductCrawler with cas-crawler

Posted by "Mattmann, Chris A (388J)" <ch...@jpl.nasa.gov>.
Hey Rishi,

I think you need to change the actionRepo, something akin to this:

1. Edit your crawler_launcher script:
 - make sure that the -Dorg.apache.oodt.cas.crawl.bean.repo is set, e.g., to 
something like file:/path/to/crawler/policy/crawler-config.xml

2. Make sure that /path/to/crawler/policy/crawler-config.xml has the configuration
that you are trying to override (e.g., your new bean definitions).

HTH!

Cheers,
Chris

On Apr 25, 2012, at 4:36 PM, Verma, Rishi (388J) wrote:

> Hi all,
> 
> I wrote a custom cas-crawler ProductCrawler, but I'm having some difficulty registering my custom product crawler with cas-crawler.
> 
> I created a product crawler by extending StdProductCrawler, and I've added this product-crawler name to crawler config files (following the example of StdProductCrawler):
> * crawler/policy/crawler-beans.xml
> * crawler/policy/cmd-line-option-beans.xml
> 
> However, after running the below command, I can clearly see my custom product crawler (called LabCASProductCrawler) is not available. A crawler ingest try also tells me that there is no "bean" by the name of my "LabCASProductCrawler" available:
>> bash-3.2$ ./crawler_launcher —printSupportedCrawlers
> ProductCrawlers:
>  Id: StdProductCrawler
>  Id: MetExtractorProductCrawler
>  Id: AutoDetectProductCrawler
> 
>> ./crawler_launcher --crawlerId LabCASProductCrawler --filemgrUrl http://localhost:9000 --productPath /data/staging/HGHAGA9 --failureDir /tmp/failed_ingest --metFileExtension met —clientTransferer org.apache.oodt.cas.filemgr.datatransfer.LocalDataTransferFactory
> Failed to parse options : No bean named 'LabCASProductCrawler' is defined
> 
> I noticed in files like crawler-config.xml and cmd-line-option-beans.xml, there were references made to crawler config files stored in the cas-crawler JAR. Looking more into this, it seems to me that crawler is pre-loading config files directly from that JAR and overshadowing any of my config changes:
> * crawler/lib/cas-crawler-0.3.jar:org/apache/oodt/cas/crawl/crawler-beans.xml
> * crawler/lib/cas-crawler-0.3.jar:org/apache/oodt/cas/crawl/crawler-config.xml
> 
> So two questions:
> 1. Am I editing the correct policy files, in order to register my custom product crawler with cas-crawler?
> 2. It seems the cas-crawler JAR contains crawler config files that take greater precedence than the ones available for editing under crawler/policy. Is there a way around this?
> 
> Thanks!
> rishi


++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattmann@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++


Re: Registering a custom ProductCrawler with cas-crawler

Posted by Sheryl John <sh...@gmail.com>.
Ah.. interesting
Thanks for sharing Rishi!

On Fri, Apr 27, 2012 at 11:37 AM, Verma, Rishi (388J) <
Rishi.Verma@jpl.nasa.gov> wrote:

> Hey All,
>
> Chris and I had an lively discussion over IM, about the topic of whether
> to write a custom crawler or use actionIds/precondId based extension
> points.
>
> We thought it would be useful to share, so I've made it available on the
> OODT wiki:
> https://cwiki.apache.org/confluence/display/OODT/2012/04/27/Custom+crawling
> +-+when+to+or+when+not+to+write+your+own+ProductCrawler
>
>
> Thanks!
> rishi
>
> On 4/26/12 1:25 PM, "Verma, Rishi (388J)" <Ri...@jpl.nasa.gov>
> wrote:
>
> >Per Chris' suggestion, I'm looking at making a custom pre-ingest action or
> >pre-ingest comparator instead of creating a full new productcrawler. This
> >might be a more light-weight solution.
> >
> >However, thanks for the tips in any case Brian and Chris!
> >
> >rishi
> >
> >On 4/26/12 2:06 AM, "Brian Foster" <ho...@me.com> wrote:
> >
> >>Nevermind... Looks like you are using 0.3 instead of the trunk... what I
> >>added applies to trunk crawler
> >>
> >>-Brian
> >>
> >>On Apr 25, 2012, at 4:36 PM, "Verma, Rishi (388J)"
> >><Ri...@jpl.nasa.gov> wrote:
> >>
> >>> Hi all,
> >>>
> >>> I wrote a custom cas-crawler ProductCrawler, but I'm having some
> >>>difficulty registering my custom product crawler with cas-crawler.
> >>>
> >>> I created a product crawler by extending StdProductCrawler, and I've
> >>>added this product-crawler name to crawler config files (following the
> >>>example of StdProductCrawler):
> >>> * crawler/policy/crawler-beans.xml
> >>> * crawler/policy/cmd-line-option-beans.xml
> >>>
> >>> However, after running the below command, I can clearly see my custom
> >>>product crawler (called LabCASProductCrawler) is not available. A
> >>>crawler ingest try also tells me that there is no "bean" by the name of
> >>>my "LabCASProductCrawler" available:
> >>>> bash-3.2$ ./crawler_launcher ‹printSupportedCrawlers
> >>> ProductCrawlers:
> >>>  Id: StdProductCrawler
> >>>  Id: MetExtractorProductCrawler
> >>>  Id: AutoDetectProductCrawler
> >>>
> >>>> ./crawler_launcher --crawlerId LabCASProductCrawler --filemgrUrl
> >>>>http://localhost:9000 --productPath /data/staging/HGHAGA9 --failureDir
> >>>>/tmp/failed_ingest --metFileExtension met ‹clientTransferer
> >>>>org.apache.oodt.cas.filemgr.datatransfer.LocalDataTransferFactory
> >>> Failed to parse options : No bean named 'LabCASProductCrawler' is
> >>>defined
> >>>
> >>> I noticed in files like crawler-config.xml and
> >>>cmd-line-option-beans.xml, there were references made to crawler config
> >>>files stored in the cas-crawler JAR. Looking more into this, it seems to
> >>>me that crawler is pre-loading config files directly from that JAR and
> >>>overshadowing any of my config changes:
> >>> *
> >>>crawler/lib/cas-crawler-0.3.jar:org/apache/oodt/cas/crawl/crawler-beans.
> >>>x
> >>>ml
> >>> *
> >>>crawler/lib/cas-crawler-0.3.jar:org/apache/oodt/cas/crawl/crawler-config
> >>>.
> >>>xml
> >>>
> >>> So two questions:
> >>> 1. Am I editing the correct policy files, in order to register my
> >>>custom product crawler with cas-crawler?
> >>> 2. It seems the cas-crawler JAR contains crawler config files that take
> >>>greater precedence than the ones available for editing under
> >>>crawler/policy. Is there a way around this?
> >>>
> >>> Thanks!
> >>> rishi
> >
>
>


-- 
-Sheryl

Re: Registering a custom ProductCrawler with cas-crawler

Posted by "Verma, Rishi (388J)" <Ri...@jpl.nasa.gov>.
Hey All,

Chris and I had an lively discussion over IM, about the topic of whether
to write a custom crawler or use actionIds/precondId based extension
points.

We thought it would be useful to share, so I've made it available on the
OODT wiki:
https://cwiki.apache.org/confluence/display/OODT/2012/04/27/Custom+crawling
+-+when+to+or+when+not+to+write+your+own+ProductCrawler


Thanks!
rishi

On 4/26/12 1:25 PM, "Verma, Rishi (388J)" <Ri...@jpl.nasa.gov> wrote:

>Per Chris' suggestion, I'm looking at making a custom pre-ingest action or
>pre-ingest comparator instead of creating a full new productcrawler. This
>might be a more light-weight solution.
>
>However, thanks for the tips in any case Brian and Chris!
>
>rishi
>
>On 4/26/12 2:06 AM, "Brian Foster" <ho...@me.com> wrote:
>
>>Nevermind... Looks like you are using 0.3 instead of the trunk... what I
>>added applies to trunk crawler
>>
>>-Brian
>>
>>On Apr 25, 2012, at 4:36 PM, "Verma, Rishi (388J)"
>><Ri...@jpl.nasa.gov> wrote:
>>
>>> Hi all,
>>> 
>>> I wrote a custom cas-crawler ProductCrawler, but I'm having some
>>>difficulty registering my custom product crawler with cas-crawler.
>>> 
>>> I created a product crawler by extending StdProductCrawler, and I've
>>>added this product-crawler name to crawler config files (following the
>>>example of StdProductCrawler):
>>> * crawler/policy/crawler-beans.xml
>>> * crawler/policy/cmd-line-option-beans.xml
>>> 
>>> However, after running the below command, I can clearly see my custom
>>>product crawler (called LabCASProductCrawler) is not available. A
>>>crawler ingest try also tells me that there is no "bean" by the name of
>>>my "LabCASProductCrawler" available:
>>>> bash-3.2$ ./crawler_launcher ‹printSupportedCrawlers
>>> ProductCrawlers:
>>>  Id: StdProductCrawler
>>>  Id: MetExtractorProductCrawler
>>>  Id: AutoDetectProductCrawler
>>> 
>>>> ./crawler_launcher --crawlerId LabCASProductCrawler --filemgrUrl
>>>>http://localhost:9000 --productPath /data/staging/HGHAGA9 --failureDir
>>>>/tmp/failed_ingest --metFileExtension met ‹clientTransferer
>>>>org.apache.oodt.cas.filemgr.datatransfer.LocalDataTransferFactory
>>> Failed to parse options : No bean named 'LabCASProductCrawler' is
>>>defined
>>> 
>>> I noticed in files like crawler-config.xml and
>>>cmd-line-option-beans.xml, there were references made to crawler config
>>>files stored in the cas-crawler JAR. Looking more into this, it seems to
>>>me that crawler is pre-loading config files directly from that JAR and
>>>overshadowing any of my config changes:
>>> * 
>>>crawler/lib/cas-crawler-0.3.jar:org/apache/oodt/cas/crawl/crawler-beans.
>>>x
>>>ml
>>> * 
>>>crawler/lib/cas-crawler-0.3.jar:org/apache/oodt/cas/crawl/crawler-config
>>>.
>>>xml
>>> 
>>> So two questions:
>>> 1. Am I editing the correct policy files, in order to register my
>>>custom product crawler with cas-crawler?
>>> 2. It seems the cas-crawler JAR contains crawler config files that take
>>>greater precedence than the ones available for editing under
>>>crawler/policy. Is there a way around this?
>>> 
>>> Thanks!
>>> rishi
>


Re: Registering a custom ProductCrawler with cas-crawler

Posted by "Verma, Rishi (388J)" <Ri...@jpl.nasa.gov>.
Per Chris' suggestion, I'm looking at making a custom pre-ingest action or
pre-ingest comparator instead of creating a full new productcrawler. This
might be a more light-weight solution.

However, thanks for the tips in any case Brian and Chris!

rishi

On 4/26/12 2:06 AM, "Brian Foster" <ho...@me.com> wrote:

>Nevermind... Looks like you are using 0.3 instead of the trunk... what I
>added applies to trunk crawler
>
>-Brian
>
>On Apr 25, 2012, at 4:36 PM, "Verma, Rishi (388J)"
><Ri...@jpl.nasa.gov> wrote:
>
>> Hi all,
>> 
>> I wrote a custom cas-crawler ProductCrawler, but I'm having some
>>difficulty registering my custom product crawler with cas-crawler.
>> 
>> I created a product crawler by extending StdProductCrawler, and I've
>>added this product-crawler name to crawler config files (following the
>>example of StdProductCrawler):
>> * crawler/policy/crawler-beans.xml
>> * crawler/policy/cmd-line-option-beans.xml
>> 
>> However, after running the below command, I can clearly see my custom
>>product crawler (called LabCASProductCrawler) is not available. A
>>crawler ingest try also tells me that there is no "bean" by the name of
>>my "LabCASProductCrawler" available:
>>> bash-3.2$ ./crawler_launcher ‹printSupportedCrawlers
>> ProductCrawlers:
>>  Id: StdProductCrawler
>>  Id: MetExtractorProductCrawler
>>  Id: AutoDetectProductCrawler
>> 
>>> ./crawler_launcher --crawlerId LabCASProductCrawler --filemgrUrl
>>>http://localhost:9000 --productPath /data/staging/HGHAGA9 --failureDir
>>>/tmp/failed_ingest --metFileExtension met ‹clientTransferer
>>>org.apache.oodt.cas.filemgr.datatransfer.LocalDataTransferFactory
>> Failed to parse options : No bean named 'LabCASProductCrawler' is
>>defined
>> 
>> I noticed in files like crawler-config.xml and
>>cmd-line-option-beans.xml, there were references made to crawler config
>>files stored in the cas-crawler JAR. Looking more into this, it seems to
>>me that crawler is pre-loading config files directly from that JAR and
>>overshadowing any of my config changes:
>> * 
>>crawler/lib/cas-crawler-0.3.jar:org/apache/oodt/cas/crawl/crawler-beans.x
>>ml
>> * 
>>crawler/lib/cas-crawler-0.3.jar:org/apache/oodt/cas/crawl/crawler-config.
>>xml
>> 
>> So two questions:
>> 1. Am I editing the correct policy files, in order to register my
>>custom product crawler with cas-crawler?
>> 2. It seems the cas-crawler JAR contains crawler config files that take
>>greater precedence than the ones available for editing under
>>crawler/policy. Is there a way around this?
>> 
>> Thanks!
>> rishi


Re: Registering a custom ProductCrawler with cas-crawler

Posted by Brian Foster <ho...@me.com>.
Nevermind... Looks like you are using 0.3 instead of the trunk... what I added applies to trunk crawler

-Brian

On Apr 25, 2012, at 4:36 PM, "Verma, Rishi (388J)" <Ri...@jpl.nasa.gov> wrote:

> Hi all,
> 
> I wrote a custom cas-crawler ProductCrawler, but I'm having some difficulty registering my custom product crawler with cas-crawler.
> 
> I created a product crawler by extending StdProductCrawler, and I've added this product-crawler name to crawler config files (following the example of StdProductCrawler):
> * crawler/policy/crawler-beans.xml
> * crawler/policy/cmd-line-option-beans.xml
> 
> However, after running the below command, I can clearly see my custom product crawler (called LabCASProductCrawler) is not available. A crawler ingest try also tells me that there is no "bean" by the name of my "LabCASProductCrawler" available:
>> bash-3.2$ ./crawler_launcher —printSupportedCrawlers
> ProductCrawlers:
>  Id: StdProductCrawler
>  Id: MetExtractorProductCrawler
>  Id: AutoDetectProductCrawler
> 
>> ./crawler_launcher --crawlerId LabCASProductCrawler --filemgrUrl http://localhost:9000 --productPath /data/staging/HGHAGA9 --failureDir /tmp/failed_ingest --metFileExtension met —clientTransferer org.apache.oodt.cas.filemgr.datatransfer.LocalDataTransferFactory
> Failed to parse options : No bean named 'LabCASProductCrawler' is defined
> 
> I noticed in files like crawler-config.xml and cmd-line-option-beans.xml, there were references made to crawler config files stored in the cas-crawler JAR. Looking more into this, it seems to me that crawler is pre-loading config files directly from that JAR and overshadowing any of my config changes:
> * crawler/lib/cas-crawler-0.3.jar:org/apache/oodt/cas/crawl/crawler-beans.xml
> * crawler/lib/cas-crawler-0.3.jar:org/apache/oodt/cas/crawl/crawler-config.xml
> 
> So two questions:
> 1. Am I editing the correct policy files, in order to register my custom product crawler with cas-crawler?
> 2. It seems the cas-crawler JAR contains crawler config files that take greater precedence than the ones available for editing under crawler/policy. Is there a way around this?
> 
> Thanks!
> rishi