You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Scott Green <sm...@gmail.com> on 2007/01/15 03:40:00 UTC

How can I get one plugin's root dir

Hi,

I need to load some resources from mine plugin's sub-directory. Any
avaiable method to get the specified plugin's root directory now?
thanks

- scott

Re: How can I get one plugin's root dir

Posted by Dennis Kubes <nu...@dragonflymc.com>.
Scott,

I should have read your original post in more detail.  I was assuming 
you were just trying to get the root directory of the plugin, not 
loading resources during a MR job.  I would have to agree with Andrzej 
approach if this were to be used during a MR job.  Sorry for the confusion.

Dennis Kubes

Scott Green wrote:
> Thanks Andrzej and Doug!
> 
> I will try both in my later work and evaluate them.
> 
> On 1/17/07, Doug Cutting <cu...@apache.org> wrote:
>> Andrzej Bialecki wrote:
>> > The reason is that if you pack this file into your job JAR, the job jar
>> > would become very large (presumably this 40MB is already compressed?).
>> > Job jar needs to be copied to each tasktracker for each task, so you
>> > will experience performance hit just because of the size of the job jar
>> > ... whereas if this file sits on DFS and is highly replicated, its
>> > content will always be available locally.
>>
>> Note that the job jar is copied into HDFS with a highish replication
>> (10?), and that it is only copied to each tasktracker node once per
>> *job*, not per task.  So it's only faster to manage this yourself if you
>> have a sequence of jobs that share this data, and if the time to
>> re-replicate it per job is significant.
>>
>> Doug
>>

Re: How can I get one plugin's root dir

Posted by Scott Green <sm...@gmail.com>.
Thanks Andrzej and Doug!

I will try both in my later work and evaluate them.

On 1/17/07, Doug Cutting <cu...@apache.org> wrote:
> Andrzej Bialecki wrote:
> > The reason is that if you pack this file into your job JAR, the job jar
> > would become very large (presumably this 40MB is already compressed?).
> > Job jar needs to be copied to each tasktracker for each task, so you
> > will experience performance hit just because of the size of the job jar
> > ... whereas if this file sits on DFS and is highly replicated, its
> > content will always be available locally.
>
> Note that the job jar is copied into HDFS with a highish replication
> (10?), and that it is only copied to each tasktracker node once per
> *job*, not per task.  So it's only faster to manage this yourself if you
> have a sequence of jobs that share this data, and if the time to
> re-replicate it per job is significant.
>
> Doug
>

Re: How can I get one plugin's root dir

Posted by Doug Cutting <cu...@apache.org>.
Andrzej Bialecki wrote:
> The reason is that if you pack this file into your job JAR, the job jar 
> would become very large (presumably this 40MB is already compressed?). 
> Job jar needs to be copied to each tasktracker for each task, so you 
> will experience performance hit just because of the size of the job jar 
> ... whereas if this file sits on DFS and is highly replicated, its 
> content will always be available locally.

Note that the job jar is copied into HDFS with a highish replication 
(10?), and that it is only copied to each tasktracker node once per 
*job*, not per task.  So it's only faster to manage this yourself if you 
have a sequence of jobs that share this data, and if the time to 
re-replicate it per job is significant.

Doug

Re: How can I get one plugin's root dir

Posted by Andrzej Bialecki <ab...@getopt.org>.
Scott Green wrote:
> Thanks you for the detailed explanation, Andrzej.
>
> My plugin contains one language-model(configuration file) whose size
> is 40M, and could you please suggest me where the model file should
> put.
> a) put it into nutch/conf dir like "regex-urlfilter.txt" file
> b) put it into plugin's jar package.

 From the purely theoretic point of view, either way it should work fine 
- the content of conf/ dir is packed into the job jar too.

One comment though, and I hope I'm not confusing you too much ;) If the 
file is that large, AND you execute your jobs using 
jobtracker/tasktrackers, AND you run on Hadoop DFS, you may want to do 
exactly the opposite from what I advocated ;) I.e. keep this file in a 
well-known external location on DFS, where it's accessible to all tasks. 
You should also set its replication factor equal to the number of 
datanodes, and then load this file directly from DFS. Still, you 
wouldn't use java.io.File, but FileSystem.open(Path).

The reason is that if you pack this file into your job JAR, the job jar 
would become very large (presumably this 40MB is already compressed?). 
Job jar needs to be copied to each tasktracker for each task, so you 
will experience performance hit just because of the size of the job jar 
... whereas if this file sits on DFS and is highly replicated, its 
content will always be available locally.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: How can I get one plugin's root dir

Posted by Scott Green <sm...@gmail.com>.
Thanks you for the detailed explanation, Andrzej.

My plugin contains one language-model(configuration file) whose size
is 40M, and could you please suggest me where the model file should
put.
 a) put it into nutch/conf dir like "regex-urlfilter.txt" file
 b) put it into plugin's jar package.

On 1/17/07, Andrzej Bialecki <ab...@getopt.org> wrote:
> Scott Green wrote:
> > Well, why should all resources needed to be packed?
>
> Because when you run Nutch on a Hadoop cluster, Hadoop requires that all
> job resources be packed into a job JAR, which is then submitted to each
> tasktracker as a part of the job. So, if you want to run in non-local
> mode you have to build the nutch-xxx.job JAR ("ant job" target).
>
> Apparently you are running in so called "local" mode, where these issues
> are quite muddy - but as soon as you try to execute it on a cluster your
> method will stop working.
>
>
> > The built result may looks like:
> >
> > xxx-plugin
> >  `--- conf
> >  `--- web
> >  `--- xxx-plugin.jar
> >  `--- deps.jar
> >  `-- plugin.xml
>
> Again: in the "local" mode this may work, but these unpacked plugins are
> not available for jobs executing on a Hadoop cluster.
>
> >
> >> Now, you may have tested your method and found that it does indeed work
> >> - but the reason is a bit obscure: the bin/nutch and bin/hadoop scripts
> >> add your build/ directory to the classpath, so that you can locally test
> >> the latest versions of the code without creating the *.job file.
> >> However, when you run your code on a Hadoop cluster your local build/
> >> directory is no longer accessible, and your method will mysteriously
> >> fail - or even worse, you may get a different version of a resource from
> >> an older version of the build/ directory found on Hadoop tasktracker
> >> nodes ...
> >
> > If you packed everything into jar(s), it is possible that the jar on
> > hadoop tasktracker node is old version, right?
>
> No. The job jar is always up to date, because it is sent with every job.
>
> But if you don't get the resources from this jar, and instead rely on
> using java.io.File-s, you may pick some old cruft from the local build/
> directory that you may have accidentally deployed to your tasktrackers ...
>
> --
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>
>

Re: How can I get one plugin's root dir

Posted by Andrzej Bialecki <ab...@getopt.org>.
Scott Green wrote:
> Well, why should all resources needed to be packed?

Because when you run Nutch on a Hadoop cluster, Hadoop requires that all 
job resources be packed into a job JAR, which is then submitted to each 
tasktracker as a part of the job. So, if you want to run in non-local 
mode you have to build the nutch-xxx.job JAR ("ant job" target).

Apparently you are running in so called "local" mode, where these issues 
are quite muddy - but as soon as you try to execute it on a cluster your 
method will stop working.


> The built result may looks like:
>
> xxx-plugin
>  `--- conf
>  `--- web
>  `--- xxx-plugin.jar
>  `--- deps.jar
>  `-- plugin.xml

Again: in the "local" mode this may work, but these unpacked plugins are 
not available for jobs executing on a Hadoop cluster.

>
>> Now, you may have tested your method and found that it does indeed work
>> - but the reason is a bit obscure: the bin/nutch and bin/hadoop scripts
>> add your build/ directory to the classpath, so that you can locally test
>> the latest versions of the code without creating the *.job file.
>> However, when you run your code on a Hadoop cluster your local build/
>> directory is no longer accessible, and your method will mysteriously
>> fail - or even worse, you may get a different version of a resource from
>> an older version of the build/ directory found on Hadoop tasktracker
>> nodes ...
>
> If you packed everything into jar(s), it is possible that the jar on
> hadoop tasktracker node is old version, right?

No. The job jar is always up to date, because it is sent with every job.

But if you don't get the resources from this jar, and instead rely on 
using java.io.File-s, you may pick some old cruft from the local build/ 
directory that you may have accidentally deployed to your tasktrackers ...

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: How can I get one plugin's root dir

Posted by Scott Green <sm...@gmail.com>.
On 1/16/07, Andrzej Bialecki <ab...@getopt.org> wrote:
> Scott Green wrote:
> > Hi Sami
> >
> > On 1/16/07, Sami Siren <ss...@gmail.com> wrote:
> >> Scott Green wrote:
> >> > Thanks Dennis! Your methond should work.
> >> >
> >> > And I really hope there is one directly method say getPluginRootDir()
> >> > in the plugin implementation.
> >>
> >> I'd recommend taking path shown by Andrzej because IMO it's bad design
> >> to depend on plugin system from a plugin.
> >
> > I am not much clear about your reason.
> >
> > The getPluginRootDir() method mentioned above should expose the
> > (absolutely) path of xxx-plugin in the below example.
> >
> > plugins
> >  `-xxx-plugin
> >          `------ lib
> >          `------ conf
> >          `------ src
> >          `------ web (only for web plugin)
> >          `------ plugin.xml
> >          `------ build.xml
>
> Ok. Now imagine that all plugins are packed together in a Jar file (as
> is the case with Nutch). Is your method still going to work? Nope.
> getPluginRootDir() may still return some non-null value (not sure about
> that), but the resources are not available as files because they are
> packed into a Jar.

Well, why should all resources needed to be packed?
The built result may looks like:

xxx-plugin
  `--- conf
  `--- web
  `--- xxx-plugin.jar
  `--- deps.jar
  `-- plugin.xml

> Now, you may have tested your method and found that it does indeed work
> - but the reason is a bit obscure: the bin/nutch and bin/hadoop scripts
> add your build/ directory to the classpath, so that you can locally test
> the latest versions of the code without creating the *.job file.
> However, when you run your code on a Hadoop cluster your local build/
> directory is no longer accessible, and your method will mysteriously
> fail - or even worse, you may get a different version of a resource from
> an older version of the build/ directory found on Hadoop tasktracker
> nodes ...

If you packed everything into jar(s), it is possible that the jar on
hadoop tasktracker node is old version, right?

> >
> > Andrzej's idea is limited(?) since i cannot get resources from conf dir.
>
> Absolutely not - that's how the whole Configuration system works.
>
>
> --
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>
>

Re: How can I get one plugin's root dir

Posted by Andrzej Bialecki <ab...@getopt.org>.
Scott Green wrote:
> Hi Sami
>
> On 1/16/07, Sami Siren <ss...@gmail.com> wrote:
>> Scott Green wrote:
>> > Thanks Dennis! Your methond should work.
>> >
>> > And I really hope there is one directly method say getPluginRootDir()
>> > in the plugin implementation.
>>
>> I'd recommend taking path shown by Andrzej because IMO it's bad design
>> to depend on plugin system from a plugin.
>
> I am not much clear about your reason.
>
> The getPluginRootDir() method mentioned above should expose the
> (absolutely) path of xxx-plugin in the below example.
>
> plugins
>  `-xxx-plugin
>          `------ lib
>          `------ conf
>          `------ src
>          `------ web (only for web plugin)
>          `------ plugin.xml
>          `------ build.xml

Ok. Now imagine that all plugins are packed together in a Jar file (as 
is the case with Nutch). Is your method still going to work? Nope. 
getPluginRootDir() may still return some non-null value (not sure about 
that), but the resources are not available as files because they are 
packed into a Jar.

Now, you may have tested your method and found that it does indeed work 
- but the reason is a bit obscure: the bin/nutch and bin/hadoop scripts 
add your build/ directory to the classpath, so that you can locally test 
the latest versions of the code without creating the *.job file. 
However, when you run your code on a Hadoop cluster your local build/ 
directory is no longer accessible, and your method will mysteriously 
fail - or even worse, you may get a different version of a resource from 
an older version of the build/ directory found on Hadoop tasktracker 
nodes ...

>
> Andrzej's idea is limited(?) since i cannot get resources from conf dir.

Absolutely not - that's how the whole Configuration system works.


-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: How can I get one plugin's root dir

Posted by Scott Green <sm...@gmail.com>.
Hi Sami

On 1/16/07, Sami Siren <ss...@gmail.com> wrote:
> Scott Green wrote:
> > Thanks Dennis! Your methond should work.
> >
> > And I really hope there is one directly method say getPluginRootDir()
> > in the plugin implementation.
>
> I'd recommend taking path shown by Andrzej because IMO it's bad design
> to depend on plugin system from a plugin.

I am not much clear about your reason.

The getPluginRootDir() method mentioned above should expose the
(absolutely) path of xxx-plugin in the below example.

plugins
  `-xxx-plugin
          `------ lib
          `------ conf
          `------ src
          `------ web (only for web plugin)
          `------ plugin.xml
          `------ build.xml

Andrzej's idea is limited(?) since i cannot get resources from conf dir.


> --
>  Sami Siren
>
>
>

Re: How can I get one plugin's root dir

Posted by Sami Siren <ss...@gmail.com>.
Scott Green wrote:
> Thanks Dennis! Your methond should work.
> 
> And I really hope there is one directly method say getPluginRootDir()
> in the plugin implementation.

I'd recommend taking path shown by Andrzej because IMO it's bad design
to depend on plugin system from a plugin.

--
 Sami Siren



Re: How can I get one plugin's root dir

Posted by Scott Green <sm...@gmail.com>.
Thanks Dennis! Your methond should work.

And I really hope there is one directly method say getPluginRootDir()
in the plugin implementation.


On 1/16/07, Dennis Kubes <nu...@dragonflymc.com> wrote:
> You can get the PluginRepository and then from there get the plugin
> descriptor and its path.  From there resources inside the plugin folder.
>     Change out parse-html with your plugin id.
>
>      Configuration conf = NutchConfiguration.create();
>      PluginRepository rep = PluginRepository.get(conf);
>      PluginDescriptor desc = rep.getPluginDescriptor("parse-html");
>      String path = desc.getPluginPath();
>      System.out.println(path);
>
>
> Dennis Kubes
>
> Scott Green wrote:
> > Can someone give a answer? I dont think it is good idea we put all
> > configuration/resources under "conf" dir.
> >
> > On 1/15/07, Scott Green <sm...@gmail.com> wrote:
> >> Hi,
> >>
> >> I need to load some resources from mine plugin's sub-directory. Any
> >> avaiable method to get the specified plugin's root directory now?
> >> thanks
> >>
> >> - scott
> >>
>

Re: How can I get one plugin's root dir

Posted by Dennis Kubes <nu...@dragonflymc.com>.
You can get the PluginRepository and then from there get the plugin 
descriptor and its path.  From there resources inside the plugin folder. 
    Change out parse-html with your plugin id.

     Configuration conf = NutchConfiguration.create();
     PluginRepository rep = PluginRepository.get(conf);
     PluginDescriptor desc = rep.getPluginDescriptor("parse-html");
     String path = desc.getPluginPath();
     System.out.println(path);


Dennis Kubes

Scott Green wrote:
> Can someone give a answer? I dont think it is good idea we put all
> configuration/resources under "conf" dir.
> 
> On 1/15/07, Scott Green <sm...@gmail.com> wrote:
>> Hi,
>>
>> I need to load some resources from mine plugin's sub-directory. Any
>> avaiable method to get the specified plugin's root directory now?
>> thanks
>>
>> - scott
>>

Re: How can I get one plugin's root dir

Posted by Scott Green <sm...@gmail.com>.
Hi,

I want to propose a bit clean plugin directory structure:

xxx-plugin
           `------ lib
           `------ conf
           `------ src
           `------ web (only for web plugin)
           `------ plugin.xml
           `------ build.xml

Take urlfilter-regex plugin as example, the configuration file
"regex-urlfilter.txt" should be put in conf/ dir. Does this make
sense?

On 1/16/07, Andrzej Bialecki <ab...@getopt.org> wrote:
> Scott Green wrote:
> > Can someone give a answer? I dont think it is good idea we put all
> > configuration/resources under "conf" dir.
> >
> > On 1/15/07, Scott Green <sm...@gmail.com> wrote:
> >> Hi,
> >>
> >> I need to load some resources from mine plugin's sub-directory. Any
> >> avaiable method to get the specified plugin's root directory now?
> >> thanks
>
> You need to make sure that this resource is packaged into the plugin jar
> (just see how it's done in other plugins). Then you should be able to
> access it through the ClassLoader that loaded this plugin, e.g.
>
> package a.b.c;
>
> public class MyPlugin {
> ...
>     InputStream is = MyPlugin.class.getResourceAsStream("myResource.txt");
> ...
> }
>
> --
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>
>

Re: How can I get one plugin's root dir

Posted by Andrzej Bialecki <ab...@getopt.org>.
Scott Green wrote:
> Can someone give a answer? I dont think it is good idea we put all
> configuration/resources under "conf" dir.
>
> On 1/15/07, Scott Green <sm...@gmail.com> wrote:
>> Hi,
>>
>> I need to load some resources from mine plugin's sub-directory. Any
>> avaiable method to get the specified plugin's root directory now?
>> thanks

You need to make sure that this resource is packaged into the plugin jar 
(just see how it's done in other plugins). Then you should be able to 
access it through the ClassLoader that loaded this plugin, e.g.

package a.b.c;

public class MyPlugin {
...
    InputStream is = MyPlugin.class.getResourceAsStream("myResource.txt");
...
}

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: How can I get one plugin's root dir

Posted by Scott Green <sm...@gmail.com>.
Can someone give a answer? I dont think it is good idea we put all
configuration/resources under "conf" dir.

On 1/15/07, Scott Green <sm...@gmail.com> wrote:
> Hi,
>
> I need to load some resources from mine plugin's sub-directory. Any
> avaiable method to get the specified plugin's root directory now?
> thanks
>
> - scott
>