You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Shaya Potter <sp...@gmail.com> on 2012/08/26 05:18:21 UTC

running main() in plugins?

I'm trying to run the main function in HtmlParser (just to see test how 
Nutch's parser works compared to others) and I can't see to figure out 
how to get it to run.

http://svn.apache.org/viewvc/nutch/branches/branch-1.5.1/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HtmlParser.java?revision=1356339&view=markup

when I run it naively, I get an error

Exception in thread "main" java.lang.RuntimeException: 
org.apache.nutch.parse.HtmlParseFilter not found.
     at 
org.apache.nutch.parse.HtmlParseFilters.<init>(HtmlParseFilters.java:55)

in looking at HtmlParseFilters, I see that it throws the runtime 
exception if it can't find any HtmlParseFilter classes, however, I can't 
seem to figure out how to make it able to find them (I see the jar's in 
the plugins dir, but do they have to be registered?  could the main() in 
HtmlParser ever work as is?

any pointers would be appreciated.

thanks.

Re: running main() in plugins?

Posted by Sourajit Basak <so...@gmail.com>.

If you wish to just check the parser, use this command

$ nutch parsechecker -dumpText <url>

This should work out of the box without any modification.

On Sun, Aug 26, 2012 at 8:48 AM, Shaya Potter <sp...@gmail.com> wrote:

> I'm trying to run the main function in HtmlParser (just to see test how
> Nutch's parser works compared to others) and I can't see to figure out how
> to get it to run.
>
> http://svn.apache.org/viewvc/**nutch/branches/branch-1.5.1/**
> src/plugin/parse-html/src/**java/org/apache/nutch/parse/**
> html/HtmlParser.java?revision=**1356339&view=markup<http://svn.apache.org/viewvc/nutch/branches/branch-1.5.1/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HtmlParser.java?revision=1356339&view=markup>
>
> when I run it naively, I get an error
>
> Exception in thread "main" java.lang.RuntimeException:
> org.apache.nutch.parse.**HtmlParseFilter not found.
>     at org.apache.nutch.parse.**HtmlParseFilters.<init>(**
> HtmlParseFilters.java:55)
>
> in looking at HtmlParseFilters, I see that it throws the runtime exception
> if it can't find any HtmlParseFilter classes, however, I can't seem to
> figure out how to make it able to find them (I see the jar's in the plugins
> dir, but do they have to be registered?  could the main() in HtmlParser
> ever work as is?
>
> any pointers would be appreciated.
>
> thanks.
>
>

Re: running main() in plugins?

Posted by Ye T Thet <ye...@gmail.com>.

Shaya,

You might want to look at the content's of bin/nutch.

When I want to to test the parser, I call

./bin/nutch org.apache.nutch.parse.ParserChecker
http://domain.com/document.html.

One point to note is that bin/nutch load the configuration before call the
classes. you would need the configuration set properly for most of the
command to work.

Hope it helps,

Ye

On Mon, Aug 27, 2012 at 12:23 AM, Markus Jelsma
<ma...@openindex.io>wrote:

> See: https://issues.apache.org/jira/browse/NUTCH-961
>
>
>
> -----Original message-----
> > From:Shaya Potter <sp...@gmail.com>
> > Sent: Sun 26-Aug-2012 17:59
> > To: user@nutch.apache.org
> > Subject: Re: running main() in plugins?
> >
> > It could be the "magic" (i.e. analysis) that Nutch is doing in the
> > background gets rid of most of the cruft, I'm just playing around on my
> > own trying to see how I can get the best text to analyze, and in many
> > cases, there's a lot of cruft and I was wondering if Nutch did anything
> > to remove said cruft (headers, footers, sidebars....)
> >
> > what I'm doing now for my experiments is relatively heavyweight, but
> > I'm, applying the readability algorithm to web pages before I index them
> > into my a lucene database.  probably not the best idea for nutch though.
> >
> > With that said, if Nutch is doing more processing than a jsoup
> > Document.text() operation, the question is why?  (some might be obvious,
> > metadata, getting outbound links)
> >
> > On 08/26/2012 08:55 AM, Lewis John Mcgibbney wrote:
> > > Hi Shaya,
> > >
> > > Can you elaborate? The plugin has been around for a good while. If you
> > > have suggestions to improve they are very welcome.
> > >
> > > Thanks
> > >
> > > On Sun, Aug 26, 2012 at 1:41 PM, Shaya Potter <sp...@gmail.com>
> wrote:
> > >> ok, so it seems that Nutch isn't doing much different (at least from a
> > >> smattering of tests I've done) than Jsoup's Document.text() ability
> (from
> > >> what I can tell at least, perhaps only some issues with spacing
> between
> > >> elements).
> > >>
> > >> On 08/26/2012 06:28 AM, Lewis John Mcgibbney wrote:
> > >>>
> > >>> You can easily run any plugin from the terminal using
> > >>>
> > >>> ./bin/nutch plugin
> > >>>
> > >>> in the case of the HtmlParser main() method you would want to do
> > >>>
> > >>> ./bin/nutch plugin parse-html org.apache.nutch.parse.html.HtmlParser
> > >>> $pathToLocalFile
> > >>>
> > >>> You have actually identified an improvement which we could do with
> > >>> having in the main() method for this class e.g.
> > >>>
> > >>> 1) When the arguments are not correctly specified it should print a
> > >>> usage message to std out explaining the correct plugin usage as with
> > >>> more or less every other plugin. Currently we just get a nasty stack
> > >>> like the following
> > >>>
> > >>> Exception in thread "main"
> java.lang.reflect.InvocationTargetException
> > >>>          at sun.reflect.NativeMethodAccessorImpl.invoke0(Native
> Method)
> > >>>          at
> > >>>
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> > >>>          at
> > >>>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> > >>>          at java.lang.reflect.Method.invoke(Method.java:597)
> > >>>          at
> > >>>
> org.apache.nutch.plugin.PluginRepository.main(PluginRepository.java:421)
> > >>> Caused by: java.io.FileNotFoundException:
> > >>> http:/www.trancearoundtheworld.com (No such file or directory)
> > >>>          at java.io.FileInputStream.open(Native Method)
> > >>>          at java.io.FileInputStream.<init>(FileInputStream.java:120)
> > >>>          at
> > >>> org.apache.nutch.parse.html.HtmlParser.main(HtmlParser.java:274)
> > >>>          ... 5 more
> > >>>
> > >>> 2) The plugin main method only enables you to parse local files an
> > >>> improvement would be to add functionality similar to the
> parserchecker
> > >>> as highlighted by Sourajit
> > >>>
> > >>> If you wish to add these functions then please open a Jira issue, the
> > >>> contribution would be great.
> > >>>
> > >>> Thanks
> > >>>
> > >>> Lewis
> > >>>
> > >>> On Sun, Aug 26, 2012 at 4:18 AM, Shaya Potter <sp...@gmail.com>
> wrote:
> > >>>>
> > >>>> I'm trying to run the main function in HtmlParser (just to see test
> how
> > >>>> Nutch's parser works compared to others) and I can't see to figure
> out
> > >>>> how
> > >>>> to get it to run.
> > >>>>
> > >>>>
> > >>>>
> http://svn.apache.org/viewvc/nutch/branches/branch-1.5.1/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HtmlParser.java?revision=1356339&view=markup
> > >>>>
> > >>>> when I run it naively, I get an error
> > >>>>
> > >>>> Exception in thread "main" java.lang.RuntimeException:
> > >>>> org.apache.nutch.parse.HtmlParseFilter not found.
> > >>>>       at
> > >>>>
> org.apache.nutch.parse.HtmlParseFilters.<init>(HtmlParseFilters.java:55)
> > >>>>
> > >>>> in looking at HtmlParseFilters, I see that it throws the runtime
> > >>>> exception
> > >>>> if it can't find any HtmlParseFilter classes, however, I can't seem
> to
> > >>>> figure out how to make it able to find them (I see the jar's in the
> > >>>> plugins
> > >>>> dir, but do they have to be registered?  could the main() in
> HtmlParser
> > >>>> ever
> > >>>> work as is?
> > >>>>
> > >>>> any pointers would be appreciated.
> > >>>>
> > >>>> thanks.
> > >>>>
> > >>>
> > >>>
> > >>>
> > >>
> > >
> > >
> > >
> >
>

Re: running main() in plugins?

Posted by Shaya Potter <sp...@gmail.com>.

quick test is obviously wrong (I edited it to get rid of cruft for my 
situation and forgot to change the return type)

make the return type String and everything's fine.

On 08/26/2012 01:35 PM, Shaya Potter wrote:
> didn't know about boilerplate before, coded up a quick test (think this
> is correct, but please point out errors if I'm doing something boneheaded)
>
> static public JResult tika_extract(String htmltext) {
>          InputStream input = new ByteArrayInputStream(htmltext.getBytes());
>          BoilerpipeContentHandler handler = new
> BoilerpipeContentHandler(new BodyContentHandler());
>          //BodyContentHandler handler = new BodyContentHandler();
>          Metadata metadata = new Metadata();
>          try {
>              new HtmlParser().parse(input, handler, metadata, new
> ParseContext());
>          } catch (IOException e) {
>              // TODO Auto-generated catch block
>              e.printStackTrace();
>          } catch (SAXException e) {
>              // TODO Auto-generated catch block
>              e.printStackTrace();
>          } catch (TikaException e) {
>              // TODO Auto-generated catch block
>              e.printStackTrace();
>          }
>
>         return handler.toTextDocument().getContent();
> }
>
> and it seems to give reasonable results (sometimes a little better than
> readability, sometimes a little worse), but the major problem it throws
> stack exceptions enough in cyberneko enough (i.e. for many, if not all,
> nytimes pages I pointed it at).
>
> On 08/26/2012 12:23 PM, Markus Jelsma wrote:
>> See: https://issues.apache.org/jira/browse/NUTCH-961
>>
>>
>>
>> -----Original message-----
>>> From:Shaya Potter <sp...@gmail.com>
>>> Sent: Sun 26-Aug-2012 17:59
>>> To: user@nutch.apache.org
>>> Subject: Re: running main() in plugins?
>>>
>>> It could be the "magic" (i.e. analysis) that Nutch is doing in the
>>> background gets rid of most of the cruft, I'm just playing around on my
>>> own trying to see how I can get the best text to analyze, and in many
>>> cases, there's a lot of cruft and I was wondering if Nutch did anything
>>> to remove said cruft (headers, footers, sidebars....)
>>>
>>> what I'm doing now for my experiments is relatively heavyweight, but
>>> I'm, applying the readability algorithm to web pages before I index them
>>> into my a lucene database.  probably not the best idea for nutch though.
>>>
>>> With that said, if Nutch is doing more processing than a jsoup
>>> Document.text() operation, the question is why?  (some might be obvious,
>>> metadata, getting outbound links)
>>>
>>> On 08/26/2012 08:55 AM, Lewis John Mcgibbney wrote:
>>>> Hi Shaya,
>>>>
>>>> Can you elaborate? The plugin has been around for a good while. If you
>>>> have suggestions to improve they are very welcome.
>>>>
>>>> Thanks
>>>>
>>>> On Sun, Aug 26, 2012 at 1:41 PM, Shaya Potter <sp...@gmail.com>
>>>> wrote:
>>>>> ok, so it seems that Nutch isn't doing much different (at least from a
>>>>> smattering of tests I've done) than Jsoup's Document.text() ability
>>>>> (from
>>>>> what I can tell at least, perhaps only some issues with spacing
>>>>> between
>>>>> elements).
>>>>>
>>>>> On 08/26/2012 06:28 AM, Lewis John Mcgibbney wrote:
>>>>>>
>>>>>> You can easily run any plugin from the terminal using
>>>>>>
>>>>>> ./bin/nutch plugin
>>>>>>
>>>>>> in the case of the HtmlParser main() method you would want to do
>>>>>>
>>>>>> ./bin/nutch plugin parse-html org.apache.nutch.parse.html.HtmlParser
>>>>>> $pathToLocalFile
>>>>>>
>>>>>> You have actually identified an improvement which we could do with
>>>>>> having in the main() method for this class e.g.
>>>>>>
>>>>>> 1) When the arguments are not correctly specified it should print a
>>>>>> usage message to std out explaining the correct plugin usage as with
>>>>>> more or less every other plugin. Currently we just get a nasty stack
>>>>>> like the following
>>>>>>
>>>>>> Exception in thread "main"
>>>>>> java.lang.reflect.InvocationTargetException
>>>>>>           at sun.reflect.NativeMethodAccessorImpl.invoke0(Native
>>>>>> Method)
>>>>>>           at
>>>>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>>>>>>
>>>>>>           at
>>>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>>>>>
>>>>>>           at java.lang.reflect.Method.invoke(Method.java:597)
>>>>>>           at
>>>>>> org.apache.nutch.plugin.PluginRepository.main(PluginRepository.java:421)
>>>>>>
>>>>>> Caused by: java.io.FileNotFoundException:
>>>>>> http:/www.trancearoundtheworld.com (No such file or directory)
>>>>>>           at java.io.FileInputStream.open(Native Method)
>>>>>>           at java.io.FileInputStream.<init>(FileInputStream.java:120)
>>>>>>           at
>>>>>> org.apache.nutch.parse.html.HtmlParser.main(HtmlParser.java:274)
>>>>>>           ... 5 more
>>>>>>
>>>>>> 2) The plugin main method only enables you to parse local files an
>>>>>> improvement would be to add functionality similar to the
>>>>>> parserchecker
>>>>>> as highlighted by Sourajit
>>>>>>
>>>>>> If you wish to add these functions then please open a Jira issue, the
>>>>>> contribution would be great.
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>> Lewis
>>>>>>
>>>>>> On Sun, Aug 26, 2012 at 4:18 AM, Shaya Potter <sp...@gmail.com>
>>>>>> wrote:
>>>>>>>
>>>>>>> I'm trying to run the main function in HtmlParser (just to see
>>>>>>> test how
>>>>>>> Nutch's parser works compared to others) and I can't see to
>>>>>>> figure out
>>>>>>> how
>>>>>>> to get it to run.
>>>>>>>
>>>>>>>
>>>>>>> http://svn.apache.org/viewvc/nutch/branches/branch-1.5.1/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HtmlParser.java?revision=1356339&view=markup
>>>>>>>
>>>>>>>
>>>>>>> when I run it naively, I get an error
>>>>>>>
>>>>>>> Exception in thread "main" java.lang.RuntimeException:
>>>>>>> org.apache.nutch.parse.HtmlParseFilter not found.
>>>>>>>        at
>>>>>>> org.apache.nutch.parse.HtmlParseFilters.<init>(HtmlParseFilters.java:55)
>>>>>>>
>>>>>>>
>>>>>>> in looking at HtmlParseFilters, I see that it throws the runtime
>>>>>>> exception
>>>>>>> if it can't find any HtmlParseFilter classes, however, I can't
>>>>>>> seem to
>>>>>>> figure out how to make it able to find them (I see the jar's in the
>>>>>>> plugins
>>>>>>> dir, but do they have to be registered?  could the main() in
>>>>>>> HtmlParser
>>>>>>> ever
>>>>>>> work as is?
>>>>>>>
>>>>>>> any pointers would be appreciated.
>>>>>>>
>>>>>>> thanks.
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>

Re: running main() in plugins?

Posted by Shaya Potter <sp...@gmail.com>.

didn't know about boilerplate before, coded up a quick test (think this 
is correct, but please point out errors if I'm doing something boneheaded)

static public JResult tika_extract(String htmltext) {
         InputStream input = new ByteArrayInputStream(htmltext.getBytes());
         BoilerpipeContentHandler handler = new 
BoilerpipeContentHandler(new BodyContentHandler());
         //BodyContentHandler handler = new BodyContentHandler();
         Metadata metadata = new Metadata();
         try {
             new HtmlParser().parse(input, handler, metadata, new 
ParseContext());
         } catch (IOException e) {
             // TODO Auto-generated catch block
             e.printStackTrace();
         } catch (SAXException e) {
             // TODO Auto-generated catch block
             e.printStackTrace();
         } catch (TikaException e) {
             // TODO Auto-generated catch block
             e.printStackTrace();
         }

        return handler.toTextDocument().getContent();
}

and it seems to give reasonable results (sometimes a little better than 
readability, sometimes a little worse), but the major problem it throws 
stack exceptions enough in cyberneko enough (i.e. for many, if not all, 
nytimes pages I pointed it at).

On 08/26/2012 12:23 PM, Markus Jelsma wrote:
> See: https://issues.apache.org/jira/browse/NUTCH-961
>
>
>
> -----Original message-----
>> From:Shaya Potter <sp...@gmail.com>
>> Sent: Sun 26-Aug-2012 17:59
>> To: user@nutch.apache.org
>> Subject: Re: running main() in plugins?
>>
>> It could be the "magic" (i.e. analysis) that Nutch is doing in the
>> background gets rid of most of the cruft, I'm just playing around on my
>> own trying to see how I can get the best text to analyze, and in many
>> cases, there's a lot of cruft and I was wondering if Nutch did anything
>> to remove said cruft (headers, footers, sidebars....)
>>
>> what I'm doing now for my experiments is relatively heavyweight, but
>> I'm, applying the readability algorithm to web pages before I index them
>> into my a lucene database.  probably not the best idea for nutch though.
>>
>> With that said, if Nutch is doing more processing than a jsoup
>> Document.text() operation, the question is why?  (some might be obvious,
>> metadata, getting outbound links)
>>
>> On 08/26/2012 08:55 AM, Lewis John Mcgibbney wrote:
>>> Hi Shaya,
>>>
>>> Can you elaborate? The plugin has been around for a good while. If you
>>> have suggestions to improve they are very welcome.
>>>
>>> Thanks
>>>
>>> On Sun, Aug 26, 2012 at 1:41 PM, Shaya Potter <sp...@gmail.com> wrote:
>>>> ok, so it seems that Nutch isn't doing much different (at least from a
>>>> smattering of tests I've done) than Jsoup's Document.text() ability (from
>>>> what I can tell at least, perhaps only some issues with spacing between
>>>> elements).
>>>>
>>>> On 08/26/2012 06:28 AM, Lewis John Mcgibbney wrote:
>>>>>
>>>>> You can easily run any plugin from the terminal using
>>>>>
>>>>> ./bin/nutch plugin
>>>>>
>>>>> in the case of the HtmlParser main() method you would want to do
>>>>>
>>>>> ./bin/nutch plugin parse-html org.apache.nutch.parse.html.HtmlParser
>>>>> $pathToLocalFile
>>>>>
>>>>> You have actually identified an improvement which we could do with
>>>>> having in the main() method for this class e.g.
>>>>>
>>>>> 1) When the arguments are not correctly specified it should print a
>>>>> usage message to std out explaining the correct plugin usage as with
>>>>> more or less every other plugin. Currently we just get a nasty stack
>>>>> like the following
>>>>>
>>>>> Exception in thread "main" java.lang.reflect.InvocationTargetException
>>>>>           at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>>>           at
>>>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>>>>>           at
>>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>>>>           at java.lang.reflect.Method.invoke(Method.java:597)
>>>>>           at
>>>>> org.apache.nutch.plugin.PluginRepository.main(PluginRepository.java:421)
>>>>> Caused by: java.io.FileNotFoundException:
>>>>> http:/www.trancearoundtheworld.com (No such file or directory)
>>>>>           at java.io.FileInputStream.open(Native Method)
>>>>>           at java.io.FileInputStream.<init>(FileInputStream.java:120)
>>>>>           at
>>>>> org.apache.nutch.parse.html.HtmlParser.main(HtmlParser.java:274)
>>>>>           ... 5 more
>>>>>
>>>>> 2) The plugin main method only enables you to parse local files an
>>>>> improvement would be to add functionality similar to the parserchecker
>>>>> as highlighted by Sourajit
>>>>>
>>>>> If you wish to add these functions then please open a Jira issue, the
>>>>> contribution would be great.
>>>>>
>>>>> Thanks
>>>>>
>>>>> Lewis
>>>>>
>>>>> On Sun, Aug 26, 2012 at 4:18 AM, Shaya Potter <sp...@gmail.com> wrote:
>>>>>>
>>>>>> I'm trying to run the main function in HtmlParser (just to see test how
>>>>>> Nutch's parser works compared to others) and I can't see to figure out
>>>>>> how
>>>>>> to get it to run.
>>>>>>
>>>>>>
>>>>>> http://svn.apache.org/viewvc/nutch/branches/branch-1.5.1/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HtmlParser.java?revision=1356339&view=markup
>>>>>>
>>>>>> when I run it naively, I get an error
>>>>>>
>>>>>> Exception in thread "main" java.lang.RuntimeException:
>>>>>> org.apache.nutch.parse.HtmlParseFilter not found.
>>>>>>        at
>>>>>> org.apache.nutch.parse.HtmlParseFilters.<init>(HtmlParseFilters.java:55)
>>>>>>
>>>>>> in looking at HtmlParseFilters, I see that it throws the runtime
>>>>>> exception
>>>>>> if it can't find any HtmlParseFilter classes, however, I can't seem to
>>>>>> figure out how to make it able to find them (I see the jar's in the
>>>>>> plugins
>>>>>> dir, but do they have to be registered?  could the main() in HtmlParser
>>>>>> ever
>>>>>> work as is?
>>>>>>
>>>>>> any pointers would be appreciated.
>>>>>>
>>>>>> thanks.
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>>
>>

RE: running main() in plugins?

Posted by Markus Jelsma <ma...@openindex.io>.

See: https://issues.apache.org/jira/browse/NUTCH-961

 
 
-----Original message-----
> From:Shaya Potter <sp...@gmail.com>
> Sent: Sun 26-Aug-2012 17:59
> To: user@nutch.apache.org
> Subject: Re: running main() in plugins?
> 
> It could be the "magic" (i.e. analysis) that Nutch is doing in the 
> background gets rid of most of the cruft, I'm just playing around on my 
> own trying to see how I can get the best text to analyze, and in many 
> cases, there's a lot of cruft and I was wondering if Nutch did anything 
> to remove said cruft (headers, footers, sidebars....)
> 
> what I'm doing now for my experiments is relatively heavyweight, but 
> I'm, applying the readability algorithm to web pages before I index them 
> into my a lucene database.  probably not the best idea for nutch though.
> 
> With that said, if Nutch is doing more processing than a jsoup 
> Document.text() operation, the question is why?  (some might be obvious, 
> metadata, getting outbound links)
> 
> On 08/26/2012 08:55 AM, Lewis John Mcgibbney wrote:
> > Hi Shaya,
> >
> > Can you elaborate? The plugin has been around for a good while. If you
> > have suggestions to improve they are very welcome.
> >
> > Thanks
> >
> > On Sun, Aug 26, 2012 at 1:41 PM, Shaya Potter <sp...@gmail.com> wrote:
> >> ok, so it seems that Nutch isn't doing much different (at least from a
> >> smattering of tests I've done) than Jsoup's Document.text() ability (from
> >> what I can tell at least, perhaps only some issues with spacing between
> >> elements).
> >>
> >> On 08/26/2012 06:28 AM, Lewis John Mcgibbney wrote:
> >>>
> >>> You can easily run any plugin from the terminal using
> >>>
> >>> ./bin/nutch plugin
> >>>
> >>> in the case of the HtmlParser main() method you would want to do
> >>>
> >>> ./bin/nutch plugin parse-html org.apache.nutch.parse.html.HtmlParser
> >>> $pathToLocalFile
> >>>
> >>> You have actually identified an improvement which we could do with
> >>> having in the main() method for this class e.g.
> >>>
> >>> 1) When the arguments are not correctly specified it should print a
> >>> usage message to std out explaining the correct plugin usage as with
> >>> more or less every other plugin. Currently we just get a nasty stack
> >>> like the following
> >>>
> >>> Exception in thread "main" java.lang.reflect.InvocationTargetException
> >>>          at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> >>>          at
> >>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> >>>          at
> >>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> >>>          at java.lang.reflect.Method.invoke(Method.java:597)
> >>>          at
> >>> org.apache.nutch.plugin.PluginRepository.main(PluginRepository.java:421)
> >>> Caused by: java.io.FileNotFoundException:
> >>> http:/www.trancearoundtheworld.com (No such file or directory)
> >>>          at java.io.FileInputStream.open(Native Method)
> >>>          at java.io.FileInputStream.<init>(FileInputStream.java:120)
> >>>          at
> >>> org.apache.nutch.parse.html.HtmlParser.main(HtmlParser.java:274)
> >>>          ... 5 more
> >>>
> >>> 2) The plugin main method only enables you to parse local files an
> >>> improvement would be to add functionality similar to the parserchecker
> >>> as highlighted by Sourajit
> >>>
> >>> If you wish to add these functions then please open a Jira issue, the
> >>> contribution would be great.
> >>>
> >>> Thanks
> >>>
> >>> Lewis
> >>>
> >>> On Sun, Aug 26, 2012 at 4:18 AM, Shaya Potter <sp...@gmail.com> wrote:
> >>>>
> >>>> I'm trying to run the main function in HtmlParser (just to see test how
> >>>> Nutch's parser works compared to others) and I can't see to figure out
> >>>> how
> >>>> to get it to run.
> >>>>
> >>>>
> >>>> http://svn.apache.org/viewvc/nutch/branches/branch-1.5.1/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HtmlParser.java?revision=1356339&view=markup
> >>>>
> >>>> when I run it naively, I get an error
> >>>>
> >>>> Exception in thread "main" java.lang.RuntimeException:
> >>>> org.apache.nutch.parse.HtmlParseFilter not found.
> >>>>       at
> >>>> org.apache.nutch.parse.HtmlParseFilters.<init>(HtmlParseFilters.java:55)
> >>>>
> >>>> in looking at HtmlParseFilters, I see that it throws the runtime
> >>>> exception
> >>>> if it can't find any HtmlParseFilter classes, however, I can't seem to
> >>>> figure out how to make it able to find them (I see the jar's in the
> >>>> plugins
> >>>> dir, but do they have to be registered?  could the main() in HtmlParser
> >>>> ever
> >>>> work as is?
> >>>>
> >>>> any pointers would be appreciated.
> >>>>
> >>>> thanks.
> >>>>
> >>>
> >>>
> >>>
> >>
> >
> >
> >
>

Re: running main() in plugins?

Posted by Shaya Potter <sp...@gmail.com>.

It could be the "magic" (i.e. analysis) that Nutch is doing in the 
background gets rid of most of the cruft, I'm just playing around on my 
own trying to see how I can get the best text to analyze, and in many 
cases, there's a lot of cruft and I was wondering if Nutch did anything 
to remove said cruft (headers, footers, sidebars....)

what I'm doing now for my experiments is relatively heavyweight, but 
I'm, applying the readability algorithm to web pages before I index them 
into my a lucene database.  probably not the best idea for nutch though.

With that said, if Nutch is doing more processing than a jsoup 
Document.text() operation, the question is why?  (some might be obvious, 
metadata, getting outbound links)

On 08/26/2012 08:55 AM, Lewis John Mcgibbney wrote:
> Hi Shaya,
>
> Can you elaborate? The plugin has been around for a good while. If you
> have suggestions to improve they are very welcome.
>
> Thanks
>
> On Sun, Aug 26, 2012 at 1:41 PM, Shaya Potter <sp...@gmail.com> wrote:
>> ok, so it seems that Nutch isn't doing much different (at least from a
>> smattering of tests I've done) than Jsoup's Document.text() ability (from
>> what I can tell at least, perhaps only some issues with spacing between
>> elements).
>>
>> On 08/26/2012 06:28 AM, Lewis John Mcgibbney wrote:
>>>
>>> You can easily run any plugin from the terminal using
>>>
>>> ./bin/nutch plugin
>>>
>>> in the case of the HtmlParser main() method you would want to do
>>>
>>> ./bin/nutch plugin parse-html org.apache.nutch.parse.html.HtmlParser
>>> $pathToLocalFile
>>>
>>> You have actually identified an improvement which we could do with
>>> having in the main() method for this class e.g.
>>>
>>> 1) When the arguments are not correctly specified it should print a
>>> usage message to std out explaining the correct plugin usage as with
>>> more or less every other plugin. Currently we just get a nasty stack
>>> like the following
>>>
>>> Exception in thread "main" java.lang.reflect.InvocationTargetException
>>>          at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>          at
>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>>>          at
>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>>          at java.lang.reflect.Method.invoke(Method.java:597)
>>>          at
>>> org.apache.nutch.plugin.PluginRepository.main(PluginRepository.java:421)
>>> Caused by: java.io.FileNotFoundException:
>>> http:/www.trancearoundtheworld.com (No such file or directory)
>>>          at java.io.FileInputStream.open(Native Method)
>>>          at java.io.FileInputStream.<init>(FileInputStream.java:120)
>>>          at
>>> org.apache.nutch.parse.html.HtmlParser.main(HtmlParser.java:274)
>>>          ... 5 more
>>>
>>> 2) The plugin main method only enables you to parse local files an
>>> improvement would be to add functionality similar to the parserchecker
>>> as highlighted by Sourajit
>>>
>>> If you wish to add these functions then please open a Jira issue, the
>>> contribution would be great.
>>>
>>> Thanks
>>>
>>> Lewis
>>>
>>> On Sun, Aug 26, 2012 at 4:18 AM, Shaya Potter <sp...@gmail.com> wrote:
>>>>
>>>> I'm trying to run the main function in HtmlParser (just to see test how
>>>> Nutch's parser works compared to others) and I can't see to figure out
>>>> how
>>>> to get it to run.
>>>>
>>>>
>>>> http://svn.apache.org/viewvc/nutch/branches/branch-1.5.1/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HtmlParser.java?revision=1356339&view=markup
>>>>
>>>> when I run it naively, I get an error
>>>>
>>>> Exception in thread "main" java.lang.RuntimeException:
>>>> org.apache.nutch.parse.HtmlParseFilter not found.
>>>>       at
>>>> org.apache.nutch.parse.HtmlParseFilters.<init>(HtmlParseFilters.java:55)
>>>>
>>>> in looking at HtmlParseFilters, I see that it throws the runtime
>>>> exception
>>>> if it can't find any HtmlParseFilter classes, however, I can't seem to
>>>> figure out how to make it able to find them (I see the jar's in the
>>>> plugins
>>>> dir, but do they have to be registered?  could the main() in HtmlParser
>>>> ever
>>>> work as is?
>>>>
>>>> any pointers would be appreciated.
>>>>
>>>> thanks.
>>>>
>>>
>>>
>>>
>>
>
>
>

Re: running main() in plugins?

Posted by Lewis John Mcgibbney <le...@gmail.com>.

Hi Shaya,

Can you elaborate? The plugin has been around for a good while. If you
have suggestions to improve they are very welcome.

Thanks

On Sun, Aug 26, 2012 at 1:41 PM, Shaya Potter <sp...@gmail.com> wrote:
> ok, so it seems that Nutch isn't doing much different (at least from a
> smattering of tests I've done) than Jsoup's Document.text() ability (from
> what I can tell at least, perhaps only some issues with spacing between
> elements).
>
> On 08/26/2012 06:28 AM, Lewis John Mcgibbney wrote:
>>
>> You can easily run any plugin from the terminal using
>>
>> ./bin/nutch plugin
>>
>> in the case of the HtmlParser main() method you would want to do
>>
>> ./bin/nutch plugin parse-html org.apache.nutch.parse.html.HtmlParser
>> $pathToLocalFile
>>
>> You have actually identified an improvement which we could do with
>> having in the main() method for this class e.g.
>>
>> 1) When the arguments are not correctly specified it should print a
>> usage message to std out explaining the correct plugin usage as with
>> more or less every other plugin. Currently we just get a nasty stack
>> like the following
>>
>> Exception in thread "main" java.lang.reflect.InvocationTargetException
>>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>         at
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>>         at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>         at java.lang.reflect.Method.invoke(Method.java:597)
>>         at
>> org.apache.nutch.plugin.PluginRepository.main(PluginRepository.java:421)
>> Caused by: java.io.FileNotFoundException:
>> http:/www.trancearoundtheworld.com (No such file or directory)
>>         at java.io.FileInputStream.open(Native Method)
>>         at java.io.FileInputStream.<init>(FileInputStream.java:120)
>>         at
>> org.apache.nutch.parse.html.HtmlParser.main(HtmlParser.java:274)
>>         ... 5 more
>>
>> 2) The plugin main method only enables you to parse local files an
>> improvement would be to add functionality similar to the parserchecker
>> as highlighted by Sourajit
>>
>> If you wish to add these functions then please open a Jira issue, the
>> contribution would be great.
>>
>> Thanks
>>
>> Lewis
>>
>> On Sun, Aug 26, 2012 at 4:18 AM, Shaya Potter <sp...@gmail.com> wrote:
>>>
>>> I'm trying to run the main function in HtmlParser (just to see test how
>>> Nutch's parser works compared to others) and I can't see to figure out
>>> how
>>> to get it to run.
>>>
>>>
>>> http://svn.apache.org/viewvc/nutch/branches/branch-1.5.1/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HtmlParser.java?revision=1356339&view=markup
>>>
>>> when I run it naively, I get an error
>>>
>>> Exception in thread "main" java.lang.RuntimeException:
>>> org.apache.nutch.parse.HtmlParseFilter not found.
>>>      at
>>> org.apache.nutch.parse.HtmlParseFilters.<init>(HtmlParseFilters.java:55)
>>>
>>> in looking at HtmlParseFilters, I see that it throws the runtime
>>> exception
>>> if it can't find any HtmlParseFilter classes, however, I can't seem to
>>> figure out how to make it able to find them (I see the jar's in the
>>> plugins
>>> dir, but do they have to be registered?  could the main() in HtmlParser
>>> ever
>>> work as is?
>>>
>>> any pointers would be appreciated.
>>>
>>> thanks.
>>>
>>
>>
>>
>



-- 
Lewis

Re: running main() in plugins?

Posted by Shaya Potter <sp...@gmail.com>.

ok, so it seems that Nutch isn't doing much different (at least from a 
smattering of tests I've done) than Jsoup's Document.text() ability 
(from what I can tell at least, perhaps only some issues with spacing 
between elements).

On 08/26/2012 06:28 AM, Lewis John Mcgibbney wrote:
> You can easily run any plugin from the terminal using
>
> ./bin/nutch plugin
>
> in the case of the HtmlParser main() method you would want to do
>
> ./bin/nutch plugin parse-html org.apache.nutch.parse.html.HtmlParser
> $pathToLocalFile
>
> You have actually identified an improvement which we could do with
> having in the main() method for this class e.g.
>
> 1) When the arguments are not correctly specified it should print a
> usage message to std out explaining the correct plugin usage as with
> more or less every other plugin. Currently we just get a nasty stack
> like the following
>
> Exception in thread "main" java.lang.reflect.InvocationTargetException
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 	at java.lang.reflect.Method.invoke(Method.java:597)
> 	at org.apache.nutch.plugin.PluginRepository.main(PluginRepository.java:421)
> Caused by: java.io.FileNotFoundException:
> http:/www.trancearoundtheworld.com (No such file or directory)
> 	at java.io.FileInputStream.open(Native Method)
> 	at java.io.FileInputStream.<init>(FileInputStream.java:120)
> 	at org.apache.nutch.parse.html.HtmlParser.main(HtmlParser.java:274)
> 	... 5 more
>
> 2) The plugin main method only enables you to parse local files an
> improvement would be to add functionality similar to the parserchecker
> as highlighted by Sourajit
>
> If you wish to add these functions then please open a Jira issue, the
> contribution would be great.
>
> Thanks
>
> Lewis
>
> On Sun, Aug 26, 2012 at 4:18 AM, Shaya Potter <sp...@gmail.com> wrote:
>> I'm trying to run the main function in HtmlParser (just to see test how
>> Nutch's parser works compared to others) and I can't see to figure out how
>> to get it to run.
>>
>> http://svn.apache.org/viewvc/nutch/branches/branch-1.5.1/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HtmlParser.java?revision=1356339&view=markup
>>
>> when I run it naively, I get an error
>>
>> Exception in thread "main" java.lang.RuntimeException:
>> org.apache.nutch.parse.HtmlParseFilter not found.
>>      at
>> org.apache.nutch.parse.HtmlParseFilters.<init>(HtmlParseFilters.java:55)
>>
>> in looking at HtmlParseFilters, I see that it throws the runtime exception
>> if it can't find any HtmlParseFilter classes, however, I can't seem to
>> figure out how to make it able to find them (I see the jar's in the plugins
>> dir, but do they have to be registered?  could the main() in HtmlParser ever
>> work as is?
>>
>> any pointers would be appreciated.
>>
>> thanks.
>>
>
>
>

Re: running main() in plugins?

Posted by Lewis John Mcgibbney <le...@gmail.com>.

You can easily run any plugin from the terminal using

./bin/nutch plugin

in the case of the HtmlParser main() method you would want to do

./bin/nutch plugin parse-html org.apache.nutch.parse.html.HtmlParser
$pathToLocalFile

You have actually identified an improvement which we could do with
having in the main() method for this class e.g.

1) When the arguments are not correctly specified it should print a
usage message to std out explaining the correct plugin usage as with
more or less every other plugin. Currently we just get a nasty stack
like the following

Exception in thread "main" java.lang.reflect.InvocationTargetException
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
	at java.lang.reflect.Method.invoke(Method.java:597)
	at org.apache.nutch.plugin.PluginRepository.main(PluginRepository.java:421)
Caused by: java.io.FileNotFoundException:
http:/www.trancearoundtheworld.com (No such file or directory)
	at java.io.FileInputStream.open(Native Method)
	at java.io.FileInputStream.<init>(FileInputStream.java:120)
	at org.apache.nutch.parse.html.HtmlParser.main(HtmlParser.java:274)
	... 5 more

2) The plugin main method only enables you to parse local files an
improvement would be to add functionality similar to the parserchecker
as highlighted by Sourajit

If you wish to add these functions then please open a Jira issue, the
contribution would be great.

Thanks

Lewis

On Sun, Aug 26, 2012 at 4:18 AM, Shaya Potter <sp...@gmail.com> wrote:
> I'm trying to run the main function in HtmlParser (just to see test how
> Nutch's parser works compared to others) and I can't see to figure out how
> to get it to run.
>
> http://svn.apache.org/viewvc/nutch/branches/branch-1.5.1/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HtmlParser.java?revision=1356339&view=markup
>
> when I run it naively, I get an error
>
> Exception in thread "main" java.lang.RuntimeException:
> org.apache.nutch.parse.HtmlParseFilter not found.
>     at
> org.apache.nutch.parse.HtmlParseFilters.<init>(HtmlParseFilters.java:55)
>
> in looking at HtmlParseFilters, I see that it throws the runtime exception
> if it can't find any HtmlParseFilter classes, however, I can't seem to
> figure out how to make it able to find them (I see the jar's in the plugins
> dir, but do they have to be registered?  could the main() in HtmlParser ever
> work as is?
>
> any pointers would be appreciated.
>
> thanks.
>



-- 
Lewis