You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Julien Nioche <li...@gmail.com> on 2010/08/02 14:10:38 UTC

Re: For HTML - is parse-html twice as fast as parse-tika

Hi Brad,

Could you run and measure the parser independently of the fetching? That
would remove any possible side effect due to caching, network issues etc...

All you need to do is remove the subdirectories parse_text, parse_data and
crawl_parse then run : nutch parse

Thanks

Julien

PS: regarding parse-html being phased out : see Andrzej's JIRA from this
morning


On 31 July 2010 22:43, brad <br...@bcs-mail.net> wrote:

> > I have been experiencing some performance issues with Tika and general
> > parsing
> > (see Parsing Performance - related to Java concurrency issue)
> >
> > Ken pointed out that both the both Tika and Nutch HtmlParser show up in
> my
> > jstack list using the delivered configuration.
> >
> > Julien suggested checking parsing with only parse-tika (html) and then
> > with parse-html.
> >
> > So here is what I did.
> >
> > Option 1) parse-tika
> >           parse-(rss|text|js|tika)
> >           parse-plugin.xml as delivered
>          tika-mimetypes.xml as delivered
>
> > Option 2) parse-html
> >           parse-(rss|text|html|js|tika)
> >           parse-plugin.xml turned ON <plugin id="parse-html" />
> >           tika-mimetypes.xml commented out <mime-type type="text/html">
> >
> > Using the same generated crawl, ran fetch with parse for each of the
> > options for 2 hours.
> > All other configurations and settings are identical
> >
> > Results:
> > Parse-tika
> > INFO  mapred.LocalJobRunner - 200 threads, 200370 pages, 6756 errors,
> 27.8
> > pages/s, 12916 kb/s
> >
> > Parse-html
> > INFO  mapred.LocalJobRunner - 200 threads, 433738 pages, 13360 errors,
> > 60.1 pages/s, 27980 kb/s,
> >
> >
> > The results:
> > Parse-html is 116% faster than parse-tika for html for the same period of
> > time and same URLs
> >
> > The error rate was about the same parse-html 3%, parse-tika 3.3%
> > Most of the errors are read timeouts
> >
> >
> > So is parse-html better?  It appears to be faster.  But, is the data as
> > good?
> > Other considerations?  Is parse-html really going to be phased out?
> >
> > Brad
> >
> >
> >
>



-- 
DigitalPebble Ltd

Open Source Solutions for Text Engineering
http://www.digitalpebble.com

RE: For HTML - is parse-html twice as fast as parse-tika

Posted by brad <br...@bcs-mail.net>.

Thanks Torsten,
That may help.  It actually makes some sense since the escape period is
actually what we are looking for.  "\." tells the regex processor to match
just for a period, where as "." tells the regex processor to match any
single character.

Thanks!
Brad

-----Original Message-----
From: Torsten Krah [mailto:tkrah@fachschaft.imn.htwk-leipzig.de] 
Sent: Tuesday, August 03, 2010 5:20 AM
To: user@nutch.apache.org
Cc: brad
Subject: Re: For HTML - is parse-html twice as fast as parse-tika

Am Montag, 2. August 2010, um 20:14:32 schrieb brad:
>  I do have about 10
> entries in the regex-urlfilter.txt file, but they are mainly to 
> exclude sites.  For Example:

I've got too this problem with 1.1. nutch often hanging at util.regexp... 
forever.
It does hang if i just use (in regexfilter property files) something like:

http://www.mydomain.local/

If i change this to be:

http://www\.mydomain\.local/

it does work - i have no glue why i have to escape the "." to be a period as
"." should match the period too. However for me it solved this annoying hang
@java util pattern matching. Maybe you can give this a try - maybe it does
help, maybe not :-).

You can get more information on "which" regex nutch "hangs" if you overwrite
the extension point or the plugin code and add some debugging line just
before the match call and find some other regex which does match and does
not hang ;-).

Torsten


--
Bitte senden Sie mir keine Word- oder PowerPoint-Anhänge.
Siehe http://www.gnu.org/philosophy/no-word-attachments.de.html

Really, I'm not out to destroy Microsoft. That will just be a completely
unintentional side effect."
	-- Linus Torvalds

Re: For HTML - is parse-html twice as fast as parse-tika

Posted by Julien Nioche <li...@gmail.com>.

the syntax is very similar indeed. automaton uses a FSA library

see http://weblogs.java.net/blog/2006/03/27/faster-java-regex-package

On 3 August 2010 16:07, brad <br...@bcs-mail.net> wrote:

> Hi Julien,
> I don't mean to sound dumb on this, but what is the difference between
> automaton-urlfilter.txt and regex-urlfilter.txt?
>
> When I look at the files they seem like they have the same default content.
>
> A google search didn't turn up much...
>
> Is there some documentation I missed somewhere?
>
> Thanks
> Brad
>
>
> -----Original Message-----
> From: Julien Nioche [mailto:lists.digitalpebble@gmail.com]
> Sent: Tuesday, August 03, 2010 7:22 AM
> To: user@nutch.apache.org; tkrah@fachschaft.imn.htwk-leipzig.de
> Subject: Re: For HTML - is parse-html twice as fast as parse-tika
>
> why not using urlfilter-automaton instead? It is much faster than the regex
> one
>
> On 3 August 2010 13:19, Torsten Krah
> <tk...@fachschaft.imn.htwk-leipzig.de>wrote:
>
> > Am Montag, 2. August 2010, um 20:14:32 schrieb brad:
> > >  I do have about 10
> > > entries in the regex-urlfilter.txt file, but they are mainly to
> > > exclude sites.  For Example:
> >
> > I've got too this problem with 1.1. nutch often hanging at util.regexp...
> > forever.
> > It does hang if i just use (in regexfilter property files) something
> like:
> >
> > http://www.mydomain.local/
> >
> > If i change this to be:
> >
> > http://www\.mydomain\.local/
> >
> > it does work - i have no glue why i have to escape the "." to be a
> > period as "." should match the period too. However for me it solved
> > this annoying hang @java util pattern matching. Maybe you can give
> > this a try - maybe it does help, maybe not :-).
> >
> > You can get more information on "which" regex nutch "hangs" if you
> > overwrite the extension point or the plugin code and add some
> > debugging line just before the match call and find some other regex
> > which does match and does not hang ;-).
> >
> > Torsten
> >
> >
> > --
> > Bitte senden Sie mir keine Word- oder PowerPoint-Anhänge.
> > Siehe http://www.gnu.org/philosophy/no-word-attachments.de.html
> >
> > Really, I'm not out to destroy Microsoft. That will just be a
> > completely unintentional side effect."
> >        -- Linus Torvalds
> >
>
>
>
> --
> DigitalPebble Ltd
>
> Open Source Solutions for Text Engineering http://www.digitalpebble.com
>
>


-- 
DigitalPebble Ltd

Open Source Solutions for Text Engineering
http://www.digitalpebble.com

RE: For HTML - is parse-html twice as fast as parse-tika

Posted by brad <br...@bcs-mail.net>.

Hi Julien,
I don't mean to sound dumb on this, but what is the difference between
automaton-urlfilter.txt and regex-urlfilter.txt?

When I look at the files they seem like they have the same default content.

A google search didn't turn up much...

Is there some documentation I missed somewhere?

Thanks
Brad
 

-----Original Message-----
From: Julien Nioche [mailto:lists.digitalpebble@gmail.com] 
Sent: Tuesday, August 03, 2010 7:22 AM
To: user@nutch.apache.org; tkrah@fachschaft.imn.htwk-leipzig.de
Subject: Re: For HTML - is parse-html twice as fast as parse-tika

why not using urlfilter-automaton instead? It is much faster than the regex
one

On 3 August 2010 13:19, Torsten Krah
<tk...@fachschaft.imn.htwk-leipzig.de>wrote:

> Am Montag, 2. August 2010, um 20:14:32 schrieb brad:
> >  I do have about 10
> > entries in the regex-urlfilter.txt file, but they are mainly to 
> > exclude sites.  For Example:
>
> I've got too this problem with 1.1. nutch often hanging at util.regexp...
> forever.
> It does hang if i just use (in regexfilter property files) something like:
>
> http://www.mydomain.local/
>
> If i change this to be:
>
> http://www\.mydomain\.local/
>
> it does work - i have no glue why i have to escape the "." to be a 
> period as "." should match the period too. However for me it solved 
> this annoying hang @java util pattern matching. Maybe you can give 
> this a try - maybe it does help, maybe not :-).
>
> You can get more information on "which" regex nutch "hangs" if you 
> overwrite the extension point or the plugin code and add some 
> debugging line just before the match call and find some other regex 
> which does match and does not hang ;-).
>
> Torsten
>
>
> --
> Bitte senden Sie mir keine Word- oder PowerPoint-Anhänge.
> Siehe http://www.gnu.org/philosophy/no-word-attachments.de.html
>
> Really, I'm not out to destroy Microsoft. That will just be a 
> completely unintentional side effect."
>        -- Linus Torvalds
>



--
DigitalPebble Ltd

Open Source Solutions for Text Engineering http://www.digitalpebble.com

Re: For HTML - is parse-html twice as fast as parse-tika

Posted by Julien Nioche <li...@gmail.com>.

why not using urlfilter-automaton instead? It is much faster than the regex
one

On 3 August 2010 13:19, Torsten Krah
<tk...@fachschaft.imn.htwk-leipzig.de>wrote:

> Am Montag, 2. August 2010, um 20:14:32 schrieb brad:
> >  I do have about 10
> > entries in the regex-urlfilter.txt file, but they are mainly to exclude
> > sites.  For Example:
>
> I've got too this problem with 1.1. nutch often hanging at util.regexp...
> forever.
> It does hang if i just use (in regexfilter property files) something like:
>
> http://www.mydomain.local/
>
> If i change this to be:
>
> http://www\.mydomain\.local/
>
> it does work - i have no glue why i have to escape the "." to be a period
> as
> "." should match the period too. However for me it solved this annoying
> hang
> @java util pattern matching. Maybe you can give this a try - maybe it does
> help, maybe not :-).
>
> You can get more information on "which" regex nutch "hangs" if you
> overwrite
> the extension point or the plugin code and add some debugging line just
> before
> the match call and find some other regex which does match and does not hang
> ;-).
>
> Torsten
>
>
> --
> Bitte senden Sie mir keine Word- oder PowerPoint-Anhänge.
> Siehe http://www.gnu.org/philosophy/no-word-attachments.de.html
>
> Really, I'm not out to destroy Microsoft. That will just be a
> completely unintentional side effect."
>        -- Linus Torvalds
>



-- 
DigitalPebble Ltd

Open Source Solutions for Text Engineering
http://www.digitalpebble.com

RE: For HTML - is parse-html twice as fast as parse-tika

Posted by brad <br...@bcs-mail.net>.

Thanks, I'll check it out! 

-----Original Message-----
From: Torsten Krah [mailto:tkrah@fachschaft.imn.htwk-leipzig.de] 
Sent: Tuesday, August 03, 2010 5:20 AM
To: user@nutch.apache.org
Cc: brad
Subject: Re: For HTML - is parse-html twice as fast as parse-tika

Am Montag, 2. August 2010, um 20:14:32 schrieb brad:
>  I do have about 10
> entries in the regex-urlfilter.txt file, but they are mainly to 
> exclude sites.  For Example:

I've got too this problem with 1.1. nutch often hanging at util.regexp... 
forever.
It does hang if i just use (in regexfilter property files) something like:

http://www.mydomain.local/

If i change this to be:

http://www\.mydomain\.local/

it does work - i have no glue why i have to escape the "." to be a period as
"." should match the period too. However for me it solved this annoying hang
@java util pattern matching. Maybe you can give this a try - maybe it does
help, maybe not :-).

You can get more information on "which" regex nutch "hangs" if you overwrite
the extension point or the plugin code and add some debugging line just
before the match call and find some other regex which does match and does
not hang ;-).

Torsten

--
Bitte senden Sie mir keine Word- oder PowerPoint-Anhänge.
Siehe http://www.gnu.org/philosophy/no-word-attachments.de.html

Really, I'm not out to destroy Microsoft. That will just be a completely
unintentional side effect."
	-- Linus Torvalds

Re: For HTML - is parse-html twice as fast as parse-tika

Posted by Torsten Krah <tk...@fachschaft.imn.htwk-leipzig.de>.

Am Montag, 2. August 2010, um 20:14:32 schrieb brad:
>  I do have about 10
> entries in the regex-urlfilter.txt file, but they are mainly to exclude
> sites.  For Example:

I've got too this problem with 1.1. nutch often hanging at util.regexp... 
forever.
It does hang if i just use (in regexfilter property files) something like:

http://www.mydomain.local/

If i change this to be:

http://www\.mydomain\.local/

it does work - i have no glue why i have to escape the "." to be a period as 
"." should match the period too. However for me it solved this annoying hang 
@java util pattern matching. Maybe you can give this a try - maybe it does 
help, maybe not :-).

You can get more information on "which" regex nutch "hangs" if you overwrite 
the extension point or the plugin code and add some debugging line just before 
the match call and find some other regex which does match and does not hang 
;-).

Torsten


-- 
Bitte senden Sie mir keine Word- oder PowerPoint-Anhänge.
Siehe http://www.gnu.org/philosophy/no-word-attachments.de.html

Really, I'm not out to destroy Microsoft. That will just be a 
completely unintentional side effect."
	-- Linus Torvalds

RE: For HTML - is parse-html twice as fast as parse-tika

Posted by brad <br...@bcs-mail.net>.

Thanks, I had tried that multiple times and the majority of time it is stuck
at:

"Thread-11" prio=10 tid=0x00002aabd8023000 nid=0x62ef runnable
[0x00000000420d8000..0x00000000420d8c10]
   java.lang.Thread.State: RUNNABLE
	at java.util.regex.Pattern$CharProperty.match(Pattern.java:3362)
	at java.util.regex.Pattern$Curly.match0(Pattern.java:3787)
	at java.util.regex.Pattern$Curly.match(Pattern.java:3761)
	at java.util.regex.Pattern$Start.match(Pattern.java:3072)
	at java.util.regex.Matcher.search(Matcher.java:1116)
	at java.util.regex.Matcher.find(Matcher.java:552)
	at
org.apache.nutch.urlfilter.regex.RegexURLFilter$Rule.match(RegexURLFilter.ja
va:90)
	at
org.apache.nutch.urlfilter.api.RegexURLFilterBase.filter(RegexURLFilterBase.
java:117)
	- locked <0x00002aaaf32f93d8> (a
org.apache.nutch.urlfilter.regex.RegexURLFilter)
	at org.apache.nutch.net.URLFilters.filter(URLFilters.java:88)
	at
org.apache.nutch.parse.ParseOutputFormat$1.write(ParseOutputFormat.java:220)
	at
org.apache.nutch.parse.ParseOutputFormat$1.write(ParseOutputFormat.java:115)
	at
org.apache.nutch.fetcher.FetcherOutputFormat$1.write(FetcherOutputFormat.jav
a:96)
	at
org.apache.nutch.fetcher.FetcherOutputFormat$1.write(FetcherOutputFormat.jav
a:70)
	at
org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:440)
	at
org.apache.hadoop.mapred.lib.IdentityReducer.reduce(IdentityReducer.java:42)
	at
org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:463)
	at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:411)
	at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:216)


I'm not sure how to do anything to improve this aspect.  I do have about 10
entries in the regex-urlfilter.txt file, but they are mainly to exclude
sites.  For Example:
-^http://([a-z0-9\-A-Z]*\.)*twitter.com
-^http://([a-z0-9\-A-Z]*\.)*facebook.com
Or exclude extensions
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|t
gz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
-\.(js|JS|mp3|MP3|mp4|MP4|wav|WAV|mov|MOV|z|Z|tar|TAR|avi|AVI|rar|RAR|jar|JA
R
|ps|PS|eps|EPS|css|CSS|wmv|WMV|flv|FLV|dmg|DMG|img|IMG|swf|SWF|msi|MSI|wvx|W
VX)$

I would have used prefix-urlfilter.txt and suffix-urlfilter.txt, but I
haven't found any documentation on how they work...

Brad

 

-----Original Message-----
From: Julien Nioche [mailto:lists.digitalpebble@gmail.com] 
Sent: Monday, August 02, 2010 10:39 AM
To: user@nutch.apache.org
Subject: Re: For HTML - is parse-html twice as fast as parse-tika

Hi,

Try calling jstack on the pid of the task to have a better idea of what it
is doing. My bet is on the normalisation of some long URLs taking ages but
it could be a lot of other things

J.

On 2 August 2010 17:26, brad <br...@bcs-mail.net> wrote:

> Hi Julien,
> I'll see if I can give a try later this week.
>
> I'm having a problem in the mapred.LocalJobRunner - reduce > reduce 
> portion right after the actual URL fetch/parse portion is complete.  I 
> don't know how long it is supposed to take for this portion to 
> complete, but I have had fetches run for 12 hours and map-reduce 
> portion run for 36 hours and still not be complete.  I ended up 
> killing the job.
>
> Right now, I'm running a fetch on 1 million URLs.  The parse and fetch 
> portion took less than 7 hours, but the map-reduce has been running 
> for 11 hours now and I'm going to wait and see if it completes.
>
> It started complete of fetcher.Fetcher:
> 2010-08-01 22:06:43,479 INFO  fetcher.Fetcher - -finishing thread 
> FetcherThread, activeThreads=0
> 2010-08-01 22:06:44,368 INFO  fetcher.Fetcher - -activeThreads=0, 
> spinWaiting=0, fetchQueues.totalSize=0
> 2010-08-01 22:06:44,369 INFO  fetcher.Fetcher - -activeThreads=0
> 2010-08-01 22:06:44,369 INFO  mapred.MapTask - Starting flush of map 
> output
> 2010-08-01 22:06:45,129 INFO  mapred.LocalJobRunner - 0 threads, 
> 853809 pages, 18772 errors, 35.4 pages/s, 16989 kb/s,
>
> The issue appears to start with
> 2010-08-01 23:22:22,174 INFO  mapred.Merger - Down to the last 
> merge-pass, with 1 segments left of total size: 31012166567 bytes
>
> Now the process has been cycling on for 10 hours:
> INFO  mapred.LocalJobRunner - reduce > reduce
>
> I'm running Nutch on a single server.
>
> Thanks
> Brad
>
>
> -----Original Message-----
> From: Julien Nioche [mailto:lists.digitalpebble@gmail.com]
> Sent: Monday, August 02, 2010 5:11 AM
> To: user@nutch.apache.org
> Subject: Re: For HTML - is parse-html twice as fast as parse-tika
>
> Hi Brad,
>
> Could you run and measure the parser independently of the fetching? 
> That would remove any possible side effect due to caching, network issues
etc...
>
> All you need to do is remove the subdirectories parse_text, parse_data 
> and crawl_parse then run : nutch parse
>
> Thanks
>
> Julien
>
> PS: regarding parse-html being phased out : see Andrzej's JIRA from 
> this morning
>
>
> On 31 July 2010 22:43, brad <br...@bcs-mail.net> wrote:
>
> > > I have been experiencing some performance issues with Tika and 
> > > general parsing (see Parsing Performance - related to Java 
> > > concurrency issue)
> > >
> > > Ken pointed out that both the both Tika and Nutch HtmlParser show 
> > > up in
> > my
> > > jstack list using the delivered configuration.
> > >
> > > Julien suggested checking parsing with only parse-tika (html) and 
> > > then with parse-html.
> > >
> > > So here is what I did.
> > >
> > > Option 1) parse-tika
> > >           parse-(rss|text|js|tika)
> > >           parse-plugin.xml as delivered
> >          tika-mimetypes.xml as delivered
> >
> > > Option 2) parse-html
> > >           parse-(rss|text|html|js|tika)
> > >           parse-plugin.xml turned ON <plugin id="parse-html" />
> > >           tika-mimetypes.xml commented out <mime-type 
> > > type="text/html">
> > >
> > > Using the same generated crawl, ran fetch with parse for each of 
> > > the options for 2 hours.
> > > All other configurations and settings are identical
> > >
> > > Results:
> > > Parse-tika
> > > INFO  mapred.LocalJobRunner - 200 threads, 200370 pages, 6756 
> > > errors,
> > 27.8
> > > pages/s, 12916 kb/s
> > >
> > > Parse-html
> > > INFO  mapred.LocalJobRunner - 200 threads, 433738 pages, 13360 
> > > errors,
> > > 60.1 pages/s, 27980 kb/s,
> > >
> > >
> > > The results:
> > > Parse-html is 116% faster than parse-tika for html for the same 
> > > period of time and same URLs
> > >
> > > The error rate was about the same parse-html 3%, parse-tika 3.3% 
> > > Most of the errors are read timeouts
> > >
> > >
> > > So is parse-html better?  It appears to be faster.  But, is the 
> > > data as good?
> > > Other considerations?  Is parse-html really going to be phased out?
> > >
> > > Brad
> > >
> > >
> > >
> >
>
>
>
> --
> DigitalPebble Ltd
>
> Open Source Solutions for Text Engineering 
> http://www.digitalpebble.com
>
>


--
DigitalPebble Ltd

Open Source Solutions for Text Engineering http://www.digitalpebble.com

Re: For HTML - is parse-html twice as fast as parse-tika

Posted by Julien Nioche <li...@gmail.com>.

Hi,

Try calling jstack on the pid of the task to have a better idea of what it
is doing. My bet is on the normalisation of some long URLs taking ages but
it could be a lot of other things

J.

On 2 August 2010 17:26, brad <br...@bcs-mail.net> wrote:

> Hi Julien,
> I'll see if I can give a try later this week.
>
> I'm having a problem in the mapred.LocalJobRunner - reduce > reduce portion
> right after the actual URL fetch/parse portion is complete.  I don't know
> how long it is supposed to take for this portion to complete, but I have
> had
> fetches run for 12 hours and map-reduce portion run for 36 hours and still
> not be complete.  I ended up killing the job.
>
> Right now, I'm running a fetch on 1 million URLs.  The parse and fetch
> portion took less than 7 hours, but the map-reduce has been running for 11
> hours now and I'm going to wait and see if it completes.
>
> It started complete of fetcher.Fetcher:
> 2010-08-01 22:06:43,479 INFO  fetcher.Fetcher - -finishing thread
> FetcherThread, activeThreads=0
> 2010-08-01 22:06:44,368 INFO  fetcher.Fetcher - -activeThreads=0,
> spinWaiting=0, fetchQueues.totalSize=0
> 2010-08-01 22:06:44,369 INFO  fetcher.Fetcher - -activeThreads=0
> 2010-08-01 22:06:44,369 INFO  mapred.MapTask - Starting flush of map output
> 2010-08-01 22:06:45,129 INFO  mapred.LocalJobRunner - 0 threads, 853809
> pages, 18772 errors, 35.4 pages/s, 16989 kb/s,
>
> The issue appears to start with
> 2010-08-01 23:22:22,174 INFO  mapred.Merger - Down to the last merge-pass,
> with 1 segments left of total size: 31012166567 bytes
>
> Now the process has been cycling on for 10 hours:
> INFO  mapred.LocalJobRunner - reduce > reduce
>
> I'm running Nutch on a single server.
>
> Thanks
> Brad
>
>
> -----Original Message-----
> From: Julien Nioche [mailto:lists.digitalpebble@gmail.com]
> Sent: Monday, August 02, 2010 5:11 AM
> To: user@nutch.apache.org
> Subject: Re: For HTML - is parse-html twice as fast as parse-tika
>
> Hi Brad,
>
> Could you run and measure the parser independently of the fetching? That
> would remove any possible side effect due to caching, network issues etc...
>
> All you need to do is remove the subdirectories parse_text, parse_data and
> crawl_parse then run : nutch parse
>
> Thanks
>
> Julien
>
> PS: regarding parse-html being phased out : see Andrzej's JIRA from this
> morning
>
>
> On 31 July 2010 22:43, brad <br...@bcs-mail.net> wrote:
>
> > > I have been experiencing some performance issues with Tika and
> > > general parsing (see Parsing Performance - related to Java
> > > concurrency issue)
> > >
> > > Ken pointed out that both the both Tika and Nutch HtmlParser show up
> > > in
> > my
> > > jstack list using the delivered configuration.
> > >
> > > Julien suggested checking parsing with only parse-tika (html) and
> > > then with parse-html.
> > >
> > > So here is what I did.
> > >
> > > Option 1) parse-tika
> > >           parse-(rss|text|js|tika)
> > >           parse-plugin.xml as delivered
> >          tika-mimetypes.xml as delivered
> >
> > > Option 2) parse-html
> > >           parse-(rss|text|html|js|tika)
> > >           parse-plugin.xml turned ON <plugin id="parse-html" />
> > >           tika-mimetypes.xml commented out <mime-type
> > > type="text/html">
> > >
> > > Using the same generated crawl, ran fetch with parse for each of the
> > > options for 2 hours.
> > > All other configurations and settings are identical
> > >
> > > Results:
> > > Parse-tika
> > > INFO  mapred.LocalJobRunner - 200 threads, 200370 pages, 6756
> > > errors,
> > 27.8
> > > pages/s, 12916 kb/s
> > >
> > > Parse-html
> > > INFO  mapred.LocalJobRunner - 200 threads, 433738 pages, 13360
> > > errors,
> > > 60.1 pages/s, 27980 kb/s,
> > >
> > >
> > > The results:
> > > Parse-html is 116% faster than parse-tika for html for the same
> > > period of time and same URLs
> > >
> > > The error rate was about the same parse-html 3%, parse-tika 3.3%
> > > Most of the errors are read timeouts
> > >
> > >
> > > So is parse-html better?  It appears to be faster.  But, is the data
> > > as good?
> > > Other considerations?  Is parse-html really going to be phased out?
> > >
> > > Brad
> > >
> > >
> > >
> >
>
>
>
> --
> DigitalPebble Ltd
>
> Open Source Solutions for Text Engineering http://www.digitalpebble.com
>
>


-- 
DigitalPebble Ltd

Open Source Solutions for Text Engineering
http://www.digitalpebble.com

Re: For HTML - is parse-html twice as fast as parse-tika

Posted by Julien Nioche <li...@gmail.com>.

Hi Brad,

Thanks for sharing this. It would be interesting to profile the parsing and
have a better idea of what makes such a difference. Could it be the
detection of the encoding for instance?

Jul


On 18 August 2010 17:48, brad <br...@bcs-mail.net> wrote:

> I finally had a chance to test the Nutch html parsing this without fetching
> per Julien suggestion.  The results were pretty much the same as my
> previous
> tests:
>
>                        parse-html              Tika-html
> Elapsed Time:   04:21:47                08:55:57
> Parse (Success):        150,634         150,615
> Parse (failed): 3,788                   3,807
>
> So, based on this test, parse-html is a little more than twice as fast as
> tika's html parsing.
>
> This was done on Linux Centos 5.5, 8gb ram, Intel Xeon CPU X3220 @ 2.40GHz
> Only Nutch related processes were running on the server
> Nutch 1.2 - which now has the nice timings feature!
>
> The data was retrieved using:
> bin/nutch fetch <segment> -noParsing -threads 200
>
> All data was parsed using:
> bin/nutch parse <segment> -threads 200
>
> Brad
>
>
> -----Original Message-----
> From: Ken Krugler [mailto:kkrugler_lists@transpac.com]
> Sent: Wednesday, August 11, 2010 2:20 PM
> To: user@nutch.apache.org
> Subject: Re: For HTML - is parse-html twice as fast as parse-tika
>
> Hi Brad,
>
> On Aug 2, 2010, at 9:26am, brad wrote:
>
> > Hi Julien,
> > I'll see if I can give a try later this week.
>
> [snip]
>
> Were you able to try the parse-only approach that Julien suggested below?
>
> I'm asking because (a) I do a fair amount of work with/on the Tika HTML
> parsing support, and (b) I've also run into surprisingly slow parse
> performance with Tika, though I didn't compare to Nutch's older parser (or
> using NekoHTML instead of TagSoup).
>
> Thanks,
>
> -- Ken
>
>
> > -----Original Message-----
> > From: Julien Nioche [mailto:lists.digitalpebble@gmail.com]
> > Sent: Monday, August 02, 2010 5:11 AM
> > To: user@nutch.apache.org
> > Subject: Re: For HTML - is parse-html twice as fast as parse-tika
> >
> > Hi Brad,
> >
> > Could you run and measure the parser independently of the fetching?
> > That
> > would remove any possible side effect due to caching, network issues
> > etc...
> >
> > All you need to do is remove the subdirectories parse_text, parse_data
> > and crawl_parse then run : nutch parse
> >
> > Thanks
> >
> > Julien
> >
> > PS: regarding parse-html being phased out : see Andrzej's JIRA from
> > this morning
> >
> >
> > On 31 July 2010 22:43, brad <br...@bcs-mail.net> wrote:
> >
> >>> I have been experiencing some performance issues with Tika and
> >>> general parsing (see Parsing Performance - related to Java
> >>> concurrency issue)
> >>>
> >>> Ken pointed out that both the both Tika and Nutch HtmlParser show up
> >>> in
> >> my
> >>> jstack list using the delivered configuration.
> >>>
> >>> Julien suggested checking parsing with only parse-tika (html) and
> >>> then with parse-html.
> >>>
> >>> So here is what I did.
> >>>
> >>> Option 1) parse-tika
> >>>          parse-(rss|text|js|tika)
> >>>          parse-plugin.xml as delivered
> >>         tika-mimetypes.xml as delivered
> >>
> >>> Option 2) parse-html
> >>>          parse-(rss|text|html|js|tika)
> >>>          parse-plugin.xml turned ON <plugin id="parse-html" />
> >>>          tika-mimetypes.xml commented out <mime-type
> >>> type="text/html">
> >>>
> >>> Using the same generated crawl, ran fetch with parse for each of the
> >>> options for 2 hours.
> >>> All other configurations and settings are identical
> >>>
> >>> Results:
> >>> Parse-tika
> >>> INFO  mapred.LocalJobRunner - 200 threads, 200370 pages, 6756
> >>> errors,
> >> 27.8
> >>> pages/s, 12916 kb/s
> >>>
> >>> Parse-html
> >>> INFO  mapred.LocalJobRunner - 200 threads, 433738 pages, 13360
> >>> errors,
> >>> 60.1 pages/s, 27980 kb/s,
> >>>
> >>>
> >>> The results:
> >>> Parse-html is 116% faster than parse-tika for html for the same
> >>> period of time and same URLs
> >>>
> >>> The error rate was about the same parse-html 3%, parse-tika 3.3%
> >>> Most of the errors are read timeouts
> >>>
> >>>
> >>> So is parse-html better?  It appears to be faster.  But, is the data
> >>> as good?
> >>> Other considerations?  Is parse-html really going to be phased out?
> >>>
> >>> Brad
> >>>
> >>>
> >>>
> >>
> >
> >
> >
> > --
> > DigitalPebble Ltd
> >
> > Open Source Solutions for Text Engineering http://
> > www.digitalpebble.com
> >
>
> --------------------------------------------
> Ken Krugler
> +1 530-210-6378
> http://bixolabs.com
> e l a s t i c   w e b   m i n i n g
>
>
>
>
>
>


-- 
DigitalPebble Ltd

Open Source Solutions for Text Engineering
http://www.digitalpebble.com

RE: For HTML - is parse-html twice as fast as parse-tika

Posted by brad <br...@bcs-mail.net>.

I finally had a chance to test the Nutch html parsing this without fetching
per Julien suggestion.  The results were pretty much the same as my previous
tests:

			parse-html		Tika-html
Elapsed Time:	04:21:47		08:55:57
Parse (Success):	150,634		150,615
Parse (failed):	3,788			3,807

So, based on this test, parse-html is a little more than twice as fast as
tika's html parsing.

This was done on Linux Centos 5.5, 8gb ram, Intel Xeon CPU X3220 @ 2.40GHz
Only Nutch related processes were running on the server
Nutch 1.2 - which now has the nice timings feature!

The data was retrieved using:
bin/nutch fetch <segment> -noParsing -threads 200

All data was parsed using:
bin/nutch parse <segment> -threads 200

Brad
 

-----Original Message-----
From: Ken Krugler [mailto:kkrugler_lists@transpac.com] 
Sent: Wednesday, August 11, 2010 2:20 PM
To: user@nutch.apache.org
Subject: Re: For HTML - is parse-html twice as fast as parse-tika

Hi Brad,

On Aug 2, 2010, at 9:26am, brad wrote:

> Hi Julien,
> I'll see if I can give a try later this week.

[snip]

Were you able to try the parse-only approach that Julien suggested below?

I'm asking because (a) I do a fair amount of work with/on the Tika HTML
parsing support, and (b) I've also run into surprisingly slow parse
performance with Tika, though I didn't compare to Nutch's older parser (or
using NekoHTML instead of TagSoup).

Thanks,

-- Ken


> -----Original Message-----
> From: Julien Nioche [mailto:lists.digitalpebble@gmail.com]
> Sent: Monday, August 02, 2010 5:11 AM
> To: user@nutch.apache.org
> Subject: Re: For HTML - is parse-html twice as fast as parse-tika
>
> Hi Brad,
>
> Could you run and measure the parser independently of the fetching?  
> That
> would remove any possible side effect due to caching, network issues 
> etc...
>
> All you need to do is remove the subdirectories parse_text, parse_data 
> and crawl_parse then run : nutch parse
>
> Thanks
>
> Julien
>
> PS: regarding parse-html being phased out : see Andrzej's JIRA from 
> this morning
>
>
> On 31 July 2010 22:43, brad <br...@bcs-mail.net> wrote:
>
>>> I have been experiencing some performance issues with Tika and 
>>> general parsing (see Parsing Performance - related to Java 
>>> concurrency issue)
>>>
>>> Ken pointed out that both the both Tika and Nutch HtmlParser show up 
>>> in
>> my
>>> jstack list using the delivered configuration.
>>>
>>> Julien suggested checking parsing with only parse-tika (html) and 
>>> then with parse-html.
>>>
>>> So here is what I did.
>>>
>>> Option 1) parse-tika
>>>          parse-(rss|text|js|tika)
>>>          parse-plugin.xml as delivered
>>         tika-mimetypes.xml as delivered
>>
>>> Option 2) parse-html
>>>          parse-(rss|text|html|js|tika)
>>>          parse-plugin.xml turned ON <plugin id="parse-html" />
>>>          tika-mimetypes.xml commented out <mime-type 
>>> type="text/html">
>>>
>>> Using the same generated crawl, ran fetch with parse for each of the 
>>> options for 2 hours.
>>> All other configurations and settings are identical
>>>
>>> Results:
>>> Parse-tika
>>> INFO  mapred.LocalJobRunner - 200 threads, 200370 pages, 6756 
>>> errors,
>> 27.8
>>> pages/s, 12916 kb/s
>>>
>>> Parse-html
>>> INFO  mapred.LocalJobRunner - 200 threads, 433738 pages, 13360 
>>> errors,
>>> 60.1 pages/s, 27980 kb/s,
>>>
>>>
>>> The results:
>>> Parse-html is 116% faster than parse-tika for html for the same 
>>> period of time and same URLs
>>>
>>> The error rate was about the same parse-html 3%, parse-tika 3.3% 
>>> Most of the errors are read timeouts
>>>
>>>
>>> So is parse-html better?  It appears to be faster.  But, is the data 
>>> as good?
>>> Other considerations?  Is parse-html really going to be phased out?
>>>
>>> Brad
>>>
>>>
>>>
>>
>
>
>
> --
> DigitalPebble Ltd
>
> Open Source Solutions for Text Engineering http:// 
> www.digitalpebble.com
>

--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g

RE: For HTML - is parse-html twice as fast as parse-tika

Posted by brad <br...@bcs-mail.net>.

Hi Ken,
I haven't had a chance yet.  I'm working on some compression issues.  I'll
put it on my calendar for next week.

Even though the results may not have been as accurate because the parse
included the fetch, I felt pretty comfortable with the numbers.  I switched
my configuration from the Tika HTML parser to the Nutch HTML parser and all
of the fetch/parse have been faster.  I have also replaced tika's
commons-compress-1.0.jar with the pre-release commons-compress-1.1.jar which
has helped.

Brad


-----Original Message-----
From: Ken Krugler [mailto:kkrugler_lists@transpac.com] 
Sent: Wednesday, August 11, 2010 2:20 PM
To: user@nutch.apache.org
Subject: Re: For HTML - is parse-html twice as fast as parse-tika

Hi Brad,

On Aug 2, 2010, at 9:26am, brad wrote:

> Hi Julien,
> I'll see if I can give a try later this week.

[snip]

Were you able to try the parse-only approach that Julien suggested below?

I'm asking because (a) I do a fair amount of work with/on the Tika HTML
parsing support, and (b) I've also run into surprisingly slow parse
performance with Tika, though I didn't compare to Nutch's older parser (or
using NekoHTML instead of TagSoup).

Thanks,

-- Ken


> -----Original Message-----
> From: Julien Nioche [mailto:lists.digitalpebble@gmail.com]
> Sent: Monday, August 02, 2010 5:11 AM
> To: user@nutch.apache.org
> Subject: Re: For HTML - is parse-html twice as fast as parse-tika
>
> Hi Brad,
>
> Could you run and measure the parser independently of the fetching?  
> That
> would remove any possible side effect due to caching, network issues 
> etc...
>
> All you need to do is remove the subdirectories parse_text, parse_data 
> and crawl_parse then run : nutch parse
>
> Thanks
>
> Julien
>
> PS: regarding parse-html being phased out : see Andrzej's JIRA from 
> this morning
>
>
> On 31 July 2010 22:43, brad <br...@bcs-mail.net> wrote:
>
>>> I have been experiencing some performance issues with Tika and 
>>> general parsing (see Parsing Performance - related to Java 
>>> concurrency issue)
>>>
>>> Ken pointed out that both the both Tika and Nutch HtmlParser show up 
>>> in
>> my
>>> jstack list using the delivered configuration.
>>>
>>> Julien suggested checking parsing with only parse-tika (html) and 
>>> then with parse-html.
>>>
>>> So here is what I did.
>>>
>>> Option 1) parse-tika
>>>          parse-(rss|text|js|tika)
>>>          parse-plugin.xml as delivered
>>         tika-mimetypes.xml as delivered
>>
>>> Option 2) parse-html
>>>          parse-(rss|text|html|js|tika)
>>>          parse-plugin.xml turned ON <plugin id="parse-html" />
>>>          tika-mimetypes.xml commented out <mime-type 
>>> type="text/html">
>>>
>>> Using the same generated crawl, ran fetch with parse for each of the 
>>> options for 2 hours.
>>> All other configurations and settings are identical
>>>
>>> Results:
>>> Parse-tika
>>> INFO  mapred.LocalJobRunner - 200 threads, 200370 pages, 6756 
>>> errors,
>> 27.8
>>> pages/s, 12916 kb/s
>>>
>>> Parse-html
>>> INFO  mapred.LocalJobRunner - 200 threads, 433738 pages, 13360 
>>> errors,
>>> 60.1 pages/s, 27980 kb/s,
>>>
>>>
>>> The results:
>>> Parse-html is 116% faster than parse-tika for html for the same 
>>> period of time and same URLs
>>>
>>> The error rate was about the same parse-html 3%, parse-tika 3.3% 
>>> Most of the errors are read timeouts
>>>
>>>
>>> So is parse-html better?  It appears to be faster.  But, is the data 
>>> as good?
>>> Other considerations?  Is parse-html really going to be phased out?
>>>
>>> Brad
>>>
>>>
>>>
>>
>
>
>
> --
> DigitalPebble Ltd
>
> Open Source Solutions for Text Engineering http:// 
> www.digitalpebble.com
>

--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g

Re: For HTML - is parse-html twice as fast as parse-tika

Posted by Ken Krugler <kk...@transpac.com>.

Hi Brad,

On Aug 2, 2010, at 9:26am, brad wrote:

> Hi Julien,
> I'll see if I can give a try later this week.

[snip]

Were you able to try the parse-only approach that Julien suggested  
below?

I'm asking because (a) I do a fair amount of work with/on the Tika  
HTML parsing support, and (b) I've also run into surprisingly slow  
parse performance with Tika, though I didn't compare to Nutch's older  
parser (or using NekoHTML instead of TagSoup).

Thanks,

-- Ken


> -----Original Message-----
> From: Julien Nioche [mailto:lists.digitalpebble@gmail.com]
> Sent: Monday, August 02, 2010 5:11 AM
> To: user@nutch.apache.org
> Subject: Re: For HTML - is parse-html twice as fast as parse-tika
>
> Hi Brad,
>
> Could you run and measure the parser independently of the fetching?  
> That
> would remove any possible side effect due to caching, network issues  
> etc...
>
> All you need to do is remove the subdirectories parse_text,  
> parse_data and
> crawl_parse then run : nutch parse
>
> Thanks
>
> Julien
>
> PS: regarding parse-html being phased out : see Andrzej's JIRA from  
> this
> morning
>
>
> On 31 July 2010 22:43, brad <br...@bcs-mail.net> wrote:
>
>>> I have been experiencing some performance issues with Tika and
>>> general parsing (see Parsing Performance - related to Java
>>> concurrency issue)
>>>
>>> Ken pointed out that both the both Tika and Nutch HtmlParser show up
>>> in
>> my
>>> jstack list using the delivered configuration.
>>>
>>> Julien suggested checking parsing with only parse-tika (html) and
>>> then with parse-html.
>>>
>>> So here is what I did.
>>>
>>> Option 1) parse-tika
>>>          parse-(rss|text|js|tika)
>>>          parse-plugin.xml as delivered
>>         tika-mimetypes.xml as delivered
>>
>>> Option 2) parse-html
>>>          parse-(rss|text|html|js|tika)
>>>          parse-plugin.xml turned ON <plugin id="parse-html" />
>>>          tika-mimetypes.xml commented out <mime-type
>>> type="text/html">
>>>
>>> Using the same generated crawl, ran fetch with parse for each of the
>>> options for 2 hours.
>>> All other configurations and settings are identical
>>>
>>> Results:
>>> Parse-tika
>>> INFO  mapred.LocalJobRunner - 200 threads, 200370 pages, 6756
>>> errors,
>> 27.8
>>> pages/s, 12916 kb/s
>>>
>>> Parse-html
>>> INFO  mapred.LocalJobRunner - 200 threads, 433738 pages, 13360
>>> errors,
>>> 60.1 pages/s, 27980 kb/s,
>>>
>>>
>>> The results:
>>> Parse-html is 116% faster than parse-tika for html for the same
>>> period of time and same URLs
>>>
>>> The error rate was about the same parse-html 3%, parse-tika 3.3%
>>> Most of the errors are read timeouts
>>>
>>>
>>> So is parse-html better?  It appears to be faster.  But, is the data
>>> as good?
>>> Other considerations?  Is parse-html really going to be phased out?
>>>
>>> Brad
>>>
>>>
>>>
>>
>
>
>
> --
> DigitalPebble Ltd
>
> Open Source Solutions for Text Engineering http:// 
> www.digitalpebble.com
>

--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g

RE: For HTML - is parse-html twice as fast as parse-tika

Posted by brad <br...@bcs-mail.net>.

Hi Julien,
I'll see if I can give a try later this week.  

I'm having a problem in the mapred.LocalJobRunner - reduce > reduce portion
right after the actual URL fetch/parse portion is complete.  I don't know
how long it is supposed to take for this portion to complete, but I have had
fetches run for 12 hours and map-reduce portion run for 36 hours and still
not be complete.  I ended up killing the job. 

Right now, I'm running a fetch on 1 million URLs.  The parse and fetch
portion took less than 7 hours, but the map-reduce has been running for 11
hours now and I'm going to wait and see if it completes.

It started complete of fetcher.Fetcher:
2010-08-01 22:06:43,479 INFO  fetcher.Fetcher - -finishing thread
FetcherThread, activeThreads=0
2010-08-01 22:06:44,368 INFO  fetcher.Fetcher - -activeThreads=0,
spinWaiting=0, fetchQueues.totalSize=0
2010-08-01 22:06:44,369 INFO  fetcher.Fetcher - -activeThreads=0
2010-08-01 22:06:44,369 INFO  mapred.MapTask - Starting flush of map output
2010-08-01 22:06:45,129 INFO  mapred.LocalJobRunner - 0 threads, 853809
pages, 18772 errors, 35.4 pages/s, 16989 kb/s, 

The issue appears to start with
2010-08-01 23:22:22,174 INFO  mapred.Merger - Down to the last merge-pass,
with 1 segments left of total size: 31012166567 bytes

Now the process has been cycling on for 10 hours:
INFO  mapred.LocalJobRunner - reduce > reduce

I'm running Nutch on a single server.

Thanks
Brad


-----Original Message-----
From: Julien Nioche [mailto:lists.digitalpebble@gmail.com] 
Sent: Monday, August 02, 2010 5:11 AM
To: user@nutch.apache.org
Subject: Re: For HTML - is parse-html twice as fast as parse-tika

Hi Brad,

Could you run and measure the parser independently of the fetching? That
would remove any possible side effect due to caching, network issues etc...

All you need to do is remove the subdirectories parse_text, parse_data and
crawl_parse then run : nutch parse

Thanks

Julien

PS: regarding parse-html being phased out : see Andrzej's JIRA from this
morning


On 31 July 2010 22:43, brad <br...@bcs-mail.net> wrote:

> > I have been experiencing some performance issues with Tika and 
> > general parsing (see Parsing Performance - related to Java 
> > concurrency issue)
> >
> > Ken pointed out that both the both Tika and Nutch HtmlParser show up 
> > in
> my
> > jstack list using the delivered configuration.
> >
> > Julien suggested checking parsing with only parse-tika (html) and 
> > then with parse-html.
> >
> > So here is what I did.
> >
> > Option 1) parse-tika
> >           parse-(rss|text|js|tika)
> >           parse-plugin.xml as delivered
>          tika-mimetypes.xml as delivered
>
> > Option 2) parse-html
> >           parse-(rss|text|html|js|tika)
> >           parse-plugin.xml turned ON <plugin id="parse-html" />
> >           tika-mimetypes.xml commented out <mime-type 
> > type="text/html">
> >
> > Using the same generated crawl, ran fetch with parse for each of the 
> > options for 2 hours.
> > All other configurations and settings are identical
> >
> > Results:
> > Parse-tika
> > INFO  mapred.LocalJobRunner - 200 threads, 200370 pages, 6756 
> > errors,
> 27.8
> > pages/s, 12916 kb/s
> >
> > Parse-html
> > INFO  mapred.LocalJobRunner - 200 threads, 433738 pages, 13360 
> > errors,
> > 60.1 pages/s, 27980 kb/s,
> >
> >
> > The results:
> > Parse-html is 116% faster than parse-tika for html for the same 
> > period of time and same URLs
> >
> > The error rate was about the same parse-html 3%, parse-tika 3.3% 
> > Most of the errors are read timeouts
> >
> >
> > So is parse-html better?  It appears to be faster.  But, is the data 
> > as good?
> > Other considerations?  Is parse-html really going to be phased out?
> >
> > Brad
> >
> >
> >
>



--
DigitalPebble Ltd

Open Source Solutions for Text Engineering http://www.digitalpebble.com