You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@nutch.apache.org by Apache Wiki <wi...@apache.org> on 2005/06/08 23:13:47 UTC

[Nutch Wiki] Update of "LanguageIdentifierBenchs" by JeromeCharron

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.

The following page has been changed by JeromeCharron:
http://wiki.apache.org/nutch/LanguageIdentifierBenchs

The comment on the change is:
New performance results + precision results

------------------------------------------------------------------------------
  == Introduction ==
  
- This page provides some performance benchmarks (not precision) of the LanguageIdentifierPlugin between the ''old'' (previous) version and the ''new'' (configurable) version (see NewLanguageIdentifier for more details).
+ This page provides some performance (code speed) and precision (identification accuracy) benchmarks of the LanguageIdentifierPlugin. These benchmarks were produced by analyzing results from the previous version (nutch-0.7-dev) and the patches [http://issues.apache.org/jira/secure/attachment/20236/NUTCH-60-050526.patch NUTCH-60-050526.patch] and NUTCH-60-050607.patch (see NewLanguageIdentifier for more details).
  
- These data can be usefull if you want to contribute in increasing the LanguageIdentifierPlugin performances, or if you want to tune precisely your ["Nutch"] configuration.
+ These data can be usefull if you want to contribute in increasing the LanguageIdentifierPlugin performance and/or precision, or if you want to tune precisely your ["Nutch"] configuration.
  
- == Data set ==
+ == Performance ==
  
- These benchmarks were produced by testing the LanguageIdentifierPlugin on a set of 492 french files representing a total size of 171,3 Mo. These files were extracted from the ''[http://people.csail.mit.edu/koehn/publications/europarl/ European Parliament Proceedings Parallel Corpus 1996-2003 Release v2]''.
+ === Data set ===
  
- == Raw results ==
+ These ''performance'' benchmarks were produced by testing the LanguageIdentifierPlugin on a set of 492 french files representing a total size of 171,3 Mo. These files were extracted from the ''[http://people.csail.mit.edu/koehn/publications/europarl/ European Parliament Proceedings Parallel Corpus 1996-2003 Release v2]''.
  
- The following matrix shows the LanguageIdentifierPlugin processing time in ''ms'' for different configurations.
- The ''Data Size'' row is the size of data in bytes used in each file to perform the identification (please notice that each test case reported in this matrix returns a good language identification).
+ === Raw results ===
+ 
+ The following matrix shows the LanguageIdentifierPlugin processing time in ''ms'' for many versions. Each patched version is configured to be comparable with the nutch-0.7-dev version, ie by using both 1-grams, 2-grams, 3-grams and 4-grams for performing analysis.
+ The ''Data Size'' row is the size of data in bytes used in each file to perform the identification.
  Other rows represent the following configurations:
-  * ''P.V.'': The Previous Version of the LanguageIdentifierPlugin.
-  * ''[x-y]'': The new LanguageIdentifierPlugin version using ngrams from size ''x'' to ''y'' to perform identification.
+  * ''Nutch-0.7'': The nutch-0.7-dev LanguageIdentifierPlugin version (without patch).
+  * ''NUTCH-60-050526'': The LanguageIdentifierPlugin code with NUTCH-60-050526.patch applied.
+  * ''NUTCH-60-050607'': The LanguageIdentifierPlugin code with NUTCH-60-050607.patch applied.
  
- ||'''Data Size'''||'''P.V.'''||'''[1-4]'''||'''[2-2]'''||'''[3-3]'''||'''[4-4]'''||'''[2-3]'''||'''[3-4]'''||'''[2-4]'''||
- ||'''128'''||8314||5124||1627||2245||1393||3073||2996||4243||
- ||'''256'''||7660||4950||1408||1604||1425||3033||2809||3983||
- ||'''512'''||8017||4917||1296||1525||1150||2990||2912||3959||
- ||'''1024'''||8265||7188||1672||1722||1200||2933||2876||4932||
- ||'''2048'''||11541||9252||2213||2909||2601||5438||5530||7307||
- ||'''4096'''||14989||12485||2938||4190||3856||7654||8543||10416||
- ||'''8192'''||21167||18289||4880||6621||5538||11259||12557||15302||
- ||'''16384'''||32295||29488||9028||11173||13130||17560||19809||23673||
- ||'''32768'''||52918||49417||16396||18446||20158||26879||30858||39311||
- ||'''65536'''||97527||91285||33242||33695||34490||50894||54398||71920||
- ||'''131072'''||167502||161258||56036||53706||53527||87603||90553||122413||
- ||'''262144'''||304609||289395||107108||108841||108674||180461||165561||222535||
- ||'''524288'''||463008||442028||151086||146601||156372||253797||245313||336378||
+ || ||'''Nutch-0.7'''||||'''NUTCH-60-050526'''||||'''NUTCH-60-050607'''||
+ ||'''Data Size'''||'''time'''||'''time'''||'''%'''||'''time'''||'''%'''||
+ ||128||2410||1485||38.38||716||70.29||
+ ||256||2842||1836||35.40||1048||63.12||
+ ||512||3759||2305||38.68||1649||56.13||
+ ||1024||5899||5130||13.04||2839||51.87||
+ ||2048||8581||7462||13.04||4534||47.16||
+ ||4096||12622||10513||16.71||8031||36.37||
+ ||8192||21360||18289||14.38||13803||35.38||
+ ||16384||32073||29488||8.06||23733||26.00||
+ ||32768||58535||49417||15.58||41994||28.26||
+ ||65536||99861||91285||8.59||81612||18.27||
+ ||131072||184083||161258||12.40||140501||23.68||
+ ||262144||309438||289395||6.48||244369||21.03||
+ ||524288||504145||442028||12.32||377693||25.08||
+ ||Total||1245608||1109891||10.90||942522||24.33||
+ ||Average||95816||85376.23||10.90||72501.69||24.33||
  
- == Graphical Representation ==
+ === Graphical representation ===
  
- [http://frutch.free.fr/images/nutch/langid-benchs01.png]
+ [http://frutch.free.fr/images/nutch/langid-benchs03.jpg]
  
- == Graphical Representation (log axis) ==
+ === Graphical representation (log axis) ===
  
- [http://frutch.free.fr/images/nutch/langid-benchs02.png]
+ [http://frutch.free.fr/images/nutch/langid-benchs04.jpg]
  
- == Discussion ==
+ === Discussion ===
+ 
+ ''TODO''
+  
+ == Precision ==
+ 
+ === Data set ===
+ 
+ These ''precision'' benchmarks were produced by testing the LanguageIdentifierPlugin on the '''Data Size'' first bytes from a set of :
+  * 492 french files,
+  * 487 english files,
+  * 488 deutch files.
+ (These files were extracted from the ''[http://people.csail.mit.edu/koehn/publications/europarl/ European Parliament Proceedings Parallel Corpus 1996-2003 Release v2]'').
+ 
+ === Raw results ===
+ 
+ || ||||||||'''Nutch-0.7'''||||||||'''NUTCH-60-050605'''||||||||'''NUTCH-60-050607'''||
+ ||'''Data Size'''||'''avg'''||'''fr'''||'''en'''||'''de'''||'''avg'''||'''fr'''||'''en'''||'''de'''||'''avg'''||'''fr'''||'''en'''||'''de'''||
+ ||8||38.84||36.99||10.47||69.06||14.00||2.64||2.67||36.68||51.11||48.37||19.30||85.66||
+ ||16||70.38||58.74||75.15||77.25||45.64||13.41||68.17||55.33||94.06||97.36||87.68||97.13||
+ ||32||66.51||55.08||86.86||57.58||56.43||41.26||73.92||54.10||98.56||99.59||96.30||99.80||
+ ||64||97.14||97.15||97.54||96.72||65.35||53.86||84.80||57.38||99.93||100||99.79||100||
+ ||128||97.90||94.51||99.79||99.39||77.81||70.53||89.32||73.57||100||100||100||100||
+ ||256||100||100||100||100||90.32||90.04||92.20||88.73||100||100||100||100||
+ ||512||100||100||100||100||96.93||98.17||97.54||95.08||100||100||100||100||
+ ||1024||100||100||100||100||99.59||99.80||99.79||99.18||100||100||100||100||
+ ||2048||100||100||100||100||100||100||100||100||100||100||100||100||
+ 
+ === Graphical representation ===
+ 
+ [http://frutch.free.fr/images/nutch/langid-benchs05.jpg]
+ 
+ === Discussion ===
  
  ''TODO''