You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Allison, Timothy B." <ta...@mitre.org> on 2015/04/14 18:09:29 UTC

topic change: common crawl slice on TIKA-1302 vm

Hi Julien,
  We're just beginning to scratch the surface.  There's much to learn from this set.  Apologies for my delay, and thank you!

These proportions line up pretty closely with your blog post (http://digitalpebble.blogspot.com/2014/11/generating-test-corpus-for-apache-tika.html) 

Total files: 2,135,515

Detected content types:
DETECTED_CONTENT_TYPE (by TIKA 1.8-rc2)	 COUNT 
image/jpeg	 857,625 	
application/pdf	 320,443 
text/plain; charset=ISO-8859-1	 276,152 
image/png	 184,855 
text/plain; charset=windows-1252	 164,327 
image/gif	 51,809 
text/plain; charset=UTF-8	 44,766 
audio/x-wav	 34,402 
application/octet-stream	 28,586 
message/rfc822	 18,231 
text/html; charset=ISO-8859-1	 17,528 
application/xhtml+xml; charset=UTF-8	 16,845 
application/zip	 14,385 
text/html; charset=UTF-8	 9,626 
audio/mpeg	 8,670 
text/html; charset=windows-1252	 7,818 
application/msword	 7,782 
application/x-archive	 5,970 
application/x-bibtex-text-file	 5,274 
application/xml	 5,234 
image/vnd.djvu	 5,063 
application/rss+xml	 4,726 
application/gzip	 4,443 
application/xhtml+xml; charset=ISO-8859-1	 4,228 
application/epub+zip	 3,458 
image/tiff	 2,980 
image/jp2	 2,706 
application/rtf	 1,622 
	
 

________________________________________
From: Julien Nioche <li...@gmail.com>
Sent: Tuesday, April 14, 2015 9:24 AM
To: dev@tika.apache.org
Subject: Re: [VOTE] Apache Tika 1.8 Release Candidate #2

Hi Tim

Great to hear that you managed to use the dataset from CommonCrawl. Thanks!

Julien

On 14 April 2015 at 14:15, Allison, Timothy B. <ta...@mitre.org> wrote:

> +1
>
> Thank you, Tyler!
>
> Apologies to Hong-Thai and community for not recognizing the severity of
> TIKA-1600 when I voted in favor of rc1!
>
> Details...
>
> I reran against govdocs1, and there aren't any major surprises.
>
> On our Rackspace vm, I  _finally_ unzipped the Common Crawl slice that
> Julien Nioche created for us, and I ran against that as well.  That turned
> up TIKA-1605 and another exceedingly rare NPE in the PDFParser.  I don't
> think either of these are blockers, and they're now fixed in trunk.
>
> There are slightly fewer metadata values for some jpegs.  For the one file
> that I manually reviewed, 1.8-rc was missing these values (that were
> available in 1.7):
>
> JPEG quality
> IPTC-NAA record
> Plug-in 1 Data
>
> Comparison reports are available here (much more work remains to be done
> on tika-eval):
>
> https://github.com/tballison/share/tree/master/tika_comparisons
>
> ________________________________________
> From: Tyler Palsulich <tp...@apache.org>
> Sent: Monday, April 13, 2015 1:56 PM
> To: dev@tika.apache.org; user@tika.apache.org
> Subject: [VOTE] Apache Tika 1.8 Release Candidate #2
>
> Hi Folks,
>
> A candidate for the Tika 1.8 release is available at:
>   https://dist.apache.org/repos/dist/dev/tika/
>
> The release candidate is a zip archive of the sources in:
>   http://svn.apache.org/repos/asf/tika/tags/1.8-rc2/
>
> The SHA1 checksum of the archive is
>   5e22fee9079370398472e59082d171ae2d7fdd31.
>
> In addition, a staged maven repository is available here:
>   https://repository.apache.org/content/repositories/orgapachetika-1009
>
> Please vote on releasing this package as Apache Tika 1.8. The vote is open
> for the next 72 hours and passes if a majority of at least three +1 Tika
> PMC votes are cast.
>
> [ ] +1 Release this package as Apache Tika 1.8
> [ ] ±0 I don't object to this release, but I haven't checked it
> [ ] -1 Do not release this package because...
>
> Thanks,
> Tyler
>



--

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Re: topic change: common crawl slice on TIKA-1302 vm

Posted by "Mattmann, Chris A (3980)" <ch...@jpl.nasa.gov>.
Awesome!

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++






-----Original Message-----
From: <Allison>, "Timothy B." <ta...@mitre.org>
Reply-To: "dev@tika.apache.org" <de...@tika.apache.org>
Date: Tuesday, April 14, 2015 at 12:09 PM
To: "dev@tika.apache.org" <de...@tika.apache.org>
Subject: topic change: common crawl slice on TIKA-1302 vm

>Hi Julien,
>  We're just beginning to scratch the surface.  There's much to learn
>from this set.  Apologies for my delay, and thank you!
>
>These proportions line up pretty closely with your blog post
>(http://digitalpebble.blogspot.com/2014/11/generating-test-corpus-for-apac
>he-tika.html) 
>
>Total files: 2,135,515
>
>Detected content types:
>DETECTED_CONTENT_TYPE (by TIKA 1.8-rc2)	 COUNT
>image/jpeg	 857,625 	
>application/pdf	 320,443
>text/plain; charset=ISO-8859-1	 276,152
>image/png	 184,855
>text/plain; charset=windows-1252	 164,327
>image/gif	 51,809 
>text/plain; charset=UTF-8	 44,766
>audio/x-wav	 34,402
>application/octet-stream	 28,586
>message/rfc822	 18,231
>text/html; charset=ISO-8859-1	 17,528
>application/xhtml+xml; charset=UTF-8	 16,845
>application/zip	 14,385
>text/html; charset=UTF-8	 9,626
>audio/mpeg	 8,670 
>text/html; charset=windows-1252	 7,818
>application/msword	 7,782
>application/x-archive	 5,970
>application/x-bibtex-text-file	 5,274
>application/xml	 5,234
>image/vnd.djvu	 5,063
>application/rss+xml	 4,726
>application/gzip	 4,443
>application/xhtml+xml; charset=ISO-8859-1	 4,228
>application/epub+zip	 3,458
>image/tiff	 2,980 
>image/jp2	 2,706 
>application/rtf	 1,622
>	
> 
>
>________________________________________
>From: Julien Nioche <li...@gmail.com>
>Sent: Tuesday, April 14, 2015 9:24 AM
>To: dev@tika.apache.org
>Subject: Re: [VOTE] Apache Tika 1.8 Release Candidate #2
>
>Hi Tim
>
>Great to hear that you managed to use the dataset from CommonCrawl.
>Thanks!
>
>Julien
>
>On 14 April 2015 at 14:15, Allison, Timothy B. <ta...@mitre.org> wrote:
>
>> +1
>>
>> Thank you, Tyler!
>>
>> Apologies to Hong-Thai and community for not recognizing the severity of
>> TIKA-1600 when I voted in favor of rc1!
>>
>> Details...
>>
>> I reran against govdocs1, and there aren't any major surprises.
>>
>> On our Rackspace vm, I  _finally_ unzipped the Common Crawl slice that
>> Julien Nioche created for us, and I ran against that as well.  That
>>turned
>> up TIKA-1605 and another exceedingly rare NPE in the PDFParser.  I don't
>> think either of these are blockers, and they're now fixed in trunk.
>>
>> There are slightly fewer metadata values for some jpegs.  For the one
>>file
>> that I manually reviewed, 1.8-rc was missing these values (that were
>> available in 1.7):
>>
>> JPEG quality
>> IPTC-NAA record
>> Plug-in 1 Data
>>
>> Comparison reports are available here (much more work remains to be done
>> on tika-eval):
>>
>> https://github.com/tballison/share/tree/master/tika_comparisons
>>
>> ________________________________________
>> From: Tyler Palsulich <tp...@apache.org>
>> Sent: Monday, April 13, 2015 1:56 PM
>> To: dev@tika.apache.org; user@tika.apache.org
>> Subject: [VOTE] Apache Tika 1.8 Release Candidate #2
>>
>> Hi Folks,
>>
>> A candidate for the Tika 1.8 release is available at:
>>   https://dist.apache.org/repos/dist/dev/tika/
>>
>> The release candidate is a zip archive of the sources in:
>>   http://svn.apache.org/repos/asf/tika/tags/1.8-rc2/
>>
>> The SHA1 checksum of the archive is
>>   5e22fee9079370398472e59082d171ae2d7fdd31.
>>
>> In addition, a staged maven repository is available here:
>>   https://repository.apache.org/content/repositories/orgapachetika-1009
>>
>> Please vote on releasing this package as Apache Tika 1.8. The vote is
>>open
>> for the next 72 hours and passes if a majority of at least three +1 Tika
>> PMC votes are cast.
>>
>> [ ] +1 Release this package as Apache Tika 1.8
>> [ ] ±0 I don't object to this release, but I haven't checked it
>> [ ] -1 Do not release this package because...
>>
>> Thanks,
>> Tyler
>>
>
>
>
>--
>
>Open Source Solutions for Text Engineering
>
>http://digitalpebble.blogspot.com/
>http://www.digitalpebble.com
>http://twitter.com/digitalpebble