You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Allison, Timothy B." <ta...@mitre.org> on 2015/04/14 18:09:29 UTC
topic change: common crawl slice on TIKA-1302 vm
Hi Julien,
We're just beginning to scratch the surface. There's much to learn from this set. Apologies for my delay, and thank you!
These proportions line up pretty closely with your blog post (http://digitalpebble.blogspot.com/2014/11/generating-test-corpus-for-apache-tika.html)
Total files: 2,135,515
Detected content types:
DETECTED_CONTENT_TYPE (by TIKA 1.8-rc2) COUNT
image/jpeg 857,625
application/pdf 320,443
text/plain; charset=ISO-8859-1 276,152
image/png 184,855
text/plain; charset=windows-1252 164,327
image/gif 51,809
text/plain; charset=UTF-8 44,766
audio/x-wav 34,402
application/octet-stream 28,586
message/rfc822 18,231
text/html; charset=ISO-8859-1 17,528
application/xhtml+xml; charset=UTF-8 16,845
application/zip 14,385
text/html; charset=UTF-8 9,626
audio/mpeg 8,670
text/html; charset=windows-1252 7,818
application/msword 7,782
application/x-archive 5,970
application/x-bibtex-text-file 5,274
application/xml 5,234
image/vnd.djvu 5,063
application/rss+xml 4,726
application/gzip 4,443
application/xhtml+xml; charset=ISO-8859-1 4,228
application/epub+zip 3,458
image/tiff 2,980
image/jp2 2,706
application/rtf 1,622
________________________________________
From: Julien Nioche <li...@gmail.com>
Sent: Tuesday, April 14, 2015 9:24 AM
To: dev@tika.apache.org
Subject: Re: [VOTE] Apache Tika 1.8 Release Candidate #2
Hi Tim
Great to hear that you managed to use the dataset from CommonCrawl. Thanks!
Julien
On 14 April 2015 at 14:15, Allison, Timothy B. <ta...@mitre.org> wrote:
> +1
>
> Thank you, Tyler!
>
> Apologies to Hong-Thai and community for not recognizing the severity of
> TIKA-1600 when I voted in favor of rc1!
>
> Details...
>
> I reran against govdocs1, and there aren't any major surprises.
>
> On our Rackspace vm, I _finally_ unzipped the Common Crawl slice that
> Julien Nioche created for us, and I ran against that as well. That turned
> up TIKA-1605 and another exceedingly rare NPE in the PDFParser. I don't
> think either of these are blockers, and they're now fixed in trunk.
>
> There are slightly fewer metadata values for some jpegs. For the one file
> that I manually reviewed, 1.8-rc was missing these values (that were
> available in 1.7):
>
> JPEG quality
> IPTC-NAA record
> Plug-in 1 Data
>
> Comparison reports are available here (much more work remains to be done
> on tika-eval):
>
> https://github.com/tballison/share/tree/master/tika_comparisons
>
> ________________________________________
> From: Tyler Palsulich <tp...@apache.org>
> Sent: Monday, April 13, 2015 1:56 PM
> To: dev@tika.apache.org; user@tika.apache.org
> Subject: [VOTE] Apache Tika 1.8 Release Candidate #2
>
> Hi Folks,
>
> A candidate for the Tika 1.8 release is available at:
> https://dist.apache.org/repos/dist/dev/tika/
>
> The release candidate is a zip archive of the sources in:
> http://svn.apache.org/repos/asf/tika/tags/1.8-rc2/
>
> The SHA1 checksum of the archive is
> 5e22fee9079370398472e59082d171ae2d7fdd31.
>
> In addition, a staged maven repository is available here:
> https://repository.apache.org/content/repositories/orgapachetika-1009
>
> Please vote on releasing this package as Apache Tika 1.8. The vote is open
> for the next 72 hours and passes if a majority of at least three +1 Tika
> PMC votes are cast.
>
> [ ] +1 Release this package as Apache Tika 1.8
> [ ] ±0 I don't object to this release, but I haven't checked it
> [ ] -1 Do not release this package because...
>
> Thanks,
> Tyler
>
--
Open Source Solutions for Text Engineering
http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble
Re: topic change: common crawl slice on TIKA-1302 vm
Posted by "Mattmann, Chris A (3980)" <ch...@jpl.nasa.gov>.
Awesome!
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann@nasa.gov
WWW: http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
-----Original Message-----
From: <Allison>, "Timothy B." <ta...@mitre.org>
Reply-To: "dev@tika.apache.org" <de...@tika.apache.org>
Date: Tuesday, April 14, 2015 at 12:09 PM
To: "dev@tika.apache.org" <de...@tika.apache.org>
Subject: topic change: common crawl slice on TIKA-1302 vm
>Hi Julien,
> We're just beginning to scratch the surface. There's much to learn
>from this set. Apologies for my delay, and thank you!
>
>These proportions line up pretty closely with your blog post
>(http://digitalpebble.blogspot.com/2014/11/generating-test-corpus-for-apac
>he-tika.html)
>
>Total files: 2,135,515
>
>Detected content types:
>DETECTED_CONTENT_TYPE (by TIKA 1.8-rc2) COUNT
>image/jpeg 857,625
>application/pdf 320,443
>text/plain; charset=ISO-8859-1 276,152
>image/png 184,855
>text/plain; charset=windows-1252 164,327
>image/gif 51,809
>text/plain; charset=UTF-8 44,766
>audio/x-wav 34,402
>application/octet-stream 28,586
>message/rfc822 18,231
>text/html; charset=ISO-8859-1 17,528
>application/xhtml+xml; charset=UTF-8 16,845
>application/zip 14,385
>text/html; charset=UTF-8 9,626
>audio/mpeg 8,670
>text/html; charset=windows-1252 7,818
>application/msword 7,782
>application/x-archive 5,970
>application/x-bibtex-text-file 5,274
>application/xml 5,234
>image/vnd.djvu 5,063
>application/rss+xml 4,726
>application/gzip 4,443
>application/xhtml+xml; charset=ISO-8859-1 4,228
>application/epub+zip 3,458
>image/tiff 2,980
>image/jp2 2,706
>application/rtf 1,622
>
>
>
>________________________________________
>From: Julien Nioche <li...@gmail.com>
>Sent: Tuesday, April 14, 2015 9:24 AM
>To: dev@tika.apache.org
>Subject: Re: [VOTE] Apache Tika 1.8 Release Candidate #2
>
>Hi Tim
>
>Great to hear that you managed to use the dataset from CommonCrawl.
>Thanks!
>
>Julien
>
>On 14 April 2015 at 14:15, Allison, Timothy B. <ta...@mitre.org> wrote:
>
>> +1
>>
>> Thank you, Tyler!
>>
>> Apologies to Hong-Thai and community for not recognizing the severity of
>> TIKA-1600 when I voted in favor of rc1!
>>
>> Details...
>>
>> I reran against govdocs1, and there aren't any major surprises.
>>
>> On our Rackspace vm, I _finally_ unzipped the Common Crawl slice that
>> Julien Nioche created for us, and I ran against that as well. That
>>turned
>> up TIKA-1605 and another exceedingly rare NPE in the PDFParser. I don't
>> think either of these are blockers, and they're now fixed in trunk.
>>
>> There are slightly fewer metadata values for some jpegs. For the one
>>file
>> that I manually reviewed, 1.8-rc was missing these values (that were
>> available in 1.7):
>>
>> JPEG quality
>> IPTC-NAA record
>> Plug-in 1 Data
>>
>> Comparison reports are available here (much more work remains to be done
>> on tika-eval):
>>
>> https://github.com/tballison/share/tree/master/tika_comparisons
>>
>> ________________________________________
>> From: Tyler Palsulich <tp...@apache.org>
>> Sent: Monday, April 13, 2015 1:56 PM
>> To: dev@tika.apache.org; user@tika.apache.org
>> Subject: [VOTE] Apache Tika 1.8 Release Candidate #2
>>
>> Hi Folks,
>>
>> A candidate for the Tika 1.8 release is available at:
>> https://dist.apache.org/repos/dist/dev/tika/
>>
>> The release candidate is a zip archive of the sources in:
>> http://svn.apache.org/repos/asf/tika/tags/1.8-rc2/
>>
>> The SHA1 checksum of the archive is
>> 5e22fee9079370398472e59082d171ae2d7fdd31.
>>
>> In addition, a staged maven repository is available here:
>> https://repository.apache.org/content/repositories/orgapachetika-1009
>>
>> Please vote on releasing this package as Apache Tika 1.8. The vote is
>>open
>> for the next 72 hours and passes if a majority of at least three +1 Tika
>> PMC votes are cast.
>>
>> [ ] +1 Release this package as Apache Tika 1.8
>> [ ] ±0 I don't object to this release, but I haven't checked it
>> [ ] -1 Do not release this package because...
>>
>> Thanks,
>> Tyler
>>
>
>
>
>--
>
>Open Source Solutions for Text Engineering
>
>http://digitalpebble.blogspot.com/
>http://www.digitalpebble.com
>http://twitter.com/digitalpebble