You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@poi.apache.org by "Allison, Timothy B." <ta...@mitre.org> on 2015/04/07 14:48:55 UTC

[COMPRESS and others] FW: Any interest in running Apache Tika as part of CommonCrawl?

All,

  We just heard back from a very active member of Common Crawl.  I don’t want to clog up our dev lists with this discussion (more than I have!), but I do want to invite all to participate in the discussion, planning and potential patches.

  If you’d like to participate, please join us here: https://groups.google.com/forum/#!topic/common-crawl/Cv21VRQjGN0

  I’ve tried to follow Commons’ vernacular, and I’ve added [COMPRESS] to the Subject line.  Please invite others who might have an interest in this work.

         Best,

                     Tim

From: Allison, Timothy B.
Sent: Tuesday, April 07, 2015 8:39 AM
To: 'Stephen Merity'; common-crawl@googlegroups.com
Subject: RE: Any interest in running Apache Tika as part of CommonCrawl?

Stephen,

  Thank you very much for responding so quickly and for all of your work on Common Crawl.  I don’t want to speak for all of us, but given the feedback I’ve gotten so far from some of the dev communities, I think we would very much appreciate the chance to be tested on a monthly basis as part of the regular Common Crawl process.

   I think we’ll still want to run more often in our own sandbox(es) on the slice of CommonCrawl we have, but the monthly testing against new data, from my perspective at least, would be a huge win for all of us.

   In addition to parsing binaries and extracting text, Tika (via PDFBox, POI and many others) can also offer metadata (e.g. exif from images), which users of CommonCrawl might find of use.

  I’ll forward this to some of the relevant dev lists to invite others to participate in the discussion on the common-crawl list.


  Thank you, again.  I very much look forward to collaborating.

             Best,

                         Tim

From: Stephen Merity [mailto:stephen@commoncrawl.org]
Sent: Tuesday, April 07, 2015 3:57 AM
To: common-crawl@googlegroups.com<ma...@googlegroups.com>
Cc: mattmann@apache.org<ma...@apache.org>; tallison@apache.org<ma...@apache.org>; dmeikle@apache.org<ma...@apache.org>; tilman@apache.org<ma...@apache.org>; nick@apache.org<ma...@apache.org>
Subject: Re: Any interest in running Apache Tika as part of CommonCrawl?

Hi Tika team!

We'd certainly be interested in working with Apache Tika on such an undertaking. At the very least, we're glad that Julien has provided you with content to battle test Tika with!

As you've noted, the text extraction performed to produce WET files are focused primarily on HTML files, leaving many other file types not covered. The existing text extraction is quite efficient and part of the same process that generates the WAT file, meaning there's next to no overhead. Performing extraction with Tika at the scale of Common Crawl would be an interesting challenge. Running it as a once off wouldn't likely be too much of a challenge and would also give Tika the benefit of a wider variety of documents (both well formed and malformed) to test against. Running it on a frequent basis or as part of the crawl pipeline would be more challenging but something we can certainly discuss, especially if there's strong community desire for it!

On Fri, Apr 3, 2015 at 5:23 AM, <ta...@gmail.com>> wrote:
CommonCrawl currently has the WET format that extracts plain text from web pages.  My guess is that this is text stripping from text-y formats.  Let me know if I'm wrong!

Would there be any interest in adding another format: WETT (WET-Tika) or supplementing the current WET by using Tika to extract contents from binary formats too: PDF, MSWord, etc.

Julien Nioche kindly carved out 220 GB for us to experiment with on TIKA-1302<https://issues.apache.org/jira/browse/TIKA-1302> on a Rackspace vm.  But, I'm wondering now if it would make more sense to have CommonCrawl run Tika as part of its regular process and make the output available in one of your standard formats.

CommonCrawl consumers would get Tika output, and the Tika dev community (including its dependencies, PDFBox, POI, etc.) could get the stacktraces to help prioritize bug fixes.

Cheers,

          Tim
--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl+unsubscribe@googlegroups.com<ma...@googlegroups.com>.
To post to this group, send email to common-crawl@googlegroups.com<ma...@googlegroups.com>.
Visit this group at http://groups.google.com/group/common-crawl.
For more options, visit https://groups.google.com/d/optout.



--
Regards,
Stephen Merity
Data Scientist @ Common Crawl

RE: [COMPRESS and others] FW: Any interest in running Apache Tika as part of CommonCrawl?

Posted by "Allison, Timothy B." <ta...@mitre.org>.
Y. Perfect.  With your code, I grabbed an extra 170k (doc|xls|ppt)x for testing Tika 1.9-rc1.  Thank you!

-----Original Message-----
From: Dominik Stadler [mailto:dominik.stadler@gmx.at] 
Sent: Monday, June 01, 2015 3:16 PM
To: POI Developers List
Subject: Re: [COMPRESS and others] FW: Any interest in running Apache Tika as part of CommonCrawl?

 Please try again with the latest version of the project, I hopefully
fixed this at https://github.com/centic9/CommonCrawlDocumentDownload
now.

Thanks... Dominik.

On Mon, Jun 1, 2015 at 6:32 PM, Dominik Stadler <do...@gmx.at> wrote:
> That's likely on my side, sorry, I'll take a look....
>
> Dominik
>
> Am 01.06.2015 16:51 schrieb "Allison, Timothy B." <ta...@mitre.org>:
>>
>> Dominik,
>>   Thank you for making this available!  I'm trying to build/run now, and
>> I'm getting this...is this user error?
>>
>>
>>
>>
>> /dev/commoncrawl-download/CommonCrawlDocumentDownload/src/test/java/org/dstadler/commoncrawl/UtilsTest.java:20:
>> error: package org.dstadler.commons.testing does not exist
>> import org.dstadler.commons.testing.MockRESTServer;
>>                                    ^
>>
>> /dev/commoncrawl-download/CommonCrawlDocumentDownload/src/test/java/org/dstadler/commoncrawl/UtilsTest.java:21:
>> error: package org.dstadler.commons.testing does not exist
>> import org.dstadler.commons.testing.TestHelpers;
>>                                    ^
>>
>> /dev/commoncrawl-download/CommonCrawlDocumentDownload/src/test/java/org/dstadler/commoncrawl/ExtensionsTest.java:31:
>> error: package org.dstadler.commons.testing does not exist
>>
>> org.dstadler.commons.testing.PrivateConstructorCoverage.executePrivateConstructor(Extensions.class);
>>                                     ^
>>
>> /dev/commoncrawl-download/CommonCrawlDocumentDownload/src/test/java/org/dstadler/commoncrawl/UtilsTest.java:158:
>> error: cannot find symbol
>>         try (MockRESTServer server = new MockRESTServer(NanoHTTPD.HTTP_OK,
>> "text/plain", "Ok")) {
>>              ^
>>   symbol:   class MockRESTServer
>>   location: class UtilsTest
>>
>> /dev/commoncrawl-download/CommonCrawlDocumentDownload/src/test/java/org/dstadler/commoncrawl/UtilsTest.java:158:
>> error: cannot find symbol
>>         try (MockRESTServer server = new MockRESTServer(NanoHTTPD.HTTP_OK,
>> "text/plain", "Ok")) {
>>                                          ^
>>   symbol:   class MockRESTServer
>>   location: class UtilsTest
>>
>> /dev/commoncrawl-download/CommonCrawlDocumentDownload/src/test/java/org/dstadler/commoncrawl/UtilsTest.java:171:
>> error: cannot find symbol
>>         try (MockRESTServer server = new
>> MockRESTServer(NanoHTTPD.HTTP_INTERNALERROR, "text/plain", "Ok")) {
>>              ^
>>   symbol:   class MockRESTServer
>>   location: class UtilsTest
>>
>> /dev/commoncrawl-download/CommonCrawlDocumentDownload/src/test/java/org/dstadler/commoncrawl/UtilsTest.java:171:
>> error: cannot find symbol
>>         try (MockRESTServer server = new
>> MockRESTServer(NanoHTTPD.HTTP_INTERNALERROR, "text/plain", "Ok")) {
>>                                          ^
>>   symbol:   class MockRESTServer
>>   location: class UtilsTest
>>
>> /dev/commoncrawl-download/CommonCrawlDocumentDownload/src/test/java/org/dstadler/commoncrawl/UtilsTest.java:179:
>> error: cannot find symbol
>>                         TestHelpers.assertContains(e, "500", "localhost",
>> Integer.toString(server.getPort()));
>>                         ^
>>   symbol:   variable TestHelpers
>>   location: class UtilsTest
>>
>> /dev/commoncrawl-download/CommonCrawlDocumentDownload/src/test/java/org/dstadler/commoncrawl/UtilsTest.java:205:
>> error: package org.dstadler.commons.testing does not exist
>>
>> org.dstadler.commons.testing.PrivateConstructorCoverage.executePrivateConstructor(Utils.class);
>>                                     ^
>> 9 errors
>> :compileTestJava FAILED
>>
>> -----Original Message-----
>> From: Dominik Stadler [mailto:centic@apache.org]
>> Sent: Wednesday, April 22, 2015 4:07 PM
>> To: POI Developers List
>> Cc: dev@tika.apache.org; dev@pdfbox.apache.org; dev@commons.apache.org
>> Subject: Re: [COMPRESS and others] FW: Any interest in running Apache Tika
>> as part of CommonCrawl?
>>
>> Hi,
>>
>> I have now published a first version of a tool to download binary data
>> of certain file types from the Common Crawl URL Index. Currently it
>> only supports the previous index format, so the data is from around
>> 2012/2013, but this also provides tons of files for mass-testing of
>> our frameworks.
>>
>> I used a small part of the files to run some integration testing
>> locally and immediately found a few issues where specially formatted
>> files broke Apache POI.
>>
>> The project is currently available at
>> https://github.com/centic9/CommonCrawlDocumentDownload, it has options
>> for downloading files as well as first retrieving a list of all
>> interesting files and then downloading them later. But it should also
>> be easily possible to change it so it processes the files on-the-fly
>> (if you want to spare the estimated >300G of disk space it will need
>> for example to store files interesting for Apache POI testing).
>>
>> Naturally running this on Amazon EC2 machines can speed up the
>> downloading a lot as then the network access to Amazon S3 is much
>> faster.
>>
>> Please give it a try if you are interested and let me know what you think.
>>
>> Dominik.
>>
>> On Tue, Apr 7, 2015 at 2:48 PM, Allison, Timothy B. <ta...@mitre.org>
>> wrote:
>> > All,
>> >
>> >   We just heard back from a very active member of Common Crawl.  I don’t
>> > want to clog up our dev lists with this discussion (more than I have!), but
>> > I do want to invite all to participate in the discussion, planning and
>> > potential patches.
>> >
>> >   If you’d like to participate, please join us here:
>> > https://groups.google.com/forum/#!topic/common-crawl/Cv21VRQjGN0
>> >
>> >   I’ve tried to follow Commons’ vernacular, and I’ve added [COMPRESS] to
>> > the Subject line.  Please invite others who might have an interest in this
>> > work.
>> >
>> >          Best,
>> >
>> >                      Tim
>> >
>> > From: Allison, Timothy B.
>> > Sent: Tuesday, April 07, 2015 8:39 AM
>> > To: 'Stephen Merity'; common-crawl@googlegroups.com
>> > Subject: RE: Any interest in running Apache Tika as part of CommonCrawl?
>> >
>> > Stephen,
>> >
>> >   Thank you very much for responding so quickly and for all of your work
>> > on Common Crawl.  I don’t want to speak for all of us, but given the
>> > feedback I’ve gotten so far from some of the dev communities, I think we
>> > would very much appreciate the chance to be tested on a monthly basis as
>> > part of the regular Common Crawl process.
>> >
>> >    I think we’ll still want to run more often in our own sandbox(es) on
>> > the slice of CommonCrawl we have, but the monthly testing against new data,
>> > from my perspective at least, would be a huge win for all of us.
>> >
>> >    In addition to parsing binaries and extracting text, Tika (via
>> > PDFBox, POI and many others) can also offer metadata (e.g. exif from
>> > images), which users of CommonCrawl might find of use.
>> >
>> >   I’ll forward this to some of the relevant dev lists to invite others
>> > to participate in the discussion on the common-crawl list.
>> >
>> >
>> >   Thank you, again.  I very much look forward to collaborating.
>> >
>> >              Best,
>> >
>> >                          Tim
>> >
>> > From: Stephen Merity [mailto:stephen@commoncrawl.org]
>> > Sent: Tuesday, April 07, 2015 3:57 AM
>> > To: common-crawl@googlegroups.com<ma...@googlegroups.com>
>> > Cc: mattmann@apache.org<ma...@apache.org>;
>> > tallison@apache.org<ma...@apache.org>;
>> > dmeikle@apache.org<ma...@apache.org>;
>> > tilman@apache.org<ma...@apache.org>;
>> > nick@apache.org<ma...@apache.org>
>> > Subject: Re: Any interest in running Apache Tika as part of CommonCrawl?
>> >
>> > Hi Tika team!
>> >
>> > We'd certainly be interested in working with Apache Tika on such an
>> > undertaking. At the very least, we're glad that Julien has provided you with
>> > content to battle test Tika with!
>> >
>> > As you've noted, the text extraction performed to produce WET files are
>> > focused primarily on HTML files, leaving many other file types not covered.
>> > The existing text extraction is quite efficient and part of the same process
>> > that generates the WAT file, meaning there's next to no overhead. Performing
>> > extraction with Tika at the scale of Common Crawl would be an interesting
>> > challenge. Running it as a once off wouldn't likely be too much of a
>> > challenge and would also give Tika the benefit of a wider variety of
>> > documents (both well formed and malformed) to test against. Running it on a
>> > frequent basis or as part of the crawl pipeline would be more challenging
>> > but something we can certainly discuss, especially if there's strong
>> > community desire for it!
>> >
>> > On Fri, Apr 3, 2015 at 5:23 AM,
>> > <ta...@gmail.com>> wrote:
>> > CommonCrawl currently has the WET format that extracts plain text from
>> > web pages.  My guess is that this is text stripping from text-y formats.
>> > Let me know if I'm wrong!
>> >
>> > Would there be any interest in adding another format: WETT (WET-Tika) or
>> > supplementing the current WET by using Tika to extract contents from binary
>> > formats too: PDF, MSWord, etc.
>> >
>> > Julien Nioche kindly carved out 220 GB for us to experiment with on
>> > TIKA-1302<https://issues.apache.org/jira/browse/TIKA-1302> on a Rackspace
>> > vm.  But, I'm wondering now if it would make more sense to have CommonCrawl
>> > run Tika as part of its regular process and make the output available in one
>> > of your standard formats.
>> >
>> > CommonCrawl consumers would get Tika output, and the Tika dev community
>> > (including its dependencies, PDFBox, POI, etc.) could get the stacktraces to
>> > help prioritize bug fixes.
>> >
>> > Cheers,
>> >
>> >           Tim
>> > --
>> > You received this message because you are subscribed to the Google
>> > Groups "Common Crawl" group.
>> > To unsubscribe from this group and stop receiving emails from it, send
>> > an email to
>> > common-crawl+unsubscribe@googlegroups.com<ma...@googlegroups.com>.
>> > To post to this group, send email to
>> > common-crawl@googlegroups.com<ma...@googlegroups.com>.
>> > Visit this group at http://groups.google.com/group/common-crawl.
>> > For more options, visit https://groups.google.com/d/optout.
>> >
>> >
>> >
>> > --
>> > Regards,
>> > Stephen Merity
>> > Data Scientist @ Common Crawl
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
>> For additional commands, e-mail: dev-help@poi.apache.org
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


Re: [COMPRESS and others] FW: Any interest in running Apache Tika as part of CommonCrawl?

Posted by Dominik Stadler <do...@gmx.at>.
 Please try again with the latest version of the project, I hopefully
fixed this at https://github.com/centic9/CommonCrawlDocumentDownload
now.

Thanks... Dominik.

On Mon, Jun 1, 2015 at 6:32 PM, Dominik Stadler <do...@gmx.at> wrote:
> That's likely on my side, sorry, I'll take a look....
>
> Dominik
>
> Am 01.06.2015 16:51 schrieb "Allison, Timothy B." <ta...@mitre.org>:
>>
>> Dominik,
>>   Thank you for making this available!  I'm trying to build/run now, and
>> I'm getting this...is this user error?
>>
>>
>>
>>
>> /dev/commoncrawl-download/CommonCrawlDocumentDownload/src/test/java/org/dstadler/commoncrawl/UtilsTest.java:20:
>> error: package org.dstadler.commons.testing does not exist
>> import org.dstadler.commons.testing.MockRESTServer;
>>                                    ^
>>
>> /dev/commoncrawl-download/CommonCrawlDocumentDownload/src/test/java/org/dstadler/commoncrawl/UtilsTest.java:21:
>> error: package org.dstadler.commons.testing does not exist
>> import org.dstadler.commons.testing.TestHelpers;
>>                                    ^
>>
>> /dev/commoncrawl-download/CommonCrawlDocumentDownload/src/test/java/org/dstadler/commoncrawl/ExtensionsTest.java:31:
>> error: package org.dstadler.commons.testing does not exist
>>
>> org.dstadler.commons.testing.PrivateConstructorCoverage.executePrivateConstructor(Extensions.class);
>>                                     ^
>>
>> /dev/commoncrawl-download/CommonCrawlDocumentDownload/src/test/java/org/dstadler/commoncrawl/UtilsTest.java:158:
>> error: cannot find symbol
>>         try (MockRESTServer server = new MockRESTServer(NanoHTTPD.HTTP_OK,
>> "text/plain", "Ok")) {
>>              ^
>>   symbol:   class MockRESTServer
>>   location: class UtilsTest
>>
>> /dev/commoncrawl-download/CommonCrawlDocumentDownload/src/test/java/org/dstadler/commoncrawl/UtilsTest.java:158:
>> error: cannot find symbol
>>         try (MockRESTServer server = new MockRESTServer(NanoHTTPD.HTTP_OK,
>> "text/plain", "Ok")) {
>>                                          ^
>>   symbol:   class MockRESTServer
>>   location: class UtilsTest
>>
>> /dev/commoncrawl-download/CommonCrawlDocumentDownload/src/test/java/org/dstadler/commoncrawl/UtilsTest.java:171:
>> error: cannot find symbol
>>         try (MockRESTServer server = new
>> MockRESTServer(NanoHTTPD.HTTP_INTERNALERROR, "text/plain", "Ok")) {
>>              ^
>>   symbol:   class MockRESTServer
>>   location: class UtilsTest
>>
>> /dev/commoncrawl-download/CommonCrawlDocumentDownload/src/test/java/org/dstadler/commoncrawl/UtilsTest.java:171:
>> error: cannot find symbol
>>         try (MockRESTServer server = new
>> MockRESTServer(NanoHTTPD.HTTP_INTERNALERROR, "text/plain", "Ok")) {
>>                                          ^
>>   symbol:   class MockRESTServer
>>   location: class UtilsTest
>>
>> /dev/commoncrawl-download/CommonCrawlDocumentDownload/src/test/java/org/dstadler/commoncrawl/UtilsTest.java:179:
>> error: cannot find symbol
>>                         TestHelpers.assertContains(e, "500", "localhost",
>> Integer.toString(server.getPort()));
>>                         ^
>>   symbol:   variable TestHelpers
>>   location: class UtilsTest
>>
>> /dev/commoncrawl-download/CommonCrawlDocumentDownload/src/test/java/org/dstadler/commoncrawl/UtilsTest.java:205:
>> error: package org.dstadler.commons.testing does not exist
>>
>> org.dstadler.commons.testing.PrivateConstructorCoverage.executePrivateConstructor(Utils.class);
>>                                     ^
>> 9 errors
>> :compileTestJava FAILED
>>
>> -----Original Message-----
>> From: Dominik Stadler [mailto:centic@apache.org]
>> Sent: Wednesday, April 22, 2015 4:07 PM
>> To: POI Developers List
>> Cc: dev@tika.apache.org; dev@pdfbox.apache.org; dev@commons.apache.org
>> Subject: Re: [COMPRESS and others] FW: Any interest in running Apache Tika
>> as part of CommonCrawl?
>>
>> Hi,
>>
>> I have now published a first version of a tool to download binary data
>> of certain file types from the Common Crawl URL Index. Currently it
>> only supports the previous index format, so the data is from around
>> 2012/2013, but this also provides tons of files for mass-testing of
>> our frameworks.
>>
>> I used a small part of the files to run some integration testing
>> locally and immediately found a few issues where specially formatted
>> files broke Apache POI.
>>
>> The project is currently available at
>> https://github.com/centic9/CommonCrawlDocumentDownload, it has options
>> for downloading files as well as first retrieving a list of all
>> interesting files and then downloading them later. But it should also
>> be easily possible to change it so it processes the files on-the-fly
>> (if you want to spare the estimated >300G of disk space it will need
>> for example to store files interesting for Apache POI testing).
>>
>> Naturally running this on Amazon EC2 machines can speed up the
>> downloading a lot as then the network access to Amazon S3 is much
>> faster.
>>
>> Please give it a try if you are interested and let me know what you think.
>>
>> Dominik.
>>
>> On Tue, Apr 7, 2015 at 2:48 PM, Allison, Timothy B. <ta...@mitre.org>
>> wrote:
>> > All,
>> >
>> >   We just heard back from a very active member of Common Crawl.  I don’t
>> > want to clog up our dev lists with this discussion (more than I have!), but
>> > I do want to invite all to participate in the discussion, planning and
>> > potential patches.
>> >
>> >   If you’d like to participate, please join us here:
>> > https://groups.google.com/forum/#!topic/common-crawl/Cv21VRQjGN0
>> >
>> >   I’ve tried to follow Commons’ vernacular, and I’ve added [COMPRESS] to
>> > the Subject line.  Please invite others who might have an interest in this
>> > work.
>> >
>> >          Best,
>> >
>> >                      Tim
>> >
>> > From: Allison, Timothy B.
>> > Sent: Tuesday, April 07, 2015 8:39 AM
>> > To: 'Stephen Merity'; common-crawl@googlegroups.com
>> > Subject: RE: Any interest in running Apache Tika as part of CommonCrawl?
>> >
>> > Stephen,
>> >
>> >   Thank you very much for responding so quickly and for all of your work
>> > on Common Crawl.  I don’t want to speak for all of us, but given the
>> > feedback I’ve gotten so far from some of the dev communities, I think we
>> > would very much appreciate the chance to be tested on a monthly basis as
>> > part of the regular Common Crawl process.
>> >
>> >    I think we’ll still want to run more often in our own sandbox(es) on
>> > the slice of CommonCrawl we have, but the monthly testing against new data,
>> > from my perspective at least, would be a huge win for all of us.
>> >
>> >    In addition to parsing binaries and extracting text, Tika (via
>> > PDFBox, POI and many others) can also offer metadata (e.g. exif from
>> > images), which users of CommonCrawl might find of use.
>> >
>> >   I’ll forward this to some of the relevant dev lists to invite others
>> > to participate in the discussion on the common-crawl list.
>> >
>> >
>> >   Thank you, again.  I very much look forward to collaborating.
>> >
>> >              Best,
>> >
>> >                          Tim
>> >
>> > From: Stephen Merity [mailto:stephen@commoncrawl.org]
>> > Sent: Tuesday, April 07, 2015 3:57 AM
>> > To: common-crawl@googlegroups.com<ma...@googlegroups.com>
>> > Cc: mattmann@apache.org<ma...@apache.org>;
>> > tallison@apache.org<ma...@apache.org>;
>> > dmeikle@apache.org<ma...@apache.org>;
>> > tilman@apache.org<ma...@apache.org>;
>> > nick@apache.org<ma...@apache.org>
>> > Subject: Re: Any interest in running Apache Tika as part of CommonCrawl?
>> >
>> > Hi Tika team!
>> >
>> > We'd certainly be interested in working with Apache Tika on such an
>> > undertaking. At the very least, we're glad that Julien has provided you with
>> > content to battle test Tika with!
>> >
>> > As you've noted, the text extraction performed to produce WET files are
>> > focused primarily on HTML files, leaving many other file types not covered.
>> > The existing text extraction is quite efficient and part of the same process
>> > that generates the WAT file, meaning there's next to no overhead. Performing
>> > extraction with Tika at the scale of Common Crawl would be an interesting
>> > challenge. Running it as a once off wouldn't likely be too much of a
>> > challenge and would also give Tika the benefit of a wider variety of
>> > documents (both well formed and malformed) to test against. Running it on a
>> > frequent basis or as part of the crawl pipeline would be more challenging
>> > but something we can certainly discuss, especially if there's strong
>> > community desire for it!
>> >
>> > On Fri, Apr 3, 2015 at 5:23 AM,
>> > <ta...@gmail.com>> wrote:
>> > CommonCrawl currently has the WET format that extracts plain text from
>> > web pages.  My guess is that this is text stripping from text-y formats.
>> > Let me know if I'm wrong!
>> >
>> > Would there be any interest in adding another format: WETT (WET-Tika) or
>> > supplementing the current WET by using Tika to extract contents from binary
>> > formats too: PDF, MSWord, etc.
>> >
>> > Julien Nioche kindly carved out 220 GB for us to experiment with on
>> > TIKA-1302<https://issues.apache.org/jira/browse/TIKA-1302> on a Rackspace
>> > vm.  But, I'm wondering now if it would make more sense to have CommonCrawl
>> > run Tika as part of its regular process and make the output available in one
>> > of your standard formats.
>> >
>> > CommonCrawl consumers would get Tika output, and the Tika dev community
>> > (including its dependencies, PDFBox, POI, etc.) could get the stacktraces to
>> > help prioritize bug fixes.
>> >
>> > Cheers,
>> >
>> >           Tim
>> > --
>> > You received this message because you are subscribed to the Google
>> > Groups "Common Crawl" group.
>> > To unsubscribe from this group and stop receiving emails from it, send
>> > an email to
>> > common-crawl+unsubscribe@googlegroups.com<ma...@googlegroups.com>.
>> > To post to this group, send email to
>> > common-crawl@googlegroups.com<ma...@googlegroups.com>.
>> > Visit this group at http://groups.google.com/group/common-crawl.
>> > For more options, visit https://groups.google.com/d/optout.
>> >
>> >
>> >
>> > --
>> > Regards,
>> > Stephen Merity
>> > Data Scientist @ Common Crawl
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
>> For additional commands, e-mail: dev-help@poi.apache.org
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


RE: [COMPRESS and others] FW: Any interest in running Apache Tika as part of CommonCrawl?

Posted by Dominik Stadler <do...@gmx.at>.
That's likely on my side, sorry, I'll take a look....

Dominik
Am 01.06.2015 16:51 schrieb "Allison, Timothy B." <ta...@mitre.org>:

> Dominik,
>   Thank you for making this available!  I'm trying to build/run now, and
> I'm getting this...is this user error?
>
>
>
> /dev/commoncrawl-download/CommonCrawlDocumentDownload/src/test/java/org/dstadler/commoncrawl/UtilsTest.java:20:
> error: package org.dstadler.commons.testing does not exist
> import org.dstadler.commons.testing.MockRESTServer;
>                                    ^
> /dev/commoncrawl-download/CommonCrawlDocumentDownload/src/test/java/org/dstadler/commoncrawl/UtilsTest.java:21:
> error: package org.dstadler.commons.testing does not exist
> import org.dstadler.commons.testing.TestHelpers;
>                                    ^
> /dev/commoncrawl-download/CommonCrawlDocumentDownload/src/test/java/org/dstadler/commoncrawl/ExtensionsTest.java:31:
> error: package org.dstadler.commons.testing does not exist
>
> org.dstadler.commons.testing.PrivateConstructorCoverage.executePrivateConstructor(Extensions.class);
>                                     ^
> /dev/commoncrawl-download/CommonCrawlDocumentDownload/src/test/java/org/dstadler/commoncrawl/UtilsTest.java:158:
> error: cannot find symbol
>         try (MockRESTServer server = new MockRESTServer(NanoHTTPD.HTTP_OK,
> "text/plain", "Ok")) {
>              ^
>   symbol:   class MockRESTServer
>   location: class UtilsTest
> /dev/commoncrawl-download/CommonCrawlDocumentDownload/src/test/java/org/dstadler/commoncrawl/UtilsTest.java:158:
> error: cannot find symbol
>         try (MockRESTServer server = new MockRESTServer(NanoHTTPD.HTTP_OK,
> "text/plain", "Ok")) {
>                                          ^
>   symbol:   class MockRESTServer
>   location: class UtilsTest
> /dev/commoncrawl-download/CommonCrawlDocumentDownload/src/test/java/org/dstadler/commoncrawl/UtilsTest.java:171:
> error: cannot find symbol
>         try (MockRESTServer server = new
> MockRESTServer(NanoHTTPD.HTTP_INTERNALERROR, "text/plain", "Ok")) {
>              ^
>   symbol:   class MockRESTServer
>   location: class UtilsTest
> /dev/commoncrawl-download/CommonCrawlDocumentDownload/src/test/java/org/dstadler/commoncrawl/UtilsTest.java:171:
> error: cannot find symbol
>         try (MockRESTServer server = new
> MockRESTServer(NanoHTTPD.HTTP_INTERNALERROR, "text/plain", "Ok")) {
>                                          ^
>   symbol:   class MockRESTServer
>   location: class UtilsTest
> /dev/commoncrawl-download/CommonCrawlDocumentDownload/src/test/java/org/dstadler/commoncrawl/UtilsTest.java:179:
> error: cannot find symbol
>                         TestHelpers.assertContains(e, "500", "localhost",
> Integer.toString(server.getPort()));
>                         ^
>   symbol:   variable TestHelpers
>   location: class UtilsTest
> /dev/commoncrawl-download/CommonCrawlDocumentDownload/src/test/java/org/dstadler/commoncrawl/UtilsTest.java:205:
> error: package org.dstadler.commons.testing does not exist
>
> org.dstadler.commons.testing.PrivateConstructorCoverage.executePrivateConstructor(Utils.class);
>                                     ^
> 9 errors
> :compileTestJava FAILED
>
> -----Original Message-----
> From: Dominik Stadler [mailto:centic@apache.org]
> Sent: Wednesday, April 22, 2015 4:07 PM
> To: POI Developers List
> Cc: dev@tika.apache.org; dev@pdfbox.apache.org; dev@commons.apache.org
> Subject: Re: [COMPRESS and others] FW: Any interest in running Apache Tika
> as part of CommonCrawl?
>
> Hi,
>
> I have now published a first version of a tool to download binary data
> of certain file types from the Common Crawl URL Index. Currently it
> only supports the previous index format, so the data is from around
> 2012/2013, but this also provides tons of files for mass-testing of
> our frameworks.
>
> I used a small part of the files to run some integration testing
> locally and immediately found a few issues where specially formatted
> files broke Apache POI.
>
> The project is currently available at
> https://github.com/centic9/CommonCrawlDocumentDownload, it has options
> for downloading files as well as first retrieving a list of all
> interesting files and then downloading them later. But it should also
> be easily possible to change it so it processes the files on-the-fly
> (if you want to spare the estimated >300G of disk space it will need
> for example to store files interesting for Apache POI testing).
>
> Naturally running this on Amazon EC2 machines can speed up the
> downloading a lot as then the network access to Amazon S3 is much
> faster.
>
> Please give it a try if you are interested and let me know what you think.
>
> Dominik.
>
> On Tue, Apr 7, 2015 at 2:48 PM, Allison, Timothy B. <ta...@mitre.org>
> wrote:
> > All,
> >
> >   We just heard back from a very active member of Common Crawl.  I don’t
> want to clog up our dev lists with this discussion (more than I have!), but
> I do want to invite all to participate in the discussion, planning and
> potential patches.
> >
> >   If you’d like to participate, please join us here:
> https://groups.google.com/forum/#!topic/common-crawl/Cv21VRQjGN0
> >
> >   I’ve tried to follow Commons’ vernacular, and I’ve added [COMPRESS] to
> the Subject line.  Please invite others who might have an interest in this
> work.
> >
> >          Best,
> >
> >                      Tim
> >
> > From: Allison, Timothy B.
> > Sent: Tuesday, April 07, 2015 8:39 AM
> > To: 'Stephen Merity'; common-crawl@googlegroups.com
> > Subject: RE: Any interest in running Apache Tika as part of CommonCrawl?
> >
> > Stephen,
> >
> >   Thank you very much for responding so quickly and for all of your work
> on Common Crawl.  I don’t want to speak for all of us, but given the
> feedback I’ve gotten so far from some of the dev communities, I think we
> would very much appreciate the chance to be tested on a monthly basis as
> part of the regular Common Crawl process.
> >
> >    I think we’ll still want to run more often in our own sandbox(es) on
> the slice of CommonCrawl we have, but the monthly testing against new data,
> from my perspective at least, would be a huge win for all of us.
> >
> >    In addition to parsing binaries and extracting text, Tika (via
> PDFBox, POI and many others) can also offer metadata (e.g. exif from
> images), which users of CommonCrawl might find of use.
> >
> >   I’ll forward this to some of the relevant dev lists to invite others
> to participate in the discussion on the common-crawl list.
> >
> >
> >   Thank you, again.  I very much look forward to collaborating.
> >
> >              Best,
> >
> >                          Tim
> >
> > From: Stephen Merity [mailto:stephen@commoncrawl.org]
> > Sent: Tuesday, April 07, 2015 3:57 AM
> > To: common-crawl@googlegroups.com<ma...@googlegroups.com>
> > Cc: mattmann@apache.org<ma...@apache.org>; tallison@apache.org
> <ma...@apache.org>; dmeikle@apache.org<ma...@apache.org>;
> tilman@apache.org<ma...@apache.org>; nick@apache.org<mailto:
> nick@apache.org>
> > Subject: Re: Any interest in running Apache Tika as part of CommonCrawl?
> >
> > Hi Tika team!
> >
> > We'd certainly be interested in working with Apache Tika on such an
> undertaking. At the very least, we're glad that Julien has provided you
> with content to battle test Tika with!
> >
> > As you've noted, the text extraction performed to produce WET files are
> focused primarily on HTML files, leaving many other file types not covered.
> The existing text extraction is quite efficient and part of the same
> process that generates the WAT file, meaning there's next to no overhead.
> Performing extraction with Tika at the scale of Common Crawl would be an
> interesting challenge. Running it as a once off wouldn't likely be too much
> of a challenge and would also give Tika the benefit of a wider variety of
> documents (both well formed and malformed) to test against. Running it on a
> frequent basis or as part of the crawl pipeline would be more challenging
> but something we can certainly discuss, especially if there's strong
> community desire for it!
> >
> > On Fri, Apr 3, 2015 at 5:23 AM, <tallison314159@gmail.com<mailto:
> tallison314159@gmail.com>> wrote:
> > CommonCrawl currently has the WET format that extracts plain text from
> web pages.  My guess is that this is text stripping from text-y formats.
> Let me know if I'm wrong!
> >
> > Would there be any interest in adding another format: WETT (WET-Tika) or
> supplementing the current WET by using Tika to extract contents from binary
> formats too: PDF, MSWord, etc.
> >
> > Julien Nioche kindly carved out 220 GB for us to experiment with on
> TIKA-1302<https://issues.apache.org/jira/browse/TIKA-1302> on a Rackspace
> vm.  But, I'm wondering now if it would make more sense to have CommonCrawl
> run Tika as part of its regular process and make the output available in
> one of your standard formats.
> >
> > CommonCrawl consumers would get Tika output, and the Tika dev community
> (including its dependencies, PDFBox, POI, etc.) could get the stacktraces
> to help prioritize bug fixes.
> >
> > Cheers,
> >
> >           Tim
> > --
> > You received this message because you are subscribed to the Google
> Groups "Common Crawl" group.
> > To unsubscribe from this group and stop receiving emails from it, send
> an email to common-crawl+unsubscribe@googlegroups.com<mailto:
> common-crawl+unsubscribe@googlegroups.com>.
> > To post to this group, send email to common-crawl@googlegroups.com
> <ma...@googlegroups.com>.
> > Visit this group at http://groups.google.com/group/common-crawl.
> > For more options, visit https://groups.google.com/d/optout.
> >
> >
> >
> > --
> > Regards,
> > Stephen Merity
> > Data Scientist @ Common Crawl
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: dev-help@pdfbox.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
> For additional commands, e-mail: dev-help@poi.apache.org
>
>

RE: [COMPRESS and others] FW: Any interest in running Apache Tika as part of CommonCrawl?

Posted by "Allison, Timothy B." <ta...@mitre.org>.
Dominik,
  Thank you for making this available!  I'm trying to build/run now, and I'm getting this...is this user error?



/dev/commoncrawl-download/CommonCrawlDocumentDownload/src/test/java/org/dstadler/commoncrawl/UtilsTest.java:20: error: package org.dstadler.commons.testing does not exist
import org.dstadler.commons.testing.MockRESTServer;
                                   ^
/dev/commoncrawl-download/CommonCrawlDocumentDownload/src/test/java/org/dstadler/commoncrawl/UtilsTest.java:21: error: package org.dstadler.commons.testing does not exist
import org.dstadler.commons.testing.TestHelpers;
                                   ^
/dev/commoncrawl-download/CommonCrawlDocumentDownload/src/test/java/org/dstadler/commoncrawl/ExtensionsTest.java:31: error: package org.dstadler.commons.testing does not exist
        org.dstadler.commons.testing.PrivateConstructorCoverage.executePrivateConstructor(Extensions.class);
                                    ^
/dev/commoncrawl-download/CommonCrawlDocumentDownload/src/test/java/org/dstadler/commoncrawl/UtilsTest.java:158: error: cannot find symbol
        try (MockRESTServer server = new MockRESTServer(NanoHTTPD.HTTP_OK, "text/plain", "Ok")) {
             ^
  symbol:   class MockRESTServer
  location: class UtilsTest
/dev/commoncrawl-download/CommonCrawlDocumentDownload/src/test/java/org/dstadler/commoncrawl/UtilsTest.java:158: error: cannot find symbol
        try (MockRESTServer server = new MockRESTServer(NanoHTTPD.HTTP_OK, "text/plain", "Ok")) {
                                         ^
  symbol:   class MockRESTServer
  location: class UtilsTest
/dev/commoncrawl-download/CommonCrawlDocumentDownload/src/test/java/org/dstadler/commoncrawl/UtilsTest.java:171: error: cannot find symbol
        try (MockRESTServer server = new MockRESTServer(NanoHTTPD.HTTP_INTERNALERROR, "text/plain", "Ok")) {
             ^
  symbol:   class MockRESTServer
  location: class UtilsTest
/dev/commoncrawl-download/CommonCrawlDocumentDownload/src/test/java/org/dstadler/commoncrawl/UtilsTest.java:171: error: cannot find symbol
        try (MockRESTServer server = new MockRESTServer(NanoHTTPD.HTTP_INTERNALERROR, "text/plain", "Ok")) {
                                         ^
  symbol:   class MockRESTServer
  location: class UtilsTest
/dev/commoncrawl-download/CommonCrawlDocumentDownload/src/test/java/org/dstadler/commoncrawl/UtilsTest.java:179: error: cannot find symbol
                        TestHelpers.assertContains(e, "500", "localhost", Integer.toString(server.getPort()));
                        ^
  symbol:   variable TestHelpers
  location: class UtilsTest
/dev/commoncrawl-download/CommonCrawlDocumentDownload/src/test/java/org/dstadler/commoncrawl/UtilsTest.java:205: error: package org.dstadler.commons.testing does not exist
        org.dstadler.commons.testing.PrivateConstructorCoverage.executePrivateConstructor(Utils.class);
                                    ^
9 errors
:compileTestJava FAILED

-----Original Message-----
From: Dominik Stadler [mailto:centic@apache.org] 
Sent: Wednesday, April 22, 2015 4:07 PM
To: POI Developers List
Cc: dev@tika.apache.org; dev@pdfbox.apache.org; dev@commons.apache.org
Subject: Re: [COMPRESS and others] FW: Any interest in running Apache Tika as part of CommonCrawl?

Hi,

I have now published a first version of a tool to download binary data
of certain file types from the Common Crawl URL Index. Currently it
only supports the previous index format, so the data is from around
2012/2013, but this also provides tons of files for mass-testing of
our frameworks.

I used a small part of the files to run some integration testing
locally and immediately found a few issues where specially formatted
files broke Apache POI.

The project is currently available at
https://github.com/centic9/CommonCrawlDocumentDownload, it has options
for downloading files as well as first retrieving a list of all
interesting files and then downloading them later. But it should also
be easily possible to change it so it processes the files on-the-fly
(if you want to spare the estimated >300G of disk space it will need
for example to store files interesting for Apache POI testing).

Naturally running this on Amazon EC2 machines can speed up the
downloading a lot as then the network access to Amazon S3 is much
faster.

Please give it a try if you are interested and let me know what you think.

Dominik.

On Tue, Apr 7, 2015 at 2:48 PM, Allison, Timothy B. <ta...@mitre.org> wrote:
> All,
>
>   We just heard back from a very active member of Common Crawl.  I don’t want to clog up our dev lists with this discussion (more than I have!), but I do want to invite all to participate in the discussion, planning and potential patches.
>
>   If you’d like to participate, please join us here: https://groups.google.com/forum/#!topic/common-crawl/Cv21VRQjGN0
>
>   I’ve tried to follow Commons’ vernacular, and I’ve added [COMPRESS] to the Subject line.  Please invite others who might have an interest in this work.
>
>          Best,
>
>                      Tim
>
> From: Allison, Timothy B.
> Sent: Tuesday, April 07, 2015 8:39 AM
> To: 'Stephen Merity'; common-crawl@googlegroups.com
> Subject: RE: Any interest in running Apache Tika as part of CommonCrawl?
>
> Stephen,
>
>   Thank you very much for responding so quickly and for all of your work on Common Crawl.  I don’t want to speak for all of us, but given the feedback I’ve gotten so far from some of the dev communities, I think we would very much appreciate the chance to be tested on a monthly basis as part of the regular Common Crawl process.
>
>    I think we’ll still want to run more often in our own sandbox(es) on the slice of CommonCrawl we have, but the monthly testing against new data, from my perspective at least, would be a huge win for all of us.
>
>    In addition to parsing binaries and extracting text, Tika (via PDFBox, POI and many others) can also offer metadata (e.g. exif from images), which users of CommonCrawl might find of use.
>
>   I’ll forward this to some of the relevant dev lists to invite others to participate in the discussion on the common-crawl list.
>
>
>   Thank you, again.  I very much look forward to collaborating.
>
>              Best,
>
>                          Tim
>
> From: Stephen Merity [mailto:stephen@commoncrawl.org]
> Sent: Tuesday, April 07, 2015 3:57 AM
> To: common-crawl@googlegroups.com<ma...@googlegroups.com>
> Cc: mattmann@apache.org<ma...@apache.org>; tallison@apache.org<ma...@apache.org>; dmeikle@apache.org<ma...@apache.org>; tilman@apache.org<ma...@apache.org>; nick@apache.org<ma...@apache.org>
> Subject: Re: Any interest in running Apache Tika as part of CommonCrawl?
>
> Hi Tika team!
>
> We'd certainly be interested in working with Apache Tika on such an undertaking. At the very least, we're glad that Julien has provided you with content to battle test Tika with!
>
> As you've noted, the text extraction performed to produce WET files are focused primarily on HTML files, leaving many other file types not covered. The existing text extraction is quite efficient and part of the same process that generates the WAT file, meaning there's next to no overhead. Performing extraction with Tika at the scale of Common Crawl would be an interesting challenge. Running it as a once off wouldn't likely be too much of a challenge and would also give Tika the benefit of a wider variety of documents (both well formed and malformed) to test against. Running it on a frequent basis or as part of the crawl pipeline would be more challenging but something we can certainly discuss, especially if there's strong community desire for it!
>
> On Fri, Apr 3, 2015 at 5:23 AM, <ta...@gmail.com>> wrote:
> CommonCrawl currently has the WET format that extracts plain text from web pages.  My guess is that this is text stripping from text-y formats.  Let me know if I'm wrong!
>
> Would there be any interest in adding another format: WETT (WET-Tika) or supplementing the current WET by using Tika to extract contents from binary formats too: PDF, MSWord, etc.
>
> Julien Nioche kindly carved out 220 GB for us to experiment with on TIKA-1302<https://issues.apache.org/jira/browse/TIKA-1302> on a Rackspace vm.  But, I'm wondering now if it would make more sense to have CommonCrawl run Tika as part of its regular process and make the output available in one of your standard formats.
>
> CommonCrawl consumers would get Tika output, and the Tika dev community (including its dependencies, PDFBox, POI, etc.) could get the stacktraces to help prioritize bug fixes.
>
> Cheers,
>
>           Tim
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl+unsubscribe@googlegroups.com<ma...@googlegroups.com>.
> To post to this group, send email to common-crawl@googlegroups.com<ma...@googlegroups.com>.
> Visit this group at http://groups.google.com/group/common-crawl.
> For more options, visit https://groups.google.com/d/optout.
>
>
>
> --
> Regards,
> Stephen Merity
> Data Scientist @ Common Crawl

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


Re: [COMPRESS and others] FW: Any interest in running Apache Tika as part of CommonCrawl?

Posted by Dominik Stadler <ce...@apache.org>.
Hi,

I have now published a first version of a tool to download binary data
of certain file types from the Common Crawl URL Index. Currently it
only supports the previous index format, so the data is from around
2012/2013, but this also provides tons of files for mass-testing of
our frameworks.

I used a small part of the files to run some integration testing
locally and immediately found a few issues where specially formatted
files broke Apache POI.

The project is currently available at
https://github.com/centic9/CommonCrawlDocumentDownload, it has options
for downloading files as well as first retrieving a list of all
interesting files and then downloading them later. But it should also
be easily possible to change it so it processes the files on-the-fly
(if you want to spare the estimated >300G of disk space it will need
for example to store files interesting for Apache POI testing).

Naturally running this on Amazon EC2 machines can speed up the
downloading a lot as then the network access to Amazon S3 is much
faster.

Please give it a try if you are interested and let me know what you think.

Dominik.

On Tue, Apr 7, 2015 at 2:48 PM, Allison, Timothy B. <ta...@mitre.org> wrote:
> All,
>
>   We just heard back from a very active member of Common Crawl.  I don’t want to clog up our dev lists with this discussion (more than I have!), but I do want to invite all to participate in the discussion, planning and potential patches.
>
>   If you’d like to participate, please join us here: https://groups.google.com/forum/#!topic/common-crawl/Cv21VRQjGN0
>
>   I’ve tried to follow Commons’ vernacular, and I’ve added [COMPRESS] to the Subject line.  Please invite others who might have an interest in this work.
>
>          Best,
>
>                      Tim
>
> From: Allison, Timothy B.
> Sent: Tuesday, April 07, 2015 8:39 AM
> To: 'Stephen Merity'; common-crawl@googlegroups.com
> Subject: RE: Any interest in running Apache Tika as part of CommonCrawl?
>
> Stephen,
>
>   Thank you very much for responding so quickly and for all of your work on Common Crawl.  I don’t want to speak for all of us, but given the feedback I’ve gotten so far from some of the dev communities, I think we would very much appreciate the chance to be tested on a monthly basis as part of the regular Common Crawl process.
>
>    I think we’ll still want to run more often in our own sandbox(es) on the slice of CommonCrawl we have, but the monthly testing against new data, from my perspective at least, would be a huge win for all of us.
>
>    In addition to parsing binaries and extracting text, Tika (via PDFBox, POI and many others) can also offer metadata (e.g. exif from images), which users of CommonCrawl might find of use.
>
>   I’ll forward this to some of the relevant dev lists to invite others to participate in the discussion on the common-crawl list.
>
>
>   Thank you, again.  I very much look forward to collaborating.
>
>              Best,
>
>                          Tim
>
> From: Stephen Merity [mailto:stephen@commoncrawl.org]
> Sent: Tuesday, April 07, 2015 3:57 AM
> To: common-crawl@googlegroups.com<ma...@googlegroups.com>
> Cc: mattmann@apache.org<ma...@apache.org>; tallison@apache.org<ma...@apache.org>; dmeikle@apache.org<ma...@apache.org>; tilman@apache.org<ma...@apache.org>; nick@apache.org<ma...@apache.org>
> Subject: Re: Any interest in running Apache Tika as part of CommonCrawl?
>
> Hi Tika team!
>
> We'd certainly be interested in working with Apache Tika on such an undertaking. At the very least, we're glad that Julien has provided you with content to battle test Tika with!
>
> As you've noted, the text extraction performed to produce WET files are focused primarily on HTML files, leaving many other file types not covered. The existing text extraction is quite efficient and part of the same process that generates the WAT file, meaning there's next to no overhead. Performing extraction with Tika at the scale of Common Crawl would be an interesting challenge. Running it as a once off wouldn't likely be too much of a challenge and would also give Tika the benefit of a wider variety of documents (both well formed and malformed) to test against. Running it on a frequent basis or as part of the crawl pipeline would be more challenging but something we can certainly discuss, especially if there's strong community desire for it!
>
> On Fri, Apr 3, 2015 at 5:23 AM, <ta...@gmail.com>> wrote:
> CommonCrawl currently has the WET format that extracts plain text from web pages.  My guess is that this is text stripping from text-y formats.  Let me know if I'm wrong!
>
> Would there be any interest in adding another format: WETT (WET-Tika) or supplementing the current WET by using Tika to extract contents from binary formats too: PDF, MSWord, etc.
>
> Julien Nioche kindly carved out 220 GB for us to experiment with on TIKA-1302<https://issues.apache.org/jira/browse/TIKA-1302> on a Rackspace vm.  But, I'm wondering now if it would make more sense to have CommonCrawl run Tika as part of its regular process and make the output available in one of your standard formats.
>
> CommonCrawl consumers would get Tika output, and the Tika dev community (including its dependencies, PDFBox, POI, etc.) could get the stacktraces to help prioritize bug fixes.
>
> Cheers,
>
>           Tim
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl+unsubscribe@googlegroups.com<ma...@googlegroups.com>.
> To post to this group, send email to common-crawl@googlegroups.com<ma...@googlegroups.com>.
> Visit this group at http://groups.google.com/group/common-crawl.
> For more options, visit https://groups.google.com/d/optout.
>
>
>
> --
> Regards,
> Stephen Merity
> Data Scientist @ Common Crawl

Re: [COMPRESS and others] FW: Any interest in running Apache Tika as part of CommonCrawl?

Posted by Dominik Stadler <ce...@apache.org>.
Hi,

I have now published a first version of a tool to download binary data
of certain file types from the Common Crawl URL Index. Currently it
only supports the previous index format, so the data is from around
2012/2013, but this also provides tons of files for mass-testing of
our frameworks.

I used a small part of the files to run some integration testing
locally and immediately found a few issues where specially formatted
files broke Apache POI.

The project is currently available at
https://github.com/centic9/CommonCrawlDocumentDownload, it has options
for downloading files as well as first retrieving a list of all
interesting files and then downloading them later. But it should also
be easily possible to change it so it processes the files on-the-fly
(if you want to spare the estimated >300G of disk space it will need
for example to store files interesting for Apache POI testing).

Naturally running this on Amazon EC2 machines can speed up the
downloading a lot as then the network access to Amazon S3 is much
faster.

Please give it a try if you are interested and let me know what you think.

Dominik.

On Tue, Apr 7, 2015 at 2:48 PM, Allison, Timothy B. <ta...@mitre.org> wrote:
> All,
>
>   We just heard back from a very active member of Common Crawl.  I don’t want to clog up our dev lists with this discussion (more than I have!), but I do want to invite all to participate in the discussion, planning and potential patches.
>
>   If you’d like to participate, please join us here: https://groups.google.com/forum/#!topic/common-crawl/Cv21VRQjGN0
>
>   I’ve tried to follow Commons’ vernacular, and I’ve added [COMPRESS] to the Subject line.  Please invite others who might have an interest in this work.
>
>          Best,
>
>                      Tim
>
> From: Allison, Timothy B.
> Sent: Tuesday, April 07, 2015 8:39 AM
> To: 'Stephen Merity'; common-crawl@googlegroups.com
> Subject: RE: Any interest in running Apache Tika as part of CommonCrawl?
>
> Stephen,
>
>   Thank you very much for responding so quickly and for all of your work on Common Crawl.  I don’t want to speak for all of us, but given the feedback I’ve gotten so far from some of the dev communities, I think we would very much appreciate the chance to be tested on a monthly basis as part of the regular Common Crawl process.
>
>    I think we’ll still want to run more often in our own sandbox(es) on the slice of CommonCrawl we have, but the monthly testing against new data, from my perspective at least, would be a huge win for all of us.
>
>    In addition to parsing binaries and extracting text, Tika (via PDFBox, POI and many others) can also offer metadata (e.g. exif from images), which users of CommonCrawl might find of use.
>
>   I’ll forward this to some of the relevant dev lists to invite others to participate in the discussion on the common-crawl list.
>
>
>   Thank you, again.  I very much look forward to collaborating.
>
>              Best,
>
>                          Tim
>
> From: Stephen Merity [mailto:stephen@commoncrawl.org]
> Sent: Tuesday, April 07, 2015 3:57 AM
> To: common-crawl@googlegroups.com<ma...@googlegroups.com>
> Cc: mattmann@apache.org<ma...@apache.org>; tallison@apache.org<ma...@apache.org>; dmeikle@apache.org<ma...@apache.org>; tilman@apache.org<ma...@apache.org>; nick@apache.org<ma...@apache.org>
> Subject: Re: Any interest in running Apache Tika as part of CommonCrawl?
>
> Hi Tika team!
>
> We'd certainly be interested in working with Apache Tika on such an undertaking. At the very least, we're glad that Julien has provided you with content to battle test Tika with!
>
> As you've noted, the text extraction performed to produce WET files are focused primarily on HTML files, leaving many other file types not covered. The existing text extraction is quite efficient and part of the same process that generates the WAT file, meaning there's next to no overhead. Performing extraction with Tika at the scale of Common Crawl would be an interesting challenge. Running it as a once off wouldn't likely be too much of a challenge and would also give Tika the benefit of a wider variety of documents (both well formed and malformed) to test against. Running it on a frequent basis or as part of the crawl pipeline would be more challenging but something we can certainly discuss, especially if there's strong community desire for it!
>
> On Fri, Apr 3, 2015 at 5:23 AM, <ta...@gmail.com>> wrote:
> CommonCrawl currently has the WET format that extracts plain text from web pages.  My guess is that this is text stripping from text-y formats.  Let me know if I'm wrong!
>
> Would there be any interest in adding another format: WETT (WET-Tika) or supplementing the current WET by using Tika to extract contents from binary formats too: PDF, MSWord, etc.
>
> Julien Nioche kindly carved out 220 GB for us to experiment with on TIKA-1302<https://issues.apache.org/jira/browse/TIKA-1302> on a Rackspace vm.  But, I'm wondering now if it would make more sense to have CommonCrawl run Tika as part of its regular process and make the output available in one of your standard formats.
>
> CommonCrawl consumers would get Tika output, and the Tika dev community (including its dependencies, PDFBox, POI, etc.) could get the stacktraces to help prioritize bug fixes.
>
> Cheers,
>
>           Tim
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl+unsubscribe@googlegroups.com<ma...@googlegroups.com>.
> To post to this group, send email to common-crawl@googlegroups.com<ma...@googlegroups.com>.
> Visit this group at http://groups.google.com/group/common-crawl.
> For more options, visit https://groups.google.com/d/optout.
>
>
>
> --
> Regards,
> Stephen Merity
> Data Scientist @ Common Crawl

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


Re: [COMPRESS and others] FW: Any interest in running Apache Tika as part of CommonCrawl?

Posted by Dominik Stadler <ce...@apache.org>.
Hi,

I have now published a first version of a tool to download binary data
of certain file types from the Common Crawl URL Index. Currently it
only supports the previous index format, so the data is from around
2012/2013, but this also provides tons of files for mass-testing of
our frameworks.

I used a small part of the files to run some integration testing
locally and immediately found a few issues where specially formatted
files broke Apache POI.

The project is currently available at
https://github.com/centic9/CommonCrawlDocumentDownload, it has options
for downloading files as well as first retrieving a list of all
interesting files and then downloading them later. But it should also
be easily possible to change it so it processes the files on-the-fly
(if you want to spare the estimated >300G of disk space it will need
for example to store files interesting for Apache POI testing).

Naturally running this on Amazon EC2 machines can speed up the
downloading a lot as then the network access to Amazon S3 is much
faster.

Please give it a try if you are interested and let me know what you think.

Dominik.

On Tue, Apr 7, 2015 at 2:48 PM, Allison, Timothy B. <ta...@mitre.org> wrote:
> All,
>
>   We just heard back from a very active member of Common Crawl.  I don’t want to clog up our dev lists with this discussion (more than I have!), but I do want to invite all to participate in the discussion, planning and potential patches.
>
>   If you’d like to participate, please join us here: https://groups.google.com/forum/#!topic/common-crawl/Cv21VRQjGN0
>
>   I’ve tried to follow Commons’ vernacular, and I’ve added [COMPRESS] to the Subject line.  Please invite others who might have an interest in this work.
>
>          Best,
>
>                      Tim
>
> From: Allison, Timothy B.
> Sent: Tuesday, April 07, 2015 8:39 AM
> To: 'Stephen Merity'; common-crawl@googlegroups.com
> Subject: RE: Any interest in running Apache Tika as part of CommonCrawl?
>
> Stephen,
>
>   Thank you very much for responding so quickly and for all of your work on Common Crawl.  I don’t want to speak for all of us, but given the feedback I’ve gotten so far from some of the dev communities, I think we would very much appreciate the chance to be tested on a monthly basis as part of the regular Common Crawl process.
>
>    I think we’ll still want to run more often in our own sandbox(es) on the slice of CommonCrawl we have, but the monthly testing against new data, from my perspective at least, would be a huge win for all of us.
>
>    In addition to parsing binaries and extracting text, Tika (via PDFBox, POI and many others) can also offer metadata (e.g. exif from images), which users of CommonCrawl might find of use.
>
>   I’ll forward this to some of the relevant dev lists to invite others to participate in the discussion on the common-crawl list.
>
>
>   Thank you, again.  I very much look forward to collaborating.
>
>              Best,
>
>                          Tim
>
> From: Stephen Merity [mailto:stephen@commoncrawl.org]
> Sent: Tuesday, April 07, 2015 3:57 AM
> To: common-crawl@googlegroups.com<ma...@googlegroups.com>
> Cc: mattmann@apache.org<ma...@apache.org>; tallison@apache.org<ma...@apache.org>; dmeikle@apache.org<ma...@apache.org>; tilman@apache.org<ma...@apache.org>; nick@apache.org<ma...@apache.org>
> Subject: Re: Any interest in running Apache Tika as part of CommonCrawl?
>
> Hi Tika team!
>
> We'd certainly be interested in working with Apache Tika on such an undertaking. At the very least, we're glad that Julien has provided you with content to battle test Tika with!
>
> As you've noted, the text extraction performed to produce WET files are focused primarily on HTML files, leaving many other file types not covered. The existing text extraction is quite efficient and part of the same process that generates the WAT file, meaning there's next to no overhead. Performing extraction with Tika at the scale of Common Crawl would be an interesting challenge. Running it as a once off wouldn't likely be too much of a challenge and would also give Tika the benefit of a wider variety of documents (both well formed and malformed) to test against. Running it on a frequent basis or as part of the crawl pipeline would be more challenging but something we can certainly discuss, especially if there's strong community desire for it!
>
> On Fri, Apr 3, 2015 at 5:23 AM, <ta...@gmail.com>> wrote:
> CommonCrawl currently has the WET format that extracts plain text from web pages.  My guess is that this is text stripping from text-y formats.  Let me know if I'm wrong!
>
> Would there be any interest in adding another format: WETT (WET-Tika) or supplementing the current WET by using Tika to extract contents from binary formats too: PDF, MSWord, etc.
>
> Julien Nioche kindly carved out 220 GB for us to experiment with on TIKA-1302<https://issues.apache.org/jira/browse/TIKA-1302> on a Rackspace vm.  But, I'm wondering now if it would make more sense to have CommonCrawl run Tika as part of its regular process and make the output available in one of your standard formats.
>
> CommonCrawl consumers would get Tika output, and the Tika dev community (including its dependencies, PDFBox, POI, etc.) could get the stacktraces to help prioritize bug fixes.
>
> Cheers,
>
>           Tim
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl+unsubscribe@googlegroups.com<ma...@googlegroups.com>.
> To post to this group, send email to common-crawl@googlegroups.com<ma...@googlegroups.com>.
> Visit this group at http://groups.google.com/group/common-crawl.
> For more options, visit https://groups.google.com/d/optout.
>
>
>
> --
> Regards,
> Stephen Merity
> Data Scientist @ Common Crawl

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


Re: [COMPRESS and others] FW: Any interest in running Apache Tika as part of CommonCrawl?

Posted by Dominik Stadler <ce...@apache.org>.
Hi,

I have now published a first version of a tool to download binary data
of certain file types from the Common Crawl URL Index. Currently it
only supports the previous index format, so the data is from around
2012/2013, but this also provides tons of files for mass-testing of
our frameworks.

I used a small part of the files to run some integration testing
locally and immediately found a few issues where specially formatted
files broke Apache POI.

The project is currently available at
https://github.com/centic9/CommonCrawlDocumentDownload, it has options
for downloading files as well as first retrieving a list of all
interesting files and then downloading them later. But it should also
be easily possible to change it so it processes the files on-the-fly
(if you want to spare the estimated >300G of disk space it will need
for example to store files interesting for Apache POI testing).

Naturally running this on Amazon EC2 machines can speed up the
downloading a lot as then the network access to Amazon S3 is much
faster.

Please give it a try if you are interested and let me know what you think.

Dominik.

On Tue, Apr 7, 2015 at 2:48 PM, Allison, Timothy B. <ta...@mitre.org> wrote:
> All,
>
>   We just heard back from a very active member of Common Crawl.  I don’t want to clog up our dev lists with this discussion (more than I have!), but I do want to invite all to participate in the discussion, planning and potential patches.
>
>   If you’d like to participate, please join us here: https://groups.google.com/forum/#!topic/common-crawl/Cv21VRQjGN0
>
>   I’ve tried to follow Commons’ vernacular, and I’ve added [COMPRESS] to the Subject line.  Please invite others who might have an interest in this work.
>
>          Best,
>
>                      Tim
>
> From: Allison, Timothy B.
> Sent: Tuesday, April 07, 2015 8:39 AM
> To: 'Stephen Merity'; common-crawl@googlegroups.com
> Subject: RE: Any interest in running Apache Tika as part of CommonCrawl?
>
> Stephen,
>
>   Thank you very much for responding so quickly and for all of your work on Common Crawl.  I don’t want to speak for all of us, but given the feedback I’ve gotten so far from some of the dev communities, I think we would very much appreciate the chance to be tested on a monthly basis as part of the regular Common Crawl process.
>
>    I think we’ll still want to run more often in our own sandbox(es) on the slice of CommonCrawl we have, but the monthly testing against new data, from my perspective at least, would be a huge win for all of us.
>
>    In addition to parsing binaries and extracting text, Tika (via PDFBox, POI and many others) can also offer metadata (e.g. exif from images), which users of CommonCrawl might find of use.
>
>   I’ll forward this to some of the relevant dev lists to invite others to participate in the discussion on the common-crawl list.
>
>
>   Thank you, again.  I very much look forward to collaborating.
>
>              Best,
>
>                          Tim
>
> From: Stephen Merity [mailto:stephen@commoncrawl.org]
> Sent: Tuesday, April 07, 2015 3:57 AM
> To: common-crawl@googlegroups.com<ma...@googlegroups.com>
> Cc: mattmann@apache.org<ma...@apache.org>; tallison@apache.org<ma...@apache.org>; dmeikle@apache.org<ma...@apache.org>; tilman@apache.org<ma...@apache.org>; nick@apache.org<ma...@apache.org>
> Subject: Re: Any interest in running Apache Tika as part of CommonCrawl?
>
> Hi Tika team!
>
> We'd certainly be interested in working with Apache Tika on such an undertaking. At the very least, we're glad that Julien has provided you with content to battle test Tika with!
>
> As you've noted, the text extraction performed to produce WET files are focused primarily on HTML files, leaving many other file types not covered. The existing text extraction is quite efficient and part of the same process that generates the WAT file, meaning there's next to no overhead. Performing extraction with Tika at the scale of Common Crawl would be an interesting challenge. Running it as a once off wouldn't likely be too much of a challenge and would also give Tika the benefit of a wider variety of documents (both well formed and malformed) to test against. Running it on a frequent basis or as part of the crawl pipeline would be more challenging but something we can certainly discuss, especially if there's strong community desire for it!
>
> On Fri, Apr 3, 2015 at 5:23 AM, <ta...@gmail.com>> wrote:
> CommonCrawl currently has the WET format that extracts plain text from web pages.  My guess is that this is text stripping from text-y formats.  Let me know if I'm wrong!
>
> Would there be any interest in adding another format: WETT (WET-Tika) or supplementing the current WET by using Tika to extract contents from binary formats too: PDF, MSWord, etc.
>
> Julien Nioche kindly carved out 220 GB for us to experiment with on TIKA-1302<https://issues.apache.org/jira/browse/TIKA-1302> on a Rackspace vm.  But, I'm wondering now if it would make more sense to have CommonCrawl run Tika as part of its regular process and make the output available in one of your standard formats.
>
> CommonCrawl consumers would get Tika output, and the Tika dev community (including its dependencies, PDFBox, POI, etc.) could get the stacktraces to help prioritize bug fixes.
>
> Cheers,
>
>           Tim
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl+unsubscribe@googlegroups.com<ma...@googlegroups.com>.
> To post to this group, send email to common-crawl@googlegroups.com<ma...@googlegroups.com>.
> Visit this group at http://groups.google.com/group/common-crawl.
> For more options, visit https://groups.google.com/d/optout.
>
>
>
> --
> Regards,
> Stephen Merity
> Data Scientist @ Common Crawl

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org