You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pdfbox.apache.org by ta...@gmail.com on 2015/04/03 14:35:00 UTC

Fwd: Any interest in running Apache Tika as part of CommonCrawl?

All,
  What do we think?

On Friday, April 3, 2015 at 8:23:11 AM UTC-4, talliso...@gmail.com wrote:

> CommonCrawl currently has the WET format that extracts plain text from web 
> pages.  My guess is that this is text stripping from text-y formats.  Let 
> me know if I'm wrong!
>
> Would there be any interest in adding another format: WETT (WET-Tika) or 
> supplementing the current WET by using Tika to extract contents from binary 
> formats too: PDF, MSWord, etc.
>
> Julien Nioche kindly carved out 220 GB for us to experiment with on 
> TIKA-1302 <https://issues.apache.org/jira/browse/TIKA-1302> on 
> a Rackspace vm.  But, I'm wondering now if it would make more sense to have 
> CommonCrawl run Tika as part of its regular process and make the output 
> available in one of your standard formats.  
>
> CommonCrawl consumers would get Tika output, and the Tika dev community 
> (including its dependencies, PDFBox, POI, etc.) could get the stacktraces 
> to help prioritize bug fixes.
>
> Cheers,
>
>           Tim 
>

Re: Any interest in running Apache Tika as part of CommonCrawl?

Posted by David Meikle <lo...@gmail.com>.

Hey Tim,

+1 from me, I think this would be great to do.

Cheers,
Dave


> On 3 Apr 2015, at 08:35, tallison314159@gmail.com wrote:
> 
> All,
>   What do we think?

Re: Any interest in running Apache Tika as part of CommonCrawl?

Posted by David Meikle <lo...@gmail.com>.

Hey Tim,

+1 from me, I think this would be great to do.

Cheers,
Dave


> On 3 Apr 2015, at 08:35, tallison314159@gmail.com wrote:
> 
> All,
>   What do we think?

Re: Any interest in running Apache Tika as part of CommonCrawl?

Posted by "Mattmann, Chris A (3980)" <ch...@jpl.nasa.gov>.

+1 this makes immense sense to me. Thanks Juls and Tim.

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++






-----Original Message-----
From: "tallison314159@gmail.com" <ta...@gmail.com>
Reply-To: "dev@tika.apache.org" <de...@tika.apache.org>
Date: Friday, April 3, 2015 at 5:35 AM
To: "dev@pdfbox.apache.org" <de...@pdfbox.apache.org>, "dev@tika.apache.org"
<de...@tika.apache.org>, "dev@poi.apache.org" <de...@poi.apache.org>
Subject: Fwd: Any interest in running Apache Tika as part of CommonCrawl?

>All,
>  What do we think?
>
>On Friday, April 3, 2015 at 8:23:11 AM UTC-4, talliso...@gmail.com wrote:
>
>CommonCrawl currently has the WET format that extracts plain text from
>web pages.  My guess is that this is text stripping from text-y formats.
>Let me know if I'm wrong!
>
>
>Would there be any interest in adding another format: WETT (WET-Tika) or
>supplementing the current WET by using Tika to extract contents from
>binary formats too: PDF, MSWord, etc.
>
>
>Julien Nioche kindly carved out 220 GB for us to experiment with on
>TIKA-1302 <https://issues.apache.org/jira/browse/TIKA-1302> on a
>Rackspace vm.  But, I'm wondering now if it would make more sense to have
>CommonCrawl run Tika as part of its regular process and make the output
>available in one of your standard formats.
>
>
>
>CommonCrawl consumers would get Tika output, and the Tika dev community
>(including its dependencies, PDFBox, POI, etc.) could get the stacktraces
>to help prioritize bug fixes.
>
>
>Cheers,
>
>
>          Tim 
>
>
>
>

Re: Any interest in running Apache Tika as part of CommonCrawl?

Posted by John Hewson <jo...@jahewson.com>.

Yes, this would be great, if you need any PDFBox assistance then count me in.

-- John

> On 3 Apr 2015, at 05:35, tallison314159@gmail.com wrote:
> 
> All,
>   What do we think?
> 
>> On Friday, April 3, 2015 at 8:23:11 AM UTC-4, talliso...@gmail.com wrote:
>> CommonCrawl currently has the WET format that extracts plain text from web pages.  My guess is that this is text stripping from text-y formats.  Let me know if I'm wrong!
>> 
>> Would there be any interest in adding another format: WETT (WET-Tika) or supplementing the current WET by using Tika to extract contents from binary formats too: PDF, MSWord, etc.
>> 
>> Julien Nioche kindly carved out 220 GB for us to experiment with on TIKA-1302 on a Rackspace vm.  But, I'm wondering now if it would make more sense to have CommonCrawl run Tika as part of its regular process and make the output available in one of your standard formats. 
>> 
>> CommonCrawl consumers would get Tika output, and the Tika dev community (including its dependencies, PDFBox, POI, etc.) could get the stacktraces to help prioritize bug fixes.
>> 
>> Cheers,
>> 
>>           Tim 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: dev-help@pdfbox.apache.org

Re: Any interest in running Apache Tika as part of CommonCrawl?

Posted by John Hewson <jo...@jahewson.com>.

Yes, this would be great, if you need any PDFBox assistance then count me in.

-- John

> On 3 Apr 2015, at 05:35, tallison314159@gmail.com wrote:
> 
> All,
>   What do we think?
> 
>> On Friday, April 3, 2015 at 8:23:11 AM UTC-4, talliso...@gmail.com wrote:
>> CommonCrawl currently has the WET format that extracts plain text from web pages.  My guess is that this is text stripping from text-y formats.  Let me know if I'm wrong!
>> 
>> Would there be any interest in adding another format: WETT (WET-Tika) or supplementing the current WET by using Tika to extract contents from binary formats too: PDF, MSWord, etc.
>> 
>> Julien Nioche kindly carved out 220 GB for us to experiment with on TIKA-1302 on a Rackspace vm.  But, I'm wondering now if it would make more sense to have CommonCrawl run Tika as part of its regular process and make the output available in one of your standard formats. 
>> 
>> CommonCrawl consumers would get Tika output, and the Tika dev community (including its dependencies, PDFBox, POI, etc.) could get the stacktraces to help prioritize bug fixes.
>> 
>> Cheers,
>> 
>>           Tim 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: dev-help@pdfbox.apache.org

Re: Any interest in running Apache Tika as part of CommonCrawl?

Posted by "Mattmann, Chris A (3980)" <ch...@jpl.nasa.gov>.

+1 this makes immense sense to me. Thanks Juls and Tim.

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++






-----Original Message-----
From: "tallison314159@gmail.com" <ta...@gmail.com>
Reply-To: "dev@tika.apache.org" <de...@tika.apache.org>
Date: Friday, April 3, 2015 at 5:35 AM
To: "dev@pdfbox.apache.org" <de...@pdfbox.apache.org>, "dev@tika.apache.org"
<de...@tika.apache.org>, "dev@poi.apache.org" <de...@poi.apache.org>
Subject: Fwd: Any interest in running Apache Tika as part of CommonCrawl?

>All,
>  What do we think?
>
>On Friday, April 3, 2015 at 8:23:11 AM UTC-4, talliso...@gmail.com wrote:
>
>CommonCrawl currently has the WET format that extracts plain text from
>web pages.  My guess is that this is text stripping from text-y formats.
>Let me know if I'm wrong!
>
>
>Would there be any interest in adding another format: WETT (WET-Tika) or
>supplementing the current WET by using Tika to extract contents from
>binary formats too: PDF, MSWord, etc.
>
>
>Julien Nioche kindly carved out 220 GB for us to experiment with on
>TIKA-1302 <https://issues.apache.org/jira/browse/TIKA-1302> on a
>Rackspace vm.  But, I'm wondering now if it would make more sense to have
>CommonCrawl run Tika as part of its regular process and make the output
>available in one of your standard formats.
>
>
>
>CommonCrawl consumers would get Tika output, and the Tika dev community
>(including its dependencies, PDFBox, POI, etc.) could get the stacktraces
>to help prioritize bug fixes.
>
>
>Cheers,
>
>
>          Tim 
>
>
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org

Re: Any interest in running Apache Tika as part of CommonCrawl?

Posted by David Meikle <lo...@gmail.com>.

Hey Tim,

+1 from me, I think this would be great to do.

Cheers,
Dave


> On 3 Apr 2015, at 08:35, tallison314159@gmail.com wrote:
> 
> All,
>   What do we think?

Any interest in running Apache Tika as part of CommonCrawl?

Posted by "Allison, Timothy B." <ta...@mitre.org>.

All,

What do you think?  (this is the third time I've tried to send to POI dev...many apologies if the other two emails spring up all of a sudden!).


https://groups.google.com/forum/#!topic/common-crawl/Cv21VRQjGN0


On Friday, April 3, 2015 at 8:23:11 AM UTC-4, talliso...@gmail.com<ma...@gmail.com> wrote:
CommonCrawl currently has the WET format that extracts plain text from web pages.  My guess is that this is text stripping from text-y formats.  Let me know if I'm wrong!

Would there be any interest in adding another format: WETT (WET-Tika) or supplementing the current WET by using Tika to extract contents from binary formats too: PDF, MSWord, etc.

Julien Nioche kindly carved out 220 GB for us to experiment with on TIKA-1302<https://issues.apache.org/jira/browse/TIKA-1302> on a Rackspace vm.  But, I'm wondering now if it would make more sense to have CommonCrawl run Tika as part of its regular process and make the output available in one of your standard formats.

CommonCrawl consumers would get Tika output, and the Tika dev community (including its dependencies, PDFBox, POI, etc.) could get the stacktraces to help prioritize bug fixes.

Cheers,

          Tim

Re: FW: Any interest in running Apache Tika as part of CommonCrawl?

Posted by Oleg Tikhonov <ol...@apache.org>.

I Tim,
Having looked at CC, a couple of ideas crossed the mind. I think it's cool.
+1.

BR,
Oleg
On 3 Apr 2015 17:29, "Allison, Timothy B." <ta...@mitre.org> wrote:

> All,
>
> What do you think?
>
>
> https://groups.google.com/forum/#!topic/common-crawl/Cv21VRQjGN0
>
>
> On Friday, April 3, 2015 at 8:23:11 AM UTC-4, talliso...@gmail.com<mailto:
> talliso...@gmail.com> wrote:
> CommonCrawl currently has the WET format that extracts plain text from web
> pages.  My guess is that this is text stripping from text-y formats.  Let
> me know if I'm wrong!
>
> Would there be any interest in adding another format: WETT (WET-Tika) or
> supplementing the current WET by using Tika to extract contents from binary
> formats too: PDF, MSWord, etc.
>
> Julien Nioche kindly carved out 220 GB for us to experiment with on
> TIKA-1302<https://issues.apache.org/jira/browse/TIKA-1302> on a Rackspace
> vm.  But, I'm wondering now if it would make more sense to have CommonCrawl
> run Tika as part of its regular process and make the output available in
> one of your standard formats.
>
> CommonCrawl consumers would get Tika output, and the Tika dev community
> (including its dependencies, PDFBox, POI, etc.) could get the stacktraces
> to help prioritize bug fixes.
>
> Cheers,
>
>           Tim
>

Re: FW: Any interest in running Apache Tika as part of CommonCrawl?

Posted by Konstantin Gribov <gr...@gmail.com>.

Tim,
seems interesting, because it provides big test dataset.
As I see, they store pdfs/docs in WARC files, so there's source data for
parsing.

-- 
Best regards,
Konstantin Gribov

пт, 3 апр. 2015 г. в 17:29, Allison, Timothy B. <ta...@mitre.org>:

> All,
>
> What do you think?
>
>
> https://groups.google.com/forum/#!topic/common-crawl/Cv21VRQjGN0
>
>
> On Friday, April 3, 2015 at 8:23:11 AM UTC-4, talliso...@gmail.com<mailto:
> talliso...@gmail.com> wrote:
> CommonCrawl currently has the WET format that extracts plain text from web
> pages.  My guess is that this is text stripping from text-y formats.  Let
> me know if I'm wrong!
>
> Would there be any interest in adding another format: WETT (WET-Tika) or
> supplementing the current WET by using Tika to extract contents from binary
> formats too: PDF, MSWord, etc.
>
> Julien Nioche kindly carved out 220 GB for us to experiment with on
> TIKA-1302<https://issues.apache.org/jira/browse/TIKA-1302> on a Rackspace
> vm.  But, I'm wondering now if it would make more sense to have CommonCrawl
> run Tika as part of its regular process and make the output available in
> one of your standard formats.
>
> CommonCrawl consumers would get Tika output, and the Tika dev community
> (including its dependencies, PDFBox, POI, etc.) could get the stacktraces
> to help prioritize bug fixes.
>
> Cheers,
>
>           Tim
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
> For additional commands, e-mail: dev-help@poi.apache.org
>
>

RE: FW: Any interest in running Apache Tika as part of CommonCrawl?

Posted by "Allison, Timothy B." <ta...@mitre.org>.

POI Colleagues,

  If you'd like a table (or better yet, the h2 database) of the results of runs against govdocs1 with stack traces, let me know where I should post it.  This came in quite handy for https://issues.apache.org/jira/browse/TIKA-1512, where the reporter couldn't share the document.   I'll be rerunning this process soon once we have a release candidate for Tike 1.8.

    The downside to govdocs1 for POI is that there are very few docx/pptx/xlsx.  I'm hoping to unzip the slice of common crawl that Julien Nioche grabbed for us on TIKA-1302 soon, and I'll let you know what's in there.

             Best,

                       Tim 

-----Original Message-----
From: Andreas Beeker [mailto:kiwiwings@apache.org] 
Sent: Friday, April 03, 2015 1:12 PM
To: POI Developers List
Cc: dev@tika.apache.org
Subject: Re: FW: Any interest in running Apache Tika as part of CommonCrawl?

Hi,

similar to Dominiks approach of checking the file base for parsing errors,
I'd like to scan for certain file constellations, for the typically "left over bytes" error
or other record combinations which I can't reproduce with my MS/Libre office versions.

I haven't thought about how it's actually done, but I think logging the location in the
integration tests and later manually checking the corresponding files should be
sufficient.

Best wishes,
Andi

On 03.04.2015 17:51, Dominik Stadler wrote:
> Hi,
>
> I am very interested as I am following the Common Crawl activity for
> some time already. It sounds like a neat idea to do the check already
> when the crawl is done, are the binary documents already part of the
> crawl-data?
>
> ...
>
> Dominik.
>
> On Fri, Apr 3, 2015 at 4:28 PM, Allison, Timothy B. <ta...@mitre.org> wrote:
>> All,
>>
>> What do you think?
>>
>>
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org

Re: FW: Any interest in running Apache Tika as part of CommonCrawl?

Posted by Andreas Beeker <ki...@apache.org>.

Hi,

similar to Dominiks approach of checking the file base for parsing errors,
I'd like to scan for certain file constellations, for the typically "left over bytes" error
or other record combinations which I can't reproduce with my MS/Libre office versions.

I haven't thought about how it's actually done, but I think logging the location in the
integration tests and later manually checking the corresponding files should be
sufficient.

Best wishes,
Andi

On 03.04.2015 17:51, Dominik Stadler wrote:
> Hi,
>
> I am very interested as I am following the Common Crawl activity for
> some time already. It sounds like a neat idea to do the check already
> when the crawl is done, are the binary documents already part of the
> crawl-data?
>
> ...
>
> Dominik.
>
> On Fri, Apr 3, 2015 at 4:28 PM, Allison, Timothy B. <ta...@mitre.org> wrote:
>> All,
>>
>> What do you think?
>>
>>
>>

Re: FW: Any interest in running Apache Tika as part of CommonCrawl?

Posted by Andreas Beeker <ki...@apache.org>.

Hi,

similar to Dominiks approach of checking the file base for parsing errors,
I'd like to scan for certain file constellations, for the typically "left over bytes" error
or other record combinations which I can't reproduce with my MS/Libre office versions.

I haven't thought about how it's actually done, but I think logging the location in the
integration tests and later manually checking the corresponding files should be
sufficient.

Best wishes,
Andi

On 03.04.2015 17:51, Dominik Stadler wrote:
> Hi,
>
> I am very interested as I am following the Common Crawl activity for
> some time already. It sounds like a neat idea to do the check already
> when the crawl is done, are the binary documents already part of the
> crawl-data?
>
> ...
>
> Dominik.
>
> On Fri, Apr 3, 2015 at 4:28 PM, Allison, Timothy B. <ta...@mitre.org> wrote:
>> All,
>>
>> What do you think?
>>
>>
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org

Re: FW: Any interest in running Apache Tika as part of CommonCrawl?

Posted by Konstantin Gribov <gr...@gmail.com>.

Dominik,
I've downloaded one of WARC files (from CC-MAIN-2015-01,
https://aws-publicdatasets.s3.amazonaws.com/common-crawl/crawl-data/CC-MAIN-2015-06/segments/1422115855094.38/warc/CC-MAIN-20150124161055-00000-ip-10-180-212-252.ec2.internal.warc.gz,
1.2GB) and
it contains at least PDFs and DOCs in crawled data.

-- 
Best regards,
Konstantin Gribov

пт, 3 апр. 2015 г. в 18:52, Dominik Stadler <do...@gmx.at>:

Hi,
>
> I am very interested as I am following the Common Crawl activity for
> some time already. It sounds like a neat idea to do the check already
> when the crawl is done, are the binary documents already part of the
> crawl-data?
>
> Actually I am currently playing around with the Common Crawl URL Index
> (http://blog.commoncrawl.org/2013/01/common-crawl-url-index/) which is
> a much smaller sized download (230G) and only contains URLs without
> all the additional information.
>
> The index is a bit outdated and currently only covers half of the full
> common crawl, however there are people working on refreshing it for
> the latest crawls.
>
> I wrote a small app which extracts interesting URLs out of these (aka
> files that POI should be able to open), resulting in aprox. 6.6
> million links! Based on some tests for the full download there would
> be around 3.3 million documents requiring approximately 3TB of
> storage. Note that this is still an old crawl with only half of the
> data included, so a current crawl will be considerably bigger!
>
> Running them through the integration testing that we added in POI
> (which performs text and property extraction but also some other
> POI-related actions) already showed a few cases where slightly
> off-spec documents can cause bugs to appear, some initial related
> commits will follow shortly...
>
> Dominik.
>
> On Fri, Apr 3, 2015 at 4:28 PM, Allison, Timothy B. <ta...@mitre.org>
> wrote:
> > All,
> >
> > What do you think?
> >
> >
> > https://groups.google.com/forum/#!topic/common-crawl/Cv21VRQjGN0
> >
> >
> > On Friday, April 3, 2015 at 8:23:11 AM UTC-4, talliso...@gmail.com
> <ma...@gmail.com> wrote:
> > CommonCrawl currently has the WET format that extracts plain text from
> web pages.  My guess is that this is text stripping from text-y formats.
> Let me know if I'm wrong!
> >
> > Would there be any interest in adding another format: WETT (WET-Tika) or
> supplementing the current WET by using Tika to extract contents from binary
> formats too: PDF, MSWord, etc.
> >
> > Julien Nioche kindly carved out 220 GB for us to experiment with on
> TIKA-1302<https://issues.apache.org/jira/browse/TIKA-1302> on a Rackspace
> vm.  But, I'm wondering now if it would make more sense to have CommonCrawl
> run Tika as part of its regular process and make the output available in
> one of your standard formats.
> >
> > CommonCrawl consumers would get Tika output, and the Tika dev community
> (including its dependencies, PDFBox, POI, etc.) could get the stacktraces
> to help prioritize bug fixes.
> >
> > Cheers,
> >
> >           Tim
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
> > For additional commands, e-mail: dev-help@poi.apache.org
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
> For additional commands, e-mail: dev-help@poi.apache.org
>
>

Re: FW: Any interest in running Apache Tika as part of CommonCrawl?

Posted by Konstantin Gribov <gr...@gmail.com>.

Dominik,
I've downloaded one of WARC files (from CC-MAIN-2015-01,
https://aws-publicdatasets.s3.amazonaws.com/common-crawl/crawl-data/CC-MAIN-2015-06/segments/1422115855094.38/warc/CC-MAIN-20150124161055-00000-ip-10-180-212-252.ec2.internal.warc.gz,
1.2GB) and
it contains at least PDFs and DOCs in crawled data.

-- 
Best regards,
Konstantin Gribov

пт, 3 апр. 2015 г. в 18:52, Dominik Stadler <do...@gmx.at>:

Hi,
>
> I am very interested as I am following the Common Crawl activity for
> some time already. It sounds like a neat idea to do the check already
> when the crawl is done, are the binary documents already part of the
> crawl-data?
>
> Actually I am currently playing around with the Common Crawl URL Index
> (http://blog.commoncrawl.org/2013/01/common-crawl-url-index/) which is
> a much smaller sized download (230G) and only contains URLs without
> all the additional information.
>
> The index is a bit outdated and currently only covers half of the full
> common crawl, however there are people working on refreshing it for
> the latest crawls.
>
> I wrote a small app which extracts interesting URLs out of these (aka
> files that POI should be able to open), resulting in aprox. 6.6
> million links! Based on some tests for the full download there would
> be around 3.3 million documents requiring approximately 3TB of
> storage. Note that this is still an old crawl with only half of the
> data included, so a current crawl will be considerably bigger!
>
> Running them through the integration testing that we added in POI
> (which performs text and property extraction but also some other
> POI-related actions) already showed a few cases where slightly
> off-spec documents can cause bugs to appear, some initial related
> commits will follow shortly...
>
> Dominik.
>
> On Fri, Apr 3, 2015 at 4:28 PM, Allison, Timothy B. <ta...@mitre.org>
> wrote:
> > All,
> >
> > What do you think?
> >
> >
> > https://groups.google.com/forum/#!topic/common-crawl/Cv21VRQjGN0
> >
> >
> > On Friday, April 3, 2015 at 8:23:11 AM UTC-4, talliso...@gmail.com
> <ma...@gmail.com> wrote:
> > CommonCrawl currently has the WET format that extracts plain text from
> web pages.  My guess is that this is text stripping from text-y formats.
> Let me know if I'm wrong!
> >
> > Would there be any interest in adding another format: WETT (WET-Tika) or
> supplementing the current WET by using Tika to extract contents from binary
> formats too: PDF, MSWord, etc.
> >
> > Julien Nioche kindly carved out 220 GB for us to experiment with on
> TIKA-1302<https://issues.apache.org/jira/browse/TIKA-1302> on a Rackspace
> vm.  But, I'm wondering now if it would make more sense to have CommonCrawl
> run Tika as part of its regular process and make the output available in
> one of your standard formats.
> >
> > CommonCrawl consumers would get Tika output, and the Tika dev community
> (including its dependencies, PDFBox, POI, etc.) could get the stacktraces
> to help prioritize bug fixes.
> >
> > Cheers,
> >
> >           Tim
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
> > For additional commands, e-mail: dev-help@poi.apache.org
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
> For additional commands, e-mail: dev-help@poi.apache.org
>
>

Re: FW: Any interest in running Apache Tika as part of CommonCrawl?

Posted by Dominik Stadler <do...@gmx.at>.

Hi,

I am very interested as I am following the Common Crawl activity for
some time already. It sounds like a neat idea to do the check already
when the crawl is done, are the binary documents already part of the
crawl-data?

Actually I am currently playing around with the Common Crawl URL Index
(http://blog.commoncrawl.org/2013/01/common-crawl-url-index/) which is
a much smaller sized download (230G) and only contains URLs without
all the additional information.

The index is a bit outdated and currently only covers half of the full
common crawl, however there are people working on refreshing it for
the latest crawls.

I wrote a small app which extracts interesting URLs out of these (aka
files that POI should be able to open), resulting in aprox. 6.6
million links! Based on some tests for the full download there would
be around 3.3 million documents requiring approximately 3TB of
storage. Note that this is still an old crawl with only half of the
data included, so a current crawl will be considerably bigger!

Running them through the integration testing that we added in POI
(which performs text and property extraction but also some other
POI-related actions) already showed a few cases where slightly
off-spec documents can cause bugs to appear, some initial related
commits will follow shortly...

Dominik.

On Fri, Apr 3, 2015 at 4:28 PM, Allison, Timothy B. <ta...@mitre.org> wrote:
> All,
>
> What do you think?
>
>
> https://groups.google.com/forum/#!topic/common-crawl/Cv21VRQjGN0
>
>
> On Friday, April 3, 2015 at 8:23:11 AM UTC-4, talliso...@gmail.com<ma...@gmail.com> wrote:
> CommonCrawl currently has the WET format that extracts plain text from web pages.  My guess is that this is text stripping from text-y formats.  Let me know if I'm wrong!
>
> Would there be any interest in adding another format: WETT (WET-Tika) or supplementing the current WET by using Tika to extract contents from binary formats too: PDF, MSWord, etc.
>
> Julien Nioche kindly carved out 220 GB for us to experiment with on TIKA-1302<https://issues.apache.org/jira/browse/TIKA-1302> on a Rackspace vm.  But, I'm wondering now if it would make more sense to have CommonCrawl run Tika as part of its regular process and make the output available in one of your standard formats.
>
> CommonCrawl consumers would get Tika output, and the Tika dev community (including its dependencies, PDFBox, POI, etc.) could get the stacktraces to help prioritize bug fixes.
>
> Cheers,
>
>           Tim
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
> For additional commands, e-mail: dev-help@poi.apache.org
>

Re: FW: Any interest in running Apache Tika as part of CommonCrawl?

Posted by Dominik Stadler <do...@gmx.at>.

Hi,

I am very interested as I am following the Common Crawl activity for
some time already. It sounds like a neat idea to do the check already
when the crawl is done, are the binary documents already part of the
crawl-data?

Actually I am currently playing around with the Common Crawl URL Index
(http://blog.commoncrawl.org/2013/01/common-crawl-url-index/) which is
a much smaller sized download (230G) and only contains URLs without
all the additional information.

The index is a bit outdated and currently only covers half of the full
common crawl, however there are people working on refreshing it for
the latest crawls.

I wrote a small app which extracts interesting URLs out of these (aka
files that POI should be able to open), resulting in aprox. 6.6
million links! Based on some tests for the full download there would
be around 3.3 million documents requiring approximately 3TB of
storage. Note that this is still an old crawl with only half of the
data included, so a current crawl will be considerably bigger!

Running them through the integration testing that we added in POI
(which performs text and property extraction but also some other
POI-related actions) already showed a few cases where slightly
off-spec documents can cause bugs to appear, some initial related
commits will follow shortly...

Dominik.

On Fri, Apr 3, 2015 at 4:28 PM, Allison, Timothy B. <ta...@mitre.org> wrote:
> All,
>
> What do you think?
>
>
> https://groups.google.com/forum/#!topic/common-crawl/Cv21VRQjGN0
>
>
> On Friday, April 3, 2015 at 8:23:11 AM UTC-4, talliso...@gmail.com<ma...@gmail.com> wrote:
> CommonCrawl currently has the WET format that extracts plain text from web pages.  My guess is that this is text stripping from text-y formats.  Let me know if I'm wrong!
>
> Would there be any interest in adding another format: WETT (WET-Tika) or supplementing the current WET by using Tika to extract contents from binary formats too: PDF, MSWord, etc.
>
> Julien Nioche kindly carved out 220 GB for us to experiment with on TIKA-1302<https://issues.apache.org/jira/browse/TIKA-1302> on a Rackspace vm.  But, I'm wondering now if it would make more sense to have CommonCrawl run Tika as part of its regular process and make the output available in one of your standard formats.
>
> CommonCrawl consumers would get Tika output, and the Tika dev community (including its dependencies, PDFBox, POI, etc.) could get the stacktraces to help prioritize bug fixes.
>
> Cheers,
>
>           Tim
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
> For additional commands, e-mail: dev-help@poi.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org

Re: FW: Any interest in running Apache Tika as part of CommonCrawl?

Posted by Konstantin Gribov <gr...@gmail.com>.

Tim,
seems interesting, because it provides big test dataset.
As I see, they store pdfs/docs in WARC files, so there's source data for
parsing.

-- 
Best regards,
Konstantin Gribov

пт, 3 апр. 2015 г. в 17:29, Allison, Timothy B. <ta...@mitre.org>:

> All,
>
> What do you think?
>
>
> https://groups.google.com/forum/#!topic/common-crawl/Cv21VRQjGN0
>
>
> On Friday, April 3, 2015 at 8:23:11 AM UTC-4, talliso...@gmail.com<mailto:
> talliso...@gmail.com> wrote:
> CommonCrawl currently has the WET format that extracts plain text from web
> pages.  My guess is that this is text stripping from text-y formats.  Let
> me know if I'm wrong!
>
> Would there be any interest in adding another format: WETT (WET-Tika) or
> supplementing the current WET by using Tika to extract contents from binary
> formats too: PDF, MSWord, etc.
>
> Julien Nioche kindly carved out 220 GB for us to experiment with on
> TIKA-1302<https://issues.apache.org/jira/browse/TIKA-1302> on a Rackspace
> vm.  But, I'm wondering now if it would make more sense to have CommonCrawl
> run Tika as part of its regular process and make the output available in
> one of your standard formats.
>
> CommonCrawl consumers would get Tika output, and the Tika dev community
> (including its dependencies, PDFBox, POI, etc.) could get the stacktraces
> to help prioritize bug fixes.
>
> Cheers,
>
>           Tim
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
> For additional commands, e-mail: dev-help@poi.apache.org
>
>

FW: Any interest in running Apache Tika as part of CommonCrawl?

Posted by "Allison, Timothy B." <ta...@mitre.org>.

All,

What do you think?


https://groups.google.com/forum/#!topic/common-crawl/Cv21VRQjGN0


On Friday, April 3, 2015 at 8:23:11 AM UTC-4, talliso...@gmail.com<ma...@gmail.com> wrote:
CommonCrawl currently has the WET format that extracts plain text from web pages.  My guess is that this is text stripping from text-y formats.  Let me know if I'm wrong!

Would there be any interest in adding another format: WETT (WET-Tika) or supplementing the current WET by using Tika to extract contents from binary formats too: PDF, MSWord, etc.

Julien Nioche kindly carved out 220 GB for us to experiment with on TIKA-1302<https://issues.apache.org/jira/browse/TIKA-1302> on a Rackspace vm.  But, I'm wondering now if it would make more sense to have CommonCrawl run Tika as part of its regular process and make the output available in one of your standard formats.

CommonCrawl consumers would get Tika output, and the Tika dev community (including its dependencies, PDFBox, POI, etc.) could get the stacktraces to help prioritize bug fixes.

Cheers,

          Tim

FW: Any interest in running Apache Tika as part of CommonCrawl?

Posted by "Allison, Timothy B." <ta...@mitre.org>.

All,

What do you think?


https://groups.google.com/forum/#!topic/common-crawl/Cv21VRQjGN0


On Friday, April 3, 2015 at 8:23:11 AM UTC-4, talliso...@gmail.com<ma...@gmail.com> wrote:
CommonCrawl currently has the WET format that extracts plain text from web pages.  My guess is that this is text stripping from text-y formats.  Let me know if I'm wrong!

Would there be any interest in adding another format: WETT (WET-Tika) or supplementing the current WET by using Tika to extract contents from binary formats too: PDF, MSWord, etc.

Julien Nioche kindly carved out 220 GB for us to experiment with on TIKA-1302<https://issues.apache.org/jira/browse/TIKA-1302> on a Rackspace vm.  But, I'm wondering now if it would make more sense to have CommonCrawl run Tika as part of its regular process and make the output available in one of your standard formats.

CommonCrawl consumers would get Tika output, and the Tika dev community (including its dependencies, PDFBox, POI, etc.) could get the stacktraces to help prioritize bug fixes.

Cheers,

          Tim

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org

RE: Any interest in running Apache Tika as part of CommonCrawl?

Posted by "Allison, Timothy B." <ta...@mitre.org>.

Sorry, link wasn’t included:

https://groups.google.com/forum/#!topic/common-crawl/Cv21VRQjGN0

From: tallison314159@gmail.com [mailto:tallison314159@gmail.com]
Sent: Friday, April 03, 2015 8:35 AM
To: dev@pdfbox.apache.org; dev@tika.apache.org; dev@poi.apache.org
Subject: Fwd: Any interest in running Apache Tika as part of CommonCrawl?

All,
  What do we think?

On Friday, April 3, 2015 at 8:23:11 AM UTC-4, talliso...@gmail.com<ma...@gmail.com> wrote:
CommonCrawl currently has the WET format that extracts plain text from web pages.  My guess is that this is text stripping from text-y formats.  Let me know if I'm wrong!

Would there be any interest in adding another format: WETT (WET-Tika) or supplementing the current WET by using Tika to extract contents from binary formats too: PDF, MSWord, etc.

Julien Nioche kindly carved out 220 GB for us to experiment with on TIKA-1302<https://issues.apache.org/jira/browse/TIKA-1302> on a Rackspace vm.  But, I'm wondering now if it would make more sense to have CommonCrawl run Tika as part of its regular process and make the output available in one of your standard formats.

CommonCrawl consumers would get Tika output, and the Tika dev community (including its dependencies, PDFBox, POI, etc.) could get the stacktraces to help prioritize bug fixes.

Cheers,

          Tim

RE: Any interest in running Apache Tika as part of CommonCrawl?

Posted by "Allison, Timothy B." <ta...@mitre.org>.

Sorry, link wasn’t included:

https://groups.google.com/forum/#!topic/common-crawl/Cv21VRQjGN0

From: tallison314159@gmail.com [mailto:tallison314159@gmail.com]
Sent: Friday, April 03, 2015 8:35 AM
To: dev@pdfbox.apache.org; dev@tika.apache.org; dev@poi.apache.org
Subject: Fwd: Any interest in running Apache Tika as part of CommonCrawl?

All,
  What do we think?

On Friday, April 3, 2015 at 8:23:11 AM UTC-4, talliso...@gmail.com<ma...@gmail.com> wrote:
CommonCrawl currently has the WET format that extracts plain text from web pages.  My guess is that this is text stripping from text-y formats.  Let me know if I'm wrong!

Would there be any interest in adding another format: WETT (WET-Tika) or supplementing the current WET by using Tika to extract contents from binary formats too: PDF, MSWord, etc.

Julien Nioche kindly carved out 220 GB for us to experiment with on TIKA-1302<https://issues.apache.org/jira/browse/TIKA-1302> on a Rackspace vm.  But, I'm wondering now if it would make more sense to have CommonCrawl run Tika as part of its regular process and make the output available in one of your standard formats.

CommonCrawl consumers would get Tika output, and the Tika dev community (including its dependencies, PDFBox, POI, etc.) could get the stacktraces to help prioritize bug fixes.

Cheers,

          Tim

Re: Any interest in running Apache Tika as part of CommonCrawl?

Posted by "Mattmann, Chris A (3980)" <ch...@jpl.nasa.gov>.

+1 this makes immense sense to me. Thanks Juls and Tim.

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++






-----Original Message-----
From: "tallison314159@gmail.com" <ta...@gmail.com>
Reply-To: "dev@tika.apache.org" <de...@tika.apache.org>
Date: Friday, April 3, 2015 at 5:35 AM
To: "dev@pdfbox.apache.org" <de...@pdfbox.apache.org>, "dev@tika.apache.org"
<de...@tika.apache.org>, "dev@poi.apache.org" <de...@poi.apache.org>
Subject: Fwd: Any interest in running Apache Tika as part of CommonCrawl?

>All,
>  What do we think?
>
>On Friday, April 3, 2015 at 8:23:11 AM UTC-4, talliso...@gmail.com wrote:
>
>CommonCrawl currently has the WET format that extracts plain text from
>web pages.  My guess is that this is text stripping from text-y formats.
>Let me know if I'm wrong!
>
>
>Would there be any interest in adding another format: WETT (WET-Tika) or
>supplementing the current WET by using Tika to extract contents from
>binary formats too: PDF, MSWord, etc.
>
>
>Julien Nioche kindly carved out 220 GB for us to experiment with on
>TIKA-1302 <https://issues.apache.org/jira/browse/TIKA-1302> on a
>Rackspace vm.  But, I'm wondering now if it would make more sense to have
>CommonCrawl run Tika as part of its regular process and make the output
>available in one of your standard formats.
>
>
>
>CommonCrawl consumers would get Tika output, and the Tika dev community
>(including its dependencies, PDFBox, POI, etc.) could get the stacktraces
>to help prioritize bug fixes.
>
>
>Cheers,
>
>
>          Tim 
>
>
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org

RE: Fwd: Any interest in running Apache Tika as part of CommonCrawl?

Posted by "Allison, Timothy B." <ta...@mitre.org>.

Makes sense.

My thought was to continue (well, actually develop and mature) that comparison work on the Rackspace vm.  

This could be an ancillary source of information.  It would come in monthly and wouldn't be as timely as being able to do our own runs, and it would only cover a single version, but I think it would still be quite valuable. 

-----Original Message-----
From: Tilman Hausherr [mailto:THausherr@t-online.de] 
Sent: Friday, April 03, 2015 10:21 AM
To: dev@pdfbox.apache.org
Subject: Re: Fwd: Any interest in running Apache Tika as part of CommonCrawl?

I'm mostly interested in differences between crawls with different 
PDFBox versions.

And I already have one change where I wonder if anything will happen: 
the text stripper code has this

wordSpacing == Float.NaN

however that is always false, and I wonder what differences will come up 
when using the correct code, which is

Float.isNaN(wordSpacing)

Tilman

Am 03.04.2015 um 14:35 schrieb tallison314159@gmail.com:
> All,
>   What do we think?
>
> On Friday, April 3, 2015 at 8:23:11 AM UTC-4, talliso...@gmail.com wrote:
>
>     CommonCrawl currently has the WET format that extracts plain text
>     from web pages.  My guess is that this is text stripping from
>     text-y formats.  Let me know if I'm wrong!
>
>     Would there be any interest in adding another format: WETT
>     (WET-Tika) or supplementing the current WET by using Tika to
>     extract contents from binary formats too: PDF, MSWord, etc.
>
>     Julien Nioche kindly carved out 220 GB for us to experiment with
>     on TIKA-1302 <https://issues.apache.org/jira/browse/TIKA-1302> on
>     a Rackspace vm.  But, I'm wondering now if it would make more
>     sense to have CommonCrawl run Tika as part of its regular
>     process and make the output available in one of your standard
>     formats.
>
>     CommonCrawl consumers would get Tika output, and the Tika dev
>     community (including its dependencies, PDFBox, POI, etc.) could
>     get the stacktraces to help prioritize bug fixes.
>
>     Cheers,
>
>               Tim
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: dev-help@pdfbox.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org

Re: Fwd: Any interest in running Apache Tika as part of CommonCrawl?

Posted by Tilman Hausherr <TH...@t-online.de>.

I'm mostly interested in differences between crawls with different 
PDFBox versions.

And I already have one change where I wonder if anything will happen: 
the text stripper code has this

wordSpacing == Float.NaN

however that is always false, and I wonder what differences will come up 
when using the correct code, which is

Float.isNaN(wordSpacing)

Tilman

Am 03.04.2015 um 14:35 schrieb tallison314159@gmail.com:
> All,
>   What do we think?
>
> On Friday, April 3, 2015 at 8:23:11 AM UTC-4, talliso...@gmail.com wrote:
>
>     CommonCrawl currently has the WET format that extracts plain text
>     from web pages.  My guess is that this is text stripping from
>     text-y formats.  Let me know if I'm wrong!
>
>     Would there be any interest in adding another format: WETT
>     (WET-Tika) or supplementing the current WET by using Tika to
>     extract contents from binary formats too: PDF, MSWord, etc.
>
>     Julien Nioche kindly carved out 220 GB for us to experiment with
>     on TIKA-1302 <https://issues.apache.org/jira/browse/TIKA-1302> on
>     a Rackspace vm.  But, I'm wondering now if it would make more
>     sense to have CommonCrawl run Tika as part of its regular
>     process and make the output available in one of your standard
>     formats.
>
>     CommonCrawl consumers would get Tika output, and the Tika dev
>     community (including its dependencies, PDFBox, POI, etc.) could
>     get the stacktraces to help prioritize bug fixes.
>
>     Cheers,
>
>               Tim
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: dev-help@pdfbox.apache.org

RE: Any interest in running Apache Tika as part of CommonCrawl?

Posted by "Allison, Timothy B." <ta...@mitre.org>.

Sorry, link wasn’t included:

https://groups.google.com/forum/#!topic/common-crawl/Cv21VRQjGN0

From: tallison314159@gmail.com [mailto:tallison314159@gmail.com]
Sent: Friday, April 03, 2015 8:35 AM
To: dev@pdfbox.apache.org; dev@tika.apache.org; dev@poi.apache.org
Subject: Fwd: Any interest in running Apache Tika as part of CommonCrawl?

All,
  What do we think?

On Friday, April 3, 2015 at 8:23:11 AM UTC-4, talliso...@gmail.com<ma...@gmail.com> wrote:
CommonCrawl currently has the WET format that extracts plain text from web pages.  My guess is that this is text stripping from text-y formats.  Let me know if I'm wrong!

Would there be any interest in adding another format: WETT (WET-Tika) or supplementing the current WET by using Tika to extract contents from binary formats too: PDF, MSWord, etc.

Julien Nioche kindly carved out 220 GB for us to experiment with on TIKA-1302<https://issues.apache.org/jira/browse/TIKA-1302> on a Rackspace vm.  But, I'm wondering now if it would make more sense to have CommonCrawl run Tika as part of its regular process and make the output available in one of your standard formats.

CommonCrawl consumers would get Tika output, and the Tika dev community (including its dependencies, PDFBox, POI, etc.) could get the stacktraces to help prioritize bug fixes.

Cheers,

          Tim