You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@tika.apache.org by "Mattmann, Chris A (388J)" <ch...@jpl.nasa.gov> on 2012/01/28 03:01:26 UTC

% of different content types out there on the web

(sorry for the cross post)

Hey Guys,

I'm trying to find a good citation or estimate (if anyone has done one) that estimates
the breakout (by % or some other metric) of content types out there out the web
(with a whole web crawl or a meaningful representative dataset) that are non HTML.

Anyone have any ideas about this?

Thanks!

Cheers,
Chris

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattmann@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Re: % of different content types out there on the web

Posted by Markus Jelsma <ma...@openindex.io>.


On Tuesday 31 January 2012 15:55:06 Mattmann, Chris A (388J) wrote:
> Hi Markus,
> 
> Thanks for the FYI. Any idea at specific %'s for those unwanted suffixes
> compared to the size of the entire corpus?

Unfortunately no, we don't keep record of those, just filter them away as soon 
as wel can.

> 
> Cheers,
> Chris
> 
> On Jan 31, 2012, at 4:39 AM, Markus Jelsma wrote:
> > We only crawl HTML and PDF files for a lot of cc-TLD's so we only have
> > data on those two. However, we also explicitly filter out all/most
> > unwanted suffixes. We do have a lot of suffixes that we encountered so
> > far.
> > 
> > On Saturday 28 January 2012 03:01:26 Mattmann, Chris A (388J) wrote:
> >> (sorry for the cross post)
> >> 
> >> Hey Guys,
> >> 
> >> I'm trying to find a good citation or estimate (if anyone has done one)
> >> that estimates the breakout (by % or some other metric) of content types
> >> out there out the web (with a whole web crawl or a meaningful
> >> representative dataset) that are non HTML.
> >> 
> >> Anyone have any ideas about this?
> >> 
> >> Thanks!
> >> 
> >> Cheers,
> >> Chris
> >> 
> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >> Chris Mattmann, Ph.D.
> >> Senior Computer Scientist
> >> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> >> Office: 171-266B, Mailstop: 171-246
> >> Email: chris.a.mattmann@nasa.gov
> >> WWW:   http://sunset.usc.edu/~mattmann/
> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >> Adjunct Assistant Professor, Computer Science Department
> >> University of Southern California, Los Angeles, CA 90089 USA
> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> 
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Senior Computer Scientist
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 171-266B, Mailstop: 171-246
> Email: chris.a.mattmann@nasa.gov
> WWW:   http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Assistant Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

-- 
Markus Jelsma - CTO - Openindex

Re: % of different content types out there on the web

Posted by Markus Jelsma <ma...@openindex.io>.


On Tuesday 31 January 2012 15:55:06 Mattmann, Chris A (388J) wrote:
> Hi Markus,
> 
> Thanks for the FYI. Any idea at specific %'s for those unwanted suffixes
> compared to the size of the entire corpus?

Unfortunately no, we don't keep record of those, just filter them away as soon 
as wel can.

> 
> Cheers,
> Chris
> 
> On Jan 31, 2012, at 4:39 AM, Markus Jelsma wrote:
> > We only crawl HTML and PDF files for a lot of cc-TLD's so we only have
> > data on those two. However, we also explicitly filter out all/most
> > unwanted suffixes. We do have a lot of suffixes that we encountered so
> > far.
> > 
> > On Saturday 28 January 2012 03:01:26 Mattmann, Chris A (388J) wrote:
> >> (sorry for the cross post)
> >> 
> >> Hey Guys,
> >> 
> >> I'm trying to find a good citation or estimate (if anyone has done one)
> >> that estimates the breakout (by % or some other metric) of content types
> >> out there out the web (with a whole web crawl or a meaningful
> >> representative dataset) that are non HTML.
> >> 
> >> Anyone have any ideas about this?
> >> 
> >> Thanks!
> >> 
> >> Cheers,
> >> Chris
> >> 
> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >> Chris Mattmann, Ph.D.
> >> Senior Computer Scientist
> >> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> >> Office: 171-266B, Mailstop: 171-246
> >> Email: chris.a.mattmann@nasa.gov
> >> WWW:   http://sunset.usc.edu/~mattmann/
> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >> Adjunct Assistant Professor, Computer Science Department
> >> University of Southern California, Los Angeles, CA 90089 USA
> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> 
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Senior Computer Scientist
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 171-266B, Mailstop: 171-246
> Email: chris.a.mattmann@nasa.gov
> WWW:   http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Assistant Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

-- 
Markus Jelsma - CTO - Openindex

Re: % of different content types out there on the web

Posted by "Mattmann, Chris A (388J)" <ch...@jpl.nasa.gov>.

Hi Markus,

Thanks for the FYI. Any idea at specific %'s for those unwanted suffixes compared
to the size of the entire corpus?

Cheers,
Chris

On Jan 31, 2012, at 4:39 AM, Markus Jelsma wrote:

> We only crawl HTML and PDF files for a lot of cc-TLD's so we only have data on 
> those two. However, we also explicitly filter out all/most unwanted suffixes. 
> We do have a lot of suffixes that we encountered so far.
> 
> On Saturday 28 January 2012 03:01:26 Mattmann, Chris A (388J) wrote:
>> (sorry for the cross post)
>> 
>> Hey Guys,
>> 
>> I'm trying to find a good citation or estimate (if anyone has done one)
>> that estimates the breakout (by % or some other metric) of content types
>> out there out the web (with a whole web crawl or a meaningful
>> representative dataset) that are non HTML.
>> 
>> Anyone have any ideas about this?
>> 
>> Thanks!
>> 
>> Cheers,
>> Chris
>> 
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Chris Mattmann, Ph.D.
>> Senior Computer Scientist
>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> Office: 171-266B, Mailstop: 171-246
>> Email: chris.a.mattmann@nasa.gov
>> WWW:   http://sunset.usc.edu/~mattmann/
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Adjunct Assistant Professor, Computer Science Department
>> University of Southern California, Los Angeles, CA 90089 USA
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> 
> -- 
> Markus Jelsma - CTO - Openindex


++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattmann@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Re: % of different content types out there on the web

Posted by "Mattmann, Chris A (388J)" <ch...@jpl.nasa.gov>.

Hi Markus,

Thanks for the FYI. Any idea at specific %'s for those unwanted suffixes compared
to the size of the entire corpus?

Cheers,
Chris

On Jan 31, 2012, at 4:39 AM, Markus Jelsma wrote:

> We only crawl HTML and PDF files for a lot of cc-TLD's so we only have data on 
> those two. However, we also explicitly filter out all/most unwanted suffixes. 
> We do have a lot of suffixes that we encountered so far.
> 
> On Saturday 28 January 2012 03:01:26 Mattmann, Chris A (388J) wrote:
>> (sorry for the cross post)
>> 
>> Hey Guys,
>> 
>> I'm trying to find a good citation or estimate (if anyone has done one)
>> that estimates the breakout (by % or some other metric) of content types
>> out there out the web (with a whole web crawl or a meaningful
>> representative dataset) that are non HTML.
>> 
>> Anyone have any ideas about this?
>> 
>> Thanks!
>> 
>> Cheers,
>> Chris
>> 
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Chris Mattmann, Ph.D.
>> Senior Computer Scientist
>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> Office: 171-266B, Mailstop: 171-246
>> Email: chris.a.mattmann@nasa.gov
>> WWW:   http://sunset.usc.edu/~mattmann/
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Adjunct Assistant Professor, Computer Science Department
>> University of Southern California, Los Angeles, CA 90089 USA
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> 
> -- 
> Markus Jelsma - CTO - Openindex


++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattmann@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Re: % of different content types out there on the web

Posted by Markus Jelsma <ma...@openindex.io>.

We only crawl HTML and PDF files for a lot of cc-TLD's so we only have data on 
those two. However, we also explicitly filter out all/most unwanted suffixes. 
We do have a lot of suffixes that we encountered so far.

On Saturday 28 January 2012 03:01:26 Mattmann, Chris A (388J) wrote:
> (sorry for the cross post)
> 
> Hey Guys,
> 
> I'm trying to find a good citation or estimate (if anyone has done one)
> that estimates the breakout (by % or some other metric) of content types
> out there out the web (with a whole web crawl or a meaningful
> representative dataset) that are non HTML.
> 
> Anyone have any ideas about this?
> 
> Thanks!
> 
> Cheers,
> Chris
> 
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Senior Computer Scientist
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 171-266B, Mailstop: 171-246
> Email: chris.a.mattmann@nasa.gov
> WWW:   http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Assistant Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

-- 
Markus Jelsma - CTO - Openindex

Re: % of different content types out there on the web

Posted by Simão Fontes <si...@gmail.com>.

Hello Chris,

In the Portuguese Web Archive we did a study of web characteristics
for the portuguese web. I don't know if this helps you but where is
the papper.

João Miranda, Daniel Gomes, Trends in Web characteristics (best paper
award: 2nd place), 7th Latin American Web Congress, Merida, Mexico,
November 2009
Link to the papper:
http://sobre.arquivo.pt/sobre-o-arquivo/trends-in-web-characteristics/at_download/file
Presentation: http://sobre.arquivo.pt/about-the-archive/presentation-trends-in-web-characteristics
About other publications from our archive:
http://sobre.arquivo.pt/about-the-archive/publications?set_language=en

Hope this is of assistence.
Cheers,
Simão Fontes

On Sat, Jan 28, 2012 at 2:01 AM, Mattmann, Chris A (388J)
<ch...@jpl.nasa.gov> wrote:
> (sorry for the cross post)
>
> Hey Guys,
>
> I'm trying to find a good citation or estimate (if anyone has done one) that estimates
> the breakout (by % or some other metric) of content types out there out the web
> (with a whole web crawl or a meaningful representative dataset) that are non HTML.
>
> Anyone have any ideas about this?
>
> Thanks!
>
> Cheers,
> Chris
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Senior Computer Scientist
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 171-266B, Mailstop: 171-246
> Email: chris.a.mattmann@nasa.gov
> WWW:   http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Assistant Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>

Re: % of different content types out there on the web

Posted by Julien Nioche <li...@gmail.com>.

That could be an interesting experiment to do with the commoncrawl dataset
and Tika on Behemoth. Assuming of course that the detection is done
correctly by Tika.  Does anyone have a spare cluster on EC2 ;-) ?

Julien

On 28 January 2012 02:01, Mattmann, Chris A (388J) <
chris.a.mattmann@jpl.nasa.gov> wrote:

> (sorry for the cross post)
>
> Hey Guys,
>
> I'm trying to find a good citation or estimate (if anyone has done one)
> that estimates
> the breakout (by % or some other metric) of content types out there out
> the web
> (with a whole web crawl or a meaningful representative dataset) that are
> non HTML.
>
> Anyone have any ideas about this?
>
> Thanks!
>
> Cheers,
> Chris
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Senior Computer Scientist
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 171-266B, Mailstop: 171-246
> Email: chris.a.mattmann@nasa.gov
> WWW:   http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Assistant Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>


-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com

Re: % of different content types out there on the web

Posted by Julien Nioche <li...@gmail.com>.

That could be an interesting experiment to do with the commoncrawl dataset
and Tika on Behemoth. Assuming of course that the detection is done
correctly by Tika.  Does anyone have a spare cluster on EC2 ;-) ?

Julien

On 28 January 2012 02:01, Mattmann, Chris A (388J) <
chris.a.mattmann@jpl.nasa.gov> wrote:

> (sorry for the cross post)
>
> Hey Guys,
>
> I'm trying to find a good citation or estimate (if anyone has done one)
> that estimates
> the breakout (by % or some other metric) of content types out there out
> the web
> (with a whole web crawl or a meaningful representative dataset) that are
> non HTML.
>
> Anyone have any ideas about this?
>
> Thanks!
>
> Cheers,
> Chris
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Senior Computer Scientist
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 171-266B, Mailstop: 171-246
> Email: chris.a.mattmann@nasa.gov
> WWW:   http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Assistant Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>


-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com

Re: % of different content types out there on the web

Posted by Markus Jelsma <ma...@openindex.io>.

We only crawl HTML and PDF files for a lot of cc-TLD's so we only have data on 
those two. However, we also explicitly filter out all/most unwanted suffixes. 
We do have a lot of suffixes that we encountered so far.

On Saturday 28 January 2012 03:01:26 Mattmann, Chris A (388J) wrote:
> (sorry for the cross post)
> 
> Hey Guys,
> 
> I'm trying to find a good citation or estimate (if anyone has done one)
> that estimates the breakout (by % or some other metric) of content types
> out there out the web (with a whole web crawl or a meaningful
> representative dataset) that are non HTML.
> 
> Anyone have any ideas about this?
> 
> Thanks!
> 
> Cheers,
> Chris
> 
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Senior Computer Scientist
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 171-266B, Mailstop: 171-246
> Email: chris.a.mattmann@nasa.gov
> WWW:   http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Assistant Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

-- 
Markus Jelsma - CTO - Openindex