You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Grant Ingersoll <gs...@apache.org> on 2009/11/03 14:43:37 UTC

http://bixolabs.com/datasets/public-terabyte-dataset-project/

Might be of interest to all you Mahouts out there...  http://bixolabs.com/datasets/public-terabyte-dataset-project/

Would be cool to get this converted over to our vector format so that  
we can cluster, etc.

Re: http://bixolabs.com/datasets/public-terabyte-dataset-project/

Posted by Miles <mi...@gmail.com>.

A very simple way to remove boilerplate (and this is trivial using Map  
Reduce) is to just remove all duplicate sentences. This does assume  
you can extract sentences, do sentence boundary detection etc.

Miled
Sent from your Ipod


On 13 Nov 2009, at 19:06, Ted Dunning <te...@gmail.com> wrote:

> This looks like a very nice approach for getting rid of the goo.  I  
> often
> advocate using words/phrases/ngrams that are highly predicted by the  
> domain
> name as an alternative for removing boilerplate.  That has the  
> advantage
> that it doesn't require training text.  In the case of wiki-pedia,  
> this is
> not so useful because everything is in the same domain.  The domain
> predictor trick will only work if the feature you are using for the  
> input is
> not very content based.  Thus, this can fail for small domain- 
> focused sites
> or if you use a content laden URL for the task.
>
>
>
> On Fri, Nov 13, 2009 at 10:36 AM, Ken Krugler
> <kk...@transpac.com>wrote:
>
>> Hi all,
>>
>> Another issue came up, about cleaning the text.
>>
>> One interested user suggested using nCleaner (see
>> http://www.lrec-conf.org/proceedings/lrec2008/pdf/885_paper.pdf) as  
>> a way
>> of tossing boilerplate text that skews text frequency data.
>>
>> Any thoughts on this?
>>
>> Thanks,
>>
>> -- Ken
>>
>>
>> On Nov 3, 2009, at 5:43am, Grant Ingersoll wrote:
>>
>> Might be of interest to all you Mahouts out there...
>>> http://bixolabs.com/datasets/public-terabyte-dataset-project/
>>>
>>> Would be cool to get this converted over to our vector format so  
>>> that we
>>> can cluster, etc.
>>>
>>
>> --------------------------------------------
>> Ken Krugler
>> +1 530-210-6378
>> http://bixolabs.com
>> e l a s t i c   w e b   m i n i n g
>>
>>
>>
>>
>>
>
>
> -- 
> Ted Dunning, CTO
> DeepDyve

Re: http://bixolabs.com/datasets/public-terabyte-dataset-project/

Posted by Ted Dunning <te...@gmail.com>.

This looks like a very nice approach for getting rid of the goo.  I often
advocate using words/phrases/ngrams that are highly predicted by the domain
name as an alternative for removing boilerplate.  That has the advantage
that it doesn't require training text.  In the case of wiki-pedia, this is
not so useful because everything is in the same domain.  The domain
predictor trick will only work if the feature you are using for the input is
not very content based.  Thus, this can fail for small domain-focused sites
or if you use a content laden URL for the task.

On Fri, Nov 13, 2009 at 10:36 AM, Ken Krugler
<kk...@transpac.com>wrote:

> Hi all,
>
> Another issue came up, about cleaning the text.
>
> One interested user suggested using nCleaner (see
> http://www.lrec-conf.org/proceedings/lrec2008/pdf/885_paper.pdf) as a way
> of tossing boilerplate text that skews text frequency data.
>
> Any thoughts on this?
>
> Thanks,
>
> -- Ken
>
>
> On Nov 3, 2009, at 5:43am, Grant Ingersoll wrote:
>
>  Might be of interest to all you Mahouts out there...
>> http://bixolabs.com/datasets/public-terabyte-dataset-project/
>>
>> Would be cool to get this converted over to our vector format so that we
>> can cluster, etc.
>>
>
> --------------------------------------------
> Ken Krugler
> +1 530-210-6378
> http://bixolabs.com
> e l a s t i c   w e b   m i n i n g
>
>
>
>
>

-- 
Ted Dunning, CTO
DeepDyve

Re: http://bixolabs.com/datasets/public-terabyte-dataset-project/

Posted by Ken Krugler <kk...@transpac.com>.

Hi all,

Another issue came up, about cleaning the text.

One interested user suggested using nCleaner (see http://www.lrec-conf.org/proceedings/lrec2008/pdf/885_paper.pdf) 
  as a way of tossing boilerplate text that skews text frequency data.

Any thoughts on this?

Thanks,

-- Ken

On Nov 3, 2009, at 5:43am, Grant Ingersoll wrote:

> Might be of interest to all you Mahouts out there...  http://bixolabs.com/datasets/public-terabyte-dataset-project/
>
> Would be cool to get this converted over to our vector format so  
> that we can cluster, etc.

--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g

Re: http://bixolabs.com/datasets/public-terabyte-dataset-project/

Posted by Ted Dunning <te...@gmail.com>.

On Fri, Nov 13, 2009 at 10:35 AM, Ken Krugler
<kk...@transpac.com>wrote:

> Hi Ted,
>
> On Nov 3, 2009, at 6:37pm, Ted Dunning wrote:
>
>  I would opt for the most specific tokenization that is feasible (no
>> stemming, as much compounding as possible).
>>
>
> By "as much compounding as possible", do you mean you want the tokenizer to
> do as much splitting as possible, or as little?
>

My ultimate preference is to actually glue very common phrases into a single
term.  This can be reversed with a linear transformation.  This may not be
feasible for a first hack.

> E.g. "super-duper" should be left as-is, or turned into "super" and
> "duper"?
>

Left as is.  And New York should be tokenized as a single term.  Likewise
with "Staff writer of the wall street journal".

> Is there a particular configuration of Lucene tokenizers that you'd
> suggest?
>

I am not an expert, but I know of no tokenizers that will do this.  Lucene
typically retains positional information which means that searching for
phrases is relatively cheap (10x in search time, but most collections take
approximately zero time to search).

Maybe the best answer is to produce two vectorized versions, one heavily
stemmed and split apart (the micro-token approach) and another one that we
can progressively improve would be the mega-token version.

The final arbiter should be whoever does the work (you!).

Re: http://bixolabs.com/datasets/public-terabyte-dataset-project/

Posted by Ken Krugler <kk...@transpac.com>.

Hi Ted,

On Nov 3, 2009, at 6:37pm, Ted Dunning wrote:

> I would opt for the most specific tokenization that is feasible (no
> stemming, as much compounding as possible).

By "as much compounding as possible", do you mean you want the  
tokenizer to do as much splitting as possible, or as little?

E.g. "super-duper" should be left as-is, or turned into "super" and  
"duper"?

Is there a particular configuration of Lucene tokenizers that you'd  
suggest?

Thanks,

-- Ken


> The rationale for this is that
> stemming and uncompounding can be added by linear transformations of  
> the
> matrix at any time.
>
> The only serious issue with this is the problem of overlapping  
> compound
> words.
>
> On Tue, Nov 3, 2009 at 2:39 PM, Ken Krugler <kkrugler_lists@transpac.com 
> >wrote:
>
>> I assume there would also be an issue of which tokenizer to use to  
>> create
>> the terms from the text.
>>
>> And possibly issues around storing separate vectors for (at least)  
>> title
>> vs. content?
>>
>> Anybody have input on either of these?
>>
>>

--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g

Re: http://bixolabs.com/datasets/public-terabyte-dataset-project/

Posted by Ted Dunning <te...@gmail.com>.

I would opt for the most specific tokenization that is feasible (no
stemming, as much compounding as possible).  The rationale for this is that
stemming and uncompounding can be added by linear transformations of the
matrix at any time.

The only serious issue with this is the problem of overlapping compound
words.

On Tue, Nov 3, 2009 at 2:39 PM, Ken Krugler <kk...@transpac.com>wrote:

> I assume there would also be an issue of which tokenizer to use to create
> the terms from the text.
>
> And possibly issues around storing separate vectors for (at least) title
> vs. content?
>
> Anybody have input on either of these?
>
>

Re: http://bixolabs.com/datasets/public-terabyte-dataset-project/

Posted by Ken Krugler <kk...@transpac.com>.

I assume there would also be an issue of which tokenizer to use to  
create the terms from the text.

And possibly issues around storing separate vectors for (at least)  
title vs. content?

Anybody have input on either of these?

Thanks,

-- Ken

On Nov 3, 2009, at 10:14am, Jake Mannix wrote:

> Well the minimum size, for the IntDoubleVector which isn't yet in  
> trunk
> (it's on Ted's patch which hasn't worked its way in yet) would  
> entail one
> int and one double per unique term in the document, so that's 12 bytes
> each.  Typical documents have lots of repeat terms, but most terms are
> smaller than 12 bytes as well... so the fraction is probably more  
> than 10%
> and less than 50% is my guess.  But I'm sure others around here have  
> more
> experience producing large vector sets out of the text in Mahout.
>
> -jake
>
> On Tue, Nov 3, 2009 at 7:49 AM, Ken Krugler <kkrugler_lists@transpac.com 
> >wrote:
>
>>
>> On Nov 3, 2009, at 5:43am, Grant Ingersoll wrote:
>>
>> Might be of interest to all you Mahouts out there...
>>> http://bixolabs.com/datasets/public-terabyte-dataset-project/
>>>
>>> Would be cool to get this converted over to our vector format so  
>>> that we
>>> can cluster, etc.
>>>
>>
>>
>> How much additional space would be required for the vectors, in some
>> optimal compressed format? Say as a percentage of raw text size.
>>
>> I'm asking because I have some flexibility in the processing and  
>> associated
>> metadata I can store as part of the dataset.
>>
>> -- Ken
>>
>> --------------------------------------------
>> Ken Krugler
>> +1 530-210-6378
>> http://bixolabs.com
>> e l a s t i c   w e b   m i n i n g
>>
>>
>>
>>
>>

--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g

Re: http://bixolabs.com/datasets/public-terabyte-dataset-project/

Posted by Ted Dunning <te...@gmail.com>.

Another alternative is to simply store a Boolean matrix.  That would require
4 bytes per term.

Both forms would compress pretty well.  In the boolean case, I would expect
that the average cost per term would be just under 2 bytes per term.  For
the vector with actual counts stored in doubles, the compression would
probably be nearly as good with about another byte or less for the value.

If we assume 2 bytes per term and 1 byte for the count on average after
compression, this should be about a quarter of what the original text was
(assuming an average term count of 2).  Markup will be stripped which will
allow a bit more savings.

These numbers are very much in-line with Jake's estimates.

On Tue, Nov 3, 2009 at 10:14 AM, Jake Mannix <ja...@gmail.com> wrote:

> Well the minimum size, for the IntDoubleVector which isn't yet in trunk
> (it's on Ted's patch which hasn't worked its way in yet) would entail one
> int and one double per unique term in the document, so that's 12 bytes
> each.  Typical documents have lots of repeat terms, but most terms are
> smaller than 12 bytes as well... so the fraction is probably more than 10%
> and less than 50% is my guess.  But I'm sure others around here have more
> experience producing large vector sets out of the text in Mahout.
>
>  -jake
>
> On Tue, Nov 3, 2009 at 7:49 AM, Ken Krugler <kkrugler_lists@transpac.com
> >wrote:
>
> >
> > On Nov 3, 2009, at 5:43am, Grant Ingersoll wrote:
> >
> >  Might be of interest to all you Mahouts out there...
> >> http://bixolabs.com/datasets/public-terabyte-dataset-project/
> >>
> >> Would be cool to get this converted over to our vector format so that we
> >> can cluster, etc.
> >>
> >
> >
> > How much additional space would be required for the vectors, in some
> > optimal compressed format? Say as a percentage of raw text size.
> >
> > I'm asking because I have some flexibility in the processing and
> associated
> > metadata I can store as part of the dataset.
> >
> > -- Ken
> >
> > --------------------------------------------
> > Ken Krugler
> > +1 530-210-6378
> > http://bixolabs.com
> > e l a s t i c   w e b   m i n i n g
> >
> >
> >
> >
> >
>

-- 
Ted Dunning, CTO
DeepDyve

Re: http://bixolabs.com/datasets/public-terabyte-dataset-project/

Posted by Jake Mannix <ja...@gmail.com>.

Well the minimum size, for the IntDoubleVector which isn't yet in trunk
(it's on Ted's patch which hasn't worked its way in yet) would entail one
int and one double per unique term in the document, so that's 12 bytes
each.  Typical documents have lots of repeat terms, but most terms are
smaller than 12 bytes as well... so the fraction is probably more than 10%
and less than 50% is my guess.  But I'm sure others around here have more
experience producing large vector sets out of the text in Mahout.

  -jake

On Tue, Nov 3, 2009 at 7:49 AM, Ken Krugler <kk...@transpac.com>wrote:

>
> On Nov 3, 2009, at 5:43am, Grant Ingersoll wrote:
>
>  Might be of interest to all you Mahouts out there...
>> http://bixolabs.com/datasets/public-terabyte-dataset-project/
>>
>> Would be cool to get this converted over to our vector format so that we
>> can cluster, etc.
>>
>
>
> How much additional space would be required for the vectors, in some
> optimal compressed format? Say as a percentage of raw text size.
>
> I'm asking because I have some flexibility in the processing and associated
> metadata I can store as part of the dataset.
>
> -- Ken
>
> --------------------------------------------
> Ken Krugler
> +1 530-210-6378
> http://bixolabs.com
> e l a s t i c   w e b   m i n i n g
>
>
>
>
>

Re: http://bixolabs.com/datasets/public-terabyte-dataset-project/

Posted by Ted Dunning <te...@gmail.com>.

In the intermediate representation, it is very good to keep string -> double
mappings in some form.  In memory, we probably need to separate this into
String -> index and index -> double representations so that we have
flexibility of representation.

I am not sure which you intended.

On Wed, Nov 4, 2009 at 1:16 AM, Robin Anil <ro...@gmail.com> wrote:

> I had always thought we should be using Hadoop to number these features and
> create the vector the way Bayes Classifier does it. In Bayes classifier, I
> don't bother to number the feature. Instead use String=>double mapping. I
> will see If feature numbering could be done by a single map/reduce job. If
> thats the case, We can use the TfIdfDriver to generate the tfidf scores and
> then convert the docs into array(int=>double) vectors. That way it would be
> done in a distributed manner
>
>
> Robin
>
>
> On Wed, Nov 4, 2009 at 1:35 PM, Shashikant Kore <shashikant@gmail.com
> >wrote:
>
> > First, we need to create lucene index from this text. Typically, index
> > size is close to 30% of the raw text. (Though, I have seen cases,
> > where it could be as high as 45%). The vectors take 25% of index size
> > (Or, roughly 10% of original text)
> >
> > The space taken by index could be reclaimed after creating the vectors.
> >
> > --shashi
> >
> > On Tue, Nov 3, 2009 at 9:19 PM, Ken Krugler <kkrugler_lists@transpac.com
> >
> > wrote:
> > >
> > > On Nov 3, 2009, at 5:43am, Grant Ingersoll wrote:
> > >
> > >> Might be of interest to all you Mahouts out there...
> > >>  http://bixolabs.com/datasets/public-terabyte-dataset-project/
> > >>
> > >> Would be cool to get this converted over to our vector format so that
> we
> > >> can cluster, etc.
> > >
> > >
> > > How much additional space would be required for the vectors, in some
> > optimal
> > > compressed format? Say as a percentage of raw text size.
> > >
> > > I'm asking because I have some flexibility in the processing and
> > associated
> > > metadata I can store as part of the dataset.
> > >
> > > -- Ken
> > >
> > > --------------------------------------------
> > > Ken Krugler
> > > +1 530-210-6378
> > > http://bixolabs.com
> > > e l a s t i c   w e b   m i n i n g
> > >
> > >
> > >
> > >
> > >
> >
>



-- 
Ted Dunning, CTO
DeepDyve

Re: http://bixolabs.com/datasets/public-terabyte-dataset-project/

Posted by Grant Ingersoll <gs...@apache.org>.

On Nov 4, 2009, at 1:16 AM, Robin Anil wrote:

> I had always thought we should be using Hadoop to number these  
> features and
> create the vector the way Bayes Classifier does it. In Bayes  
> classifier, I
> don't bother to number the feature. Instead use String=>double  
> mapping. I
> will see If feature numbering could be done by a single map/reduce  
> job. If
> thats the case, We can use the TfIdfDriver to generate the tfidf  
> scores and
> then convert the docs into array(int=>double) vectors. That way it  
> would be
> done in a distributed manner

Ideally, I think we have a bunch of different conversion mechanisms.   
We should probably move the TfIdfDriver out to the Utils module and  
see if it can be made more generic.  We also could use Hadoop M/R jobs  
for the Lucene extraction code, too, so that if you have a bunch of  
indexes in a large scale distributed environment, you can run M/R on  
it to create the vectors.

-Grant

Re: http://bixolabs.com/datasets/public-terabyte-dataset-project/

Posted by Robin Anil <ro...@gmail.com>.

I had always thought we should be using Hadoop to number these features and
create the vector the way Bayes Classifier does it. In Bayes classifier, I
don't bother to number the feature. Instead use String=>double mapping. I
will see If feature numbering could be done by a single map/reduce job. If
thats the case, We can use the TfIdfDriver to generate the tfidf scores and
then convert the docs into array(int=>double) vectors. That way it would be
done in a distributed manner

Robin

On Wed, Nov 4, 2009 at 1:35 PM, Shashikant Kore <sh...@gmail.com>wrote:

> First, we need to create lucene index from this text. Typically, index
> size is close to 30% of the raw text. (Though, I have seen cases,
> where it could be as high as 45%). The vectors take 25% of index size
> (Or, roughly 10% of original text)
>
> The space taken by index could be reclaimed after creating the vectors.
>
> --shashi
>
> On Tue, Nov 3, 2009 at 9:19 PM, Ken Krugler <kk...@transpac.com>
> wrote:
> >
> > On Nov 3, 2009, at 5:43am, Grant Ingersoll wrote:
> >
> >> Might be of interest to all you Mahouts out there...
> >>  http://bixolabs.com/datasets/public-terabyte-dataset-project/
> >>
> >> Would be cool to get this converted over to our vector format so that we
> >> can cluster, etc.
> >
> >
> > How much additional space would be required for the vectors, in some
> optimal
> > compressed format? Say as a percentage of raw text size.
> >
> > I'm asking because I have some flexibility in the processing and
> associated
> > metadata I can store as part of the dataset.
> >
> > -- Ken
> >
> > --------------------------------------------
> > Ken Krugler
> > +1 530-210-6378
> > http://bixolabs.com
> > e l a s t i c   w e b   m i n i n g
> >
> >
> >
> >
> >
>

Re: http://bixolabs.com/datasets/public-terabyte-dataset-project/

Posted by Shashikant Kore <sh...@gmail.com>.

First, we need to create lucene index from this text. Typically, index
size is close to 30% of the raw text. (Though, I have seen cases,
where it could be as high as 45%). The vectors take 25% of index size
(Or, roughly 10% of original text)

The space taken by index could be reclaimed after creating the vectors.

--shashi

On Tue, Nov 3, 2009 at 9:19 PM, Ken Krugler <kk...@transpac.com> wrote:
>
> On Nov 3, 2009, at 5:43am, Grant Ingersoll wrote:
>
>> Might be of interest to all you Mahouts out there...
>>  http://bixolabs.com/datasets/public-terabyte-dataset-project/
>>
>> Would be cool to get this converted over to our vector format so that we
>> can cluster, etc.
>
>
> How much additional space would be required for the vectors, in some optimal
> compressed format? Say as a percentage of raw text size.
>
> I'm asking because I have some flexibility in the processing and associated
> metadata I can store as part of the dataset.
>
> -- Ken
>
> --------------------------------------------
> Ken Krugler
> +1 530-210-6378
> http://bixolabs.com
> e l a s t i c   w e b   m i n i n g
>
>
>
>
>

Re: http://bixolabs.com/datasets/public-terabyte-dataset-project/

Posted by Ken Krugler <kk...@transpac.com>.

On Nov 3, 2009, at 5:43am, Grant Ingersoll wrote:

> Might be of interest to all you Mahouts out there...  http://bixolabs.com/datasets/public-terabyte-dataset-project/
>
> Would be cool to get this converted over to our vector format so  
> that we can cluster, etc.

How much additional space would be required for the vectors, in some  
optimal compressed format? Say as a percentage of raw text size.

I'm asking because I have some flexibility in the processing and  
associated metadata I can store as part of the dataset.

-- Ken

--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g