You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Frank Scholten <fr...@frankscholten.nl> on 2011/05/06 22:02:02 UTC

Vectorizing arbitrary value types with seq2sparse

Hi everyone,

At the moment seq2sparse can generate vectors from sequence values of
type Text. More specifically, SequenceFileTokenizerMapper handles Text
values.

Would it be useful if seq2sparse could be configured to vectorize
value types such as a Blog article with several textual fields like
title, content, tags and so on?

Or is it easier to create a separate job for this or use Pig or
anything like that?

Frank

Re: Vectorizing arbitrary value types with seq2sparse

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
compound=composite, sorry. I've been mixing these words up since i was
in the first grade.

On Sat, May 7, 2011 at 3:13 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
> potentially one might be able to use compound key consisting,
> essentially, of doc id and the value category and then re-vectorize it
> (or bastardize seq2sparse) by adding the quantitative feature to the
> values. yes n-grams might get screwed a little but who cares. Output
> still might be useful.

Re: Vectorizing arbitrary value types with seq2sparse

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
potentially one might be able to use compound key consisting,
essentially, of doc id and the value category and then re-vectorize it
(or bastardize seq2sparse) by adding the quantitative feature to the
values. yes n-grams might get screwed a little but who cares. Output
still might be useful.

n-grams on tags are of course not very useful (tags indeed have no
order) but who knows.

I will have to get to consider similar problem, i.e. see how i can
bastardize the seq2sparse code to run incremental runs on pre-existing
dictionairies (if it is not yet capable of this).

If this gets implemented, in context of this problem one could run
multiple passes on different types of fields and would have an option
to disable bigrams in certain passes where it wouldn't have much
sense.

-d

On Fri, May 6, 2011 at 3:53 PM, Ted Dunning <te...@gmail.com> wrote:
> Yeah.. that doesn't work at all.
>
> You need different analyzers at least and some fields are numeric, some
> textual.  The same words
> in different fields (usually) need to be considered separately.  N-grams
> raises all kinds of crazy issues.
>
> For instance, what does an n-gram of tags mean?  Are tags even ordered?
>
> Some fields contain dates, but different dates need to be considered ages,
> or points in time.
>
> It gets whacky fast.
>
> On Fri, May 6, 2011 at 1:52 PM, Frank Scholten <fr...@frankscholten.nl>wrote:
>
>> Hmm, seems more complex that I thought. I thought of a simple approach
>> where you could configure your own class that concatenated the desired
>> fields into one Text value and have the SequenceFileTokenizerMapper
>> process that value.
>>
>> But this can give unexpected results? I guess it may find incorrect
>> n-grams from tokens that were from different fields.
>>
>> On Fri, May 6, 2011 at 10:17 PM, Ted Dunning <te...@gmail.com>
>> wrote:
>> > This is definitely desirable but is very different from the current tool.
>> >
>> > My guess is the big difficulty will be describing the vectorization to be
>> > done.  The hashed representations would make that easier, but still not
>> > trivial.  Dictionary based methods add multiple dictionary specifications
>> > and also require that we figure out how to combine vectors by
>> concatenation
>> > or overlay.
>> >
>> > On Fri, May 6, 2011 at 1:02 PM, Frank Scholten <frank@frankscholten.nl
>> >wrote:
>> >
>> >> Hi everyone,
>> >>
>> >> At the moment seq2sparse can generate vectors from sequence values of
>> >> type Text. More specifically, SequenceFileTokenizerMapper handles Text
>> >> values.
>> >>
>> >> Would it be useful if seq2sparse could be configured to vectorize
>> >> value types such as a Blog article with several textual fields like
>> >> title, content, tags and so on?
>> >>
>> >> Or is it easier to create a separate job for this or use Pig or
>> >> anything like that?
>> >>
>> >> Frank
>> >>
>> >
>>
>

Re: Vectorizing arbitrary value types with seq2sparse

Posted by Ted Dunning <te...@gmail.com>.
Yeah.. that doesn't work at all.

You need different analyzers at least and some fields are numeric, some
textual.  The same words
in different fields (usually) need to be considered separately.  N-grams
raises all kinds of crazy issues.

For instance, what does an n-gram of tags mean?  Are tags even ordered?

Some fields contain dates, but different dates need to be considered ages,
or points in time.

It gets whacky fast.

On Fri, May 6, 2011 at 1:52 PM, Frank Scholten <fr...@frankscholten.nl>wrote:

> Hmm, seems more complex that I thought. I thought of a simple approach
> where you could configure your own class that concatenated the desired
> fields into one Text value and have the SequenceFileTokenizerMapper
> process that value.
>
> But this can give unexpected results? I guess it may find incorrect
> n-grams from tokens that were from different fields.
>
> On Fri, May 6, 2011 at 10:17 PM, Ted Dunning <te...@gmail.com>
> wrote:
> > This is definitely desirable but is very different from the current tool.
> >
> > My guess is the big difficulty will be describing the vectorization to be
> > done.  The hashed representations would make that easier, but still not
> > trivial.  Dictionary based methods add multiple dictionary specifications
> > and also require that we figure out how to combine vectors by
> concatenation
> > or overlay.
> >
> > On Fri, May 6, 2011 at 1:02 PM, Frank Scholten <frank@frankscholten.nl
> >wrote:
> >
> >> Hi everyone,
> >>
> >> At the moment seq2sparse can generate vectors from sequence values of
> >> type Text. More specifically, SequenceFileTokenizerMapper handles Text
> >> values.
> >>
> >> Would it be useful if seq2sparse could be configured to vectorize
> >> value types such as a Blog article with several textual fields like
> >> title, content, tags and so on?
> >>
> >> Or is it easier to create a separate job for this or use Pig or
> >> anything like that?
> >>
> >> Frank
> >>
> >
>

Re: Vectorizing arbitrary value types with seq2sparse

Posted by Frank Scholten <fr...@frankscholten.nl>.
Hmm, seems more complex that I thought. I thought of a simple approach
where you could configure your own class that concatenated the desired
fields into one Text value and have the SequenceFileTokenizerMapper
process that value.

But this can give unexpected results? I guess it may find incorrect
n-grams from tokens that were from different fields.

On Fri, May 6, 2011 at 10:17 PM, Ted Dunning <te...@gmail.com> wrote:
> This is definitely desirable but is very different from the current tool.
>
> My guess is the big difficulty will be describing the vectorization to be
> done.  The hashed representations would make that easier, but still not
> trivial.  Dictionary based methods add multiple dictionary specifications
> and also require that we figure out how to combine vectors by concatenation
> or overlay.
>
> On Fri, May 6, 2011 at 1:02 PM, Frank Scholten <fr...@frankscholten.nl>wrote:
>
>> Hi everyone,
>>
>> At the moment seq2sparse can generate vectors from sequence values of
>> type Text. More specifically, SequenceFileTokenizerMapper handles Text
>> values.
>>
>> Would it be useful if seq2sparse could be configured to vectorize
>> value types such as a Blog article with several textual fields like
>> title, content, tags and so on?
>>
>> Or is it easier to create a separate job for this or use Pig or
>> anything like that?
>>
>> Frank
>>
>

Re: Vectorizing arbitrary value types with seq2sparse

Posted by Ted Dunning <te...@gmail.com>.
This is definitely desirable but is very different from the current tool.

My guess is the big difficulty will be describing the vectorization to be
done.  The hashed representations would make that easier, but still not
trivial.  Dictionary based methods add multiple dictionary specifications
and also require that we figure out how to combine vectors by concatenation
or overlay.

On Fri, May 6, 2011 at 1:02 PM, Frank Scholten <fr...@frankscholten.nl>wrote:

> Hi everyone,
>
> At the moment seq2sparse can generate vectors from sequence values of
> type Text. More specifically, SequenceFileTokenizerMapper handles Text
> values.
>
> Would it be useful if seq2sparse could be configured to vectorize
> value types such as a Blog article with several textual fields like
> title, content, tags and so on?
>
> Or is it easier to create a separate job for this or use Pig or
> anything like that?
>
> Frank
>