You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@flink.apache.org by Hilmi Yildirim <Hi...@dfki.de> on 2016/01/05 11:36:58 UTC

LabeledVector with label vector

Hi,
in the ML-Pipeline of Flink we have the "LabeledVector" class. It 
consists of a vector and a label as a double value. Unfortunately, it is 
not applicable for sequence learning where the label is also a vector. 
For example, in NLP we have a vector of words and the label is a vector 
of the corresponding labels.

The optimize function of the "Solver" class has a DateSet[LabeledVector] 
as input and, therefore, it is not applicable for sequence learning. I 
think the LabeledVector should be adapted that the label is a vector 
instead of a single Double value. What do you think?

Best Regards,

-- 
==================================================================
Hilmi Yildirim, M.Sc.
Researcher

DFKI GmbH
Intelligente Analytik für Massendaten
DFKI Projektbüro Berlin
Alt-Moabit 91c
D-10559 Berlin
Phone: +49 30 23895 1814

E-Mail: Hilmi.Yildirim@dfki.de

-------------------------------------------------------------
Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH
Firmensitz: Trippstadter Strasse 122, D-67663 Kaiserslautern

Geschaeftsfuehrung:
Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender)
Dr. Walter Olthoff

Vorsitzender des Aufsichtsrats:
Prof. Dr. h.c. Hans A. Aukes

Amtsgericht Kaiserslautern, HRB 2313
-------------------------------------------------------------


Re: LabeledVector with label vector

Posted by Till Rohrmann <ti...@gmail.com>.
Hi,

yes, initially we thought about introducing a LabeledVector where the label
can be a vector. However, for the sake of simplicity we decided to first
implement a LabeledVector with a single double value as label.

A simple double value should take 8 bytes of memory space. The
DenseVector(Array(Label_Value)) should take 4 bytes for the array
reference, 16 bytes for the array structure and 8 bytes for the value = 28
bytes. Thus, the space for the label would be roughly tripled. This might
not be too expensive assuming that the most space will be taken by the
feature vector. However, I would also assume that the access to the label
value wrapped in a DenseVector should be a little bit slower since one
first has to retrieve the reference to the array and then access the first
element instead of directly accessing a double field.

I would assume that this should not have such a big impact on the overall
performance. But without running some benchmarks, it’s hard to say.

Alternatively, you can also define your own custom type for NLP. I’m not
really familiar with sequence learning but can you use gradient descent for
that? If not, then you have to write your own solver which can also work on
a different type.

Cheers,
Till
​

On Wed, Jan 6, 2016 at 2:12 AM, Chiwan Park <ch...@apache.org> wrote:

> Hi Theodore,
>
> Thanks for explaining the reason. :)
>
> So how about change LabeledVector contains two vectors? One of vectors is
> for label and the other one is for value. I think this approach would be
> okay because a double value label could be represented as a
> DenseVector(Array(LABEL_VALUE)).
>
> Only problem in this approach is some overhead of processing Vector type
> in case of single double label. If the overhead is significant, we should
> create two types of LabeledVector such as DoubleLabeledVector and
> VectorLabeledVector.
>
> Which one is preferred?
>
> > On Jan 5, 2016, at 11:38 PM, Theodore Vasiloudis <
> theodoros.vasiloudis@gmail.com> wrote:
> >
> > Generalizing the type of the label for the label vector is an idea we
> > played with when designing the current optimization framework.
> >
> > We ended up deciding against it as the double type allows us to do
> > regressions and (multiclass) classification which should be the majority
> of
> > the use cases out there, while keeping the code simple.
> >
> > Generalizing this to [T <: Serializable] is too broad I think. [T <:
> > Vector] is I think more reasonable, I cannot think of many cases where
> the
> > label in an optimization problems is something other than a
> vector/double.
> >
> > Any change would require a number of changes in the optimization of
> course,
> > as optimizing for vector and double labels requires different handling of
> > error calculation etc but it should be doable.
> > Note however that since LabeledVector is such a core part of the library
> > any changes would involve a number of adjustments downstream.
> >
> > Perhaps having different optimizers etc. for Vectors and double labels
> > makes sense, but I haven't put much though into this.
> >
> >
> > On Tue, Jan 5, 2016 at 12:17 PM, Chiwan Park <ch...@apache.org>
> wrote:
> >
> >> Hi Hilmi,
> >>
> >> Thanks for suggestion about type of labeled vector. Basically, I agree
> >> that your suggestion is reasonable. But, I would like to generialize
> >> `LabeledVector` like following example:
> >>
> >> ```
> >> case class LabeledVector[T <: Serializable](label: T, vector: Vector)
> >> extends Serializable {
> >>  // some implementations for LabeledVector
> >> }
> >> ```
> >>
> >> How about this implementation? If there are any other opinions, please
> >> send a email to mailing list.
> >>
> >>> On Jan 5, 2016, at 7:36 PM, Hilmi Yildirim <Hi...@dfki.de>
> >> wrote:
> >>>
> >>> Hi,
> >>> in the ML-Pipeline of Flink we have the "LabeledVector" class. It
> >> consists of a vector and a label as a double value. Unfortunately, it is
> >> not applicable for sequence learning where the label is also a vector.
> For
> >> example, in NLP we have a vector of words and the label is a vector of
> the
> >> corresponding labels.
> >>>
> >>> The optimize function of the "Solver" class has a
> DateSet[LabeledVector]
> >> as input and, therefore, it is not applicable for sequence learning. I
> >> think the LabeledVector should be adapted that the label is a vector
> >> instead of a single Double value. What do you think?
> >>>
> >>> Best Regards,
> >>>
> >>> --
> >>> ==================================================================
> >>> Hilmi Yildirim, M.Sc.
> >>> Researcher
> >>>
> >>> DFKI GmbH
> >>> Intelligente Analytik für Massendaten
> >>> DFKI Projektbüro Berlin
> >>> Alt-Moabit 91c
> >>> D-10559 Berlin
> >>> Phone: +49 30 23895 1814
> >>>
> >>> E-Mail: Hilmi.Yildirim@dfki.de
> >>>
> >>> -------------------------------------------------------------
> >>> Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH
> >>> Firmensitz: Trippstadter Strasse 122, D-67663 Kaiserslautern
> >>>
> >>> Geschaeftsfuehrung:
> >>> Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender)
> >>> Dr. Walter Olthoff
> >>>
> >>> Vorsitzender des Aufsichtsrats:
> >>> Prof. Dr. h.c. Hans A. Aukes
> >>>
> >>> Amtsgericht Kaiserslautern, HRB 2313
> >>> -------------------------------------------------------------
> >>>
> >>
> >> Regards,
> >> Chiwan Park
>
> Regards,
> Chiwan Park
>
>
>

Re: LabeledVector with label vector

Posted by Chiwan Park <ch...@apache.org>.
Hi Theodore,

Thanks for explaining the reason. :)

So how about change LabeledVector contains two vectors? One of vectors is for label and the other one is for value. I think this approach would be okay because a double value label could be represented as a DenseVector(Array(LABEL_VALUE)).

Only problem in this approach is some overhead of processing Vector type in case of single double label. If the overhead is significant, we should create two types of LabeledVector such as DoubleLabeledVector and VectorLabeledVector.

Which one is preferred? 

> On Jan 5, 2016, at 11:38 PM, Theodore Vasiloudis <th...@gmail.com> wrote:
> 
> Generalizing the type of the label for the label vector is an idea we
> played with when designing the current optimization framework.
> 
> We ended up deciding against it as the double type allows us to do
> regressions and (multiclass) classification which should be the majority of
> the use cases out there, while keeping the code simple.
> 
> Generalizing this to [T <: Serializable] is too broad I think. [T <:
> Vector] is I think more reasonable, I cannot think of many cases where the
> label in an optimization problems is something other than a vector/double.
> 
> Any change would require a number of changes in the optimization of course,
> as optimizing for vector and double labels requires different handling of
> error calculation etc but it should be doable.
> Note however that since LabeledVector is such a core part of the library
> any changes would involve a number of adjustments downstream.
> 
> Perhaps having different optimizers etc. for Vectors and double labels
> makes sense, but I haven't put much though into this.
> 
> 
> On Tue, Jan 5, 2016 at 12:17 PM, Chiwan Park <ch...@apache.org> wrote:
> 
>> Hi Hilmi,
>> 
>> Thanks for suggestion about type of labeled vector. Basically, I agree
>> that your suggestion is reasonable. But, I would like to generialize
>> `LabeledVector` like following example:
>> 
>> ```
>> case class LabeledVector[T <: Serializable](label: T, vector: Vector)
>> extends Serializable {
>>  // some implementations for LabeledVector
>> }
>> ```
>> 
>> How about this implementation? If there are any other opinions, please
>> send a email to mailing list.
>> 
>>> On Jan 5, 2016, at 7:36 PM, Hilmi Yildirim <Hi...@dfki.de>
>> wrote:
>>> 
>>> Hi,
>>> in the ML-Pipeline of Flink we have the "LabeledVector" class. It
>> consists of a vector and a label as a double value. Unfortunately, it is
>> not applicable for sequence learning where the label is also a vector. For
>> example, in NLP we have a vector of words and the label is a vector of the
>> corresponding labels.
>>> 
>>> The optimize function of the "Solver" class has a DateSet[LabeledVector]
>> as input and, therefore, it is not applicable for sequence learning. I
>> think the LabeledVector should be adapted that the label is a vector
>> instead of a single Double value. What do you think?
>>> 
>>> Best Regards,
>>> 
>>> --
>>> ==================================================================
>>> Hilmi Yildirim, M.Sc.
>>> Researcher
>>> 
>>> DFKI GmbH
>>> Intelligente Analytik für Massendaten
>>> DFKI Projektbüro Berlin
>>> Alt-Moabit 91c
>>> D-10559 Berlin
>>> Phone: +49 30 23895 1814
>>> 
>>> E-Mail: Hilmi.Yildirim@dfki.de
>>> 
>>> -------------------------------------------------------------
>>> Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH
>>> Firmensitz: Trippstadter Strasse 122, D-67663 Kaiserslautern
>>> 
>>> Geschaeftsfuehrung:
>>> Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender)
>>> Dr. Walter Olthoff
>>> 
>>> Vorsitzender des Aufsichtsrats:
>>> Prof. Dr. h.c. Hans A. Aukes
>>> 
>>> Amtsgericht Kaiserslautern, HRB 2313
>>> -------------------------------------------------------------
>>> 
>> 
>> Regards,
>> Chiwan Park

Regards,
Chiwan Park



Re: LabeledVector with label vector

Posted by Theodore Vasiloudis <th...@gmail.com>.
Generalizing the type of the label for the label vector is an idea we
played with when designing the current optimization framework.

We ended up deciding against it as the double type allows us to do
regressions and (multiclass) classification which should be the majority of
the use cases out there, while keeping the code simple.

Generalizing this to [T <: Serializable] is too broad I think. [T <:
Vector] is I think more reasonable, I cannot think of many cases where the
label in an optimization problems is something other than a vector/double.

Any change would require a number of changes in the optimization of course,
as optimizing for vector and double labels requires different handling of
error calculation etc but it should be doable.
Note however that since LabeledVector is such a core part of the library
any changes would involve a number of adjustments downstream.

Perhaps having different optimizers etc. for Vectors and double labels
makes sense, but I haven't put much though into this.


On Tue, Jan 5, 2016 at 12:17 PM, Chiwan Park <ch...@apache.org> wrote:

> Hi Hilmi,
>
> Thanks for suggestion about type of labeled vector. Basically, I agree
> that your suggestion is reasonable. But, I would like to generialize
> `LabeledVector` like following example:
>
> ```
> case class LabeledVector[T <: Serializable](label: T, vector: Vector)
> extends Serializable {
>   // some implementations for LabeledVector
> }
> ```
>
> How about this implementation? If there are any other opinions, please
> send a email to mailing list.
>
> > On Jan 5, 2016, at 7:36 PM, Hilmi Yildirim <Hi...@dfki.de>
> wrote:
> >
> > Hi,
> > in the ML-Pipeline of Flink we have the "LabeledVector" class. It
> consists of a vector and a label as a double value. Unfortunately, it is
> not applicable for sequence learning where the label is also a vector. For
> example, in NLP we have a vector of words and the label is a vector of the
> corresponding labels.
> >
> > The optimize function of the "Solver" class has a DateSet[LabeledVector]
> as input and, therefore, it is not applicable for sequence learning. I
> think the LabeledVector should be adapted that the label is a vector
> instead of a single Double value. What do you think?
> >
> > Best Regards,
> >
> > --
> > ==================================================================
> > Hilmi Yildirim, M.Sc.
> > Researcher
> >
> > DFKI GmbH
> > Intelligente Analytik für Massendaten
> > DFKI Projektbüro Berlin
> > Alt-Moabit 91c
> > D-10559 Berlin
> > Phone: +49 30 23895 1814
> >
> > E-Mail: Hilmi.Yildirim@dfki.de
> >
> > -------------------------------------------------------------
> > Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH
> > Firmensitz: Trippstadter Strasse 122, D-67663 Kaiserslautern
> >
> > Geschaeftsfuehrung:
> > Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender)
> > Dr. Walter Olthoff
> >
> > Vorsitzender des Aufsichtsrats:
> > Prof. Dr. h.c. Hans A. Aukes
> >
> > Amtsgericht Kaiserslautern, HRB 2313
> > -------------------------------------------------------------
> >
>
> Regards,
> Chiwan Park
>
>
>

Re: LabeledVector with label vector

Posted by Chiwan Park <ch...@apache.org>.
Hi Hilmi,

Thanks for suggestion about type of labeled vector. Basically, I agree that your suggestion is reasonable. But, I would like to generialize `LabeledVector` like following example:

```
case class LabeledVector[T <: Serializable](label: T, vector: Vector) extends Serializable {
  // some implementations for LabeledVector
}
```

How about this implementation? If there are any other opinions, please send a email to mailing list.

> On Jan 5, 2016, at 7:36 PM, Hilmi Yildirim <Hi...@dfki.de> wrote:
> 
> Hi,
> in the ML-Pipeline of Flink we have the "LabeledVector" class. It consists of a vector and a label as a double value. Unfortunately, it is not applicable for sequence learning where the label is also a vector. For example, in NLP we have a vector of words and the label is a vector of the corresponding labels.
> 
> The optimize function of the "Solver" class has a DateSet[LabeledVector] as input and, therefore, it is not applicable for sequence learning. I think the LabeledVector should be adapted that the label is a vector instead of a single Double value. What do you think?
> 
> Best Regards,
> 
> -- 
> ==================================================================
> Hilmi Yildirim, M.Sc.
> Researcher
> 
> DFKI GmbH
> Intelligente Analytik für Massendaten
> DFKI Projektbüro Berlin
> Alt-Moabit 91c
> D-10559 Berlin
> Phone: +49 30 23895 1814
> 
> E-Mail: Hilmi.Yildirim@dfki.de
> 
> -------------------------------------------------------------
> Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH
> Firmensitz: Trippstadter Strasse 122, D-67663 Kaiserslautern
> 
> Geschaeftsfuehrung:
> Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender)
> Dr. Walter Olthoff
> 
> Vorsitzender des Aufsichtsrats:
> Prof. Dr. h.c. Hans A. Aukes
> 
> Amtsgericht Kaiserslautern, HRB 2313
> -------------------------------------------------------------
> 

Regards,
Chiwan Park