You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Shahab Yunus <sh...@gmail.com> on 2018/04/10 13:01:20 UTC

StringIndexer with high cardinality huge data

Is the StringIndexer
<https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala>
keeps all the mapped label to indices in the memory of the driver machine?
It seems to be unless I am missing something.

What if our data that needs to be indexed is huge and columns to be indexed
are high cardinality (or with lots of categories) and more than one such
column need to be indexed? Meaning it wouldn't fit in memory.

Thanks.

Regards,
Shahab

Re: StringIndexer with high cardinality huge data

Posted by Shahab Yunus <sh...@gmail.com>.

Thanks guys.

@Filipp Zhinkin
Yes, we might have couple of string columns which will have 15million+
unique values which need to be mapped to indices.

@Nick Pentreath
We are on 2.0.2 though I will check it out. Is it better from hashing
collision perspective or can handle large volume of data as well?

Regards,
Shahab

On Tue, Apr 10, 2018 at 10:05 AM, Nick Pentreath <ni...@gmail.com>
wrote:

> Also check out FeatureHasher in Spark 2.3.0 which is designed to handle
> this use case in a more natural way than HashingTF (and handles multiple
> columns at once).
>
>
>
> On Tue, 10 Apr 2018 at 16:00, Filipp Zhinkin <fi...@gmail.com>
> wrote:
>
>> Hi Shahab,
>>
>> do you actually need to have a few columns with such a huge amount of
>> categories whose value depends on original value's frequency?
>>
>> If no, then you may use value's hash code as a category or combine all
>> columns into a single vector using HashingTF.
>>
>> Regards,
>> Filipp.
>>
>> On Tue, Apr 10, 2018 at 4:01 PM, Shahab Yunus <sh...@gmail.com>
>> wrote:
>> > Is the StringIndexer keeps all the mapped label to indices in the
>> memory of
>> > the driver machine? It seems to be unless I am missing something.
>> >
>> > What if our data that needs to be indexed is huge and columns to be
>> indexed
>> > are high cardinality (or with lots of categories) and more than one such
>> > column need to be indexed? Meaning it wouldn't fit in memory.
>> >
>> > Thanks.
>> >
>> > Regards,
>> > Shahab
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>
>>

Re: StringIndexer with high cardinality huge data

Posted by Nick Pentreath <ni...@gmail.com>.

Also check out FeatureHasher in Spark 2.3.0 which is designed to handle
this use case in a more natural way than HashingTF (and handles multiple
columns at once).



On Tue, 10 Apr 2018 at 16:00, Filipp Zhinkin <fi...@gmail.com>
wrote:

> Hi Shahab,
>
> do you actually need to have a few columns with such a huge amount of
> categories whose value depends on original value's frequency?
>
> If no, then you may use value's hash code as a category or combine all
> columns into a single vector using HashingTF.
>
> Regards,
> Filipp.
>
> On Tue, Apr 10, 2018 at 4:01 PM, Shahab Yunus <sh...@gmail.com>
> wrote:
> > Is the StringIndexer keeps all the mapped label to indices in the memory
> of
> > the driver machine? It seems to be unless I am missing something.
> >
> > What if our data that needs to be indexed is huge and columns to be
> indexed
> > are high cardinality (or with lots of categories) and more than one such
> > column need to be indexed? Meaning it wouldn't fit in memory.
> >
> > Thanks.
> >
> > Regards,
> > Shahab
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>

Re: StringIndexer with high cardinality huge data

Posted by Filipp Zhinkin <fi...@gmail.com>.

Hi Shahab,

do you actually need to have a few columns with such a huge amount of
categories whose value depends on original value's frequency?

If no, then you may use value's hash code as a category or combine all
columns into a single vector using HashingTF.

Regards,
Filipp.

On Tue, Apr 10, 2018 at 4:01 PM, Shahab Yunus <sh...@gmail.com> wrote:
> Is the StringIndexer keeps all the mapped label to indices in the memory of
> the driver machine? It seems to be unless I am missing something.
>
> What if our data that needs to be indexed is huge and columns to be indexed
> are high cardinality (or with lots of categories) and more than one such
> column need to be indexed? Meaning it wouldn't fit in memory.
>
> Thanks.
>
> Regards,
> Shahab

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org