You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Sean Owen (JIRA)" <ji...@apache.org> on 2017/03/23 14:20:42 UTC

[jira] [Updated] (SPARK-20071) StringIndexer overflows Kryo serialization buffer when run on column with many long distinct values

     [ https://issues.apache.org/jira/browse/SPARK-20071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sean Owen updated SPARK-20071:
------------------------------
    Issue Type: Improvement  (was: Bug)

Not a bug, right?
You can effect some of this yourself with preprocessing. Yes it's manual but it means you have full flexibility. Putting it in StringIndexer may not actually simplify things.

> StringIndexer overflows Kryo serialization buffer when run on column with many long distinct values
> ---------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-20071
>                 URL: https://issues.apache.org/jira/browse/SPARK-20071
>             Project: Spark
>          Issue Type: Improvement
>          Components: ML
>    Affects Versions: 2.1.0
>            Reporter: Barry Becker
>            Priority: Minor
>
> I marked this as minor because there are workarounds.
> I have a 2 million row dataset with a string column that is mostly unique and contains many very long values.
> Most of the values are between 1,000 and 40,0000 characters long.
> I am using Kryoserializer and increased the  spark.kryoserializer.buffer.max to 256m. 
> If I try to run StringIndexer.fit on this column, I will get an OutOfMemory exception or more likely a Buffer overflow error like 
> {code}
> org.apache.spark.SparkException: Kryo serialization failed: Buffer overflow. 
> Available: 0, required: 23. 
> To avoid this, increase spark.kryoserializer.buffer.max 
> value.org.apache.spark.serializer.KryoSerializerInstance.serialize(KryoSerializer.scala:315) 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:324) 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> {code}
> This result is not that surprising given that we are trying to index a column like this, however, I can think of some suggestions that would help avoid the error and maybe help performance.
> These possible enhancements to StringIndexer might be hard, but I thought I would suggest them anyway, just in case they are not.
> 1) Add param for Top N values. I know that StringIndexer gives lower indices to the more commonly occurring values. It would be great if one could specify that I only want to index the top N values and long everything else into a special "Other" value.
> 2) Add param for label length limit. Only consider the first L characters of labels when doing the indexing.
> Either of these enhancements would work, but I suppose they can also be implemented with additional work as steps preceding the indexer in the pipeline. Perhaps topByKey could be used to replace the column with one that has the top values and "Other" as suggesed in 1).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org