You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Marc Le Bihan <ml...@gmail.com> on 2020/10/01 08:27:01 UTC

My I report a special comparaison of executions leading on issues on Spark JIRA ?

Hello,

I currently run a Spark project based on cities, local authorities,
enterprises, local communities, etc.
Ten Datasets written in Java are doing operations going from simple join to
elaborate ones.
Language used is Java. 20 integrations tests with the whole data (20 GB)
takes seven hour.

*All work perfectly under Spark 2.4.6 - Scala 2.12 - Java 11 or 8*. 
I remember it was working well on Spark 2.4.5 too, 
but had many troubles in the past with Spark 2.4.3 (if I remember well from
L4Z algorithms often).

I attempted to run my integration tests on Spark 3.0.1. Many of them has
failed, with strange messages. 
Something about lambda or about Map that where no more taken into account
when in a Java Dataset, object or schema ?

I then gone back, but to Spark 2.4.7. To make a try. And Spark 2.4.7. also
encounters troubles that 2.4.6. didn't have.

My question :


May I create an issue on JIRA based on the comparison of the executions of
my project with different versions of Spark, reporting error messages
received, call stacks and showing the lines around the one that encountered
a problem if available, 
even if I can't provide you test cases for each trouble ? 
Would this be able to give you hints about things that are going wrong ?

I could then have a try with some development version if needed (when asked
for) to see if my project returns to stability.



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org

Re: My I report a special comparaison of executions leading on issues on Spark JIRA ?

Posted by Wenchen Fan <cl...@gmail.com>.

It will speed up the process a lot if a simple code snippet to reproduce
the error is provided.

On Sat, Oct 3, 2020 at 4:40 AM Marc Le Bihan <ml...@gmail.com> wrote:

> Yes. As I explained at the beginning of the message.
>
> For com/fasterxml/jackson/module/scala/ScalaObjectMapper missing
> I will check myself with spark-core and spark-sql become unable to load
> this
> dependency
>
> But I see nothing in Spark Migration Guide 2.4.6 to 3.0 explaining the
> apparition of this message :
> org.apache.spark.sql.AnalysisException: *Can't extract value from
> lambdavariable(MapObject, StringType, true, 376)*: need struct type but got
> string;
>
> Can you hint me ?
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>
>

Re: My I report a special comparaison of executions leading on issues on Spark JIRA ?

Posted by Marc Le Bihan <ml...@gmail.com>.

Yes. As I explained at the beginning of the message.

For com/fasterxml/jackson/module/scala/ScalaObjectMapper missing
I will check myself with spark-core and spark-sql become unable to load this
dependency

But I see nothing in Spark Migration Guide 2.4.6 to 3.0 explaining the
apparition of this message :
org.apache.spark.sql.AnalysisException: *Can't extract value from
lambdavariable(MapObject, StringType, true, 376)*: need struct type but got
string; 

Can you hint me ?



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org

Re: My I report a special comparaison of executions leading on issues on Spark JIRA ?

Posted by Sean Owen <sr...@gmail.com>.

I am not sure what tests you are referring to. Your own? They may indeed
have to be changed to work with Spark 3. All Spark tests pass in Spark 3
though.

No, until you can clarify I do not see something to report in JIRA.

On Fri, Oct 2, 2020, 3:07 PM Marc Le Bihan <ml...@gmail.com> wrote:

> Few tests (that are working on 2.4.6 and 2.4.7) are failling in 3.0.1
>
> Some with this message : *java.lang.ClassNotFoundException:
> com/fasterxml/jackson/module/scala/ScalaObjectMapper*
>
> Coming from :
>         at
>
> org.apache.spark.sql.catalyst.util.RebaseDateTime.lastSwitchJulianDay(RebaseDateTime.scala)
>         at
>
> org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.rebaseDays(VectorizedColumnReader.java:182)
>         at
>
> org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.decodeDictionaryIds(VectorizedColumnReader.java:336)
>         at
>
> org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:239)
>         at
>
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:273)
>
> or
>         at
>
> org.apache.spark.sql.catalyst.util.DateTimeUtils$.toJavaDate(DateTimeUtils.scala:130)
>         at
>
> org.apache.spark.sql.catalyst.util.DateTimeUtils.toJavaDate(DateTimeUtils.scala)
>         at
>
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown
> Source)
>
>
> The oher ones with this one :
> org.apache.spark.sql.AnalysisException: *Can't extract value from
> lambdavariable(MapObject, StringType, true, 376)*: need struct type but got
> string;
>
> These one might be hurting to a dataset having this schema ?
>
> /**
>  * Renvoyer le schéma du Dataset.
>  * @return Schema.
>  */
> public StructType schemaEntreprise() {
>    StructType schema = new StructType()
>       .add("siren", StringType, false)
>       .add("statutDiffusionUniteLegale", StringType, true)
>       .add("unitePurgeeUniteLegale", StringType, true )
>       .add("dateCreationEntreprise", StringType, true)
>       .add("sigle", StringType, true)
>
>       .add("sexe", StringType, true)
>       .add("prenom1", StringType, true)
>       .add("prenom2", StringType, true)
>       .add("prenom3", StringType, true)
>       .add("prenom4", StringType, true)
>
>       .add("prenomUsuel", StringType, true)
>       .add("pseudonyme", StringType, true)
>       .add("rna", StringType, true)
>       .add("trancheEffectifsUniteLegale", StringType, true)
>       .add("anneeEffectifsUniteLegale", StringType, true)
>
>       .add("dateDernierTraitement", StringType, true)
>       .add("nombrePeriodesUniteLegale", StringType, true)
>       .add("categorieEntreprise", StringType, true)
>       .add("anneeCategorieEntreprise", StringType, true)
>       .add("dateDebutHistorisation", StringType, true)
>
>       .add("etatAdministratifUniteLegale", StringType, true)
>       .add("nomNaissance", StringType, true)
>       .add("nomUsage", StringType, true)
>       .add("denominationEntreprise", StringType, true)
>       .add("denominationUsuelle1", StringType, true)
>
>       .add("denominationUsuelle2", StringType, true)
>       .add("denominationUsuelle3", StringType, true)
>       .add("categorieJuridique", StringType, true)
>       .add("activitePrincipale", StringType, true)
>       .add("nomenclatureActivitePrincipale", StringType, true)
>
>       .add("nicSiege", StringType, true)
>       .add("economieSocialeSolidaireUniteLegale", StringType, true)
>       .add("caractereEmployeurUniteLegale", StringType, true)
>
>          // Champs créés par withColumn
>          .add("purgee", BooleanType, true)
>          .add("anneeValiditeEffectifSalarie", IntegerType, true)
>          .add("active", BooleanType, true)
>          .add("nombrePeriodes", IntegerType, true)
>          .add("anneeCategorie", IntegerType, true)
>
>          .add("economieSocialeSolidaire", BooleanType, true)
>          .add("caractereEmployeur", BooleanType, true);
>
>    // Ajouter au Dataset des entreprises la liaison avec les
> établissements.
>    MapType mapEtablissements = new MapType(StringType,
> this.datasetEtablissement.schemaEtablissement(), true);
>    StructField etablissements = new StructField("etablissements",
> mapEtablissements, true, Metadata.empty());
>    schema.add(etablissements);
>    schema.add("libelleCategorieJuridique", StringType, true);
>    schema.add("partition", StringType, true);
>
>    return schema;
> }
>
> Are they worth to mention in an issue (or to complete the description of an
> existing issue) ?
> Do you need me to pursue some analysis, and if so, how ?
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>
>

Re: My I report a special comparaison of executions leading on issues on Spark JIRA ?

Posted by Marc Le Bihan <ml...@gmail.com>.

Few tests (that are working on 2.4.6 and 2.4.7) are failling in 3.0.1 

Some with this message : *java.lang.ClassNotFoundException:
com/fasterxml/jackson/module/scala/ScalaObjectMapper*

Coming from :
	at
org.apache.spark.sql.catalyst.util.RebaseDateTime.lastSwitchJulianDay(RebaseDateTime.scala)
	at
org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.rebaseDays(VectorizedColumnReader.java:182)
	at
org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.decodeDictionaryIds(VectorizedColumnReader.java:336)
	at
org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:239)
	at
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:273)

or
	at
org.apache.spark.sql.catalyst.util.DateTimeUtils$.toJavaDate(DateTimeUtils.scala:130)
	at
org.apache.spark.sql.catalyst.util.DateTimeUtils.toJavaDate(DateTimeUtils.scala)
	at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown
Source)


The oher ones with this one :
org.apache.spark.sql.AnalysisException: *Can't extract value from
lambdavariable(MapObject, StringType, true, 376)*: need struct type but got
string;

These one might be hurting to a dataset having this schema ?

/**
 * Renvoyer le schéma du Dataset.
 * @return Schema.
 */
public StructType schemaEntreprise() {
   StructType schema = new StructType()
      .add("siren", StringType, false)
      .add("statutDiffusionUniteLegale", StringType, true)
      .add("unitePurgeeUniteLegale", StringType, true )
      .add("dateCreationEntreprise", StringType, true)
      .add("sigle", StringType, true)
      
      .add("sexe", StringType, true)
      .add("prenom1", StringType, true)
      .add("prenom2", StringType, true)
      .add("prenom3", StringType, true)
      .add("prenom4", StringType, true)
      
      .add("prenomUsuel", StringType, true)
      .add("pseudonyme", StringType, true)
      .add("rna", StringType, true)
      .add("trancheEffectifsUniteLegale", StringType, true)
      .add("anneeEffectifsUniteLegale", StringType, true)
      
      .add("dateDernierTraitement", StringType, true)
      .add("nombrePeriodesUniteLegale", StringType, true)
      .add("categorieEntreprise", StringType, true)
      .add("anneeCategorieEntreprise", StringType, true)
      .add("dateDebutHistorisation", StringType, true)

      .add("etatAdministratifUniteLegale", StringType, true)
      .add("nomNaissance", StringType, true)
      .add("nomUsage", StringType, true)
      .add("denominationEntreprise", StringType, true)
      .add("denominationUsuelle1", StringType, true)

      .add("denominationUsuelle2", StringType, true)
      .add("denominationUsuelle3", StringType, true)
      .add("categorieJuridique", StringType, true)
      .add("activitePrincipale", StringType, true)
      .add("nomenclatureActivitePrincipale", StringType, true)

      .add("nicSiege", StringType, true)
      .add("economieSocialeSolidaireUniteLegale", StringType, true)
      .add("caractereEmployeurUniteLegale", StringType, true)
      
         // Champs créés par withColumn
         .add("purgee", BooleanType, true)
         .add("anneeValiditeEffectifSalarie", IntegerType, true)
         .add("active", BooleanType, true)
         .add("nombrePeriodes", IntegerType, true)
         .add("anneeCategorie", IntegerType, true)

         .add("economieSocialeSolidaire", BooleanType, true)
         .add("caractereEmployeur", BooleanType, true);
   
   // Ajouter au Dataset des entreprises la liaison avec les établissements.
   MapType mapEtablissements = new MapType(StringType,
this.datasetEtablissement.schemaEtablissement(), true);
   StructField etablissements = new StructField("etablissements",
mapEtablissements, true, Metadata.empty());
   schema.add(etablissements);
   schema.add("libelleCategorieJuridique", StringType, true);
   schema.add("partition", StringType, true);
   
   return schema;
}

Are they worth to mention in an issue (or to complete the description of an
existing issue) ?
Do you need me to pursue some analysis, and if so, how ?



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org

Re: My I report a special comparaison of executions leading on issues on Spark JIRA ?

Posted by Sean Owen <sr...@gmail.com>.

Yes indeed, in fact you seem to be describing Spark 2->3 changes that are
already documented in the spark 3 migration guide.

On Thu, Oct 1, 2020 at 7:08 AM Russell Spitzer <ru...@gmail.com>
wrote:

> You are always welcome to create a jira or jiras, but you may find you get
> a faster response by asking about your issues on the mailing list first.
>
> That may help in identifying whether your issues are already logged or
> not, or whether there is a solution that can be applied right away.
>
>
> On Thu, Oct 1, 2020, 3:27 AM Marc Le Bihan <ml...@gmail.com> wrote:
>
>> Hello,
>>
>> I currently run a Spark project based on cities, local authorities,
>> enterprises, local communities, etc.
>> Ten Datasets written in Java are doing operations going from simple join
>> to
>> elaborate ones.
>> Language used is Java. 20 integrations tests with the whole data (20 GB)
>> takes seven hour.
>>
>> *All work perfectly under Spark 2.4.6 - Scala 2.12 - Java 11 or 8*.
>> I remember it was working well on Spark 2.4.5 too,
>> but had many troubles in the past with Spark 2.4.3 (if I remember well
>> from
>> L4Z algorithms often).
>>
>> I attempted to run my integration tests on Spark 3.0.1. Many of them has
>> failed, with strange messages.
>> Something about lambda or about Map that where no more taken into account
>> when in a Java Dataset, object or schema ?
>>
>> I then gone back, but to Spark 2.4.7. To make a try. And Spark 2.4.7. also
>> encounters troubles that 2.4.6. didn't have.
>>
>> My question :
>>
>>
>> May I create an issue on JIRA based on the comparison of the executions of
>> my project with different versions of Spark, reporting error messages
>> received, call stacks and showing the lines around the one that
>> encountered
>> a problem if available,
>> even if I can't provide you test cases for each trouble ?
>> Would this be able to give you hints about things that are going wrong ?
>>
>> I could then have a try with some development version if needed (when
>> asked
>> for) to see if my project returns to stability.
>>
>>
>>
>> --
>> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>
>>

Re: My I report a special comparaison of executions leading on issues on Spark JIRA ?

Posted by Russell Spitzer <ru...@gmail.com>.

You are always welcome to create a jira or jiras, but you may find you get
a faster response by asking about your issues on the mailing list first.

That may help in identifying whether your issues are already logged or not,
or whether there is a solution that can be applied right away.


On Thu, Oct 1, 2020, 3:27 AM Marc Le Bihan <ml...@gmail.com> wrote:

> Hello,
>
> I currently run a Spark project based on cities, local authorities,
> enterprises, local communities, etc.
> Ten Datasets written in Java are doing operations going from simple join to
> elaborate ones.
> Language used is Java. 20 integrations tests with the whole data (20 GB)
> takes seven hour.
>
> *All work perfectly under Spark 2.4.6 - Scala 2.12 - Java 11 or 8*.
> I remember it was working well on Spark 2.4.5 too,
> but had many troubles in the past with Spark 2.4.3 (if I remember well from
> L4Z algorithms often).
>
> I attempted to run my integration tests on Spark 3.0.1. Many of them has
> failed, with strange messages.
> Something about lambda or about Map that where no more taken into account
> when in a Java Dataset, object or schema ?
>
> I then gone back, but to Spark 2.4.7. To make a try. And Spark 2.4.7. also
> encounters troubles that 2.4.6. didn't have.
>
> My question :
>
>
> May I create an issue on JIRA based on the comparison of the executions of
> my project with different versions of Spark, reporting error messages
> received, call stacks and showing the lines around the one that encountered
> a problem if available,
> even if I can't provide you test cases for each trouble ?
> Would this be able to give you hints about things that are going wrong ?
>
> I could then have a try with some development version if needed (when asked
> for) to see if my project returns to stability.
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>
>