You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Sean Owen (JIRA)" <ji...@apache.org> on 2017/06/07 09:33:18 UTC
[jira] [Updated] (SPARK-20491) Synonym handling replacement issue
in Apache Spark
[ https://issues.apache.org/jira/browse/SPARK-20491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sean Owen updated SPARK-20491:
------------------------------
Target Version/s: (was: 2.0.2)
> Synonym handling replacement issue in Apache Spark
> --------------------------------------------------
>
> Key: SPARK-20491
> URL: https://issues.apache.org/jira/browse/SPARK-20491
> Project: Spark
> Issue Type: Question
> Components: Examples, ML
> Affects Versions: 2.0.2
> Environment: Eclipse LUNA, Spring Boot
> Reporter: Nishanth J
> Labels: maven
>
> I am facing a major issue on replacement of Synonyms in my DataSet.
> I am trying to replace the synonym of the Brand names to its equivalent names.
> I have tried 2 methods to solve this issue.
> Method 1 (regexp_replace)
> Here i am using the regexp_replace method.
> Hashtable manufacturerNames = new Hashtable();
> Enumeration names;
> String str;
> double bal;
> manufacturerNames.put("Allen","Apex Tool Group");
> manufacturerNames.put("Armstrong","Apex Tool Group");
> manufacturerNames.put("Campbell","Apex Tool Group");
> manufacturerNames.put("Lubriplate","Apex Tool Group");
> manufacturerNames.put("Delta","Apex Tool Group");
> manufacturerNames.put("Gearwrench","Apex Tool Group");
> manufacturerNames.put("H.K. Porter","Apex Tool Group");
> /*....100 MORE....*/
> manufacturerNames.put("Stanco","Stanco Mfg");
> manufacturerNames.put("Stanco","Stanco Mfg");
> manufacturerNames.put("Standard Safety","Standard Safety Equipment Company");
> manufacturerNames.put("Standard Safety","Standard Safety Equipment Company");
> // Show all balances in hash table.
> names = manufacturerNames.keys();
> Dataset<Row> dataFileContent = sqlContext.load("com.databricks.spark.csv", options);
> while(names.hasMoreElements()) {
> str = (String) names.nextElement();
> dataFileContent=dataFileContent.withColumn("ManufacturerSource", regexp_replace(col("ManufacturerSource"),str,manufacturerNames.get(str).toString()));
> }
> dataFileContent.show();
> I got to know that the amount of data is too huge for regexp_replace so got a solution to use UDF
> http://stackoverflow.com/questions/43413513/issue-in-regex-replace-in-apache-spark-java
> Method 2 (UDF)
> List<Row> data2 = Arrays.asList(
> RowFactory.create("Allen", "Apex Tool Group"),
> RowFactory.create("Armstrong","Apex Tool Group"),
> RowFactory.create("DeWALT","StanleyBlack")
> );
> StructType schema2 = new StructType(new StructField[] {
> new StructField("label2", DataTypes.StringType, false, Metadata.empty()),
> new StructField("sentence2", DataTypes.StringType, false, Metadata.empty())
> });
> Dataset<Row> sentenceDataFrame2 = spark.createDataFrame(data2, schema2);
> UDF2<String, String, Boolean> contains = new UDF2<String, String, Boolean>() {
> private static final long serialVersionUID = -5239951370238629896L;
> @Override
> public Boolean call(String t1, String t2) throws Exception {
> return t1.contains(t2);
> }
> };
> spark.udf().register("contains", contains, DataTypes.BooleanType);
> UDF3<String, String, String, String> replaceWithTerm = new UDF3<String, String, String, String>() {
> private static final long serialVersionUID = -2882956931420910207L;
> @Override
> public String call(String t1, String t2, String t3) throws Exception {
> return t1.replaceAll(t2, t3);
> }
> };
> spark.udf().register("replaceWithTerm", replaceWithTerm, DataTypes.StringType);
> Dataset<Row> joined = sentenceDataFrame.join(sentenceDataFrame2, callUDF("contains", sentenceDataFrame.col("sentence"), sentenceDataFrame2.col("label2")))
> .withColumn("sentence_replaced", callUDF("replaceWithTerm", sentenceDataFrame.col("sentence"), sentenceDataFrame2.col("label2"), sentenceDataFrame2.col("sentence2")))
> .select(col("sentence_replaced"));
> joined.show(false);
> }
> Got this output when there are multiple replacements do in a row.
> Input-
> Allen Armstrong jeevi pramod Allen
> sandesh Armstrong jeevi
> harsha nischay DeWALT
> Output-
> Apex Tool Group Armstrong jeevi pramod Apex Tool Group
> Allen Apex Tool Group jeevi pramod Allen
> sandesh Apex Tool Group jeevi
> harsha nischay StanleyBlack
> Expected Output-
> Apex Tool Group Apex Tool Group jeevi pramod Apex Tool Group
> sandesh Apex Tool Group jeevi
> harsha nischay StanleyBlack
> Are there any other method which must be followed to get the proper output.? Or is this is limitation of UDF ?
> Kindly help us with this issue.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org