You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Philippe de Rochambeau <ph...@free.fr> on 2023/04/01 18:30:46 UTC

Looping through a series of telephone numbers

Hello,
I’m looking for an efficient way in Spark to search for a series of telephone numbers, contained in a CSV file, in a data set column.

In pseudo code,

for tel in [tel1, tel2, …. tel40,000] 
	search for tel in dataset using .like(« %tel% »)
end for 

I’m using the like function because the telephone numbers in the data set main contain prefixes, such as « + « ; e.g., « +3312224444 ».

Any suggestions would be welcome.

Many thanks.

Philippe





---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: Looping through a series of telephone numbers

Posted by Bjørn Jørgensen <bj...@gmail.com>.

dataset.csv
id,tel_in_dataset
1,+3311111111
2,+3312224444
3,+3313333333
4,+3312225555
5,+3312226666
6,+3314444444
7,+3312227777
8,+3315555555

telephone_numbers.csv
tel
+3312224444
+3312225555
+3312226666
+3312227777



start spark with all of yous cpu and ram

import os
import multiprocessing
from pyspark import SparkConf, SparkContext
from pyspark import pandas as ps
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, concat, concat_ws, expr, lit, trim,
regexp_replace
from pyspark.sql.types import IntegerType, StringType, StructField,
StructType

os.environ["PYARROW_IGNORE_TIMEZONE"] = "1"

number_cores = int(multiprocessing.cpu_count())

mem_bytes = os.sysconf("SC_PAGE_SIZE") * os.sysconf("SC_PHYS_PAGES")  #
e.g. 4015976448
memory_gb = int(mem_bytes / (1024.0**3))  # e.g. 3.74


def get_spark_session(app_name: str, conf: SparkConf):
    conf.setMaster("local[{}]".format(number_cores))
    conf.set("spark.driver.memory", "{}g".format(memory_gb)).set(
        "spark.sql.adaptive.enabled", "True"
    ).set(
        "spark.serializer", "org.apache.spark.serializer.KryoSerializer"
    ).set(
        "spark.sql.repl.eagerEval.maxNumRows", "10000"
        ).set(
        "sc.setLogLevel", "ERROR"
    )

    return
SparkSession.builder.appName(app_name).config(conf=conf).getOrCreate()


spark = get_spark_session("My app", SparkConf())
spark.sparkContext.setLogLevel("ERROR")




#pandas API on spark
tel_df = ps.read_csv("telephone_numbers.csv")

tel_df['tel'] = tel_df['tel'].astype(str)
tel_df['cleaned_tel'] = tel_df['tel'].str.replace('+', '', regex=False)

dataset_df = ps.read_csv("dataset.csv")
dataset_df['tel_in_dataset'] = dataset_df['tel_in_dataset'].astype(str)

dataset_df['cleaned_tel_in_dataset'] =
dataset_df['tel_in_dataset'].str.replace('+', '', regex=False)

filtered_df =
dataset_df[dataset_df['cleaned_tel_in_dataset'].isin(tel_df['cleaned_tel'].to_list())]

filtered_df.head()


idtel_in_datasetcleaned_tel_in_dataset
1 2 3312224444 3312224444
3 4 3312225555 3312225555
4 5 3312226666 3312226666
6 7 3312227777 3312227777


#pyspark
tel_df = spark.read.csv("telephone_numbers.csv", header=True)
tel_df = tel_df.withColumn("cleaned_tel", regexp_replace(col("tel"), "\\+",
""))

dataset_df = spark.read.csv("dataset.csv", header=True)
dataset_df = dataset_df.withColumn("cleaned_tel_in_dataset",
regexp_replace(col("tel_in_dataset"), "\\+", ""))

filtered_df =
dataset_df.where(col("cleaned_tel_in_dataset").isin([row.cleaned_tel for
row in tel_df.collect()]))

filtered_df.show()


+---+--------------+----------------------+
| id|tel_in_dataset|cleaned_tel_in_dataset|
+---+--------------+----------------------+
|  2|   +3312224444|            3312224444|
|  4|   +3312225555|            3312225555|
|  5|   +3312226666|            3312226666|
|  7|   +3312227777|            3312227777|
+---+--------------+----------------------+




søn. 2. apr. 2023 kl. 18:18 skrev Mich Talebzadeh <mich.talebzadeh@gmail.com
>:

> Hi Phillipe,
>
> These are my thoughts besides comments from Sean
>
> Just to clarify, you receive a CSV file periodically and you already have
> a file that contains valid patterns for phone numbers (reference)
>
> In a pseudo language you can probe your csv DF against the reference DF
>
> // load your reference dataframeval reference_DF=sqlContext.parquetFile("path")
> // mark this smaller dataframe to be stored in memoryreference_DF.cache()
>
> //Create a temp table
>
> reference_DF.createOrReplaceTempView("reference")
>
> // Do the same on the CSV, change the line below
>
> val csvDF = spark.read.format("com.databricks.spark.csv").option("inferSchema", "true").option("header", "false").load("path")
>
> csvDF.cache()  // This may or not work if CSV is large, however it is worth trying
>
> csvDF.createOrReplaceTempView("csv")
>
> sqlContext.sql("JOIN Query").show
>
> If you prefer to broadcast the reference data, you must first collect it on the driver before you broadcast it. This requires that your RDD fits in memory on your driver (and executors).
>
> You can then play around with that join.
>
> HTH
>
> Mich Talebzadeh,
> Lead Solutions Architect/Engineering Lead
> Palantir Technologies Limited
>
>
>    view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Sun, 2 Apr 2023 at 09:17, Philippe de Rochambeau <ph...@free.fr>
> wrote:
>
>> Many thanks, Mich.
>> Is « foreach »  the best construct to  lookup items is a dataset  such as
>> the below «  telephonedirectory » data set?
>>
>> val telrdd = spark.sparkContext.parallelize(Seq(«  tel1 » , «  tel2 » , «  tel3 » …)) // the telephone sequence
>>
>> // was read for a CSV file
>>
>> val ds = spark.read.parquet(«  /path/to/telephonedirectory » )
>>
>>   rdd .foreach(tel => {
>>     longAcc.select(«  * » ).rlike(«  + »  + tel)
>>   })
>>
>>
>>
>>
>> Le 1 avr. 2023 à 22:36, Mich Talebzadeh <mi...@gmail.com> a
>> écrit :
>>
>> This may help
>>
>> Spark rlike() Working with Regex Matching Example
>> <https://sparkbyexamples.com/spark/spark-rlike-regex-matching-examples/>s
>> Mich Talebzadeh,
>> Lead Solutions Architect/Engineering Lead
>> Palantir Technologies Limited
>>
>>    view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Sat, 1 Apr 2023 at 19:32, Philippe de Rochambeau <ph...@free.fr>
>> wrote:
>>
>>> Hello,
>>> I’m looking for an efficient way in Spark to search for a series of
>>> telephone numbers, contained in a CSV file, in a data set column.
>>>
>>> In pseudo code,
>>>
>>> for tel in [tel1, tel2, …. tel40,000]
>>>         search for tel in dataset using .like(« %tel% »)
>>> end for
>>>
>>> I’m using the like function because the telephone numbers in the data
>>> set main contain prefixes, such as « + « ; e.g., « +3312224444 ».
>>>
>>> Any suggestions would be welcome.
>>>
>>> Many thanks.
>>>
>>> Philippe
>>>
>>>
>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>>
>>>
>>

-- 
Bjørn Jørgensen
Vestre Aspehaug 4, 6010 Ålesund
Norge

+47 480 94 297

Re: Looping through a series of telephone numbers

Posted by Mich Talebzadeh <mi...@gmail.com>.

Hi Philippe,

Broadcast variables allow the programmer to keep a read-only variable
cached on each machine rather than shipping a copy of it with tasks. They
can be used, for example, to give every node a copy of a large input
dataset in an efficient manner. Spark also attempts to distribute broadcast
variables using efficient broadcast algorithms to reduce communication cost.

If you have enough memory, the smaller table is cached in the driver and
distributed to every node of the cluster, reduning shift and lift of data
check this link

https://sparkbyexamples.com/spark/broadcast-join-in-spark/#:~:text=Broadcast%20join%20is%20an%20optimization,always%20collected%20at%20the%20driver
.

HTH

Mich Talebzadeh,
Lead Solutions Architect/Engineering Lead
Palantir Technologies Limited


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Sun, 2 Apr 2023 at 20:05, Philippe de Rochambeau <ph...@free.fr> wrote:

> Hi Mich,
> what exactly do you mean by « if you prefer to broadcast the reference
> data »?
> Philippe
>
> Le 2 avr. 2023 à 18:16, Mich Talebzadeh <mi...@gmail.com> a
> écrit :
>
> Hi Phillipe,
>
> These are my thoughts besides comments from Sean
>
> Just to clarify, you receive a CSV file periodically and you already have
> a file that contains valid patterns for phone numbers (reference)
>
> In a pseudo language you can probe your csv DF against the reference DF
>
> // load your reference dataframeval reference_DF=sqlContext.parquetFile("path")
> // mark this smaller dataframe to be stored in memoryreference_DF.cache()
>
> //Create a temp table
>
> reference_DF.createOrReplaceTempView("reference")
>
> // Do the same on the CSV, change the line below
>
> val csvDF = spark.read.format("com.databricks.spark.csv").option("inferSchema", "true").option("header", "false").load("path")
>
> csvDF.cache()  // This may or not work if CSV is large, however it is worth trying
>
> csvDF.createOrReplaceTempView("csv")
>
> sqlContext.sql("JOIN Query").show
>
> If you prefer to broadcast the reference data, you must first collect it on the driver before you broadcast it. This requires that your RDD fits in memory on your driver (and executors).
>
> You can then play around with that join.
>
> HTH
>
> Mich Talebzadeh,
> Lead Solutions Architect/Engineering Lead
> Palantir Technologies Limited
>
>    view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Sun, 2 Apr 2023 at 09:17, Philippe de Rochambeau <ph...@free.fr>
> wrote:
>
>> Many thanks, Mich.
>> Is « foreach »  the best construct to  lookup items is a dataset  such as
>> the below «  telephonedirectory » data set?
>>
>> val telrdd = spark.sparkContext.parallelize(Seq(«  tel1 » , «  tel2 » , «  tel3 » …)) // the telephone sequence
>>
>> // was read for a CSV file
>>
>> val ds = spark.read.parquet(«  /path/to/telephonedirectory » )
>>
>>   rdd .foreach(tel => {
>>     longAcc.select(«  * » ).rlike(«  + »  + tel)
>>   })
>>
>>
>>
>>
>> Le 1 avr. 2023 à 22:36, Mich Talebzadeh <mi...@gmail.com> a
>> écrit :
>>
>> This may help
>>
>> Spark rlike() Working with Regex Matching Example
>> <https://sparkbyexamples.com/spark/spark-rlike-regex-matching-examples/>s
>> Mich Talebzadeh,
>> Lead Solutions Architect/Engineering Lead
>> Palantir Technologies Limited
>>
>>    view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Sat, 1 Apr 2023 at 19:32, Philippe de Rochambeau <ph...@free.fr>
>> wrote:
>>
>>> Hello,
>>> I’m looking for an efficient way in Spark to search for a series of
>>> telephone numbers, contained in a CSV file, in a data set column.
>>>
>>> In pseudo code,
>>>
>>> for tel in [tel1, tel2, …. tel40,000]
>>>         search for tel in dataset using .like(« %tel% »)
>>> end for
>>>
>>> I’m using the like function because the telephone numbers in the data
>>> set main contain prefixes, such as « + « ; e.g., « +3312224444 ».
>>>
>>> Any suggestions would be welcome.
>>>
>>> Many thanks.
>>>
>>> Philippe
>>>
>>>
>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>>
>>>
>>
>

Re: Looping through a series of telephone numbers

Posted by Philippe de Rochambeau <ph...@free.fr>.

Hi Mich,
what exactly do you mean by « if you prefer to broadcast the reference data »?
Philippe

> Le 2 avr. 2023 à 18:16, Mich Talebzadeh <mi...@gmail.com> a écrit :
> 
> Hi Phillipe,
> 
> These are my thoughts besides comments from Sean
> 
> Just to clarify, you receive a CSV file periodically and you already have a file that contains valid patterns for phone numbers (reference)
> 
> In a pseudo language you can probe your csv DF against the reference DF
> 
> // load your reference dataframe
> val reference_DF=sqlContext.parquetFile("path")
> 
> // mark this smaller dataframe to be stored in memory
> reference_DF.cache()
> //Create a temp table
> reference_DF.createOrReplaceTempView("reference")
> // Do the same on the CSV, change the line below
> val csvDF = spark.read.format("com.databricks.spark.csv").option("inferSchema", "true").option("header", "false").load("path")
> csvDF.cache()  // This may or not work if CSV is large, however it is worth trying
> csvDF.createOrReplaceTempView("csv")
> sqlContext.sql("JOIN Query").show
> If you prefer to broadcast the reference data, you must first collect it on the driver before you broadcast it. This requires that your RDD fits in memory on your driver (and executors).
> 
> You can then play around with that join.
> HTH
> 
> Mich Talebzadeh,
> Lead Solutions Architect/Engineering Lead
> Palantir Technologies Limited
> 
>    view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
> 
>  https://en.everybodywiki.com/Mich_Talebzadeh
> 
>  
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.
>  
> 
> 
> On Sun, 2 Apr 2023 at 09:17, Philippe de Rochambeau <phiroc@free.fr <ma...@free.fr>> wrote:
>> Many thanks, Mich.
>> Is « foreach »  the best construct to  lookup items is a dataset  such as the below «  telephonedirectory » data set?
>> 
>> val telrdd = spark.sparkContext.parallelize(Seq(«  tel1 » , «  tel2 » , «  tel3 » …)) // the telephone sequence
>> // was read for a CSV file
>> val ds = spark.read.parquet(«  /path/to/telephonedirectory » )
>>   
>>   rdd .foreach(tel => {
>>     longAcc.select(«  * » ).rlike(«  + »  + tel)
>>   })
>> 
>> 
>> 
>>> Le 1 avr. 2023 à 22:36, Mich Talebzadeh <mich.talebzadeh@gmail.com <ma...@gmail.com>> a écrit :
>>> 
>>> This may help
>>> 
>>> Spark rlike() Working with Regex Matching Example <https://sparkbyexamples.com/spark/spark-rlike-regex-matching-examples/>s
>>> Mich Talebzadeh,
>>> Lead Solutions Architect/Engineering Lead
>>> Palantir Technologies Limited
>>> 
>>>    view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>> 
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>> 
>>>  
>>> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.
>>>  
>>> 
>>> 
>>> On Sat, 1 Apr 2023 at 19:32, Philippe de Rochambeau <phiroc@free.fr <ma...@free.fr>> wrote:
>>>> Hello,
>>>> I’m looking for an efficient way in Spark to search for a series of telephone numbers, contained in a CSV file, in a data set column.
>>>> 
>>>> In pseudo code,
>>>> 
>>>> for tel in [tel1, tel2, …. tel40,000] 
>>>>         search for tel in dataset using .like(« %tel% »)
>>>> end for 
>>>> 
>>>> I’m using the like function because the telephone numbers in the data set main contain prefixes, such as « + « ; e.g., « +3312224444 ».
>>>> 
>>>> Any suggestions would be welcome.
>>>> 
>>>> Many thanks.
>>>> 
>>>> Philippe
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org <ma...@spark.apache.org>
>>>> 
>>

Re: Looping through a series of telephone numbers

Posted by Mich Talebzadeh <mi...@gmail.com>.

Hi Phillipe,

These are my thoughts besides comments from Sean

Just to clarify, you receive a CSV file periodically and you already have a
file that contains valid patterns for phone numbers (reference)

In a pseudo language you can probe your csv DF against the reference DF

// load your reference dataframeval reference_DF=sqlContext.parquetFile("path")
// mark this smaller dataframe to be stored in memoryreference_DF.cache()

//Create a temp table

reference_DF.createOrReplaceTempView("reference")

// Do the same on the CSV, change the line below

val csvDF = spark.read.format("com.databricks.spark.csv").option("inferSchema",
"true").option("header", "false").load("path")

csvDF.cache()  // This may or not work if CSV is large, however it is
worth trying

csvDF.createOrReplaceTempView("csv")

sqlContext.sql("JOIN Query").show

If you prefer to broadcast the reference data, you must first collect
it on the driver before you broadcast it. This requires that your RDD
fits in memory on your driver (and executors).

You can then play around with that join.

HTH

Mich Talebzadeh,
Lead Solutions Architect/Engineering Lead
Palantir Technologies Limited


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Sun, 2 Apr 2023 at 09:17, Philippe de Rochambeau <ph...@free.fr> wrote:

> Many thanks, Mich.
> Is « foreach »  the best construct to  lookup items is a dataset  such as
> the below «  telephonedirectory » data set?
>
> val telrdd = spark.sparkContext.parallelize(Seq(«  tel1 » , «  tel2 » , «  tel3 » …)) // the telephone sequence
>
> // was read for a CSV file
>
> val ds = spark.read.parquet(«  /path/to/telephonedirectory » )
>
>   rdd .foreach(tel => {
>     longAcc.select(«  * » ).rlike(«  + »  + tel)
>   })
>
>
>
>
> Le 1 avr. 2023 à 22:36, Mich Talebzadeh <mi...@gmail.com> a
> écrit :
>
> This may help
>
> Spark rlike() Working with Regex Matching Example
> <https://sparkbyexamples.com/spark/spark-rlike-regex-matching-examples/>s
> Mich Talebzadeh,
> Lead Solutions Architect/Engineering Lead
> Palantir Technologies Limited
>
>    view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Sat, 1 Apr 2023 at 19:32, Philippe de Rochambeau <ph...@free.fr>
> wrote:
>
>> Hello,
>> I’m looking for an efficient way in Spark to search for a series of
>> telephone numbers, contained in a CSV file, in a data set column.
>>
>> In pseudo code,
>>
>> for tel in [tel1, tel2, …. tel40,000]
>>         search for tel in dataset using .like(« %tel% »)
>> end for
>>
>> I’m using the like function because the telephone numbers in the data set
>> main contain prefixes, such as « + « ; e.g., « +3312224444 ».
>>
>> Any suggestions would be welcome.
>>
>> Many thanks.
>>
>> Philippe
>>
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>
>>
>

Re: Looping through a series of telephone numbers

Posted by Gera Shegalov <ge...@gmail.com>.

+1 to using a UDF.  E.g., TransmogrifAI uses
<https://github.com/salesforce/TransmogrifAI/blob/ef6d3267cf4379a0805d6add400d7b0e328e4aa1/core/src/main/scala/com/salesforce/op/stages/impl/feature/PhoneNumberParser.scala#L274>
 libphonenumber https://github.com/google/libphonenumber that normalizes
phone numbers to a tuple (country code, national number), so you can find
more sophisticated matches for phones written in different notations.

If you simplify it for DataFrame/SQL-only use, you can create a Scala UDF:

$SPARK_HOME/bin/spark-shell --packages
com.googlecode.libphonenumber:libphonenumber:8.13.9
scala> :paste
// Entering paste mode (ctrl-D to finish)

import com.google.i18n.phonenumbers._
import scala.collection.JavaConverters._
val extractPhonesUDF = udf((x: String) =>
  PhoneNumberUtil.getInstance()
    .findNumbers(x, "US").asScala.toSeq
    .map(x => (x.number.getCountryCode, x.number.getNationalNumber)))
spark.udf.register("EXTRACT_PHONES", extractPhonesUDF)
sql("""
SELECT
  EXTRACT_PHONES('+496811234567,+1(415)7654321') AS needles,
  EXTRACT_PHONES('Call our HQ in Germany at (+49) 0681/1234567, in Paris at
: +33 01 12 34 56 78, or the SF office at 415-765-4321') AS haystack,
  ARRAY_INTERSECT(needles, haystack) AS needles_in_haystack
""").show(truncate=false)

// Exiting paste mode, now interpreting.

+-----------------------------------+----------------------------------------------------+-----------------------------------+
|needles                            |haystack
             |needles_in_haystack                |
+-----------------------------------+----------------------------------------------------+-----------------------------------+
|[{49, 6811234567}, {1, 4157654321}]|[{49, 6811234567}, {33, 112345678},
{1, 4157654321}]|[{49, 6811234567}, {1, 4157654321}]|
+-----------------------------------+----------------------------------------------------+-----------------------------------+

On Sun, Apr 2, 2023 at 7:18 AM Sean Owen <sr...@gmail.com> wrote:

> That won't work, you can't use Spark within Spark like that.
> If it were exact matches, the best solution would be to load both datasets
> and join on telephone number.
> For this case, I think your best bet is a UDF that contains the telephone
> numbers as a list and decides whether a given number matches something in
> the set. Then use that to filter, then work with the data set.
> There are probably clever fast ways of efficiently determining if a string
> is a prefix of a group of strings in Python you could use too.
>
> On Sun, Apr 2, 2023 at 3:17 AM Philippe de Rochambeau <ph...@free.fr>
> wrote:
>
>> Many thanks, Mich.
>> Is « foreach »  the best construct to  lookup items is a dataset  such as
>> the below «  telephonedirectory » data set?
>>
>> val telrdd = spark.sparkContext.parallelize(Seq(«  tel1 » , «  tel2 » , «  tel3 » …)) // the telephone sequence
>>
>> // was read for a CSV file
>>
>> val ds = spark.read.parquet(«  /path/to/telephonedirectory » )
>>
>>   rdd .foreach(tel => {
>>     longAcc.select(«  * » ).rlike(«  + »  + tel)
>>   })
>>
>>
>>
>>
>> Le 1 avr. 2023 à 22:36, Mich Talebzadeh <mi...@gmail.com> a
>> écrit :
>>
>> This may help
>>
>> Spark rlike() Working with Regex Matching Example
>> <https://sparkbyexamples.com/spark/spark-rlike-regex-matching-examples/>s
>> Mich Talebzadeh,
>> Lead Solutions Architect/Engineering Lead
>> Palantir Technologies Limited
>>
>>    view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Sat, 1 Apr 2023 at 19:32, Philippe de Rochambeau <ph...@free.fr>
>> wrote:
>>
>>> Hello,
>>> I’m looking for an efficient way in Spark to search for a series of
>>> telephone numbers, contained in a CSV file, in a data set column.
>>>
>>> In pseudo code,
>>>
>>> for tel in [tel1, tel2, …. tel40,000]
>>>         search for tel in dataset using .like(« %tel% »)
>>> end for
>>>
>>> I’m using the like function because the telephone numbers in the data
>>> set main contain prefixes, such as « + « ; e.g., « +3312224444 ».
>>>
>>> Any suggestions would be welcome.
>>>
>>> Many thanks.
>>>
>>> Philippe
>>>
>>>
>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>>
>>>
>>

Re: Looping through a series of telephone numbers

Posted by Sean Owen <sr...@gmail.com>.

That won't work, you can't use Spark within Spark like that.
If it were exact matches, the best solution would be to load both datasets
and join on telephone number.
For this case, I think your best bet is a UDF that contains the telephone
numbers as a list and decides whether a given number matches something in
the set. Then use that to filter, then work with the data set.
There are probably clever fast ways of efficiently determining if a string
is a prefix of a group of strings in Python you could use too.

On Sun, Apr 2, 2023 at 3:17 AM Philippe de Rochambeau <ph...@free.fr>
wrote:

> Many thanks, Mich.
> Is « foreach »  the best construct to  lookup items is a dataset  such as
> the below «  telephonedirectory » data set?
>
> val telrdd = spark.sparkContext.parallelize(Seq(«  tel1 » , «  tel2 » , «  tel3 » …)) // the telephone sequence
>
> // was read for a CSV file
>
> val ds = spark.read.parquet(«  /path/to/telephonedirectory » )
>
>   rdd .foreach(tel => {
>     longAcc.select(«  * » ).rlike(«  + »  + tel)
>   })
>
>
>
>
> Le 1 avr. 2023 à 22:36, Mich Talebzadeh <mi...@gmail.com> a
> écrit :
>
> This may help
>
> Spark rlike() Working with Regex Matching Example
> <https://sparkbyexamples.com/spark/spark-rlike-regex-matching-examples/>s
> Mich Talebzadeh,
> Lead Solutions Architect/Engineering Lead
> Palantir Technologies Limited
>
>    view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Sat, 1 Apr 2023 at 19:32, Philippe de Rochambeau <ph...@free.fr>
> wrote:
>
>> Hello,
>> I’m looking for an efficient way in Spark to search for a series of
>> telephone numbers, contained in a CSV file, in a data set column.
>>
>> In pseudo code,
>>
>> for tel in [tel1, tel2, …. tel40,000]
>>         search for tel in dataset using .like(« %tel% »)
>> end for
>>
>> I’m using the like function because the telephone numbers in the data set
>> main contain prefixes, such as « + « ; e.g., « +3312224444 ».
>>
>> Any suggestions would be welcome.
>>
>> Many thanks.
>>
>> Philippe
>>
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>
>>
>

Re: Looping through a series of telephone numbers

Posted by Philippe de Rochambeau <ph...@free.fr>.

Many thanks, Mich.
Is « foreach »  the best construct to  lookup items is a dataset  such as the below «  telephonedirectory » data set?

val telrdd = spark.sparkContext.parallelize(Seq(«  tel1 » , «  tel2 » , «  tel3 » …)) // the telephone sequence
// was read for a CSV file
val ds = spark.read.parquet(«  /path/to/telephonedirectory » )
  
  rdd .foreach(tel => {
    longAcc.select(«  * » ).rlike(«  + »  + tel)
  })



> Le 1 avr. 2023 à 22:36, Mich Talebzadeh <mi...@gmail.com> a écrit :
> 
> This may help
> 
> Spark rlike() Working with Regex Matching Example <https://sparkbyexamples.com/spark/spark-rlike-regex-matching-examples/>s
> Mich Talebzadeh,
> Lead Solutions Architect/Engineering Lead
> Palantir Technologies Limited
> 
>    view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
> 
>  https://en.everybodywiki.com/Mich_Talebzadeh
> 
>  
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.
>  
> 
> 
> On Sat, 1 Apr 2023 at 19:32, Philippe de Rochambeau <phiroc@free.fr <ma...@free.fr>> wrote:
>> Hello,
>> I’m looking for an efficient way in Spark to search for a series of telephone numbers, contained in a CSV file, in a data set column.
>> 
>> In pseudo code,
>> 
>> for tel in [tel1, tel2, …. tel40,000] 
>>         search for tel in dataset using .like(« %tel% »)
>> end for 
>> 
>> I’m using the like function because the telephone numbers in the data set main contain prefixes, such as « + « ; e.g., « +3312224444 ».
>> 
>> Any suggestions would be welcome.
>> 
>> Many thanks.
>> 
>> Philippe
>> 
>> 
>> 
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org <ma...@spark.apache.org>
>>

Re: Looping through a series of telephone numbers

Posted by Mich Talebzadeh <mi...@gmail.com>.

This may help

Spark rlike() Working with Regex Matching Example
<https://sparkbyexamples.com/spark/spark-rlike-regex-matching-examples/>s
Mich Talebzadeh,
Lead Solutions Architect/Engineering Lead
Palantir Technologies Limited


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Sat, 1 Apr 2023 at 19:32, Philippe de Rochambeau <ph...@free.fr> wrote:

> Hello,
> I’m looking for an efficient way in Spark to search for a series of
> telephone numbers, contained in a CSV file, in a data set column.
>
> In pseudo code,
>
> for tel in [tel1, tel2, …. tel40,000]
>         search for tel in dataset using .like(« %tel% »)
> end for
>
> I’m using the like function because the telephone numbers in the data set
> main contain prefixes, such as « + « ; e.g., « +3312224444 ».
>
> Any suggestions would be welcome.
>
> Many thanks.
>
> Philippe
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>

Re: Looping through a series of telephone numbers

Posted by Philippe de Rochambeau <ph...@free.fr>.

Wow, you guys, Anastasios, Bjørn and Mich, are stars!
Thank you very much for your suggestions. I’m going to print them and study them closely.


> Le 2 avr. 2023 à 20:05, Anastasios Zouzias <zo...@gmail.com> a écrit :
> 
> Hi Philippe,
> 
> I would like to draw your attention to this great library that saved my day in the past when parsing phone numbers in Spark: 
> 
> https://github.com/google/libphonenumber
> 
> If you combine it with Bjørn's suggestions you will have a good start on your linkage task.
> 
> Best regards,
> Anastasios Zouzias
> 
> 
> On Sat, Apr 1, 2023 at 8:31 PM Philippe de Rochambeau <phiroc@free.fr <ma...@free.fr>> wrote:
>> Hello,
>> I’m looking for an efficient way in Spark to search for a series of telephone numbers, contained in a CSV file, in a data set column.
>> 
>> In pseudo code,
>> 
>> for tel in [tel1, tel2, …. tel40,000] 
>>         search for tel in dataset using .like(« %tel% »)
>> end for 
>> 
>> I’m using the like function because the telephone numbers in the data set main contain prefixes, such as « + « ; e.g., « +3312224444 ».
>> 
>> Any suggestions would be welcome.
>> 
>> Many thanks.
>> 
>> Philippe
>> 
>> 
>> 
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org <ma...@spark.apache.org>
>> 
> 
> 
> -- 
> -- Anastasios Zouzias
>  <ma...@zurich.ibm.com>

Re: Looping through a series of telephone numbers

Posted by Anastasios Zouzias <zo...@gmail.com>.

Hi Philippe,

I would like to draw your attention to this great library that saved my day
in the past when parsing phone numbers in Spark:

https://github.com/google/libphonenumber

If you combine it with Bjørn's suggestions you will have a good start on
your linkage task.

Best regards,
Anastasios Zouzias


On Sat, Apr 1, 2023 at 8:31 PM Philippe de Rochambeau <ph...@free.fr>
wrote:

> Hello,
> I’m looking for an efficient way in Spark to search for a series of
> telephone numbers, contained in a CSV file, in a data set column.
>
> In pseudo code,
>
> for tel in [tel1, tel2, …. tel40,000]
>         search for tel in dataset using .like(« %tel% »)
> end for
>
> I’m using the like function because the telephone numbers in the data set
> main contain prefixes, such as « + « ; e.g., « +3312224444 ».
>
> Any suggestions would be welcome.
>
> Many thanks.
>
> Philippe
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>

-- 
-- Anastasios Zouzias
<az...@zurich.ibm.com>