You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Oliver Ruebenacker <ol...@broadinstitute.org> on 2023/01/17 22:18:11 UTC
[PySPark] How to check if value of one column is in array of another column
Hello,
I have data originally stored as JSON. Column gene contains a string,
column nearest an array of strings. How can I check whether the value of
gene is an element of the array of nearest?
I tried: genes_joined.gene.isin(genes_joined.nearest)
But I get an error that says:
pyspark.sql.utils.AnalysisException: cannot resolve '(gene IN (nearest))'
due to data type mismatch: Arguments must be same type but were: string !=
array<string>;
How do I do this? Thanks!
Best, Oliver
--
Oliver Ruebenacker, Ph.D. (he)
Senior Software Engineer, Knowledge Portal Network
<http://kp4cd.org/>, Flannick
Lab <http://www.flannicklab.org/>, Broad Institute
<http://www.broadinstitute.org/>
Re: [PySPark] How to check if value of one column is in array of another column
Posted by Oliver Ruebenacker <ol...@broadinstitute.org>.
Awesome, thanks, this was exactly what I needed!
On Tue, Jan 17, 2023 at 5:23 PM Sean Owen <sr...@gmail.com> wrote:
> I think you want array_contains:
>
> https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.array_contains.html
>
> On Tue, Jan 17, 2023 at 4:18 PM Oliver Ruebenacker <
> oliverr@broadinstitute.org> wrote:
>
>>
>> Hello,
>>
>> I have data originally stored as JSON. Column gene contains a string,
>> column nearest an array of strings. How can I check whether the value of
>> gene is an element of the array of nearest?
>>
>> I tried: genes_joined.gene.isin(genes_joined.nearest)
>>
>> But I get an error that says:
>>
>> pyspark.sql.utils.AnalysisException: cannot resolve '(gene IN (nearest))'
>> due to data type mismatch: Arguments must be same type but were: string !=
>> array<string>;
>>
>> How do I do this? Thanks!
>>
>> Best, Oliver
>>
>> --
>> Oliver Ruebenacker, Ph.D. (he)
>> Senior Software Engineer, Knowledge Portal Network <http://kp4cd.org/>, Flannick
>> Lab <http://www.flannicklab.org/>, Broad Institute
>> <http://www.broadinstitute.org/>
>>
>
--
Oliver Ruebenacker, Ph.D. (he)
Senior Software Engineer, Knowledge Portal Network
<http://kp4cd.org/>, Flannick
Lab <http://www.flannicklab.org/>, Broad Institute
<http://www.broadinstitute.org/>
Re: [PySPark] How to check if value of one column is in array of another column
Posted by Sean Owen <sr...@gmail.com>.
I think you want array_contains:
https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.array_contains.html
On Tue, Jan 17, 2023 at 4:18 PM Oliver Ruebenacker <
oliverr@broadinstitute.org> wrote:
>
> Hello,
>
> I have data originally stored as JSON. Column gene contains a string,
> column nearest an array of strings. How can I check whether the value of
> gene is an element of the array of nearest?
>
> I tried: genes_joined.gene.isin(genes_joined.nearest)
>
> But I get an error that says:
>
> pyspark.sql.utils.AnalysisException: cannot resolve '(gene IN (nearest))'
> due to data type mismatch: Arguments must be same type but were: string !=
> array<string>;
>
> How do I do this? Thanks!
>
> Best, Oliver
>
> --
> Oliver Ruebenacker, Ph.D. (he)
> Senior Software Engineer, Knowledge Portal Network <http://kp4cd.org/>, Flannick
> Lab <http://www.flannicklab.org/>, Broad Institute
> <http://www.broadinstitute.org/>
>