You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by ashish rawat <dc...@gmail.com> on 2017/12/02 02:45:10 UTC
Re: NLTK with Spark Streaming

Thanks Nicholas, but the problem for us is that we want to use NLTK Python
library, since our data scientists are training using that. Rewriting the
inference logic using some other library would be time consuming and in
some cases, it may not even work because of unavailability of some
functions.

On Nov 29, 2017 3:16 AM, "Nicholas Hakobian" <
nicholas.hakobian@rallyhealth.com> wrote:

Depending on your needs, its fairly easy to write a lightweight python
wrapper around the Databricks spark-corenlp library: https://github.com/
databricks/spark-corenlp


Nicholas Szandor Hakobian, Ph.D.
Staff Data Scientist
Rally Health
nicholas.hakobian@rallyhealth.com


On Sun, Nov 26, 2017 at 8:19 AM, ashish rawat <dc...@gmail.com> wrote:

> Thanks Holden and Chetan.
>
> Holden - Have you tried it out, do you know the right way to do it?
> Chetan - yes, if we use a Java NLP library, it should not be any issue in
> integrating with spark streaming, but as I pointed out earlier, we want to
> give flexibility to data scientists to use the language and library of
> their choice, instead of restricting them to a library of our choice.
>
> On Sun, Nov 26, 2017 at 9:42 PM, Chetan Khatri <
> chetan.opensource@gmail.com> wrote:
>
>> But you can still use Stanford NLP library and distribute through spark
>> right !
>>
>> On Sun, Nov 26, 2017 at 3:31 PM, Holden Karau <ho...@pigscanfly.ca>
>> wrote:
>>
>>> So it’s certainly doable (it’s not super easy mind you), but until the
>>> arrow udf release goes out it will be rather slow.
>>>
>>> On Sun, Nov 26, 2017 at 8:01 AM ashish rawat <dc...@gmail.com>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> Has someone tried running NLTK (python) with Spark Streaming (scala)? I
>>>> was wondering if this is a good idea and what are the right Spark operators
>>>> to do this? The reason we want to try this combination is that we don't
>>>> want to run our transformations in python (pyspark), but after the
>>>> transformations, we need to run some natural language processing operations
>>>> and we don't want to restrict the functions data scientists' can use to
>>>> Spark natural language library. So, Spark streaming with NLTK looks like
>>>> the right option, from the perspective of fast data processing and data
>>>> science flexibility.
>>>>
>>>> Regards,
>>>> Ashish
>>>>
>>> --
>>> Twitter: https://twitter.com/holdenkarau
>>>
>>
>>
>