You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by 尹绪森 <yi...@gmail.com> on 2014/02/25 03:35:43 UTC

Preparing to provide a small text files input API in mllib

Hi community,

As I moving forward to write a LDA (Latent Dirichlet Allocation) to Spark
mllib, I find that a small files input API is useful, so I am writing a
smallTextFiles() to support it.

smallTextFiles() will digest a directory of text files, and return an
RDD[(String, String)], the former String is the file name, while the latter
one is the contents of the text file.

smallTextFiles() can be used for local disk IO, or HDFS IO, just like the
textFiles() in SparkContext. In the scenario of LDA, there are 2 common
uses:

1. We use smallTextFiles() to preprocess local disk files, i.e. combine
those files into a huge one, then transfer it onto HDFS to do further
process, such as LDA clustering.

2. We can also transfer the raw directory of small files onto HDFS (though
it is not recommended, because it will cost too many namenode entries),
then clustering it directly with LDA.

I also find in the Spark mail list that there are some users need this
function.

I have already finished it, but I am trying to remove a useless shuffle to
improve the performance now. Here is my code and all testsuites have passed.
https://github.com/yinxusen/incubator-spark/commit/ef418ea73e3cdaea9e45f60ce28fef3474872ade

What do you think about that ? I wish for your advises, thanks !

-- 
Best Regards
-----------------------------------
Xusen Yin    尹绪森
Beijing Key Laboratory of Intelligent Telecommunications Software and
Multimedia
Beijing University of Posts & Telecommunications
Intel Labs China
Homepage: *http://yinxusen.github.io/ <http://yinxusen.github.io/>*

Re: Preparing to provide a small text files input API in mllib

Posted by Mridul Muralidharan <mr...@gmail.com>.
Hi,

  I have not looked into why this would be needed, but given it is
needed, I added a couple of comments to the PR.
Overall, it looks promising.

Regards,
Mridul


On Tue, Feb 25, 2014 at 8:05 AM, 尹绪森 <yi...@gmail.com> wrote:
> Hi community,
>
> As I moving forward to write a LDA (Latent Dirichlet Allocation) to Spark
> mllib, I find that a small files input API is useful, so I am writing a
> smallTextFiles() to support it.
>
> smallTextFiles() will digest a directory of text files, and return an
> RDD[(String, String)], the former String is the file name, while the latter
> one is the contents of the text file.
>
> smallTextFiles() can be used for local disk IO, or HDFS IO, just like the
> textFiles() in SparkContext. In the scenario of LDA, there are 2 common
> uses:
>
> 1. We use smallTextFiles() to preprocess local disk files, i.e. combine
> those files into a huge one, then transfer it onto HDFS to do further
> process, such as LDA clustering.
>
> 2. We can also transfer the raw directory of small files onto HDFS (though
> it is not recommended, because it will cost too many namenode entries),
> then clustering it directly with LDA.
>
> I also find in the Spark mail list that there are some users need this
> function.
>
> I have already finished it, but I am trying to remove a useless shuffle to
> improve the performance now. Here is my code and all testsuites have passed.
> https://github.com/yinxusen/incubator-spark/commit/ef418ea73e3cdaea9e45f60ce28fef3474872ade
>
> What do you think about that ? I wish for your advises, thanks !
>
> --
> Best Regards
> -----------------------------------
> Xusen Yin    尹绪森
> Beijing Key Laboratory of Intelligent Telecommunications Software and
> Multimedia
> Beijing University of Posts & Telecommunications
> Intel Labs China
> Homepage: *http://yinxusen.github.io/ <http://yinxusen.github.io/>*