You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Soheil Pourbafrani <so...@gmail.com> on 2018/09/24 12:53:49 UTC

How to access line fileName in loading file using the textFile method

Hi, My text data are in the form of text file. In the processing logic, I
need to know each word is from which file. Actually, I need to tokenize the
words and create the pair of <fileName, word>. The naive solution is to
call sc.textFile for each file and having the fileName in a variable,
create the pairs, but it's not efficient and I got the StackOverflow error
as dataset grew.

So my question is supposing all files are in a directory and I read then
using sc.textFile("path/*"), how can I understand each data is for which
file?

Is it possible (and needed) to customize the textFile method?

Re: How to access line fileName in loading file using the textFile method

Posted by Maxim Gekk <ma...@databricks.com>.
> So my question is supposing all files are in a directory and I read then
using sc.textFile("path/*"), how can I understand each data is for which
file?

Maybe the input_file_name() function help you:
https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$@input_file_name():org.apache.spark.sql.Column

On Mon, Sep 24, 2018 at 2:54 PM Soheil Pourbafrani <so...@gmail.com>
wrote:

> Hi, My text data are in the form of text file. In the processing logic, I
> need to know each word is from which file. Actually, I need to tokenize the
> words and create the pair of <fileName, word>. The naive solution is to
> call sc.textFile for each file and having the fileName in a variable,
> create the pairs, but it's not efficient and I got the StackOverflow error
> as dataset grew.
>
> So my question is supposing all files are in a directory and I read then
> using sc.textFile("path/*"), how can I understand each data is for which
> file?
>
> Is it possible (and needed) to customize the textFile method?
>


-- 

Maxim Gekk

Technical Solutions Lead

Databricks Inc.

maxim.gekk@databricks.com

databricks.com

  <http://databricks.com/>

Re: How to access line fileName in loading file using the textFile method

Posted by Jörn Franke <jo...@gmail.com>.
You can create your own data source exactly doing this. 

Why is the file name important if the file content is the same?

> On 24. Sep 2018, at 13:53, Soheil Pourbafrani <so...@gmail.com> wrote:
> 
> Hi, My text data are in the form of text file. In the processing logic, I need to know each word is from which file. Actually, I need to tokenize the words and create the pair of <fileName, word>. The naive solution is to call sc.textFile for each file and having the fileName in a variable, create the pairs, but it's not efficient and I got the StackOverflow error as dataset grew.
> 
> So my question is supposing all files are in a directory and I read then using sc.textFile("path/*"), how can I understand each data is for which file?
> 
> Is it possible (and needed) to customize the textFile method?

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org


Re: How to access line fileName in loading file using the textFile method

Posted by vermanurag <an...@fnmathlogic.com>.
Spark has sc.wholeTextFiles() which returns RDD of tuple. First element of
tuple if the file name and second element is the file content.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org