You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Vikash Kumar <vi...@gmail.com> on 2016/05/31 17:32:10 UTC

how to get file name of record being reading in spark

I have a requirement in which I need to read the input files from a
directory and append the file name in each record while output.

e.g. I have directory /input/files/ which have folllowing files:
ABC_input_0528.txt
ABC_input_0531.txt

suppose input file ABC_input_0528.txt contains
111,abc,234
222,xyz,456

suppose input file ABC_input_0531.txt contains
100,abc,299
200,xyz,499

and I need to create one final output with file name in each record using
dataframes
my output file should looks like this:
111,abc,234,ABC_input_0528.txt
222,xyz,456,ABC_input_0528.txt
100,abc,299,ABC_input_0531.txt
200,xyz,499,ABC_input_0531.txt

I am trying to use this inputFileName function but it is showing blank.
https://spark.apache.org/docs/1.6.1/api/java/org/apache/spark/sql/functions.html#inputFileName()

Can anybody help me?

Re: how to get file name of record being reading in spark

Posted by Vikash Kumar <vi...@gmail.com>.
Can anybody suggest different solution using inputFileName or
input_file_name

On Tue, May 31, 2016 at 11:43 PM, Vikash Kumar <vi...@gmail.com>
wrote:

> thanks Ajay but I have this below code to generate dataframes, So I wanted
> to change in df only to achieve this. I thought inputFileName will work but
> it's not working.
>
> private def getPaths: String = {
>  val regex = (conf.namingConvention + conf.extension).replace("?",
> ".?").replace("*​*", ".*​*?")
>  val files = FileUtilities.getFiles(conf.filePath).filter(x =>
> x.getName.matches(regex))
>  println(s"${files.length} files matched:\n${files.map( x => "-- " +
> x.getName ).mkString("\n")}")
>  files.map(_.getPath).mkString(",")
> }
> private def readTextFile(sqlContext: SQLContext): DataFrame = {
>  System.out.println(s"Reading ${conf.filePath}")
>  sqlContext.read
>    .format("com.databricks.spark.csv")
>    .option("delimiter", conf.delimiter.getOrElse(defaultDelimiter))
>    .option("header", if (conf.hasHeader.getOrElse(defaultHasHeader))
> "true" else "false")
>    .option("quote", if
> (conf.textQualifier.getOrElse(defaultTextQualifier)) "\"" else null)
>    .schema(conf.schema.toStruct)
>    .load(getPaths)
> }
> println("Intaking text file(s)...")
> *val df: DataFrame = readTextFile(sqlContext)*
>
> On Tue, May 31, 2016 at 11:26 PM, Ajay Chander <it...@gmail.com>
> wrote:
>
>> Hi Vikash,
>>
>> These are my thoughts, read the input directory using wholeTextFiles()
>> which would give a paired RDD with key as file name and value as file
>> content. Then you can apply a map function to read each line and append
>> key to the content.
>>
>> Thank you,
>> Aj
>>
>>
>> On Tuesday, May 31, 2016, Vikash Kumar <vi...@gmail.com> wrote:
>>
>>> I have a requirement in which I need to read the input files from a
>>> directory and append the file name in each record while output.
>>>
>>> e.g. I have directory /input/files/ which have folllowing files:
>>> ABC_input_0528.txt
>>> ABC_input_0531.txt
>>>
>>> suppose input file ABC_input_0528.txt contains
>>> 111,abc,234
>>> 222,xyz,456
>>>
>>> suppose input file ABC_input_0531.txt contains
>>> 100,abc,299
>>> 200,xyz,499
>>>
>>> and I need to create one final output with file name in each record
>>> using dataframes
>>> my output file should looks like this:
>>> 111,abc,234,ABC_input_0528.txt
>>> 222,xyz,456,ABC_input_0528.txt
>>> 100,abc,299,ABC_input_0531.txt
>>> 200,xyz,499,ABC_input_0531.txt
>>>
>>> I am trying to use this inputFileName function but it is showing blank.
>>>
>>> https://spark.apache.org/docs/1.6.1/api/java/org/apache/spark/sql/functions.html#inputFileName()
>>>
>>> Can anybody help me?
>>>
>>>
>

Re: how to get file name of record being reading in spark

Posted by Vikash Kumar <vi...@gmail.com>.
thanks Ajay but I have this below code to generate dataframes, So I wanted
to change in df only to achieve this. I thought inputFileName will work but
it's not working.

private def getPaths: String = {
 val regex = (conf.namingConvention + conf.extension).replace("?",
".?").replace("*​*", ".*​*?")
 val files = FileUtilities.getFiles(conf.filePath).filter(x =>
x.getName.matches(regex))
 println(s"${files.length} files matched:\n${files.map( x => "-- " +
x.getName ).mkString("\n")}")
 files.map(_.getPath).mkString(",")
}
private def readTextFile(sqlContext: SQLContext): DataFrame = {
 System.out.println(s"Reading ${conf.filePath}")
 sqlContext.read
   .format("com.databricks.spark.csv")
   .option("delimiter", conf.delimiter.getOrElse(defaultDelimiter))
   .option("header", if (conf.hasHeader.getOrElse(defaultHasHeader)) "true"
else "false")
   .option("quote", if (conf.textQualifier.getOrElse(defaultTextQualifier))
"\"" else null)
   .schema(conf.schema.toStruct)
   .load(getPaths)
}
println("Intaking text file(s)...")
*val df: DataFrame = readTextFile(sqlContext)*

On Tue, May 31, 2016 at 11:26 PM, Ajay Chander <it...@gmail.com> wrote:

> Hi Vikash,
>
> These are my thoughts, read the input directory using wholeTextFiles()
> which would give a paired RDD with key as file name and value as file
> content. Then you can apply a map function to read each line and append
> key to the content.
>
> Thank you,
> Aj
>
>
> On Tuesday, May 31, 2016, Vikash Kumar <vi...@gmail.com> wrote:
>
>> I have a requirement in which I need to read the input files from a
>> directory and append the file name in each record while output.
>>
>> e.g. I have directory /input/files/ which have folllowing files:
>> ABC_input_0528.txt
>> ABC_input_0531.txt
>>
>> suppose input file ABC_input_0528.txt contains
>> 111,abc,234
>> 222,xyz,456
>>
>> suppose input file ABC_input_0531.txt contains
>> 100,abc,299
>> 200,xyz,499
>>
>> and I need to create one final output with file name in each record using
>> dataframes
>> my output file should looks like this:
>> 111,abc,234,ABC_input_0528.txt
>> 222,xyz,456,ABC_input_0528.txt
>> 100,abc,299,ABC_input_0531.txt
>> 200,xyz,499,ABC_input_0531.txt
>>
>> I am trying to use this inputFileName function but it is showing blank.
>>
>> https://spark.apache.org/docs/1.6.1/api/java/org/apache/spark/sql/functions.html#inputFileName()
>>
>> Can anybody help me?
>>
>>

Re: how to get file name of record being reading in spark

Posted by Ajay Chander <it...@gmail.com>.
Hi Vikash,

These are my thoughts, read the input directory using wholeTextFiles()
which would give a paired RDD with key as file name and value as file
content. Then you can apply a map function to read each line and append key
to the content.

Thank you,
Aj

On Tuesday, May 31, 2016, Vikash Kumar <vi...@gmail.com> wrote:

> I have a requirement in which I need to read the input files from a
> directory and append the file name in each record while output.
>
> e.g. I have directory /input/files/ which have folllowing files:
> ABC_input_0528.txt
> ABC_input_0531.txt
>
> suppose input file ABC_input_0528.txt contains
> 111,abc,234
> 222,xyz,456
>
> suppose input file ABC_input_0531.txt contains
> 100,abc,299
> 200,xyz,499
>
> and I need to create one final output with file name in each record using
> dataframes
> my output file should looks like this:
> 111,abc,234,ABC_input_0528.txt
> 222,xyz,456,ABC_input_0528.txt
> 100,abc,299,ABC_input_0531.txt
> 200,xyz,499,ABC_input_0531.txt
>
> I am trying to use this inputFileName function but it is showing blank.
>
> https://spark.apache.org/docs/1.6.1/api/java/org/apache/spark/sql/functions.html#inputFileName()
>
> Can anybody help me?
>
>