You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Yana Kadiyska <ya...@gmail.com> on 2015/08/25 18:51:50 UTC

[SQL/Hive] Trouble with refreshTable

I'm having trouble with refreshTable, I suspect because I'm using it
incorrectly.

I am doing the following:

1. Create DF from parquet path with wildcards, e.g. /foo/bar/*.parquet
2. use registerTempTable to register my dataframe
3. A new file is dropped under  /foo/bar/
4. Call hiveContext.refreshTable in the hope that the paths for the
Dataframe are re-evaluated

Step 4 does not work as I imagine -- if I have 1 file in step 1, and 2
files in step 3, I still get the same count when I query the table

So I have 2 questions

1). Is there a way to see the files that a Dataframe/RDD is underpinned by
2). What is a reasonable way to refresh the table with "newcomer" data --
I'm suspecting I have to start over from step 1 to force the Dataframe to
re-see new files, but am hoping there is a simpler way (I know frames are
immutable but they are also lazy so I'm thinking paths with wildcards
evaluated per call might be possible?)

Thanks for any insights.