You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Sujeet Pardeshi <Su...@sas.com> on 2018/08/07 08:33:08 UTC

Hive output file 000000_0

Hi All,
I am doing an Insert overwrite operation through a hive external table onto AWS S3. Hive creates a output file 000000_0 onto S3. However at times I am noticing that it creates file with other names like 0000003_0 etc. I always need to overwrite the existing file but with inconsistent file names I am unable to do so. How do I force hive to always create a consistent filename like 000000_0? Below is an example of how my code looks like, where tab_content is a hive external table. 

INSERT OVERWRITE TABLE tab_content
PARTITION(datekey)
select * from source

Regards,
Sujeet Singh Pardeshi
Software Specialist
SAS Research and Development (India) Pvt. Ltd.
Level 2A and Level 3, Cybercity, Magarpatta, Hadapsar  Pune, Maharashtra, 411 013
off: +91-20-30418810  

 "When the solution is simple, God is answering…"

RE: Hive output file 000000_0

Posted by Ryan Harris <Ry...@zionsbancorp.com>.

Also, I believe that the output format matters.  If your output is TEXTFILE I think that all of the reducers can append to the same file concurrently.  However for block-based output formats, that isn’t possible.

From: Furcy Pin [mailto:pin.furcy@gmail.com]
Sent: Wednesday, August 08, 2018 9:58 AM
To: user@hive.apache.org
Subject: Re: Hive output file 000000_0

________________________________

Indeed, if your table is big, then you DO want to parallelise and setting the number of reducers to 1 is clearly not a good idea.

From what you explained, I still don't understand what your exact problem is.
Can't you just leave your query as is and be okay with many part_0000X files?

On Wed, 8 Aug 2018 at 16:20, Sujeet Pardeshi <Su...@sas.com>> wrote:
Thanks Furcy. Will the below setting not hit performance? Just one reducer? I am processing huge data and cannot compromise on performance.

SET mapred.reduce.tasks = 1

Regards,

Sujeet Singh Pardeshi

Software Specialist

SAS Research and Development (India) Pvt. Ltd.
Level 2A and Level 3, Cybercity, Magarpatta, Hadapsar  Pune, Maharashtra, 411 013
off: +91-20-30418810
[Description: untitled]
 "When the solution is simple, God is answering…"

From: Furcy Pin <pi...@gmail.com>>
Sent: 08 August 2018 PM 06:37
To: user@hive.apache.org<ma...@hive.apache.org>
Subject: Re: Hive output file 000000_0

EXTERNAL
It might sound silly, but isn't it what Hive is supposed to do, being a distributed computation framework and all ?

Hive will write one file per reducer, called 00000_0, 00001_0, etc. where the number corresponds to the number of your reducer.
Sometimes the _0 will be a _1 or _2 or more depending on the retries or speculative execution.

This is how MapReduce works, and this is how Tez and Spark work too: you can't have several executors writing simultaneously in the same file on HDFS
(maybe you can on s3, but not through the HDFS api). So each executor writes in its own file.
If you want to read back the data with MapReduce, Tez or Spark, you simply specify the folder path, and all files in it will be read. That's what Hive does too.

Now, if you really want only one file (for instance to export it as a csv file), I guess you can try to add this before your query
(if you use MapReduce, there probably are similar configurations for Hive on Tez or Hive on Spark)
SET mapred.reduce.tasks = 1

And if you really suspect that some files are not correctly overwritten (like it happened to me once with Spark on Azure blob storage), I suggest that you check the file timestamp
to determine which execution of your query wrote it.

Hope this helps,

Furcy

On Wed, 8 Aug 2018 at 14:25, Sujeet Pardeshi <Su...@sas.com>> wrote:
Hi Deepak,
Thanks for your response. The table is not bucketed or clustered. It can be seen below.

DROP TABLE IF EXISTS ${SCHEMA_NM}. daily_summary;
CREATE EXTERNAL TABLE ${SCHEMA_NM}.daily_summary
(
  bouncer VARCHAR(12),
  device_type VARCHAR(52),
  visitor_type VARCHAR(10),
  visit_origination_type VARCHAR(65),
  visit_origination_name VARCHAR(260),
  pg_domain_name VARCHAR(215),
  class1_id VARCHAR(650),
  class2_id VARCHAR(650),
  bouncers INT,
  rv_revenue DECIMAL(17,2),
  visits INT,
  active_page_view_time INT,
  total_page_view_time BIGINT,
  average_visit_duration INT,
  co_conversions INT,
  page_views INT,
  landing_page_url VARCHAR(1332),
  dt DATE

)
PARTITIONED BY (datelocal DATE)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\001'
LOCATION '${OUTPUT_PATH}/daily_summary/'
TBLPROPERTIES ('serialization.null.format'='');

MSCK REPAIR TABLE ${SCHEMA_NM}.daily_summary;

Regards,
Sujeet Singh Pardeshi
Software Specialist
SAS Research and Development (India) Pvt. Ltd.
Level 2A and Level 3, Cybercity, Magarpatta, Hadapsar  Pune, Maharashtra, 411 013
off: +91-20-30418810

 "When the solution is simple, God is answering…"

-----Original Message-----
From: Deepak Jaiswal <dj...@hortonworks.com>>
Sent: 07 August 2018 PM 11:19
To: user@hive.apache.org<ma...@hive.apache.org>
Subject: Re: Hive output file 000000_0

EXTERNAL

Hi Sujeet,

I am assuming that the table is bucketed? If so, then the name represents which bucket the file belongs to as Hive creates 1 file per bucket for each operation.

In this case, the file 000003_0 belongs to bucket 3.
To always have files named 000000_0, the table must be unbucketed.
I hope it helps.

Regards,
Deepak

On 8/7/18, 1:33 AM, "Sujeet Pardeshi" <Su...@sas.com>> wrote:

    Hi All,
    I am doing an Insert overwrite operation through a hive external table onto AWS S3. Hive creates a output file 000000_0 onto S3. However at times I am noticing that it creates file with other names like 0000003_0 etc. I always need to overwrite the existing file but with inconsistent file names I am unable to do so. How do I force hive to always create a consistent filename like 000000_0? Below is an example of how my code looks like, where tab_content is a hive external table.

    INSERT OVERWRITE TABLE tab_content
    PARTITION(datekey)
    select * from source

    Regards,
    Sujeet Singh Pardeshi
    Software Specialist
    SAS Research and Development (India) Pvt. Ltd.
    Level 2A and Level 3, Cybercity, Magarpatta, Hadapsar  Pune, Maharashtra, 411 013
    off: +91-20-30418810

     "When the solution is simple, God is answering…"

======================================================================
THIS ELECTRONIC MESSAGE, INCLUDING ANY ACCOMPANYING DOCUMENTS, IS CONFIDENTIAL and may contain information that is privileged and exempt from disclosure under applicable law. If you are neither the intended recipient nor responsible for delivering the message to the intended recipient, please note that any dissemination, distribution, copying or the taking of any action in reliance upon the message is strictly prohibited. If you have received this communication in error, please notify the sender immediately.  Thank you.

Re: Hive output file 000000_0

Posted by Furcy Pin <pi...@gmail.com>.

Indeed, if your table is big, then you DO want to parallelise and setting
the number of reducers to 1 is clearly not a good idea.

From what you explained, I still don't understand what your exact problem
is.
Can't you just leave your query as is and be okay with many part_0000X
files?




On Wed, 8 Aug 2018 at 16:20, Sujeet Pardeshi <Su...@sas.com>
wrote:

> Thanks Furcy. Will the below setting not hit performance? Just one
> reducer? I am processing huge data and cannot compromise on performance.
>
>
>
> SET mapred.reduce.tasks = 1
>
>
>
>
>
> Regards,
>
> Sujeet Singh Pardeshi
>
> Software Specialist
>
> SAS Research and Development (India) Pvt. Ltd.
>
> Level 2A and Level 3, Cybercity, Magarpatta, Hadapsar  Pune, Maharashtra,
> 411 013
> *o*ff: +91-20-30418810
> [image: Description: untitled]
>
>  *"When the solution is simple, God is answering…" *
>
>
>
> *From:* Furcy Pin <pi...@gmail.com>
> *Sent:* 08 August 2018 PM 06:37
> *To:* user@hive.apache.org
> *Subject:* Re: Hive output file 000000_0
>
>
>
> *EXTERNAL*
>
> It might sound silly, but isn't it what Hive is supposed to do, being a
> distributed computation framework and all ?
>
>
>
> Hive will write one file per reducer, called 00000_0, 00001_0, etc. where
> the number corresponds to the number of your reducer.
>
> Sometimes the _0 will be a _1 or _2 or more depending on the retries or
> speculative execution.
>
>
>
> This is how MapReduce works, and this is how Tez and Spark work too: you
> can't have several executors writing simultaneously in the same file on
> HDFS
>
> (maybe you can on s3, but not through the HDFS api). So each executor
> writes in its own file.
>
> If you want to read back the data with MapReduce, Tez or Spark, you simply
> specify the folder path, and all files in it will be read. That's what Hive
> does too.
>
>
>
> Now, if you really want only one file (for instance to export it as a csv
> file), I guess you can try to add this before your query
>
> (if you use MapReduce, there probably are similar configurations for Hive
> on Tez or Hive on Spark)
>
> SET mapred.reduce.tasks = 1
>
>
>
> And if you really suspect that some files are not correctly overwritten
> (like it happened to me once with Spark on Azure blob storage), I suggest
> that you check the file timestamp
>
> to determine which execution of your query wrote it.
>
>
>
>
>
> Hope this helps,
>
>
>
> Furcy
>
>
>
> On Wed, 8 Aug 2018 at 14:25, Sujeet Pardeshi <Su...@sas.com>
> wrote:
>
> Hi Deepak,
> Thanks for your response. The table is not bucketed or clustered. It can
> be seen below.
>
> DROP TABLE IF EXISTS ${SCHEMA_NM}. daily_summary;
> CREATE EXTERNAL TABLE ${SCHEMA_NM}.daily_summary
> (
>   bouncer VARCHAR(12),
>   device_type VARCHAR(52),
>   visitor_type VARCHAR(10),
>   visit_origination_type VARCHAR(65),
>   visit_origination_name VARCHAR(260),
>   pg_domain_name VARCHAR(215),
>   class1_id VARCHAR(650),
>   class2_id VARCHAR(650),
>   bouncers INT,
>   rv_revenue DECIMAL(17,2),
>   visits INT,
>   active_page_view_time INT,
>   total_page_view_time BIGINT,
>   average_visit_duration INT,
>   co_conversions INT,
>   page_views INT,
>   landing_page_url VARCHAR(1332),
>   dt DATE
>
> )
> PARTITIONED BY (datelocal DATE)
> ROW FORMAT DELIMITED
> FIELDS TERMINATED BY '\001'
> LOCATION '${OUTPUT_PATH}/daily_summary/'
> TBLPROPERTIES ('serialization.null.format'='');
>
> MSCK REPAIR TABLE ${SCHEMA_NM}.daily_summary;
>
> Regards,
> Sujeet Singh Pardeshi
> Software Specialist
> SAS Research and Development (India) Pvt. Ltd.
> Level 2A and Level 3, Cybercity, Magarpatta, Hadapsar  Pune, Maharashtra,
> 411 013
> off: +91-20-30418810
>
>  "When the solution is simple, God is answering…"
>
> -----Original Message-----
> From: Deepak Jaiswal <dj...@hortonworks.com>
> Sent: 07 August 2018 PM 11:19
> To: user@hive.apache.org
> Subject: Re: Hive output file 000000_0
>
> EXTERNAL
>
> Hi Sujeet,
>
> I am assuming that the table is bucketed? If so, then the name represents
> which bucket the file belongs to as Hive creates 1 file per bucket for each
> operation.
>
> In this case, the file 000003_0 belongs to bucket 3.
> To always have files named 000000_0, the table must be unbucketed.
> I hope it helps.
>
> Regards,
> Deepak
>
> On 8/7/18, 1:33 AM, "Sujeet Pardeshi" <Su...@sas.com> wrote:
>
>     Hi All,
>     I am doing an Insert overwrite operation through a hive external table
> onto AWS S3. Hive creates a output file 000000_0 onto S3. However at times
> I am noticing that it creates file with other names like 0000003_0 etc. I
> always need to overwrite the existing file but with inconsistent file names
> I am unable to do so. How do I force hive to always create a consistent
> filename like 000000_0? Below is an example of how my code looks like,
> where tab_content is a hive external table.
>
>     INSERT OVERWRITE TABLE tab_content
>     PARTITION(datekey)
>     select * from source
>
>     Regards,
>     Sujeet Singh Pardeshi
>     Software Specialist
>     SAS Research and Development (India) Pvt. Ltd.
>     Level 2A and Level 3, Cybercity, Magarpatta, Hadapsar  Pune,
> Maharashtra, 411 013
>     off: +91-20-30418810
>
>      "When the solution is simple, God is answering…"
>
>

RE: Hive output file 000000_0

Posted by Sujeet Pardeshi <Su...@sas.com>.

Thanks Furcy. Will the below setting not hit performance? Just one reducer? I am processing huge data and cannot compromise on performance.

SET mapred.reduce.tasks = 1



Regards,

Sujeet Singh Pardeshi

Software Specialist

SAS Research and Development (India) Pvt. Ltd.
Level 2A and Level 3, Cybercity, Magarpatta, Hadapsar  Pune, Maharashtra, 411 013
off: +91-20-30418810
[Description: untitled]
 "When the solution is simple, God is answering…"

From: Furcy Pin <pi...@gmail.com>
Sent: 08 August 2018 PM 06:37
To: user@hive.apache.org
Subject: Re: Hive output file 000000_0


EXTERNAL
It might sound silly, but isn't it what Hive is supposed to do, being a distributed computation framework and all ?

Hive will write one file per reducer, called 00000_0, 00001_0, etc. where the number corresponds to the number of your reducer.
Sometimes the _0 will be a _1 or _2 or more depending on the retries or speculative execution.

This is how MapReduce works, and this is how Tez and Spark work too: you can't have several executors writing simultaneously in the same file on HDFS
(maybe you can on s3, but not through the HDFS api). So each executor writes in its own file.
If you want to read back the data with MapReduce, Tez or Spark, you simply specify the folder path, and all files in it will be read. That's what Hive does too.

Now, if you really want only one file (for instance to export it as a csv file), I guess you can try to add this before your query
(if you use MapReduce, there probably are similar configurations for Hive on Tez or Hive on Spark)
SET mapred.reduce.tasks = 1

And if you really suspect that some files are not correctly overwritten (like it happened to me once with Spark on Azure blob storage), I suggest that you check the file timestamp
to determine which execution of your query wrote it.


Hope this helps,

Furcy

On Wed, 8 Aug 2018 at 14:25, Sujeet Pardeshi <Su...@sas.com>> wrote:
Hi Deepak,
Thanks for your response. The table is not bucketed or clustered. It can be seen below.

DROP TABLE IF EXISTS ${SCHEMA_NM}. daily_summary;
CREATE EXTERNAL TABLE ${SCHEMA_NM}.daily_summary
(
  bouncer VARCHAR(12),
  device_type VARCHAR(52),
  visitor_type VARCHAR(10),
  visit_origination_type VARCHAR(65),
  visit_origination_name VARCHAR(260),
  pg_domain_name VARCHAR(215),
  class1_id VARCHAR(650),
  class2_id VARCHAR(650),
  bouncers INT,
  rv_revenue DECIMAL(17,2),
  visits INT,
  active_page_view_time INT,
  total_page_view_time BIGINT,
  average_visit_duration INT,
  co_conversions INT,
  page_views INT,
  landing_page_url VARCHAR(1332),
  dt DATE

)
PARTITIONED BY (datelocal DATE)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\001'
LOCATION '${OUTPUT_PATH}/daily_summary/'
TBLPROPERTIES ('serialization.null.format'='');

MSCK REPAIR TABLE ${SCHEMA_NM}.daily_summary;

Regards,
Sujeet Singh Pardeshi
Software Specialist
SAS Research and Development (India) Pvt. Ltd.
Level 2A and Level 3, Cybercity, Magarpatta, Hadapsar  Pune, Maharashtra, 411 013
off: +91-20-30418810

 "When the solution is simple, God is answering…"

-----Original Message-----
From: Deepak Jaiswal <dj...@hortonworks.com>>
Sent: 07 August 2018 PM 11:19
To: user@hive.apache.org<ma...@hive.apache.org>
Subject: Re: Hive output file 000000_0

EXTERNAL

Hi Sujeet,

I am assuming that the table is bucketed? If so, then the name represents which bucket the file belongs to as Hive creates 1 file per bucket for each operation.

In this case, the file 000003_0 belongs to bucket 3.
To always have files named 000000_0, the table must be unbucketed.
I hope it helps.

Regards,
Deepak

On 8/7/18, 1:33 AM, "Sujeet Pardeshi" <Su...@sas.com>> wrote:

    Hi All,
    I am doing an Insert overwrite operation through a hive external table onto AWS S3. Hive creates a output file 000000_0 onto S3. However at times I am noticing that it creates file with other names like 0000003_0 etc. I always need to overwrite the existing file but with inconsistent file names I am unable to do so. How do I force hive to always create a consistent filename like 000000_0? Below is an example of how my code looks like, where tab_content is a hive external table.

    INSERT OVERWRITE TABLE tab_content
    PARTITION(datekey)
    select * from source

    Regards,
    Sujeet Singh Pardeshi
    Software Specialist
    SAS Research and Development (India) Pvt. Ltd.
    Level 2A and Level 3, Cybercity, Magarpatta, Hadapsar  Pune, Maharashtra, 411 013
    off: +91-20-30418810

     "When the solution is simple, God is answering…"

Re: Hive output file 000000_0

Posted by Furcy Pin <pi...@gmail.com>.

It might sound silly, but isn't it what Hive is supposed to do, being a
distributed computation framework and all ?

Hive will write one file per reducer, called 00000_0, 00001_0, etc. where
the number corresponds to the number of your reducer.
Sometimes the _0 will be a _1 or _2 or more depending on the retries or
speculative execution.

This is how MapReduce works, and this is how Tez and Spark work too: you
can't have several executors writing simultaneously in the same file on
HDFS
(maybe you can on s3, but not through the HDFS api). So each executor
writes in its own file.
If you want to read back the data with MapReduce, Tez or Spark, you simply
specify the folder path, and all files in it will be read. That's what Hive
does too.

Now, if you really want only one file (for instance to export it as a csv
file), I guess you can try to add this before your query
(if you use MapReduce, there probably are similar configurations for Hive
on Tez or Hive on Spark)
SET mapred.reduce.tasks = 1

And if you really suspect that some files are not correctly overwritten
(like it happened to me once with Spark on Azure blob storage), I suggest
that you check the file timestamp
to determine which execution of your query wrote it.


Hope this helps,

Furcy

On Wed, 8 Aug 2018 at 14:25, Sujeet Pardeshi <Su...@sas.com>
wrote:

> Hi Deepak,
> Thanks for your response. The table is not bucketed or clustered. It can
> be seen below.
>
> DROP TABLE IF EXISTS ${SCHEMA_NM}. daily_summary;
> CREATE EXTERNAL TABLE ${SCHEMA_NM}.daily_summary
> (
>   bouncer VARCHAR(12),
>   device_type VARCHAR(52),
>   visitor_type VARCHAR(10),
>   visit_origination_type VARCHAR(65),
>   visit_origination_name VARCHAR(260),
>   pg_domain_name VARCHAR(215),
>   class1_id VARCHAR(650),
>   class2_id VARCHAR(650),
>   bouncers INT,
>   rv_revenue DECIMAL(17,2),
>   visits INT,
>   active_page_view_time INT,
>   total_page_view_time BIGINT,
>   average_visit_duration INT,
>   co_conversions INT,
>   page_views INT,
>   landing_page_url VARCHAR(1332),
>   dt DATE
>
> )
> PARTITIONED BY (datelocal DATE)
> ROW FORMAT DELIMITED
> FIELDS TERMINATED BY '\001'
> LOCATION '${OUTPUT_PATH}/daily_summary/'
> TBLPROPERTIES ('serialization.null.format'='');
>
> MSCK REPAIR TABLE ${SCHEMA_NM}.daily_summary;
>
> Regards,
> Sujeet Singh Pardeshi
> Software Specialist
> SAS Research and Development (India) Pvt. Ltd.
> Level 2A and Level 3, Cybercity, Magarpatta, Hadapsar  Pune, Maharashtra,
> 411 013
> off: +91-20-30418810
>
>  "When the solution is simple, God is answering…"
>
> -----Original Message-----
> From: Deepak Jaiswal <dj...@hortonworks.com>
> Sent: 07 August 2018 PM 11:19
> To: user@hive.apache.org
> Subject: Re: Hive output file 000000_0
>
> EXTERNAL
>
> Hi Sujeet,
>
> I am assuming that the table is bucketed? If so, then the name represents
> which bucket the file belongs to as Hive creates 1 file per bucket for each
> operation.
>
> In this case, the file 000003_0 belongs to bucket 3.
> To always have files named 000000_0, the table must be unbucketed.
> I hope it helps.
>
> Regards,
> Deepak
>
> On 8/7/18, 1:33 AM, "Sujeet Pardeshi" <Su...@sas.com> wrote:
>
>     Hi All,
>     I am doing an Insert overwrite operation through a hive external table
> onto AWS S3. Hive creates a output file 000000_0 onto S3. However at times
> I am noticing that it creates file with other names like 0000003_0 etc. I
> always need to overwrite the existing file but with inconsistent file names
> I am unable to do so. How do I force hive to always create a consistent
> filename like 000000_0? Below is an example of how my code looks like,
> where tab_content is a hive external table.
>
>     INSERT OVERWRITE TABLE tab_content
>     PARTITION(datekey)
>     select * from source
>
>     Regards,
>     Sujeet Singh Pardeshi
>     Software Specialist
>     SAS Research and Development (India) Pvt. Ltd.
>     Level 2A and Level 3, Cybercity, Magarpatta, Hadapsar  Pune,
> Maharashtra, 411 013
>     off: +91-20-30418810
>
>      "When the solution is simple, God is answering…"
>
>
>

RE: Hive output file 000000_0

Posted by Sujeet Pardeshi <Su...@sas.com>.

Hi Deepak,
Thanks for your response. The table is not bucketed or clustered. It can be seen below.

DROP TABLE IF EXISTS ${SCHEMA_NM}. daily_summary;
CREATE EXTERNAL TABLE ${SCHEMA_NM}.daily_summary
(
  bouncer VARCHAR(12),
  device_type VARCHAR(52),
  visitor_type VARCHAR(10),
  visit_origination_type VARCHAR(65),
  visit_origination_name VARCHAR(260),
  pg_domain_name VARCHAR(215),
  class1_id VARCHAR(650),
  class2_id VARCHAR(650),
  bouncers INT,
  rv_revenue DECIMAL(17,2),
  visits INT,
  active_page_view_time INT,
  total_page_view_time BIGINT,
  average_visit_duration INT,
  co_conversions INT,
  page_views INT,
  landing_page_url VARCHAR(1332),
  dt DATE
  
)
PARTITIONED BY (datelocal DATE)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\001'
LOCATION '${OUTPUT_PATH}/daily_summary/'
TBLPROPERTIES ('serialization.null.format'='');

MSCK REPAIR TABLE ${SCHEMA_NM}.daily_summary;

Regards,
Sujeet Singh Pardeshi
Software Specialist
SAS Research and Development (India) Pvt. Ltd.
Level 2A and Level 3, Cybercity, Magarpatta, Hadapsar  Pune, Maharashtra, 411 013
off: +91-20-30418810  

 "When the solution is simple, God is answering…" 

-----Original Message-----
From: Deepak Jaiswal <dj...@hortonworks.com> 
Sent: 07 August 2018 PM 11:19
To: user@hive.apache.org
Subject: Re: Hive output file 000000_0

EXTERNAL

Hi Sujeet,

I am assuming that the table is bucketed? If so, then the name represents which bucket the file belongs to as Hive creates 1 file per bucket for each operation.

In this case, the file 000003_0 belongs to bucket 3.
To always have files named 000000_0, the table must be unbucketed.
I hope it helps.

Regards,
Deepak

On 8/7/18, 1:33 AM, "Sujeet Pardeshi" <Su...@sas.com> wrote:

    Hi All,
    I am doing an Insert overwrite operation through a hive external table onto AWS S3. Hive creates a output file 000000_0 onto S3. However at times I am noticing that it creates file with other names like 0000003_0 etc. I always need to overwrite the existing file but with inconsistent file names I am unable to do so. How do I force hive to always create a consistent filename like 000000_0? Below is an example of how my code looks like, where tab_content is a hive external table.

    INSERT OVERWRITE TABLE tab_content
    PARTITION(datekey)
    select * from source

    Regards,
    Sujeet Singh Pardeshi
    Software Specialist
    SAS Research and Development (India) Pvt. Ltd.
    Level 2A and Level 3, Cybercity, Magarpatta, Hadapsar  Pune, Maharashtra, 411 013
    off: +91-20-30418810

     "When the solution is simple, God is answering…"

Re: Hive output file 000000_0

Posted by Deepak Jaiswal <dj...@hortonworks.com>.

Hi Sujeet,

I am assuming that the table is bucketed? If so, then the name represents which bucket the file belongs to as Hive creates 1 file per bucket for each operation.

In this case, the file 000003_0 belongs to bucket 3.
To always have files named 000000_0, the table must be unbucketed.
I hope it helps.

Regards,
Deepak

On 8/7/18, 1:33 AM, "Sujeet Pardeshi" <Su...@sas.com> wrote:

    Hi All,
    I am doing an Insert overwrite operation through a hive external table onto AWS S3. Hive creates a output file 000000_0 onto S3. However at times I am noticing that it creates file with other names like 0000003_0 etc. I always need to overwrite the existing file but with inconsistent file names I am unable to do so. How do I force hive to always create a consistent filename like 000000_0? Below is an example of how my code looks like, where tab_content is a hive external table. 
    
    INSERT OVERWRITE TABLE tab_content
    PARTITION(datekey)
    select * from source
    
    Regards,
    Sujeet Singh Pardeshi
    Software Specialist
    SAS Research and Development (India) Pvt. Ltd.
    Level 2A and Level 3, Cybercity, Magarpatta, Hadapsar  Pune, Maharashtra, 411 013
    off: +91-20-30418810  
    
     "When the solution is simple, God is answering…"