You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-user@hadoop.apache.org by Vijaya Narayana Reddy Bhoomi Reddy <vi...@whishworks.com> on 2015/03/25 10:55:57 UTC

Identifying new files on HDFS

Hi,

We have a requirement to process only new files in HDFS on a daily basis. I
am sure this is a general requirement in many ETL kind of processing
scenarios. Just wondering if there is a way to identify new files that are
added to a path in HDFS? For example, assume already some files were
present for sometime. Now I have added new files today. So wanted to
process only those new files. What is the best way to achieve this.

Thanks & Regards
Vijay


*Vijay Bhoomireddy*, Big Data Architect

1000 Great West Road, Brentford, London, TW8 9DW
*T:  +44 20 3475 7980*
*M: **+44 7481 298 360*
*W: *ww <http://www.whishworks.com/>w.whishworks.com
<http://www.whishworks.com/>

<https://www.linkedin.com/company/whishworks>
<http://www.whishworks.com/blog/>  <https://twitter.com/WHISHWORKS>
<https://www.facebook.com/whishworksit>

-- 
The contents of this e-mail are confidential and for the exclusive use of 
the intended recipient. If you receive this e-mail in error please delete 
it from your system immediately and notify us either by e-mail or 
telephone. You should not copy, forward or otherwise disclose the content 
of the e-mail. The views expressed in this communication may not 
necessarily be the view held by WHISHWORKS.

RE: Identifying new files on HDFS

Posted by Mich Talebzadeh <mi...@peridale.co.uk>.
Good points. I will have done, empty and failed directories.

 

HTH

 

Mich Talebzadeh

 

http://talebzadehmich.wordpress.com

 

Publications due shortly:

Creating in-memory Data Grid for Trading Systems with Oracle TimesTen and Coherence Cache

 

NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Ltd, its subsidiaries nor their employees accept any responsibility.

 

From: Harsh J [mailto:harsh@cloudera.com] 
Sent: 25 March 2015 21:24
To: user@hadoop.apache.org; mich@peridale.co.uk
Subject: Re: Identifying new files on HDFS

 

Look at timestamps of the file? HDFS maintains both mtimes and atimes (latter's not exposed in -ls though).

 

In ETL context, a simple workflow system also resolves this. You have an incoming directory, a done directory, and a destination directory, etc. and you can move around files pre/post processing for every job, to manage new content/avoid repeated processing (as one simple example).

 

On Wed, Mar 25, 2015 at 11:11 PM, Mich Talebzadeh <mi...@peridale.co.uk> wrote:

Hi,

Have you considered taking snapshot of files at close of business and compare it with the new snapshot and process only new ones? Just a simple shell script will do.

HTH

Let your email find you with BlackBerry from Vodafone

  _____  

From: Vijaya Narayana Reddy Bhoomi Reddy <vi...@whishworks.com> 

Date: Wed, 25 Mar 2015 09:55:57 +0000

To: <us...@hadoop.apache.org>

ReplyTo: user@hadoop.apache.org 

Subject: Identifying new files on HDFS

 

Hi,

 

We have a requirement to process only new files in HDFS on a daily basis. I am sure this is a general requirement in many ETL kind of processing scenarios. Just wondering if there is a way to identify new files that are added to a path in HDFS? For example, assume already some files were present for sometime. Now I have added new files today. So wanted to process only those new files. What is the best way to achieve this.

 

Thanks & Regards

Vijay




  <http://www.whishworks.com/images/whishworks/WWlogotm.png> 

Vijay Bhoomireddy, Big Data Architect

1000 Great West Road, Brentford, London, TW8 9DW

T:  +44 20 3475 7980 <tel:%2B44%2020%203475%207980> 

M: +44 7481 298 360 <tel:%2B44%207481%20298%20360> 

W:  <http://www.whishworks.com/> ww <http://www.whishworks.com/> w.whishworks.com

 <https://www.linkedin.com/company/whishworks>   <http://www.whishworks.com/blog/>   <https://twitter.com/WHISHWORKS>   <https://www.facebook.com/whishworksit> 


The contents of this e-mail are confidential and for the exclusive use of the intended recipient. If you receive this e-mail in error please delete it from your system immediately and notify us either by e-mail or telephone. You should not copy, forward or otherwise disclose the content of the e-mail. The views expressed in this communication may not necessarily be the view held by WHISHWORKS. 





 

-- 

Harsh J


RE: Identifying new files on HDFS

Posted by Mich Talebzadeh <mi...@peridale.co.uk>.
Good points. I will have done, empty and failed directories.

 

HTH

 

Mich Talebzadeh

 

http://talebzadehmich.wordpress.com

 

Publications due shortly:

Creating in-memory Data Grid for Trading Systems with Oracle TimesTen and Coherence Cache

 

NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Ltd, its subsidiaries nor their employees accept any responsibility.

 

From: Harsh J [mailto:harsh@cloudera.com] 
Sent: 25 March 2015 21:24
To: user@hadoop.apache.org; mich@peridale.co.uk
Subject: Re: Identifying new files on HDFS

 

Look at timestamps of the file? HDFS maintains both mtimes and atimes (latter's not exposed in -ls though).

 

In ETL context, a simple workflow system also resolves this. You have an incoming directory, a done directory, and a destination directory, etc. and you can move around files pre/post processing for every job, to manage new content/avoid repeated processing (as one simple example).

 

On Wed, Mar 25, 2015 at 11:11 PM, Mich Talebzadeh <mi...@peridale.co.uk> wrote:

Hi,

Have you considered taking snapshot of files at close of business and compare it with the new snapshot and process only new ones? Just a simple shell script will do.

HTH

Let your email find you with BlackBerry from Vodafone

  _____  

From: Vijaya Narayana Reddy Bhoomi Reddy <vi...@whishworks.com> 

Date: Wed, 25 Mar 2015 09:55:57 +0000

To: <us...@hadoop.apache.org>

ReplyTo: user@hadoop.apache.org 

Subject: Identifying new files on HDFS

 

Hi,

 

We have a requirement to process only new files in HDFS on a daily basis. I am sure this is a general requirement in many ETL kind of processing scenarios. Just wondering if there is a way to identify new files that are added to a path in HDFS? For example, assume already some files were present for sometime. Now I have added new files today. So wanted to process only those new files. What is the best way to achieve this.

 

Thanks & Regards

Vijay




  <http://www.whishworks.com/images/whishworks/WWlogotm.png> 

Vijay Bhoomireddy, Big Data Architect

1000 Great West Road, Brentford, London, TW8 9DW

T:  +44 20 3475 7980 <tel:%2B44%2020%203475%207980> 

M: +44 7481 298 360 <tel:%2B44%207481%20298%20360> 

W:  <http://www.whishworks.com/> ww <http://www.whishworks.com/> w.whishworks.com

 <https://www.linkedin.com/company/whishworks>   <http://www.whishworks.com/blog/>   <https://twitter.com/WHISHWORKS>   <https://www.facebook.com/whishworksit> 


The contents of this e-mail are confidential and for the exclusive use of the intended recipient. If you receive this e-mail in error please delete it from your system immediately and notify us either by e-mail or telephone. You should not copy, forward or otherwise disclose the content of the e-mail. The views expressed in this communication may not necessarily be the view held by WHISHWORKS. 





 

-- 

Harsh J


RE: Identifying new files on HDFS

Posted by Mich Talebzadeh <mi...@peridale.co.uk>.
Good points. I will have done, empty and failed directories.

 

HTH

 

Mich Talebzadeh

 

http://talebzadehmich.wordpress.com

 

Publications due shortly:

Creating in-memory Data Grid for Trading Systems with Oracle TimesTen and Coherence Cache

 

NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Ltd, its subsidiaries nor their employees accept any responsibility.

 

From: Harsh J [mailto:harsh@cloudera.com] 
Sent: 25 March 2015 21:24
To: user@hadoop.apache.org; mich@peridale.co.uk
Subject: Re: Identifying new files on HDFS

 

Look at timestamps of the file? HDFS maintains both mtimes and atimes (latter's not exposed in -ls though).

 

In ETL context, a simple workflow system also resolves this. You have an incoming directory, a done directory, and a destination directory, etc. and you can move around files pre/post processing for every job, to manage new content/avoid repeated processing (as one simple example).

 

On Wed, Mar 25, 2015 at 11:11 PM, Mich Talebzadeh <mi...@peridale.co.uk> wrote:

Hi,

Have you considered taking snapshot of files at close of business and compare it with the new snapshot and process only new ones? Just a simple shell script will do.

HTH

Let your email find you with BlackBerry from Vodafone

  _____  

From: Vijaya Narayana Reddy Bhoomi Reddy <vi...@whishworks.com> 

Date: Wed, 25 Mar 2015 09:55:57 +0000

To: <us...@hadoop.apache.org>

ReplyTo: user@hadoop.apache.org 

Subject: Identifying new files on HDFS

 

Hi,

 

We have a requirement to process only new files in HDFS on a daily basis. I am sure this is a general requirement in many ETL kind of processing scenarios. Just wondering if there is a way to identify new files that are added to a path in HDFS? For example, assume already some files were present for sometime. Now I have added new files today. So wanted to process only those new files. What is the best way to achieve this.

 

Thanks & Regards

Vijay




  <http://www.whishworks.com/images/whishworks/WWlogotm.png> 

Vijay Bhoomireddy, Big Data Architect

1000 Great West Road, Brentford, London, TW8 9DW

T:  +44 20 3475 7980 <tel:%2B44%2020%203475%207980> 

M: +44 7481 298 360 <tel:%2B44%207481%20298%20360> 

W:  <http://www.whishworks.com/> ww <http://www.whishworks.com/> w.whishworks.com

 <https://www.linkedin.com/company/whishworks>   <http://www.whishworks.com/blog/>   <https://twitter.com/WHISHWORKS>   <https://www.facebook.com/whishworksit> 


The contents of this e-mail are confidential and for the exclusive use of the intended recipient. If you receive this e-mail in error please delete it from your system immediately and notify us either by e-mail or telephone. You should not copy, forward or otherwise disclose the content of the e-mail. The views expressed in this communication may not necessarily be the view held by WHISHWORKS. 





 

-- 

Harsh J


RE: Identifying new files on HDFS

Posted by Mich Talebzadeh <mi...@peridale.co.uk>.
Good points. I will have done, empty and failed directories.

 

HTH

 

Mich Talebzadeh

 

http://talebzadehmich.wordpress.com

 

Publications due shortly:

Creating in-memory Data Grid for Trading Systems with Oracle TimesTen and Coherence Cache

 

NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Ltd, its subsidiaries nor their employees accept any responsibility.

 

From: Harsh J [mailto:harsh@cloudera.com] 
Sent: 25 March 2015 21:24
To: user@hadoop.apache.org; mich@peridale.co.uk
Subject: Re: Identifying new files on HDFS

 

Look at timestamps of the file? HDFS maintains both mtimes and atimes (latter's not exposed in -ls though).

 

In ETL context, a simple workflow system also resolves this. You have an incoming directory, a done directory, and a destination directory, etc. and you can move around files pre/post processing for every job, to manage new content/avoid repeated processing (as one simple example).

 

On Wed, Mar 25, 2015 at 11:11 PM, Mich Talebzadeh <mi...@peridale.co.uk> wrote:

Hi,

Have you considered taking snapshot of files at close of business and compare it with the new snapshot and process only new ones? Just a simple shell script will do.

HTH

Let your email find you with BlackBerry from Vodafone

  _____  

From: Vijaya Narayana Reddy Bhoomi Reddy <vi...@whishworks.com> 

Date: Wed, 25 Mar 2015 09:55:57 +0000

To: <us...@hadoop.apache.org>

ReplyTo: user@hadoop.apache.org 

Subject: Identifying new files on HDFS

 

Hi,

 

We have a requirement to process only new files in HDFS on a daily basis. I am sure this is a general requirement in many ETL kind of processing scenarios. Just wondering if there is a way to identify new files that are added to a path in HDFS? For example, assume already some files were present for sometime. Now I have added new files today. So wanted to process only those new files. What is the best way to achieve this.

 

Thanks & Regards

Vijay




  <http://www.whishworks.com/images/whishworks/WWlogotm.png> 

Vijay Bhoomireddy, Big Data Architect

1000 Great West Road, Brentford, London, TW8 9DW

T:  +44 20 3475 7980 <tel:%2B44%2020%203475%207980> 

M: +44 7481 298 360 <tel:%2B44%207481%20298%20360> 

W:  <http://www.whishworks.com/> ww <http://www.whishworks.com/> w.whishworks.com

 <https://www.linkedin.com/company/whishworks>   <http://www.whishworks.com/blog/>   <https://twitter.com/WHISHWORKS>   <https://www.facebook.com/whishworksit> 


The contents of this e-mail are confidential and for the exclusive use of the intended recipient. If you receive this e-mail in error please delete it from your system immediately and notify us either by e-mail or telephone. You should not copy, forward or otherwise disclose the content of the e-mail. The views expressed in this communication may not necessarily be the view held by WHISHWORKS. 





 

-- 

Harsh J


Re: Identifying new files on HDFS

Posted by Harsh J <ha...@cloudera.com>.
Look at timestamps of the file? HDFS maintains both mtimes and atimes
(latter's not exposed in -ls though).

In ETL context, a simple workflow system also resolves this. You have an
incoming directory, a done directory, and a destination directory, etc. and
you can move around files pre/post processing for every job, to manage new
content/avoid repeated processing (as one simple example).

On Wed, Mar 25, 2015 at 11:11 PM, Mich Talebzadeh <mi...@peridale.co.uk>
wrote:

> Hi,
>
> Have you considered taking snapshot of files at close of business and
> compare it with the new snapshot and process only new ones? Just a simple
> shell script will do.
>
> HTH
> Let your email find you with BlackBerry from Vodafone
> ------------------------------
> *From: * Vijaya Narayana Reddy Bhoomi Reddy <
> vijaya.bhoomireddy@whishworks.com>
> *Date: *Wed, 25 Mar 2015 09:55:57 +0000
> *To: *<us...@hadoop.apache.org>
> *ReplyTo: * user@hadoop.apache.org
> *Subject: *Identifying new files on HDFS
>
> Hi,
>
> We have a requirement to process only new files in HDFS on a daily basis.
> I am sure this is a general requirement in many ETL kind of processing
> scenarios. Just wondering if there is a way to identify new files that
> are added to a path in HDFS? For example, assume already some files were
> present for sometime. Now I have added new files today. So wanted to
> process only those new files. What is the best way to achieve this.
>
> Thanks & Regards
> Vijay
>
>
> *Vijay Bhoomireddy*, Big Data Architect
>
> 1000 Great West Road, Brentford, London, TW8 9DW
> *T:  +44 20 3475 7980 <%2B44%2020%203475%207980>*
> *M: **+44 7481 298 360 <%2B44%207481%20298%20360>*
> *W: *ww <http://www.whishworks.com/>w.whishworks.com
> <http://www.whishworks.com/>
>
> <https://www.linkedin.com/company/whishworks>
> <http://www.whishworks.com/blog/>  <https://twitter.com/WHISHWORKS>
> <https://www.facebook.com/whishworksit>
>
> The contents of this e-mail are confidential and for the exclusive use of
> the intended recipient. If you receive this e-mail in error please delete
> it from your system immediately and notify us either by e-mail or
> telephone. You should not copy, forward or otherwise disclose the content
> of the e-mail. The views expressed in this communication may not
> necessarily be the view held by WHISHWORKS.
>



-- 
Harsh J

Re: Identifying new files on HDFS

Posted by Harsh J <ha...@cloudera.com>.
Look at timestamps of the file? HDFS maintains both mtimes and atimes
(latter's not exposed in -ls though).

In ETL context, a simple workflow system also resolves this. You have an
incoming directory, a done directory, and a destination directory, etc. and
you can move around files pre/post processing for every job, to manage new
content/avoid repeated processing (as one simple example).

On Wed, Mar 25, 2015 at 11:11 PM, Mich Talebzadeh <mi...@peridale.co.uk>
wrote:

> Hi,
>
> Have you considered taking snapshot of files at close of business and
> compare it with the new snapshot and process only new ones? Just a simple
> shell script will do.
>
> HTH
> Let your email find you with BlackBerry from Vodafone
> ------------------------------
> *From: * Vijaya Narayana Reddy Bhoomi Reddy <
> vijaya.bhoomireddy@whishworks.com>
> *Date: *Wed, 25 Mar 2015 09:55:57 +0000
> *To: *<us...@hadoop.apache.org>
> *ReplyTo: * user@hadoop.apache.org
> *Subject: *Identifying new files on HDFS
>
> Hi,
>
> We have a requirement to process only new files in HDFS on a daily basis.
> I am sure this is a general requirement in many ETL kind of processing
> scenarios. Just wondering if there is a way to identify new files that
> are added to a path in HDFS? For example, assume already some files were
> present for sometime. Now I have added new files today. So wanted to
> process only those new files. What is the best way to achieve this.
>
> Thanks & Regards
> Vijay
>
>
> *Vijay Bhoomireddy*, Big Data Architect
>
> 1000 Great West Road, Brentford, London, TW8 9DW
> *T:  +44 20 3475 7980 <%2B44%2020%203475%207980>*
> *M: **+44 7481 298 360 <%2B44%207481%20298%20360>*
> *W: *ww <http://www.whishworks.com/>w.whishworks.com
> <http://www.whishworks.com/>
>
> <https://www.linkedin.com/company/whishworks>
> <http://www.whishworks.com/blog/>  <https://twitter.com/WHISHWORKS>
> <https://www.facebook.com/whishworksit>
>
> The contents of this e-mail are confidential and for the exclusive use of
> the intended recipient. If you receive this e-mail in error please delete
> it from your system immediately and notify us either by e-mail or
> telephone. You should not copy, forward or otherwise disclose the content
> of the e-mail. The views expressed in this communication may not
> necessarily be the view held by WHISHWORKS.
>



-- 
Harsh J

Re: Identifying new files on HDFS

Posted by Harsh J <ha...@cloudera.com>.
Look at timestamps of the file? HDFS maintains both mtimes and atimes
(latter's not exposed in -ls though).

In ETL context, a simple workflow system also resolves this. You have an
incoming directory, a done directory, and a destination directory, etc. and
you can move around files pre/post processing for every job, to manage new
content/avoid repeated processing (as one simple example).

On Wed, Mar 25, 2015 at 11:11 PM, Mich Talebzadeh <mi...@peridale.co.uk>
wrote:

> Hi,
>
> Have you considered taking snapshot of files at close of business and
> compare it with the new snapshot and process only new ones? Just a simple
> shell script will do.
>
> HTH
> Let your email find you with BlackBerry from Vodafone
> ------------------------------
> *From: * Vijaya Narayana Reddy Bhoomi Reddy <
> vijaya.bhoomireddy@whishworks.com>
> *Date: *Wed, 25 Mar 2015 09:55:57 +0000
> *To: *<us...@hadoop.apache.org>
> *ReplyTo: * user@hadoop.apache.org
> *Subject: *Identifying new files on HDFS
>
> Hi,
>
> We have a requirement to process only new files in HDFS on a daily basis.
> I am sure this is a general requirement in many ETL kind of processing
> scenarios. Just wondering if there is a way to identify new files that
> are added to a path in HDFS? For example, assume already some files were
> present for sometime. Now I have added new files today. So wanted to
> process only those new files. What is the best way to achieve this.
>
> Thanks & Regards
> Vijay
>
>
> *Vijay Bhoomireddy*, Big Data Architect
>
> 1000 Great West Road, Brentford, London, TW8 9DW
> *T:  +44 20 3475 7980 <%2B44%2020%203475%207980>*
> *M: **+44 7481 298 360 <%2B44%207481%20298%20360>*
> *W: *ww <http://www.whishworks.com/>w.whishworks.com
> <http://www.whishworks.com/>
>
> <https://www.linkedin.com/company/whishworks>
> <http://www.whishworks.com/blog/>  <https://twitter.com/WHISHWORKS>
> <https://www.facebook.com/whishworksit>
>
> The contents of this e-mail are confidential and for the exclusive use of
> the intended recipient. If you receive this e-mail in error please delete
> it from your system immediately and notify us either by e-mail or
> telephone. You should not copy, forward or otherwise disclose the content
> of the e-mail. The views expressed in this communication may not
> necessarily be the view held by WHISHWORKS.
>



-- 
Harsh J

Re: Identifying new files on HDFS

Posted by Harsh J <ha...@cloudera.com>.
Look at timestamps of the file? HDFS maintains both mtimes and atimes
(latter's not exposed in -ls though).

In ETL context, a simple workflow system also resolves this. You have an
incoming directory, a done directory, and a destination directory, etc. and
you can move around files pre/post processing for every job, to manage new
content/avoid repeated processing (as one simple example).

On Wed, Mar 25, 2015 at 11:11 PM, Mich Talebzadeh <mi...@peridale.co.uk>
wrote:

> Hi,
>
> Have you considered taking snapshot of files at close of business and
> compare it with the new snapshot and process only new ones? Just a simple
> shell script will do.
>
> HTH
> Let your email find you with BlackBerry from Vodafone
> ------------------------------
> *From: * Vijaya Narayana Reddy Bhoomi Reddy <
> vijaya.bhoomireddy@whishworks.com>
> *Date: *Wed, 25 Mar 2015 09:55:57 +0000
> *To: *<us...@hadoop.apache.org>
> *ReplyTo: * user@hadoop.apache.org
> *Subject: *Identifying new files on HDFS
>
> Hi,
>
> We have a requirement to process only new files in HDFS on a daily basis.
> I am sure this is a general requirement in many ETL kind of processing
> scenarios. Just wondering if there is a way to identify new files that
> are added to a path in HDFS? For example, assume already some files were
> present for sometime. Now I have added new files today. So wanted to
> process only those new files. What is the best way to achieve this.
>
> Thanks & Regards
> Vijay
>
>
> *Vijay Bhoomireddy*, Big Data Architect
>
> 1000 Great West Road, Brentford, London, TW8 9DW
> *T:  +44 20 3475 7980 <%2B44%2020%203475%207980>*
> *M: **+44 7481 298 360 <%2B44%207481%20298%20360>*
> *W: *ww <http://www.whishworks.com/>w.whishworks.com
> <http://www.whishworks.com/>
>
> <https://www.linkedin.com/company/whishworks>
> <http://www.whishworks.com/blog/>  <https://twitter.com/WHISHWORKS>
> <https://www.facebook.com/whishworksit>
>
> The contents of this e-mail are confidential and for the exclusive use of
> the intended recipient. If you receive this e-mail in error please delete
> it from your system immediately and notify us either by e-mail or
> telephone. You should not copy, forward or otherwise disclose the content
> of the e-mail. The views expressed in this communication may not
> necessarily be the view held by WHISHWORKS.
>



-- 
Harsh J

Re: Identifying new files on HDFS

Posted by Mich Talebzadeh <mi...@peridale.co.uk>.
Hi,

Have you considered taking snapshot of files at close of business and compare it with the new snapshot and process only new ones? Just a simple shell script will do.

HTH
Let your email find you with BlackBerry from Vodafone

-----Original Message-----
From: Vijaya Narayana Reddy Bhoomi Reddy <vi...@whishworks.com>
Date: Wed, 25 Mar 2015 09:55:57 
To: <us...@hadoop.apache.org>
Reply-To: user@hadoop.apache.org
Subject: Identifying new files on HDFS

Hi,

We have a requirement to process only new files in HDFS on a daily basis. I
am sure this is a general requirement in many ETL kind of processing
scenarios. Just wondering if there is a way to identify new files that are
added to a path in HDFS? For example, assume already some files were
present for sometime. Now I have added new files today. So wanted to
process only those new files. What is the best way to achieve this.

Thanks & Regards
Vijay


*Vijay Bhoomireddy*, Big Data Architect

1000 Great West Road, Brentford, London, TW8 9DW
*T:  +44 20 3475 7980*
*M: **+44 7481 298 360*
*W: *ww <http://www.whishworks.com/>w.whishworks.com
<http://www.whishworks.com/>

<https://www.linkedin.com/company/whishworks>
<http://www.whishworks.com/blog/>  <https://twitter.com/WHISHWORKS>
<https://www.facebook.com/whishworksit>

-- 
The contents of this e-mail are confidential and for the exclusive use of 
the intended recipient. If you receive this e-mail in error please delete 
it from your system immediately and notify us either by e-mail or 
telephone. You should not copy, forward or otherwise disclose the content 
of the e-mail. The views expressed in this communication may not 
necessarily be the view held by WHISHWORKS.


Re: Identifying new files on HDFS

Posted by Mich Talebzadeh <mi...@peridale.co.uk>.
Hi,

Have you considered taking snapshot of files at close of business and compare it with the new snapshot and process only new ones? Just a simple shell script will do.

HTH
Let your email find you with BlackBerry from Vodafone

-----Original Message-----
From: Vijaya Narayana Reddy Bhoomi Reddy <vi...@whishworks.com>
Date: Wed, 25 Mar 2015 09:55:57 
To: <us...@hadoop.apache.org>
Reply-To: user@hadoop.apache.org
Subject: Identifying new files on HDFS

Hi,

We have a requirement to process only new files in HDFS on a daily basis. I
am sure this is a general requirement in many ETL kind of processing
scenarios. Just wondering if there is a way to identify new files that are
added to a path in HDFS? For example, assume already some files were
present for sometime. Now I have added new files today. So wanted to
process only those new files. What is the best way to achieve this.

Thanks & Regards
Vijay


*Vijay Bhoomireddy*, Big Data Architect

1000 Great West Road, Brentford, London, TW8 9DW
*T:  +44 20 3475 7980*
*M: **+44 7481 298 360*
*W: *ww <http://www.whishworks.com/>w.whishworks.com
<http://www.whishworks.com/>

<https://www.linkedin.com/company/whishworks>
<http://www.whishworks.com/blog/>  <https://twitter.com/WHISHWORKS>
<https://www.facebook.com/whishworksit>

-- 
The contents of this e-mail are confidential and for the exclusive use of 
the intended recipient. If you receive this e-mail in error please delete 
it from your system immediately and notify us either by e-mail or 
telephone. You should not copy, forward or otherwise disclose the content 
of the e-mail. The views expressed in this communication may not 
necessarily be the view held by WHISHWORKS.


Re: Identifying new files on HDFS

Posted by Mich Talebzadeh <mi...@peridale.co.uk>.
Hi,

Have you considered taking snapshot of files at close of business and compare it with the new snapshot and process only new ones? Just a simple shell script will do.

HTH
Let your email find you with BlackBerry from Vodafone

-----Original Message-----
From: Vijaya Narayana Reddy Bhoomi Reddy <vi...@whishworks.com>
Date: Wed, 25 Mar 2015 09:55:57 
To: <us...@hadoop.apache.org>
Reply-To: user@hadoop.apache.org
Subject: Identifying new files on HDFS

Hi,

We have a requirement to process only new files in HDFS on a daily basis. I
am sure this is a general requirement in many ETL kind of processing
scenarios. Just wondering if there is a way to identify new files that are
added to a path in HDFS? For example, assume already some files were
present for sometime. Now I have added new files today. So wanted to
process only those new files. What is the best way to achieve this.

Thanks & Regards
Vijay


*Vijay Bhoomireddy*, Big Data Architect

1000 Great West Road, Brentford, London, TW8 9DW
*T:  +44 20 3475 7980*
*M: **+44 7481 298 360*
*W: *ww <http://www.whishworks.com/>w.whishworks.com
<http://www.whishworks.com/>

<https://www.linkedin.com/company/whishworks>
<http://www.whishworks.com/blog/>  <https://twitter.com/WHISHWORKS>
<https://www.facebook.com/whishworksit>

-- 
The contents of this e-mail are confidential and for the exclusive use of 
the intended recipient. If you receive this e-mail in error please delete 
it from your system immediately and notify us either by e-mail or 
telephone. You should not copy, forward or otherwise disclose the content 
of the e-mail. The views expressed in this communication may not 
necessarily be the view held by WHISHWORKS.


Re: Identifying new files on HDFS

Posted by Mich Talebzadeh <mi...@peridale.co.uk>.
Hi,

Have you considered taking snapshot of files at close of business and compare it with the new snapshot and process only new ones? Just a simple shell script will do.

HTH
Let your email find you with BlackBerry from Vodafone

-----Original Message-----
From: Vijaya Narayana Reddy Bhoomi Reddy <vi...@whishworks.com>
Date: Wed, 25 Mar 2015 09:55:57 
To: <us...@hadoop.apache.org>
Reply-To: user@hadoop.apache.org
Subject: Identifying new files on HDFS

Hi,

We have a requirement to process only new files in HDFS on a daily basis. I
am sure this is a general requirement in many ETL kind of processing
scenarios. Just wondering if there is a way to identify new files that are
added to a path in HDFS? For example, assume already some files were
present for sometime. Now I have added new files today. So wanted to
process only those new files. What is the best way to achieve this.

Thanks & Regards
Vijay


*Vijay Bhoomireddy*, Big Data Architect

1000 Great West Road, Brentford, London, TW8 9DW
*T:  +44 20 3475 7980*
*M: **+44 7481 298 360*
*W: *ww <http://www.whishworks.com/>w.whishworks.com
<http://www.whishworks.com/>

<https://www.linkedin.com/company/whishworks>
<http://www.whishworks.com/blog/>  <https://twitter.com/WHISHWORKS>
<https://www.facebook.com/whishworksit>

-- 
The contents of this e-mail are confidential and for the exclusive use of 
the intended recipient. If you receive this e-mail in error please delete 
it from your system immediately and notify us either by e-mail or 
telephone. You should not copy, forward or otherwise disclose the content 
of the e-mail. The views expressed in this communication may not 
necessarily be the view held by WHISHWORKS.