You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@apex.apache.org by "Mukkamula, Suryavamshivardhan (CWM-NR)" <su...@rbc.com> on 2016/08/09 15:29:51 UTC

handling French Characters using AbstractFileInputOperator

Hi,

I have files on HDFS with French characters that I need to write to another file on HDFS. I am using AbstractFileInputOperator.java which has the following method that can stream the input file. Can you please suggest how would I handle the French characters ? (I suppose I should pass the character encoding UTF8 to generate the inputstream but not sure how would I achieve that).

###############method from AbstractFileInputOperator.java####################

protected InputStream openFile(Path path) throws IOException
  {
    currentFile = path.toString();
    offset = 0;
    retryCount = 0;
    skipCount = 0;
    LOG.info("opening file {}", path);
    InputStream input = fs.open(path);
    return input;
  }

Regards,
Surya Vamshi

_______________________________________________________________________
If you received this email in error, please advise the sender (by return email or otherwise) immediately. You have consented to receive the attached electronically at the above-noted email address; please retain a copy of this confirmation for future reference.  

Si vous recevez ce courriel par erreur, veuillez en aviser l'expéditeur immédiatement, par retour de courriel ou par un autre moyen. Vous avez accepté de recevoir le(s) document(s) ci-joint(s) par voie électronique à l'adresse courriel indiquée ci-dessus; veuillez conserver une copie de cette confirmation pour les fins de reference future.

Re: handling French Characters using AbstractFileInputOperator

Posted by Yogi Devendra <yo...@apache.org>.
One more webinar link for this : https://www.youtube.com/watch?v=J8omclpAfps

~ Yogi

On 10 August 2016 at 07:53, Yogi Devendra <yo...@apache.org> wrote:

> If your usecase is just for copying files on HDFS and if there is no need
> to look inside the file (parsing records, processing) then you need not use
> AbstractFileInputOperator.
>
> Instead you can use FSInputModule, HDFSFileCopyModule as done in this
> application.
> https://github.com/apache/apex-malhar/tree/master/apps/filecopy
>
> Here, files will be read as raw binary data so character encoding should
> not matter.
>
> https://www.brighttalk.com/webcast/13685/194937/hadoop-ingestion-made-easy
> gives some explaination on this.
>
> Let me know if this filecopy application suits your usecase.
>
> ~ Yogi
>
> On 9 August 2016 at 20:59, Mukkamula, Suryavamshivardhan (CWM-NR) <
> suryavamshivardhan.mukkamula@rbc.com> wrote:
>
>> Hi,
>>
>> I have files on HDFS with French characters that I need to write to
>> another file on HDFS. I am using AbstractFileInputOperator.java which has
>> the following method that can stream the input file. Can you please suggest
>> how would I handle the French characters ? (I suppose I should pass the
>> character encoding UTF8 to generate the inputstream but not sure how would
>> I achieve that).
>>
>> ###############method from AbstractFileInputOperator.java
>> ####################
>>
>> *protected* InputStream openFile(Path path) *throws* IOException
>>   {
>>     currentFile = path.toString();
>>     offset = 0;
>>     retryCount = 0;
>>     skipCount = 0;
>>     *LOG*.info("opening file {}", path);
>>     InputStream input = fs.open(path);
>>     *return* input;
>>   }
>>
>> Regards,
>> Surya Vamshi
>>
>>
>> _______________________________________________________________________
>>
>> If you received this email in error, please advise the sender (by return
>> email or otherwise) immediately. You have consented to receive the attached
>> electronically at the above-noted email address; please retain a copy of
>> this confirmation for future reference.
>>
>> Si vous recevez ce courriel par erreur, veuillez en aviser l'expéditeur
>> immédiatement, par retour de courriel ou par un autre moyen. Vous avez
>> accepté de recevoir le(s) document(s) ci-joint(s) par voie électronique à
>> l'adresse courriel indiquée ci-dessus; veuillez conserver une copie de
>> cette confirmation pour les fins de reference future.
>>
>>
>

Re: handling French Characters using AbstractFileInputOperator

Posted by Yogi Devendra <yo...@apache.org>.
If your usecase is just for copying files on HDFS and if there is no need
to look inside the file (parsing records, processing) then you need not use
AbstractFileInputOperator.

Instead you can use FSInputModule, HDFSFileCopyModule as done in this
application.
https://github.com/apache/apex-malhar/tree/master/apps/filecopy

Here, files will be read as raw binary data so character encoding should
not matter.

https://www.brighttalk.com/webcast/13685/194937/hadoop-ingestion-made-easy
gives some explaination on this.

Let me know if this filecopy application suits your usecase.

~ Yogi

On 9 August 2016 at 20:59, Mukkamula, Suryavamshivardhan (CWM-NR) <
suryavamshivardhan.mukkamula@rbc.com> wrote:

> Hi,
>
> I have files on HDFS with French characters that I need to write to
> another file on HDFS. I am using AbstractFileInputOperator.java which has
> the following method that can stream the input file. Can you please suggest
> how would I handle the French characters ? (I suppose I should pass the
> character encoding UTF8 to generate the inputstream but not sure how would
> I achieve that).
>
> ###############method from AbstractFileInputOperator.
> java####################
>
> *protected* InputStream openFile(Path path) *throws* IOException
>   {
>     currentFile = path.toString();
>     offset = 0;
>     retryCount = 0;
>     skipCount = 0;
>     *LOG*.info("opening file {}", path);
>     InputStream input = fs.open(path);
>     *return* input;
>   }
>
> Regards,
> Surya Vamshi
>
>
> _______________________________________________________________________
>
> If you received this email in error, please advise the sender (by return
> email or otherwise) immediately. You have consented to receive the attached
> electronically at the above-noted email address; please retain a copy of
> this confirmation for future reference.
>
> Si vous recevez ce courriel par erreur, veuillez en aviser l'expéditeur
> immédiatement, par retour de courriel ou par un autre moyen. Vous avez
> accepté de recevoir le(s) document(s) ci-joint(s) par voie électronique à
> l'adresse courriel indiquée ci-dessus; veuillez conserver une copie de
> cette confirmation pour les fins de reference future.
>
>

Re: handling French Characters using AbstractFileInputOperator

Posted by Munagala Ramanath <ra...@datatorrent.com>.
If you are dealing with file data purely as byte arrays and copying them
from one place to another, you need not worry about the language or charset
since the bytes are preserved.

If you are converting them to Strings explicitly or using classes that might
do so implicitly, you need to specify an appropriate *CharSet* for the
conversion.

The *InputStreamReader* has a constructor that takes a *CharSet* or a
charset name
as a string.

Standard CharSet objects are available as static fields of the
*StandardCharsets* class:
https://docs.oracle.com/javase/7/docs/api/java/nio/charset/StandardCharsets.html

So, *if you know for sure that your input file is encoded in UTF8*, you can
create a *BufferedReader* that wraps an *InputStreamReader* as shown below
(very
similar to the code in *LineByLineFileInputOperator* in Apex Malhar):

------------------------------------------------------------------------------
*import java.nio.charset.StandardCharsets;*
*...*

*  protected transient BufferedReader br;*

*  protected InputStream openFile(Path path) throws IOException*
*  {*
*    InputStream is = super.openFile(path);*
*    br = new BufferedReader(new InputStreamReader(is,
StandardCharsets.UTF_8));*
*    return is;*
*  }*

*  @Override*
*  protected void closeFile(InputStream is) throws IOException*
*  {*
*    super.closeFile(is);*
*    br.close();*
*    br = null;*
*  }*

*  @Override*
*  protected String readEntity() throws IOException*
*  {*
*    return br.readLine();*
*  }*
--------------------------------------------------------------------

Ram

On Tue, Aug 9, 2016 at 8:29 AM, Mukkamula, Suryavamshivardhan (CWM-NR) <
suryavamshivardhan.mukkamula@rbc.com> wrote:

> Hi,
>
> I have files on HDFS with French characters that I need to write to
> another file on HDFS. I am using AbstractFileInputOperator.java which has
> the following method that can stream the input file. Can you please suggest
> how would I handle the French characters ? (I suppose I should pass the
> character encoding UTF8 to generate the inputstream but not sure how would
> I achieve that).
>
> ###############method from AbstractFileInputOperator.
> java####################
>
> *protected* InputStream openFile(Path path) *throws* IOException
>   {
>     currentFile = path.toString();
>     offset = 0;
>     retryCount = 0;
>     skipCount = 0;
>     *LOG*.info("opening file {}", path);
>     InputStream input = fs.open(path);
>     *return* input;
>   }
>
> Regards,
> Surya Vamshi
>
>
> _______________________________________________________________________
>
> If you received this email in error, please advise the sender (by return
> email or otherwise) immediately. You have consented to receive the attached
> electronically at the above-noted email address; please retain a copy of
> this confirmation for future reference.
>
> Si vous recevez ce courriel par erreur, veuillez en aviser l'expéditeur
> immédiatement, par retour de courriel ou par un autre moyen. Vous avez
> accepté de recevoir le(s) document(s) ci-joint(s) par voie électronique à
> l'adresse courriel indiquée ci-dessus; veuillez conserver une copie de
> cette confirmation pour les fins de reference future.
>
>