You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@flume.apache.org by "syntony liu (JIRA)" <ji...@apache.org> on 2013/09/05 16:02:51 UTC

[jira] [Created] (FLUME-2182) Spooling Directory Source can't ingest data completely, when a file contain some wide character, such as chinese character.

syntony liu created FLUME-2182:
----------------------------------

             Summary: Spooling Directory Source can't ingest data completely, when a file contain some wide character, such as chinese character.
                 Key: FLUME-2182
                 URL: https://issues.apache.org/jira/browse/FLUME-2182
             Project: Flume
          Issue Type: Bug
          Components: Sinks+Sources
    Affects Versions: v1.4.0
            Reporter: syntony liu
            Priority: Critical


the bug is in ResettableFileInputStream.java: int readChar().
if the last byte of buf is only a partial of a wide character, readChar()  shouldn't return -1(ResettableFileInputStream.java:186). it 
loses the remanent data  in a file.

I fix it such as: 
public synchronized int readChar() throws IOException {
   // if (!buf.hasRemaining()) {
   if(buf.limit()- buf.position < 10){
      refillBuf();
    }

    int start = buf.position();
    charBuf.clear();

    boolean isEndOfInput = false;
    if (position >= fileSize) {
      isEndOfInput = true;
    }

    CoderResult res = decoder.decode(buf, charBuf, isEndOfInput);
    if (res.isMalformed() || res.isUnmappable()) {
      res.throwException();
    }

    int delta = buf.position() - start;

    charBuf.flip();
    if (charBuf.hasRemaining()) {
      char c = charBuf.get();
      // don't increment the persisted location if we are in between a
      // surrogate pair, otherwise we may never recover if we seek() to this
      // location!
      incrPosition(delta, !Character.isHighSurrogate(c));
      return c;

    // there may be a partial character in the decoder buffer
    } else {
      incrPosition(delta, false);
      return -1;
    }

  }

it avoid a partial character, but have new issue. sometime, some lines of a log file have a repeated character.
eg. 
   original file: 123456
   sink file:     1233456

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira