You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "Paul Rogers (JIRA)" <ji...@apache.org> on 2017/05/09 04:47:05 UTC
[jira] [Comment Edited] (DRILL-5490)
RepeatedVarCharOutput.recordStart becomes corrupted on realloc
[ https://issues.apache.org/jira/browse/DRILL-5490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16002016#comment-16002016 ]
Paul Rogers edited comment on DRILL-5490 at 5/9/17 4:46 AM:
------------------------------------------------------------
The other place that {{recordStart}} is used is to get the first row of data for DRILL-951, to extract the header. Consider {{CompliantTextReader.extractHeader()}}:
{code}
HeaderOutputMutator hOutputMutator = new HeaderOutputMutator();
TextOutput hOutput = new RepeatedVarCharOutput(hOutputMutator, getColumns(), true);
this.allocate(hOutputMutator.fieldVectorMap);
...
// extract first row only
reader.parseNext();
// grab the field names from output
String [] fieldNames = ((RepeatedVarCharOutput)hOutput).getTextOutput();
{code}
The only way the above can works is that we never call {RepeatedVarCharOutput.startBatch()}}, so {{recordStart}} defaults to 0, and so the zero-record check accidentally works:
{code}
if (this.recordStart != characterData) {
throw new ExecutionSetupException("record text was requested before finishing record");
}
{code}
It is likely, however, that an empty header line would not be caught and so the above exception would never actually fire.
was (Author: paul-rogers):
The other place that {{recordStart}} is used is to get the first row of data for DRILL-951, to extract the header. Here again, we are fortunate that one bug masks another. The {{RepeatedVarCharOutput}} is never used with headers. Consider {{CompliantTextReader.setup()}}:
{code}
if (settings.isHeaderExtractionEnabled()){
//extract header and use that to setup a set of VarCharVectors
String [] fieldNames = extractHeader();
output = new FieldVarCharOutput(outputMutator, fieldNames, getColumns(), isStarQuery());
} else {
//simply use RepeatedVarCharVector
output = new RepeatedVarCharOutput(outputMutator, getColumns(), isStarQuery());
}
{code}
> RepeatedVarCharOutput.recordStart becomes corrupted on realloc
> --------------------------------------------------------------
>
> Key: DRILL-5490
> URL: https://issues.apache.org/jira/browse/DRILL-5490
> Project: Apache Drill
> Issue Type: Bug
> Affects Versions: 1.10.0
> Reporter: Paul Rogers
> Priority: Minor
>
> The class {{RepeatedVarCharOutput}}, used to read text data into the {{columns}} array, has a member called {{recordStart}} which is supposed to track the memory offset of the start of the record data in the {{columns}} VarChar array. However, the logic for the member is very wrong. First, it is initialized based on an uninitialized variable:
> {{code}}
> public void startBatch() {
> this.recordStart = characterDataOriginal;
> ...
> loadVarCharDataAddress();
> }
> private void loadVarCharDataAddress(){
> ...
> this.characterDataOriginal = buf.memoryAddress();
> ...
> }
> {{code}}
> Second, the class keeps track of actual memory addresses for the VarChar data. When data would exceed the buffer, memory is reallocated using the following method. But, this method *does not* adjust the {{recordStart}} member, which now points to a bogus memory address:
> {code}
> private void expandVarCharData(){
> vector.getDataVector().reAlloc();
> long diff = characterData - characterDataOriginal;
> loadVarCharDataAddress();
> characterData += diff;
> }
> {code}
> Fortunately, like so many bugs in this class, these bugs are offset by the fact that the variable is never used in a way that cause obvious problems (else these issues would have been found sooner.) The only real use is:
> {code}
> @Override
> public boolean rowHasData() {
> return recordStart < characterData;
> }
> {code}
> Which, it turns out, is used only at EOF in {{TextReader.parseRecord()}}:
> {code}
> } catch(StreamFinishedPseudoException e) {
> // if we've written part of a field or all of a field, we should send this row.
> if (fieldsWritten == 0 && !output.rowHasData()) {
> throw e;
> }
> }
> {code}
> Thus, the bug will cause a failure only if:
> * A batch is resized, and
> * The new address is lower than the original address, and
> * It is the last batch in the file.
> Because of the low probability of hitting the bug, the priority is set to Minor.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)