You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@druid.apache.org by GitBox <gi...@apache.org> on 2020/02/26 21:10:19 UTC

[GitHub] [druid] sascha-coenen opened a new issue #9411: S3 InputSource issue when using prefix mode if a directory contains _SUCCESS files

sascha-coenen opened a new issue #9411: S3 InputSource issue when using prefix mode if a directory contains  _SUCCESS files
URL: https://github.com/apache/druid/issues/9411
 
 
   ### Affected Version
   v 0.17.0
   
   ### Description
   We set up Druid Indexer nodes to test the new native parallel ingestion. 
   Then we used the following InputSource section within an index_parallel spec to point to a "directory" in S3 that would contain a _SUCCESS file along with a bunch of data files.
   
   `
         "inputSource": {
           "type": "s3",
           "prefixes": ["s3://smt-druid-ingestion-stage/SI-835/year=2020/month=01/day=20/hour=00/1580297687716/auction"]
         }
   `
   
   The index_parallel task fails and we observed in the logs that the above section got rewritten to the following
   
   `
         "inputSource": {
           "type": "s3",
           "uris": null,
           "prefixes": null,
           "objects": [
             {
               "bucket": "smt-druid-ingestion-stage",
               "path": "SI-835/year=2020/month=01/day=20/hour=00/1580297687716/auction/_SUCCESS"
             }
           ]
         }
   `
   
   This looks to me like an attempt was made to support filtering out _SUCCESS files from the file list and that inadvertently the filter condition is doing the opposite.
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] vikramsinghchandel commented on issue #9411: S3 InputSource issue when using prefix mode if a directory contains _SUCCESS files

Posted by GitBox <gi...@apache.org>.
vikramsinghchandel commented on issue #9411: S3 InputSource issue when using prefix mode if a directory contains  _SUCCESS files
URL: https://github.com/apache/druid/issues/9411#issuecomment-602703280
 
 
   Hi @gianm  any update with this?
   
   the _SUCCESS file is actually of zero size(0 B) and is created as a result of a Spark ETL job via EMR. 
   
   ![image](https://user-images.githubusercontent.com/10221155/77337970-f5042e00-6d29-11ea-87a5-5a2f74094d62.png)
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] sascha-coenen edited a comment on issue #9411: S3 InputSource issue when using prefix mode if a directory contains _SUCCESS files

Posted by GitBox <gi...@apache.org>.
sascha-coenen edited a comment on issue #9411: S3 InputSource issue when using prefix mode if a directory contains  _SUCCESS files
URL: https://github.com/apache/druid/issues/9411#issuecomment-591695099
 
 
   > Are you saying that the data files did not get ingested?
   
   yes. the ingestion fails as long as we have the _SUCCESS file in the folder. Then I deleted  the _SUCCESS file and the same ingestion spec resubmitted a second time was now working. At least the first phase was running successfully.
   
   WIth Druid 0.16.0 we had some successful attempts at using index_parallel. With 0.17.0 the second phase immediately fails. Currently we are clueless why, but as far as this report scope is concerned, it is sufficient that the first phase completes successfully. I can see in the logs that the files got picked up and processed and that there were as many slots used as there are files in the s3 location.
   
   When the _SUCCESS file is present, only one slot will be used, so it doesn't get to the point that the other files would also be scheduled. Perhaps I'm wrong. I can double check tomorrow to be sure. But now I know that there is no dedicated code to deal with shadowed files so I know to remove them for now.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] gianm commented on issue #9411: S3 InputSource issue when using prefix mode if a directory contains _SUCCESS files

Posted by GitBox <gi...@apache.org>.
gianm commented on issue #9411: S3 InputSource issue when using prefix mode if a directory contains  _SUCCESS files
URL: https://github.com/apache/druid/issues/9411#issuecomment-591658445
 
 
   Are you saying that the data files did not get ingested?
   
   At first glance I don't see code to do anything one way or the other with `_SUCCESS` files. They should be treated the same as any other file. There's some code here to filter out directory placeholders, but it doesn't seem related: https://github.com/apache/druid/blob/druid-0.17.0/extensions-core/s3-extensions/src/main/java/org/apache/druid/storage/s3/ObjectSummaryIterator.java#L154
   
   I could be missing something, but could you double-check that there are actually objects under this prefix and they are nonzero size etc.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] jihoonson commented on issue #9411: S3 InputSource issue when using prefix mode if a directory contains _SUCCESS files

Posted by GitBox <gi...@apache.org>.
jihoonson commented on issue #9411: S3 InputSource issue when using prefix mode if a directory contains  _SUCCESS files
URL: https://github.com/apache/druid/issues/9411#issuecomment-591657701
 
 
   Hi @sascha-coenen, I think the second `inputSource` with the `_SUCCESS` file is created for a subtask. In 0.17, the parallel task lists all objects starting with the given prefixes and creates subtasks to process each object. Unfortunately, there is no way to filter out unwanted objects for s3 input source now.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] vikramsinghchandel edited a comment on issue #9411: S3 InputSource issue when using prefix mode if a directory contains _SUCCESS files

Posted by GitBox <gi...@apache.org>.
vikramsinghchandel edited a comment on issue #9411: S3 InputSource issue when using prefix mode if a directory contains  _SUCCESS files
URL: https://github.com/apache/druid/issues/9411#issuecomment-602703280
 
 
   Hi @gianm  any update with this?
   
   the _SUCCESS file is actually of zero size(0 B) and is created as a result of a Spark ETL job via EMR. This is the error that pops while it tries to sample the dataset
   
   ![image](https://user-images.githubusercontent.com/10221155/77337970-f5042e00-6d29-11ea-87a5-5a2f74094d62.png)
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] jihoonson edited a comment on issue #9411: S3 InputSource issue when using prefix mode if a directory contains _SUCCESS files

Posted by GitBox <gi...@apache.org>.
jihoonson edited a comment on issue #9411: S3 InputSource issue when using prefix mode if a directory contains  _SUCCESS files
URL: https://github.com/apache/druid/issues/9411#issuecomment-602719403
 
 
   Hi @vikramsinghchandel, https://github.com/apache/druid/pull/9450 can work around this issue.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] vikramsinghchandel edited a comment on issue #9411: S3 InputSource issue when using prefix mode if a directory contains _SUCCESS files

Posted by GitBox <gi...@apache.org>.
vikramsinghchandel edited a comment on issue #9411: S3 InputSource issue when using prefix mode if a directory contains  _SUCCESS files
URL: https://github.com/apache/druid/issues/9411#issuecomment-602703280
 
 
   Hi @gianm  any update on this?
   
   the _SUCCESS file is actually of zero size(0 B) and is created as a result of a Spark ETL job via EMR. This is the error that pops while it tries to sample the dataset
   
   ![image](https://user-images.githubusercontent.com/10221155/77340108-1d415c00-6d2d-11ea-9a99-871f1dfe4bf1.png)
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] vikramsinghchandel edited a comment on issue #9411: S3 InputSource issue when using prefix mode if a directory contains _SUCCESS files

Posted by GitBox <gi...@apache.org>.
vikramsinghchandel edited a comment on issue #9411: S3 InputSource issue when using prefix mode if a directory contains  _SUCCESS files
URL: https://github.com/apache/druid/issues/9411#issuecomment-602703280
 
 
   Hi @gianm  any update with this?
   
   the _SUCCESS file is actually of zero size(0 B) and is created as a result of a Spark ETL job via EMR. This is the error that pops while it tries to sample the dataset
   
   ![image](https://user-images.githubusercontent.com/10221155/77340108-1d415c00-6d2d-11ea-9a99-871f1dfe4bf1.png)
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] sascha-coenen commented on issue #9411: S3 InputSource issue when using prefix mode if a directory contains _SUCCESS files

Posted by GitBox <gi...@apache.org>.
sascha-coenen commented on issue #9411: S3 InputSource issue when using prefix mode if a directory contains  _SUCCESS files
URL: https://github.com/apache/druid/issues/9411#issuecomment-591695099
 
 
   > Are you saying that the data files did not get ingested?
   yes. the ingestion fails as long as we have the _SUCCESS file in the folder. Then I deleted  the _SUCCESS file and the same ingestion spec resubmitted a second time was now working. At least the first phase was running successfully.
   
   WIth Druid 0.16.0 we had some successful attempts at using index_parallel. With 0.17.0 the second phase immediately fails. Currently we are clueless why, but as far as this report scope is concerned, it is sufficient that the first phase completes successfully. I can see in the logs that the files got picked up and processed and that there were as many slots used as there are files in the s3 location.
   
   When the _SUCCESS file is present, only one slot will be used, so it doesn't get to the point that the other files would also be scheduled. Perhaps I'm wrong. I can double check tomorrow to be sure. But now I know that there is no dedicated code to deal with shadowed files so I know to remove them for now.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] jihoonson commented on issue #9411: S3 InputSource issue when using prefix mode if a directory contains _SUCCESS files

Posted by GitBox <gi...@apache.org>.
jihoonson commented on issue #9411: S3 InputSource issue when using prefix mode if a directory contains  _SUCCESS files
URL: https://github.com/apache/druid/issues/9411#issuecomment-602719403
 
 
   Hi @vikramsinghchandel, https://github.com/apache/druid/pull/9450 can workaround this issue.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org