You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@druid.apache.org by GitBox <gi...@apache.org> on 2022/04/19 05:26:45 UTC

[GitHub] [druid] mounikanakkala opened a new issue, #12458: Druid does not intermittently drop segments past retention time

mounikanakkala opened a new issue, #12458:
URL: https://github.com/apache/druid/issues/12458

Druid does not intermittently drop segments past retention time. This led to org.apache.druid.segment.SegmentMissingException on our systems

### Affected Version

0.22.1

### Description

**What happened**
We have a datasource where we set Retention rules - loadByPeriod(P24M+future), dropForever. Datasource segment granularity is Hour.

We encountered an erroneous case where a segment that is past 24 months did not get deleted properly.
- Segments page shows the segment as available and after some refreshes, it doesn't show. But after some more refreshes it reappears.
- We ran a query on sys.server_segments table
```
select *
from sys.server_segments
where segment_id = <segment_id>
```
It returned two historicals having that segment. Since we have Druid cluster setup on Kubernetes, we deleted the two historical pods and that's when the segments were no longer available on Druid and the issue was resolved.

**How often is this issue occurring**
It doesn't happen with all segments but happens for 1-2 segments once in a few days.

**More details on Druid cluster setup**
- Druid processes - Coordinator, middle managers, historicals, broker, router are on Kubernetes.
- Historicals use AWS EBS as Persistence volume. This means data is actually stored on EBS and when Historical pod is removed, another pod is created within minutes and the EBS gets attached to this new pod.
- When we deleted the pod as mentioned above, the issue got resolved. Since EBS is not affected, I suppose it means that there was some main-memory information that was still there on Historical but it was not supposed to.

**How did we come across this issue**
Time was 2022-04-19T03. Segment that did not get deleted was 2020-02-19T00 even though it was past 24 months.
We ran time boundary query
```
{
"dataSource": "our_datasource",
"queryType": "timeBoundary",
"bound": "minTime"
}
```
We got the following exception
```
org.apache.druid.server.QueryResource - Exception handling request: {class=org.apache.druid.server.QueryResource, exceptionType=class
org.apache.druid.segment.SegmentMissingException,
exceptionMessage=No results found for segments[[SegmentDescriptor{interval=2020-04-19T00:00:00.000Z/2020-04-19T01:00:00.000Z, version='2022-04-11T17:18:50.095Z', partitionNumber=0}]],
query={
"queryType": "timeBoundary",
"dataSource": {
"type": "table",
"name": "our_datasource"
},
"intervals": {
"type": "intervals",
"intervals": [
"-146136543-09-08T08:23:32.096Z/146140482-04-24T15:36:27.903Z"
]
},
"bound": "minTime",
"filter": null,
"descending": false,
"granularity": {
"type": "all"
}
}, peer=xx.xx.xx.xx}

(org.apache.druid.segment.SegmentMissingException: No results found for segments[[SegmentDescriptor{interval=2020-04-19T00:00:00.000Z/2020-04-19T01:00:00.000Z, version='2022-04-11T17:18:50.095Z', partitionNumber=0}]])
```

As the segment is past retention time, that's when we started checking on Segments page as mentioned above and the sys.server_segments table.

Kindly help us resolve this issue. Please let us know if you need further details.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] tanisdlj commented on issue #12458: Druid does not intermittently drop segments past retention time

Posted by GitBox <gi...@apache.org>.

tanisdlj commented on issue #12458:
URL: https://github.com/apache/druid/issues/12458#issuecomment-1134857279

   Happening to us too


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] tanisdlj commented on issue #12458: Druid does not intermittently drop segments past retention time

Posted by GitBox <gi...@apache.org>.

tanisdlj commented on issue #12458:
URL: https://github.com/apache/druid/issues/12458#issuecomment-1143402880

   @mounikanakkala 0.22.1, running on hosts, not containers


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] mounikanakkala commented on issue #12458: Druid does not intermittently drop segments past retention time

Posted by GitBox <gi...@apache.org>.

mounikanakkala commented on issue #12458:
URL: https://github.com/apache/druid/issues/12458#issuecomment-1185041416

   Hi Team,
   
   May I know if there is any update on this one?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] mounikanakkala commented on issue #12458: Druid does not intermittently drop segments past retention time

Posted by GitBox <gi...@apache.org>.

mounikanakkala commented on issue #12458:
URL: https://github.com/apache/druid/issues/12458#issuecomment-1134895717

   > Happening to us too
   
   @tanisdlj 
   Thank you for sharing. Can you please share the Druid version that you are running? Just want to confirm if this started happening in the new version. Also, may I know if you are running your Druid cluster on Kubernetes?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

Re: [I] Druid does not intermittently drop segments past retention time (druid)

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.

github-actions[bot] commented on issue #12458:
URL: https://github.com/apache/druid/issues/12458#issuecomment-1854890120

   This issue has been marked as stale due to 280 days of inactivity.
   It will be closed in 4 weeks if no further activity occurs. If this issue is still
   relevant, please simply write any comment. Even if closed, you can still revive the
   issue at any time or discuss it on the dev@druid.apache.org list.
   Thank you for your contributions.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

Re: [I] Druid does not intermittently drop segments past retention time (druid)

Posted by "winsmith (via GitHub)" <gi...@apache.org>.

winsmith commented on issue #12458:
URL: https://github.com/apache/druid/issues/12458#issuecomment-1879171584

   This is happening to our cluster as well. Running on Kubernetes, deleting and recreating one of our four historicals fixes this temporarily, but it seems to always return until I completely drop the relevant segments and re-import the data which is annoying and takes a while. Any advice on how to at least fix if not prevent this? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] mounikanakkala commented on issue #12458: Druid does not intermittently drop segments past retention time

Posted by GitBox <gi...@apache.org>.

mounikanakkala commented on issue #12458:
URL: https://github.com/apache/druid/issues/12458#issuecomment-1108546631

   Facing the same issue again.
   
   **Observations**
   Druid console segments UI page with every _**refresh within seconds shows different list of segments**_.
   <img width="1549" alt="Screen Shot 2022-04-25 at 5 58 17 AM" src="https://user-images.githubusercontent.com/15020965/165093940-55010b0d-8262-48ca-a230-c42c4e9e8ffc.png">
   
   
   <img width="1549" alt="Screen Shot 2022-04-25 at 5 59 16 AM" src="https://user-images.githubusercontent.com/15020965/165094226-9ee4c3c7-5fa9-4bcb-ad11-8d90741497bc.png">
   
   <img width="1549" alt="Screen Shot 2022-04-25 at 5 59 47 AM" src="https://user-images.githubusercontent.com/15020965/165094240-5a06e4cb-9d13-4d64-9c34-e44b1b7a784f.png">
   
   My understanding is segments UI page shows results of sys.segments. Can you please add which process and how often  creates or refreshes sys.segments information?
   
   Result of the below query
   ```
   select * from sys.segments
   where segment_id like 'our_datasource_2022-04-09T05:00:00.000Z_2022-04-09T06:00:00.000Z%'
   order by partition_num
   ```
   
   <img width="886" alt="Screen Shot 2022-04-25 at 6 03 56 AM" src="https://user-images.githubusercontent.com/15020965/165094765-87411bd6-37d5-428a-835b-f84c5300e982.png">
   
   **But http://<coordinator IP address>:8081/unified-console.html#segments UI page does not show any segment for 2022-04-09.**


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] OurNewestMember commented on issue #12458: Druid does not intermittently drop segments past retention time

Posted by GitBox <gi...@apache.org>.

OurNewestMember commented on issue #12458:
URL: https://github.com/apache/druid/issues/12458#issuecomment-1320806594

   Coordinator is worth focusing on.  Why?
   - segments may not be dropped: (coordinator duty to mark used; ...although historical to execute it...although coordinator can affect health of historical based on, eg, load/drop workload including segment balancing...and back and forth and...)
   - inconsistent query results on broker (obviously impacted by broker performance itself, but its metadata can be intensive and has reliance on coordinator)
   - overall historical load: can be heavily dependent on coordinator (eg, loading/dropping segments, even coordinator -> poor ingestion -> suboptimal segments -> more query workload, etc, etc)...could prevent proper segment unloading
       - same as "segments may not be dropped" above...but this is from "point of view" of historical
   - it touches the metadata datastore which can be an effective way for something like ingest (eg, heavy ingests, compaction, etc) to stall the coordinator (eg, you could heavily fragment an RDBMS with heavy ingest)
   
   So "problem on historical" also a appears very good candidate here.  However, the "inconsistencies" (in/across time...and in space: like on different historicals and pods, with different segments, different queries, upon different browser refreshes -> possibly calls to different brokers, etc) demand more commonality between the failures rather than "persistent set of coincidences" as the explanation (of course "persistent coincidences" not impossible).  So I'd look at the coordinator as a relevant commonality.  (...And of course coordinator can be affected by other cluster activity, like heavy ingest destabilizing the overlord running on the same hardware or the metadata store which is shared with the coordinator...all things are connected)
   
   The point of mentioning all of this is because an upgrade may not fix a problem like this one.  (Actually it could make it worse -- sometimes it happens, like potentially around 2021-12 [some new feature side effect causing much higher resource requirement for a recently enhanced in-memory column info, IIRC] and maybe also around 2022-10 [massive increase in heap requirements for streaming and batch indexing]...upgrade problems are pretty understandable with a large, complex system).  Not saying you shouldn't upgrade -- just saying that regardless of that, the system could be running too close to some limits for your needs, for example.  If so, the info above is about examining wherever that gap may live between desired and actual performance.
   
   Some questions worth mulling over...
   
   How many segments in the cluster? (best to breakdown by used/unused...because that affects coordinator workload, plus the possibly very highly relevant workload of the overlord if sharing resources for computation/network/state/etc)
   
   How smooth is ingest workload (demand) and performance (actual)?  (Also consider compaction, even kill tasks, etc)
   
   Any general observations related to stability and performance? (eg, dying processes, failed ingest tasks, slow publish times, indexing error messages about retries/errors in HTTP calls, ongoing logs/alerts on throttling segment balancing, etc)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org