Project

General

Profile

Actions

Task #7583

open

Handle etcd leader change or temporary unavailability gracefully in uncloud

Added by Ahmed Bilal almost 2 years ago. Updated 11 months ago.

Status:
Waiting
Priority:
Normal
Assignee:
-
Target version:
-
Start date:
01/07/2020
Due date:
% Done:

0%

Estimated time:
PM Check date:

Description

Here is leader change.

uncloud@server12:~$ uncloud scheduler

Traceback (most recent call last):
  File "/home/uncloud/.local/lib/python3.5/site-packages/etcd3/client.py", line 46, in handler
    return f(*args, **kwargs)
  File "/home/uncloud/.local/lib/python3.5/site-packages/etcd3/client.py", line 308, in get_prefix_response
    metadata=self.metadata
  File "/home/uncloud/.local/lib/python3.5/site-packages/grpc/_channel.py", line 824, in __call__
    return _end_unary_response_blocking(state, call, False, None)
  File "/home/uncloud/.local/lib/python3.5/site-packages/grpc/_channel.py", line 726, in _end_unary_response_blocking
    raise _InactiveRpcError(state)
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
        status = StatusCode.UNAVAILABLE
        details = "etcdserver: leader changed" 
        debug_error_string = "{"created":"@1578303042.125921981","description":"Error received from peer ipv6:[2a0a:e5c0:2:12:0:f0ff:fea9:c43a]:2379","file":"src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"etcdserver: leader changed","grpc_status":14}" 

and here is temporary unavailability

Cannot connect to etcd: is etcd running as configured in uncloud.conf?                                                                                                      
Exception in thread etcd3_watch_7fc862374c18:                                                                                                                               
Traceback (most recent call last):                                                                                                                                          
  File "/usr/lib/python3.5/threading.py", line 914, in _bootstrap_inner                                                                                                     
    self.run()                                                                                                                                                              
  File "/usr/lib/python3.5/threading.py", line 862, in run                                                                                                                  
    self._target(*self._args, **self._kwargs)                                                                                                                               
  File "/home/uncloud/.local/lib/python3.5/site-packages/etcd3/watch.py", line 126, in _run                                                                                 
    metadata=self._metadata)                                                                                                                                                
  File "/home/uncloud/.local/lib/python3.5/site-packages/grpc/_channel.py", line 1084, in __call__
    event_handler, self._context)                                                        
  File "/home/uncloud/.local/lib/python3.5/site-packages/grpc/_channel.py", line 1171, in create
    operationses_and_tags, context)
  File "src/python/grpcio/grpc/_cython/_cygrpc/channel.pyx.pxi", line 478, in grpc._cython.cygrpc.Channel.integrated_call
  File "src/python/grpcio/grpc/_cython/_cygrpc/channel.pyx.pxi", line 296, in grpc._cython.cygrpc._integrated_call
  File "src/python/grpcio/grpc/_cython/_cygrpc/channel.pyx.pxi", line 222, in grpc._cython.cygrpc._call
  File "src/python/grpcio/grpc/_cython/_cygrpc/channel.pyx.pxi", line 262, in grpc._cython.cygrpc._call
ValueError: Cannot invoke RPC: Channel closed!

Process Process-1:
Traceback (most recent call last):
  File "/home/uncloud/.local/lib/python3.5/site-packages/etcd3/client.py", line 46, in handler
    return f(*args, **kwargs)
  File "/home/uncloud/.local/lib/python3.5/site-packages/etcd3/client.py", line 426, in put
    metadata=self.metadata
  File "/home/uncloud/.local/lib/python3.5/site-packages/grpc/_channel.py", line 824, in__call__
    return _end_unary_response_blocking(state, call, False, None)
  File "/home/uncloud/.local/lib/python3.5/site-packages/grpc/_channel.py", line 726, in_end_unary_response_blocking
    raise _InactiveRpcError(state)
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
        status = StatusCode.UNAVAILABLE
        details = "etcdserver: request timed out" 
        debug_error_string = "{"created":"@1578304999.737401188","description":"Error received from peer ipv6:[2a0a:e5c0:2:12:0:f0ff:fea9:c43a]:2379","file":"src/core/lib/$
urface/call.cc","file_line":1056,"grpc_message":"etcdserver: request timed out","grpc_status":14}" 

Related issues

Related to Open Infrastructure - Task #7590: Expect everything to fail (uncloud)New01/09/2020

Actions
Actions #1

Updated by Ahmed Bilal almost 2 years ago

The later unavailability is due to election for leader.

Actions #2

Updated by Ahmed Bilal almost 2 years ago

We have to re-evaluate/re-check all the usage of etcd in uncloud to make sure we handle these events correctly/gracefully.

I have modified few things in get_prefix/watch_prefix that would help to correctly/gracefully handle etcd events for requests. For Example, `uncloud host` and `uncloud scheduler` used to crash previously in their blocked state (i.e when waiting for requests) whenever there is a leadership change or etcd unavailable temporarily.

Actions #3

Updated by Ahmed Bilal almost 2 years ago

  • Related to Task #7590: Expect everything to fail (uncloud) added
Actions #4

Updated by Ahmed Bilal over 1 year ago

  • Status changed from New to Waiting
Actions #5

Updated by Ahmed Bilal 11 months ago

  • Assignee deleted (Ahmed Bilal)
Actions

Also available in: Atom PDF