I'm in the process of learning ins-and-outs of Airflow to end all our Cron woes. When trying to mimic failure of (CeleryExecutor) workers, I've got stuck with Sensors. I'm using ExternalTaskSensors to wire-up top-level DAGs together as described here.
My current understanding is that since Sensor is just a type of Operator, it must inherit basic traits from BaseOperator. If I kill a worker (the docker container), all ordinary (non-Sensor) tasks running on it get rescheduled on other workers.
However upon killing a worker, ExternalTaskSensor does not get re-scheduled on a different worker; rather it gets stuck

Then either of following things happen:
- I just keep waiting for several minutes and then sometimes the
ExternalTaskSensoris marked as failed but workflow resumes (it has happened a few times but I don't have a screenshot) - I stop all
docker containers (including those runningscheduler/celeryetc) and then restart them all, then the stuckExternalTaskSensorgets rescheduled and workflow resumes. Sometimes it takes several stop-start cycles ofdocker containers to get the stuckExternalTaskSensorresuming again
Sensor still stuck after single docker container stop-start cycle
Sensor resumes after several docker container stop-start cycles
My questions are:
- Does
dockerhave a role in this weird behaviour? - Is there a difference between
Sensors (particularlyExternalTaskSensor) and otheroperators in terms of scheduling / retry behaviour? - How can I ensure that a
Sensoris also rescheduled when theworkerit is running on gets killed?
I'm using puckel/docker-airflow with
Airflow 1.9.0-4Python 3.6-slimCeleryExecutorwithredis:3.2.7
This is the link to my code.


