braindump ... cause thread-dumps are not enough ;)

Apache Airflow tricks

I will update this post from time to time with more learnings.

Problem: I fixed problem in my pipeline but airflow doesn’t see this.

Possible things you can do:

  • check if you actually did fix it :)
  • try to refresh the DAG through UI
  • remove *.pyc files from the dags directory
  • check if you have installed the dependencies in the right python environment (e.g. in the right version of system-wide python or in the right virtualenv)

Problem: This DAG isn’t available in the web server’s DagBag object. It shows up in this list because the scheduler marked it as active in the metadata database.

  • Refresh the DAG code from the UI
  • Restart webserver - this did the trick in my case. Some people report that there might be a stalled gunicorn process. If restart doesn’t help, try to find rogue processes and kill them manually (source, source 2)

Problem: I want to delete a DAG

  • Airflow 1.10 has a command for this: airflow delete ...
  • Prior to 1.10, you can use following script:
import sys
from airflow.hooks.postgres_hook import PostgresHook

dag_input = sys.argv[1]
hook=PostgresHook( postgres_conn_id= "airflow_db")

for t in ["xcom", "task_instance", "sla_miss", "log", "job", "dag_run", "dag" ]:
    sql="delete from {} where dag_id='{}'".format(t, dag_input), True)

(tested: works like a charm :))