samwellwang

samwellwang

coder
twitter

Installation experience of Airflow

Introduction to Airflow#

Apache Airflow™ is an open-source platform for developing, scheduling, and monitoring batch workflows. Airflow's scalable Python framework allows you to build workflows that connect with almost any technology. The web interface helps manage the state of workflows. Airflow can be deployed in various ways, from a single process on a laptop to a distributed setup that supports large workflows.

Implementation Goals#

Combining our own email ETL project, we separate the workflow scheduling work and hand it over to Airflow. The deployment method is to install Airflow through the virtual environment of the business project itself. It is relatively simple to treat it as part of the business project. Normally, the two should be decoupled. Considering the simplicity of this phase, the small number of workflows, and the fact that it only serves the current business project, decoupling will be considered if other projects need to use Airflow in the future.

Installation Steps#

export AIRFLOW_HOME=/project_directory/airflow

Specify a system variable AIRFLOW_HOME as the airflow folder in the current project directory. This is used to place the airflow configuration file, log folder, dag folder, plugin folder, and SQLite database file in the current project directory. Otherwise, airflow will automatically create this folder in the /home directory.

AIRFLOW_VERSION=2.7.3

# Extract the version of Python you have installed. If you're currently using a Python version that is not supported by Airflow, you may want to set this manually.
# See above for supported versions.
PYTHON_VERSION="$(python --version | cut -d " " -f 2 | cut -d "." -f 1-2)"

CONSTRAINT_URL="https://raw.githubusercontent.com/apache/airflow/constraints-${AIRFLOW_VERSION}/constraints-${PYTHON_VERSION}.txt"
# For example this would install 2.7.3 with python 3.8: https://raw.githubusercontent.com/apache/airflow/constraints-2.7.3/constraints-3.8.txt

pip install "apache-airflow==${AIRFLOW_VERSION}" --constraint "${CONSTRAINT_URL}"

After activating the virtual environment (source ./env/bin/activate), execute the above script to install Airflow version 2.7.3. The virtual environment is required to be bound to the host's Python version 3.8+. During the first installation, there may be various dependency issues, especially between the dependencies of the current business project. Solve them one by one. If the installation is slow, you can replace the source with a domestic mirror: -i https://pypi.mirrors.ustc.edu.cn/simple/

airflow standalone

Start the standalone version of the service. If this step is successful, it is basically halfway to success, but it is almost impossible. The first problem encountered is that the SQLite version is too low, mainly due to CentOS. The built-in SQLite3 version is 3.7, but Airflow requires a minimum version of 3.15. Ubuntu should not have this problem. The solution is not complicated. Install a copy of SQLite version 3.15. Why not directly upgrade the original version? Because I'm afraid of crashing the system. Then replace the shortcut command and execute it again. Here is a tricky point: in a normal production environment, the official website does not recommend using SQLite, but if you want to change the connection method, you can only modify the configuration file ./airflow/airflow.cfg. The magical thing is that this directory, and even this folder, will only be created after starting the airflow service once! I didn't want to upgrade SQLite before, I wanted to directly change the database connection method and then start it. I thought the system variable didn't take effect, but there was no user directory either. I found this warning in the official documentation. So SQLite is indispensable, even if you don't want to use it as the database backend in the future.

sqlite3 --version
mkdir sqlite3_upgrade
cd sqlite3_upgrade
wget --no-check-certificate  https://www.sqlite.org/2023/sqlite-autoconf-3440200.tar.gz
tar -xzvf sqlite-autoconf-3440200.tar.gz 
cd sqlite-autoconf-3440200/
./configure
make
make install
/usr/local/bin/sqlite3 --version
cp /usr/bin/sqlite3 /usr/bin/sqlite3_old
rm /usr/bin/sqlite3
ln -s /usr/local/bin/sqlite3  /usr/bin/sqlite3 
sqlite3 --version

The above steps are for reference when upgrading SQLite on CentOS 7.

When the output is 3.44, it can be started.

airflow standalone

Normally, it can be started, and then you can go to the airflow folder in the project directory to see the corresponding log folder, airflow.cfg, and airflow.db.
Enter the IP address of the machine plus port 8080 in the browser to access the web page. The default username is admin, and the default password can be found in the background logs and the standalone_admin_password.txt file in the airflow folder.

Next is to replace SQLite with MySQL (if it is PostgreSQL, it will be easier, as Airflow comes with plugins for interacting with it), and change the parameters in the configuration file to:

sql_alchemy_conn = mysql+pymysql://username:password@host:port/db_name

Note the recommendation in the official documentation:

mysql+mysqldb://<user>:<password>@<host>[:<port>]/<dbname>

There may be problems on CentOS. The mysqlclient driver may not be installed or the version may not match. In the end, I installed it, but it didn't take effect, so I switched to the Python MySQL driver, even though mysqlclient has been verified by Airflow's official CI. Ubuntu or other systems should not have any problems.
PS: Using the official one is also fine. Download a version that matches the system version (CentOS 7) mariadb-devel and install it.

After saving the changes, execute:

airflow db migrate

The system will automatically migrate the required database to the MySQL server you specified. Unexpectedly, another problem occurred when Airflow was creating tables:

airflow Invalid default value for 'updated_at'

It can be understood that this is due to the default time format of MySQL. MySQL is currently in strict mode and needs to change the MySQL mode setting to:

SET SQL_MODE='ALLOW_INVALID_DATES';

But this is only a temporary solution. In the end, you need to modify the MySQL configuration file (my.ini) to remove NO_ZERO_DATE from the sql-mode option and then restart. However, because our MySQL is deployed in Docker, vim command cannot be used inside Docker... This problem is still pending.

Finally:

airflow db migrate
airflow standalone

You can see a lot of new tables in the MySQL database, and the system can be accessed normally through the web. Success!

Unsolved Problems#

  • Permanent modification of MySQL configuration file
  • Whether there will be any problems with SQLite after the modification is not permanent
  • The deployment mode of the production environment must be different from the current one.
Loading...
Ownership of this post data is guaranteed by blockchain and smart contracts to the creator alone.