Blog | 老司机的文档集

CentOS7系统安装ansible awx记录

2017年12月20日 · 阅读需 2 分钟

记录awx的安装测试过程以及需要注意的点

关于awx

awx(https://github.com/ansible/awx)是ansible tower的开源版本，作为tower的upstream对外开源，项目从2013年开始维护，2017年由redhat对外开源，目前维护得比较活跃。由于官方的install guide写得有点杂不是很直观，导致想要安装个简单的测试环境体验一下功能都要折腾半天，这里提供一个简单版本的安装流程方便快速体验。

安装过程

安装软件包

yum -y install epel-release
systemctl disable firewalld
systemctl stop firewalld
sed -i 's/SELINUX=enforcing/SELINUX=disabled/g' /etc/selinux/config
setenforce 0
yum -y install git gettext ansible docker nodejs npm gcc-c++ bzip2 python-docker-py

启动服务

systemctl start docker
systemctl enable docker

clone awx代码

git clone https://github.com/ansible/awx.git
cd awx/installer/
# 注意修改一下postgres_data_dir到其他目录比如/data/pgdocker
vi inventory
ansible-playbook -i inventory install.yml

检查日志

docker logs -f awx_task

以上是安装过程，因为本地环境访问外网要经过代理，这里记录一下配置docker通过代理访问外网的方式，否则pull image会有问题。

mkdir /etc/systemd/system/docker.service.d/
cat > /etc/systemd/system/docker.service.d/http-proxy.conf <<EOF
[Service]
Environment="HTTP_PROXY=proxy.test.dev:8080" "HTTPS_PROXY=proxy.test.dev:8080" "NO_PROXY=localhost,127.0.0.1,172.1.0.2"
EOF

systemctl daemon-reload
systemctl restart docker
systemctl show --property=Environment docker

参考文档：

[1] http://khmel.org/?p=1245

[2] https://docs.docker.com/engine/admin/systemd/#httphttps-proxy

很诡异的服务日志无法切割问题分析

2017年10月19日 · 阅读需 4 分钟

遇到某服务进程产生的日志始终无法切割，原来原因在这里

问题现象

某业务机器因磁盘容量超过阈值收到告警，分析发现是由于该机器上某个服务进程产生的日志文件量太大，且日志文件未按周期切割，进而导致历史日志信息积累到单个日志文件中。未避免故障发生采取临时措施手工切割该日志文件，由于该服务进程并未提供内置的日志切割，因此手工模拟类似logrotate的copytruncate模式对日志进行切割。但在将日志truncate之后，奇怪的一幕发生了，通过ls查看文件大小发现并未减少，直觉判断这可能是文件句柄一直处于打开状态且偏移量未发生改变导致。在进一步检查了该进程的启动方式之后，发现该进程通过nohup启动，并将标准输出重定向到持续增大的日志文件中。

模拟

我们通过下面几行脚本来模拟此现象：

#!/bin/bash
while true; do
    sleep 1
    head -5000 /dev/urandom
done

脚本启动后会有一个常驻进程每个1秒钟输出一堆字符串以此来模拟日志文件增涨，我们按照以下方式启动：

nohup ./daemon.sh >out.log 2>&1 < /dev/null &

等待一会之后我们观察到日志已经写入了

[root@localhost t]# ll -h out.log ;du -h out.log 
-rw-r--r-- 1 root root 64M Oct 19 17:41 out.log
64M   out.log

接着将日志文件清空，再观察文件大小变化

[root@localhost t]# 
[root@localhost t]# truncate -s0 out.log              
[root@localhost t]# ll -h out.log ;du -h out.log 
-rw-r--r-- 1 root root 93M Oct 19 17:41 out.log
4.0M  out.log

这时可以看到，虽然文件被清空了，但是ls看到的大小依然没有发生变化，也就是说文件中产生了大量空洞。

解决方法

将nohup启动进程后的输出重定向 > 替换为 >>，即改为append模式来写入日志，这时再truncate就不会出现上面的问题了。

 nohup ./daemon.sh >>out.log 2>&1  </dev/null &
 
 
[root@localhost t]# ll -h out.log ;du -h out.log 
-rw-r--r-- 1 root root 48M Oct 19 19:43 out.log
64M   out.log
[root@localhost t]# ll -h out.log ;du -h out.log 
-rw-r--r-- 1 root root 77M Oct 19 19:43 out.log
128M  out.log
[root@localhost t]# truncate -s0 out.log              
[root@localhost t]# ll -h out.log ;du -h out.log 
-rw-r--r-- 1 root root 1.3M Oct 19 19:43 out.log
2.0M  out.log

这里留一个问题：为什么使用append模式就不会出现这个问题？

参考文档：

[1] https://www.gnu.org/software/bash/manual/bash.html#Redirections

[2] https://www.gnu.org/software/coreutils/manual/html_node/nohup-invocation.html

HADOOP3.0.0纠删码测试

2017年8月20日 · 阅读需 6 分钟

记录hadoop3.0.0版本测试安装过程，对hadoop3.0.0中纠删码进行了简单测试。

基础环境

HADOOP3.0.0版本中增加了纠删码技术，在提高可用性的同时还能减低存储成本，目前处于实验阶段，以下将测试环境中的搭建步骤及简单测试过程进行记录。本次仅对hdfs进行测试，因此不会部署其他服务。

系统环境如下：

kvm虚拟机，1个namenode节点，6个datanode节点，4core ，8G mem ， 50G disk

hadoop版本：3.0.0-alpha4

java版本： 1.8.0_144

测试集群安装

基础包安装

从apache镜像下载hadoop-3.0.0-alpha4，当前最新版本，hadoop home目录/opt/hadoop，在namenode节点将配置等修改好之后拷贝到所有datanode节点。

tar xf hadoop-3.0.0-alpha4.tar.gz
mv hadoop-3.0.0-alpha4 /opt/hadoop
yum -y install jdk --disablerepo=* --enablerepo=local-custom

配置文件修改

需要修改的配置文件有hadoop-env.sh 、core-site.xml、hdfs-site.xml

修改/opt/hadoop/etc/hadoop/hadoop-env.sh中以下参数：

export JAVA_HOME=/usr/java/jdk1.8.0_144
export HADOOP_HOME=/opt/hadoop
# 注意这里配置heapsize和2.7版本的差别，2.7为HADOOP_HEAPSIZE
export HADOOP_HEAPSIZE_MAX=1024
export HADOOP_OPTS="$HADOOP_OPTS -Djava.net.preferIPv4Stack=true”
export HADOOP_CLIENT_OPTS="-Xmx512m $HADOOP_CLIENT_OPTS”
export HADOOP_LOG_DIR=${HADOOP_HOME}/logs
export HADOOP_PID_DIR=/tmp

修改/opt/hadoop/etc/hadoop/hdfs-site.xml内容如下：

<configuration>
   <property>
      <name>dfs.blocksize</name>
      <value>134217728</value>
    </property>
   <property>
      <name>dfs.datanode.data.dir</name>
      <value>/data/hdfs/data</value>
      <final>true</final>
    </property>
    <property>
      <name>dfs.namenode.name.dir</name>
      <value>/data/hdfs/namenode</value>
      <final>true</final>
    </property>
   <property>
      <name>dfs.namenode.rpc-address</name>
      <value>192.168.199.26:8020</value>
    </property>
</configuration>

将环境变量增加到当前用户bashrc：

# ~/.bashrc
export HADOOP_HOME=/opt/hadoop
export PATH=$PATH:/opt/hadoop/bin

配置修改完成后将/opt/hadoop整个目录同步到所有datanode节点。

启动HDFS

格式化namenode

hdfs namenode -format ns1

启动namenode和datanode

# 注意3.0.0版本的差别，在2.7中启动脚本如下
hadoop-daemon.sh --config $HADOOP_HOME/etc/hadoop --script hdfs start namenode
hadoop-daemon.sh --config $HADOOP_HOME/etc/hadoop --script hdfs start datanode
# 新版本中已经重写了管理脚本，统一到hdfs命令中，启动方式如下：
hdfs --daemon start namenode
hdfs --daemon start datanode

启动成功之后即可通过namenode web ui观察到集群基本情况，2.7中默认web ui端口为50070，而3.0.0中修改为9870，

测试hdfs

启动完成之后可以通过hdfs命令测试服务是否可用：

hdfs dfs -mkdir hdfs://192.168.199.26:8020/t1/
dd if=/dev/urandom of=f1 bs=1M count=5000
hdfs dfs -put f1 hdfs://192.168.199.26:8020/t1/
hdfs dfs -rm -skipTrash hdfs://192.168.199.26:8020/t1/f1

这里使用的时候需要写完整的hdfs协议和namenode:port，我们可以修改一下配置文件，将defaultfs修改为hdfs协议，方便测试。同时此版本中默认并未启用纠删码，需要手工配置。默认内置支持的policy有 RS-3-2-64k, RS-6-3-64k, RS-10-4-64k, RS-LEGACY-6-3-64k, XOR-2-1-64k，我这里只准备了少量节点，因此只对其中两种进行简单测试。

修改hdfs-site.xml 增加以下配置：

    <property>
      <name>dfs.namenode.ec.policies.enabled</name>
      <value>XOR-2-1-64k,RS-3-2-64k</value>
    </property>
    <property>
      <name>dfs.nameservices</name>
      <value>ns1</value>
    </property>
    <property>
        <name>dfs.ha.namenodes.ns1</name>
        <value>nn1</value>
    </property>
    <property>
      <name>dfs.namenode.rpc-address.ns1.nn1</name>
      <value>192.168.199.26:8020</value>
    </property>
    <property>
        <name>dfs.client.failover.proxy.provider.ns1</name>
        <value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
    </property>

修改core-site.xml增加以下配置：

<configuration>
    <property>
      <name>fs.defaultFS</name>
      <value>hdfs://ns1</value>
      <final>true</final>
    </property>
</configuration>

重启namenode节点后，就可以使用纠删码了，纠删码可以针对目录进行设置，不同的目录设置不同的策略。

hdfs ec -setPolicy -policy XOR-2-1-64k -path /t1
hdfs ec -setPolicy -policy RS-3-2-64k -path /t2

我们准备了3个目录，t1和t2分别设置了不同的policy，t3不设置，往3个目录上传相同的5G大小的文件，每次上传前均清空hdfs中数据，并清空虚拟机和物理机缓存，统计的耗时和空间占用情况大致如下：

HDFS目录	纠删码策略	put耗时	磁盘占用kb
t1	XOR-2-1-64k	1m15.559s	7740364
t2	RS-3-2-64k	1m13.920s	8600436
t3	无(三副本)	2m7.705s	15480600

通过简单的测试对比，我们可以大致了解到在理想状态下纠删码技术比传统的三副本在写入速度上有提升，因其降低了对磁盘IO和带宽的消耗，同时占用的磁盘空间小于三副本方式。其中磁盘占用和不同的纠删码策略理论值基本吻合，磁盘空间消耗倍数为 (校验块+数据块)/数据块。

注意这里的测试仅限于对HADOOP3.0.0中纠删码有一个感性认识，测试方法有很多不严谨的地方，比如put数据的耗时并未考虑各种环境因素，仅仅是在一个相对理想的环境下进行简单测试。实际生产环境中情况非常复杂，需要权衡CPU带宽磁盘，甚至机架供电等各方面因素，如果需要获得一份可靠的性能对比数据则必须保障稳定运行足够长的时间，通过长期观察才能得出对生产有实际指导意义的信息。

参考文档：

[1] http://hadoop.apache.org/docs/r3.0.0-alpha4/hadoop-project-dist/hadoop-common/ClusterSetup.html

[2] http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HDFSErasureCoding.html

k8s测试环境部署记录

2017年8月18日 · 阅读需 5 分钟

本文记录了在本地部署k8s测试环境的过程，部署脚本参考github上其他同学分享的脚本，在其基础上做了些小改动。

k8s 中的节点类型

master 负责管理其他节点的调度中心，master可以有备机replica做冗余

minion 由master管理，运行容器服务，1个集群中有N个minion节点

部署前准备

部署此测试环境参考了github上其他同学的分享，我fork的repo地址：https://github.com/5xops/k8s-deploy

首先按照github repo中的readme部分将k8s rpm包下载到本地，以准备离线部署。其次我在本地测试环境准备了6以上的虚拟机节点，其中3个节点用于部署etcd集群，2个节点用于部署k8s master，其余节点用于部署k8s minion。所有虚拟机均运行在一台物理服务器上，管理虚拟机用了这个脚本（https://github.com/itxx00/vmm）。

首先创建好需要使用到的虚拟机节点：

vmm create etcd1
vmm create etcd2
vmm create etcd3
vmm create kubem1
vmm create kubem2
vmm create node1
vmm create node2

因部署k8s集群时要求所有节点都配置好主机名，因为默认创建出来的虚拟机没有修改hostname，需要使用另外一个脚本来配置hostname并配置好/etc/hosts，首先准备好初始化执行的脚本：

#!/bin/bash
# content of ~/.vmm/init.sh
cd /tmp
hostnm=$(cat hostname)
[[ -n $hostnm ]] || {
    echo "err"
    exit 1
}
echo $hostnm >/etc/hostname
hostname $hostnm
cat /tmp/hosts.tmp >/etc/hosts

接着执行初始化操作：

vminit etcd1
vminit etcd2
vminit etcd3
vminit kubem1
vminit kubem2
vminit node1
vminit node2

每个节点的hostname将被设置成虚拟机的名字，同时vminit脚本会把本地的ssh公钥和私钥都拷贝到虚拟机节点，这样初始化之后的节点可以通过相同的密钥互相免密登录，注意这样的操作仅适用于快速搭建测试环境，生产环境千万不要这么处理。打通免密ssh是因为后面部署k8s时会用到。

搭建etcd集群

在部署k8s集群之前，我们需要先部署一个独立的etcd集群，k8s会用到这个集群。这里我采用ansible playbook来部署，playbook已经分享到github上面，（https://github.com/itxx00/ansible-etcd），按照repo中的readme来准备好集群。

准备镜像仓库

将下载下来的k8s rpm包和docker镜像放到/data/k8s-deploy目录，其中rpms目录存放了需要用到的rpm包，images目录存放需要用到的docker镜像，因原脚本中不包含k8s dashboard（一套web ui管理界面），为了能够部署dashboard，对原脚本作了修改增加了部署dashboard的选项，部署dashboard需要一些额外的docker镜像和配置文件，这里先补充好docker镜像：

yum -y install docker
systemctl start docker
docker pull googlecontainer/kubernetes-dashboard-amd64:v1.6.1
docker pull googlecontainer/heapster-influxdb-amd64:v1.1.1
docker pull googlecontainer/heapster-grafana-amd64:v4.0.2
docker pull googlecontainer/heapster-amd64:v1.3.0
cd /data/k8s-deploly/images
docker save googlecontainer/kubernetes-dashboard-amd64 -o kubernetes-dashboard-amd64_v1.6.1.tar
docker save googlecontainer/heapster-influxdb-amd64 -o heapster-influxdb-amd64_v1.1.1.tar
docker save googlecontainer/heapster-grafana-amd64 -o heapster-grafana-amd64_v4.0.2.tar
docker save googlecontainer/heapster-amd64 -o heapster-amd64_v1.3.0.tar

原脚本中安装rpm包是在各节点下载好后本地安装的，我改进了一下采用yum方式安装，准备yum仓库：

createrepo /data/k8s-deploy/rpms
yum -y install nginx
cat >/etc/nginx/conf.d/k8srepo.conf <<EOF
server {
    listen       8000;
    server_name  _;
    root         /data/k8s-deploy;
    autoindex on;

    location / {
        autoindex on;
    }
}
EOF
systemctl restart nginx

至此yum repo和docker镜像准备完成。

部署k8s集群

部署过程参考https://github.com/5xops/k8s-deploy/blob/master/README.md ，需要修改repo中的k8slocal.repo文件中ip地址为yum仓库对应的ip地址，部署master、minion和replica可参考master.sh，minon.sh, replica.sh的内容。需要注意的是为了部署dashboard服务我们额外增加了dashboard相关的配置文件，具体增加的内容请参考这个commit： https://github.com/5xops/k8s-deploy/commit/1766a675d76edb32f310acd98d5c6ed50a356e5b

至此，k8s测试环境及搭建完成，后续我将使用这个k8s测试环境来部署其他服务，后面会慢慢分享。

nftables：nft man文档阅读笔记

2017年6月13日 · 阅读需 48 分钟

利用空闲时间学习了nftables的基础知识，其中官方的man page中包含了大量信息，在阅读过程中整理了一份带中文注释的笔记，以辅助加深记忆。

工具名称

nft -- 包过滤规则管理工具

基本用法

nft [ -h | --help ] [ -v | --version ]

工具描述

nftables 作为新一代的防火墙策略框架，旨在替代之前的各种防火墙工具诸如iptables/ebtables等，而且提供了类似tc的带宽限速能力。而nft则提供了nftables的命令行入口，是用户空间的管理工具。

选项说明

执行nft --help查看完整帮助信息

-h, --help

查看帮助信息.

-v, --version

查看版本号.

-n, --numeric

以数值方式展示数据，可重复使用，一个-n表示不解析域名，第二次不解析端口号，第三次不解析协议和uid/gid。

-s, --stateless

省略规则和有状态对象的状态信息

-N

将ip地址解析成域名，依赖dns解析。

-a, --handle

输出内容中展示规则handle信息

-I, --includepath directory

添加include文件搜索目录

-f, --file filename

从文件获取输入

-i, --interactive

从交互式cli获取输入

文件格式

语法规定

单行过长可用\换行连接；多个命令写到同一行可用分号; 分隔；注释使用井号#打头；标识符用大小写字母打头，后面跟数字字母下划线正斜杠反斜杠以及点号; 用双引号引起来表示纯字符串。

文件引用

include "filename"

可由外部文件通过include导入到当前文件，用-I/--includepath指定导入文件所在目录，如果include后面接的是目录而非文件，则整个目录的文件将以字母顺序依次导入。

符号变量

define variable = expr

$variable

Symbolic variables can be defined using the define statement. Variable references are expressions and can be used initialize other variables. The scope of a definition is the current block and all blocks contained within. 变量使用define定义，变量引用属于表达式，可以用于初始化其他变量，变量的生效范围在当前block以及被包含的所有block内。

示例 1. 使用符号变量

define int_if1 = eth0
define int_if2 = eth1
define int_ifs = { $int_if1, $int_if2 }

filter input iif $int_ifs accept

地址族

根据处理的包的种类不同可以将其分为不同的地址族。不同的地址族在内核中包含有特定阶段的处理路径和hook点，当对应hook的规则存在时则会被nftables处理。具体类型如下：

ip

IPv4 地址族

ip6

IPv6 地址族

inet

Internet (IPv4/IPv6) 地址族

arp

ARP 地址族

bridge

Bridge 地址族

netdev

Netdev 地址族

所有nftables对象存在于特定的地址族namespace中，换言之所有identifier都含有一个特定的地址族，如果未指定则默认使用ip地址族

IPv4/IPv6/Inet address families

IPv4/IPv6/Inet 地址族用于处理 IPv4和IPv6包，其在network stack中在不同的包处理阶段一共包含了5个hook.

Table 1. IPv4/IPv6/Inet 地址类hook列表

Hook名称	描述
prerouting	所有进入到系统的包都会被prerouting hook进行处理. 它在routing流程之前就被发起，用于靠前阶段的包过滤或者更改影响routing的包属性.
input	发往本地系统的包将被input hook处理.
forward	被转发到其他主机的包会经由forward hook处理.
output	由本地进程发送出去的包将被output hook处理.
postrouting	所有离开系统的包都将被postrouting hook处理.

ARP address family

ARP地址族用于处理经由系统接收和发送的ARP包。一般在集群环境中对ARP包进行mangle处理以支持clustering。

Table 2. ARP address family hooks

Hook	描述
input	分发到本机的包会经过input hook.
output	由本机发出的包会经过output hook.

Bridge address family

bridge地址族处理通过桥接设备的ethernet包。

Netdev address family

Netdev地址族处理从ingress过来的包。

Table 3. Netdev address family hooks

Hook	Description
ingress	所有进入系统的包都将被ingress hook处理。它在进入layer 3之前的阶段就开始处理。

Tables

{add | delete | list | flush} table [family] {table}

table是chain/set/stateful object的容器，table由其地址族和名字做标识。地址族必须属于ip, ip6, arp, bridge, netdev中的一种，inet地址族是一个虚拟地址族，同来创建同时包含IPv4和IPv6的table，如果没有指定地址族则默认使用ip地址族。

add

添加指定地址族，指定名称的table

delete

删除指定的table

list

列出指定table中的所有chain和rule

flush

清除指定table中的所有chain和rule

Chains

{add} chain [family] {table} {chain} {hook} {priority} {policy} {device}

{add | create | delete | list | flush} chain [family] {table} {chain}

{rename} chain [family] {table} {chain} {newname}

chain是rule的容器，他们存在于两种类型，基础链（base chain）和常规链（regular chain）。base chain是网络栈中数据包的入口点，regular chain则可用于jump的目标并对规则进行更好地组织。

add

在指定table中添加新的链，当hook和权重值被指定时，添加的chain为base chain，将在网络栈中hook相关联。

create

与add命令类似，不同之处在于当创建的chain存在时会返回错误。

delete

删除指定的chain，被删除的chain不能有规则且不能是跳转目标chain。

rename

重命名chain

list

列出指定chain中的所有rule

flush

清除指定chain中所有rule

Rules

[add | insert] rule [family] {table} {chain} [position position] {statement...}

{delete} rule [family] {table} {chain} {handle handle}

Rules are constructed from two kinds of components according to a set of grammatical rules: expressions and statements.

add

Add a new rule described by the list of statements. The rule is appended to the given chain unless a position is specified, in which case the rule is appended to the rule given by the position.

insert

Similar to the add command, but the rule is prepended to the beginning of the chain or before the rule at the given position.

delete

Delete the specified rule.

Sets

{add} set family] {table} {set}{ {type} [flags] [timeout] [gc-interval] [elements] [size] [policy]}

{delete | list | flush} set [family] {table} {set}

{add | delete} element [family] {table} {set}{ {elements}}

Sets are elements containers of an user-defined data type, they are uniquely identified by an user-defined name and attached to tables. sets 是用户定义的数据类型的容器，具有用户定义的唯一标识，被应用到table上。

add

在指定的table中添加一个新的set

delete

删除指定set

list

查看set内的元素

flush

清空整个set

add element

往set中添加元素，多个使用逗号分隔

delete element

从set中删除元素，多个使用逗号分隔

Table 4. Set 参数

关键字	描述	类型
type	元素的数据类型	string: ipv4_addr, ipv6_addr, ether_addr, inet_proto, inet_service, mark
flags	set flags	string: constant, interval, timeout
timeout	元素在set中的存活时间	string, 带单位的小数. 单位: d, h, m, s
gc-interval	垃圾回收间隔, 仅当timeout或flag timeout设置时生效	string, decimal followed by unit. Units are: d, h, m, s
elements	set中包含的元素	set data type
size	set可存放的最大元素个数	unsigned integer (64 bit)
policy	set policy	string: performance [default], memory

Maps

{add} map [family] {table} {map}{ {type} [flags] [elements] [size] [policy]}

{delete | list | flush} map [family] {table} {map}

{add | delete} element [family] {table} {map}{ {elements}}

Maps store data based on some specific key used as input, they are uniquely identified by an user-defined name and attached to tables.

add

Add a new map in the specified table.

delete

Delete the specified map.

list

Display the elements in the specified map.

flush

Remove all elements from the specified map.

add element

Comma-separated list of elements to add into the specified map.

delete element

Comma-separated list of element keys to delete from the specified map.

Table 5. Map specifications

Keyword	Description	Type
type	data type of map elements	string ':' string: ipv4_addr, ipv6_addr, ether_addr, inet_proto, inet_service, mark, counter, quota. Counter and quota can't be used as keys
flags	map flags	string: constant, interval
elements	elements contained by the map	map data type
size	maximun number of elements in the map	unsigned integer (64 bit)
policy	map policy	string: performance [default], memory

Stateful objects

{add | delete | list | reset} type [family] {table} {object}

Stateful objects are attached to tables and are identified by an unique name. They group stateful information from rules, to reference them in rules the keywords "type name" are used e.g. "counter name".

add

Add a new stateful object in the specified table.

delete

Delete the specified object.

list

Display stateful information the object holds.

reset

List-and-reset stateful object.

Ct

ct {helper} {type} {type} {protocol} {protocol} [l3proto] [family]

Ct helper is used to define connection tracking helpers that can then be used in combination with the "ct helper set" statement. type and protocol are mandatory, l3proto is derived from the table family by default, i.e. in the inet table the kernel will try to load both the ipv4 and ipv6 helper backends, if they are supported by the kernel.

Table 6. conntrack helper specifications

Keyword	Description	Type
type	name of helper type	quoted string (e.g. "ftp")
protocol	layer 4 protocol of the helper	string (e.g. tcp)
l3proto	layer 3 protocol of the helper	address family (e.g. ip)

示例 2. defining and assigning ftp helper

Unlike iptables, helper assignment needs to be performed after the conntrack lookup has completed, for example with the default 0 hook priority.

table inet myhelpers {
  ct helper ftp-standard {
     type "ftp" protocol tcp
  }
  chain prerouting {
      type filter hook prerouting priority 0;
      tcp dport 21 ct helper set "ftp-standard"
  }
}

Counter

counter [packets bytes]

Table 7. Counter specifications

Keyword	Description	Type
packets	initial count of packets	unsigned integer (64 bit)
bytes	initial count of bytes	unsigned integer (64 bit)

Quota

quota [over | until] [used]

Table 8. Quota specifications

Keyword	Description	Type
quota	quota limit, used as the quota name	Two arguments, unsigned interger (64 bit) and string: bytes, kbytes, mbytes. "over" and "until" go before these arguments
used	initial value of used quota	Two arguments, unsigned interger (64 bit) and string: bytes, kbytes, mbytes

Expressions

Expressions represent values, either constants like network addresses, port numbers etc. or data gathered from the packet during ruleset evaluation. Expressions can be combined using binary, logical, relational and other types of expressions to form complex or relational (match) expressions. They are also used as arguments to certain types of operations, like NAT, packet marking etc.

Each expression has a data type, which determines the size, parsing and representation of symbolic values and type compatibility with other expressions.

describe command

describe {expression}

The describe command shows information about the type of an expression and its data type.

示例 3. The describe command

$ nft describe tcp flags
payload expression, datatype tcp_flag (TCP flag) (basetype bitmask, integer), 8 bits

pre-defined symbolic constants:
fin                               0x01
syn                               0x02
rst                               0x04
psh                               0x08
ack                               0x10
urg                               0x20
ecn                               0x40
cwr                               0x80

Data types

Data types determine the size, parsing and representation of symbolic values and type compatibility of expressions. A number of global data types exist, in addition some expression types define further data types specific to the expression type. Most data types have a fixed size, some however may have a dynamic size, f.i. the string type.

Types may be derived from lower order types, f.i. the IPv4 address type is derived from the integer type, meaning an IPv4 address can also be specified as an integer value.

In certain contexts (set and map definitions) it is necessary to explicitly specify a data type. Each type has a name which is used for this.

Integer type

Table 9.

Name	Keyword	Size	Base type
Integer	integer	variable	-

The integer type is used for numeric values. It may be specified as decimal, hexadecimal or octal number. The integer type doesn't have a fixed size, its size is determined by the expression for which it is used.

Bitmask type

Table 10.

Name	Keyword	Size	Base type
Bitmask	bitmask	variable	integer

The bitmask type (bitmask) is used for bitmasks.

String type

Table 11.

Name	Keyword	Size	Base type
String	string	variable	-

The string type is used to for character strings. A string begins with an alphabetic character (a-zA-Z) followed by zero or more alphanumeric characters or the characters /, -, _ and .. In addition anything enclosed in double quotes (") is recognized as a string.

示例 4. String specification


# Interface name
filter input iifname eth0

# Weird interface name
filter input iifname "(eth0)"

Link layer address type

Table 12.

Name	Keyword	Size	Base type
Link layer address	lladdr	variable	integer

The link layer address type is used for link layer addresses. Link layer addresses are specified as a variable amount of groups of two hexadecimal digits separated using colons (:).

示例 5. Link layer address specification


# Ethernet destination MAC address
filter input ether daddr 20:c9:d0:43:12:d9

IPv4 address type

Table 13.

Name	Keyword	Size	Base type
IPv4 address	ipv4_addr	32 bit	integer

The IPv4 address type is used for IPv4 addresses. Addresses are specified in either dotted decimal, dotted hexadecimal, dotted octal, decimal, hexadecimal, octal notation or as a host name. A host name will be resolved using the standard system resolver.

示例 6. IPv4 address specification

# dotted decimal notation
filter output ip daddr 127.0.0.1

# host name
filter output ip daddr localhost

IPv6 address type

Table 14.

Name	Keyword	Size	Base type
IPv6 address	ipv6_addr	128 bit	integer

The IPv6 address type is used for IPv6 addresses. FIXME

示例 7. IPv6 address specification

# abbreviated loopback address
filter output ip6 daddr ::1

Boolean type

Table 15.

Name	Keyword	Size	Base type
Boolean	boolean	1 bit	integer

The boolean type is a syntactical helper type in user space. It's use is in the right-hand side of a (typically implicit) relational expression to change the expression on the left-hand side into a boolean check (usually for existence).

The following keywords will automatically resolve into a boolean type with given value:

Table 16.

Keyword	Value
exists	1
missing	0

示例 8. Boolean specification

The following expressions support a boolean comparison:

Table 17.

Expression	Behaviour
fib	Check route existence.
exthdr	Check IPv6 extension header existence.
tcp option	Check TCP option header existence.

# match if route exists
filter input fib daddr . iif oif exists

# match only non-fragmented packets in IPv6 traffic
filter input exthdr frag missing

# match if TCP timestamp option is present
filter input tcp option timestamp exists

ICMP Type type

Table 18.

Name	Keyword	Size	Base type
ICMP Type	icmp_type	8 bit	integer

The ICMP Type type is used to conveniently specify the ICMP header's type field.

The following keywords may be used when specifying the ICMP type:

Table 19.

Keyword	Value
echo-reply	0
destination-unreachable	3
source-quench	4
redirect	5
echo-request	8
router-advertisement	9
router-solicitation	10
time-exceeded	11
parameter-problem	12
timestamp-request	13
timestamp-reply	14
info-request	15
info-reply	16
address-mask-request	17
address-mask-reply	18

示例 9. ICMP Type specification


# match ping packets
filter output icmp type { echo-request, echo-reply }

ICMPv6 Type type

Table 20.

Name	Keyword	Size	Base type
ICMPv6 Type	icmpv6_type	8 bit	integer

The ICMPv6 Type type is used to conveniently specify the ICMPv6 header's type field.

The following keywords may be used when specifying the ICMPv6 type:

Table 21.

Keyword	Value
destination-unreachable	1
packet-too-big	2
time-exceeded	3
parameter-problem	4
echo-request	128
echo-reply	129
mld-listener-query	130
mld-listener-report	131
mld-listener-done	132
mld-listener-reduction	132
nd-router-solicit	133
nd-router-advert	134
nd-neighbor-solicit	135
nd-neighbor-advert	136
nd-redirect	137
router-renumbering	138
ind-neighbor-solicit	141
ind-neighbor-advert	142
mld2-listener-report	143

示例 10. ICMPv6 Type specification


# match ICMPv6 ping packets
filter output icmpv6 type { echo-request, echo-reply }

Primary expressions

The lowest order expression is a primary expression, representing either a constant or a single datum from a packet's payload, meta data or a stateful module.

Meta expressions

meta {length | nfproto | l4proto | protocol | priority}

[meta] {mark | iif | iifname | iiftype | oif | oifname | oiftype} [meta] {skuid | skgid | nftrace | rtclassid | ibriport | obriport | pkttype | cpu | iifgroup | oifgroup | cgroup | random}

A meta expression refers to meta data associated with a packet.

There are two types of meta expressions: unqualified and qualified meta expressions. Qualified meta expressions require the meta keyword before the meta key, unqualified meta expressions can be specified by using the meta key directly or as qualified meta expressions.

Table 22. Meta expression types

Keyword	Description	Type
length	Length of the packet in bytes	integer (32 bit)
protocol	Ethertype protocol value	ether_type
priority	TC packet priority	tc_handle
mark	Packet mark	mark
iif	Input interface index	iface_index
iifname	Input interface name	string
iiftype	Input interface type	iface_type
oif	Output interface index	iface_index
oifname	Output interface name	string
oiftype	Output interface hardware type	iface_type
skuid	UID associated with originating socket	uid
skgid	GID associated with originating socket	gid
rtclassid	Routing realm	realm
ibriport	Input bridge interface name	string
obriport	Output bridge interface name	string
pkttype	packet type	pkt_type
cpu	cpu number processing the packet	integer (32 bits)
iifgroup	incoming device group	devgroup
oifgroup	outgoing device group	devgroup
cgroup	control group id	integer (32 bits)
random	pseudo-random number	integer (32 bits)

Table 23. Meta expression specific types

Type	Description
iface_index	Interface index (32 bit number). Can be specified numerically or as name of an existing interface.
ifname	Interface name (16 byte string). Does not have to exist.
iface_type	Interface type (16 bit number).
uid	User ID (32 bit number). Can be specified numerically or as user name.
gid	Group ID (32 bit number). Can be specified numerically or as group name.
realm	Routing Realm (32 bit number). Can be specified numerically or as symbolic name defined in /etc/iproute2/rt_realms.
devgroup_type	Device group (32 bit number). Can be specified numerically or as symbolic name defined in /etc/iproute2/group.
pkt_type	Packet type: Unicast (addressed to local host), Broadcast (to all), Multicast (to group).

示例 11. Using meta expressions


# qualified meta expression
filter output meta oif eth0

# unqualified meta expression
filter output oif eth0

fib expressions

fib {saddr | daddr [mark | iif | oif]]} {oif | oifname | type}

A fib expression queries the fib (forwarding information base) to obtain information such as the output interface index a particular address would use. The input is a tuple of elements that is used as input to the fib lookup functions.

Table 24. fib expression specific types

Keyword	Description	Type
oif	Output interface index	integer (32 bit)
oifname	Output interface name	string
type	Address type	fib_addrtype

示例 12. Using fib expressions


# drop packets without a reverse path
filter prerouting fib saddr . iif oif missing drop

# drop packets to address not configured on ininterface
filter prerouting fib daddr . iif type != { local, broadcast, multicast } drop

# perform lookup in a specific 'blackhole' table (0xdead, needs ip appropriate ip rule)
filter prerouting meta mark set 0xdead fib daddr . mark type vmap { blackhole : drop, prohibit : jump prohibited, unreachable : drop }

Routing expressions

rt {classid | nexthop}

A routing expression refers to routing data associated with a packet.

Table 25. Routing expression types

Keyword	Description	Type
classid	Routing realm	realm
nexthop	Routing nexthop	ipv4_addr/ipv6_addr

Table 26. Routing expression specific types

Type	Description
realm	Routing Realm (32 bit number). Can be specified numerically or as symbolic name defined in /etc/iproute2/rt_realms.

示例 13. Using routing expressions


# IP family independent rt expression
filter output rt classid 10

# IP family dependent rt expressions
ip filter output rt nexthop 192.168.0.1
ip6 filter output rt nexthop fd00::1
inet filter meta nfproto ipv4 output rt nexthop 192.168.0.1
inet filter meta nfproto ipv6 output rt nexthop fd00::1

Payload expressions

Payload expressions refer to data from the packet's payload.

Ethernet header expression

ether [ethernet header field]

Table 27. Ethernet header expression types

Keyword	Description	Type
daddr	Destination MAC address	ether_addr
saddr	Source MAC address	ether_addr
type	EtherType	ether_type

VLAN header expression

vlan [VLAN header field]

Table 28. VLAN header expression

Keyword	Description	Type
id	VLAN ID (VID)	integer (12 bit)
cfi	Canonical Format Indicator	integer (1 bit)
pcp	Priority code point	integer (3 bit)
type	EtherType	ether_type

ARP header expression

arp [ARP header field]

Table 29. ARP header expression

Keyword	Description	Type
htype	ARP hardware type	integer (16 bit)
ptype	EtherType	ether_type
hlen	Hardware address len	integer (8 bit)
plen	Protocol address len	integer (8 bit)
operation	Operation	arp_op

IPv4 header expression

ip [IPv4 header field]

Table 30. IPv4 header expression

Keyword	Description	Type
version	IP header version (4)	integer (4 bit)
hdrlength	IP header length including options	integer (4 bit) FIXME scaling
dscp	Differentiated Services Code Point	dscp
ecn	Explicit Congestion Notification	ecn
length	Total packet length	integer (16 bit)
id	IP ID	integer (16 bit)
frag-off	Fragment offset	integer (16 bit)
ttl	Time to live	integer (8 bit)
protocol	Upper layer protocol	inet_proto
checksum	IP header checksum	integer (16 bit)
saddr	Source address	ipv4_addr
daddr	Destination address	ipv4_addr

ICMP header expression

icmp [ICMP header field]

Table 31. ICMP header expression

Keyword	Description	Type
type	ICMP type field	icmp_type
code	ICMP code field	integer (8 bit)
checksum	ICMP checksum field	integer (16 bit)
id	ID of echo request/response	integer (16 bit)
sequence	sequence number of echo request/response	integer (16 bit)
gateway	gateway of redirects	integer (32 bit)
mtu	MTU of path MTU discovery	integer (16 bit)

IPv6 header expression

ip6 [IPv6 header field]

Table 32. IPv6 header expression

Keyword	Description	Type
version	IP header version (6)	integer (4 bit)
dscp	Differentiated Services Code Point	dscp
ecn	Explicit Congestion Notification	ecn
flowlabel	Flow label	integer (20 bit)
length	Payload length	integer (16 bit)
nexthdr	Nexthdr protocol	inet_proto
hoplimit	Hop limit	integer (8 bit)
saddr	Source address	ipv6_addr
daddr	Destination address	ipv6_addr

ICMPv6 header expression

icmpv6 [ICMPv6 header field]

Table 33. ICMPv6 header expression

Keyword	Description	Type
type	ICMPv6 type field	icmpv6_type
code	ICMPv6 code field	integer (8 bit)
checksum	ICMPv6 checksum field	integer (16 bit)
parameter-problem	pointer to problem	integer (32 bit)
packet-too-big	oversized MTU	integer (32 bit)
id	ID of echo request/response	integer (16 bit)
sequence	sequence number of echo request/response	integer (16 bit)
max-delay	maximum response delay of MLD queries	integer (16 bit)

TCP header expression

tcp [TCP header field]

Table 34. TCP header expression

Keyword	Description	Type
sport	Source port	inet_service
dport	Destination port	inet_service
sequence	Sequence number	integer (32 bit)
ackseq	Acknowledgement number	integer (32 bit)
doff	Data offset	integer (4 bit) FIXME scaling
reserved	Reserved area	integer (4 bit)
flags	TCP flags	tcp_flag
window	Window	integer (16 bit)
checksum	Checksum	integer (16 bit)
urgptr	Urgent pointer	integer (16 bit)

UDP header expression

udp [UDP header field]

Table 35. UDP header expression

Keyword	Description	Type
sport	Source port	inet_service
dport	Destination port	inet_service
length	Total packet length	integer (16 bit)
checksum	Checksum	integer (16 bit)

UDP-Lite header expression

udplite [UDP-Lite header field]

Table 36. UDP-Lite header expression

Keyword	Description	Type
sport	Source port	inet_service
dport	Destination port	inet_service
checksum	Checksum	integer (16 bit)

SCTP header expression

sctp [SCTP header field]

Table 37. SCTP header expression

Keyword	Description	Type
sport	Source port	inet_service
dport	Destination port	inet_service
vtag	Verfication Tag	integer (32 bit)
checksum	Checksum	integer (32 bit)

DCCP header expression

dccp [DCCP header field]

Table 38. DCCP header expression

Keyword	Description	Type
sport	Source port	inet_service
dport	Destination port	inet_service

Authentication header expression

ah [AH header field]

Table 39. AH header expression

Keyword	Description	Type
nexthdr	Next header protocol	inet_proto
hdrlength	AH Header length	integer (8 bit)
reserved	Reserved area	integer (16 bit)
spi	Security Parameter Index	integer (32 bit)
sequence	Sequence number	integer (32 bit)

Encrypted security payload header expression

esp [ESP header field]

Table 40. ESP header expression

Keyword	Description	Type
spi	Security Parameter Index	integer (32 bit)
sequence	Sequence number	integer (32 bit)

IPcomp header expression

comp [IPComp header field]

Table 41. IPComp header expression

Keyword	Description	Type
nexthdr	Next header protocol	inet_proto
flags	Flags	bitmask
cpi	Compression Parameter Index	integer (16 bit)

Extension header expressions

Extension header expressions refer to data from variable-sized protocol headers, such as IPv6 extension headers and TCPs options.

nftables currently supports matching (finding) a given ipv6 extension header or TCP option.

hbh {nexthdr | hdrlength}

frag {nexthdr | frag-off | more-fragments | id}

rt {nexthdr | hdrlength | type | seg-left}

dst {nexthdr | hdrlength}

mh {nexthdr | hdrlength | checksum | type}

tcp option {eol | noop | maxseg | window | sack-permitted | sack | sack0 | sack1 | sack2 | sack3 | timestamp} [_tcp_option_field_]

The following syntaxes are valid only in a relational expression with boolean type on right-hand side for checking header existence only:

exthdr {hbh | frag | rt | dst | mh}

tcp option {eol | noop | maxseg | window | sack-permitted | sack | sack0 | sack1 | sack2 | sack3 | timestamp}

Table 42. IPv6 extension headers

Keyword	Description
hbh	Hop by Hop
rt	Routing Header
frag	Fragmentation header
dst	dst options
mh	Mobility Header

Table 43. TCP Options

Keyword	Description	TCP option fields
eol	End of option list	kind
noop	1 Byte TCP No-op options	kind
maxseg	TCP Maximum Segment Size	kind, length, size
window	TCP Window Scaling	kind, length, count
sack-permitted	TCP SACK permitted	kind, length
sack	TCP Selective Acknowledgement (alias of block 0)	kind, length, left, right
sack0	TCP Selective Acknowledgement (block 0)	kind, length, left, right
sack1	TCP Selective Acknowledgement (block 1)	kind, length, left, right
sack2	TCP Selective Acknowledgement (block 2)	kind, length, left, right
sack3	TCP Selective Acknowledgement (block 3)	kind, length, left, right
timestamp	TCP Timestamps	kind, length, tsval, tsecr

示例 14. finding TCP options

filter input tcp option sack-permitted kind 1 counter

示例 15. matching IPv6 exthdr

ip6 filter input frag more-fragments 1 counter

Conntrack expressions

Conntrack expressions refer to meta data of the connection tracking entry associated with a packet.

There are three types of conntrack expressions. Some conntrack expressions require the flow direction before the conntrack key, others must be used directly because they are direction agnostic. The packets, bytes and avgpkt keywords can be used with or without a direction. If the direction is omitted, the sum of the original and the reply direction is returned. The same is true for the zone, if a direction is given, the zone is only matched if the zone id is tied to the given direction.

ct {state | direction | status | mark | expiration | helper | label | l3proto | protocol | bytes | packets | avgpkt | zone}

ct {original | reply} {l3proto | protocol | saddr | daddr | proto-src | proto-dst | bytes | packets | avgpkt | zone}

Table 44. Conntrack expressions

Keyword	Description	Type
state	State of the connection	ct_state
direction	Direction of the packet relative to the connection	ct_dir
status	Status of the connection	ct_status
mark	Connection mark	mark
expiration	Connection expiration time	time
helper	Helper associated with the connection	string
label	Connection tracking label bit or symbolic name defined in connlabel.conf in the nftables include path	ct_label
l3proto	Layer 3 protocol of the connection	nf_proto
saddr	Source address of the connection for the given direction	ipv4_addr/ipv6_addr
daddr	Destination address of the connection for the given direction	ipv4_addr/ipv6_addr
protocol	Layer 4 protocol of the connection for the given direction	inet_proto
proto-src	Layer 4 protocol source for the given direction	integer (16 bit)
proto-dst	Layer 4 protocol destination for the given direction	integer (16 bit)
packets	packet count seen in the given direction or sum of original and reply	integer (64 bit)
bytes	bytecount seen, see description for packets keyword	integer (64 bit)
avgpkt	average bytes per packet, see description for packets keyword	integer (64 bit)
zone	conntrack zone	integer (16 bit)

Statements

Statements represent actions to be performed. They can alter control flow (return, jump to a different chain, accept or drop the packet) or can perform actions, such as logging, rejecting a packet, etc.

Statements exist in two kinds. Terminal statements unconditionally terminate evaluation of the current rule, non-terminal statements either only conditionally or never terminate evaluation of the current rule, in other words, they are passive from the ruleset evaluation perspective. There can be an arbitrary amount of non-terminal statements in a rule, but only a single terminal statement as the final statement.

Verdict statement

The verdict statement alters control flow in the ruleset and issues policy decisions for packets.

{accept | drop | queue | continue | return}

{jump | goto} {chain}

accept

Terminate ruleset evaluation and accept the packet.

drop

Terminate ruleset evaluation and drop the packet.

queue

Terminate ruleset evaluation and queue the packet to userspace.

continue

Continue ruleset evaluation with the next rule. FIXME

return

Return from the current chain and continue evaluation at the next rule in the last chain. If issued in a base chain, it is equivalent to accept.

jump chain

Continue evaluation at the first rule in chain. The current position in the ruleset is pushed to a call stack and evaluation will continue there when the new chain is entirely evaluated of a return verdict is issued.

goto chain

Similar to jump, but the current position is not pushed to the call stack, meaning that after the new chain evaluation will continue at the last chain instead of the one containing the goto statement.

示例 16. Verdict statements

# process packets from eth0 and the internal network in from_lan
# chain, drop all packets from eth0 with different source addresses.

filter input iif eth0 ip saddr 192.168.0.0/24 jump from_lan
filter input iif eth0 drop

Payload statement

The payload statement alters packet content. It can be used for example to set ip DSCP (differv) header field or ipv6 flow labels.

示例 17. route some packets instead of bridging

# redirect tcp:http from 192.160.0.0/16 to local machine for routing instead of bridging
# assumes 00:11:22:33:44:55 is local MAC address.
bridge input meta iif eth0 ip saddr 192.168.0.0/16 tcp dport 80 meta pkttype set unicast ether daddr set 00:11:22:33:44:55

示例 18. Set IPv4 DSCP header field

ip forward ip dscp set 42

Log statement

log [prefix _quoted_string_] [level _syslog-level_] [flags log-flags]

log [group _nflog_group_] [prefix _quoted_string_] [queue-threshold value] [snaplen size]

The log statement enables logging of matching packets. When this statement is used from a rule, the Linux kernel will print some information on all matching packets, such as header fields, via the kernel log (where it can be read with dmesg(1) or read in the syslog). If the group number is specified, the Linux kernel will pass the packet to nfnetlink_log which will multicast the packet through a netlink socket to the specified multicast group. One or more userspace processes may subscribe to the group to receive the packets, see libnetfilter_queue documentation for details. This is a non-terminating statement, so the rule evaluation continues after the packet is logged.

Table 45. log statement options

Keyword	Description	Type
prefix	Log message prefix	quoted string
syslog-level	Syslog level of logging	string: emerg, alert, crit, err, warn [default], notice, info, debug
group	NFLOG group to send messages to	unsigned integer (16 bit)
snaplen	Length of packet payload to include in netlink message	unsigned integer (32 bit)
queue-threshold	Number of packets to queue inside the kernel before sending them to userspace	unsigned integer (32 bit)

Table 46. log-flags

Flag	Description
tcp sequence	Log TCP sequence numbers.
tcp options	Log options from the TCP packet header.
ip options	Log options from the IP/IPv6 packet header.
skuid	Log the userid of the process which generated the packet.
ether	Decode MAC addresses and protocol.
all	Enable all log flags listed above.

示例 19. Using log statement

# log the UID which generated the packet and ip options
ip filter output log flags skuid flags ip options

# log the tcp sequence numbers and tcp options from the TCP packet
ip filter output log flags tcp sequence,options

# enable all supported log flags
ip6 filter output log flags all

Reject statement

reject [with] {icmp | icmp6 | icmpx} [type] {icmp_type | icmp6_type | icmpx_type}

reject [with] {tcp} {reset}

A reject statement is used to send back an error packet in response to the matched packet otherwise it is equivalent to drop so it is a terminating statement, ending rule traversal. This statement is only valid in the input, forward and output chains, and user-defined chains which are only called from those chains.

Table 47. reject statement type (ip)

Value	Description	Type
icmp_type	ICMP type response to be sent to the host	net-unreachable, host-unreachable, prot-unreachable, port-unreachable [default], net-prohibited, host-prohibited, admin-prohibited

Table 48. reject statement type (ip6)

Value	Description	Type
icmp6_type	ICMPv6 type response to be sent to the host	no-route, admin-prohibited, addr-unreachable, port-unreachable [default], policy-fail, reject-route

Table 49. reject statement type (inet)

Value	Description	Type
icmpx_type	ICMPvXtype abstraction response to be sent to the host, this is a set of types that overlap in IPv4 and IPv6 to be used from the inet family.	port-unreachable [default], admin-prohibited, no-route, host-unreachable

Counter statement

A counter statement sets the hit count of packets along with the number of bytes.

counter {packets _number_ } {bytes _number_ }

Conntrack statement

The conntrack statement can be used to set the conntrack mark and conntrack labels.

ct {mark | eventmask | label | zone} [set] value

The ct statement sets meta data associated with a connection. The zone id has to be assigned before a conntrack lookup takes place, i.e. this has to be done in prerouting and possibly output (if locally generated packets need to be placed in a distinct zone), with a hook priority of -300.

Table 50. Conntrack statement types

Keyword	Description	Value
eventmask	conntrack event bits	bitmask, integer (32 bit)
helper	name of ct helper object to assign to the connection	quoted string
mark	Connection tracking mark	mark
label	Connection tracking label	label
zone	conntrack zone	integer (16 bit)

示例 20. save packet nfmark in conntrack

ct mark set meta mark

示例 21. set zone mapped via interface

table inet raw {
  chain prerouting {
      type filter hook prerouting priority -300;
      ct zone set iif map { "eth1" : 1, "veth1" : 2 }
  }
  chain output {
      type filter hook output priority -300;
      ct zone set oif map { "eth1" : 1, "veth1" : 2 }
  }
}

示例 22. restrict events reported by ctnetlink

ct eventmask set new or related or destroy

Meta statement

A meta statement sets the value of a meta expression. The existing meta fields are: priority, mark, pkttype, nftrace.

meta {mark | priority | pkttype | nftrace} [set] value

A meta statement sets meta data associated with a packet.

Table 51. Meta statement types

Keyword	Description	Value
priority	TC packet priority	tc_handle
mark	Packet mark	mark
pkttype	packet type	pkt_type
nftrace	ruleset packet tracing on/off. Use monitor trace command to watch traces	0, 1

Limit statement

limit [rate] [over]_packet_number_ [/] {second | minute | hour | day} [burst _packet_number_ packets]

limit [rate] [over]_byte_number_ {bytes | kbytes | mbytes} [/] {second | minute | hour | day | week} [burst _byte_number_ bytes]

A limit statement matches at a limited rate using a token bucket filter. A rule using this statement will match until this limit is reached. It can be used in combination with the log statement to give limited logging. The over keyword, that is optional, makes it match over the specified rate.

Table 52. limit statement values

Value	Description	Type
packet_number	Number of packets	unsigned integer (32 bit)
byte_number	Number of bytes	unsigned integer (32 bit)

NAT statements

snat [to _address_ [:port]] [persistent, random, fully-random]

snat [to _address_ - _address_ [:_port_ - _port_]] [persistent, random, fully-random]

dnat [to _address_ [:_port_]] [persistent, random, fully-random]

dnat [to _address_ [:_port_ - _port_]] [persistent, random, fully-random]

masquerade [to [:_port_]] [persistent, random, fully-random]

masquerade [to [:_port_ - _port_]] [persistent, random, fully-random]

redirect [to [:_port_]] [persistent, random, fully-random]

redirect [to [:_port_ - _port_]] [persistent, random, fully-random]

The nat statements are only valid from nat chain types.

The snat and masquerade statements specify that the source address of the packet should be modified. While snat is only valid in the postrouting and input chains, masquerade makes sense only in postrouting. The dnat and redirect statements are only valid in the prerouting and output chains, they specify that the destination address of the packet should be modified. You can use non-base chains which are called from base chains of nat chain type too. All future packets in this connection will also be mangled, and rules should cease being examined.

The masquerade statement is a special form of snat which always uses the outgoing interface's IP address to translate to. It is particularly useful on gateways with dynamic (public) IP addresses.

The redirect statement is a special form of dnat which always translates the destination address to the local host's one. It comes in handy if one only wants to alter the destination port of incoming traffic on different interfaces.

Note that all nat statements require both prerouting and postrouting base chains to be present since otherwise packets on the return path won't be seen by netfilter and therefore no reverse translation will take place.

Table 53. NAT statement values

Expression	Description	Type
address	Specifies that the source/destination address of the packet should be modified. You may specify a mapping to relate a list of tuples composed of arbitrary expression key with address value.	ipv4_addr, ipv6_addr, eg. abcd::1234, or you can use a mapping, eg. meta mark map { 10 : 192.168.1.2, 20 : 192.168.1.3 }
port	Specifies that the source/destination address of the packet should be modified.	port number (16 bits)

Table 54. NAT statement flags

Flag	Description
persistent	Gives a client the same source-/destination-address for each connection.
random	If used then port mapping will be randomized using a random seeded MD5 hash mix using source and destination address and destination port.
fully-random	If used then port mapping is generated based on a 32-bit pseudo-random algorithm.

示例 23. Using NAT statements

# create a suitable table/chain setup for all further examples
add table nat
add chain nat prerouting { type nat hook prerouting priority 0; }
add chain nat postrouting { type nat hook postrouting priority 100; }

# translate source addresses of all packets leaving via eth0 to address 1.2.3.4
add rule nat postrouting oif eth0 snat to 1.2.3.4

# redirect all traffic entering via eth0 to destination address 192.168.1.120
add rule nat prerouting iif eth0 dnat to 192.168.1.120

# translate source addresses of all packets leaving via eth0 to whatever
# locally generated packets would use as source to reach the same destination
add rule nat postrouting oif eth0 masquerade

# redirect incoming TCP traffic for port 22 to port 2222
add rule nat prerouting tcp dport 22 redirect to :2222

Queue statement

This statement passes the packet to userspace using the nfnetlink_queue handler. The packet is put into the queue identified by its 16-bit queue number. Userspace can inspect and modify the packet if desired. Userspace must then drop or reinject the packet into the kernel. See libnetfilter_queue documentation for details.

queue [num _queue_number_] [bypass]

queue [num _queue_number_from_ - _queue_number_to_] [bypass,fanout]

Table 55. queue statement values

Value	Description	Type
queue_number	Sets queue number, default is 0.	unsigned integer (16 bit)
queue_number_from	Sets initial queue in the range, if fanout is used.	unsigned integer (16 bit)
queue_number_to	Sets closing queue in the range, if fanout is used.	unsigned integer (16 bit)

Table 56. queue statement flags

Flag	Description
bypass	Let packets go through if userspace application cannot back off. Before using this flag, read libnetfilter_queue documentation for performance tuning recomendations.
fanout	Distribute packets between several queues.

Additional commands

These are some additional commands included in nft.

export

Export your current ruleset in XML or JSON format to stdout.

Examples:

[...]
% nft export json
[...]

monitor

The monitor command allows you to listen to Netlink events produced by the nf_tables subsystem, related to creation and deletion of objects. When they ocurr, nft will print to stdout the monitored events in either XML, JSON or native nft format.

To filter events related to a concrete object, use one of the keywords 'tables', 'chains', 'sets', 'rules', 'elements'.

To filter events related to a concrete action, use keyword 'new' or 'destroy'.

Hit ^C to finish the monitor operation.

示例 24. Listen to all events, report in native nft format

% nft monitor

示例 25. Listen to added tables, report in XML format

% nft monitor new tables xml

示例 26. Listen to deleted rules, report in JSON format

% nft monitor destroy rules json

示例 27. Listen to both new and destroyed chains, in native nft format

% nft monitor chains

Error reporting

When an error is detected, nft shows the line(s) containing the error, the position of the erroneous parts in the input stream and marks up the erroneous parts using carrets (^). If the error results from the combination of two expressions or statements, the part imposing the constraints which are violated is marked using tildes (~).

For errors returned by the kernel, nft can't detect which parts of the input caused the error and the entire command is marked.

示例 28. Error caused by single incorrect expression

<cmdline>:1:19-22: Error: Interface does not exist
filter output oif eth0
                  ^^^^

示例 29. Error caused by invalid combination of two expressions

<cmdline>:1:28-36: Error: Right hand side of relational expression (==) must be constant
filter output tcp dport == tcp dport
                        ~~ ^^^^^^^^^

示例 30. Error returned by the kernel

<cmdline>:0:0-23: Error: Could not process rule: Operation not permitted
filter output oif wlan0
^^^^^^^^^^^^^^^^^^^^^^^

退出状态码

On success, nft exits with a status of 0. Unspecified errors cause it to exit with a status of 1, memory allocation errors with a status of 2, unable to open Netlink socket with 3.

Authors

nftables was written by Patrick McHardy and Pablo Neira Ayuso, among many other contributors from the Netfilter community.

Copyright

nftables is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License version 2 as published by the Free Software Foundation.

This documentation is licenced under the terms of the Creative Commons Attribution-ShareAlike 4.0 license, CC BY-SA 4.0.

开篇

2017年5月31日 · 阅读需 1 分钟

重新开始写博客，这篇作为开篇，继忙碌地工作了快两年之后，发现抽时间将一些想法和经验沉淀下来也是如此重要。而整理文章的过程本身也是对思绪的整理，可以帮助我们更好的理解所做的事情。接下来会把旧博客的备份文章重新放上来。

CENTOS7管理之动态防火墙FIREWALLD

2015年1月3日 · 阅读需 12 分钟

firewalld 的出现带来了许多新的亮点，它使得防火墙规则管理更加方便和统一，但 firewalld 项目本身还有诸多不完善的地方，还需要一些时间的沉淀才能变得更加稳定和得到更多软件社区的支持。在我们使用过程中也发现了其存在的一些问题，我们也正在努力参与到 firewalld 的改进和测试当中，也希望有更多的人能够参与进来。

firewalld的出现

自 fedora18 开始，firewalld 已经成为默认的防火墙管理组件被集成到系统中，用以取代 iptables service 服务，而基于 fedora18 开发的RHEL7，以及其一脉相承的CentOS7 也都使用 firewalld 作为默认的防火墙管理组件，无论是对于日常使用还是企业应用来讲这样的变更都或多或少的为使用者带来了影响。在浏览完本文前我们先不急着下结论到底这样的变化是好是坏，让我们“剥离它天生的骄傲，排除这些外界的干扰“，来看看 firewalld 到底是什么样的。以下内容以 CentOS7 系统为例进行讲解，我们假设读者对于 iptables 防火墙相关知识有一定的掌握。

什么是动态防火墙？

我们首先需要弄明白的第一个问题是到底什么是动态防火墙。为了解答这个问题，我们先来回忆一下 iptables service 管理防火墙规则的模式：用户将新的防火墙规则添加进 /etc/sysconfig/iptables 配置文件当中，再执行命令 service iptables reload 使变更的规则生效。在这整个过程的背后，iptables service 首先对旧的防火墙规则进行了清空，然后重新完整地加载所有新的防火墙规则，而如果配置了需要 reload 内核模块的话，过程背后还会包含卸载和重新加载内核模块的动作，而不幸的是，这个动作很可能对运行中的系统产生额外的不良影响，特别是在网络非常繁忙的系统中。

如果我们把这种哪怕只修改一条规则也要进行所有规则的重新载入的模式称为静态防火墙的话，那么 firewalld 所提供的模式就可以叫做动态防火墙，它的出现就是为了解决这一问题，任何规则的变更都不需要对整个防火墙规则列表进行重新加载，只需要将变更部分保存并更新到运行中的 iptables 即可。

这里有必要说明一下 firewalld 和 iptables 之间的关系， firewalld 提供了一个 daemon 和 service，还有命令行和图形界面配置工具，它仅仅是替代了 iptables service 部分，其底层还是使用 iptables 作为防火墙规则管理入口。firewalld 使用 python 语言开发，在新版本中已经计划使用 c++ 重写 daemon 部分。

firewalld 具有哪些特性？

那么 firewalld 除了是动态防火墙以外，它还具有哪些优势或者特性呢？第一个是配置文件。firewalld 的配置文件被放置在不同的 xml 文件当中，这使得对规则的维护变得更加容易和可读，有条理。相比于 iptables 的规则配置文件而言，这显然可以算作是一个进步。第二个是区域模型。firewalld 通过对 iptables 自定义链的使用，抽象出一个区域模型的概念，将原本十分灵活的自定义链统一成一套默认的标准使用规范和流程，使得防火墙在易用性和通用性上得到提升。令一个重要特性是对 ebtables 的支持，通过统一的接口来实现 ipt/ebt 的统一管理。还有一个重要特性是富语言。富语言风格的配置让规则管理变得更加人性化，学习门槛相比原生的 iptables 命令有所降低，让初学者可以在很短时间内掌握其基本用法，规则管理变得更快捷。

firewalld 基本术语

本文不打算重复阐述现有文档中的基本知识，仅仅提供一些对现有文档中知识的补充和理解，详细的文档请参阅man page或本文结尾处的延伸阅读¹部分。

zone：

firewalld将网卡对应到不同的区域（zone），zone 默认共有9个，block dmz drop external home internal public trusted work ，不同的区域之间的差异是其对待数据包的默认行为不同，根据区域名字我们可以很直观的知道该区域的特征，在CentOS7系统中，默认区域被设置为public，而在最新版本的fedora（fedora21）当中随着 server 版和 workstation 版的分化则添加了两个不同的自定义 zone FedoraServer 和 FedoraWorkstation 分别对应两个版本。使用下面的命令分别列出所有支持的 zone 和查看当前的默认 zone：

firewall-cmd --get-zones
firewall-cmd --get-default-zone

所有可用 zone 的 xml 配置文件被保存在 /usr/lib/firewalld/zones/ 目录，该目录中的配置为默认配置，不允许管理员手工修改，自定义 zone 配置需保存到 /etc/firewalld/zones/ 目录。防火墙规则即是通过 zone 配置文件进行组织管理，因此 zone 的配置文件功能类似于 /etc/sysconfig/iptables 文件，只不过根据不同的场景默认定义了不同的版本供选择使用，这就是 zone 的方便之处。

service：

在 /usr/lib/firewalld/services/ 目录中，还保存了另外一类配置文件，每个文件对应一项具体的网络服务，如 ssh 服务等，与之对应的配置文件中记录了各项服务所使用的 tcp/udp 端口，在最新版本的 firewalld 中默认已经定义了 70+ 种服务供我们使用，当默认提供的服务不够用或者需要自定义某项服务的端口时，我们需要将 service 配置文件放置在 /etc/firewalld/services/ 目录中。service 配置的好处显而易见，第一，通过服务名字来管理规则更加人性化，第二，通过服务来组织端口分组的模式更加高效，如果一个服务使用了若干个网络端口，则服务的配置文件就相当于提供了到这些端口的规则管理的批量操作快捷方式。每加载一项 service 配置就意味着开放了对应的端口访问，使用下面的命令分别列出所有支持的 service 和查看当前 zone 种加载的 service：

firewall-cmd --get-services
firewall-cmd --list-services

使用示例

在 firewalld 官方文档中提供了若干使用示例，这些示例对学习防火墙管理是很好的基本参考资料。接下来我们将通过一些真实的使用示例来展示如何使用 firewalld 对防火墙规则进行管理。

场景一：自定义 ssh 端口号

出于安全因素我们往往需要对一些关键的网络服务默认端口号进行变更，如 ssh，ssh 的默认端口号是 22，通过查看防火墙规则可以发现默认是开放了 22 端口的访问的：

[root@localhost ~]# iptables -S
... ...
-A IN_public_allow -p tcp -m tcp --dport 22 -m conntrack --ctstate NEW -j ACCEPT

假设自定义的 ssh 端口号为 22022，使用下面的命令来添加新端口的防火墙规则：

firewall-cmd --add-port=22022/tcp

如果需要使规则保存到 zone 配置文件，则需要加参数 --permanent。我们还可以使用自定义 service 的方式来实现同样的效果：在 /etc/firewalld/services/ 目录中添加自定义配置文件 custom-ssh.xml ，内容如下：

<?xml version="1.0" encoding="utf-8"?>
<service>
  <short>customized SSH</short>
  <description>Secure Shell (SSH) is a protocol for logging into and executing commands on remote machines. It provides secure encrypted communications. If you plan on accessing your machine remotely via SSH over a firewalled interface, enable this option. You need the openssh-server package installed for this option to be useful.</description>
  <port protocol="tcp" port="22022"/>
</service>

执行命令重载配置文件，并添加防火墙规则:

systemctl reload firewalld
firewall-cmd --add-service=custom-ssh

一旦新的规则生效，旧的 ssh 端口规则旧可以被禁用掉：

firewall-cmd --remove-service=ssh

场景二：允许指定的IP访问SNMP服务

某些特殊的服务我们并不想开放给所有人访问，只需要开放给特定的IP地址即可，例如 SNMP 服务，我们将使用 firewalld 的富语言风格配置指令：

firewall-cmd --add-rich-rule="rule family='ipv4' source address='10.0.0.2' port port='161' protocol='udp' accept"

查看防火墙规则状态，证明结果正是我们想要的：

[root@localhost ~]# iptables -S
... ...
-A IN_public_allow -s 10.0.0.2/32 -p udp -m udp --dport 161 -m conntrack --ctstate NEW -j ACCEPT

参考链接：²

NGINX性能调优简要

2014年10月19日 · 阅读需 13 分钟

以下是数天前NGINX官方博客发表的一篇有关nginx性能优化的文章，内容简明扼要，值得一读。译文在原文基础上略有变动，如有不足，欢迎指正。

前言

NGINX作为一款高性能负载均衡器，缓存和web服务器被大家所熟知，为全世界40%的网站提供着服务。NGINX和Linux系统中的大多数默认配置对普通应用来说已经很合适，然而要想达到更高的性能以应对高负载场景那么一些性能优化是必须的。这篇文章将会探讨一些在性能调优时涉及到的NGINX和Linux系统参数。当然可供调节的参数众多，我们这里只会涉及部分对大多数用户来说被使用最多的。其他没有被涉及到的参数多是需要对系统有深入了解之后才需要接触到的。

简介

我们假定此文读者对NGINX的基本架构和配置有一定了解，此文不会重复NGINX文档中的内容，如有涉及我们仅会给出链接。

在做性能调优时一个最佳实践是一次只动一个参数，如果调整后未达到预期效果，则将其修改回初始值。我们将从Linux系统的一些参数开始讲起，这些参数将对后续的NGINX配置调整有着至关重要的作用。

Linux 配置

现代Linux内核（2.6+）中的很多参数设置都是十分到位的，但是仍然有一些是需要我们进行调整的。如果一些默认的值设置得太小，那么在内核日志中将记录着错误信息，这预示着我们需要对其中一些参数进行调整了。在众多选项当中我们只会涉及到对大多数负载场景都实用的，请参考Linux系统文档以了解更多有关这些选项的细节。

Backlog Queue

下面的设置与网络连接和连接队列有直接关联。如果你有大量的接入请求，且偶尔出现一些失效请求的话，那么下面的设置将起到优化效果。

net.core.somaxconn

此参数控制NGINX等待连接的队列大小。因NGINX处理连接的速度快，所以一般这个参数不建议被设置太大，但是默认值有点偏低，所以针对大流量网站来说调整这个参数是必须的，如果此参数被设置多低那么有可能在内核日志中看到报错，这时需要调整此参数直到报错消失为止。注意，如果此参数设置大于512的话，则需要对NGINX配置中的listen指令中的backlog参数进行同步调整。

net.core.netdev_max_backlog

此参数控制数据包在进入CPU处理前，在网卡中被缓存的量，需要处理大带宽的机器需要增加此参数的值。设置此参数需要参考具体的网卡的文档或者根据系统错误日志进行调整。

File Descriptors

文件描述符是系统在处理如网络连接和打开文件时的系统资源。NGINX中每个连接的建立可能需要占用两个文件描述符，例如代理模式下，一个用来处理客户端连接，一个用来处理到后端的连接，而如果HTTP keepalives被启用的话对文件描述符的消耗会轻松一些。需要处理高并发的机器建议调整下面的参数：

sys.fs.file_max

这个参数影响系统全局的文件描述符打开数量限制。

nofile

此参数影响单个用户的文件描述符数量，在/etc/security/limits.conf中进行设置。

Ephemeral ports

当NGINX被作为代理服务器时，每个到后端服务器的连接将占用一个临时的，或短期的端口。

net.ipv4.ip_local_port_range

此参数控制可被用作临时端口的起始范围，一个通用设置是1024-65000

net.ipv4.tcp_fin_timeout

此参数控制一个连接使用完毕后端口被回收再利用的超时时间，一般默认设置为60秒，但是一般设置降低到30或者15秒都是没有问题的。

NGINX 配置

下面介绍NGINX中对性能有影响的参数，如下面提到的一样，我们只讲解一些适用于大多数用户的参数进行调整，其他未提及的可能是不建议调整的。

Worker Processes

NGINX可以同时启动多个worker进程，每个进程处理大量的连接，通过调整下面的参数可以控制启动的进程数量和每个进程所处理的连接数量。

worker_processes

此参数控制NGINX启动进程的数量，在多数情况下每个cpu核心分配一个进程是最佳的，将参数值设置为auto即可达到。很多场景下都需要增加此参数的值，如在需要处理大量磁盘I/O的场景。默认的值是1。

worker_connections

此参数控制在同一时刻单个进程可以处理的最大连接数。默认值512，但大多数系统都可以处理更大的量。其最佳值与实际场景和系统有关，需要经过反复测试才能得出。

Keepalives

Keepalive连接降低连接的建立和关闭对CPU和网络的消耗，对性能有着十分重要的影响。NGINX会关闭所有客户端的连接，且与后端服务器的连接都是独立的。NGINX支持到客户端和后端的keepalive，下面的参数对客户端keepalive进行控制：

keepalive_requests

此参数控制通过单个keepalive连接可以处理的客户端请求数量，默认值是100，可以调整为更大的值，特别是在通过单个客户端进行压力测试的时候。

keepalive_timeout

此参数控制单个keepalive连接在空闲状态保持连接的时间。

下面的参数对后端keepalive进行控制：

keepalive

此参数控制单个worker进程保持到upstream server的keepalive连接数量，且默认没有设置。如需启动到后端的keepalive连接则需要进行如下设置：

proxy_http_version 1.1; proxy_set_header Connection "";

Access Logging

请求产生的日志记录对CPU和I/O都有消耗，一个可以降低资源消耗的办法是启用日志缓存。启用日志缓存后，NGINX将缓存部分请求日志，然后一次性写入文件。要启用日志缓存只需要在access_log中增加“buffer=size”的设置即可，size值控制缓存的大小，同时也可以设置“flush=time”以控制缓存的时间，设置了两个参数后，NGINX会在缓存被充满或者日志条目数达到flush值时回写日志。在worker进程重启或者关闭时也会回写日志。而永久禁用日志记录也是可以实现的。

Sendfile

Sendfile 是一项操作系统特性，可以被NGINX启用。它通过对数据在文件描述符之间进行in-kernel copying来提供更快的tcp数据传输处理，一般通过zero-copy技术实现。NGINX使用它来完成缓存数据或者磁盘数据到socket的写操作，不产生任何的用户空间上下文切换开销，可降低CPU负载和提高处理速度。当启用sendfile特性之后，由于数据不经过用户空间，使得对数据内容进行处理的filter将不起作用，例如gzip filter将默认被禁用。

Limits

NGINX也为用户提供设置连接限制的能力，用来对客户请求的资源进行控制，对系统性能，用户体验和安全性也产生极大的影响。下面是部分用于请求限制的指令：

limit_conn/limit_conn_zone

这两个指令用来控制NGINX可接收的连接数，例如来自单个客户端IP的连接请求。这有助于限制客户端建立过多的连接并消耗过多资源。

limit_rate

这个指令控制单个连接允许的客户端最大带宽。这可以避免系统被部分客户端耗尽资源，保证了为每个客户端请求提供服务的质量。

limit_req/limit_req_zone

这两个指令可以控制NGINX的请求回复水平，以不至于被部分客户端拖垮。也被用来加强安全性，特别是对登陆页面等进行有效的保护。

max_conns

这个指令用来控制到后端服务器的最大连接数，保护后端服务器不被拖垮，默认值是zero，没有任何限制。

queue

如果max_conns被设置，那么此参数对超过最大连接时的状态产生影响，可以设置队列中请求的个数和缓冲时间，如果没有设置此参数，则队列不存在。

其他设置

NGINX还有一些特性能对某些特定场景下的应用起到性能优化作用，我们将探讨其中的两个特性。

缓存

当把NGINX作为负载均衡器来使用的场景下，启用cache可以显著改善到客户端的响应时间，且显著降低后端服务器的压力。如需了解更多NGINX的caching设置，可以参考此链接：: NGINX Admin Guide – Caching.

压缩

对回应内容进行压缩将有效降低回应内容的大小，降低带宽消耗，然而压缩将增加CPU的开销，所以带宽成本较高时启用才是明智之举。需要明确注意的是不要对已经压缩的内容启用压缩，例如jpeg格式的图片，如需了解更多有关压缩的设置可以参考此文档： NGINX Admin Guide – Compression and Decompression

原文链接:

http://nginx.com/blog/tuning-nginx/

shell脚本coding style总结

2014年5月23日 · 阅读需 5 分钟

这里总结了个人比较推崇的shell脚本coding style，编写出方便阅读和维护的脚本是运维人员的基本操守。

1.关于缩进：一个tab定义为4个空格；关于这个缩进距离貌似有太多的的说法了，有的会用8个空格，有的用2个空格，还有的用4个空格；不过最终取决于团队风格。但是记住，正确的做法是项目、文件、成员间全部统一，千万不要出现一个项目内各种缩进，tab和空格混用，甚至一个文件中的缩进都不统一的情况。在vim中设置如下：

set ts=4
set sw=4
set expandtab

2.尽量缩短单行的长度，最好不超过72字符（注意：这个限制因历史原因导致，为了兼容那些老的终端设备而考虑，实际上现在已经不适用了，单行代码可以更长，只要不要超过大多数屏幕的输出宽度就好）

bad:

thisisaverylongline || thisisanotherlongline

good:

thisisaverylongline ||
    thisisanotherlongline

3.注释尽量保持清晰的层次,#号与注释文本间保持一个空格以示和代码的区分

bad:

#this is a comment
#this is a code line

good:

# this is a comment
#this is code line

4.定义函数可以不用写function关键字，函数名字尽量短小无歧义，尽量传递返回值

bad:

function  cmd1

good:

cmd_start() {
   dosomethinghere
   return 0
}

5.全局变量用大写，局部变量用小写，函数内尽量不要使用全局变量，以免混淆导致变量覆盖,注意尽量使用小写表示变量

bad:

foo() {
    a=2
    echo $a
}
a=1
echo $a

good:

foo(){
    local a
    a=2
    echo $a
}
A=1
echo $A

6.使用内建的[] 、[[]] 进行条件测试，避免使用test命令

bad:

test -f filename

good:

[ -f filename ]
[[ -n $string ]]

7.使用$(())进行普通运算，尽量避免使用expr或其他外部命令 $[]也可用于计算

bad:

num=$(expr 1 + 1)

good:

num=$((1+1))
num=$[1+2]

8.管道符左右都应加空格，重定向符空格加左不加右

bad:

find /data -name tmp*|xargs rm -fr
cat a>b

good:

find /data -name tmp* | xargs rm -fr
cat a >b

9.当source命令属于外部命令的时候，我们应尽量使用. ，当视力不好怕写错看错的时候，应尽量不用.而用source，在实际使用中我们发现.号极难辨认，这里推荐使用source命令，更方便维护。

bad:

. func_file

good:

source func_file

10.if 和 then 之间使用分号+空格分隔，不要用换行，书写上和c style类似：

bad:

grep string logfile
if [ $? -ne 0 ]
then
  dosomething
fi

while 1
do
  do something here
done

good:

if grep string logfile; then
    dosomething
fi

while 1; do
   do something here
done

注意;和then间有一个空格的距离

11.如果grep能直接处理文件输入那就不要和cat连用; 如果wc能直接从文件统计就不要和cat连用; 如果grep能统计行数就不要和wc连用

bad:

cat file | grep tring
cat file.list | wc -l
grep avi av.list | wc -l

good:

grep tring file
wc -l file.list
grep -c avi av.list

12.如果awk的时候需要搜索恰好awk又能搜索，那么就不要再和grep连用

bad:

dosomething | grep tring | awk '{print $1}'

good:

dosomething | awk '/tring/' '{print $1}'

13、尽量编写兼容旧版本shell风格且含义清晰的代码，不推荐不兼容写法或者不方便他人维护的代码

bad:

do_some_thing |& do_another_thing
do_some_thing &>> some_where

good:

do_some_thing 2>&1 | do_another_thing
do_some_thing >some_where 2>&1

注意 : 以上仅代表个人习惯和理解，并不能适用于所有人，适合的才是最好的。

拓展阅读文档：

Systemd基本使用介绍

2014年4月8日 · 阅读需 15 分钟

但是由于计算机软硬件的不断发展，人们逐渐发现sysvinit所提供的功能已经无法满足当前的需求，服务多年的sysvinit终将过时，于是一些替代的方案开始出现，这其中包括upstart ，systemd等。

关于PID 1

在linux的世界里，第一个启动到用户空间的进程名叫init，其pid为1，init进程启动完毕后，它相当于其他进程的根，负责将其他的服务进程启动起来，最终启动成为一个完整可用的系统。而提供这个init程序的软件名为sysvinit，它在整个linux系统中所担任的角色重要程度无可厚非。但是由于计算机软硬件的不断发展，人们逐渐发现sysvinit所提供的功能已经无法满足当前的需求，服务多年的sysvinit终将过时，于是一些替代的方案开始出现，这其中包括upstart ，systemd等。

init的责任

为什么说sysvinit会过时呢？我们从用户需求的角度来看不难发现，其实我们对init的最原始最核心需求是，将系统从内核引导至用户空间，而由于现在硬件水平的发展，使得我们在单纯的原始需求之上产生了新的或者更多的需求，那就是不仅要引导系统，而且要快速的引导系统。而早在sysvinit诞生的那个时候启动速度的重要程度似乎不是很高，所以它在这方面显得有些老是很自然的。现在让我们继续思考一下如何才能实现快速引导。试想，当一个系统启动的时候需要启动长长的一列服务，这样的系统启动速度应该不会快到哪里去，如大多数人使用桌面系统的情况一样，一个加快系统启动速度的最直接方法就是减少启动项，把不必要的启动项禁止掉。而另外一个方法则实现起来不像第一个这么简单，那就是并行启动。并行启动可以带来的速度提升显而易见，我们只需要尽可能保证让更多的服务可以并行启动即可。而从其他方面来看sysvinit由于对脚本的依赖导致启动完毕所有服务过程中需要大量的执行外部命令，这一点也将导致引导速度的变慢。

后起之秀

当一项标准已经无法满足当下发展的需求时，最好的办法是打破陈规让自己变成新的标准。systemd就是这么做的，因为现实的需要它已经不能顾及POSIX标准了，以不至于阻碍其更好的发展，而systemd在设计之初也和苹果的launchd在某些地方不谋而合。尽管在最初的时候这种做法没有得到所有人的认可甚至是惹恼了一些家伙，但是现在看来systemd已经成为了事实上的标准了，现在已经采用systemd的发行版有fedora，opensuse，debian，ubuntu，rhel7，archlinux等，尽管在systemd的推广过程中发生了很多曲折，但最终大家的选择是一致的---朝着更好的方向发展而不是墨守陈规。在功能上，systemd不仅仅是解决了现有程序的种种问题，而且包含了大量新的特性，这意味着系统中原本需要的很多组件现在也都可以下岗了，下面来看看systemd的一些特性：

服务管理 - 提供了统一的启动/停止/重启/重载而不再需要编写一大坨脚本来干这种原本就应该十分简单基本且重要的事情，取而代之的是更为简洁的配置文件，这也表示今后的系统中各种service启动控制的统一性，使得daemon服务开发者免于编写各种看上去参差不齐的启动脚本，增加了通用性。
socket管理 - 支持监听socket，这使得socket的监听和服务本身可以相互独立，一方面可以以此提高服务启动速度，另外一方面还可以节约系统开销。
设备管理 - 可以配合udev规则对设备进行管理，还可以配合/etc/fstab文件实现磁盘的挂载管理，甚至实现更高级的自动挂载；
target分组 - 把不同的unit组合起来，组合到同一个target里面，完全取代sysvinit的运行等级的概念；
状态快照 - 可以对当前系统中的unit状态进行快照，这个概念有些类似于系统的休眠与恢复，你可以让系统从一个状态切换进入另外一个状态；

systemd管理入门

下面我们将通过一些例子来演示一些基本的管理命令，这些命令对于系统管理员来说至关重要。

1 如何检查启动项

任何启动项，只要是在系统启动时有被执行到，不论启动成功还是失败，systemd都能够记录下他们的状态，直接执行不带参数的systemctl命令即可观察到，如下：

# systemctl
UNIT                                             LOAD   ACTIVE SUB       DESCRIPTION
proc-sys-fs-binfmt_misc.automount                loaded active waiting   Arbitrary Executable File Formats File System Aut
sys-devices-pci...0-backlight-acpi_video1.device loaded active plugged   /sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0
sys-devices-pci...0-backlight-acpi_video0.device loaded active plugged   /sys/devices/pci0000:00/0000:00:02.0/backlight/ac
sys-devices-pci...00-0000:00:19.0-net-em1.device loaded active plugged   82579LM Gigabit Network Connection
sys-devices-pci...d1.4:1.0-bluetooth-hci0.device loaded active plugged   /sys/devices/pci0000:00/0000:00:1a.0/usb1/1-1/1-1
sys-devices-pci...000:00:1b.0-sound-card0.device loaded active plugged   6 Series/C200 Series Chipset Family High Definiti
sys-devices-pci...0000:03:00.0-net-wlp3s0.device loaded active plugged   RTL8188CE 802.11b/g/n WiFi Adapter
sys-devices-pci...-0:0:0:0-block-sda-sda1.device loaded active plugged   ST9500420AS
sys-devices-pci...-0:0:0:0-block-sda-sda2.device loaded active plugged   ST9500420AS
sys-devices-pci...-0:0:0:0-block-sda-sda3.device loaded active plugged   ST9500420AS
sys-devices-pci...-0:0:0:0-block-sda-sda5.device loaded active plugged   ST9500420AS
sys-devices-pci...-0:0:0:0-block-sda-sda6.device loaded active plugged   ST9500420AS
sys-devices-pci...-0:0:0:0-block-sda-sda7.device loaded active plugged   LVM PV EKHM59-PY9G-AoRX-Nr9k-nnxN-XxxO-DFcj4N on
sys-devices-pci...0:0:0-0:0:0:0-block-sda.device loaded active plugged   ST9500420AS
sys-devices-pci...-1:0:0:0-block-sdb-sdb1.device loaded active plugged   KINGSTON_SVP200S360G
sys-devices-pci...-1:0:0:0-block-sdb-sdb2.device loaded active plugged   LVM PV rlRqSb-IlQn-DJQi-i7fg-sUV5-3bjI-g2npg7 on
sys-devices-pci...1:0:0-1:0:0:0-block-sdb.device loaded active plugged   KINGSTON_SVP200S360G

要检查具体的服务，则使用status选项加上服务名即可，如下：

# systemctl status libvirtd
libvirtd.service - Virtualization daemon
   Loaded: loaded (/usr/lib/systemd/system/libvirtd.service; enabled)
   Active: active (running) since 一 2014-04-07 19:10:30 CST; 9min ago
     Docs: man:libvirtd(8)
           http://libvirt.org
 Main PID: 1673 (libvirtd)
   CGroup: /system.slice/libvirtd.service
           ├─1673 /usr/sbin/libvirtd
           └─1804 /sbin/dnsmasq --conf-file=/var/lib/libvirt/dnsmasq/default.conf


4月 07 19:10:33 localhost.localdomain dnsmasq[1804]: using nameserver 218.6.200.139#53
4月 07 19:10:33 localhost.localdomain dnsmasq[1804]: using nameserver 61.139.2.69#53
4月 07 19:10:33 localhost.localdomain dnsmasq[1804]: using local addresses only for unqualified names
4月 07 19:10:33 localhost.localdomain dnsmasq[1804]: read /etc/hosts - 3 addresses
4月 07 19:10:33 localhost.localdomain dnsmasq[1804]: read /var/lib/libvirt/dnsmasq/default.addnhosts - 0 addresses
4月 07 19:10:33 localhost.localdomain dnsmasq-dhcp[1804]: read /var/lib/libvirt/dnsmasq/default.hostsfile
Hint: Some lines were ellipsized, use -l to show in full.

以上这些信息向我们展示了服务的运行时间，当前状态，相关进程pid，以及最近的有关该服务的日志信息，是不是很方便？

2 找出每项服务所对应的进程id

对于系统管理来说这是最常见的工作，但是在sysvinit时代我们只能借助例如ps命令等来完成，而systemd已经考虑到了系统管理员的需求，于是下面的命令诞生了：

# systemd-cgls
├─1 /usr/lib/systemd/systemd --switched-root --system --deserialize 22
├─user.slice
│ ├─user-1000.slice
│ │ ├─session-1.scope
│ │ │ ├─1806 gdm-session-worker [pam/gdm-password]
│ │ │ ├─1820 /usr/bin/gnome-keyring-daemon --daemonize --login
│ │ │ ├─1822 gnome-session
│ │ │ ├─1830 dbus-launch --sh-syntax --exit-with-session
│ │ │ ├─1831 /bin/dbus-daemon --fork --print-pid 4 --print-address 6 --session
│ │ │ ├─1858 /usr/libexec/at-spi-bus-launcher
│ │ │ ├─1862 /bin/dbus-daemon --config-file=/etc/at-spi2/accessibility.conf --nofork --print-address 3
│ │ │ ├─1865 /usr/libexec/at-spi2-registryd --use-gnome-session
│ │ │ ├─1872 /usr/libexec/gvfsd
... ...

现在我们可以很清晰的看到哪项服务启动了哪些进程，对于系统管理来说十分有用；

3 如何正确的杀死服务进程

在sysvinit的时代，如果需要结束一个服务及其启动的所有进程，可能会遇到一些糟糕的进程无法正确结束掉，即便是我们使用kill，killall等命令来结束它们，到了systemd的时代一切都变得不一样，systemd号称是第一个能正确的终结一项服务的程序，下面来看看具体如何操作的：

systemctl kill crond.service

或者指定一个信号发送出去

systemctl kill -s SIGKILL crond.service

例如可以像这样执行一次reload操作

systemctl kill -s HUP --kill-who=main crond.service

4 如何停止和禁用一项服务

下面我们回顾一下在sysvinit时代执行下面的命令所实现的功能

# service ntpd stop
# chkconfig ntpd off

很简单，首先停止服务，其次禁用服务那么在systemd中应该如何操作呢？

# systemctl stop ntpd.service
# systemdctl disable ntpd.service

很显然systemctl命令已经取代了service 和chkconfig两个命令的位置，不光如此，我们还能将服务设置为连人工也无法启动：

# ln -s /dev/null  /etc/systemd/system/ntpd.service
# systemctl daemon-reload

5 检查系统启动消耗时间

在过去我们可以借助第三方工具来实现对系统启动过程的耗时跟踪，现在这个功能已经集成到systemd当中，可以十分方便的了解到系统启动时在各个阶段所花费的时间：

# systemd-analyze
Startup finished in 1.669s (kernel) + 1.514s (initrd) + 7.106s (userspace) = 10.290s
以上信息简要的列出了从内核到用户空间整个引导过程大致花费的时间，如果要查看具体每项服务所花费的时间则使用如下命令：
# systemd-analyze blame
          6.468s dnf-makecache.service
          5.556s network.service
          1.022s plymouth-start.service
           812ms plymouth-quit-wait.service
           542ms lvm2-pvscan@8:7.service
           451ms systemd-udev-settle.service
           306ms firewalld.service
           246ms dmraid-activation.service
           194ms lvm2-pvscan@8:18.service
           171ms lvm2-monitor.service
           145ms bluetooth.service
           135ms accounts-daemon.service
           113ms rtkit-daemon.service
           111ms ModemManager.service
           104ms avahi-daemon.service
           102ms systemd-logind.service
            79ms systemd-vconsole-setup.service
            77ms acpid.service

输出结果类似上面这些内容，至此我们可以看到系统里每一项服务的启动时间，据此可以对系统引导过程了如指掌，如果你觉得这还不够直观，那么可以将结果导出到图像文件里面：

systemd-analyze plot >systemd.svg

该命令会将系统的启动过程输出到一张svg图像上面，更直观的展现出各个服务启动所花费的时间，在我们需要分析和优化服务启动项的时候很有帮助。

6 查看各项服务的资源使用情况

和top命令不一样，top命令更侧重于展示以进程为单位的资源状态，而systemd提供了一个命令来方便的查看每项服务的实时资源消耗状态：

# systemd-cgtop
Path                                              Tasks   %CPU   Memory  Input/s Output/s
/                                                   199   12.3     1.9G        -        -
/system.slice/ModemManager.service                    1      -        -        -        -
/system.slice/abrt-oops.service                       1      -        -        -        -
/system.slice/abrt-xorg.service                       1      -        -        -        -
/system.slice/abrtd.service                           1      -        -        -        -
/system.slice/accounts-daemon.service                 1      -        -        -        -
/system.slice/acpid.service                           1      -        -        -        -
/system.slice/alsa-state.service                      1      -        -        -        -
/system.slice/atd.service                             1      -        -        -        -
/system.slice/auditd.service                          3      -        -        -        -
/system.slice/avahi-daemon.service                    2      -        -        -        -
/system.slice/bluetooth.service                       1      -        -        -        -
/system.slice/chronyd.service                         1      -        -        -        -
/system.slice/colord.service                          1      -        -        -        -
/system.slice/crond.service                           1      -        -        -        -

7 我们必须要注意到的配置文件变更

systemd考虑到各种发行版本的用户使用习惯，尽量提供更为通用的配置文件以方便各家统一，方便用户使用，下面列出一些基本的统一的配置文件：

/etc/hostname，debian和redhat在这个配置文件上的差异导致了系统管理或多或少的不便，此文件的统一意义重大
/etc/vconsole.conf，此文件用来统一管理console和键盘映射
/etc/locale.conf 配置系统环境语系
/etc/modules-load.d/*.conf 内核模块加载配置文件
/etc/sysctl.d/*.conf 内核参数配置文件，对/etc/sysctl.conf的扩充
/etc/tmpfiles.d/*.conf 运行态临时文件配置
/etc/os-release /etc/machine-id /etc/machine-info 这三个文件的统一对系统管理员来说也是意义深远，它让我们有了统一的检测发行版本等信息的入口

以上内容仅仅让各位对systemd形成基本的认识，我们将在后期的文章中更加深入地讨论systemd的特性。最后，再次感谢作者Lennart的贡献。

参考链接：

http://cgit.freedesktop.org/systemd/

http://0pointer.de/blog/projects/

https://linuxtoy.org/archives/interview-creater-of-systemd-and-pulseaudio-lennart.html