罕见的索引失效-CFANZ编程社区

前言

心塞，今天又碰到一例特别神奇的现象，应该说是十分罕见，在此分享一下。

现象

对于慢SQL最常见的优化手段就是建索引，对于DBA来说，因为没有类似MySQL中的虚拟索引(当然你可以使用hypopg)，所以一般都会选择在事务里建索引，然后查看是否会生效，是否走了索引，但是今天就碰到了一起奇怪的案例：在事务里建索引死活不走，提交之后，就正常走索引了！

最开始怀疑是不是之前类似的坑，即用户级的参数比如用户级将enable_indexscan设为了off，发现不是，然后一顿折腾创建了一个转换类型的函数索引(create index myidx on test (cast(info as varchar)))因为我看SQL里面涉及到了类型转换，当然都无济于事。

正当我怀疑人生的时候，去看了看pg_index系统表，里面有一个字段indcheckxmin很可疑。

If true, queries must not use the index until the xmin of this pg_index row is below their TransactionXmin event horizon, because the table may contain broken HOT chains with incompatible rows that they can see

实际看下效果：

postgres=# begin;
BEGIN
postgres=# create index myidx on t1(id);
CREATE INDEX
postgres=# explain select * from t1 where id = 100;            
---正常情况下可以走索引
                             QUERY PLAN                              
---------------------------------------------------------------------
 Bitmap Heap Scan on t1  (cost=4.67..52.52 rows=50 width=4)
   Recheck Cond: (id = 100)
   ->  Bitmap Index Scan on myidx  (cost=0.00..4.66 rows=50 width=0)
         Index Cond: (id = 100)
(4 rows)
 
postgres=# update pg_index set indcheckxmin = 'true' where indexrelid = 'myidx'::regclass;    
---模拟发生
UPDATE 1
postgres=# explain select * from t1 where id = 100;
                     QUERY PLAN                      
-----------------------------------------------------
 Seq Scan on t1  (cost=0.00..170.00 rows=50 width=4)
   Filter: (id = 100)
(2 rows)
 
postgres=# set enable_seqscan to off;                        
---即使设置了disable cost，依旧选择走顺序扫描
SET
postgres=# explain select * from t1 where id = 100;
                              QUERY PLAN                               
-----------------------------------------------------------------------
 Seq Scan on t1  (cost=10000000000.00..10000000170.00 rows=50 width=4)
   Filter: (id = 100)
(2 rows)
 
postgres=# commit ;
COMMIT    
postgres=# explain select * from t1 where id = 100;            
---事务提交后
                             QUERY PLAN                              
---------------------------------------------------------------------
 Bitmap Heap Scan on t1  (cost=4.67..61.54 rows=50 width=36)
   Recheck Cond: (id = 100)
   ->  Bitmap Index Scan on myidx  (cost=0.00..4.66 rows=50 width=0)
         Index Cond: (id = 100)
(4 rows)
 
postgres=# select indcheckxmin from pg_index where indexrelid = 'myidx'::regclass;
 indcheckxmin 
--------------
 t
(1 row)

可以看到，当pg_index里面的indcheckxmin变成了true之后，索引就不可用了。事务提交后，又可以用了。

所以现在的问题关键变成了：indcheckxmin什么时候会变成true？首先看下官网的注释，遗憾的是，在官网上的解释挺令人费解的

不过里面提到了broken HOT chains，看下定义

Broken HOT Chain
 
    A HOT chain in which the key value for an index has changed.
 
    This is not allowed to occur normally but if a new index is created
    it can happen.  In that case various strategies are used to ensure
    that no transaction for which the older tuples are visible can
    use the index.

那么什么情况会broken HOT chains呢？正常的HOT就不再赘述了，"断链"是指无法使用HOT了，也就是HOT的限制

1.当更新的元组和老元组不在同一个页中时，新旧元组链是不能跨越页面的2.当索引的key值更新时，原有索引记录中的key无法再定位到正确元组，此时页会在索引页面中插入一条新的索引元组

那么简单理解一下字面意思：表里面可能包含broken HOT chains，这样的话就会包含一些本不该看到的元组，这个时候就会将indcheckxmin设置为true，索引就不可用。源码中也有这一块的逻辑early_pruning_enabled

if (index_bad ||
            (indexForm->indcheckxmin && !indexInfo->ii_BrokenHotChain) ||
            early_pruning_enabled)
        {
            if (!indexInfo->ii_BrokenHotChain && !early_pruning_enabled)
                indexForm->indcheckxmin = false;
            else if (index_bad || early_pruning_enabled)
                indexForm->indcheckxmin = true;
            indexForm->indisvalid = true;
            indexForm->indisready = true;
            indexForm->indislive = true;
            CatalogTupleUpdate(pg_index, &indexTuple->t_self, indexTuple);

关于何时做HOT链的剪枝，在interdb里面也提到了一句，两个字：复杂！

The pruning processing will be executed, if possible, when a SQL command is executed such as SELECT, UPDATE, INSERT and DELETE. The exact execution timing is not described in this chapter because it is very complicated. The details are described in the README.HOT file.

正常情况下

postgres=# begin;
BEGIN
postgres=# create index myidx on test(id);
CREATE INDEX
postgres=# select txid_current();
 txid_current 
--------------
          628
(1 row)
 
postgres=# select txid_current_snapshot();
 txid_current_snapshot 
-----------------------
 628:628:
(1 row)
 
postgres=# select xmin,xmax,indcheckxmin from pg_index where indexrelid='myidx'::regclass;
 xmin | xmax | indcheckxmin 
------+------+--------------
  628 |    0 | f
(1 row)

在异常情况下，也就是在生产当中的情况，当indcheckxmin为true的时候，可以看到，插入的xmin居然是当前的事务ID加了1，按照隔离级别来说，大于xmax（16415）不可见，所以这个索引无法使用。

postgres=# begin;
BEGIN
postgres=# create index myidx2 on t2(id);
CREATE INDEX
postgres=# select txid_current();
 txid_current 
--------------
        16415
(1 row)
 
postgres=# select txid_current_snapshot();
 txid_current_snapshot 
-----------------------
 16415:16415:
(1 row)
 
postgres=# select xmin,xmax,indcheckxmin from pg_index where indexrelid='myidx'::regclass;
 xmin  | xmax | indcheckxmin 
-------+------+--------------
 16416 |   0  | t
(1 row)

在src/backend/access/heap/README.HOT有这一块的注释，注意这么一句话，可以看到，无法使用的索引也是因为插入pg_index的xmin大于它自己的txid，所以无法使用。

ransactions are allowed to use such an index only after pg_index.xmin is below their TransactionXmin horizon

仔细阅读一下这段注释：

CREATE INDEX
------------
 
CREATE INDEX presents a problem for HOT updates.  While the existing HOT
chains all have the same index values for existing indexes, the columns
in the new index might change within a pre-existing HOT chain, creating
a "broken" chain that can't be indexed properly.
 
To address this issue, regular (non-concurrent) CREATE INDEX makes the
new index usable only by new transactions and transactions that don't
have snapshots older than the CREATE INDEX command.  This prevents
queries that can see the inconsistent HOT chains from trying to use the
new index and getting incorrect results.  Queries that can see the index
can only see the rows that were visible after the index was created,
hence the HOT chains are consistent for them.
 
Entries in the new index point to root tuples (tuples with current index
pointers) so that our index uses the same index pointers as all other
indexes on the table.  However the row we want to index is actually at
the *end* of the chain, ie, the most recent live tuple on the HOT chain.
That is the one we compute the index entry values for, but the TID
we put into the index is that of the root tuple.  Since queries that
will be allowed to use the new index cannot see any of the older tuple
versions in the chain, the fact that they might not match the index entry
isn't a problem.  (Such queries will check the tuple visibility
information of the older versions and ignore them, without ever looking at
their contents, so the content inconsistency is OK.)  Subsequent updates
to the live tuple will be allowed to extend the HOT chain only if they are
HOT-safe for all the indexes.
 
Because we have ShareLock on the table, any DELETE_IN_PROGRESS or
INSERT_IN_PROGRESS tuples should have come from our own transaction.
Therefore we can consider them committed since if the CREATE INDEX
commits, they will be committed, and if it aborts the index is discarded.
An exception to this is that early lock release is customary for system
catalog updates, and so we might find such tuples when reindexing a system
catalog.  In that case we deal with it by waiting for the source
transaction to commit or roll back.  (We could do that for user tables
too, but since the case is unexpected we prefer to throw an error.)
 
Practically, we prevent certain transactions from using the new index by
setting pg_index.indcheckxmin to TRUE.  Transactions are allowed to use
such an index only after pg_index.xmin is below their TransactionXmin
horizon, thereby ensuring that any incompatible rows in HOT chains are
dead to them. (pg_index.xmin will be the XID of the CREATE INDEX
transaction.  The reason for using xmin rather than a normal column is
that the regular vacuum freezing mechanism will take care of converting
xmin to FrozenTransactionId before it can wrap around.)
 
This means in particular that the transaction creating the index will be
unable to use the index if the transaction has old snapshots.  We
alleviate that problem somewhat by not setting indcheckxmin unless the
table actually contains HOT chains with RECENTLY_DEAD members.
 
Another unpleasant consequence is that it is now risky to use SnapshotAny
in an index scan: if the index was created more recently than the last
vacuum, it's possible that some of the visited tuples do not match the
index entry they are linked to.  This does not seem to be a fatal
objection, since there are few users of SnapshotAny and most use seqscans.
The only exception at this writing is CLUSTER, which is okay because it
does not require perfect ordering of the indexscan readout (and especially
so because CLUSTER tends to write recently-dead tuples out of order anyway).

CREATE INDEX CONCURRENTLY
-------------------------
 
In the concurrent case we must take a different approach.  We create the
pg_index entry immediately, before we scan the table.  The pg_index entry
is marked as "not ready for inserts".  Then we commit and wait for any
transactions which have the table open to finish.  This ensures that no
new HOT updates will change the key value for our new index, because all
transactions will see the existence of the index and will respect its
constraint on which updates can be HOT.  Other transactions must include
such an index when determining HOT-safety of updates, even though they
must ignore it for both insertion and searching purposes.
 
We must do this to avoid making incorrect index entries.  For example,
suppose we are building an index on column X and we make an index entry for
a non-HOT tuple with X=1.  Then some other backend, unaware that X is an
indexed column, HOT-updates the row to have X=2, and commits.  We now have
an index entry for X=1 pointing at a HOT chain whose live row has X=2.
We could make an index entry with X=2 during the validation pass, but
there is no nice way to get rid of the wrong entry with X=1.  So we must
have the HOT-safety property enforced before we start to build the new
index.
 
After waiting for transactions which had the table open, we build the index
for all rows that are valid in a fresh snapshot.  Any tuples visible in the
snapshot will have only valid forward-growing HOT chains.  (They might have
older HOT updates behind them which are broken, but this is OK for the same
reason it's OK in a regular index build.)  As above, we point the index
entry at the root of the HOT-update chain but we use the key value from the
live tuple.
 
We mark the index open for inserts (but still not ready for reads) then
we again wait for transactions which have the table open.  Then we take
a second reference snapshot and validate the index.  This searches for
tuples missing from the index, and inserts any missing ones.  Again,
the index entries have to have TIDs equal to HOT-chain root TIDs, but
the value to be inserted is the one from the live tuple.
 
Then we wait until every transaction that could have a snapshot older than
the second reference snapshot is finished.  This ensures that nobody is
alive any longer who could need to see any tuples that might be missing
from the index, as well as ensuring that no one can see any inconsistent
rows in a broken HOT chain (the first condition is stronger than the
second).  Finally, we can mark the index valid for searches.
 
Note that we do not need to set pg_index.indcheckxmin in this code path,
because we have outwaited any transactions that would need to avoid using
the index.  (indcheckxmin is only needed because non-concurrent CREATE
INDEX doesn't want to wait; its stronger lock would create too much risk of
deadlock if it did.)

好吧，放弃了，仔细看也没读太懂。

小结

这种问题对于开发来说其实感知很小，因为只在事务内无法使用索引，只不过容易造成今天上午的困局，一直以为是什么选择率、不兼容、用户级参数导致的。所以对于后续发生类似现象的时候，去额外看一看pg_index里面indcheckxmin字段，是否为true。另外还Get到了一个点：存在broken HOT chains的时候，会发生这种现象。

而且流复制里面的bufferpin冲突也和HOT有关，也就是pg_stat_database_conflicts的confl_bufferpin字段

One way to reduce the need for VACUUM is to use HOT updates. Then any query on the primary that accesses a page with dead heap-only tuples and can get an exclusive lock on it will prune the HOT chains. PostgreSQL always holds such page locks for a short time, so there is no conflict with processing on the primary. There are other causes for page locks, but this is perhaps the most frequent one.When the standby server should replay such an exclusive page lock and a query is using the page (“has the page pinned” in PostgreSQL jargon), you get a buffer pin replication conflict. Pages can be pinned for a while, for example during a sequential scan of a table on the outer side of a nested loop join.HOT chain pruning can of course also lead to snapshot replication conflicts.

后续等功力再深一点的时候，再来回顾一下这个问题。

参考

https://github.com/postgres/postgres/blob/master/src/backend/access/heap/README.HOT

https://www.postgresql.org/message-id/27473.1189896544@sss.pgh.pa.us

https://www.interdb.jp/pg/pgsql07.html

http://mysql.taobao.org/monthly/2020/09/05/