Post

TOTP Series Part 3: The Definitive Guide to KEK Rotation, NIST Compliance, and Memory Management | TOTP 系列之三:KEK 輪換、NIST 合規與記憶體管理的終極指南

TOTP Series Part 3: The Definitive Guide to KEK Rotation, NIST Compliance, and Memory Management | TOTP 系列之三:KEK 輪換、NIST 合規與記憶體管理的終極指南

The Absolute Necessity of Key Rotation in Enterprise Systems

In Part 2 of this series, we meticulously established the Envelope Encryption pattern, ensuring that every user’s Data Encryption Key (DEK) is securely wrapped by a global Key Encryption Key (KEK). While this architecture provides robust at-rest protection, it introduces a profound operational requirement: What is our protocol if the KEK is compromised?

In enterprise software development, we do not rely on the hope that a compromise will never happen; we architect systems under the assumption that a compromise is an inevitability. Security standards such as the NIST Special Publication 800-57 (Recommendation for Key Management) explicitly mandate the regular rotation (cryptoperiod enforcement) of master cryptographic keys.

Whether a company policy requires master keys to be rotated every 90 days, or an incident occurs where a DevOps engineer with KMS access leaves the company under acrimonious circumstances, the system must possess the capability to swap the old KEK for a new one.

Rotating a master key in a live, distributed production environment is traditionally fraught with extreme engineering challenges. Questions abound:

  • Do we have to take the entire authentication system offline?
  • How do we handle race conditions where a user attempts to log in at the exact millisecond their DEK is being re-encrypted?
  • How do we propagate the new active key across horizontally scaled microservices?
  • Furthermore, when we load these raw cryptographic keys into application RAM to perform the rotation, how do we prevent them from lingering as vulnerable artifacts susceptible to memory-dumping attacks (like Heartbleed)?

In this exhaustive deep dive, we will explore the architectural design of a seamless, zero-downtime KEK Rotation mechanism in NestJS. Crucially, we will examine the V8 engine’s memory model and learn how to properly sanitize memory buffers using Node.js to ensure old keys are truly eradicated.


Deep Dive: The NIST Cryptographic Key Lifecycle

Before we write the rotation logic, we must understand the states a key passes through. According to NIST guidelines, a key is not simply “active” or “inactive”. It has a distinct lifecycle:

  1. Pre-activation: The key has been generated but is not yet authorized for use.
  2. Active: The key may be used to encrypt new data and decrypt existing data.
  3. Suspended: The key’s use is temporarily halted.
  4. Deactivated: The key may NO LONGER be used to encrypt new data, but it MUST be kept available to decrypt legacy data.
  5. Compromised: The key has been breached. Data encrypted with it must be immediately re-encrypted with a new key.
  6. Destroyed: The key material has been permanently eradicated.

During our KEK rotation, the Old KEK moves from Active to Deactivated (it cannot encrypt new DEKs, but must decrypt existing ones during the migration). The New KEK moves from Pre-activation to Active. Once the migration of all DEKs is complete, the Old KEK can transition to Destroyed.


Architecture: The Zero-Downtime Rotation Flow

Because we utilized Envelope Encryption, rotating the KEK is surprisingly efficient. We do not need to touch the millions of totp_encrypted_secret rows. We only need to decrypt the DEKs using the Old KEK and re-encrypt them using the New KEK.

The Two-Phase Migration Strategy

To avoid locking the entire users table and causing login timeouts, we must use a batch-processing cursor.

  1. Phase 1: Key Distribution (The Preparation)
    • A new KEK is generated in the KMS.
    • The application servers are notified (via Redis Pub/Sub or a configuration polling mechanism) to load BOTH the Old KEK and the New KEK into memory.
    • For all new user registrations occurring from this moment forward, the New KEK is used immediately.
    • For existing users logging in, the system attempts to decrypt their DEK with the New KEK. If it fails (because they haven’t been migrated yet), it falls back to the Old KEK. This ensures zero downtime.
  2. Phase 2: The Batch Migration (The Execution)
    • A background worker (e.g., a Cron Job or a BullMQ worker) begins iterating through the database in batches of 500 users.
    • It reads the totp_encrypted_dek.
    • It decrypts it using the Old KEK to reveal the plaintext DEK buffer.
    • It encrypts the plaintext DEK buffer using the New KEK.
    • It saves the new totp_encrypted_dek back to the database.
  3. Phase 3: Cleanup (The Eradication)
    • Once the background worker verifies that 100% of the DEKs have been migrated, the Old KEK is officially marked as Destroyed in the KMS.
    • The application servers drop the Old KEK from memory and perform a secure buffer sanitization.

Algorithm Breakdown: Re-encryption Logic with TypeORM Transactions

Let’s look at the NestJS implementation of the batch migration script. We will wrap our batch updates in TypeORM transactions to ensure that if a server crashes mid-batch, we don’t end up with partially updated, corrupted DEKs.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
// src/modules/auth/totp/kek-rotation.service.ts
import { Injectable, Logger } from '@nestjs/common';
import { DataSource } from 'typeorm';

@Injectable()
export class KekRotationService {
  private readonly logger = new Logger(KekRotationService.name);

  constructor(
    private readonly dataSource: DataSource,
    private readonly encryptionService: EncryptionService,
    private readonly kekService: KekService,
  ) {}

  /**
   * Executes the Phase 2 Batch Migration.
   * This function should ideally be triggered by an Admin API or a scheduled BullMQ Job.
   */
  async executeBatchRotation() {
    this.logger.log('Starting KEK Rotation batch process...');
    
    // 1. Fetch both keys from the KMS integration
    const oldKekBuffer = await this.kekService.getDeactivatedKek();
    const newKekBuffer = await this.kekService.getActiveKek();
    
    if (!oldKekBuffer || !newKekBuffer) {
      throw new Error('Cannot perform rotation: Missing required KEK states.');
    }

    const batchSize = 500;
    let lastProcessedId = '00000000-0000-0000-0000-000000000000'; // Cursor-based pagination
    let hasMore = true;
    let totalMigrated = 0;

    while (hasMore) {
      // Create a new QueryRunner for the transaction
      const queryRunner = this.dataSource.createQueryRunner();
      await queryRunner.connect();
      await queryRunner.startTransaction();

      try {
        // 2. Fetch users using Keysert Pagination (much faster than OFFSET)
        const users = await queryRunner.manager.createQueryBuilder(User, 'user')
          .where('user.isTotpEnabled = :isEnabled', { isEnabled: true })
          .andWhere('user.id > :lastId', { lastId: lastProcessedId })
          .orderBy('user.id', 'ASC')
          .take(batchSize)
          .getMany();

        if (users.length === 0) {
          hasMore = false;
          await queryRunner.commitTransaction();
          break;
        }

        // 3. Process the batch
        for (const user of users) {
          // Decrypt DEK with old KEK
          const rawDekHex = this.encryptionService.decrypt(
            user.totpEncryptedDek, 
            oldKekBuffer
          );
          
          // Encrypt DEK with new KEK
          const newEncryptedDek = this.encryptionService.encrypt(
            rawDekHex, 
            newKekBuffer
          );
          
          // Queue the update in the transaction
          await queryRunner.manager.update(User, user.id, {
            totpEncryptedDek: newEncryptedDek
          });
          
          lastProcessedId = user.id;
        }

        // 4. Commit the batch transaction
        await queryRunner.commitTransaction();
        totalMigrated += users.length;
        this.logger.log(`Successfully migrated ${totalMigrated} DEKs so far...`);

      } catch (error) {
        // If anything fails, rollback the ENTIRE batch of 500 to maintain consistency
        this.logger.error(`Batch migration failed at cursor ${lastProcessedId}. Rolling back.`, error);
        await queryRunner.rollbackTransaction();
        
        // Stop the process so an admin can investigate the corruption
        throw error; 
      } finally {
        await queryRunner.release();
      }
    }

    // 5. Critical: Memory Cleanup for the background worker
    this.sanitizeBuffer(oldKekBuffer);
    
    this.logger.log(`KEK Rotation completed successfully. Total migrated: ${totalMigrated}`);
  }

  /**
   * Securely wipes the cryptographic buffer from RAM.
   */
  private sanitizeBuffer(buffer: Buffer) {
    if (Buffer.isBuffer(buffer)) {
      buffer.fill(0);
    }
  }
}

Clean Code: Mastering Memory Management in Node.js

Understanding why buffer.fill(0) is necessary requires a deep dive into how the V8 engine (which powers Node.js) handles memory.

Node.js uses a Garbage Collector (GC) governed by the V8 engine. When a variable goes out of scope, the GC marks that memory space as available. Eventually, during a “Mark-and-Sweep” cycle, the memory is reclaimed. However, “eventually” is an unacceptable timeframe for cryptographic keys.

If a sophisticated attacker manages to trigger a core dump (crashing the server to analyze its memory footprint) or exploit a memory-leak vulnerability before the GC runs, any keys sitting in that “freed but not yet overwritten” memory block can be extracted.

The Immutable String Problem

Because JavaScript strings are immutable, if you store a cryptographic key as a standard string (e.g., const myKey = "super_secret_string"), you mathematically cannot wipe it. If you try to reassign it (myKey = ""), V8 simply creates a new empty string in a different memory location and points the variable there. The original "super_secret_string" remains orphaned in the heap until the GC decides to clean it up. Furthermore, the V8 engine will often create multiple hidden copies of strings and move them around during memory compaction.

The Buffer Solution

This is why enterprise security strictly dictates the use of Buffer objects for cryptographic keys.

Buffers in Node.js point to raw, unmanaged memory allocations outside the standard V8 heap (in the C++ layer). They are mutable. By explicitly calling buffer.fill(0), we instantly and deterministically overwrite the exact physical memory addresses holding our key with zeros. This completely destroys the key material before the GC is even aware.

This is a crucial security pattern that elevates a standard implementation to a true enterprise-grade architecture.


企業系統中 KEK 輪換的絕對必要性

喺呢個系列嘅第 2 篇文入面,我哋極之仔細咁建立咗信封加密 (Envelope Encryption) 架構,確保每一位用戶嘅資料加密金鑰 (DEK) 都受到一條全局嘅金鑰加密金鑰 (KEK) 嚴密保護。雖然呢個架構提供咗極度強大嘅靜態資料保護 (At-rest protection),但佢同時帶嚟一個深層次嘅營運考量:如果條 KEK 俾人 Hack 咗,我哋嘅應變方案係咩?

喺企業級軟件開發嘅世界,我哋絕對唔可以抱住「希望永遠都唔會出事」嘅僥倖心態;我哋設計系統嘅大前題,係假設系統「遲早一定會被入侵」。安全標準,例如 NIST Special Publication 800-57 (密碼學金鑰管理建議),明確強制規定主密碼學金鑰必須定期進行輪換 (Cryptoperiod enforcement)。

無論係因為公司 Policy 規定 Master keys 必須每 90 日強制更換一次,定係發生咗突發事件——例如一個擁有 KMS 權限嘅 DevOps 工程師同公司反面離職,系統都必須要具備無縫替換新舊 KEK 嘅能力。

喺一個 Live 嘅、分散式 (Distributed) 嘅 Production 環境裡面換 Master key,傳統上係一項極度艱巨嘅工程挑戰。隨之而來嘅問題一大堆:

  • 我哋係咪要將成個 Authentication 系統停機 (Downtime) 來做?
  • 點樣處理 Race condition (競爭危害)?例如有個用戶啱啱好喺佢條 DEK 被重新加密嗰一毫秒走去登入,點算?
  • 點樣將新生效嘅鎖匙,同步廣播去水平擴展 (Horizontally scaled) 咗嘅 Microservices 度?
  • 仲有,當我哋將呢啲原始嘅密碼學金鑰 Load 入 Application 嘅 RAM 裡面進行輪換運算嗰陣,我哋點樣防止佢哋變成殘留嘅高危物件,避免俾 Memory-dumping (記憶體傾印) 攻擊 (好似 Heartbleed 咁) 偷走?

喺呢篇極度詳盡嘅深度探討入面,我哋會研究一個喺 NestJS 入面做到「零停機時間 (Zero-downtime)」嘅 KEK 輪換機制嘅架構設計。最關鍵嘅係,我哋會剖析 V8 Engine 嘅記憶體模型,並學習點樣喺 Node.js 裡面正確地清理記憶體 (Memory sanitization),確保舊鎖匙被真正、徹底咁消滅。


深度探討:NIST 密碼學金鑰生命週期

喺我哋落手寫 Rotation logic 之前,我哋必須要了解一條鎖匙會經歷嘅各種狀態 (States)。根據 NIST 嘅指引,一條鎖匙絕對唔係只有「生效 (Active)」同「失效 (Inactive)」咁簡單。佢有一個完整嘅生命週期:

  1. 預先啟動 (Pre-activation): 鎖匙已經 Generate 咗出嚟,但仲未獲授權使用。
  2. 生效 (Active): 鎖匙可以用嚟加密新 Data,亦都可以用嚟解密舊 Data。
  3. 暫停 (Suspended): 鎖匙嘅使用被短暫中止。
  4. 停用 (Deactivated): 鎖匙 絕對唔可以 再用嚟加密新 Data,但係佢 必須 繼續保留,用嚟解密以前留落嚟嘅舊 Data。
  5. 受損/外洩 (Compromised): 鎖匙已經被黑客攻破。任何用佢加密過嘅 Data 必須即時用新鎖匙重新加密。
  6. 銷毀 (Destroyed): 鎖匙嘅物理材料已經被永久、徹底咁抹除。

喺我哋嘅 KEK 輪換過程入面,舊 KEK 會由 生效 轉移去 停用 (佢唔可以再加密新 DEK,但喺 Migration 期間必須要用佢嚟解密)。新 KEK 會由 預先啟動 轉移去 生效。當所有 DEK 嘅 Migration 百分百完成之後,舊 KEK 就可以功成身退,正式過渡去 銷毀 狀態。


架構設計:零停機時間輪換流程 (Zero-Downtime Rotation Flow)

正正因為我哋採用咗 Envelope Encryption,Rotate KEK 嘅效率先可以高得咁驚人。我哋完全唔需要掂 Database 入面幾百萬行嘅 totp_encrypted_secret。我哋只需要用舊 KEK 將啲 DEK 解密,然後用新 KEK 重新加密佢哋就搞掂。

兩階段遷移策略 (Two-Phase Migration Strategy)

為咗避免鎖死 (Lock) 成張 users table 而搞到其他用戶登入 Timeout,我哋必須要用 Batch-processing (分批處理) 嘅 Cursor 模式。

  1. 第一階段:金鑰派發 (準備期)
    • 喺 KMS (金鑰管理系統) 入面 Generate 一條新 KEK。
    • 通知所有 Application servers (可以透過 Redis Pub/Sub 或者 Config 輪詢機制),叫佢哋將舊 KEK 同新 KEK 同時 Load 入 Memory。
    • 由呢一刻開始,所有 新註冊 嘅用戶,會即刻用新 KEK 嚟做加密。
    • 對於現有用戶登入,系統會先嘗試用新 KEK 嚟解密佢哋條 DEK。如果失敗 (因為佢哋個 Record 仲未被 Migrate),系統會自動 Fallback (退回) 用舊 KEK 嚟解密。呢個絕妙嘅設計確保咗系統 零停機時間
  2. 第二階段:分批遷移 (執行期)
    • 一個 Background worker (例如 Cron Job 或者 BullMQ worker) 開始以每批 500 個用戶嘅速度,逐批讀取 Database。
    • 讀取 totp_encrypted_dek
    • 用舊 KEK 解密,還原做 Plaintext DEK buffer。
    • 用新 KEK 將 Plaintext DEK buffer 重新加密。
    • 將新嘅 totp_encrypted_dek Save 返落 Database。
  3. 第三階段:大掃除 (銷毀期)
    • 當 Background worker 確認 100% 嘅 DEK 都已經 Migrate 完畢,舊 KEK 就會喺 KMS 入面被正式標記為「銷毀 (Destroyed)」。
    • 所有 Application servers 會將舊 KEK 由 Memory 中丟棄,並執行嚴格嘅安全 Buffer 抹除程序。

演算法解碼:結合 TypeORM Transaction 嘅重新加密邏輯

我哋嚟睇吓 NestJS 裡面負責執行 Batch migration script 嘅具體實作。我哋會將每一批嘅 Update 包裝喺 TypeORM 嘅 Transaction (交易) 入面,確保如果 Server 喺處理到一半嗰陣突然 Crash 咗,我哋都唔會留低一堆改咗一半、Corrupted (損毀) 嘅 DEK。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
// src/modules/auth/totp/kek-rotation.service.ts
import { Injectable, Logger } from '@nestjs/common';
import { DataSource } from 'typeorm';

@Injectable()
export class KekRotationService {
  private readonly logger = new Logger(KekRotationService.name);

  constructor(
    private readonly dataSource: DataSource,
    private readonly encryptionService: EncryptionService,
    private readonly kekService: KekService,
  ) {}

  /**
   * 執行第二階段嘅 Batch Migration。
   * 喺完美嘅架構下,呢個 Function 應該由 Admin API 或者 Schedule 咗嘅 BullMQ Job 嚟觸發。
   */
  async executeBatchRotation() {
    this.logger.log('開始執行 KEK Rotation 分批處理...');
    
    // 1. 由 KMS 整合模組提取齊兩條鎖匙
    const oldKekBuffer = await this.kekService.getDeactivatedKek();
    const newKekBuffer = await this.kekService.getActiveKek();
    
    if (!oldKekBuffer || !newKekBuffer) {
      throw new Error('無法進行輪換:缺少必要嘅 KEK 狀態。');
    }

    const batchSize = 500;
    let lastProcessedId = '00000000-0000-0000-0000-000000000000'; // Cursor-based 分頁法
    let hasMore = true;
    let totalMigrated = 0;

    while (hasMore) {
      // 為呢個 Transaction 建立一個新嘅 QueryRunner
      const queryRunner = this.dataSource.createQueryRunner();
      await queryRunner.connect();
      await queryRunner.startTransaction();

      try {
        // 2. 用 Keysert Pagination 提取用戶 (效能比普通嘅 OFFSET 快極多)
        const users = await queryRunner.manager.createQueryBuilder(User, 'user')
          .where('user.isTotpEnabled = :isEnabled', { isEnabled: true })
          .andWhere('user.id > :lastId', { lastId: lastProcessedId })
          .orderBy('user.id', 'ASC')
          .take(batchSize)
          .getMany();

        if (users.length === 0) {
          hasMore = false;
          await queryRunner.commitTransaction();
          break;
        }

        // 3. 處理呢一批 User
        for (const user of users) {
          // 用舊 KEK 解密 DEK
          const rawDekHex = this.encryptionService.decrypt(
            user.totpEncryptedDek, 
            oldKekBuffer
          );
          
          // 用新 KEK 重新加密 DEK
          const newEncryptedDek = this.encryptionService.encrypt(
            rawDekHex, 
            newKekBuffer
          );
          
          // 將 Update 動作放入 Transaction Queue 入面
          await queryRunner.manager.update(User, user.id, {
            totpEncryptedDek: newEncryptedDek
          });
          
          lastProcessedId = user.id;
        }

        // 4. 正式 Commit 呢批 Transaction
        await queryRunner.commitTransaction();
        totalMigrated += users.length;
        this.logger.log(`進度良好,目前已成功 Migrate 咗 ${totalMigrated} 條 DEK...`);

      } catch (error) {
        // 如果中間有任何差池,即刻 Rollback 晒成批 500 個 Record,確保資料一致性
        this.logger.error(`Batch migration 去到 Cursor ${lastProcessedId} 發生嚴重錯誤。正在 Rollback。`, error);
        await queryRunner.rollbackTransaction();
        
        // 中止成個 Process,等 Admin 嚟調查有咩 Data 損毀咗
        throw error; 
      } finally {
        await queryRunner.release();
      }
    }

    // 5. 極度重要:為 Background worker 進行 Memory 清理
    this.sanitizeBuffer(oldKekBuffer);
    
    this.logger.log(`KEK Rotation 完美完成。總共 Migrate 數量:${totalMigrated}`);
  }

  /**
   * 非常安全咁將密碼學 Buffer 裡面嘅 Data 抹走。
   */
  private sanitizeBuffer(buffer: Buffer) {
    if (Buffer.isBuffer(buffer)) {
      buffer.fill(0);
    }
  }
}

Clean Code 原則:精通 Node.js 嘅記憶體管理

要明白點解 buffer.fill(0) 係生死攸關嘅操作,我哋必須要深入拆解 V8 Engine (Node.js 嘅核心) 係點樣處理 Memory 嘅。

Node.js 用嘅係由 V8 Engine 掌管嘅 Garbage Collector (GC - 垃圾回收機制)。當一個 Variable 離開咗 Scope 之後,GC 會將嗰段 Memory 空間標記為「可用」。最終,喺一個叫 “Mark-and-Sweep” 嘅週期入面,嗰段 Memory 先會真正被回收。但係,對於高度敏感嘅密碼學金鑰來講,「最終會回收」呢個時間框架係絕對不能接受嘅。

如果一個高階黑客成功觸發咗 Core dump (強行搞 Crash 個 Server 嚟擷取並分析佢嘅 Memory footprint),或者喺 GC 運行之前利用咗好似 Heartbleed 呢類 memory-leak 漏洞,任何放喺「Free 咗但仲未被 Overwrite」區域嘅鎖匙,都可以被輕易抽出來。

不可變字串嘅死穴 (The Immutable String Problem)

因為 JavaScript 嘅 String 係 不可變嘅 (Immutable),如果你將一條密碼學金鑰當做普通 String 咁存 (例如 const myKey = "super_secret_string"),你喺數學上同物理上係冇辦法人手抹走佢嘅。如果你嘗試去改佢 (myKey = ""),V8 只會喺另一忽 Memory 度 Create 一個新嘅空字串,然後將個 Variable 指向嗰度。原本嗰個 "super_secret_string" 會變成 Heap 入面嘅孤兒,一直殘留喺度,直到 GC 心情好決定去清走佢為止。更慘嘅係,V8 Engine 為咗效能,成日會喺做 Memory compaction 嗰陣,偷偷地 Create 好多個隱藏嘅 String copies 搬來搬去。

Buffer 嘅終極解法 (The Buffer Solution)

就係因為咁,企業級安全守則嚴格規定:處理密碼學鎖匙絕對只可以用 Buffer objects!

Node.js 嘅 Buffer 指向嘅係標準 V8 Heap 以外、由 C++ 底層管理嘅 Raw memory allocation。佢哋係 可變嘅 (Mutable)。透過明確咁 Call buffer.fill(0),我哋可以瞬間、而且百分百確定咁,將裝住鎖匙嘅嗰忽實體物理 Memory 地址全部填滿 0。呢招會喺 GC 察覺之前,將鎖匙材料徹底、永久地摧毀。

呢個係一個極之關鍵嘅 Security pattern,掌握佢,就係將普通嘅 Implementation 昇華到真正企業級神壇嘅必經之路!

This post is licensed under CC BY 4.0 by the author.