Cap eviction effort (CPU under stress) in HyperClockCache (#12141)

Summary: HyperClockCache is intended to mitigate performance problems under stress conditions (as well as optimizing average-case parallel performance). In LRUCache, the biggest such problem is lock contention when one or a small number of cache entries becomes particularly hot. Regardless of cache sharding, accesses to any particular cache entry are linearized against a single mutex, which is held while each access updates the LRU list. All HCC variants are fully lock/wait-free for accessing blocks already in the cache, which fully mitigates this contention problem. However, HCC (and CLOCK in general) can exhibit extremely degraded performance under a different stress condition: when no (or almost no) entries in a cache shard are evictable (they are pinned). Unlike LRU which can find any evictable entries immediately (at the cost of more coordination / synchronization on each access), CLOCK has to search for evictable entries. Under the right conditions (almost exclusively MB-scale caches not GB-scale), the CPU cost of each cache miss could fall off a cliff and bog down the whole system. To effectively mitigate this problem (IMHO), I'm introducing a new default behavior and tuning parameter for HCC, `eviction_effort_cap`. See the comments on the new config parameter in the public API. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12141 Test Plan: unit test included ## Performance test We can use cache_bench to validate no regression (CPU and memory) in normal operation, and to measure change in behavior when cache is almost entirely pinned. (TODO: I'm not sure why I had to get the pinned ratio parameter well over 1.0 to see truly bad performance, but the behavior is there.) Build with `make DEBUG_LEVEL=0 USE_CLANG=1 PORTABLE=0 cache_bench`. We also set MALLOC_CONF="narenas:1" for all these runs to essentially remove jemalloc variances from the results, so that the max RSS given by /usr/bin/time is essentially ideal (assuming the allocator minimizes fragmentation and other memory overheads well). Base command reproducing bad behavior: ``` ./cache_bench -cache_type=auto_hyper_clock_cache -threads=12 -histograms=0 -pinned_ratio=1.7 ``` ``` Before, LRU (alternate baseline not exhibiting bad behavior): Rough parallel ops/sec = 2290997 1088060 maxresident Before, AutoHCC (bad behavior): Rough parallel ops/sec = 141011 <- Yes, more than 10x slower 1083932 maxresident ``` Now let us sample a range of values in the solution space: ``` After, AutoHCC, eviction_effort_cap = 1: Rough parallel ops/sec = 3212586 2402216 maxresident After, AutoHCC, eviction_effort_cap = 10: Rough parallel ops/sec = 2371639 1248884 maxresident After, AutoHCC, eviction_effort_cap = 30: Rough parallel ops/sec = 1981092 1131596 maxresident After, AutoHCC, eviction_effort_cap = 100: Rough parallel ops/sec = 1446188 1090976 maxresident After, AutoHCC, eviction_effort_cap = 1000: Rough parallel ops/sec = 549568 1084064 maxresident ``` I looks like `cap=30` is a sweet spot balancing acceptable CPU and memory overheads, so is chosen as the default. ``` Change to -pinned_ratio=0.85 Before, LRU: Rough parallel ops/sec = 2108373 1078232 maxresident Before, AutoHCC, averaged over ~20 runs: Rough parallel ops/sec = 2164910 1077312 maxresident After, AutoHCC, eviction_effort_cap = 30, averaged over ~20 runs: Rough parallel ops/sec = 2145542 1077216 maxresident ``` The slight CPU improvement above is consistent with the cap, with no measurable memory overhead under moderate stress. ``` Change to -pinned_ratio=0.25 (low stress) Before, AutoHCC, averaged over ~20 runs: Rough parallel ops/sec = 2221149 1076540 maxresident After, AutoHCC, eviction_effort_cap = 30, averaged over ~20 runs: Rough parallel ops/sec = 2224521 1076664 maxresident ``` No measurable difference under normal circumstances. Some tests repeated with FixedHCC, with similar results. Reviewed By: anand1976 Differential Revision: D52174755 Pulled By: pdillinger fbshipit-source-id: d278108031b1220c1fa4c89c5a9d34b7cf4ef1b8
author: Peter Dillinger <peterd@meta.com> 2023-12-14 22:13:32 -0800
committer: Facebook GitHub Bot <facebook-github-bot@users.noreply.github.com> 2023-12-14 22:13:32 -0800
commit: 88bc91f3cc2b492b8a45ba2c49650f527df97ad8 (patch)
tree: fe23f5a35a716feb95801e313d7d986634b153f6 /cache
parent: cd577f605948894b51fbaab39d1df03a04dfd70f (diff)
5 files changed, 208 insertions, 62 deletions
diff --git a/cache/cache_bench_tool.cc b/cache/cache_bench_tool.cc
index 89945abf7..07cd1a1f6 100644
--- a/cache/cache_bench_tool.cc
+++ b/cache/cache_bench_tool.cc
@@ -48,6 +48,10 @@ DEFINE_uint64(cache_size, 1 * GiB,
               "Number of bytes to use as a cache of uncompressed data.");
 DEFINE_int32(num_shard_bits, -1,
              "ShardedCacheOptions::shard_bits. Default = auto");
+DEFINE_int32(
+    eviction_effort_cap,
+    ROCKSDB_NAMESPACE::HyperClockCacheOptions(1, 1).eviction_effort_cap,
+    "HyperClockCacheOptions::eviction_effort_cap");
 
 DEFINE_double(resident_ratio, 0.25,
               "Ratio of keys fitting in cache to keyspace.");
@@ -391,6 +395,7 @@ class CacheBench {
           FLAGS_cache_size, /*estimated_entry_charge=*/0, FLAGS_num_shard_bits);
       opts.hash_seed = BitwiseAnd(FLAGS_seed, INT32_MAX);
       opts.memory_allocator = allocator;
+      opts.eviction_effort_cap = FLAGS_eviction_effort_cap;
       if (FLAGS_cache_type == "fixed_hyper_clock_cache" ||
           FLAGS_cache_type == "hyper_clock_cache") {
         opts.estimated_entry_charge = FLAGS_value_bytes_estimate > 0
diff --git a/cache/clock_cache.cc b/cache/clock_cache.cc
index e37c03fb5..de2e56186 100644
--- a/cache/clock_cache.cc
+++ b/cache/clock_cache.cc
@@ -15,6 +15,7 @@
 #include <cassert>
 #include <cinttypes>
 #include <cstddef>
+#include <cstdint>
 #include <exception>
 #include <functional>
 #include <numeric>
@@ -94,7 +95,8 @@ inline void Unref(const ClockHandle& h, uint64_t count = 1) {
   (void)old_meta;
 }
 
-inline bool ClockUpdate(ClockHandle& h, bool* purgeable = nullptr) {
+inline bool ClockUpdate(ClockHandle& h, BaseClockTable::EvictionData* data,
+                        bool* purgeable = nullptr) {
   uint64_t meta;
   if (purgeable) {
     assert(*purgeable == false);
@@ -126,6 +128,7 @@ inline bool ClockUpdate(ClockHandle& h, bool* purgeable = nullptr) {
       (meta >> ClockHandle::kReleaseCounterShift) & ClockHandle::kCounterMask;
   if (acquire_count != release_count) {
     // Only clock update entries with no outstanding refs
+    data->seen_pinned_count++;
     return false;
   }
   if ((meta >> ClockHandle::kStateShift == ClockHandle::kStateVisible) &&
@@ -149,6 +152,8 @@ inline bool ClockUpdate(ClockHandle& h, bool* purgeable = nullptr) {
                               << ClockHandle::kStateShift) |
                                  (meta & ClockHandle::kHitBitMask))) {
     // Took ownership.
+    data->freed_charge += h.GetTotalCharge();
+    data->freed_count += 1;
     return true;
   } else {
     // Compare-exchange failing probably
@@ -356,6 +361,18 @@ void ConstApplyToEntriesRange(const Func& func, const HandleImpl* begin,
   }
 }
 
+constexpr uint32_t kStrictCapacityLimitBit = 1u << 31;
+
+uint32_t SanitizeEncodeEecAndScl(int eviction_effort_cap,
+                                 bool strict_capacit_limit) {
+  eviction_effort_cap = std::max(int{1}, eviction_effort_cap);
+  eviction_effort_cap =
+      std::min(static_cast<int>(~kStrictCapacityLimitBit), eviction_effort_cap);
+  uint32_t eec_and_scl = static_cast<uint32_t>(eviction_effort_cap);
+  eec_and_scl |= strict_capacit_limit ? kStrictCapacityLimitBit : 0;
+  return eec_and_scl;
+}
+
 }  // namespace
 
 void ClockHandleBasicData::FreeData(MemoryAllocator* allocator) const {
@@ -385,17 +402,20 @@ HandleImpl* BaseClockTable::StandaloneInsert(
 
 template <class Table>
 typename Table::HandleImpl* BaseClockTable::CreateStandalone(
-    ClockHandleBasicData& proto, size_t capacity, bool strict_capacity_limit,
+    ClockHandleBasicData& proto, size_t capacity, uint32_t eec_and_scl,
     bool allow_uncharged) {
   Table& derived = static_cast<Table&>(*this);
   typename Table::InsertState state;
   derived.StartInsert(state);
 
   const size_t total_charge = proto.GetTotalCharge();
-  if (strict_capacity_limit) {
+  // NOTE: we can use eec_and_scl as eviction_effort_cap below because
+  // strict_capacity_limit=true is supposed to disable the limit on eviction
+  // effort, and a large value effectively does that.
+  if (eec_and_scl & kStrictCapacityLimitBit) {
     Status s = ChargeUsageMaybeEvictStrict<Table>(
         total_charge, capacity,
-        /*need_evict_for_occupancy=*/false, state);
+        /*need_evict_for_occupancy=*/false, eec_and_scl, state);
     if (!s.ok()) {
       if (allow_uncharged) {
         proto.total_charge = 0;
@@ -407,7 +427,7 @@ typename Table::HandleImpl* BaseClockTable::CreateStandalone(
     // Case strict_capacity_limit == false
     bool success = ChargeUsageMaybeEvictNonStrict<Table>(
         total_charge, capacity,
-        /*need_evict_for_occupancy=*/false, state);
+        /*need_evict_for_occupancy=*/false, eec_and_scl, state);
     if (!success) {
       // Force the issue
       usage_.FetchAddRelaxed(total_charge);
@@ -420,7 +440,7 @@ typename Table::HandleImpl* BaseClockTable::CreateStandalone(
 template <class Table>
 Status BaseClockTable::ChargeUsageMaybeEvictStrict(
     size_t total_charge, size_t capacity, bool need_evict_for_occupancy,
-    typename Table::InsertState& state) {
+    uint32_t eviction_effort_cap, typename Table::InsertState& state) {
   if (total_charge > capacity) {
     return Status::MemoryLimit(
         "Cache entry too large for a single cache shard: " +
@@ -445,7 +465,8 @@ Status BaseClockTable::ChargeUsageMaybeEvictStrict(
   }
   if (request_evict_charge > 0) {
     EvictionData data;
-    static_cast<Table*>(this)->Evict(request_evict_charge, state, &data);
+    static_cast<Table*>(this)->Evict(request_evict_charge, state, &data,
+                                     eviction_effort_cap);
     occupancy_.FetchSub(data.freed_count);
     if (LIKELY(data.freed_charge > need_evict_charge)) {
       assert(data.freed_count > 0);
@@ -475,7 +496,7 @@ Status BaseClockTable::ChargeUsageMaybeEvictStrict(
 template <class Table>
 inline bool BaseClockTable::ChargeUsageMaybeEvictNonStrict(
     size_t total_charge, size_t capacity, bool need_evict_for_occupancy,
-    typename Table::InsertState& state) {
+    uint32_t eviction_effort_cap, typename Table::InsertState& state) {
   // For simplicity, we consider that either the cache can accept the insert
   // with no evictions, or we must evict enough to make (at least) enough
   // space. It could lead to unnecessary failures or excessive evictions in
@@ -511,7 +532,8 @@ inline bool BaseClockTable::ChargeUsageMaybeEvictNonStrict(
   }
   EvictionData data;
   if (need_evict_charge > 0) {
-    static_cast<Table*>(this)->Evict(need_evict_charge, state, &data);
+    static_cast<Table*>(this)->Evict(need_evict_charge, state, &data,
+                                     eviction_effort_cap);
     // Deal with potential occupancy deficit
     if (UNLIKELY(need_evict_for_occupancy) && data.freed_count == 0) {
       assert(data.freed_charge == 0);
@@ -530,11 +552,7 @@ inline bool BaseClockTable::ChargeUsageMaybeEvictNonStrict(
   return true;
 }
 
-void BaseClockTable::TrackAndReleaseEvictedEntry(
-    ClockHandle* h, BaseClockTable::EvictionData* data) {
-  data->freed_charge += h->GetTotalCharge();
-  data->freed_count += 1;
-
+void BaseClockTable::TrackAndReleaseEvictedEntry(ClockHandle* h) {
   bool took_value_ownership = false;
   if (eviction_callback_) {
     // For key reconstructed from hash
@@ -551,11 +569,20 @@ void BaseClockTable::TrackAndReleaseEvictedEntry(
   MarkEmpty(*h);
 }
 
+bool IsEvictionEffortExceeded(const BaseClockTable::EvictionData& data,
+                              uint32_t eviction_effort_cap) {
+  // Basically checks whether the ratio of useful effort to wasted effort is
+  // too low, with a start-up allowance for wasted effort before any useful
+  // effort.
+  return (data.freed_count + 1U) * uint64_t{eviction_effort_cap} <=
+         data.seen_pinned_count;
+}
+
 template <class Table>
 Status BaseClockTable::Insert(const ClockHandleBasicData& proto,
                               typename Table::HandleImpl** handle,
                               Cache::Priority priority, size_t capacity,
-                              bool strict_capacity_limit) {
+                              uint32_t eec_and_scl) {
   using HandleImpl = typename Table::HandleImpl;
   Table& derived = static_cast<Table&>(*this);
 
@@ -573,9 +600,12 @@ Status BaseClockTable::Insert(const ClockHandleBasicData& proto,
   // strict_capacity_limit, but mostly pessimistic.
   bool use_standalone_insert = false;
   const size_t total_charge = proto.GetTotalCharge();
-  if (strict_capacity_limit) {
+  // NOTE: we can use eec_and_scl as eviction_effort_cap below because
+  // strict_capacity_limit=true is supposed to disable the limit on eviction
+  // effort, and a large value effectively does that.
+  if (eec_and_scl & kStrictCapacityLimitBit) {
     Status s = ChargeUsageMaybeEvictStrict<Table>(
-        total_charge, capacity, need_evict_for_occupancy, state);
+        total_charge, capacity, need_evict_for_occupancy, eec_and_scl, state);
     if (!s.ok()) {
       // Revert occupancy
       occupancy_.FetchSubRelaxed(1);
@@ -584,7 +614,7 @@ Status BaseClockTable::Insert(const ClockHandleBasicData& proto,
   } else {
     // Case strict_capacity_limit == false
     bool success = ChargeUsageMaybeEvictNonStrict<Table>(
-        total_charge, capacity, need_evict_for_occupancy, state);
+        total_charge, capacity, need_evict_for_occupancy, eec_and_scl, state);
     if (!success) {
       // Revert occupancy
       occupancy_.FetchSubRelaxed(1);
@@ -688,8 +718,7 @@ void BaseClockTable::TEST_ReleaseNMinus1(ClockHandle* h, size_t n) {
 #endif
 
 FixedHyperClockTable::FixedHyperClockTable(
-    size_t capacity, bool /*strict_capacity_limit*/,
-    CacheMetadataChargePolicy metadata_charge_policy,
+    size_t capacity, CacheMetadataChargePolicy metadata_charge_policy,
     MemoryAllocator* allocator,
     const Cache::EvictionCallback* eviction_callback, const uint32_t* hash_seed,
     const Opts& opts)
@@ -1084,7 +1113,8 @@ inline void FixedHyperClockTable::ReclaimEntryUsage(size_t total_charge) {
 }
 
 inline void FixedHyperClockTable::Evict(size_t requested_charge, InsertState&,
-                                        EvictionData* data) {
+                                        EvictionData* data,
+                                        uint32_t eviction_effort_cap) {
   // precondition
   assert(requested_charge > 0);
 
@@ -1105,10 +1135,10 @@ inline void FixedHyperClockTable::Evict(size_t requested_charge, InsertState&,
   for (;;) {
     for (size_t i = 0; i < step_size; i++) {
       HandleImpl& h = array_[ModTableSize(Lower32of64(old_clock_pointer + i))];
-      bool evicting = ClockUpdate(h);
+      bool evicting = ClockUpdate(h, data);
       if (evicting) {
         Rollback(h.hashed_key, &h);
-        TrackAndReleaseEvictedEntry(&h, data);
+        TrackAndReleaseEvictedEntry(&h);
       }
     }
 
@@ -1119,6 +1149,10 @@ inline void FixedHyperClockTable::Evict(size_t requested_charge, InsertState&,
     if (old_clock_pointer >= max_clock_pointer) {
       return;
     }
+    if (IsEvictionEffortExceeded(*data, eviction_effort_cap)) {
+      eviction_effort_exceeded_count_.FetchAddRelaxed(1);
+      return;
+    }
 
     // Advance clock pointer (concurrently)
     old_clock_pointer = clock_pointer_.FetchAddRelaxed(step_size);
@@ -1133,10 +1167,11 @@ ClockCacheShard<Table>::ClockCacheShard(
     const Cache::EvictionCallback* eviction_callback, const uint32_t* hash_seed,
     const typename Table::Opts& opts)
     : CacheShardBase(metadata_charge_policy),
-      table_(capacity, strict_capacity_limit, metadata_charge_policy, allocator,
-             eviction_callback, hash_seed, opts),
+      table_(capacity, metadata_charge_policy, allocator, eviction_callback,
+             hash_seed, opts),
       capacity_(capacity),
-      strict_capacity_limit_(strict_capacity_limit) {
+      eec_and_scl_(SanitizeEncodeEecAndScl(opts.eviction_effort_cap,
+                                           strict_capacity_limit)) {
   // Initial charge metadata should not exceed capacity
   assert(table_.GetUsage() <= capacity_.LoadRelaxed() ||
          capacity_.LoadRelaxed() < sizeof(HandleImpl));
@@ -1212,7 +1247,11 @@ void ClockCacheShard<Table>::SetCapacity(size_t capacity) {
 template <class Table>
 void ClockCacheShard<Table>::SetStrictCapacityLimit(
     bool strict_capacity_limit) {
-  strict_capacity_limit_.StoreRelaxed(strict_capacity_limit);
+  if (strict_capacity_limit) {
+    eec_and_scl_.FetchOrRelaxed(kStrictCapacityLimitBit);
+  } else {
+    eec_and_scl_.FetchAndRelaxed(~kStrictCapacityLimitBit);
+  }
   // next Insert will take care of any necessary evictions
 }
 
@@ -1234,7 +1273,7 @@ Status ClockCacheShard<Table>::Insert(const Slice& key,
   proto.total_charge = charge;
   return table_.template Insert<Table>(proto, handle, priority,
                                        capacity_.LoadRelaxed(),
-                                       strict_capacity_limit_.LoadRelaxed());
+                                       eec_and_scl_.LoadRelaxed());
 }
 
 template <class Table>
@@ -1249,9 +1288,9 @@ typename Table::HandleImpl* ClockCacheShard<Table>::CreateStandalone(
   proto.value = obj;
   proto.helper = helper;
   proto.total_charge = charge;
-  return table_.template CreateStandalone<Table>(
-      proto, capacity_.LoadRelaxed(), strict_capacity_limit_.LoadRelaxed(),
-      allow_uncharged);
+  return table_.template CreateStandalone<Table>(proto, capacity_.LoadRelaxed(),
+                                                 eec_and_scl_.LoadRelaxed(),
+                                                 allow_uncharged);
 }
 
 template <class Table>
@@ -1503,14 +1542,20 @@ void BaseHyperClockCache<Table>::ReportProblems(
     const std::shared_ptr<Logger>& info_log) const {
   if (info_log->GetInfoLogLevel() <= InfoLogLevel::DEBUG_LEVEL) {
     LoadVarianceStats slot_stats;
+    uint64_t eviction_effort_exceeded_count = 0;
     this->ForEachShard([&](const BaseHyperClockCache<Table>::Shard* shard) {
       size_t count = shard->GetTableAddressCount();
       for (size_t i = 0; i < count; ++i) {
         slot_stats.Add(IsSlotOccupied(*shard->GetTable().HandlePtr(i)));
       }
+      eviction_effort_exceeded_count +=
+          shard->GetTable().GetEvictionEffortExceededCount();
     });
     ROCKS_LOG_AT_LEVEL(info_log, InfoLogLevel::DEBUG_LEVEL,
                        "Slot occupancy stats: %s", slot_stats.Report().c_str());
+    ROCKS_LOG_AT_LEVEL(info_log, InfoLogLevel::DEBUG_LEVEL,
+                       "Eviction effort exceeded: %" PRIu64,
+                       eviction_effort_exceeded_count);
   }
 }
 
@@ -1908,8 +1953,7 @@ class AutoHyperClockTable::ChainRewriteLock {
 };
 
 AutoHyperClockTable::AutoHyperClockTable(
-    size_t capacity, bool /*strict_capacity_limit*/,
-    CacheMetadataChargePolicy metadata_charge_policy,
+    size_t capacity, CacheMetadataChargePolicy metadata_charge_policy,
     MemoryAllocator* allocator,
     const Cache::EvictionCallback* eviction_callback, const uint32_t* hash_seed,
     const Opts& opts)
@@ -2590,7 +2634,8 @@ using ClockUpdateChainLockedOpData =
 template <class OpData>
 void AutoHyperClockTable::PurgeImplLocked(OpData* op_data,
                                           ChainRewriteLock& rewrite_lock,
-                                          size_t home) {
+                                          size_t home,
+                                          BaseClockTable::EvictionData* data) {
   constexpr bool kIsPurge = std::is_same_v<OpData, PurgeLockedOpData>;
   constexpr bool kIsClockUpdateChain =
       std::is_same_v<OpData, ClockUpdateChainLockedOpData>;
@@ -2632,7 +2677,7 @@ void AutoHyperClockTable::PurgeImplLocked(OpData* op_data,
       assert(home == BottomNBits(h->hashed_key[1], home_shift));
       if constexpr (kIsClockUpdateChain) {
         // Clock update and/or check for purgeable (under (de)construction)
-        if (ClockUpdate(*h, &purgeable)) {
+        if (ClockUpdate(*h, data, &purgeable)) {
           // Remember for finishing eviction
           op_data->push_back(h);
           // Entries for eviction become purgeable
@@ -2642,6 +2687,7 @@ void AutoHyperClockTable::PurgeImplLocked(OpData* op_data,
         }
       } else {
         (void)op_data;
+        (void)data;
         purgeable = ((h->meta.Load() >> ClockHandle::kStateShift) &
                      ClockHandle::kStateShareableBit) == 0;
       }
@@ -2719,7 +2765,8 @@ using PurgeOpData = const UniqueId64x2;
 using ClockUpdateChainOpData = ClockUpdateChainLockedOpData;
 
 template <class OpData>
-void AutoHyperClockTable::PurgeImpl(OpData* op_data, size_t home) {
+void AutoHyperClockTable::PurgeImpl(OpData* op_data, size_t home,
+                                    BaseClockTable::EvictionData* data) {
   // Early efforts to make AutoHCC fully wait-free ran into too many problems
   // that needed obscure and potentially inefficient work-arounds to have a
   // chance at working.
@@ -2800,9 +2847,9 @@ void AutoHyperClockTable::PurgeImpl(OpData* op_data, size_t home) {
   if (!rewrite_lock.IsEnd()) {
     if constexpr (kIsPurge) {
       PurgeLockedOpData* locked_op_data{};
-      PurgeImplLocked(locked_op_data, rewrite_lock, home);
+      PurgeImplLocked(locked_op_data, rewrite_lock, home, data);
     } else {
-      PurgeImplLocked(op_data, rewrite_lock, home);
+      PurgeImplLocked(op_data, rewrite_lock, home, data);
     }
   }
 }
@@ -3405,7 +3452,8 @@ void AutoHyperClockTable::EraseUnRefEntries() {
 }
 
 void AutoHyperClockTable::Evict(size_t requested_charge, InsertState& state,
-                                EvictionData* data) {
+                                EvictionData* data,
+                                uint32_t eviction_effort_cap) {
   // precondition
   assert(requested_charge > 0);
 
@@ -3463,12 +3511,12 @@ void AutoHyperClockTable::Evict(size_t requested_charge, InsertState& state,
         if (home >= used_length) {
           break;
         }
-        PurgeImpl(&to_finish_eviction, home);
+        PurgeImpl(&to_finish_eviction, home, data);
       }
     }
 
     for (HandleImpl* h : to_finish_eviction) {
-      TrackAndReleaseEvictedEntry(h, data);
+      TrackAndReleaseEvictedEntry(h);
       // NOTE: setting likely_empty_slot here can cause us to reduce the
       // portion of "at home" entries, probably because an evicted entry
       // is more likely to come back than a random new entry and would be
@@ -3496,6 +3544,11 @@ void AutoHyperClockTable::Evict(size_t requested_charge, InsertState& state,
     if (old_clock_pointer + step_size >= max_clock_pointer) {
       return;
     }
+
+    if (IsEvictionEffortExceeded(*data, eviction_effort_cap)) {
+      eviction_effort_exceeded_count_.FetchAddRelaxed(1);
+      return;
+    }
   }
 }
 
diff --git a/cache/clock_cache.h b/cache/clock_cache.h
index 3086e7e97..54d656021 100644
--- a/cache/clock_cache.h
+++ b/cache/clock_cache.h
@@ -11,6 +11,7 @@
 
 #include <array>
 #include <atomic>
+#include <climits>
 #include <cstddef>
 #include <cstdint>
 #include <memory>
@@ -374,6 +375,14 @@ struct ClockHandle : public ClockHandleBasicData {
 
 class BaseClockTable {
  public:
+  struct BaseOpts {
+    explicit BaseOpts(int _eviction_effort_cap)
+        : eviction_effort_cap(_eviction_effort_cap) {}
+    explicit BaseOpts(const HyperClockCacheOptions& opts)
+        : BaseOpts(opts.eviction_effort_cap) {}
+    int eviction_effort_cap;
+  };
+
   BaseClockTable(CacheMetadataChargePolicy metadata_charge_policy,
                  MemoryAllocator* allocator,
                  const Cache::EvictionCallback* eviction_callback,
@@ -386,13 +395,13 @@ class BaseClockTable {
   template <class Table>
   typename Table::HandleImpl* CreateStandalone(ClockHandleBasicData& proto,
                                                size_t capacity,
-                                               bool strict_capacity_limit,
+                                               uint32_t eec_and_scl,
                                                bool allow_uncharged);
 
   template <class Table>
   Status Insert(const ClockHandleBasicData& proto,
                 typename Table::HandleImpl** handle, Cache::Priority priority,
-                size_t capacity, bool strict_capacity_limit);
+                size_t capacity, uint32_t eec_and_scl);
 
   void Ref(ClockHandle& handle);
 
@@ -406,12 +415,17 @@ class BaseClockTable {
 
   uint64_t GetYieldCount() const { return yield_count_.LoadRelaxed(); }
 
+  uint64_t GetEvictionEffortExceededCount() const {
+    return eviction_effort_exceeded_count_.LoadRelaxed();
+  }
+
   struct EvictionData {
     size_t freed_charge = 0;
     size_t freed_count = 0;
+    size_t seen_pinned_count = 0;
   };
 
-  void TrackAndReleaseEvictedEntry(ClockHandle* h, EvictionData* data);
+  void TrackAndReleaseEvictedEntry(ClockHandle* h);
 
 #ifndef NDEBUG
   // Acquire N references
@@ -436,6 +450,7 @@ class BaseClockTable {
   template <class Table>
   Status ChargeUsageMaybeEvictStrict(size_t total_charge, size_t capacity,
                                      bool need_evict_for_occupancy,
+                                     uint32_t eviction_effort_cap,
                                      typename Table::InsertState& state);
 
   // Helper for updating `usage_` for new entry with given `total_charge`
@@ -449,6 +464,7 @@ class BaseClockTable {
   template <class Table>
   bool ChargeUsageMaybeEvictNonStrict(size_t total_charge, size_t capacity,
                                       bool need_evict_for_occupancy,
+                                      uint32_t eviction_effort_cap,
                                       typename Table::InsertState& state);
 
  protected:  // data
@@ -461,9 +477,15 @@ class BaseClockTable {
   RelaxedAtomic<uint64_t> clock_pointer_{};
 
   // Counter for number of times we yield to wait on another thread.
+  // It is normal for this to occur rarely in normal operation.
   // (Relaxed: a simple stat counter.)
   RelaxedAtomic<uint64_t> yield_count_{};
 
+  // Counter for number of times eviction effort cap is exceeded.
+  // It is normal for this to occur rarely in normal operation.
+  // (Relaxed: a simple stat counter.)
+  RelaxedAtomic<uint64_t> eviction_effort_exceeded_count_{};
+
   // TODO: is this separation needed if we don't do background evictions?
   ALIGN_AS(CACHE_LINE_SIZE)
   // Number of elements in the table.
@@ -517,17 +539,19 @@ class FixedHyperClockTable : public BaseClockTable {
     inline void SetStandalone() { standalone = true; }
   };  // struct HandleImpl
 
-  struct Opts {
-    explicit Opts(size_t _estimated_value_size)
-        : estimated_value_size(_estimated_value_size) {}
-    explicit Opts(const HyperClockCacheOptions& opts) {
+  struct Opts : public BaseOpts {
+    explicit Opts(size_t _estimated_value_size, int _eviction_effort_cap)
+        : BaseOpts(_eviction_effort_cap),
+          estimated_value_size(_estimated_value_size) {}
+    explicit Opts(const HyperClockCacheOptions& opts)
+        : BaseOpts(opts.eviction_effort_cap) {
       assert(opts.estimated_entry_charge > 0);
       estimated_value_size = opts.estimated_entry_charge;
     }
     size_t estimated_value_size;
   };
 
-  FixedHyperClockTable(size_t capacity, bool strict_capacity_limit,
+  FixedHyperClockTable(size_t capacity,
                        CacheMetadataChargePolicy metadata_charge_policy,
                        MemoryAllocator* allocator,
                        const Cache::EvictionCallback* eviction_callback,
@@ -549,7 +573,8 @@ class FixedHyperClockTable : public BaseClockTable {
   // Runs the clock eviction algorithm trying to reclaim at least
   // requested_charge. Returns how much is evicted, which could be less
   // if it appears impossible to evict the requested amount without blocking.
-  void Evict(size_t requested_charge, InsertState& state, EvictionData* data);
+  void Evict(size_t requested_charge, InsertState& state, EvictionData* data,
+             uint32_t eviction_effort_cap);
 
   HandleImpl* Lookup(const UniqueId64x2& hashed_key);
 
@@ -803,18 +828,20 @@ class AutoHyperClockTable : public BaseClockTable {
     }
   };  // struct HandleImpl
 
-  struct Opts {
-    explicit Opts(size_t _min_avg_value_size)
-        : min_avg_value_size(_min_avg_value_size) {}
+  struct Opts : public BaseOpts {
+    explicit Opts(size_t _min_avg_value_size, int _eviction_effort_cap)
+        : BaseOpts(_eviction_effort_cap),
+          min_avg_value_size(_min_avg_value_size) {}
 
-    explicit Opts(const HyperClockCacheOptions& opts) {
+    explicit Opts(const HyperClockCacheOptions& opts)
+        : BaseOpts(opts.eviction_effort_cap) {
       assert(opts.estimated_entry_charge == 0);
       min_avg_value_size = opts.min_avg_entry_charge;
     }
     size_t min_avg_value_size;
   };
 
-  AutoHyperClockTable(size_t capacity, bool strict_capacity_limit,
+  AutoHyperClockTable(size_t capacity,
                       CacheMetadataChargePolicy metadata_charge_policy,
                       MemoryAllocator* allocator,
                       const Cache::EvictionCallback* eviction_callback,
@@ -841,7 +868,8 @@ class AutoHyperClockTable : public BaseClockTable {
   // Runs the clock eviction algorithm trying to reclaim at least
   // requested_charge. Returns how much is evicted, which could be less
   // if it appears impossible to evict the requested amount without blocking.
-  void Evict(size_t requested_charge, InsertState& state, EvictionData* data);
+  void Evict(size_t requested_charge, InsertState& state, EvictionData* data,
+             uint32_t eviction_effort_cap);
 
   HandleImpl* Lookup(const UniqueId64x2& hashed_key);
 
@@ -906,7 +934,8 @@ class AutoHyperClockTable : public BaseClockTable {
   // with proper handling to ensure all existing data is seen even in the
   // presence of concurrent insertions, etc. (See implementation.)
   template <class OpData>
-  void PurgeImpl(OpData* op_data, size_t home = SIZE_MAX);
+  void PurgeImpl(OpData* op_data, size_t home = SIZE_MAX,
+                 EvictionData* data = nullptr);
 
   // An RAII wrapper for locking a chain of entries for removals. See
   // implementation.
@@ -916,7 +945,7 @@ class AutoHyperClockTable : public BaseClockTable {
   // implementation.
   template <class OpData>
   void PurgeImplLocked(OpData* op_data, ChainRewriteLock& rewrite_lock,
-                       size_t home);
+                       size_t home, EvictionData* data);
 
   // Update length_info_ as much as possible without waiting, given a known
   // usable (ready for inserts and lookups) grow_home. (Previous grow_homes
@@ -1078,9 +1107,10 @@ class ALIGN_AS(CACHE_LINE_SIZE) ClockCacheShard final : public CacheShardBase {
   // (Relaxed: eventual consistency/update is OK)
   RelaxedAtomic<size_t> capacity_;
 
-  // Whether to reject insertion if cache reaches its full capacity.
+  // Encodes eviction_effort_cap (bottom 31 bits) and strict_capacity_limit
+  // (top bit). See HyperClockCacheOptions::eviction_effort_cap etc.
   // (Relaxed: eventual consistency/update is OK)
-  RelaxedAtomic<bool> strict_capacity_limit_;
+  RelaxedAtomic<uint32_t> eec_and_scl_;
 };  // class ClockCacheShard
 
 template <class Table>
diff --git a/cache/compressed_secondary_cache_test.cc b/cache/compressed_secondary_cache_test.cc
index d72680b84..79f40868a 100644
--- a/cache/compressed_secondary_cache_test.cc
+++ b/cache/compressed_secondary_cache_test.cc
@@ -992,6 +992,8 @@ class CompressedSecCacheTestWithTiered
         /*_capacity=*/0,
         /*_estimated_entry_charge=*/256 << 10,
         /*_num_shard_bits=*/0);
+    // eviction_effort_cap setting simply to avoid churn in existing test
+    hcc_opts.eviction_effort_cap = 100;
     TieredCacheOptions opts;
     lru_opts.capacity = 0;
     lru_opts.num_shard_bits = 0;
diff --git a/cache/lru_cache_test.cc b/cache/lru_cache_test.cc
index 27fd5cc85..91b1d02c1 100644
--- a/cache/lru_cache_test.cc
+++ b/cache/lru_cache_test.cc
@@ -389,12 +389,13 @@ class ClockCacheTest : public testing::Test {
     }
   }
 
-  void NewShard(size_t capacity, bool strict_capacity_limit = true) {
+  void NewShard(size_t capacity, bool strict_capacity_limit = true,
+                int eviction_effort_cap = 30) {
     DeleteShard();
     shard_ =
         reinterpret_cast<Shard*>(port::cacheline_aligned_alloc(sizeof(Shard)));
 
-    TableOpts opts{1 /*value_size*/};
+    TableOpts opts{1 /*value_size*/, eviction_effort_cap};
     new (shard_)
         Shard(capacity, strict_capacity_limit, kDontChargeCacheMetadata,
               /*allocator*/ nullptr, &eviction_callback_, &hash_seed_, opts);
@@ -445,12 +446,20 @@ class ClockCacheTest : public testing::Test {
     return Slice(reinterpret_cast<const char*>(&hashed_key), 16U);
   }
 
+  // A bad hash function for testing / stressing collision handling
   static inline UniqueId64x2 TestHashedKey(char key) {
     // For testing hash near-collision behavior, put the variance in
     // hashed_key in bits that are unlikely to be used as hash bits.
     return {(static_cast<uint64_t>(key) << 56) + 1234U, 5678U};
   }
 
+  // A reasonable hash function, for testing "typical behavior" etc.
+  template <typename T>
+  static inline UniqueId64x2 CheapHash(T i) {
+    return {static_cast<uint64_t>(i) * uint64_t{0x85EBCA77C2B2AE63},
+            static_cast<uint64_t>(i) * uint64_t{0xC2B2AE3D27D4EB4F}};
+  }
+
   Shard* shard_ = nullptr;
 
  private:
@@ -683,6 +692,53 @@ TYPED_TEST(ClockCacheTest, ClockEvictionTest) {
   }
 }
 
+TYPED_TEST(ClockCacheTest, ClockEvictionEffortCapTest) {
+  using HandleImpl = typename ClockCacheTest<TypeParam>::Shard::HandleImpl;
+  for (bool strict_capacity_limit : {true, false}) {
+    SCOPED_TRACE("strict_capacity_limit = " +
+                 std::to_string(strict_capacity_limit));
+    for (int eec : {-42, 0, 1, 10, 100, 1000}) {
+      SCOPED_TRACE("eviction_effort_cap = " + std::to_string(eec));
+      constexpr size_t kCapacity = 1000;
+      // Start with much larger capacity to ensure that we can go way over
+      // capacity without reaching table occupancy limit.
+      this->NewShard(3 * kCapacity, strict_capacity_limit, eec);
+      auto& shard = *this->shard_;
+      shard.SetCapacity(kCapacity);
+
+      // Nearly fill the cache with pinned entries, then add a bunch of
+      // non-pinned entries. eviction_effort_cap should affect how many
+      // evictable entries are present beyond the cache capacity, despite
+      // being evictable.
+      constexpr size_t kCount = kCapacity - 1;
+      std::unique_ptr<HandleImpl* []> ha { new HandleImpl* [kCount] {} };
+      for (size_t i = 0; i < 2 * kCount; ++i) {
+        UniqueId64x2 hkey = this->CheapHash(i);
+        ASSERT_OK(shard.Insert(
+            this->TestKey(hkey), hkey, nullptr /*value*/, &kNoopCacheItemHelper,
+            1 /*charge*/, i < kCount ? &ha[i] : nullptr, Cache::Priority::LOW));
+      }
+
+      if (strict_capacity_limit) {
+        // If strict_capacity_limit is enabled, the cache will never exceed its
+        // capacity
+        EXPECT_EQ(shard.GetOccupancyCount(), kCapacity);
+      } else {
+        // Rough inverse relationship between cap and possible memory
+        // explosion, which shows up as increased table occupancy count.
+        int effective_eec = std::max(int{1}, eec) + 1;
+        EXPECT_NEAR(shard.GetOccupancyCount() * 1.0,
+                    kCount * (1 + 1.4 / effective_eec),
+                    kCount * (0.6 / effective_eec) + 1.0);
+      }
+
+      for (size_t i = 0; i < kCount; ++i) {
+        shard.Release(ha[i]);
+      }
+    }
+  }
+}
+
 namespace {
 struct DeleteCounter {
   int deleted = 0;
author	Peter Dillinger <peterd@meta.com>	2023-12-14 22:13:32 -0800
committer	Facebook GitHub Bot <facebook-github-bot@users.noreply.github.com>	2023-12-14 22:13:32 -0800
commit	88bc91f3cc2b492b8a45ba2c49650f527df97ad8 (patch)
tree	fe23f5a35a716feb95801e313d7d986634b153f6 /cache
parent	cd577f605948894b51fbaab39d1df03a04dfd70f (diff)