Cloud databases should support dynamic allocation of hot and cold data to fast and slow device. So the client can get the best performance within their budget.
There are some locality in SSTable access. (This paper uses QuizUp workload to do the analysis)
Figure 3indicates that younger data will have more access while older data have less access.
Figure 4illustrates that the hottest data can be 4 orders of magnitudes than the coldest data.
Figure 5says that As more records are inserted over time, the number of infrequently accessed SSTables (“cold” SSTables) increases, while the number of frequently accessed SSTables (“hot” SSTables) stays about the same.
Components consist of
record index) and
With those pattern, this paper try to solve this specific question:
Find a subset of SSTables to be stored in the fast storage (optimization goal) such that the sum of fast storage SSTable accesses is maximized, (constraint) while bounding the volume of SSTables in fast storage.
They use this model to do some optimization.
- Pf, Ps: unit price for fast and slow device
- Sf, Ss: sum of all SSTable sizes in the fast and slow storage
- Ai: number of accesses to SSTable i
- Si: the size of SSTable i
- Xi: SSTable in fast storage or not
This becomes a 0/1 knapsack problem. They use a greedy algorithm to solve this problem.
Why not just sorting by the frequency, then put as many as SSTable to fast device directly.
They use a counter to record the access frequency, but they can be affected severely by spike or dips. So they implement a filter through exponential average to smooth the frequency counter.
This is like a digital signal filer. Transform function.
The thing I don’t understand is that why memory caching is not enough for those hot data. Maybe one possible explanation is that even hot data is too big to be stored in memory.