2021#

Friday, December 31, 2021
in Machine Learning, AI, Deep Learning, Data, Data-Centric-AI
35 min read

Review of recent advances in dealing with data size challenges in Deep Learning

The energy and excitement in machine learning and deep learning communities are infectious these days. So many groundbreaking advances are happening in this area but I have often found myself wondering why the only thing that makes it all shine - yes I am talking about the dark horse of deep learning the data is so underappreciated. The last few years of DL research have given me much joy and excitement and I carry hope now that going forward we can see some exciting progress in this space that explore advances in deep learning in conjunction with data! In this article, I summarise some of the recent developments in the deep learning space that I have been blown away by.

Table of content of this article: - Review of recent advances in dealing with data size challenges in Deep Learning - The dark horse of deep learning: data - Labelled data: the types of labels - Commonly used DL techniques centered around data - Data Augmentation - Transfer Learning - Dimensionality reduction - Active learning - Challenges in scaling dataset for deep-learning - Recent advances in data-related techniques - 1. Regularization - 1.1 Mixup - 1.2 Label Smoothing - 2. Compression - 2.1. X-shot learning: How many are enough? - 2.2. Pruning - 2.2.1 Coresets - 2.2.2 Example forgetting - 2.2.3 Using Gradient norms - 2.3. Distillation - 3. So what if you have noisy data - Conclusion - References

The dark horse of deep learning: data

Deep learning (DL) algorithms learn to perform a task by building a (domain) knowledge representation by looking at the training data. An early study of image models (classification and segmentation, year 2017) noted that the performance of the model increases logarithmically as the training dataset increases 1. The belief that increasing training dataset size will continue to increase model performance has been long-held. This has also been supported by another empirical study that validated this belief across machine translation, language modeling, image classification, and speech recognition 2 (see figure 1).

*Figure 1: Shows relationship between generalization error and dataset size (log scale) 2 *

So, the bigger dataset is better right? Almost! A theoretical foundation has been laid out in the form of power-law i.e $ \begin{equation} \label{power_law} ε(m) \approx αm^{β_g} \end{equation} $ wherein ε is generalization error, m is the number of samples in the training set, α is a constant property of the problem/DL task, and β_g is the scaling exponent that defines the steepness of the learning curve. Here, β_g, the steepness of the curve depicts how well a model can learn from adding more data to the training set 2 .(see figure 2) Empirically, β_g was found to be between −0.07 and −0.35 despite theory suggesting β_g to be 0.5 or 1. Nonetheless, the logarithmic relationship holds. As shown in figure 2, the gain eventually tapers in irreducible error.

*Figure 2: Power Law curve showing model trained with a small dataset only as good as random guesses to rapidly getting better as dataset size increases to eventually settling into irreducible error region explained by a combination of factors including imperfect data that cause imperfect generalization 2 *

This can be attributed to several factors including imperfection in data. The importance of data quality and continually iterating over is touched on in some of the previous talks 1, 2, 3. Data quality matters, and so does the data distribution. The better the distribution of the training dataset is, the more generalized the model can be!

Data is certainly the new oil! 3

So, can we scale the data size without many grievances? Keep in mind, 61% of AI practicing organizations already find data and data-related challenges as their top challenge 4. If the challenges around procurement, storage, data quality, and distribution/demographic of the dataset has not subsumed you yet, this post focuses on yet another series of questioning. How can we train efficiently when data volume grows and the computation cost and turnaround time increase linearly with data growth? Then we begin asking how much of the data is superfluous, which examples are more impactful, and how do we find them? These are very important questions to ask given a recent survey 4 noted that about 40% of the organizations practicing AI already spend at least $1M per annum on GPUs and AI-related compute infrastructures. This should concern us all. Not every organization beyond the FAANG (and also, the one's assumed in FAANG but missed out on acronym!) and ones with big fat balance sheet will be able to leverage the gain by simply scaling the dataset. Besides, this should concern us all for environmental reasons and carbon emissions implications more details.

The carbon footprint of training a single AI is as much as 284 tonnes of carbon dioxide equivalent — five times the lifetime emissions of an average car source.

The utopian state of simply scaling training datasets and counting your blessings simply does not exist. The question then is, what are we doing about it? Unfortunately not a whole lot especially if you look at the excitement in the ML research community in utilizing gazillion GPU years to gain a minuscule increase in model performance attributed to algorithmic or architectural changes. But the good news is this area is gaining much more traction now. Few pieces of research since 2020 are very promising albeit in their infancy. I have been following the literature around the use of data in AI (a.k.a data-centric AI) topic very closely as this is one of my active areas of interest. I am excited about some of the recent developments in this area. In this post, I will cover some of my learnings and excitement around this topic.

Before, covering them in detail, let's review foundational understanding and priors first:

Labelled data: the types of labels

This post focuses heavily on supervised learning scenarios focussing mainly on computer vision. In this space, there are two types of labels: - Hard labels - Soft labels

Traditional labels are hard labels where the value of ground truth is a discrete value e.g. 0 and 1, 0 for no, and 1 for yes. These discrete values can be anything depending on how the dataset was curated. It's important to note that these values are absolute and unambiguously indicate their meaning.

There is an emerging form of labels known as soft labels where ground through represents the likelihood. By nature these labels are continuous. An example, a pixel is 40% cat 60% dog. It will make a whole lot of sense in the following sections.

Commonly used DL techniques centered around data

Data augmentation and transfer learning are two commonly used techniques in deep learning these days that focus on applying data efficiently. Both these techniques are heavily democratized now and commonly applied unless explicitly omitted.

Data Augmentation

Data augmentation encompasses a variety of techniques to transform a datapoint such that it adds variety to the dataset. The technique aims to keep the data distribution about the same but add richness to the dataset by adding variety. Predominantly, the transformation via this technique has been intra-sample. Affine transformations, contrast adjustment, jittering, or color balancing are some such examples of data augmentation techniques. Imgaug and kornia are very good libraries for such operation even though all ML frameworks offer a limited set of data augmentation routines.

Data augmentation technique was initially proposed to increase robustness and achieve better generalization in the model but they are also used as a technique to synthetically increase data size as well. This is especially true in cases where data procurement is really challenging. These days, data augmentation techniques have become a lot more complex and richer including scenarios where-in model-driven augmentations may also be applied. One example of this is GAN-based techniques to augment and synthesize samples. In fact, data augmentation is also one of the techniques to build robustness against adversarial attacks.

* Example of augmentation src *

Transfer Learning

Transfer learning is a very well democratized technique as well that stems from reusing the learned representations into a new task if the problem domain of two tasks is related. Transfer learning relaxes the assumption that the training data must be independent and identically distributed (i.i.d.) with the test data 5, allowing one to solve the problem of insufficient training data by bootstrapping model weights from another learned model trained with another dataset.

*Figure 3: Training with and without transfer learning 6 *

With transfer learning, faster convergence can be achieved if there is an overlap between the tasks of the source and target model.

Dimensionality reduction

Dimension reduction techniques are also applied to large datasets:

These techniques are categorized into two: 1. Ones that seek to preserve the pairwise distance amongst all the samples in the dataset. Principal component analysis (PCA) is a good example of this. 2. Ones that preserve the local distances over global distance. The techniques like uniform manifold approximation and projection (UMAP) 23 and t-distributed stochastic neighbor embedding (t-SNE) 24 fit in this category. UMAP arguably preserves more of the global structure and is algorithmically faster than t-sne. Both T-SNE and UMAP use gradient descent for arriving at the optimal embeddings.

These techniques in DL space however are mostly used to understand the data and also for visualization purposes. UMAP and T-SNE do better at preserving global structure than other embedding algorithms but are limited. This blog covers the topic more in detail.

Active learning

Active learning is a methodology wherein the training process proactively asks for labels on specific data. It is used more commonly in classical ML techniques, but it has not been very successful in deep learning due to back-propagation. Offline or pool-based active learning has been investigated heavily for use in deep learning but without much groundbreaking success. The use of active learning is not very straightforward either due to the negative impact of outliers on training 25. Pool-based active learning will be covered in the following section in more detail (pruning).

Challenges in scaling dataset for deep-learning

Besides the techniques discussed in previous section, not a lot of investment has been done in the area focussing on data-centric AI. The momentum around data-centric AI is forming a bit recently with Andrew Ng driving data-centric AI efforts through his new startup Landing.AI.

In my view, the following are some of the broad categories of questions that fall under the purview of data-centric AI:

How to efficiently train with the rapid increase in the dataset? Yann LeCun called out in his interview with Soumith Chintala during PyTorch developer day 2021 that training time of more than 1 week should not be acceptable. This is a very good baseline for practical reasons but if one does not have an enormous GPU fleet at their disposal then this is goalpost is very hard to achieve given current DL practices. So, what else can be done to train efficiently with increased dataset size?
Are all samples equally important? How important a sample in the dataset is? Can we leverage the "importance factor" for good?
What role does a sample play towards better generalization? Some samples carry redundant features, so how to deduplicate the dataset when features as in DL are not explicit?
Data size matters but can we be strategic about what goes in the dataset?
1. Cleverly doing this has to do with efficient sampling and data mining techniques. These are the easily solved problem if and only if we know what our targets are. Challenge in DL, as I see it, is what to look for to mine for the best sample? This is not well understood.
How can we leverage more innate DL techniques like objective functions, backpropagation, and gradient descent to build a slick and effective dataset that provides the highest return on investment.
Noises in datasets are seen as evil. But are they always evil? How much noise can one live with? How to quantify this?
How much of a crime it is if data bleeds across traditional train/validate/calibrate/test splits.
1. What are the recommendations on the data split for cascade training scenarios?
How fancy can one get with data augmentation before returns start to diminish?
How to optimize the use of data if continual learning is observed?

If we look at humans are learning machines, they have infinite data at their disposal to learn from. Our system had evolved into efficient strategies to parse through infinite data streams to select the samples we are interested in. How our vision system performs foveal fixation utilizing saccadic eye movements to conduct efficient subsampling of interesting and useful datapoint should be a good motivation. Sure we fail sometimes, we fail to see the pen on the table even though it's right in front of us due to various reasons but we hit it right most of the time. Some concepts of Gestalt theory, a principle used to explain how people perceive visual components (as organized patterns, instead of many disparate parts) are already applicable for better selection of data even if machine models are stochastic parrots. According to this theory, eight main factors, listed below, determine how the visual system automatically groups elements into patterns.

Proximity: Tendency to perceive objects or shapes that are close to one another as forming a group.
Similarity: Tendency to group objects if physical resemblance e.g. shape, pattern, color, etc. is present.
Closure: Tendency to see complete figures/forms even if what is present in the image is incomplete.
Symmetry: Tendency to 'see' objects as symmetrical and forming around a center point. 50
Common fate: Tendency to associate similar movement as part of a common motion.
Continuity: Tendency to perceive each object as a single uninterrupted i.e. continuous object
Good Gestalt: Tendency to group together if a regular, simple, and orderly pattern can be formed
Past experience: Tendency to categorize objects according to past experience.

Of these, I argue, proximity, similarity, common fate, and past experience are relevant. I even argue on the possibility of applying closure. A recent work by FAIR 22, shows that machine models can fill in the gaps and infer missing pieces correctly by applying minor changes commonly used technique autoencoders. Why I bring this up with so much excitement than GAN-based techniques of hallucination is how easy it is to build and train as compared to GAN.

Masked-Autoencoders showing model inferring missing patches 22

Its been interesting to note that the recent advances towards dealing with the challenges of scaled data are largely inspired by already known deep-learning techniques except they are now applied through the lens of data. Examples like pruning, compression, sampling strategies, and leveraging from phenomena such as catastrophic forgetting, knowledge distillation.

Technique	How it's presently utilized in model building	Proposed data-centric view
Prunning	A specialized class of model compression technique where low magnitude weights are eliminated to reduce the size and computational overhead.	Samples that don't contribute much to generalization are omitted from the training regime.
Compression	A broad range of model compression techniques to reduce the size and computational overhead includes techniques like quantization wherein some amount of information loss is expected.	A broad range of data filtering and compression techniques to reduce size without compromising much on generalization.
Distillation	To extract learned representation from a more complex model to a potentially smaller model.	To extract knowledge present in the larger dataset into a smaller synthesized set.
Loss function	Also termed as the objective function is the one of core concepts of DL that defines the problem statement.	As shown in 22, and also more broadly can be leveraged to fill in missing information in the data.
Regularisation	One of the theoretical principles of DL is applied through various techniques like BatchNorm, Dropouts to avoid overfitting.	Variety of techniques to ensure overfitting applied with data in mind, e.g. Label Smoothing 7,10

*Table 1: Summary of techniques that are crossbreed from core DL techniques to also into data-centric DL *

Let's dive into the details of how these classes of techniques are applied through the lens of data:

1. Regularization

1.1 Mixup

Mixup is a special form of data augmentation technique that looks beyond intra-sample modification and explores inter-sample modification. The idea with mixup is to linearly combine (through convex geometry) a pair of samples to result into one.

$ \begin{equation} x\prime=λx_i + (1−λ)x_j , \ \text{where,} λ ∈ [0,1] \ \text{ drawn from beta distribution, and xi, xj are input/source vector} \end{equation} $

$ \begin{equation} y\prime = λy_i + (1 − λ)y_j , \ \text{where y_i , yj are one-hot label encodings} \end{equation} $

*Figure 4: Sample produced by applying mixup 7 on Oxford Pets dataset *

Mixup 7 in fact seeks to regularize the neural network to favor simple linear behavior in-between training examples. As shown in fig 5, mixup results in better models with fewer missed. Its been shown that mixup increases the generalization, reduces the memorization of corrupt labels, increases the robustness towards adversarial examples 7,8.

Figure 5: Shows that using mixup 7, lower prediction error and smaller gradient norms are observed.

I see mixup as not only an augmentation, and regularisation technique but also a data compression technique. Depending on how frequently (say α) the mixup is applied, the dataset compression ratio (C_r) will :

$ \begin{equation} C_r = 1 - α/2 \end{equation} $

If you have not noticed already, applying mixup convert labels to soft labels. The linear combination of discreet values will result in a continuous label value that can explain the example previously discussed wherein the pixel is 40% cat 60% dog (see fig 5).

1.2 Label Smoothing

Label smoothing 10 is a form of regularisation technique that smoothes out ground truth by a very small value epsilon ɛ. One motivation for this is of course better generalization and avoiding overfitting. While the other motivation is to discourage the model from becoming overconfident. Both 8,10 have shown that label smoothing leads to better models.

$ \begin{equation} Q_{i} = \begin{cases} \displaystyle 1 - ɛ & \text{if i == k,} \
ɛ/K & \text{Otherwise, where K is number of classes} \
\end{cases}
\end{equation} $

Label smoothing as indicated by the equation above does not lead to any visible differences in label data as ɛ is really small. However, applying mixup change visibly changes both source (x) and the label (y).

Applying label-smoothing has no noticeable difference

2. Compression

Compression refers to a broad range of data filtering and compression techniques to reduce size without compromising much on generalization. Following are some of the recent exciting development on this front:

2.1. X-shot learning: How many are enough?

The troubles of high computational cost and long training times due to an increase in the dataset have led to the development of training by a few shot strategies. The intuition behind this approach is to take a model and guide it to learn to perform a new task only by looking at a few samples 11. The concept of transfer learning is implicitly applied in this approach. This line of investigation started with training new tasks by using only a few (handful) samples and explored an extreme case of one-shot training i.e. learning new tasks from only one sample 12,13.

Recently an interesting mega-extreme approach of shot-based learning has emerged - 'Less Than One'-Shot Learning a.k.a LO Shot learning 11. This approach utilizes soft label concepts and seeks to merge hard label N class samples into M samples where M < N and thus the name less than one! LO Shot-based techniques are a form of data compression technique and may feel very similar to the MixUp technique discussed earlier. However, LO Shot contrary to a convex combination of samples as in Mixup, exploits distance weighted k-Nearest Neighbours technique to infer the soft labels. Their algorithm termed distance-weighted soft-label prototype k-Nearest Neighbours (SLaPkNN) essentially takes the sum of the label vectors of the k-nearest prototypes to target point x, with each prototype weighted inversely proportional to its distance from x. The following figure shows 4 class datasets are merged into 2 samples using SLaPkNN.

Figure LO-Shot: LO Shot splitting 4 class space into 2 points 11.

In my understanding that is the main theoretical difference between the two techniques, with the other difference being mixup only merges two samples into one using a probability drawn from beta distribution combined using λ and 1-λ whereas LO is more versatile and can compress greatly. I am not saying mixup cant be extended to be more multivariate but that empirical analysis of such approach is unknown; whereas with 11 its been shown SLaPkNN can compress 3M − 2 classes into M samples at least.

The technical explanation for this along with code is available here.

2.2. Pruning

Pruning is a subclass of compression techniques wherein samples that are not really helpful or effective are dropped out whereas selected samples are kept as is without any loss in content. Following are some of the known techniques of dataset pruning:

2.2.1 Coresets

Coreset selection technique pertains to subsampling from a large dataset to a smaller set that almost approximates the given large set. This is not a new technique and has heavily been explored using hand-engineered features and simpler models like Markov models to approximate the smaller set. This is not a DL-specific technique either and has its place in classical ML as well. An example could be using naïve Bayes to select coresets for more computationally expensive counterparts like decision trees.

In deep learning, using a similar concept, a lighter-weight DL model can be used as a proxy to select the approximate dataset 15. This is easier achieved when continual learning is practiced otherwise it can be a very expensive technique in itself given proxy model needs to be trained with a full dataset first. This becomes especially tricky given the proxy and target models are different and also when the information in the dataset is not concentrated in a few samples but uniformly distributed over all of them. These are some of the reasons why this approach is not very successful.

2.2.2 Example forgetting

An investigation 14 reported that some samples once learned are never forgotten and exhibit the same behavior across various training parameters and hyperparameters. There are other classes of samples that are forgotten. The forgetting event was defined as when the model prediction regresses in the subsequent epoch. Both qualitative and qualitative (see fig 6 and 7) analysis into such forgotten samples indicated noisy labels, images with "uncommon" visually complicated features were the main reasons for example forgetting.

Figure 6: Algorithm to track forgotten samples 14.

Figure 7: Indicating how increasing fraction of noisy samples led to increased forgetting events 14.

An interesting observation from this study was that losing a large fraction of unforgotten samples still results in extremely competitive performance on the test set. The hypothesis formed was unforgotten samples are not very informative whereas forgotten samples are more informative and useful for training. In their case, the forgetting score stabilized after 75 epochs (using RESNET & CIFAR but the value will vary as per model and data).

Perhaps a few samples are enough to tell that a cat has 4 legs, a small face, and pointy ears, and it's more about how different varieties of cats look especially if they look different from the norm e.g. Sphynx cats.

2.2.3 Using Gradient norms

Loss functions are an excellent measure to find interesting samples in your dataset whether these may be poorly labeled or really outlier samples. This was highlighted by Andrej Karpathy as well:

When you sort your dataset descending by loss you are guaranteed to find something unexpected, strange, and helpful.

Personally, I have found loss a very good measure to find poorly labeled samples. So, the natural question would be "should we explore how we can use the loss as a measure to prune the dataset?". It's not until NeurIPS 2021, 21 that this was properly investigated. This Standford study looked into the initial loss gradient norm of individual training examples, averaged over several weight initializations, to prune the dataset for better generalization. This work is closely related to example forgetting except that instead of performance measure the focus more is on using local information early in training to prune the dataset.

This work proposes GraNd score of a training sample (x, y) at time t given by L2 norm of the gradient of loss computed on that sample and also expected loss L2 norm termed EL2N (equation below). The reasoning here is that samples with a small GraNd score have abounded influence on learning how to classify the rest of the training data at a given training time. Empirically, this paper found that averaging the norms over multiple weight initializations resulted in scores that correlate with forgetting scores 14 and leads to pruning a significant fraction of samples early in training. They were able to prune 50% of the sample from CIFAR-10 without affecting accuracy, while on the more challenging CIFAR-100 dataset, they pruned 25% of examples with only a 1% drop in accuracy.
$ \begin{equation} χ_t(x, y) = E_{w_t} || g_t(x, y)||_2 \ \tag*{GraNd eq} \ \end{equation} $

$ \begin{equation} χ_t(x, y) = E || p(w_t, x) - y)||_2 \ \tag*{EL2N eq} \ \end{equation} $

This is a particularly interesting approach and is a big departure from other pruning strategies to date which treated samples in the dataset independently. Dropping samples based on independent statistics provides a weaker theoretical guarantee of success as DL is a non-convex problem 21. I am very curious to find out how mixup impacts the GraNd scores given it shown (see figure 5b) using mixup leads to smaller gradient norm (l1 albeit).

Results of prunning with GradNd and EL2N 21.

The results from this study are shown in the fig above. Noticeably high pruning is not fruititious even with this approach despite how well it's doing on CIFAR-10 and 100 datasets. Are we retaining the data distribution when we drop large samples? Mostly not and that is only reasoning that makes sense. And we circle back to how much pruning is enough? Is that network dependent or more a property of data and its distribution? This study 21 claims that GradND and EL2N scores, when averaged over multiple initializations or training trajectories remove dependence on specific weights/networks, presenting a more compressed dataset. If this assertion holds in reality, in my view, this is a very promising finding easing the data-related challenges of DL.

What's more fascinating about this work is that it sheds light on how the underlying data distribution shapes the training dynamics. This has been amiss until now. Of particular interest is identifying subspaces of the model's data representation that are relatively stable over the training.

2.3. Distillation

Distillation technique refers to the methodologies of distilling the knowledge of a complex or larger set into a smaller set. Knowledge or model distillation is a popular technique that compresses the learned representation of a larger model into a much smaller model without any significant drop in performance. Using student-teacher training regime have been explored extensively even in the case of transformer networks that are even more data-hungry than more conventional network say Convolution network DeiT. Despite being called data-efficient, this paper employs a teacher-student strategy to transform networks, and data itself is merely treated as a commodity.

Recently, this concept is investigated for use in deep learning for dataset distillation with aim of synthesizing an optimal smaller dataset from a large dataset 17,16. The distilled dataset are learned and synthesized but in theory, they approximate the larger dataset. Note that the synthesized data may not follow the same data distribution.

Some dataset distillation techniques refer to their approach as compression as well. I disagree with this in principle as compression albeit lossy in this context, refers to compressing the dataset whereas with distillation the data representation is deduced/synthesized - potentially leading to entirely different samples together. Perhaps it's the compressibility factor a.k.a compression ratio that applies to both techniques. For example, see figure 13 shows the extent to which distilled images can change.

A dataset distillation 17 paper quotes:

We present a new optimization algorithm for synthesizing a small number of synthetic data samples not only capturing much of the original training data but also tailored explicitly for fast model training in only a few gradient steps 17.

Their problem formulation was very interesting! they derive the network weights as a differentiable function of the synthetic training data and set the training objective to optimize the pixel values of the distilled images The result from this study showed that one can go as low as one synthetic image per category while not regressing too much on the performance. More concretely, they distilled 60K training images of the MNIST digit dataset into only 10 synthetic images (one per class) yields a test-time MNIST recognition performance of 94%, compared to 99% for the original dataset).

Figure 8: Dataset distillation results from FAIR study 17.

Here are some of the distilled samples for the classes labeled at the top (fig 9). It's amazing how well the MNIST trained on these sets does but CIFAR one misses the mark only being about as good as random (54%) compared to 80% on the full dataset(fig 8 & 9).

Figure 9: Dataset distillation results from FAIR study 17.

Following this work, another distillation technique was proposed utilizing kernel methods - more specifically kernel ridge regression to obtain ε-approximate of original datasets 18. This technique is termed Kernel Inducing Points (KIP) and follows the same principle for keeping the objective function to maximize the approximation and backpropagate the gradients to learn synthesized distilled data. The difference between 18 and 17 is one 17 uses the same DL network while the other 18 uses kernel techniques. With KIP, another advantage is that not just source samples but optionally labels can be synthesized too. In 17, the objective was purely to learn pixel values and thus the source (X). This paper 18 also proposes another technique Label Solve (LS) in while X is kept constant and only label (Y) is learned.

Figure 10: Examples of distilled samples a) with KIP and b) With LS 18.

The CIFAR 10 result from 17 (fig 9) was about 36.79% for 10 samples, with KIP there is a slight gain in performance there given the extreme compression. This raises the question of what is a good compression ratio that can guarantee good information retention. For complex tasks like CIFAR (compared to MNIST), 10 (one per sample) may not be enough given how complex this dataset is comparatively.

Figure 11: CIFAR10 result from KIP and LS 18.

Actually, LO shot technique 11, discussed previously, is also a specialized form of X-shot technique that does dataset distillation. Aside from that, gradient-based techniques for dataset distillation have been actively investigated in the last few years (ref 16,17,18,19,20). Another approach explored siamese augmentation method termed Differentiable Siamese Augmentation (DSA) that uses matching loss and synthesizes dataset through backpropagation techniques 16

Figure 12: Differentiable Siamese Augmentation 16.

Bayesian and gradient-descent trained neural networks converge to Gaussian processes (GP) as the number of hidden units in intermediary layers approaches infinity 20 (fig 13). This is true for convolutional networks as well as they converge to a particular gaussian processes channel limits are stretched to infinity. These networks can thus be described as kernels known as Neural Tangent Kernel (NTK). Neural Tangents library based on JAX an auto differentiation toolkit has been used in applying these kernels in defining some of the recent distillation methods. References 18,20,21 are one such examples.

Figure 13: Infinite width Convolution networks converging to infinity 20.

The authors of KIP and LS techniques 18 explore how to scale and accelerate the distillation process to apply these techniques to infinitely wide convolutional networks 20. Their results are very promising (see fig 14).

Figure 14: KIP ConvNet results 20.

A visual inspection into the distilled dataset from the infinite width CNN-based KIP technique is shown in fig 15. The distillation results are curious, to say the least. In the example, the distilled apple image seems to represent a pile of apples whereas the bottle distilled results into visibly different bottles while still showing artifacts of the original two bottles. While other classes show high order features (with some noise).

Figure 15: KIP ConvNet example of distilled CIFAR set 20.

Figure 16 shows MNIST results, they are not only very interesting but also look very much like mixup (where x and y both are mixed).

Figure 16: KIP ConvNet example of distilled MNIST set 20.

3. So what if you have noisy data

Noises in the dataset are considered a nuisance. Because models hold the compressed form of knowledge represented by the dataset, dataset curation techniques carefully look to avoid noises in datasets.

Noises can be an incredibly powerful technique to fill in the missing information in source/images. For instance, if only part of an image is known then instead of padding the image with default (0 or 1-pixel value), filling in using random noise can be an effective technique avoiding confusion on actual real values that relate to the black or white region. This has held true in my own experience. It was very amusing to note that the forgetting event study 14 in fact looked into adding label noise on the distribution of forgetting events. They added noise in pixel values and observed that adding an increasing amount of noise decreases the number of unforgettable samples (see also figure 7 for their results when noise was used).

Noise coming from randomness is handled very well by DL networks as well. I find the result from 22, shown in figure 17 quite fascinating actually. Looking at how well the model is doing when missing patches are random and how poor it is doing when missing patches are systematic is indicative of how powerful and lame (both at the same time) machines are!

Figure 17: Noisy patches in Masked-AutoEncoder 22.

GradND study 21, looked into the effect of noise on the source itself and performed a series of experiments to conclude that when there is enough data, keeping the high score examples, which are often noisy or difficult, does not hurt performance and can only help.

Conclusion

In summary, the last four years have been incredibly exciting for data in DL space and the year 2021 even more! There is a lot of mileage we can get out of simpler techniques like MixUp but more exciting developments are dissecting the training dynamics and exploring the importance of samples in solving a particular task using DL techniques. Distillation methods are still in the early stages where they work well for simpler datasets but honestly how many real-world problems have simple datasets? Nevertheless, some really exciting development in this space. These techniques can be groundbreaking if the compression methods hold across a wide range of architectures as indicated by 21.

References

[1707.02968] Revisiting Unreasonable Effectiveness of Data in Deep Learning Era.” Accessed January 3, 2022. https://arxiv.org/abs/1707.02968.
Hestness, Joel, Sharan Narang, Newsha Ardalani, Gregory Diamos, Heewoo Jun, Hassan Kianinejad, Md Mostofa Ali Patwary, Yang Yang, and Yanqi Zhou. “Deep Learning Scaling Is Predictable, Empirically,” December 2017. https://arxiv.org/abs/1712.00409.
https://www.wired.com/story/no-data-is-not-the-new-oil/
https://pages.run.ai/hubfs/PDFs/2021-State-of-AI-Infrastructure-Survey.pdf
[1808.01974] A Survey on Deep Transfer Learning. Accessed January 5, 2022. https://arxiv.org/abs/1808.01974.
Transfer Learning. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.146.1515&rep=rep1&type=pdf
Zhang, Hongyi, Moustapha Cisse, Yann N. Dauphin, and David Lopez-Paz. “Mixup: Beyond Empirical Risk Minimization,” October 2017. https://arxiv.org/abs/1710.09412.
[1812.01187] Bag of Tricks for Image Classification with Convolutional Neural Networks. Accessed December 30, 2021. https://arxiv.org/abs/1812.01187.
[2009.08449] 'Less Than One'-Shot Learning: Learning N Classes From M < N Samples. Accessed January 5, 2022. https://arxiv.org/abs/2009.08449.
[1512.00567] Rethinking the Inception Architecture for Computer Vision. Accessed January 5, 2022. https://arxiv.org/abs/1512.00567.
[1904.05046] Generalizing from a Few Examples: A Survey on Few-Shot Learning. Accessed January 5, 2022. https://arxiv.org/abs/1904.05046.
Li Fei-Fei, R. Fergus and P. Perona, "One-shot learning of object categories," in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 28, no. 4, pp. 594–611, April 2006, doi: 10.1109/TPAMI.2006.79. https://ieeexplore.ieee.org/document/1597116
[1606.04080] Matching Networks for One Shot Learning. Accessed January 5, 2022. https://arxiv.org/abs/1606.04080.
[1812.05159] An Empirical Study of Example Forgetting during Deep Neural Network Learning. Accessed December 29, 2021. https://arxiv.org/abs/1812.05159.
[1906.11829] Selection via Proxy: Efficient Data Selection for Deep Learning. Accessed December 29, 2021. https://arxiv.org/abs/1906.11829.
[2102.08259] Dataset Condensation with Differentiable Siamese Augmentation. Accessed January 5, 2022. https://arxiv.org/abs/2102.08259.
[1811.10959] Dataset Distillation. Accessed January 5, 2022. https://arxiv.org/abs/1811.10959.
[2011.00050] Dataset Meta-Learning from Kernel Ridge-Regression. Accessed January 5, 2022. https://arxiv.org/abs/2011.00050.
[2006.05929] Dataset Condensation with Gradient Matching. Accessed January 5, 2022. https://arxiv.org/abs/2006.05929.
[2107.13034] Dataset Distillation with Infinitely Wide Convolutional Networks. Accessed January 5, 2022. https://arxiv.org/abs/2107.13034.
[2107.07075] Deep Learning on a Data Diet: Finding Important Examples Early in Training. Accessed December 10, 2021. https://arxiv.org/abs/2107.07075.
[2111.06377] Masked Autoencoders Are Scalable Vision Learners. Accessed January 5, 2022. https://arxiv.org/abs/2111.06377.
[1802.03426] UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. Accessed January 5, 2022. 23. https://arxiv.org/abs/1802.03426.
Maaten, Laurens van der and Geoffrey E. Hinton. "Visualizing Data using t-SNE." Journal of Machine Learning Research 9 (2008): 2579–2605. https://www.jmlr.org/papers/volume9/vandermaaten08a/vandermaaten08a.pdf
[2107.02331] Mind Your Outliers! Investigating the Negative Impact of Outliers on Active Learning for Visual Question Answering.” Accessed January 8, 2022. https://arxiv.org/abs/2107.02331

Sunday, March 14, 2021
in Kubernetes, OOM
24 min read

Debugging OOMKilled Pods in Kubernetes: A Deep Dive

When I deployed a new application to our self-managed Kubernetes cluster, chaos ensued. Pods were being terminated with error code 137 and OOMKilled status repeatedly.

Despite my experience with Kubernetes^1,2 and thorough testing of the application on bare metal and virtual environments, the root cause remained elusive.

The situation was frustrating:

This triggered a comprehensive investigation that revealed fascinating insights about Kubernetes and Linux kernel behavior. While Line Corp has documented similar investigations, this case presented unique challenges.

This article details the investigation process and key discoveries about Kubernetes and Linux kernel interactions.

Application Context

The application performs intensive numpy and TensorFlow computations on multimedia content, producing artifacts and associated metadata. These workloads are particularly memory-intensive.

The resource requirements fluctuate within a predictable range, as shown in our metrics:

Figure 2: Average resource requirements of the app when run on VMs or bare metal

While the resource utilization shows variation (Figure 2), this pattern represents a valid and supported workload, though optimization opportunities exist.

Let's examine the investigation timeline:

Day 1: Initial Deployment

Pods were terminated every 20 minutes with error code 137 and OOMKilled status.

Figure 3: The killer is on the loose! - Whodunit?

Understanding the failure mode: 1. Error code 137 indicates that the container process received the SIGKILL and thus was killed by the OS kernel. SIGKILL on Kube can only be produced using one of the following means:

1.1. Manually (human): Triggering CTRL+C or using other means of manually sending SIGKILL or even manually killing process.

1.2. Container Runtime/Interface: `Kubelet` the process running on the host machine that manages running Kube workload is `the power that be` for containers. 
It communicates through container runtime to manage the container lifecycle. It can kill and almost always kills badly behaving pods!

![](../../resources/oom/CRI.png)
*Figure 4: Container runtime interface. Image Credit: [Ian Lewis]! Borrowed from his 4 part container runtime series [container runtime] that I highly recommend reading*

1.3. OS kernel: The OS kernel is responsible for the life cycle of processes running on the host. 
It is `the mighty power that be` for all the processes on the host including the container process and its children.
It can also kill and almost always kills badly behaving processes!

OOMKilled represent a kill event (SIGKILL) triggered to a process because someone in-charge suspected of the process to be the culprit of a memory surge that may lead to an out of memory event. This is a safeguard mechanism to avoid system-level failure and to nip mischieve in the bud.

Key Insight: The termination originated from either the Container Runtime/Interface or the OS Kernel due to suspected memory issues. Manual intervention was ruled out.

Technical Components

Container Runtime (Figure 4) handles:

a) Running containers: Comes from open container initiative (OCI) (about 2013) open sourced by Docker called "runc". It provides ability to run containers.

b) Image management: How images are packed, unpacked, pushed, pulled etc comes under this umbrella. A good example for this is "containerd".

Figure 5: Docker stack! Image credit: internet

There are several other implementation for runtime than runc+containerd like rkt but for me, its runc+containerd in play.
control groups are a Linux kernel feature that allows processes to be organized into hierarchical groups whose usage of various types of ../resources (memory, CPU, and so on) can then be limited and monitored. The cgroups interface is provided through a pseudo-filesystem called cgroupfs. You may have heard about /sys/fs/cgroup/!

Liz Rice's demonstration of container internals provides excellent foundational understanding of cgroups' role in containerization. The demo code is particularly instructive.

Figure 6: CGroup in picture! Image credit: zines by Julia Evans
Kubelet (see fig 4) not only interfaces container runtime but also has cAdvisor(for Container Advisor) integrated within. Note kubelet is a service running on the host and it operates at the host level, not the pod. With cAdvisor it captures resource utilization, statistics about control group of all container processes on the host.
Kubernetes manages the resource for containers using cgroups that guarantees resource isolation and restrictions. Kube can allocate X amount of ../resources to a container and allow the ../resources to grow until a pre-existing limit is reached or no more is left on the host to use. Kube provides these requests and limits semantic on containers which are used to enforce the said limit on process hierarchy for each container via cgroups. Now, the `limit is not always a hard cut-off. As documented in google's blog of best practices resource requests and limits, there are two types of ../resources:
1. Compressible ../resources: When resource limit is reached, Kube will throttle the container i.e. start to restrict the usage but won't actually terminate the container. CPU is considered as a compressible resource.
2. Incompressible ../resources: When a limit for this type of resource is reached, the highest usage process within the cgroups hierarchy will be killed. Memory is an incompressible resource.
Takeaway 2: It's not the CPU limit, but the memory limit that we need to focus on.
Kubernetes classifies pod into three categories based on the quality of service (QoS) they provide:

4.1 Guaranteed pods are those who's resource request and limit are just the same. These are the best kind of workload from Kube's viewpoint as they are easier to allocate and plan for resource-wise. These pods are guaranteed to not be killed until they exceed their limits. Figure 7: Guaranteed QoS pod example

4.2 Best-Effort pods are those where no resource requirements are specified. These are the lowest priority pods and the first to get killed if the system runs out of memory. Figure 8: Best-Effort QoS pod example

4.3 Burstable pods are those whose resource request and limit are defined in a range (fig 9), with limit treated as max if undefined. These are the kind of workloads that are more likely to be killed when the host system is under load and they exceed their requests and no Best-Effort pods exist. Figure 9: Burstable QoS pod example
```
So can Kube over-commit? 
If yes, would it always be on the compressible ../resources? 
```
Yes, Kube can overcommit. The pod limits are allowed to be higher than requests. It's possible that the sum of all limits exceeds the total node capacity. It's possible to overcommit both compressible and incompressible ../resources. This is pictorially explained here. In fact, with Kube, it's also possible to not only vertically overcommit but also horizontally (at cluster level) overcommit. Horizontal overcommits are nicer as they can trigger auto-scaling events to scale out.

So why the pods are getting killed?

The app was initially deployed with Burstable QoS with Memory requirements set at request: 4Gi, limit: 7Gi, and CPU set at 2 for both requests, limits (see fig 2). The nodes were AWS r5.2xlarge type with 8 CPU, 64GB RAM, running Debian/Linux. Other than Kube system components and the app, nothing else was deployed on these nodes.

So, Kube could have only deployed 3 app pods per r5.2xlarge nodes (due to CPU request). This means, 43GB (=64-7*3) of RAM was lying around singing hakuna matata! What a waste! Sure but let's not digress! So why the OOMKill? ¯\_(ツ)_/¯

Noteworthy observation: - Node monitoring tells us that is running healthy and has plenty of ../resources at its disposal. - the pod is still OOMKilled but not all app pods on the node, just one is killed.

I am still clueless. So, caving in, I decided to use up this extra memory floating around and beef up the nodes a bit more and buy more time to do a proper investigation. Now, the apps are redeployed again with RAM request 4Gi, limit: 31Gi (leaving 4GB for other misc system components).

Did that ameliorate the problem - no! Of course, I am being silly about this, I should be making it guaranteed to have better chance of avoiding OOMKill.

App's on Kube: day 2

So, my apps are running with guaranteed QoS with 31GB of RAM as required/limit. Node still seems healthy and shows no sign of duress.

Hows the app doing with the new revised configuration: still getting OOMKilled with 137 error code left and right!

Meanwhile, we uncovered random memory surges in some pods (see figure 10). These surges occurred very rarely and did not match to the duration of out-of-memory kill events. In fact, the frequency of OOM was much higher than these memory surges.

Figure 10: The notorious spike of memory use on pod

While these surges are worth investigating, they are still within the request/limit range (28.x Gi suurge on 31Gi request). So they still don't justify the OOM event.

Whats log telling us

Based on Takeaway 1 & 2, we look at who is firing the kill signal. #Whodunit

Kube events for pod and other higher-level abstractions

Investigating, on Kube Events there is no record or any OOMKill or any event signaling anything malicious.

kubectl describe pod <my pod>
kubectl describe deploy <my pod>

In fact, according to my Kube event stream kubectl get events, Kube is all healthy and there is nothing to see, nothing to worry about there! It shows that containers are clearly being restart but it seems to be not capturing any adverse event and bringing it back up to keep to desired declared state on attached replicaset.

26m         Normal   Created   pod/myapp   Created container planck
26m         Normal   Started   pod/myapp   Started container planck
26m         Normal   Pulled    pod/myapp   Container image "app" already present on machine

What are the CRI and kubelet doing?

Looking at the system journal, there is nothing noteworthy recorded for OOM. 1. Nothing is logged for Out of memory (command reference journalctl -u kubelet | grep -i "Out of memory") 2. Only log I see for shorter term oom (cmd reference journalctl -u kubelet | grep -i "oom" is info level log of kubelet startup record.

kubelet[2130]: I0309 04:52:13.990735    2130 flags.go:33] FLAG: --oom-score-adj="-999"
kubelet[2130]: I0309 04:52:15.416807    2130 docker_service.go:258] Docker Info: &{ID:XF74:2JFW:UOE4:QI7X:TXQU:RJLG:E7FC:K4K3:IUTM:MGFW:W2GM:Z6AC Containers:0 ContainersRunning:0 ContainersPaused:0 ContainersStopped:0 Images:0 Driver:overlay2 DriverStatus:[[Backing Filesystem extfs] [Supports d_type true] [Native Overlay Diff true]] SystemStatus:[] Plugins:{Volume:[local] Network:[bridge host macvlan null overlay] Authorization:[] Log:[awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog]} MemoryLimit:true SwapLimit:false KernelMemory:true KernelMemoryTCP:false CPUCfsPeriod:true CPUCfsQuota:true CPUShares:true CPUSet:true PidsLimit:false IPv4Forwarding:true BridgeNfIptables:true BridgeNfIP6tables:true Debug:false NFd:23 OomKillDisable:true NGoroutines:44 SystemTime:2021-03-09T04:52:15.411198727Z LoggingDriver:json-file CgroupDriver:cgroupfs NEventsListener:0 KernelVersion:4.9.0-14-amd64 OperatingSystem:Debian GNU/Linux 9 (stretch) OSType:linux Architecture:x86_64 IndexServerAddress:https://index.docker.io/v1/ RegistryConfig:0xc00062c0e0 NCPU:16 MemTotal:133666107392 Generic../resources:[] DockerRootDir:/var/lib/docker HTTPProxy: HTTPSProxy: NoProxy: Name:ip-172-30-36-152 Labels:[] ExperimentalBuild:false ServerVersion:18.06.3-ce ClusterStore: ClusterAdvertise: Runtimes:map[runc:{Path:docker-runc Args:[]}] DefaultRuntime:runc Swarm:{NodeID: NodeAddr: LocalNodeState:inactive ControlAvailable:false Error: RemoteManagers:[] Nodes:0 Managers:0 Cluster:<nil> Warnings:[]} LiveRestoreEnabled:false Isolation: InitBinary:docker-init ContainerdCommit:{ID:468a545b9edcd5932818eb9de8e72413e616e86e Expected:468a545b9edcd5932818eb9de8e72413e616e86e} RuncCommit:{ID:a592beb5bc4c4092b1b1bac971afed27687340c5 Expected:a592beb5bc4c4092b1b1bac971afed27687340c5} InitCommit:{ID:fec3683 Expected:fec3683} SecurityOptions:[name=seccomp,profile=default] ProductLicense: Warnings:[]}
kubelet[2130]: I0309 04:52:15.437879    2130 manager.go:1159] Started watching for new ooms in manager

Normally, in the event of OOM triggered by Kube, we should see kubelet recording some signal for oom e.g. An OOM event was triggered

Takeaway 3: As far as Kube is concerned, the pod is well behaved and it's all hakuna matata!

So, #Whodunit? Enter day 3 - new day new investigation

App's on Kube: day 3

Based on the previous 3 takeaways, the only potential suspect we have is OS kernel. The pods are still crashing and metrics, events, and Kube level logs do not justify the observation.

Reading kernel logs

System level log scan grep -i -r 'out of memory' /var/log/ takes us somewhere.
```
/var/log/kern.log:Mar  9 13:17:05 ip-172-xxx-xx-xxx kernel: [30320.358563] Memory cgroup out of memory: Kill process 11190 (app) score 9 or sacrifice child
```
Takeaway 4: We do in fact have kernel thinking memory cgroups is in danger and starting to kill!
Kernel logs (/var/log/kern.log) seem to have much more insightful info than the above one-liner out of memory: Kill process.

But before we look into this, let's do a bit of a deep dive into related concepts:

Deep-dive into OS Kernel

Swap space and Kube

Docker supports setting swappiness however it's discouraged as it's slow and less performant. Also, providing a limit on the swap is unsupported at the docker level which can lead to resource management and overcommitment chaos. These are some of the reasons why kops and in general Kube prefer no swap on hosts.

OOMKill disable on Kubernetes

OS Kernels allow disabling OOM Kill for cgroups level (/sys/fs/cgroup/memory/memory.oom_control) even docker supports it using --oom-kill-disable flag. These are highly discouraged due to the nature of problem band-aid fixer OOM Killer solves. It also does not sit with Kube's declarative approach orchestration and also with cattle workload philosophy. It's also why by default oom kill is enabled on Kubernetes.

Its possible however to configure it to disable OOMKill by starting kubelet service with --cgroup-driver=cgroupfs argument and then setting oom_kill_disable under /sys/fs/cgroup/memory/memory.oom_control as 1.

Takeaway 5: It's not something I want to enable either, but for the completeness of the discussion, it's worth mentioning :).

Kernel memory management

The kernel uses virtual addressing (using paging and segmentation) to provide isolation amongst various processes running on host. It is also virtual addressing that allows for use of more memory than what's available currently in physical memory (RAM) by making use of other sources like a disk (a.k.a. swap). Virtual addressing is divided into user & kernel space. Userspace is the sort of virtual address space that's reserved for user/application programs whereas kernel space is reserved for kernel-related operations.

Now, the os kernel is designed to be greedy - greedy to be able to run as many processes as possible. This is also the reason why we need mechanisms like `out of memory'.
System vs memory controller (memch) OOM

cgroups comprises of two components: core and controller. Core corresponds to managing the hierarchy and core capabilities whereas controllers are focused on the type of resource cgroup is controlling eg cpu, io, memory controller ('memcg').

Now, the user-space out-of-memory handling can address OOM conditions for both cgroups using the memory controller ('memcg') and for the system as a whole.
`Takeaway 6`: We know, based on our takeaways, that our OOM is not stemming from system draining or system as whole. Also, log `Memory cgroup out of memory` indicating that its `memcg`
that's triggering the OOM Kill. Here, the app process hierarchy memory usage is aggregated together into its memcgs so the memory usage at group level can be accounted for. 
What our first log here is telling us is `memcg usage reached its limits and memory cannot be reclaimed i.e. the memcg is out of memory`<sup>[1][lwn]</sup>.

OOM kill score

How does kernel come to decide which process to kill, is based on a score. The score has two parts: main (oom_score) and adjustment factor (oom_score_adj). These scores are store against process id in process space and can be located on disk as :
```
/proc/<pid>/oom_score
/proc/<pid>/oom_score_adj
```
The oom_score is given by kernel and is proportional to the amount of memory used by the process i.e. = 10 x percentage of memory used by the process. This means, the maximum oom_score is 100% x 10 = 1000!. Now, the higher the oom_score higher the change of the process being killed. However, user can provide an adjustment factor oom_score_adj (a.k.a. oom_adj in older kernel versions). If provided, it is used to adjust the final score. The valid value for oom_score_adj is in the range of (-1000, +1000), where -ve score decreases and +ve increases the chances of oomkill. More details on this can be found in this very interesting article by Jonathan Corbet another OOMKill rewrite, with precursory article found here.
OOM trigger workflow

kmsg is the kernel message interface that directs kernel messages to /proc/kmsg & /dev/kmsg. Now, /dev/kmsg is more useful for us mere mortals as it's designed to be persistent. /proc/kmsg is designed to be read once and treated more as event queue if you will. Messages from here also trickle through to kernel logs @ /var/log/kern.log.

_On Kube_

Kebelet watches for `kmsg` and handles messages that will translate to OOMEvent/OOMKillEvent in Kube event stream which is then handled appropriately to trigger OOMKill. More interesting details of how this happens can be found [here][line-eng-qos] (also shown in borrowed fig 11).

![](../../resources/oom/workflow-4-1024x816.png)
*Figure 11: OOM handling workflow on Kubernetes. Image credit: [Line Corp][line-eng-qos]*

As mentioned in `takeway 3 & 4`, this workflow however was not triggered in our case, we are did not record any Kube related OOM events or even kubelet receiving
any related messages.

_At Kernel Level_
When system or memory controller related OOM is suspected, based on `oom_score` (with adjustment `oom_score_adj`), `oom-killer` is invoked on the highest
score process and its children.

So why the pods are getting killed?

In my case, memory cgroup ran out of memory and my stack trace confirms this (see fig 12). It tells me that the application container was killed because it was consuming 1.5MB shy of memory set as limit (31457280 KB).

Figure 12: Kernel log part 1

OK! this explains the OOMKill but why:

a. My monitoring only shows 29GB as max memory surge!

b. I never noticed beyond 9GB usage in local/testing/profiling and all the jazz!

This simply does not add up! Let's hold on to this thought for a bit and look at the rest of the logs and what it says:

Before we go into part 2 of the log, I should explain a few things:

The pause container is the parent container of each pod, responsible for creating and managing the environment for the group of containers that would be provisioned within the pod. For more info, I will direct you to an excellent article by Ian Lewis, the almighty pause container. I need to explain this because it will be shown in the following log.
Definition of memory cgroups stats metrics as per kernel.org is listed below.

Note that, anonymous memory (abbreviated often as anon) is a memory mapping with no file or device backing it. Anon memory is used by programs to allocate memory for the stacks and heaps. Also, the standard page size on the Linux kernel is 4KB which can be really inefficient to store mapping for a large block of memory virtual memory. Hugepages are designed to solve this inefficiency and can hold a bigger chunk than 4KB. More details on this is available here.

| Metrics of memory cgroups stats |                                                                                                        Definition                                                                                                        |
| ------------------------------- | :----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: |
| rss                             | rss stands for resident set size. It is the portion of memory occupied by a process that is held in RAM. This metric represents the number of bytes of anonymous and swap cache memory (includes transparent hugepages). |
| rss_huge                        |                                                                                   number of bytes of anonymous transparent hugepages.                                                                                    |
| cache                           |                                                                                          number of bytes of page cache memory.                                                                                           |
| mapped_file                     |                                                                                number of bytes of the mapped file (includes tmpfs/shmem)                                                                                 |
| swap                            |                                                                                              number of bytes of swap usage                                                                                               |
| dirty                           |                                                                            number of bytes that are waiting to get written back to the disk.                                                                             |
| writeback                       |                                                                         number of bytes of file/anon cache that are queued for syncing to disk.                                                                          |
| inactive_anon                   |                                                                         number of bytes of anonymous and swap cache memory on inactive LRU list.                                                                         |
| active_anon                     |                                                                          number of bytes of anonymous and swap cache memory on active LRU list.                                                                          |
| inactive_file                   |                                                                               number of bytes of file-backed memory on inactive LRU list.                                                                                |
| active_file                     |                                                                                number of bytes of file-backed memory on active LRU list.                                                                                 |
| unevictable                     |                                                                            number of bytes of memory that cannot be reclaimed (mlocked etc).                                                                             |

Now, as discussed previously, the swap is not being used in this system. See the second part of the logs in fig 13. You will note, there are two containers recorded and their memory stats is a capture - a) the pause container and b) the app container. We can ignore the pause, it's tiny and looking very healthy. But look at the stats for app pod in fig 13 (below)! At the time my app was killed, it held about 29GB in hugepages and only 1.3GB extra in RSS. That's huge and remember monitoring it not picking it for some reason! It captured 29GB but not 31GB! Perhaps its picking only rss_huge and presenting it as rss erroneously! ¯\_(ツ)_/¯! Yes, we have a problem but this monitoring issue is for another day!

Figure 13: Kernel log part 2

Notice the blue arrow in fig 13, its capturing page info by both the pause container process and app container process. These are page info and not and need to be multiplied by 4KB to get actual memory stats. These are translated two lines below the blue line!

My app has freaking 62GB in total virtual memory! What's going on!

Ok, so "total-vm" is the part of virtual memory the process uses. A part of this "total-vm" that's mapped to RAM is rss. Part of rss that's allocated on to real memory, blocks is your anon-rss (anonymous memory), and the other part of rss is mapped to devices and files and termed file-rss. If my app goes crazy and allocates a large chunk of space (say using malloc()) but never really use it then total-vm can be high but it won't all be used in real memory. This is made possible due to overcommit. A good sign of this happening, given swap off, is when total-vm is high but rss is actually low! This is exactly what's happening here! We have about 30GB difference between total-vm and rss.

Takeaway 7: We have two problems here: a) Supporting over-commitment and b) Allocation of what we suspect un-needed memory!

Let's look at solving the over-commit first and see what level of fixes it provides:

Controlling over-commits

So far, we have concluded over-commitment is a problem. Well, as discussed previously, it's a feature (of both kernel & kube) apparently!

Kernel uses the "extendability" of virtual addressing to over-commit. The kernel settings vm.overcommit_memory and vm.overcommit_ratio is specially designed to controlling this capability. For more info, see here.

1.1 vm.overcommit_memory = 0: Make best guess and overcommit where possible. This is the default.

1.2 vm.overcommit_memory = 1: Always overcommit

1.3 vm.overcommit_memory = 2: Never overcommit, and only allocate as much memory as defined in overcommit_ratio.

vm.overcommit_ratio is only used when overcommit_memory=2. It defines what percent of the physical RAM plus swap space should be allocated. This is default to 50. We want this config to be 100.

But the use of sysctl to set these(using the following) is not enough as the config won't persist on horizontal scaling (new node spinning due to spot instances or less important but restart):

sysctl -w vm.overcommit_memory=2
sysctl -w vm.overcommit_ratio=100

The effect of these configs is immediate and no start is needed. Talking about the restart, systcl cli config update do not persist, system config needs to be updated in /etc/sysctl.conf to persist the setting across restarts.

On Kube, kops provisioned clusters, these settings need to be supplied through sysctlparameters config but these are only supported from kube 1.17 and higher! Safe sysctl parameters can be set at pod level however our setting is not (obviously) supported at the pod level. One can't use additionaluserdata for this either, as these settings are overridden when kops provision node as Kube node!

And, to make it a helluva fun, this cluster is currently at 1.12! Heya, Mr. Murphy!

So, I say our my prayers, and turn to bash:

for memip in $(aws ec2 describe-instances --region us-east-1 --instance-ids \
$(aws autoscaling describe-auto-scaling-instances --region us-east-1 --output text \
--query "AutoScalingInstances[?AutoScalingGroupName=='myasg'].InstanceId") \
--query "Reservations[].Instances[].PrivateIpAddress")
do     
    ssh -o StrictHostKeyChecking=no  ${memip} 'bash -s' < set_mem.sh   
done

where set_mem.sh is:

#!/usr/bin/env bash
sudo sysctl -w vm.overcommit_memory=2
sudo sysctl -w vm.overcommit_ratio=100

I see a massive improvement in OOMKills. Pods that were killed every 20mins and odd, are chugging along with 24hr processing and no crash still. Figure 14: Getting somewhere! OOMKills sort of under control!

So, perhaps we can upgrade Kube and make this configuration systematic!

But, I am not done yet! No no no no no no no .....

Remember, part b of our problem in takeaway 7 i.e. b) Allocation of what we suspect un-needed memory!.

Why was it happening in the first place, and why it's controlled with overcommit disabled. I won't lie, it still happens but far less infrequent!

it's not fixed yet!

Oh! the fun never ends! All the places we go! I will cover this later, ahem ahem, when I know the answer! Pretty sure it's some nasty behavior of Tensorflow 2, and the investigation is underway!

Thanks for reading. Hopefully, it was a fun insightful read!