Biannual Journal Monadi for Cyberspace Security (AFTA)

fa روشی نوین در آموزش مدل‌های مبتنی بر انرژی به منظور تصفیه کارآمدتر تصاویر خصمانه A novel method for training energy-based models for eﬀicient purification of adversarial images رمز و امنیت اطلاعات Cryptology and Information Security پژوهشی Research Article <div style="text-align: justify;"> کاربردهای یادگیری عمیق به‌سرعت در حال گسترش هستند. با این حال، در حوزه‌های حساسی مانند امنیت و سلامت، حملات خصمانه - که در آن اغتشاشات جزئی در ورودی، باعث افت شدید عملکرد مدل می‌شوند - همچنان مانعی جدی برای پذیرش این کاربردها هستند. مدل‌های مولد با توانایی یادگیری توزیع داده‌ها، گزینه‌هایی امیدوارکننده برای بازیابی تصویر اصلی از روی نمونه‌های خصمانه در طی فرآیندی به نام تصفیه هستند. در این مقاله روندی جدید برای تصفیه پیشنهاد می‌دهیم که در آن مدل مبتنی بر انرژی Mp، پیش از طبقه‌بند M به‌کار می‌رود. برخلاف رویکردهای قبلی و برای اولین بار، Mp به‌گونه‌ای آموزش می‌بیند که به نمونه‌های خصمانه M (نمونه‌های منفی در آموزش )، انرژی بالا (احتمال پایین) اختصاص دهد. بر اساس نتایج، روش ما در برابر نمونه‌های خصمانه، دقت مقاوم بالاتری نسبت به مدل تصفیه‌ی استاندارد Mp داشته و بهبودهایی به میزان %12.3، %22.87 و %12.30 روی  پایگاه‌های داده MNIST، FashionMNIST و CIFAR10 تحت AutoAttack (L∞) نشان داده است. همچنین با وجود سادگی فرآیند آموزش، روش ما روی پایگاه داده CIFAR10 نسبت به یک مدل پیشرفته مبتنی بر انرژی، بهبودی معادل %3 در دقت مقاوم نشان می‌دهد. افزون‌بر این، یک حمله‌ی وفقی نیز طراحی کرده‌ایم که دفاع ما را هدف قرار می‌دهد و نشان می‌دهیم Mp همچنان قادر به خنثی‌سازی آن است.</div> <div style="text-align: justify;">The applications of deep learning are rapidly expanding, and neural networks have demonstrated remarkable performance across a wide range of tasks, including computer vision, natural language processing, autonomous driving, and others. However, deploying these networks in safety- and security-critical applications still faces serious challenges. In general, these challenges fall within the domain of trustworthy machine learning. Trustworthiness challenges can be categorized into several classes. In this paper, we focus on one of the most critical issues, adversarial attacks, and propose a defensive method to mitigate them. Adversarial attacks involve the addition of carefully crafted, extremely small perturbations that are imperceptible to humans, yet cause classifiers to misclassify inputs at test time. These attacks are particularly dangerous because the adversary does not require access to the training data, and they can significantly degrade the performance of state-of-the-art classifiers, in some cases reducing accuracy to near zero. To counter these attacks, various defense strategies have been proposed, which can generally be divided into certified and empirical defenses. Certified defenses provide mathematical guarantees that the predicted label remains unchanged under bounded perturbations. In contrast, empirical defenses do not offer formal guarantees and instead rely on heuristic or practical techniques to mitigate attacks. Despite the lack of provable guarantees, empirical defenses are more numerous and often demonstrate stronger robustness in practice. This category includes methods such as input transformation, architectural modification, training procedure modification, and the use of auxiliary models. In input transformation–based defenses, an adversarial image is first denoised by a defense mechanism, and the purified image is then passed to the classifier for prediction. Generative models, due to their strong capability to capture the underlying data distribution, are well-suited for image purification. They can map adversarial examples back toward the true data manifold, thereby removing adversarial noise. Consequently, most generative-model-based defenses are categorized under input transformation–based defense methods. A class of generative models known as energy-based models (EBMs) can be employed as effective tools for purifying adversarial images. Although these models do not allow for exact computation of the probability of an individual input, they enable comparison between the relative likelihoods of different inputs. This property makes it possible to identify a trajectory along which the input can be iteratively modified so as to increase its likelihood under the model. When the input to an energy-based model is pure noise, maximizing its likelihood corresponds to generating a realistic image. Ideally, when the input is an adversarial example, increasing its likelihood is equivalent to removing adversarial perturbations and reconstructing a clean image from the adversarial one. It is expected that such a purified image will then be correctly classified by the classifier. Achieving effective denoising requires proper training of the energy-based model. Various methods have been proposed for training such models, one of which is contrastive divergence. In this approach, the training process starts from a random distribution, and at each iteration, a batch of real data samples and a batch of fake (model-generated) samples are selected. The model is then trained to increase the probability (decrease the energy) assigned to real data while decreasing the probability (increasing energy) assigned to fake data. Since the total probability mass of the distribution must remain constant, these increases and decreases must be balanced. After several effective training iterations, the model learns an accurate approximation of the true data distribution. As a result, it becomes capable to both generate samples from this distribution and remove adversarial perturbations from adversarial images. In practice, contrastive divergence in its standard form is not a suitable training strategy for an energy-based model intended for adversarial denoising. The reason is that each purification step requires computing the gradient between an adversarial image and its corresponding clean image. However, since adversarial examples are extremely close to clean images, the resulting gradient is very small. Consequently, the purification process becomes slow and ineffective, failing to achieve sufficient denoising within a reasonable number of steps or time. In the method proposed in this paper, adversarial examples generated by a classifier using AutoAttack (one of the most powerful and widely applicable adversarial attacks) are incorporated into the set of fake samples during training. As a result, the energy-based model learns to identify these adversarial examples as low-probability samples and becomes more familiar with the structure of adversarial perturbations. This enhanced exposure enables the model to remove adversarial noise more effectively during the purification phase, leading to improved defensive performance. Under this formulation, the gradient of the probability distribution between clean and adversarial images becomes larger, allowing the purification process to proceed more efficiently and effectively. Several factors can influence the quality of the purification process, including the proportion of adversarial samples included in the fake batch and the number of purification steps. In addition to robust accuracy, the impact of these factors on clean accuracy must also be carefully evaluated, as prior studies have shown that these two metrics often exhibit a trade-off: improving one may lead to degradation in the other. According to empirical findings, increasing the proportion of adversarial examples in the fake batch, as well as increasing the number of purification steps, has a positive effect on improving robust accuracy. As expected, increasing the number of purification steps requires injecting more noise into the image during the purification process, which in turn leads to a reduction in clean accuracy. However, experimental results indicate that this reduction is smaller compared to standard training of the energy-based model without adversarial examples. This observation suggests that the proposed training strategy not only improves robust accuracy but also preserves higher clean accuracy relative to conventional energy-based model training. Experimental results demonstrate that our approach achieves significantly higher accuracy against adversarial examples compared to a standardly trained purification model. Specifically, under the AutoAttack infinity norm benchmark, our method improves robust accuracy by 12.31%, 22.87%, and 12.30% on the MNIST, FashionMNIST, and CIFAR-10 datasets, respectively. Moreover, despite employing a simpler training procedure, our approach surpasses a state-of-the-art energy-based model on CIFAR-10 by 3% in terms of robust accuracy. In this paper, in addition to the proposed defense method, an adaptive attack targeting energy-based model is also introduced. The core idea of this attack is that both the purification process in an energy-based model and the attack process against a classifier operate by introducing perturbations. If these perturbations are aligned, a successful attack can be achieved. In this scenario, the applied perturbation simultaneously increases the likelihood of the image under the energy-based model while strengthening the attack against the classifier. As a result, the attacked image is assigned a high probability by the energy-based model rather than being identified as a low-probability sample. Consequently, the model no longer attempts to purify the image. To achieve this, the perturbation used by the energy-based model during purification is explicitly incorporated into the attack process. In the proposed adaptive attack, one attack step is first performed, followed by one purification step using the energy-based model. The perturbation applied during purification is then scaled by a predefined coefficient and added to the perturbation computed by the classifier for the subsequent attack step. The next attack iteration is then carried out accordingly. By injecting the purification perturbation into the attack, the resulting image is assigned a high likelihood by the energy-based model, effectively disabling further purification. Repeating this process over multiple iterations results in a strong and effective adaptive attack. According to prior studies, adversarial examples generated against robust architectures, i.e., models equipped with defense mechanisms, exhibit an interesting property known as Perceptually Aligned Gradient. Under this property, adversarial images tend to visually resemble the target class specified by the attacker. This characteristic was examined and confirmed for the adversarial examples produced by the proposed adaptive attack. In the presented attack method, no explicit constraint is imposed on the magnitude of the added perturbation, which allows the resulting adversarial examples to reduce the classifier’s robust accuracy to nearly zero. However, since each dataset defines a permissible perturbation budget, constraining the perturbation to a standard bound enables the defense method to fully neutralize the attack.</div> قابلیت اعتماد, حملات خصمانه, مدل‌های مبتنی بر انرژی, تصفیه تصاویر خصمانه, گرادیان قابل درک ترازشده Trustworthiness, Adversarial Attacks, Energy Based Models, Adversarial Purification, Perceptually Aligned Gradient 82 97 http://monadi.isc.org.ir/browse.php?a_code=A-10-407-14&slc_lang=fa&sid=1 Reza Hajimohammadi Tabriz رضا حاجی محمدی تبریز reza_hajimohammadi@ee.sharif.edu 10031947532846002115 10031947532846002115 No Department of Electrical Engineering, Sharif University of Technology, Tehran, Iran دانشکده برق، دانشگاه صنعتی شریف، تهران، ایران Sajjad Amini سجاد امینی s_amini@sharif.edu 10031947532846002116 10031947532846002116 Yes Electronics Research Institute, Sharif University of Technology, Tehran, Iran پژوهشکده الکترونیک، دانشگاه صنعتی شریف، تهران، ایران Reza Kazemi رضا کاظمی reza.kazemi@sharif.edu 10031947532846002117 10031947532846002117 No Electronics Research Institute, Sharif University of Technology, Tehran, Iran پژوهشکده الکترونیک، دانشگاه صنعتی شریف، تهران، ایران