WWW2026
Frequency-Corrupt Based Graph Self-Supervised Learning
Haojie Li, Mengjiao Zhang, Guanfeng Liu, Qiang Hu, Yan Wang, Junwei Du
Abstract
Graph self-supervised learning (GSSL) alleviates the graph data labeling bottleneck without supervision, enabling wide application in domains like recommendation systems and social network analysis. High-frequency signals are valuable in GSSL for capturing local structural preferences, thereby enriching graph representations and boosting model performance. However, in practical applications, two critical problems hinder the efficient and robust use of these signals. First, the locality of high-frequency signals limits their full utilization by the model. Second, over-reliance on specific high-frequency signals will affect the model's generalization. To address the above problems, we propose the Frequency-Corrupt Based Graph Self-Supervised Learning (FC-GSSL) algorithm. Specifically, we generate corrupted graphs biased toward high-frequency signals by corrupting nodes and edges according to their low-frequency contributions. These corrupted graphs are fed as input to an autoencoder, with low-frequency and general features serving as the supervision. This compels the model to effectively fuse high- and low-frequency signals, thereby integrating and utilizing more valuable high-frequency information. Additionally, we design multiple sampling strategies and form diverse corrupted graphs based on the intersections and union between the results obtained from these strategies. By aligning the node representations from these views, the model can identify valuable frequency combinations, which helps reduce the negative impact of specific high-frequency components and improve generalization. FC-GSSL optimizes the design of GSSL for web applications, significantly improving model performance on complex web-related graphs, such as social networks and citation networks. This work makes a direct contribution to advancing the ''Graph Algorithms and Modeling for the Web'' research track. Experimental results on 14 datasets across multiple tasks illustrate the superiority of the proposed approach.