ACL2024

The PGNSC Benchmark: How Do We Predict Where Information Spreads?

Alexander Taylor, Wei Wang

Abstract

Social networks have become ideal vehicles 001 for news dissemination because posted content 002 is easily able to reach users beyond a news 003 outlet's direct audience. Understanding how 004 information is transmitted among communities 005 of users is a critical step towards understand-006 ing the impact social networks have on real-007 world events. Two significant barriers in this 008 vein of work are identifying user clusters and 009 meaningfully characterizing these communi-010 ties. Thus, we propose the PGNSC benchmark, 011 which builds information pathways based on 012 the audiences of influential news sources and 013 uses their content to characterize the commu-014 nities. We present methods of aggregating 015 these news-source-centric communities and for 016 constructing the community feature representa-017 tions that are used sequentially to construct in-018 formation pathway prediction pipelines. Lastly, 019 we perform extensive experiments to demon-020 strate the performance of baseline pipeline con-021 structions and to highlight the possibilities for 022 future work. 023 1 Introduction 024 Social media platforms have become a crucial part 025 of the information dissemination ecosystem. By 026 allowing users to choose whom they share their so-027 called "content feeds" with, these platforms have 028 created an environment in which the reach of infor-029 mation is amplified. 030 Pew Research (Forman-Katz, 2022) reported 031 that roughly half of American adults regularly con-032 sume news through social media and that 13% pre-033 fer to get their news through social media, which 034 increases to 33% for adults under 30. Further sur-035 veying suggests that adults under 30 place as much 036 trust in news gathered from social media as from 037 traditional news outlets (Liedke, 2022). Because of 038 this, traditionally "offline" news sources now have 039 dedicated social media accounts that seek to propa-tions between news organization communities. We 082 also provide a general sequential framework for 083 building pipelines to perform information pathway 084 prediction and define baseline methods for each 085 pipeline stage. 086 For the community aggregation stage, we in-087 clude several methods of aggregating communi-088 ties based on prior work (Taylor et al., 2023; Ko-089 morowski et al., 2018; Romero et al., 2010) and 090 show their impact on information pathway predic-091 tion performance. To construct community feature 092 representations, we seek to leverage the recent ad-093 vances in large language models (LLMs) and their 094 use in enhancing graph representations by using 095 LLMs to summarize and encode of each organiza-096 tion's content (He et al., 2023; Chen et al., 2023b). 097 Because news content often includes images, we 098 also incorporate a jointly-trained image-text en-099 coder into the set of community node feature gen-100 eration pipelines (Radford et al., 2021). 101 The appeal of PGNSC goes beyond providing 102 data that can be used for analysis of patterns of 103 information propagation as well as graph represen-104 tation tasks. We believe PGNSC is well positioned 105 to serve as a vehicle for exploring how LLMs and 106 graph data can be used in concert to make predic-107 tions. 108 2 Dynamic Graph Benchmarks 109 The development of graph representations has re-110 cently undergone a renaissance with the develop-111 ment and application of graph neural networks that 112 benefit from data richness and complexity (Huang 113 et al., 2023; noa; Gravina and Bacciu, 2023). This 114 evolution has underscored the critical need for ro-115 bust benchmarks in both static and dynamic graph 116 domains, divided into real-world and synthetic 117 datasets. Our focus herein is on dynamic graphs, 118 which are pivotal for modeling time-evolving rela-119 tionships in numerous applications. 120 2.1 Real-world Dynamic Graph Benchmarks 121 High-quality, real-world datasets are considered We define communities as sets of users ag-207 gregated around the news sources defined in 208 PGNSC (Taylor et al., 2023). Community aggre-209 gations were performed using the engagement met-210 rics and user interactions drawn from the user-level 211 information pathways according to methods de-212 scribed in detail in a later section. Each user-level 213 information pathway was mapped to a community-Limitations 577 This work presents a novel benchmark, 5 commu-578 nity aggregation heuristics, and 7 feature initial-579 ization pipelines. While we mentioned scalability 580 issues of existing methodologies, we acknowledge 581 the simplicity of the community aggregation heuris-582 tics included in this work. We also acknowledge the 583 limitations of our data: as mentioned in the main 584 paper, many response tweets are missing engage-585 ment metrics, meaning that the Engagement and 586 Interactions scores are being computed over incom-587 plete data. Lastly, we have included the sources 588 used to determine