TY - JOUR

T1 - Approximately counting triangles in large graph streams including edge duplicates with a fixed memory usage

AU - Wang, Pinghui

AU - Qi, Yiyan

AU - Sun, Yu

AU - Zhang, Xiangliang

AU - Tao, Jing

AU - Guan, Xiaohong

N1 - KAUST Repository Item: Exported on 2021-04-19
Acknowledgements: The research presented in this paper is supported in part by National Natural Science Foundation of China (U1301254, 61603290, 61602371), 111 International Collaboration Program of China, Ministry of Education&China Mobile Research Fund (MCM20160311), Natural Science Foundation of Jiangsu Province (SBK2014021758), Prospective Joint Research of Industry-Academia-Research Joint Innovation Funding of Jiangsu Province (BY2014074), Shenzhen Basic Research Grant (JCYJ20160229195940462), China Postdoctoral Science Foundation (2015M582663), Natural Science Basic Research Plan in Shaanxi Province of China (2016JQ6034).

PY - 2017/10

Y1 - 2017/10

N2 - Counting triangles in a large graph is important for detecting network anomalies such as spam web pages and suspicious accounts (e.g., fraudsters and advertisers) on online social networks. However, it is challenging to compute the number of triangles in a large graph represented as a stream of edges with a low computational cost when given a limited memory. Recently, several effective sampling-based approximation methods have been developed to solve this problem. However, they assume the graph stream of interest contains no duplicate edges, which does not hold in many real-world graph streams (e.g., phone calling networks). In this paper, we observe that these methods exhibit a large estimation error or computational cost even when modified to deal with duplicate edges using deduplication techniques such as Bloom filter and hash-based sampling. To solve this challenge, we design a one-pass streaming algorithm for uniformly sampling distinct edges at a high speed. Compared to state-of-the-art algorithms, our algorithm reduces the sampling cost per edge from O(log k) (k is the maximum number of sampled edges determined by the available memory space) to O(1) without using any additional memory space. Based on sampled edges, we develop a simple yet accurate method to infer the number of triangles in the original graph stream. We conduct extensive experiments on a variety of real-world large graphs, and the results demonstrate that our method is several times more accurate and faster than state-of-the-art methods with the same memory usage.

AB - Counting triangles in a large graph is important for detecting network anomalies such as spam web pages and suspicious accounts (e.g., fraudsters and advertisers) on online social networks. However, it is challenging to compute the number of triangles in a large graph represented as a stream of edges with a low computational cost when given a limited memory. Recently, several effective sampling-based approximation methods have been developed to solve this problem. However, they assume the graph stream of interest contains no duplicate edges, which does not hold in many real-world graph streams (e.g., phone calling networks). In this paper, we observe that these methods exhibit a large estimation error or computational cost even when modified to deal with duplicate edges using deduplication techniques such as Bloom filter and hash-based sampling. To solve this challenge, we design a one-pass streaming algorithm for uniformly sampling distinct edges at a high speed. Compared to state-of-the-art algorithms, our algorithm reduces the sampling cost per edge from O(log k) (k is the maximum number of sampled edges determined by the available memory space) to O(1) without using any additional memory space. Based on sampled edges, we develop a simple yet accurate method to infer the number of triangles in the original graph stream. We conduct extensive experiments on a variety of real-world large graphs, and the results demonstrate that our method is several times more accurate and faster than state-of-the-art methods with the same memory usage.

UR - http://hdl.handle.net/10754/668645

UR - https://dl.acm.org/doi/10.14778/3149193.3149197

UR - http://www.scopus.com/inward/record.url?scp=85055709648&partnerID=8YFLogxK

U2 - 10.14778/3149193.3149197

DO - 10.14778/3149193.3149197

M3 - Article

SN - 2150-8097

VL - 11

SP - 162

EP - 175

JO - Proceedings of the VLDB Endowment

JF - Proceedings of the VLDB Endowment

IS - 2

ER -