Cracker algorithm for mining connected components in large graphs accepted at ISCC 2015!

Emanuele Carlini, HPC Lab., ISTI-CNR, Pisa, Italy
Patrizio Dazzi, HPC Lab., ISTI-CNR, Pisa, Italy
Alessandro Lulli, University of Pisa and HPC Lab., ISTI-CNR, Pisa, Italy
Claudio Lucchese, HPC Lab., ISTI-CNR, Pisa, Italy
Laura Ricci, University of Pisa

Apr. 16 2015

Accepted at the 20th IEEE Symposium on Computers and Communications, Larnaca, Cyprus, 6–9 July 2015 [1].

Abstract. The problem of finding connected components in a graph is common to several applications dealing with graph analytics, such as social network analysis, web graph mining and image processing. The exponentially growing size of graphs requires the definition of appropriated computational models and algorithms for their processing on high throughput distributed architectures. In this paper we present cracker, an efficient iterative algorithm to detect connected components in large graphs. The strategy of cracker is to iteratively grow a spanning tree for each connected component of the graph. Nodes added to such trees are discarded from the computation in the subsequent iterations. We provide an extensive experimental evaluation considering a wide variety of synthetic and real-world graphs. The experimental results show that cracker consistently outperforms state-of-the-art approaches both in terms of total computation time and volume of messages exchanged.

1 The cracker Algorithm

The basic idea of cracker is to iteratively reduce the graph size by progressively pruning vertices until only one vertex for each connected components (CC) is left, i.e., its seed. For instance, an isolated vertex can be immediately excluded from further processing, since it does not provide useful information for the discovery of the other ccs. Similarly, as soon as a cc is identified, it can be excluded from computation since it does not impact on the other ccs in the graph. When a vertex is removed, it contributes to iteratively build a spanning tree that, at the end of the algorithm, will exist a spanning tree for each connected component.


PIC
Figure 1: cracker: example of seed identification. Gray nodes are excluded from the computation

The seed identification is an iterative algorithm made of two steps. The first step, named Min Selection Step, aims at understanding, for each node, if it is a potential seed by at least one of its neighbours. This decision is taken locally at each node, given that the only knowledge available locally to each node is its neighbourhood. A node can be considered as a potential seed if it is a local minimum in the neighbourhood of some node. If a node is not a local minimum in any neighbourhood then it cannot be a seed. To identify such nodes, the information about the local minimum in its neighbourhood is spread by each node to its neighbourhood.

Given the undirected input graph Gt, an intermediate directed graph Ht is generated linking nodes to potential seeds. Subsequently, the goal of the Pruning Step is to remove from Ht, and thereby deactivate, all the nodes that cannot be considered as seeds. A new undirected graph Gt+1, smaller that Gt, is generated and used to feed the next iteration (see Figure 1). The nodes that are moved during the Pruning Step grow a forest of covering trees T of the cc in the graph. Eventually, the seeds remain the only active nodes in the computation. Upon their deactivation, the Seed Identification terminates and the Seed Propagation is triggered.

2 Experiments


Table 1: Results on real-world datasets. The results marked with star are relative to not terminated executions due to errors.













Italy
California
PPI-All
Algorithm TimeStepsMsg.s Vol.TimeStepsMsg.s Vol.TimeStepsMsg.s Vol.













cracker 461 33 7261724 37 24 54 132 1389 12 8852134













ccf 807 18 12144744 60 14 100 408 2182 6 2393733













hash-to-min 1681 18 17746456 87 14 147 539 1777 6 11034360













ccmr >3957>15 88 14 197 333 2235 6 30995174


























Flickr
YouTube
Skitter
Algorithm TimeStepsMsg.s Vol.TimeStepsMsg.s Vol.TimeStepsMsg.s Vol.













cracker 45 12 60 147 43 15 68 164 35 14 50 126













ccf 54 7 40 239 49 7 65 267 38 7 39 193













hash-to-min 82 7 90 338 113 7 109 394 77 7 82 305













ccmr 107 6 170 287 211 7 140 239 127 7 151 255













pegasus 121 14 4591354 140 19 4171191 136 21 5011469














Experimental results on real-word data are shown in Table 1. cracker resulted to be the most efficient algorithm. The wide spectrum of datasets analysed reveals that the pruning mechanism is effective and lets cracker to outperform the existing state-of-the-art solutions. cracker revealed to be a very flexible solution, resulting the most efficient solution both when the largest cc is near the dimension of the entire graph and when it is of smaller size. In terms of time, cracker outperforms its competitors in every dataset used, with the best competitor being from 9% to 75% slower. Finally, cracker obtained the least message volume among all the competitors.

References

[1]   Emanuele Carlini, Patrizio Dazzi, Alessandro Lulli, Claudio Lucchese, and Laura Ricci. Cracker: Crumbling large graphs into connected components. In ISCC ’15: Proceedings of the 20th IEEE Symposium on Computers and Communications, 2015.

Share on