PIC

Claudio Lucchese


I’m Associate Professor with the Ca’ Foscari University of Venice. Formerly, I was researcher with the I.S.T.I. “A. Faedo” in Pisa, working with the High Performance Computing Lab. Here you may find contact info, résumé, and other details. Below some news and notes on my research activities in the areas of data mining and information retrieval. You can follow me on Twitter.

Efficient and Effective Document Ranking for Web Search.

Web search engines use very complex models in order to estimate the relevance of a document w.r.t. a given user query. These models are made of thousands regression trees, and their evaluation is computationally expensive. In [5] we proposed a new algorithm, named QuickScorer, best paper a ACM SIGIR 2015 and later extended in ACM TOIS[2], being up to 6 times faster than state of the art. This is achieved thanks to a novel data layout that provides better cache locality, and that transforms the forest traversal in to fast bit-wise operations. Additional improvements were achieved by vectorizing the algorithm: by exploiting SIMD instructions it is possible to evaluate multiple documents simultaneously [6]. A patent application was filled for the algorithm.

We also developed novel algorithm aimed at improving the quality of ranking models. The proposed CLEaVER algorithm [4] is able to prune by a factor of 50% to 80% the trees of a given model, without hindering its accuracy.

These results are the outcome of a quality vs. efficiency trade-off investigation we conducted [1], partially in collaboration with Tiscali Italia SpA.

Most of the work on Learning to Rank is described here: http://learningtorank.isti.cnr.it/.

Software

PIC

RankEval

RankEval [3] is an open-source tool for the analysis and evaluation of Learning-to-Rank models based on ensembles of regression trees. The success of ensembles of regression trees fostered the development of several open-source libraries targeting efficiency of the learning phase and effectiveness of the resulting models. However, these libraries offer only very limited help for the tuning and evaluation of the trained models. RankEval aims at providing a common ground for several Learning to Rank libraries by providing useful and interoperable tools for a comprehensive comparison and in-depth analysis of ranking models.

RankEval is available on GitHub: https://github.com/hpclab/rankeval.

PIC

QuickRank

QuickRank [1] is an efficient Learning to Rank toolkit providing multithreaded C++ implementation of several algorithms. QuickRank was designed and developed with efficiency in mind.

QuickRank is available on GitHub: https://github.com/hpclab/quickrank.

Projects

References

[1]   Gabriele Capannini, Claudio Lucchese, Franco Maria Nardini, Salvatore Orlando, Raffaele Perego, and Nicola Tonellotto. Quality versus efficiency in document scoring with learning-to-rank models. Information Processing & Management, 2016.

[2]   Domenico Dato, Claudio Lucchese, Franco Maria Nardini, Salvatore Orlando, Raffaele Perego, Nicola Tonellotto, and Rossano Venturini. Fast ranking with additive ensembles of oblivious and non-oblivious regression trees. ACM Transactions on Information Systems, 35(2):15:1–15:31, 2016.

[3]   Claudio Lucchese, Cristina Ioana Muntean, Franco Maria Nardini, Raffaele Perego, and Salvatore Trani. Rankeval: An evaluation and analysis framework for learning-to-rank solutions. In SIGIR ’17: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2017.

[4]   Claudio Lucchese, Franco Maria Nardini, Salvatore Orlando, Raffaele Perego, Fabrizio Silvestri, and Salvatore Trani. Post-learning optimization of tree ensembles for efficient ranking. In SIGIR ’16: Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2016.

[5]   Claudio Lucchese, Franco Maria Nardini, Salvatore Orlando, Raffaele Perego, Nicola Tonellotto, and Rossano Venturini. Quickscorer: a fast algorithm to rank documents with additive ensembles of regression trees. In SIGIR ’15: Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2015. (best paper) (ACM Notable Article).

[6]   Claudio Lucchese, Franco Maria Nardini, Salvatore Orlando, Raffaele Perego, Nicola Tonellotto, and Rossano Venturini. Exploiting cpu simd extensions to speed-up document scoring with tree ensembles. In SIGIR ’16: Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2016.

Frequent Pattern Mining.

Frequent Pattern mining is one of the most typical data mining tasks. We attacked the problem both by an efficiency and efficacy side. Frequent pattern mining is computationally expensive. One of the algorithm we developed, named DCI_CLOSED, in IEEE TKDE [8], is still one of the efficiency benchmarks adopted.

A very interesting result was presented at ACM SIGKDD 2011 [1]: we showed that it is possible to extract relevant pattern with a single scan of the dataset by exploiting smart sampling strategies, thus allowing the mining of very large datasets.

We also developed out-of-core [9], multi-core[10], and distributed pattern mining algorithms[7].

On the efficacy side, we investigated a new methodology for extracting a minimal and most informative set of patters by exploiting the Minimum Description Length principle. The proposed algorithm, published in IEEE TKDE [11], improved over the state-of-the-art, also comparing with matrix decomposition techniques such as SVD and PCA.

On the same research line, we developed algorithms that allow the analyst to specify a set of properties (other than frequency) the desired patterns should satisfy [63254].

Software

PaNDa+

This is the result of our work on mining patterns, in particular dense “rectangles”, in noisy binary datasets. The implementation of the TKDE’14 paper [11] is available here.

Direct Local Pattern Sampling

Published at KDD’11 [1], this is the result of a collaboration with M. Boley, Sandy Moens, Daniel Paurat and Thomas Gärtner from the University of Bonn. website.

DCI-Closed

This is the result of our work on Pattern Mining algorithm for the discovery of closed frequent itemsets, in particular it implements three of our papers at ICDM’07 [10], TKDE’06 [8] and SDM’06[9], in one comprehensive software specialized for the dense datasets. In fact, it supports both out-of-core and multi-core mining. download.

Find-Rules

This software generates association rules with given minimum confidence from a collection of frequent itemsets. Differently from others, it does not need a downward closed collection. It extracts all the possible association rules, without assuming that given a frequent itemset the supports of its subsets is known. The input format is the usual ascii format: 1 2 3 (95). download.

Logistic PCA

This is a porting to python of Andrew I. Schein’s matlab code from the paper “A Generalized Linear Model for Principal Component Analysis of Binary Data”. download.

References

[1]   Mario Boley, Claudio Lucchese, Daniel Paurat, and Thomas Gärtner. Direct local pattern sampling by efficient two-step random procedures. In KDD ’11: Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 582–590, August 21-24 2011.

[2]   Francesco Bonchi, Fosca Giannotti, Claudio Lucchese, Salvatore Orlando, Raffaele Perego, and Roberto Trasarti. Conquest: a constraint-based querying system for exploratory pattern discovery. In ICDE ’06: Proceedings of the 2006 IEEE International Conference on Data Engineering (Demo), pages 159–160, 2006.

[3]   Francesco Bonchi, Fosca Giannotti, Claudio Lucchese, Salvatore Orlando, Raffaele Perego, and Roberto Trasarti. A constraint-based querying system for exploratory pattern discovery. Information Systems, 34(1):3–27, 2009.

[4]   Francesco Bonchi and Claudio Lucchese. On closed constrained frequent pattern mining. In ICDM ’04: Proceedings of the Fourth IEEE International Conference on Data Mining, pages 35–42, November 2004.

[5]   Francesco Bonchi and Claudio Lucchese. On condensed representations of constrained frequent patterns. Knowledge and Information Systems, 9(2):180–201, 2006.

[6]   Francesco Bonchi and Claudio Lucchese. Extending the state-of-the-art of constraint-based pattern discovery. Data and Knowledge Engineering, 60(2):377–399, 2007.

[7]   Claudio Lucchese, Carlo Mastroianni, Salvatore Orlando, and Domenico Talia. Mining@home: Towards a public resource computing framework for distributed data mining. Concurrency and Computation: Practice and Experience, 22(5):658–682, 2010.

[8]   Claudio Lucchese, Salvatore Orlando, and Raffaele Perego. Fast and memory efficient mining of frequent closed itemsets. IEEE Transactions On Knowledge and Data Engineering, 18(1):21–36, 2006.

[9]   Claudio Lucchese, Salvatore Orlando, and Raffaele Perego. Mining frequent closed itemsets out of core. In SDM ’06: Proceedings of the third SIAM International Conference on Data Mining, April 2006.

[10]   Claudio Lucchese, Salvatore Orlando, and Raffaele Perego. Parallel mining of frequent closed patterns: Harnessing modern computer architectures. In ICDM ’07: Proceedings of the Seventh IEEE International Conference on Data Mining, pages 242–251, November 2007.

[11]   Claudio Lucchese, Salvatore Orlando, and Raffaele Perego. A unifying framework for mining approximate top-k binary patterns. IEEE Transactions On Knowledge and Data Engineering, 26(12):2900–2913, 2014.

Entity Linking for Document Understanding.

Entity Linking is the task of identifying named entities (places, people, concepts, etc.) mentioned in a given piece of text. We developed a framework for the annotation and evaluation of Entity Linking Algorithms [32] and a novel method for the estimation of entity correlation [1]. Indeed, as entity correlation is the building block of most approaches, we showed how to improve the accuracy of several state-of-the-art algorithms.

Finally, we developed a new Entity Linking method[4], named SEL, aiming at estimating the relevance of the mentioned entities and to improve the accuracy in the detection of the most relevant ones, with a 2x F-measure improvement w.r.t. to the state of the art. This work was awarded the Best Student Paper at DocEnc 2016.

References

[1]   Diego Ceccarelli, Claudio Lucchese, Salvatore Orlando, Raffaele Perego, and Salvatore Trani. Learning relatedness measures for entity linking. In CIKM ’13: Proceedings of the 22st ACM international conference on Information and knowledge management, 2013.

[2]   Diego Ceccarelli, Claudio Lucchese, Salvatore Orlando, Raffaele Perego, and Salvatore Trani. Dexter 2.0 – an open source tool for semantically enriching data. In ISWC ’14: Proceedings of the 13th Intl. Semantic Web Conference, pages 417–420, 2014.

[3]   Diego Ceccarelli, Claudio Lucchese, Salvatore Orlando, Raffaele Perego, and Salvatore Trani. Manual annotation of semi-structured documents for entity-linking. In CIKM ’14: Proceedings of the 23rd ACM Intl. Conference on Information and Knowledge Management, pages 2075–2077, 2014.

[4]   Diego Ceccarelli, Claudio Lucchese, Salvatore Orlando, Raffaele Perego, and Salvatore Trani. SEL: a unified algorithm for entity linking and saliency detection. In DocEng ’16: Proceedings of the 2015 ACM Symposium on Document Engineering, 2016.

Large-scale Data Mining.

We discuss two major results. In [2] we developed a new algorithm, named Cracker, for the detection of connected components in very large graphs. The proposed algorithm, implemented over Spark, is able to process billions nodes and hundreds billions arcs, with a speed-up factor in the rage 1.6x–15x over the state of the art.

In [1] we tackled the self-join similarity in large document collections. The proposed MapReduce algorithm exhibits a speed-up w.r.t. the state of the art in the range 2x–5x.

In both works, we adopted a common approach based on the pruning non informative data in order to reduce the computational cost of the algorithms.

References

[1]   Ranieri Baraglia, Claudio Lucchese, and Gianmarco De Francisci Morales. Document similarity self-join with mapreduce. In ICDM ’10: Proceedings of the tenth IEEE International Conference on Data Mining, December 2010.

[2]   Alessandro Lulli, Emanuele Carlini, Patrizio Dazzi, Claudio Lucchese, and Laura Ricci. Fast connected components computation in large graphs by vertex pruning. IEEE Transactions on Parallel and Distributed Systems, 28(3):760–773, 2017.

Web Mining.

One relevant achievement, awarded ACM Notable Article, focuses on Web search engines’ query log mining. We developed a novel method that allows to cluster users’ queries into tasks and missions (e.g., planning a holiday trip) thus allowing a better user intent understanding [21].

Finally, we developed a novel personalized news recommendation algorithm that exploits social network data by analyzing the tweets of a user and his social circles to recommend the most items published by news agencies being most relevant for the given user [3].

References

[1]   Claudio Lucchese, Salvatore Orlando, Raffaele Perego, Fabrizio Silvestri, and Gabriele Tolomei. Identifying task-based sessions in search engine query logs. In WSDM ’11: ACM International Conference on Web Search and Data Mining, February 2011. (best 6).

[2]   Claudio Lucchese, Salvatore Orlando, Raffaele Perego, Fabrizio Silvestri, and Gabriele Tolomei. Discovering tasks from search engine query logs. ACM Trans. Inf. Syst., 31(3):14, 2013. (ACM Notable Article).

[3]   Gianmarco De Francisci Morales, Aris Gionis, and Claudio Lucchese. From chatter to headlines: Harnessing the real-time web for personalized news recommendation. In WSDM ’12: ACM International Conference on Web Search and Data Mining, 2012.

Share on