An Empirical Evaluation of Set Similarity Join Techniques

Date/Time
Date(s) - 18/03/2016
12:00 pm - 1:00 pm

Categories


Abstract:

Set similarity joins compute all pairs of similar sets from two collections of sets. We conduct extensive experiments on seven state-of-the-art algorithms for set similarity joins. These algorithms adopt a filter-verification approach. Our analysis shows that verification has not received enough attention in previous works. In practice, efficient verification inspects only a small, constant number of set elements and is faster than some of the more sophisticated filter techniques. Although we can identify three winners, we find that most algorithms show very similar performance. The key technique is the prefix filter, and AllPairs, the first algorithm adopting this techniques is still a relevant competitor. We repeat experiments from previous work and discuss diverging results. All our claims are supported by a detailed analysis of the factors that determine the overall runtime.

Speaker: Willi Mann.

Willi Mann is a PhD student and member of the Doctoral College GIScience at the University of Salzburg.
Willi obtained his Master degree in Computer Science at the Paris Lodron University of Salzburg in 2010. He wrote his thesis in Martin Held’s Computational Geometry Lab. He worked with libraries for “Exact Geometric Computation (EGC)”, which is a solution to address the errors introduced by floating-point arithmetic in real hardware.
Afterwards, he worked in two interdisciplinary FP7 projects at TU Munich as a project assistant. The goal of these projects was to integrate various data sources in the life science domain.
In 2013, he joined the Database Research Group at the University of Salzburg, which is headed by Nikolaus Augsten. His work focuses on similarity joins where he is developing techniques to address scability and performance bottlenecks.
Willi is particularly interested in applications of multi-dimensional similarity joins where the domain of the attributes are strings, trees, space, and time. He is also a passionate programmer and likes to tune existing code to make it faster and scalable.