Effective and Efficient Similarity Search in Databases

Lange, Dustin

Please use this identifier to cite or link to this item: https://biblioteca.unisced.edu.mz/handle/123456789/2672

Title:	Effective and Efficient Similarity Search in Databases
Authors:	Lange, Dustin
Keywords:	Database Database System indexing algorithmics
Issue Date:	8-Nov-2013
Publisher:	Universitat Postdam
Citation:	117pg
Abstract:	With ever-growing amounts of data and the ability and desire to integrate and query more and more databases, there is a need for eficient processing of this data. Traditional relational database systems are built for fast retrieval of data from a large corpus. With SQL and eficient index structures, such as the B+-tree, retrieval of records with exact matches in their attribute values from even very large databases can be implemented with little effort. However, a query may also be inaccurate, as it may contain typing errors or missing values, and also a database record may contain incorrect or incomplete information. In this case, an index that only finds exact matches cannot be used. A traditional database system neither offers the possibility to define what is a similar record, nor does it perform a fast retrieval of those records. The field of research that solves this problem is called similarity search: Given a set of records in a database and a query record, similarity search aims to find all records in the database that are suficiently similar to the query record. This thesis is structured as follows. We begin with an overview of our similarity search system in Chapter 2 before describing the components of the system in detail in the following chapters. Chapter 3 introduces the similarity model used throughout the thesis. We also propose the novel similarity measure for comparing database records that exploits frequencies of values. Chapter 4 contains an introduction to similarity indexes for fast retrieval of similar values given specific similarity measures. We present an index structure for string similarity search, the State Set Index (SSI), and compare the method with previous index structures. For subsequent chapters, we assume that we have created one similarity index for each attribute, and that we have an overall similarity measure composed of attribute-specific measures. In Chapter 5, we then introduce query plans as a means of describing how to access the similarity indexes and how to combine the results. We describe static and query-specific algorithms for selecting query plans based on the criteria result completeness and execution cost. Chapter 6 adds the BSA method for answering top-k queries with similarity indexes by retrieving bulks of IDs of relevant records and combining results into a priority queue. For Chapters 3 to 6, related work is described at the end of each chapter. We conclude the thesis and give an overview on open research questions for future work in Chapter 7.
URI:	http://biblioteca.unisced.edu.mz/handle/123456789/2672
Appears in Collections:	Bancos de dados

Files in This Item:

File	Description	Size	Format
Effective-and-Efficient-Similarity-Search-in-Databases.pdf		8.76 MB	Adobe PDF	View/Open

Show full item record