Lecturer: Batya Kenig
Host: Gil Einziger
Title: Soft Constraints for Data
Management
Abstract:
Integrity constraints such as
functional dependencies (FD), and multivalued dependencies (MVD) are
fundamental in database schema design, query optimization, and for enforcing
data integrity. Current data intensive applications such as ML algorithms
process observational data that is often unnormalized, inconsistent, erroneous
and noisy. In these applications, quite often the constraints need to be
inferred from the data, and are not required to hold exactly, but it suffices
if they hold only to a certain degree.
In this work, we use information
theory to quantify the degree of satisfaction of a constraint, giving rise to
two major challenges that I will cover in this talk:
the implication problem for soft
constraints, and discovering soft constraints in data. The implication problem
for soft constraints asks whether a set of constraints (antecedents) that hold
in the data to a large degree imply a high degree of satisfaction of another
constraint (consequent). The implication problem has been investigated in both
the Database and AI literature, but only under the assumption that all
constraints hold exactly; our work extends this to the case of soft
constraints.
Next, we address the problem of
mining soft constraints from data, and present an algorithm for discovering
complete schemas from data. The algorithm employs pruning techniques that take
advantage of the properties of the information-theoretic measures associated
with the constraints, and allow it to scale to datasets with up to 1M tuples,
and up to 30 attributes.
Based on joint work with Dan Suciu,
Pranay Mundra, Guna Prasad, and Babak Salimi (to be presented at ICDT 2020 and
SIGMOD 2020) The talk is based on two papers:
https://arxiv.org/abs/1812.09987
https://arxiv.org/abs/1911.12933
Homepage: https://sites.google.com/view/batyakenig/