May. 12, 2020

Online (Link to Recorded Lecture Below)

Date: 12-May-20 (Tue)
Time: 12:00
Zoom link:
Lecturer: Tomer Sidi
Host: Chen Kaesar

*** For the recorded lecture, please click here ***


The Protein Data Bank (PDB), the ultimate source for data in
structural biology, is inherently imbalanced. To alleviate biases,
virtually all structural biology studies use non-redundant subsets of
the PDB, which include only a fraction of the available data. An
alternative approach, dubbed redundancy-weighting, down-weights
redundant entries rather than discarding them. This approach may be
particularly helpful for Machine Learning (ML) methods that use the
PDB as their source for data.

Current state-of-art methods for Secondary Structure Prediction of
proteins (SSP) use non-redundant datasets to predict either 3-letter
or 8-letter secondary structure annotations. The current study
challenges both choices: the dataset and alphabet size. On the one
hand, Non-redundant datasets are presumably unbiased, but are also
inherently small, which limits the machine learning performance.  On
the other hand, the utility of both 3- and 8-letter alphabets is
limited by the aggregation of parallel, anti-parallel, and mixed
beta-sheets in a single class. Each of these subclasses imposes
different structural constraints, which makes the distinction between
them desirable. In this study we show improvement in prediction
accuracy by training on a redundancy-weighted dataset. Further, we
show the information content is improved by extending the alphabet to
consider beta subclasses while hardly effecting SSP accuracy. Finally,
we show the utility of 13-class SSP on Estimation of protein Model
Accuracy (EMA).


T. Sidi and C. Keasar, 2020, "Redundancy-Weighting the PDB for
Detailed Secondary Structure Prediction Using Deep-Learning Models",

A. Elofsson et al., 2017, "Methods for estimation of model accuracy in

T. Sidi and C. Keasar, 2019, "Loss-functions matter, on optimizing
score functions for the estimation of protein models accuracy",