Borzoi – predicting RNA-seq coverage via Transformer-variant Model

Paper URL: https://www.nature.com/articles/s41588-024-02053-6

Data preprocessing

Transformation: Squashed scale

Owing to the relatively large dynamic range of RNA-seq, we normalized each coverage track by exponentiating its bin values by 3/4. If bin values were still larger than 384 after exponentiation, we applied an additional square-root transform to the residual value. These operations effectively limit the contribution that very highly expressed genes can impose on the model training loss. The formula below summarizes the transform applied to the jth bin for tissue t of target tensor y:

$$ boldsymbol{y}{j, t}^{(text {squashed ) }}=left{boldsymbol{y}{j, t}^{(3 / 4)} text { if } boldsymbol{y}{j, t}^{(3 / 4)} leq 384 text {, otherwise } 384+sqrt{boldsymbol{y}{j, t}^{(3 / 4)}-384}right} $$

We refer to this set of transformations as ‘squashed scale’ in the main text.

Input sequence Attribution methods

Gradient Saliency

pros:

Computationally Efficient
Quick insights.

cons:

Nosiy output: Tends to produce noisier results due to moving off the one-hot coding complex, which can lead to less reliable interpretations.
Limited to local changes: May not capture the global impact of feature changes effectively

Implementation:

In-silico Mutagenesis

pros:

Better-Calibrated Attributions: Often provides more accurate and reliable feature importance by directly assessing the impact of feature changes on model outputs.
Multi-Output Capability: Can evaluate multiple outputs simultaneously, making it suitable for complex task.

cons: Computationally expensive, longer processing time.

Implementation:

深度学习可解释性技术概览（generated by AI）

基于梯度的方法

GradCAM/GradCAM++：结合特征图和梯度信息，生成类激活热图，特别适用于CNN。
DeepLIFT：比较每个神经元激活与参考激活的差异，分配贡献分数。
Input×Gradient：将输入与梯度相乘，突出显示重要特征。

扰动和干预方法

LIME (Local Interpretable Model-agnostic Explanations)：通过扰动输入并训练本地可解释模型来近似复杂模型。
SHAP (SHapley Additive exPlanations)：基于博弈论的方法，计算每个特征对预测的贡献。
Occlusion/Perturbation：通过遮挡或修改输入的不同部分，观察模型输出变化。

基于激活的方法

CAM (Class Activation Mapping)：使用全局平均池化层前的特征图来识别区分性区域。
Feature Visualization：通过优化输入以最大化特定神经元的激活，可视化神经元偏好。
Activation Maximization：生成最大化特定类别分数的合成输入。

基于注意力的方法

Attention Maps：在使用注意力机制的模型中，可视化注意力权重指示模型关注的区域。
Transformer Explainability：可视化Transformer模型中自注意力矩阵，解释token之间的关系。

概念和语义层面的解释

TCAV (Testing with Concept Activation Vectors)：测试高级人类概念对模型决策的重要性。
Network Dissection：将神经元与可解释的视觉概念对应起来。
Concept Bottleneck Models：强制模型通过人类可理解的概念进行预测。

基于案例的方法

Influence Functions：识别训练集中对特定预测影响最大的样本。
Nearest Neighbors：通过查找激活空间中相似样本来解释预测。

后解释方法

代理模型：训练简单的可解释模型(决策树、线性模型)来模拟复杂模型的行为。
Rule Extraction：从神经网络中提取规则集合。

模型内在可解释性

自解释模型：如注意力机制、原型网络等设计时考虑可解释性的模型。
稀疏模型：强制大部分权重为零，提高可解释性。

评估和比较

解释稳定性：评估相似输入的解释一致性。
解释忠实度：测量解释与模型行为的符合程度。
人类评估：通过用户研究评估解释对人类的帮助程度。

每种方法都有其优势和局限性，选择合适的可解释性技术取决于具体应用场景、模型类型和解释需求。

genoRetriever – predicting TSS…

Two methods might help

To improve the visual clarity of the learned feature curves, we apply L2 regularization to the weights of the second convolution layer in each consensus network with a scale factor of 2 × 10-3. To prevent attention bias in downstream prediction, we apply L1 regularization to all trainable convolution weights: a scale factor of 4 × 10-5 for the first encoding layer and 5 × 10-5 for the second convolution layer.

My intitution about L1 normalization and L2 normalization

L1 Regularization (Lasso):

Definition: Adds the absolute value of weights to the loss function.
Formula: $$ J(theta)=text { Loss }+lambda sum_{i=1}^nleft|theta_iright| $$
Characteristics:
- Promotes sparsity; some weights may become zero, aiding feature selection.
- Optimization can be more complex due to non-smoothness.

L2 Regularization (Ridge):

Definition: Adds the square of weights to the loss function.
Formula: $$ J(theta)=text { Loss }+lambda sum_{i=1}^n theta_i^2 $$
Characteristics:
- Retains all features but shrinks their weights.
- Easier optimization due to smoothness.

Key Differences:

Sparsity: L1 leads to sparse solutions; L2 retains all weights.
Optimization: L1 is more complex; L2 is simpler.
Use Cases: L1 is good for feature selection; L2 is better for multicollinearity.

Scores by discrete Fourier Transform (DFT) could reflect the combined positional and abundance effects of motif removal on transcription initation.

where f(t) is the time-domain signal (the effect curve), F(ω) is its frequency-domain representation,

ω is the angular frequency, and iii is the imaginary unit.

Goley's Blog