Why CNN for image
Filter: 3×3
stride 步长
Feature Map
几个 Filter 几个 image
Max Pooling
Deep dream: Exaggerate
Deep style:
Why CNN for image
Filter: 3×3
stride 步长
Feature Map
几个 Filter 几个 image
Max Pooling
Deep dream: Exaggerate
Deep style:
Fat + Short vs. Thin +Tall
Deep ==> Modularization
Why Deep? Training Data 不够
GMM
Univerality Therorem
Analogy
End-to-end Learning
ReLU:
Maxout:
ReLU is a special case of Maxout.
Learnable activation function
RMSProp:
Momentum:
RMSProp + Momentum ==> Adam
Regularization:
Dropout
Backpropagation
to compute gradients efficiently
Chain Rule:
dz/dx = dz/dy × dy/dx
Fully Connected Feedforward Network
Output Layer = Multi-class Classifier
Example
Step 1: Function Set
Step 2: Goodness of a Function
Cross Entropy
Step 3: Find the best Function(Gradient Descent)
no squarre error
Discriminative 有时优于 Generative(几率模型:Naive Bayes)
Multi-class Classification
Softmax ==> 0<y<1
Limitation of Logistic Regression
Classificaiton as Regression
Generative Model:
P(x) =
Gaussian Distribution
Find Maximum Likelihood (mean*, covariance*)
All dimensions are independent ==> Naive Bayes Classifier
σ(z)=1/ (1+exp(-z))
On-line vs Off-line:
Momentum
Adagrad
RMSProp
Adam
Real Application
Adagrad
root mean square
g(gradient): 偏微分
best step: |First derivative| / Second derivative
Stochastic Gradient Descent
Feature Scaling
Taylor Series
error 来源:bias 和 variance
mean: μ
variance: σ^2
s^2 是 σ^2的估测值
E[f*] = f^-: f* 的期望值
简单的模型 Variance 较小,简单的模型受数据波动影响小
复杂模型的 Bias 更小
Regularization ==> 使曲线变平滑6
Cross Validation
x_i: features
input: x^n
output: y^^n
function: f_n
Loss function L(function 的 function):
Step3: Best Function
f* = arg min L(f)
w*, b* = arg min L(w, b)
Gradient Descent:
convex 凸面的 adj.
引入更复杂的函数:
x_cp^2
Overfitting
Back to Step 1: Redesign
Back to Step 2: Rularization(调整)
不考虑 b
select λ
几何概率:与构成事件的长、面积、体积 成比例;
几何概率特点:基本事件 的无限性(抽象)、等可能性;
古典概型特点:基本事件 的有限性(具象)、等可能性;
最大熵模型:
步长定义
good 重点内容
秩 铺垫
要点总结
上确界:M=supE
下确界:M=infE