Likelihood
Likelihood
Sequence Generation
Conditional Sequence Generation
Maximizing Expected Reward
Policy Gradient
Conditional GAN
Abtractive Summarization
GAN + Autoencoder
Feature Extraction:
InfoGAN
VAE-GAN
BiGAN
Triple GAN
Feature Disentangle v. 解开
J-S divergence proplem
Wasserstein GAN:
Earth Mover's Distance
Lipschitz Function
intractable adj. 棘手的 <==> difficult
f-divergence
exponential adj. 指数
Theory behind GAN:
Divergence
KL Divergence
sample v. 抽样
J-S divergence
Unsupervised Conditional Generation
CycleGAN:
Cycle consistency
GAN: Generative Adversarial Network
since sliced bread
Disciminator
Step 1: Fix G, update D
Step 1: Fix D, update G
Can Generator learn by itself?
Auto-encoder
Decoder = Generator
Can Discriminator generate?
Why CNN for image
Filter: 3×3
stride 步长
Feature Map
几个 Filter 几个 image
Max Pooling
Deep dream: Exaggerate
Deep style:
Fat + Short vs. Thin +Tall
Deep ==> Modularization
Why Deep? Training Data 不够
GMM
Univerality Therorem
Analogy
End-to-end Learning
ReLU:
Maxout:
ReLU is a special case of Maxout.
Learnable activation function
RMSProp:
Momentum:
RMSProp + Momentum ==> Adam
Regularization:
Dropout
Backpropagation
to compute gradients efficiently
Chain Rule:
dz/dx = dz/dy × dy/dx
Fully Connected Feedforward Network
Output Layer = Multi-class Classifier
Example
Step 1: Function Set
Step 2: Goodness of a Function
Cross Entropy
Step 3: Find the best Function(Gradient Descent)
no squarre error
Discriminative 有时优于 Generative(几率模型:Naive Bayes)
Multi-class Classification
Softmax ==> 0<y<1
Limitation of Logistic Regression
Classificaiton as Regression
Generative Model:
P(x) =
Gaussian Distribution
Find Maximum Likelihood (mean*, covariance*)
All dimensions are independent ==> Naive Bayes Classifier
σ(z)=1/ (1+exp(-z))
On-line vs Off-line:
Momentum
Adagrad
RMSProp
Adam
Real Application
Adagrad
root mean square
g(gradient): 偏微分
best step: |First derivative| / Second derivative
Stochastic Gradient Descent
Feature Scaling
Taylor Series
error 来源:bias 和 variance
mean: μ
variance: σ^2
s^2 是 σ^2的估测值
E[f*] = f^-: f* 的期望值
简单的模型 Variance 较小,简单的模型受数据波动影响小
复杂模型的 Bias 更小
Regularization ==> 使曲线变平滑6
Cross Validation
x_i: features
input: x^n
output: y^^n
function: f_n
Loss function L(function 的 function):
Step3: Best Function
f* = arg min L(f)
w*, b* = arg min L(w, b)
Gradient Descent:
convex 凸面的 adj.
引入更复杂的函数:
x_cp^2
Overfitting
Back to Step 1: Redesign
Back to Step 2: Rularization(调整)
不考虑 b
select λ