为你,千千万万遍
Clip分享

CLIP

image-20250516090029870

数据集

  • MS-COCO 约100000张照片 高质量
  • Visual Genome 约100000张照片 高质量
  • YFCC100M 约1亿张照片 ——> 1500万张
  • WIT 自己搭建,网络爬取图片-文字对,约4亿个

colab clip示例代码【facebook GitHub】

图像特征提取

RESNET

image-20250516090029870

​ 原图 残差 目标图

image-20250516090433665

Resnet block

image-20250516090714851

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
import torch
import torch.nn as nn
import torch.nn.functional as F

class ResidualBlock(nn.Module):
def __init__(self, in_channels, out_channels, stride=1):
super(ResidualBlock, self).__init__()
self.conv1 = nn.Conv2d(in_channels, out_channels, kernel_size=3, stride=stride, padding=1, bias=False)
self.bn1 = nn.BatchNorm2d(out_channels)
self.conv2 = nn.Conv2d(out_channels, out_channels, kernel_size=3, padding=1, bias=False)
self.bn2 = nn.BatchNorm2d(out_channels)

# 如果输入和输出的通道数不一致,或者步长不为1,需要调整输入维度
self.shortcut = nn.Sequential()
if in_channels!= out_channels or stride!= 1:
self.shortcut = nn.Sequential(
nn.Conv2d(in_channels, out_channels, kernel_size=1, stride=stride, bias=False),
nn.BatchNorm2d(out_channels)
)

def forward(self, x):
residual = x # 保存原始输入
out = self.conv1(x)
out = self.bn1(out)
out = F.relu(out)
out = self.conv2(out)
out = self.bn2(out)

# 将原始输入通过shortcut连接,并与输出相加
shortcut = self.shortcut(residual)
out += shortcut
out = F.relu(out)

return out

resnet colab示例代码

VIT

轻松理解ViT(Vision Transformer)原理及源码

文本特征

VIT示例代码

多模态 resnet vit 余弦相似度
算法记录
© 2020 Gina
Powered by hexo | Theme is blank
Title - Artist
0:00