なんかこの記事のサムネイルが全く記事の内容と関係がないですよね。

本日は量子化についてPyTorchで実行します。

今回は以前巷の解説記事は読みましたが、その辺りの知識を前提に記事を書いています。

量子化について巷の解説記事を調べてみる

https://blueqat.com/yuichiro_minato2/b1889005-dbf3-4619-998e-f92af042bbe1

量子化は精度を落としてパラメータ圧縮する技術です。

なんかPyTorchの記事から二箇所見つけました。

QUANTIZATION

https://pytorch.org/docs/stable/quantization.html

QUANTIZATION RECIPE

https://pytorch.org/tutorials/recipes/quantization.html

また、巷の解説記事は、

PyTorchの量子化をかるく動かしてみる【Quantization】

https://zuoli.tech/?p=27

こちらがありました。今回は公式記事の最初の方を見てみたいと思います。

「PyTorchは、通常のFP32モデルと比較してINT8量子化をサポートしており、モデルサイズとメモリ帯域幅の要件を4倍削減できます。INT8計算のハードウェアサポートは、通常FP32計算と比較して2倍から4倍高速です。量子化は主に推論の高速化のための技術（略）」

「PyTorchは、ディープラーニングモデルを量子化するための複数のアプローチをサポートしています。ほとんどの場合、モデルはFP32で訓練された後、INT8に変換されます。さらに、PyTorchは量子化認識トレーニングもサポートしており、これは偽量子化モジュールを使用して前方および後方のパスの両方で量子化エラーをモデル化します。ただし、計算全体は浮動小数点で行われます。量子化認識トレーニングの終了時に、PyTorchは訓練されたモデルを低精度に変換するための関数を提供します。」

ちょっと勉強した、quantization aware trainingもサポートしているそうなので使ってみたいですね。

「PyTorchは、量子化のために二つの異なるモードを提供しています：「Eager Mode Quantization（イーガーモード量子化）」と「FX Graph Mode Quantization（FXグラフモード量子化）」です。」

「量子化の新しいユーザーは、まずFXグラフモード量子化を試してみることをお勧めします。もし機能しない場合、FXグラフモード量子化のガイドラインに従うか、イーガーモード量子化に戻ることができます。」

ということで自動的にやってくれるFXグラフモードとイーガーモードがあるようですが、初心者はまず自動モードをお勧めされています。

「PyTorchでは、以下の3種類の量子化がサポートされています：

1. **動的量子化（Dynamic Quantization）**:

- 重みは量子化されていますが、アクティベーション（活性化関数の出力）は浮動小数点で読み書きされ、計算時に量子化されます。

2. **静的量子化（Static Quantization）**:

- 重みとアクティベーションの両方が量子化されており、トレーニング後のキャリブレーションが必要です。

3. **静的量子化認識トレーニング（Static Quantization Aware Training）**:

- 重みとアクティベーションの両方が量子化され、トレーニング中に量子化数値がモデル化されます。

これらの量子化タイプ間のトレードオフについてのより包括的な概要については、私たちの「PyTorchにおける量子化入門」のブログ投稿をご覧ください。

動的量子化と静的量子化ではオペレーターのカバレッジが異なり、以下の表に記載されています。FX量子化については、対応する機能的な部分（functionals）もサポートされていることに注意してください。」

まさか、こんなブログ記事が出てるとは。。。

Introduction to Quantization on PyTorch

https://pytorch.org/blog/introduction-to-quantization-on-pytorch/

知りませんでした。今回は触れる余裕がないので、別の機会にしたいと思います。動的量子化、静的量子化、静的量子化認識トレーニングがあることは以前の記事で触れましたのでおさらいになりますが、結構シンプルに描かれててわかりやすいです。

「トレーニング後の動的量子化（Post Training Dynamic Quantization）

これは、量子化を適用する最も簡単な形式で、重みは事前に量子化され、アクティベーション（活性化関数の出力）は推論中に動的に量子化されます。この方法は、モデルの実行時間が行列乗算の計算よりもメモリからの重みの読み込みに支配される場合に使用されます。これは、小さいバッチサイズを持つLSTMやTransformerタイプのモデルに当てはまります。」

量子化前後の重みの取り出し方とかは、こちらのブログが大変参考になりました。

PyTorchの量子化をかるく動かしてみる【Quantization】

https://zuoli.tech/?p=27

実際にGoogle Colabでサクッと動かせます。

Eager Mode Quantization（イーガーモード量子化）

トレーニング後の動的量子化（Post Training Dynamic Quantization）

こちらがダイアグラムです。linear層のweightがint8になる予定です。

# original model

all tensors and computations are in floating point

previous_layer_fp32 -- linear_fp32 -- activation_fp32 -- next_layer_fp32
/
linear_weight_fp32

dynamically quantized model

linear and LSTM weights are in int8

previous_layer_fp32 -- linear_int8_w_fp32_inp -- activation_fp32 -- next_layer_fp32
/
linear_weight_int8

import torch

define a floating point model

class M(torch.nn.Module):
def __init__(self):
super().__init__()
self.fc = torch.nn.Linear(4, 4)

def forward(self, x):
    x = self.fc(x)
    return x

create a model instance

model_fp32 = M()

create a quantized model instance

model_int8 = torch.ao.quantization.quantize_dynamic(
model_fp32, # the original model
{torch.nn.Linear}, # a set of layers to dynamically quantize
dtype=torch.qint8) # the target dtype for quantized weights

run the model

input_fp32 = torch.randn(4, 4, 4, 4)
res = model_int8(input_fp32)

きちんとできているか確認してみます。

print(model_fp32.fc.weight.dtype)
print(model_int8.fc.weight().dtype)

結果は、

torch.float32
torch.qint8

きちんと変換されていました。一応出力を見てみます。

torch.float32

直前はqint8でしたが、最終出力はactivationを通るので、fp32に戻りました。トレーニング後の動的量子化では重みのみがint8となり、その他はfp32が保持されるようです。

なんか動的量子化の専用ページがあるようです。

DYNAMIC QUANTIZATION

https://pytorch.org/tutorials/recipes/recipes/dynamic_quantization.html

こちらも別の機会に学んでみます。

トレーニング後の静的量子化（Post Training Static Quantization、PTQ static）は、モデルの重みとアクティベーションを量子化します。可能な限り、アクティベーションを前の層に統合します。アクティベーションの最適な量子化パラメータを決定するために、代表的なデータセットでの校正が必要です。トレーニング後の静的量子化は、メモリ帯域幅と計算効率の両方が重要な場合に使用され、CNN（畳み込みニューラルネットワーク）が典型的な使用例です。

トレーニング後の静的量子化を適用する前に、モデルを修正する必要がある場合があります。詳細については、イーガーモード静的量子化のためのモデル準備を参照してください。

ダイアグラムはこちらです。

# original model

all tensors and computations are in floating point

previous_layer_fp32 -- linear_fp32 -- activation_fp32 -- next_layer_fp32
/
linear_weight_fp32

statically quantized model

weights and activations are in int8

previous_layer_int8 -- linear_with_activation_int8 -- next_layer_int8
/
linear_weight_int8

先ほどと違って全部int8になってます。ちょっと前述の参考ブログを参考にして、一部関数を変更してReLU後にfp32に戻す前と戻した後のdequantを考慮して、実際に実行してみると、

import torch

define a floating point model where some layers could be statically quantized

class M(torch.nn.Module):
def __init__(self):
super().__init__()
# QuantStub converts tensors from floating point to quantized
self.quant = torch.ao.quantization.QuantStub()
self.conv = torch.nn.Conv2d(1, 1, 1)
self.relu = torch.nn.ReLU()
# DeQuantStub converts tensors from quantized to floating point
self.dequant = torch.ao.quantization.DeQuantStub()

def forward(self, x):
    # manually specify where tensors will be converted from floating
    # point to quantized in the quantized model
    x = self.quant(x)
    x = self.conv(x)
    x = self.relu(x)
    # manually specify where tensors will be converted from quantized
    # to floating point in the quantized model

    #ここを一部書き換えました。
    dequant\_x = self.dequant(x)
    return dequant\_x, x

create a model instance

model_fp32 = M()

model must be set to eval mode for static quantization logic to work

model_fp32.eval()

attach a global qconfig, which contains information about what kind

of observers to attach. Use 'x86' for server inference and 'qnnpack'

for mobile inference. Other quantization configurations such as selecting

symmetric or asymmetric quantization and MinMax or L2Norm calibration techniques

can be specified here.

Note: the old 'fbgemm' is still available but 'x86' is the recommended default

for server inference.

model_fp32.qconfig = torch.ao.quantization.get_default_qconfig('fbgemm')

model_fp32.qconfig = torch.ao.quantization.get_default_qconfig('x86')

Fuse the activations to preceding layers, where applicable.

This needs to be done manually depending on the model architecture.

Common fusions include `conv + relu` and `conv + batchnorm + relu`

model_fp32_fused = torch.ao.quantization.fuse_modules(model_fp32, [['conv', 'relu']])

Prepare the model for static quantization. This inserts observers in

the model that will observe activation tensors during calibration.

model_fp32_prepared = torch.ao.quantization.prepare(model_fp32_fused)

calibrate the prepared model to determine quantization parameters for activations

in a real world setting, the calibration would be done with a representative dataset

input_fp32 = torch.randn(4, 1, 4, 4)
model_fp32_prepared(input_fp32)

Convert the observed model to a quantized model. This does several things:

quantizes the weights, computes and stores the scale and bias value to be

used with each activation tensor, and replaces key operators with quantized

implementations.

model_int8 = torch.ao.quantization.convert(model_fp32_prepared)

run the model, relevant calculations will happen in int8

#ここも一部書き換えました。
res_dequant, res = model_int8(input_fp32)

print(res_dequant.dtype)
print(res.dtype)

torch.float32
torch.quint8

dequant前はint8なので、活性化関数などを通っても量子化されていて、最後に変換して戻しているのがみれます。

なんか静的量子化のページがあるようです。

(BETA) STATIC QUANTIZATION WITH EAGER MODE IN PYTORCH

https://pytorch.org/tutorials/advanced/static_quantization_tutorial.html

次はQATです。

静的量子化のための量子化認識トレーニングについて

「量子化認識トレーニング（Quantization Aware Training、QAT）は、トレーニング中に量子化の効果をモデル化し、他の量子化方法に比べて高い精度を実現します。静的、動的、または重みのみの量子化に対してQATを行うことができます。トレーニング中は、すべての計算が浮動小数点で行われ、fake_quantモジュールがINT8の効果をシミュレートするためにクランプと丸めを行い、量子化の効果をモデル化します。モデル変換後は、重みとアクティベーションが量子化され、可能な限りアクティベーションは前の層に統合されます。QATは一般的にCNNで使用され、静的量子化と比較して高い精度をもたらします。」

まずはダイアグラムを見てみます。

# original model

all tensors and computations are in floating point

previous_layer_fp32 -- linear_fp32 -- activation_fp32 -- next_layer_fp32
/
linear_weight_fp32

model with fake_quants for modeling quantization numerics during training

previous_layer_fp32 -- fq -- linear_fp32 -- activation_fp32 -- fq -- next_layer_fp32
/
linear_weight_fp32 -- fq

quantized model

weights and activations are in int8

previous_layer_int8 -- linear_with_activation_int8 -- next_layer_int8
/
linear_weight_int8

訓練中はfp32で行われながら、fake_quantモジュールがfqの場所でint8をシミュレートするようです。くわしくはここではわからなかったので、今度探してみます。モデルの変換後は全部がint8に。

実行してみます。

import torch

define a floating point model where some layers could benefit from QAT

class M(torch.nn.Module):
def __init__(self):
super().__init__()
# QuantStub converts tensors from floating point to quantized
self.quant = torch.ao.quantization.QuantStub()
self.conv = torch.nn.Conv2d(1, 1, 1)
self.bn = torch.nn.BatchNorm2d(1)
self.relu = torch.nn.ReLU()
# DeQuantStub converts tensors from quantized to floating point
self.dequant = torch.ao.quantization.DeQuantStub()

def forward(self, x):
    x = self.quant(x)
    x = self.conv(x)
    x = self.bn(x)
    x = self.relu(x)
    x = self.dequant(x)
    return x

create a model instance

model_fp32 = M()

model must be set to eval for fusion to work

model_fp32.eval()

attach a global qconfig, which contains information about what kind

of observers to attach. Use 'x86' for server inference and 'qnnpack'

for mobile inference. Other quantization configurations such as selecting

symmetric or asymmetric quantization and MinMax or L2Norm calibration techniques

can be specified here.

Note: the old 'fbgemm' is still available but 'x86' is the recommended default

for server inference.

model_fp32.qconfig = torch.ao.quantization.get_default_qconfig('fbgemm')

model_fp32.qconfig = torch.ao.quantization.get_default_qat_qconfig('x86')

fuse the activations to preceding layers, where applicable

this needs to be done manually depending on the model architecture

model_fp32_fused = torch.ao.quantization.fuse_modules(model_fp32,
[['conv', 'bn', 'relu']])

Prepare the model for QAT. This inserts observers and fake_quants in

the model needs to be set to train for QAT logic to work

the model that will observe weight and activation tensors during calibration.

model_fp32_prepared = torch.ao.quantization.prepare_qat(model_fp32_fused.train())

run the training loop (not shown)

#training_loop(model_fp32_prepared)

Convert the observed model to a quantized model. This does several things:

quantizes the weights, computes and stores the scale and bias value to be

used with each activation tensor, fuses modules where appropriate,

and replaces key operators with quantized implementations.

model_fp32_prepared.eval()
model_int8 = torch.ao.quantization.convert(model_fp32_prepared)

run the model, relevant calculations will happen in int8

res = model_int8(input_fp32)

途中training_loopはコメントアウトしました。この手順で訓練できるようです。今回は手順だけの確認です。

イーガーモード静的量子化のためのモデル準備について

イーガーモードの量子化を行う前に、モデル定義を現在は一部修正する必要があります。これは、現在の量子化がモジュールごとに機能するためです。具体的には、すべての量子化技術において、ユーザーは以下の操作を行う必要があります：

- 出力の再量子化が必要な操作（追加のパラメーターがあるもの）を、関数形式からモジュール形式に変換します（例：`torch.nn.functional.relu`の代わりに`torch.nn.ReLU`を使用）。

- `.qconfig`属性をサブモジュールに割り当てるか、`qconfig_mapping`を指定することにより、モデルのどの部分を量子化するかを指定します。例えば、`model.conv1.qconfig = None`と設定すると、`model.conv`層は量子化されません。`model.linear1.qconfig = custom_qconfig`と設定すると、`model.linear1`の量子化設定はグローバルの`qconfig`ではなく`custom_qconfig`を使用します。

活性化を量子化する静的量子化技術については、ユーザーはさらに以下を行う必要があります：

- 活性化の量子化と非量子化の場所を指定します。これは`QuantStub`と`DeQuantStub`モジュールを使用して行います。

- 量子化に特別な処理が必要なテンソル操作をモジュールにラップするために`FloatFunctional`を使用します。例としては、出力の量子化パラメーターを決定するための特別な処理が必要な`add`や`cat`などの操作があります。

- モジュールを融合します：より高い精度とパフォーマンスを得るために、操作/モジュールを単一のモジュールに組み合わせます。これは`fuse_modules()`APIを使用して行われ、融合されるモジュールのリストを取ります。現在、以下の融合がサポートされています：[Conv, Relu]、[Conv, BatchNorm]、[Conv, BatchNorm, Relu]、[Linear, Relu]。

この辺りは指定に従います。

次は(プロトタイプ) FX グラフモード量子化

トレーニング後の量子化には複数の量子化タイプ（重みのみ、動的、静的）があり、その設定は `qconfig_mapping`（`prepare_fx`関数の引数）を通じて行われます。

FXPTQ APIの例：

今回はちょっと下のコードを追加しましたが、細かいところまでは指定してません。

下の例題にはモデルとinputデータがなかったので、それを加えれば使えそうです。

import torch
from torch.ao.quantization import (
get_default_qconfig_mapping,
get_default_qat_qconfig_mapping,
QConfigMapping,
)

import torch.ao.quantization.quantize_fx as quantize_fx
import copy

define a floating point model

class M(torch.nn.Module):
def __init__(self):
super().__init__()
self.fc = torch.nn.Linear(4, 4)

def forward(self, x):
    x = self.fc(x)
    return x

model_fp = M()

run the model

input_fp32 = torch.randn(4, 4, 4, 4)

post training dynamic/weight_only quantization

we need to deepcopy if we still want to keep model_fp unchanged after quantization since quantization apis change the input model

model_to_quantize = copy.deepcopy(model_fp)
model_to_quantize.eval()
qconfig_mapping = QConfigMapping().set_global(torch.ao.quantization.default_dynamic_qconfig)

a tuple of one or more example inputs are needed to trace the model

example_inputs = (input_fp32)

prepare

model_prepared = quantize_fx.prepare_fx(model_to_quantize, qconfig_mapping, example_inputs)

no calibration needed when we only have dynamic/weight_only quantization

quantize

model_quantized = quantize_fx.convert_fx(model_prepared)

post training static quantization

model_to_quantize = copy.deepcopy(model_fp)
qconfig_mapping = get_default_qconfig_mapping("qnnpack")
model_to_quantize.eval()

prepare

model_prepared = quantize_fx.prepare_fx(model_to_quantize, qconfig_mapping, example_inputs)

calibrate (not shown)

quantize

model_quantized = quantize_fx.convert_fx(model_prepared)

quantization aware training for static quantization

model_to_quantize = copy.deepcopy(model_fp)
qconfig_mapping = get_default_qat_qconfig_mapping("qnnpack")
model_to_quantize.train()

prepare

model_prepared = quantize_fx.prepare_qat_fx(model_to_quantize, qconfig_mapping, example_inputs)

training loop (not shown)

quantize

model_quantized = quantize_fx.convert_fx(model_prepared)

fusion

model_to_quantize = copy.deepcopy(model_fp)
model_fused = quantize_fx.fuse_fx(model_to_quantize)

FX グラフモード量子化に関する詳細情報についてはいくつかのリンクが提供されていました。

あとは、多少の例題ではなく、説明がありました。

量子化スタック

量子化は、浮動小数点モデルを量子化モデルに変換するプロセスです。したがって、量子化スタックは大きく2つの部分に分けることができます：1) 量子化モデルのための構成要素や抽象化 2) 浮動小数点モデルを量子化モデルに変換するための量子化フローの構成要素や抽象化

量子化モデル

量子化テンソル

PyTorchで量子化を行うためには、量子化データをテンソルで表現できる必要があります。量子化テンソルにより、量子化データ（int8/uint8/int32として表される）と量子化パラメータ（スケールとゼロポイントなど）を一緒に格納することができます。量子化テンソルは、量子化された算術を容易にするだけでなく、量子化されたフォーマットでデータをシリアライズすることも可能にします。

PyTorchは、テンソルごととチャネルごとの対称および非対称量子化をサポートしています。テンソルごととは、テンソル内のすべての値が同じ量子化パラメータを使用して同じ方法で量子化されることを意味します。チャネルごととは、テンソルの次元（通常はテンソルのチャネル次元）ごとに、テンソル内の値が異なる量子化パラメータを使用して量子化されることを意味します。これにより、テンソルを量子化値に変換する際の誤差が少なくなり、外れ値がテンソル全体ではなく、それが存在するチャネルにのみ影響を与えることになります。

このマッピングは、以下のように浮動小数点テンソルを使用して変換することで行われます：

Q = round(x/scale + zero_point)

テンソルの属性に関して、これまで学んだ量子化がサポートされています。

まずは対称・非対称、あとはテンソルのチャンネルごとの量子化です。

torch.per_tensor_affine
torch.per_tensor_symmetric
torch.per_channel_affine
torch.per_channel_symmetric

量子化後のテンソルのデータ型

torch.quint8
torch.qint8
torch.qint32
torch.float16

量子化パラメータについての個別のパラメータ

scale (float)
zero_point (int)

チャンネルごとの量子化についての個別パラメータ

per_channel_scales (list of float)
per_channel_zero_points (list of int)
axis (int)

変換について

量子化

Quantize (float -> quantized)
torch.quantize_per_tensor(x, scale, zero_point, dtype)
torch.quantize_per_channel(x, scales, zero_points, axis, dtype)
torch.quantize_per_tensor_dynamic(x, dtype, reduce_range)
to(torch.float16)

逆量子化

Dequantize (quantized -> float)
quantized_tensor.dequantize() - calling dequantize on a torch.float16 Tensor will convert the Tensor back to torch.float
torch.dequantize(x)

量子化エンジン

量子化モデルが実行される際に、qengine（torch.backends.quantized.engine）は、どのバックエンドを使用して実行するかを指定します。量子化された活性化と重みの値の範囲に関して、qengineが量子化モデルと互換性があることを確認することが重要です。

量子化フロー

Observer と FakeQuantize

Observer は PyTorch のモジュールで、以下の目的で使用されます：

- Observer を通過するテンソルの最小値や最大値などのテンソル統計情報を収集する。

- 収集したテンソル統計に基づいて量子化パラメータを計算する。

FakeQuantize は PyTorch のモジュールで、以下の目的で使用されます：

- ネットワーク内のテンソルに対して量子化（量子化/逆量子化を実行）をシミュレートする。

- Observer から収集された統計情報に基づいて量子化パラメータを計算するか、または量子化パラメータを学習することもできる。

いきなり結構大事な要素が出てきますね。。。

QConfig

QConfigは、qscheme、dtypeなどで設定可能なObserverまたはFakeQuantizeモジュールクラスの名前付きタプルです。これは、オペレータがどのように観測されるべきかを設定するために使用されます。

オペレータ/モジュールの量子化設定

様々な種類のObserver/FakeQuantize
dtype（データタイプ）
qscheme（量子化スキーム）
quant_min/quant_max：低精度テンソルのシミュレーションに使用可能
現在、活性化（activation）と重み（weight）の設定がサポートされています

特定のオペレータやモジュールに対して設定されたqconfigに基づいて、入力/重み/出力のobserverを挿入します。

一般的な量子化フロー

一般的に、量子化の流れは以下のようになります。

準備

- ユーザーが指定したqconfigに基づいてObserver/FakeQuantizeモジュールを挿入する。

校正/トレーニング（ポストトレーニング量子化または量子化認識トレーニングに応じて）

- Observerが統計情報を収集するか、FakeQuantizeモジュールが量子化パラメータを学習することを可能にする。

変換

- 校正/トレーニング済みのモデルを量子化モデルに変換する。

量子化のモードは2つの方法で分類できます：

1. 量子化フローを適用する場所によって、以下のように分類されます：

- ポストトレーニング量子化（トレーニング後に量子化を適用し、量子化パラメータはサンプル校正データに基づいて計算される）

- 量子化認識トレーニング（トレーニング中に量子化をシミュレートし、量子化パラメータをトレーニングデータを使用してモデルと一緒に学習する）

2. オペレータを量子化する方法によって、以下のように分類されます：

- 重みのみの量子化（重みのみが静的に量子化される）

- 動的量子化（重みが静的に量子化され、アクティベーションが動的に量子化される）

- 静的量子化（重みとアクティベーションの両方が静的に量子化される）

同じ量子化フロー内で、オペレータの量子化方法を組み合わせることができます。例えば、静的に量子化されたオペレータと動的に量子化されたオペレータを含むポストトレーニング量子化を行うことができます。

ということで、大体のフローは見れました。PyTorchでイーガーモード、FXグラフモードの具体的なコードも見てみました。個人的にはどちらも使いやすそうな気がしました。今後は具体的なモデルとデータを使って今後はベンチマークなどが取れそうですね。以上です。

量子化をPyTorchで実行

Yuichiro Minato

all tensors and computations are in floating point

dynamically quantized model

linear and LSTM weights are in int8

define a floating point model

create a model instance

create a quantized model instance

run the model

all tensors and computations are in floating point

statically quantized model

weights and activations are in int8

define a floating point model where some layers could be statically quantized

create a model instance

model must be set to eval mode for static quantization logic to work

attach a global qconfig, which contains information about what kind

of observers to attach. Use 'x86' for server inference and 'qnnpack'

for mobile inference. Other quantization configurations such as selecting

symmetric or asymmetric quantization and MinMax or L2Norm calibration techniques

can be specified here.

Note: the old 'fbgemm' is still available but 'x86' is the recommended default

for server inference.

model_fp32.qconfig = torch.ao.quantization.get_default_qconfig('fbgemm')

Fuse the activations to preceding layers, where applicable.

This needs to be done manually depending on the model architecture.

Common fusions include `conv + relu` and `conv + batchnorm + relu`

Prepare the model for static quantization. This inserts observers in

the model that will observe activation tensors during calibration.

calibrate the prepared model to determine quantization parameters for activations

in a real world setting, the calibration would be done with a representative dataset

Convert the observed model to a quantized model. This does several things:

quantizes the weights, computes and stores the scale and bias value to be

used with each activation tensor, and replaces key operators with quantized

implementations.

run the model, relevant calculations will happen in int8

all tensors and computations are in floating point

model with fake_quants for modeling quantization numerics during training

quantized model

weights and activations are in int8

define a floating point model where some layers could benefit from QAT

create a model instance

model must be set to eval for fusion to work

attach a global qconfig, which contains information about what kind

of observers to attach. Use 'x86' for server inference and 'qnnpack'

for mobile inference. Other quantization configurations such as selecting

symmetric or asymmetric quantization and MinMax or L2Norm calibration techniques

can be specified here.

Note: the old 'fbgemm' is still available but 'x86' is the recommended default

for server inference.

model_fp32.qconfig = torch.ao.quantization.get_default_qconfig('fbgemm')

fuse the activations to preceding layers, where applicable

this needs to be done manually depending on the model architecture

Prepare the model for QAT. This inserts observers and fake_quants in

the model needs to be set to train for QAT logic to work

the model that will observe weight and activation tensors during calibration.

run the training loop (not shown)

Convert the observed model to a quantized model. This does several things:

quantizes the weights, computes and stores the scale and bias value to be

used with each activation tensor, fuses modules where appropriate,

and replaces key operators with quantized implementations.

run the model, relevant calculations will happen in int8

define a floating point model

run the model

post training dynamic/weight_only quantization

we need to deepcopy if we still want to keep model_fp unchanged after quantization since quantization apis change the input model

a tuple of one or more example inputs are needed to trace the model

prepare

no calibration needed when we only have dynamic/weight_only quantization

quantize

post training static quantization

prepare

calibrate (not shown)

quantize

quantization aware training for static quantization

prepare

training loop (not shown)

quantize

fusion

QConfig

オペレータ/モジュールの量子化設定