
raise ValueError("AMP and Manual Mixed Precision Training are both activated! Error"). –. 删除使用flags参数"amp"的判断,用于适配昇腾910 AI ...
Atlas Data Center Solution V100R020C00
01 2020-11-28
© 2020
01 (2020-11-28)
©
i
Atlas Data Center Solution
1 ..............................................................................................................................................1
2 .................................................................................................................................. 2
3 ...................................................................................................................................... 3
4 .................................................................................................................................. 4
4.1 Estimator .......................................................................................................................................................................... 4 4.2 sess.run .............................................................................................................................................................................. 6 4.3 Keras ...................................................................................................................................................................................9 4.3.1 Keras ............................................................................................................................................................................... 9 4.3.2 Keras .............................................................................................................................................................. 9 4.3.3 Keras NPUEstimator........................................................................................................................................... 10
5 ................................................................................................................................12
5.1 ............................................................................................................................................................................................ 12 5.2 ................................................................................................................................................................................... 15 5.2.1 Server ........................................................................................................................................................................... 15 5.2.2 Server ........................................................................................................................................................................... 16 5.2.3 Atlas 300T 9000................................................................................................................................. 18 5.3 ......................................................................................................................................................... 19 5.4 ................................................................................................................................................................................... 22 5.4.1 ................................................................................................................................................. 22 5.4.2 ...................................................................................................................................................... 22 5.4.3 Horovod .............................................................................................................................................................. 23 5.5 ......................................................................................................................................................... 25 5.5.1 API ................................................................................................................................................................................. 25 5.5.2 ........................................................................................................................................................................... 28 5.5.3 ............................................................................................................................................. 29 5.5.4 group .................................................................................................................................................................. 29 5.5.5 ...................................................................................................................................................... 29 5.5.6 Tensor .................................................................................................................................. 30
6 ....................................................................................................................................32
6.1 ................................................................................................................................................................................... 32 6.2 Loss Scaling............................................................................................................................................................................. 34
01 (2020-11-28)
©
ii
Atlas Data Center Solution
6.3 ................................................................................................................................................................................... 35 6.4 Profiling.................................................................................................................................................................................... 36 6.5 Dump........................................................................................................................................................................................ 38 6.6 ................................................................................................................................................................................... 40 6.7 Log/Summary......................................................................................................................................................................... 43 6.8 ............................................................................................................................................................. 45 6.9 .................................................................................................................................................................. 46 6.10 ckpt pb ................................................................................................................................. 49 6.10.1 ...................................................................................................................................................................................... 49 6.10.2 ............................................................................................................................................................................. 50
7 .................................................................................................................... 51
8 ....................................................................................................................................57
8.1 Tensorflow ....................................................................................................................................................... 57 8.2 TF Adapter ............................................................................................................................................................ 65 8.2.1 ................................................................................................................................................................................ 65 8.2.2 NPURunConfig ................................................................................................................................................. 70 8.2.3 ProfilingConfig ................................................................................................................................................. 75 8.2.4 DumpConfig ...................................................................................................................................................... 76 8.2.5 NPUEstimator ................................................................................................................................................... 77 8.2.6 NPUEstimatorSpec .......................................................................................................................................... 78 8.2.7 NPUCheckpointSaverHook .......................................................................................................................... 80 8.2.8 NPUOutputTensorHook ................................................................................................................................ 81 8.2.9 NPUDistributedOptimizer ............................................................................................................................ 82 8.2.10 NPULossScaleOptimizer ............................................................................................................................. 83 8.2.11 NPUOptimizer ................................................................................................................................................ 84 8.2.12 FixedLossScaleManager ..............................................................................................................................86 8.2.13 ExponentialUpdateLossScaleManager .................................................................................................. 87 8.2.14 dropout............................................................................................................................................................................... 87 8.2.15 LARSV2............................................................................................................................................................................... 88 8.2.16 initialize_system...............................................................................................................................................................89 8.2.17 shutdown_system............................................................................................................................................................ 90 8.2.18 without_npu_compile_scope....................................................................................................................................... 91 8.2.19 set_iteration_per_loop................................................................................................................................................... 91 8.2.20 create_iteration_per_loop_var..................................................................................................................................... 92 8.2.21 load_iteration_per_loop_var........................................................................................................................................ 92 8.2.22 model_to_npu_estimator.............................................................................................................................................. 93 8.2.23 sess.run session .................................................................................................................................... 94 8.3 .................................................................................................................................................................. 98 8.3.1 ................................................................................................................................................................................ 98 8.3.2 group ................................................................................................................................................................. 100 8.3.2.1 create_group.................................................................................................................................................................. 100 8.3.2.2 destroy_group............................................................................................................................................................... 102
01 (2020-11-28)
©
iii
Atlas Data Center Solution
8.3.2.3 get_rank_size................................................................................................................................................................. 103 8.3.2.4 get_local_rank_size...................................................................................................................................................... 103 8.3.2.5 get_rank_id.....................................................................................................................................................................104 8.3.2.6 get_local_rank_id......................................................................................................................................................... 105 8.3.2.7 get_world_rank_from_group_rank......................................................................................................................... 106 8.3.2.8 get_group_rank_from_world_rank......................................................................................................................... 106 8.3.3 ..................................................................................................................................................................... 107 8.3.3.1 set_split_strategy_by_idx........................................................................................................................................... 107 8.3.3.2 set_split_strategy_by_size.......................................................................................................................................... 109 8.3.4 ..................................................................................................................................................................... 109 8.3.4.1 allreduce......................................................................................................................................................................... 110 8.3.4.2 allgather..........................................................................................................................................................................111 8.3.4.3 broadcast........................................................................................................................................................................ 111 8.3.4.4 reduce_scatter............................................................................................................................................................... 112 8.3.4.5 send.................................................................................................................................................................................. 113 8.3.4.6 receive.............................................................................................................................................................................. 114
9 ................................................................................................................................. 116
9.1 imagenet ResNet50 .................................................................................................... 116 9.1.1 ......................................................................................................................................................................... 116 9.1.2 ............................................................................................................................................................................. 117 9.1.3 ..................................................................................................................................................................... 117 9.1.4 ......................................................................................................................................................................... 118 9.1.5 ............................................................................................................................................................................. 120 9.1.6 ............................................................................................................................................................................. 123 9.1.7 ............................................................................................................................................................................. 124 9.1.8 ............................................................................................................................................................................. 127 9.2 bookscorpus BERT Estimater............................................................................127 9.2.1 ......................................................................................................................................................................... 127 9.2.2 ............................................................................................................................................................................. 128 9.2.3 ..................................................................................................................................................................... 128 9.2.4 ......................................................................................................................................................................... 129 9.2.5 ............................................................................................................................................................................. 131 9.2.6 ............................................................................................................................................................................. 141 9.2.7 ............................................................................................................................................................................. 144 9.2.8 ............................................................................................................................................................................. 153
10 ....................................................................................................................................... 157
10.1 7.3.0 gcc........................................................................................................................................................... 157 10.2 .............................................................................................................................................................................. 158
01 (2020-11-28)
©
iv
Atlas Data Center Solution
1
1
TensorFlowPython APIAI
01 (2020-11-28)
©
1
Atlas Data Center Solution
2
2
1. TensorFlow 1.15TensorFlow Tensorflow
2. infershapeunknowshape 3. formatNCHWNHWCNCHWCNCN 4. cast
float32 float16 5. TF.conditionTF.whileloop 6. PNPURunconfigsave_checkpoints_secs 7. PPSummary 8. iterations_per_loop>1save_checkpoints_steps iterations_per_loopiterations_per_loop save_checkpoints_stepsiterations_per_loop>1 save_summary_stepslog_step_count_steps Log/Summary 9. geludropoutAscend 10. summary/log/dataString 11. inf/nan 12.
a. device b. server1/2/4/8P c. int8int32float16float32 13.
a. tf.data.make_initializable_iteratorgetnext b. BatchDatasetdrop_remainderTrue
batch size batch size
01 (2020-11-28)
©
2
Atlas Data Center Solution
3
3
Ascend Tensorflow 1.15.0 deviceIP
01 (2020-11-28)
©
3
Atlas Data Center Solution
4
4
4.1 Estimator 4.2 sess.run 4.3 Keras
4.1 Estimator
Estimator
Estimator APITensorFlowAPI2018TensorFlow 1.10 Estimator
Estimator
1. input_fn 2. model_fn 3. EstimatorRunconfig 4. EstimatorEstimator.train()
Estimator APIAI
def train_input_fn(train_data,train_labels): #numpy return tf.estimator.inputs.numpy_input_fn(
x={"x":train_data}, y=train_labels, batch_size=FLAGS.batch_size, num_epochs=None,#epochs shuffle=True)
shapeshape dataset.batch(batch_size)
01 (2020-11-28)
©
4
Atlas Data Center Solution
4
batchAI drop_remainderTrue
dataset = dataset.batch(batch_size, drop_remainder=True)
(batch_size)batch sizebatch size
assert num_written_lines == num_actual_predict_examples
dropout
TensorFlow
layers = tf.nn.dropout()
from npu_bridge.estimator import npu_ops layers = npu_ops.dropout()
bertgelu
TensorFlow
def gelu(x): cdf = 0.5 * (1.0 + tf.tanh( (np.sqrt(2 / np.pi) * (x + 0.044715 * tf.pow(x, 3))))) return x*cdf
layers = gelu()
from npu_bridge.estimator.npu_unary_ops import npu_unary_ops layers = npu_unary_ops.gelu(x)
TensorFlowRunconfigtrain_distributeAscend train_distributeNPUDistributedOptimizer NPU Device EstimatorNPUEstimator NPUBroadcastGlobalVariablesHookbroadcast
TensorFlowRunconfigRunconfig NPURunconfig
TensorFlow
config=tf.estimator.RunConfig( model_dir=FLAGS.model_dir, save_checkpoints_steps=FLAGS.save_checkpoints_steps, session_config=tf.ConfigProto(allow_soft_placement=True,log_device_placement=False))
from npu_bridge.estimator.npu.npu_config import NPURunConfig from npu_bridge.estimator import npu_ops npu_config=NPURunConfig(
model_dir=FLAGS.model_dir, save_checkpoints_steps=FLAGS.save_checkpoints_steps, session_config=tf.ConfigProto(allow_soft_placement=True,log_device_placement=False)# )
01 (2020-11-28)
©
5
Atlas Data Center Solution
4
Estimator
TensorFlowEstimatorNPUEstimator
TensorFlow
mnist_classifier=tf.estimator.Estimator( model_fn=cnn_model_fn, config=config, model_dir="/tmp/mnist_convnet_model")
from npu_bridge.estimator.npu.npu_estimator import NPUEstimator
mnist_classifier=NPUEstimator( model_fn=cnn_model_fn, config=npu_config, model_dir="/tmp/mnist_convnet_model" )
mnist_classifier.train( input_fn=train_input_fn, steps=20000, hooks=[logging_hook])
4.2 sess.run
sess.run
sess.run APITensorFlowAPIEstimator
sess.run API
1. 2. /Loss/ 3. session 4.
sess.run APIAI
shapeshape dataset.batch(batch_size) batchAI drop_remainderTrue
dataset = dataset.batch(batch_size, drop_remainder=True)
(batch_size)batch
01 (2020-11-28)
©
6
Atlas Data Center Solution
4
sizebatch size
assert num_written_lines == num_actual_predict_examples
/ Loss/
dropout TensorFlow
layers = tf.nn.dropout()
from npu_bridge.estimator import npu_ops layers = npu_ops.dropout()
bertgelu TensorFlow
def gelu(x): cdf = 0.5 * (1.0 + tf.tanh( (np.sqrt(2 / np.pi) * (x + 0.044715 * tf.pow(x, 3))))) return x*cdf
layers = gelu()
from npu_bridge.estimator.npu_unary_ops import npu_unary_ops layers = npu_unary_ops.gelu(x)
sess.run broadcast
rank_size = os.environ.get('RANK_SIZE', '').strip() if int(rank_size) > 1:
input = tf.trainable_variables() bcast_global_variables_op = hccl_ops.broadcast(input, 0)
Deviceallreduce NPUDistributedOptimizer NPUDistributedOptimizer
rank_size = os.environ.get('RANK_SIZE', '').strip() if int(rank_size) > 1:
grads = [ hccl_ops.allreduce(grad, "sum") for grad in grads ]
session
AIsess.run
rewrite_options.disable_model_pruning
rewrite_options.function_optimization
rewrite_options.constant_folding
rewrite_options.shape_optimization
rewrite_options.arithmetic_optimization
rewrite_options.loop_optimization
rewrite_options.dependency_optimization
01 (2020-11-28)
©
7
Atlas Data Center Solution
4
rewrite_options.layout_optimizer rewrite_options.memory_optimization rewrite_options.remapping GradFusionOptimizer rewrite_options.optimizers.extend(["GradFusionOptimizer"]) AI custom_op.parameter_map["use_off_line"].b = True
TensorFlow
# iterator=Iterator.from_structure(train_dataset.output_types,train_dataset.output_shapes)
#batch next_batch=iterator.get_next()
# training_init_op=iterator.make_initializer(train_dataset)
# init=tf.global_variables_initializer() sess=tf.Session() sess.run(init)
#Get the number of training/validation steps per epoch train_batches_per_epoch=int(np.floor(train_size/batch_size))
from npu_bridge.estimator import npu_ops from tensorflow.core.protobuf.rewriter_config_pb2 import RewriterConfig
# iterator=Iterator.from_structure(train_dataset.output_types,train_dataset.output_shapes)
#batch next_batch=iterator.get_next()
# training_init_op=iterator.make_initializer(train_dataset)
# init=tf.global_variables_initializer()
#session config = tf.ConfigProto() custom_op = config.graph_options.rewrite_options.custom_optimizers.add() custom_op.name = "NpuOptimizer" custom_op.parameter_map["use_off_line"].b = True #AI config.graph_options.rewrite_options.remapping = RewriterConfig.OFF #remap
sess = tf.Session(config=config) sess.run(init)
#Get the number of training/validation steps per epoch train_batches_per_epoch=int(np.floor(train_size/batch_size))
sessionsess.runsession
01 (2020-11-28)
©
8
Atlas Data Center Solution
4
# for epoch in range(num_epochs):
##Initialize iterator with the training dataset sess.run(training_init_op) for step in range(train_batches_per_epoch):
#get next batch of data img_batch,label_batch=sess.run(next_batch) #run the training op _,train_loss = sess.run([train_op, loss],feed_dict={x:img_batch,y_:label_batch,is_training:True})
4.3 Keras
4.3.1 Keras
KerasEstimatorTensorFlowAPI Keras API
1.
2.
3.
4.
KerasAscend AscendKerasAscend Keras
AscendKerasAPI session.runAI1 Keras
HostDevice model_to_npu_estimatorKerasNPUEstimator NPURunConfigiterations_per_loopsession.run AIKeras NPUEstimator
4.3.2 Keras
AscendKerasAPIsession.run AI1Keras Ascend
1. use_off_lineAI TensorFlowKeras
import tensorflow as tf import tensorflow.python.keras as keras from tensorflow.python.keras import backend as K from tensorflow.core.protobuf.rewriter_config_pb2 import RewriterConfig from npu_bridge.estimator import npu_ops
sess_config = tf.ConfigProto() custom_op = sess_config.graph_options.rewrite_options.custom_optimizers.add() custom_op.name = "NpuOptimizer" custom_op.parameter_map["use_off_line"].b = True sess_config.graph_options.rewrite_options.remapping = RewriterConfig.OFF sess_config.graph_options.rewrite_options.optimizers.extend(["GradFusionOptimizer"]) #
01 (2020-11-28)
©
9
Atlas Data Center Solution
4
sess = tf.Session(config=sess_config) K.set_session(sess)
#... #... #... #...
sess.close()
2. AIKeras optimizertensorflowkeras NPUDistributedOptimizer
from npu_bridge.estimator.npu.npu_optimizer import NPUDistributedOptimizer
opt = tf.compat.v1.train.AdamOptimizer(learning_rate=0.1) opt = NPUDistributedOptimizer(opt) keras_model.compile(optimizer=opt,loss='sparse_categorical_crossentropy')
callback
4.3.3 Keras NPUEstimator
KerasNPUEstimator iterations_per_loop
KerasNPUEstimatorinput_fn
Keras resize
Estimatorlist listlist resize
TensorFlow
# keras train_datagen = ImageDataGenerator(rescale=1./255,
horizontal_flip=True)
train_generator = train_datagen.flow_from_directory('data/', target_size=(224, 224, 3), batch_size=32, class_mode='sparse')
# filename def _parse_function(filename, label):
image = tf.read_file(filename) image = tf.image.decode_image(image) image = image / 255.0 image = tf.image.resize_images(image, [224, 224, 3]) image = tf.image.random_flip_left_right(image) return image, label
def input_fn(): # list filenames = tf.constant(["/data/image1.jpg", "/data/image2.jpg", ...]) # label[i]filenames[i]label, labellist labels = tf.constant([0, 5, ...]) # dataset(filename, label)
01 (2020-11-28)
©
10
Atlas Data Center Solution
4
dataset = tf.data.Dataset.from_tensor_slices((filenames, labels)).repeat(10) # dataset(image_resized, label) dataset = dataset.map(_parse_function) # dataset(image_resized_batch, label_batch) dataset = dataset.shuffle().batch(32) return dataset
model_to_npu_estimatorKerasNPUEstimator
TensorFlow
from keras.layers import Input, Dense from keras.models import Model
# This returns a tensor inputs = Input(shape=(224, 224, 3))
# This creates a model that includes # the Input layer and three Dense layers keras_model = ResNet50(input_tensor=inputs, weights=None, include_top=True) keras_model.compile(optimizer='rmsprop', loss='sparse_categorical_crossentropy')
keras_model.fit_generator( train_generator, steps_per_epoch=100, epochs=10)
session_config = tf.ConfigProto() run_config = NPURunConfig(enable_data_pre_proc=True,
session_config=session_config, save_checkpoints_steps=2, model_dir=model_path, iterations_per_loop=10) # KerasNPUEstimator est_resnet = model_to_npu_estimator(keras_model=keras_model, config=run_config) # est_resnet.train(input_fn=lambda: input_fn(), max_steps=1000)
AIKerasoptimizer tensorflowkeras NPUDistributedOptimizer
from npu_bridge.estimator.npu.npu_optimizer import NPUDistributedOptimizer
opt = tf.train.AdamOptimizer(0.01) opt = NPUDistributedOptimizer(opt) keras_model.compile(optimizer=opt, loss='sparse_categorical_crossentropy')
KerascallbackNPUEstimator
01 (2020-11-28)
©
11
Atlas Data Center Solution
5
5
5.1 5.2 5.3 5.4 5.5
5.1
AI device
01 (2020-11-28)
©
12
Atlas Data Center Solution
5-1
5
PS-workersAllReduce
PS-workers5-2 AllReduce5-3 AllReduceAscend AllReduce
01 (2020-11-28)
©
13
Atlas Data Center Solution
5-2 PS-workers
5
5-3 Allreduce
broadcastallreduce
01 (2020-11-28)
©
14
Atlas Data Center Solution
5
NPUDistributedOptimizerallreduce
NPUDistributedOptimizer allreduce
API allreduce/broadcast/allgather/reduce_scatter/send/receive 824
5.2
5.2.1 Server
Server1ServerServer8 AI1/2/4/80-34-7 24
01 (2020-11-28)
©
15
Atlas Data Center Solution
5-4
5
Device
5.2.2 Server
Server+Server Server128Server8AI Server8*nn Servern2
Server8Server 1*n/2*n/4*nnServer create_groupgroup
01 (2020-11-28)
©
16
Atlas Data Center Solution
5-5
5
5-6
01 (2020-11-28)
©
17
Atlas Data Center Solution
5
Agent TensorFlowTensorFlow AI
Server8Server Server1/2/4Server
Serverdevice1/2/4 1 2[0, 5][1, 4][2, 7][3, 6] 4[0, 2, 5, 7][1, 3, 4, 6]
broadcast/allreduce/ reduce_scatter/allgather1/2/4
8*nnServer 1*n/2*n/4*n
5.2.3 Atlas 300T 9000
AI
100GServerRing + Halving-doubling
5-7
1. 2. 3. 4.
Server
IP
allreduce/broadcast/allgather/reduce_scatter
HCCL_INTRA_PCIE_ENABLE HCCL_INTRA_ROCE_ENABLEPCIe RoCE
01 (2020-11-28)
©
18
Atlas Data Center Solution
5
5. ranktable
5.3
ranktable RANK_TABLE_FILE ranktable
ranktablejson2p rank_table_2p.json
1.
2. Serverranktable 1/2/4/8
3. Serverranktable8*n nServer
4. Atlas 300T 9000ranktable
{
"server_count":"1", //serverserver
"server_list":
[
{
"device":[ // serverdevice
{
"device_id":"0", // HDC
"device_ip":"192.168.0.2", // IP
"rank_id":"0" // rankrankID0
},
{
"device_id":"1",
"device_ip":"192.168.1.2",
"rank_id":"1"
}
],
"server_id":"10.0.0.10" //serverIP
}
],
"status":"completed", // ranktablecompleted
"version":"1.0"
// ranktable,"1.0"
}
01 (2020-11-28)
©
19
Atlas Data Center Solution
5
5-1 ranktable
/
server_count
Server
status
Rank table
completedRank table
initializingRank table
version
ranktable1.0
server_list
Server
server_id
ServerIP 10.0.0.10
device_id
IDDeviceServer HDC
[0-7]
device_ip
IP 192.168.1.2
Servercat /etc/hccn.conf IP
8ranktable devicesdeviceip device_id0ip 192.168.100.101devices "devices": [{"device_id": "0", "device_ip": "192.168.100.101"}]
ranktable Server1/2/4devices device_ip" "device device_id0 devices "devices": [{"device_id": "0", "device_ip": ""}]
rank_id
Rank0 [0, Device-1]
{
"status":"completed", // Rank tablecompleted
"group_count":"1", // group1
"group_list":
// group
[
{
"group_name":"hccl_world_group",//grouphccl_world_group
"instance_count":"2",
// instance
"device_count":"2",
// groupdevice
"instance_list":[
01 (2020-11-28)
©
20
Atlas Data Center Solution
5
{
"pod_name":"tf-bae41", //instance
"server_id":"10.0.0.10", //serverIP
"devices":[
//instancedevice
{
"device_id":"0",
// HDC
"device_ip":"192.168.0.2" // IP
}
]
},
{
"pod_name":"tf-tbdf1",
"server_id":"10.0.0.10",
"devices":[
{
"device_id":"1",
"device_ip":"192.168.1.2"
}
]
}
]
}
]
}
5-2 ranktable
/
status
Rank table
completedRank table
initializingRank table
group_count
Group1
group_list
Group
group_name
Groupgroup_count1 hccl_world_group hccl_world_group group
group groupgroup hccl_world_groupgroup
instance_count
instance_listpod_name
device_count
group
instance_list
-
-
pod_name
instance_list
server_id
ServerIP 10.0.0.10
01 (2020-11-28)
©
21
Atlas Data Center Solution
devices device_id
device_ip
5
/
-
-
IDDeviceServer HDC
[0-7]
IP 192.168.1.2
Servercat /etc/hccn.conf IP
8ranktable devicesdeviceip device_id0ip 192.168.100.101devices "devices": [{"device_id": "0", "device_ip": "192.168.100.101"}]
ranktable Server1/2/4devices device_ip" "device device_id0 devices "devices": [{"device_id": "0", "device_ip": ""}]
5.4
5.4.1
get_rank_sizeget_rank_idid
dataset = dataset.shard(get_rank_size(),get_rank_id())
dataset = dataset.repeat( )
5.4.2
TensorFlowRunconfigtrain_distributeAscend train_distributeNPUDistributedOptimizer NPU Device
TensorFlow
def cnn_model_fn(features,labels,mode): # xxx #loss
01 (2020-11-28)
©
22
Atlas Data Center Solution
5
xxx
#Configure the TrainingOp() if mode == tf.estimator.ModeKeys.TRAIN:
optimizer=tf.train.GradientDescentOptimizer(learning_rate=0.001)#SGD train_op=optimizer.minimize(loss=loss,global_step=tf.train.get_global_step())#loss return tf.estimator.EstimatorSpec(mode=mode,loss=loss,train_op=train_op)
from npu_bridge.estimator.npu.npu_optimizer import NPUDistributedOptimizer
def cnn_model_fn(features,labels,mode): # xxx #loss xxx
#Configure the TrainingOp(for TRAIN mode) if mode == tf.estimator.ModeKeys.TRAIN:
optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.001)#SGD distributedOptimizer=NPUDistributedOptimizer(optimizer)#NPU train_op=distributedOptimizer.minimize(loss=loss,global_step=tf.train.get_global_step())#loss return tf.estimator.EstimatorSpec(mode=mode,loss=loss,train_op=train_op)
Tensorflowgrads = tf.gradients(loss, tvars) NPUDistributedOptimizerNPUDistributedOptimizer compute_gradientsapply_gradients
EstimatorNPUDistributedOptimizerallreduce NPUEstimatorNPUBroadcastGlobalVariablesHook broadcast sess.runNPUDistributedOptimizerallreduce broadcastbroadcast
5.4.3 Horovod
HorovodTensorFlowKerasPyTorchMXNet TensorFlowPS worker HorovodAllreducePS worker
HorovodAI
5-3 Horovod hvd.DistributedOptimizer hvd.init hvd.local_rank hvd.size hvd.rank
NPUDistributedOptimizer get_local_rank_id get_rank_size get_rank_id
01 (2020-11-28)
©
23
Atlas Data Center Solution
Horovod hvd.BroadcastGlobalVariablesHook
5
NPUDistributedOptimizerGE Broadcast
sess.runestimator.trainget_local_rank_id/get_rank_size/ get_rank_idHCCLsessioninitialize_system shutdown_systemsession
Horovod
import tensorflow as tf import horovod.tensorflow as hvd
# Initialize Horovod hvd.init()
# Pin GPU to be used to process local rank (one GPU per process) config = tf.ConfigProto() config.gpu_options.visible_device_list = str(hvd.local_rank())
# Build model... loss = ... opt = tf.train.AdagradOptimizer(0.01 * hvd.size())
# Add Horovod Distributed Optimizer opt = hvd.DistributedOptimizer(opt)
# Add hook to broadcast variables from rank 0 to all other processes during # initialization. hooks = [hvd.BroadcastGlobalVariablesHook(0)]
# Make training operation train_op = opt.minimize(loss)
# Save checkpoints only on worker 0 to prevent other workers from corrupting them. checkpoint_dir = '/tmp/train_logs' if hvd.rank() == 0 else None
# The MonitoredTrainingSession takes care of session initialization, # restoring from a checkpoint, saving to a checkpoint, and closing when done # or an error occurs. with tf.train.MonitoredTrainingSession(checkpoint_dir=checkpoint_dir,
config=config, hooks=hooks) as mon_sess: while not mon_sess.should_stop(): # Perform synchronous training. mon_sess.run(train_op)
import tensorflow as tf from hccl.manage.api import get_local_rank_id from hccl.manage.api import get_rank_size from hccl.manage.api import get_rank_id from npu_bridge.estimator import npu_ops from npu_bridge.estimator.npu.npu_optimizer import NPUDistributedOptimizer from tensorflow.core.protobuf.rewriter_config_pb2 import RewriterConfig # HCCLgroupHCCL npu_int = npu_ops.initialize_system() npu_shutdown = npu_ops.shutdown_system()
01 (2020-11-28)
©
24
Atlas Data Center Solution
5
config = tf.ConfigProto() custom_op = config.graph_options.rewrite_options.custom_optimizers.add() custom_op.name = "NpuOptimizer" custom_op.parameter_map["use_off_line"].b = True config.graph_options.rewrite_options.remapping = RewriterConfig.OFF #remap init_sess = tf.Session(config=config) init_sess.run(npu_int)
# Pin GPU to be used to process local rank (one GPU per process) config.gpu_options.visible_device_list = str(get_local_rank_id())
# Build model... loss = ... opt = tf.train.AdagradOptimizer(0.01 * get_rank_size())
# Add NPU Distributed Optimizer opt = NPUDistributedOptimizer(opt)
# Add hook to broadcast variables from rank 0 to all other processes during # initialization. # hooks = [hvd.BroadcastGlobalVariablesHook(0)]
# Make training operation train_op = opt.minimize(loss)
# Save checkpoints only on worker 0 to prevent other workers from corrupting them. checkpoint_dir = '/tmp/train_logs' if get_rank_id() == 0 else None
# The MonitoredTrainingSession takes care of session initialization, # restoring from a checkpoint, saving to a checkpoint, and closing when done # or an error occurs. with tf.train.MonitoredTrainingSession(checkpoint_dir=checkpoint_dir,
config=config, hooks=hooks) as mon_sess: while not mon_sess.should_stop(): # Perform synchronous training. mon_sess.run(train_op)
init_sess.run(npu_shutdown) init_sess.close()
5.5
5.5.1 API
NPUDistributedOptimizerallreduce rank
AIhccl python npu_bridge
from npu_bridge.estimator import npu_ops from hccl.manage.api import get_rank_size
01 (2020-11-28)
©
25
Atlas Data Center Solution
5
5-4
rank
create_group
destroy_group
get_rank_size
get_local_rank_size
get_rank_id get_local_rank_id get_world_rank_from_group_rank
get_group_rank_from_world_rank
set_split_strategy_by_idx
group
group
group rank Device
{install_pat h_fwkacllib }/fwkacllib/ python/ sitepackages/ hccl/hccl/ manage/ api.py
group device local rank
device group rank
device group local rank
group rank id world rank id
world rank id group
group rank id
id group
{install_pat h_fwkacllib }/fwkacllib/ python/ sitepackages/ hccl/hccl/ split/api.py
01 (2020-11-28)
©
26
Atlas Data Center Solution
set_split_strategy_by_size
allreduce
allgather
broadcast
reduce_scatter send receive
5
group
group allreduce
group allgather Tensor
{install_pat h_tfplugin}/ tfplugin/ python/ sitepackages/ npu_bridge/ hccl/ hccl_ops.py
group broadcast root rank
group
reducescatt er
group send
group receive
01 (2020-11-28)
©
27
Atlas Data Center Solution
5
5-5
ranktable ranktableServer
Device
rank
rank
group
group
hccl world groupgroup rankranktable
grouphccl world group create_groupranktablerankgroup
rank size
rank sizegrouprank4096
local rank sizegroupServerrank 1/2/4/8
rank id
rank idgrouprank0~rank size-1grouprankgroup0 hccl world grouprank idworld rank id
world rank idhccl world grouprank 0~rank size-1
local rank idgroupServerrank 0~local rank size-1
GE allreduceallreduce
5.5.2
sess.runestimator.trainget_local_rank_id/get_rank_size/ get_rank_idHCCLsessioninitialize_system shutdown_systemsession
import tensorflow as tf from npu_bridge.estimator import npu_ops from tensorflow.core.protobuf.rewriter_config_pb2 import RewriterConfig
npu_int = npu_ops.initialize_system() npu_shutdown = npu_ops.shutdown_system()
config = tf.ConfigProto() custom_op = config.graph_options.rewrite_options.custom_optimizers.add() custom_op.name = "NpuOptimizer" custom_op.parameter_map["use_off_line"].b = True config.graph_options.rewrite_options.remapping = RewriterConfig.OFF #remap
init_sess = tf.Session(config=config) init_sess.run(npu_int)
#HCCL...
01 (2020-11-28)
©
28
Atlas Data Center Solution
5
#...
init_sess.run(npu_shutdown) init_sess.close()
import tensorflow as tf from npu_bridge.estimator import npu_ops from tensorflow.core.protobuf.rewriter_config_pb2 import RewriterConfig
npu_init = npu_ops.initialize_system() npu_shutdown = npu_ops.shutdown_system()
config = tf.ConfigProto() custom_op = config.graph_options.rewrite_options.custom_optimizers.add() custom_op.name = "NpuOptimizer" custom_op.parameter_map["use_off_line"].b = True config.graph_options.rewrite_options.remapping = RewriterConfig.OFF #remap
with tf.Session(config=config) as sess: sess.run(npu_init) #HCCL... #... sess.run(npu_shutdown)
5.5.3
create_group groupranktablehccl_world_group
hccl_world_groupnpugroup
from hccl.manage.api import create_group create_group("myGroup", 2, [0, 1])
5.5.4 group
groupgroup
get_rank_sizegroupnpu
from hccl.manage.api import get_rank_size rankSize = get_rank_size("myGroup")
get_local_rank_sizenpuServergroupnpu
from hccl.manage.api import get_local_rank_size lcoalRankSize = get_local_rank_size("myGroup")
get_rank_idnpugrouprank id
from hccl.manage.api import get_rank_id rankId = get_rank_id("myGroup")
get_local_rank_idnpuServerlocal rank id
from hccl.manage.api import get_local_rank_id localRankId = get_local_rank_id("myGroup")
5.5.5
allreduce
set_split_strategy_by_idxidgroup
from hccl.split.api import set_split_strategy_by_idx set_split_strategy_by_idx([20, 100, 159])
01 (2020-11-28)
©
29
Atlas Data Center Solution
5
set_split_strategy_by_sizegroup
from hccl.split.api import set_split_strategy_by_size set_split_strategy_by_size([60, 20, 20])
5.5.6 Tensor
tensor
allreduce
allreducegroupallreduce reducereducereduction
#---------------------allreduce test(2 npu)---------------------------------
from npu_bridge.hccl import hccl_ops
tensor = tf.random_uniform((1, 3), minval=1, maxval=10, dtype=tf.float32)
allreduce_test = hccl_ops.allreduce(tensor , "sum")
allgather
allgathergroupallgatherTensor
#---------------------allgather test(2 npu)---------------------------------
from npu_bridge.hccl import hccl_ops
cCon = tf.constant([1.0,2.0,3.0])
allgather_test = hccl_ops.allgather(cCon, 2)
#---------- rank 0/1 allgather _test = [1.0, 2.0, 3.0, 1.0, 2.0, 3.0] ----------
broadcast
broadcastgroupbroadcastroot rank
#---------------------broadcast test(2 npu)---------------------------------
from npu_bridge.hccl import hccl_ops
cCon = tf.Variable([1.0,2.0,3.0])
01 (2020-11-28)
©
30
Atlas Data Center Solution
input = [cCon] broadcast_test = hccl_ops.broadcast(input, 0) #---------------- rank 0/1 broadcast_test = [1.0, 2.0, 3.0] --------------------
5
reducescatter
reduce_scattergroupreducescatterreduce reduction
#---------------------reducescatter test(2 npu)----------------------------from npu_bridge.hccl import hccl_ops cCon = tf.constant([1.0,2.0,3.0,4.0]) reducescatter_test = hccl_ops.reduce_scatter(cCon, "sum", 2) #-----------------rank 0 reducescatter _test = [2.0, 4.0] ---------------------#-----------------rank 1 reducescatter _test = [6.0, 8.0] ----------------------
send receive
sendgroupsend
#---------------------------------send test------------------------------------from npu_bridge.hccl import hccl_ops sr_tag = 0 dest_rank = 1 hccl_ops.send(tensor, sr_tag, dest_rank)
receivegroupreceive
#---------------------receive test(2 npu)----------------------------------from npu_bridge.hccl import hccl_ops sr_tag = 0 src_rank = 0 tensor = hccl_ops.receive(tensor.shape, tensor.dtype, sr_tag, src_rank)
01 (2020-11-28)
©
31
Atlas Data Center Solution
6
6
6.1 6.2 Loss Scaling 6.3 6.4 Profiling 6.5 Dump 6.6 6.7 Log/Summary 6.8 6.9 6.10 ckptpb
6.1
float16float32 float32AI
allow_fp32_to_fp16float32 float16float32 Conv2DDepthwiseConv2D
force_fp16float16float32float16
must_keep_origin_dtypeConv2D float16float32
allow_mix_precisionfloat32 float32float16
01 (2020-11-28)
©
32
Atlas Data Center Solution
6
AI float32float16 Loss Scaling castAI
Estimator
EstimatorNPURunConfigprecision_mode
from npu_bridge.estimator.npu.npu_config import NPURunConfig from npu_bridge.estimator import npu_ops
npu_config=NPURunConfig( model_dir=FLAGS.model_dir, save_checkpoints_steps=FLAGS.save_checkpoints_steps, session_config=tf.ConfigProto(allow_soft_placement=True,log_device_placement=False), precision_mode="allow_mix_precision" )
allow_mix_precision
1. /home/HwHiAiUser/Ascend/nnae/latest/opp/op_impl/built-in/ ai_core/tbe/config/
"/home/HwHiAiUser/Ascend/nnae/latest/opp"
2. aic-ascend910-ops-info.json
chmod u+w aic-ascend910-ops-info.json
3. aic-ascend910-ops-info.json precision_reduce
"precision_reduce":{ "flag":"true"
}
truefloat32float16
falsefloat32float16
sess.run
sess.runsessionprecision_mode
import tensorflow as tf from npu_bridge.estimator import npu_ops from tensorflow.core.protobuf.rewriter_config_pb2 import RewriterConfig
config = tf.ConfigProto()
custom_op = config.graph_options.rewrite_options.custom_optimizers.add() custom_op.name = "NpuOptimizer" custom_op.parameter_map["use_off_line"].b = True custom_op.parameter_map["precision_mode"].s = tf.compat.as_bytes("allow_mix_precision") config.graph_options.rewrite_options.remapping = RewriterConfig.OFF #remap
with tf.Session(config=config) as sess: print(sess.run(cost))
Estimator
01 (2020-11-28)
©
33
Atlas Data Center Solution
6
6.2 Loss Scaling
Loss Scalingfloat16 lossLoss ScaleS Loss Scaling
Loss Scaling
Loss ScalingLossScaleOptimizer NPULossScaleOptimizerNPUOptimizer NPULossScaleOptimizer
Loss ScalingLoss Scale
NPULossScaleOptimizer FixedLossScaleManagerLoss Scale
Loss Scaling Loss Scale
NPULossScaleOptimizer ExponentialUpdateLossScaleManagerLoss Scale
NPULossScaleOptimizeris_distributedTrue Loss Scaling
TensorFlow
if FLAGS.use_fp16 and (FLAGS.bert_loss_scale not in [None, -1]): opt_tmp = opt if FLAGS.bert_loss_scale == 0: loss_scale_manager =
tf.contrib.mixed_precision.ExponentialUpdateLossScaleManager(init_loss_scale=2**32, incr_every_n_steps=1000, decr_every_n_nan_or_inf=2, decr_ratio=0.5)
elif FLAGS.bert_loss_scale >= 1: loss_scale_manager = tf.contrib.mixed_precision.FixedLossScaleManager(loss_scale=FLAGS.bert_loss_scale)
else: raise ValueError("Invalid loss scale: %d" % FLAGS.bert_loss_scale)
opt = tf.contrib.mixed_precision.LossScaleOptimizer(opt_tmp, loss_scale_manager)
from npu_bridge.estimator.npu.npu_loss_scale_optimizer import NPULossScaleOptimizer from npu_bridge.estimator.npu.npu_loss_scale_manager import FixedLossScaleManager from npu_bridge.estimator.npu.npu_loss_scale_manager import ExponentialUpdateLossScaleManager if FLAGS.use_fp16 and (FLAGS.bert_loss_scale not in [None, -1]):
opt_tmp = opt if FLAGS.bert_loss_scale == 0:
loss_scale_manager = ExponentialUpdateLossScaleManager(init_loss_scale=2**32, incr_every_n_steps=1000, decr_every_n_nan_or_inf=2, decr_ratio=0.5)
elif FLAGS.bert_loss_scale >= 1: loss_scale_manager = FixedLossScaleManager(loss_scale=FLAGS.bert_loss_scale)
else: raise ValueError("Invalid loss scale: %d" % FLAGS.bert_loss_scale)
#device11 if ops_adapter.size() > 1:
opt_tmp = NPUDistributedOptimizer(opt_tmp)
01 (2020-11-28)
©
34
Atlas Data Center Solution
6
opt = NPULossScaleOptimizer(opt_tmp, loss_scale_manager, is_distributed=True) else:
opt = NPULossScaleOptimizer(opt_tmp, loss_scale_manager)
global step
Loss ScalingLoss Scalingstep step
resnet50HCtf.train.MomentumOptimizer global stepapply_gradients step
Bertglobal stepcreate_optimizer global step
TensorFlowglobal stepcreate_optimizer
def create_optimizer(loss, init_lr, num_train_steps, num_warmup_steps, hvd=None, manual_fp16=False, use_fp16=False, num_accumulation_steps=1,
optimizer_type="adam", allreduce_post_accumulation=False): ...
if tf.flags.FLAGS.npu_bert_clip_by_global_norm: new_global_step = tf.cond(all_are_finite, lambda: global_step + 1, lambda: global_step)
else: new_global_step = global_step + 1
new_global_step = tf.identity(new_global_step, name='step_update') train_op = tf.group(train_op, [global_step.assign(new_global_step)]) return train_op
Ascendglobal step
1. create_optimizerglobal step
def create_optimizer(loss, init_lr, num_train_steps, num_warmup_steps, hvd=None, manual_fp16=False, use_fp16=False, num_accumulation_steps=1,
optimizer_type="adam", allreduce_post_accumulation=False): ...
#if tf.flags.FLAGS.npu_bert_clip_by_global_norm: # new_global_step = tf.cond(all_are_finite, lambda: global_step + 1, lambda: global_step) #else: # new_global_step = global_step + 1 #new_global_step = tf.identity(new_global_step, name='step_update') #train_op = tf.group(train_op, [global_step.assign(new_global_step)]) return train_op
2. AdamWeightDecayOptimizerLAMBOptimizerapply_gradients returnglobal stepLoss Scaling apply_gredients
def apply_gradients(self, grads_and_vars, global_step=None, name=None, manual_fp16=False):
assignments = [] for (grad, param) in grads_and_vars:
... new_global_step = global_step + 1 new_global_step = tf.identity(new_global_step, name='step_update') assignments.extend([global_step.assign(new_global_step)]) return tf.group(*assignments, name=name)
6.3
AIDevice
01 (2020-11-28)
©
35
Atlas Data Center Solution
6
AITensorflow
iterations_per_loop1
without_npu_compile_scope
Estimator
EstimatorNPURunConfigmix_compile_mode
from npu_bridge.estimator.npu.npu_config import NPURunConfig from npu_bridge.estimator import npu_ops
session_config=tf.ConfigProto() config = NPURunConfig(session_config=session_config, mix_compile_mode=True, iterations_per_loop=1)
sess.run
sess.runsessionmix_compile_mode without_npu_compile_scope
import tensorflow as tf from npu_bridge.estimator import npu_ops from npu_bridge.estimator.npu import npu_scope from tensorflow.core.protobuf.rewriter_config_pb2 import RewriterConfig
X = tf.random_normal([2,]) Y = tf.random_normal([2,])
with npu_scope.without_npu_compile_scope(): pred = tf.add(tf.multiply(X, 1.), 0.)
cost = tf.reduce_sum(tf.abs(pred-Y))
config = tf.ConfigProto() custom_op = config.graph_options.rewrite_options.custom_optimizers.add() custom_op.name = "NpuOptimizer" custom_op.parameter_map["use_off_line"].b = True custom_op.parameter_map["mix_compile_mode"].b = True config.graph_options.rewrite_options.remapping = RewriterConfig.OFF #remap
with tf.Session(config=config) as sess: print(sess.run(cost)) # reduce_sumHost
6.4 Profiling
ProfilingProfiling Profiling
training_traceAI
task_traceAIHWTS/AICore
op_trace training_tracetask_trace
01 (2020-11-28)
©
36
Atlas Data Center Solution
6
Profiling
Estimator Profiling
EstimatorNPURunConfigprofiling_configProfiling
from npu_bridge.estimator.npu.npu_config import NPURunConfig from npu_bridge.estimator.npu.npu_config import ProfilingConfig
profiling_options = ['task_trace','training_trace'] profiling_config = ProfilingConfig(enable_profiling=True, enable_options = profiling_options) session_config=tf.ConfigProto()
config = NPURunConfig(profiling_config=profiling_config, session_config=session_config)
export FP_POINT=resnet_v1_50_1/conv1/Conv2D export BP_POINT=add_1
sess.run Profiling
sess.runsessionprofiling_modeprofiling_optionsProfiling
custom_op = config.graph_options.rewrite_options.custom_optimizers.add() custom_op.name = "NpuOptimizer" custom_op.parameter_map["use_off_line"].b = True custom_op.parameter_map["profiling_mode"].b = True custom_op.parameter_map["profiling_options"].s = tf.compat.as_bytes("task_trace:training_trace")
with tf.Session(config=config) as sess: print(sess.run(cost))
export FP_POINT=resnet_v1_50_1/conv1/Conv2D export BP_POINT=add_1
Profiling
export PROFILING_MODE=true export PROFILING_OPTIONS=training_trace:task_trace export FP_POINT=resnet_v1_50_1/conv1/Conv2D export BP_POINT=add_1
Profiling
/var/log/npu/profiling/ JOBxxxxAAA first_runtime_task_trace_datakernelname hwts.log.data.45.dev.profiler_default_tagAICORE
ts_track.data.44.dev.profiler_default_tagAICPU training_trace.46.dev.profiler_default_tag
01 (2020-11-28)
©
37
Atlas Data Center Solution
Profiling
ProfilingProfiling
6
6.5 Dump
Data DumpDump/ Dump
inputDump outputDump allDump
Dump
dumpdump dumpG
Estimator Dump
EstimatorNPURunConfigdump_configDump NPURunConfigDumpConfigdumpdump dumpdumpDumpConfig DumpConfig
from npu_bridge.estimator.npu.npu_config import NPURunConfig from npu_bridge.estimator.npu.npu_config import DumpConfig # enable_dumpdump # dump_pathdumpdump/var/log/npu/ ide_daemon/dump/{dump_path} # dump_stepdumpNonedump"|" 0|5|10"-"0|3-5|10 # dump_modedumpdumpdumpinput/output/all dump_config = DumpConfig(enable_dump=True, dump_path = "/tmp", dump_step="0|5|10", dump_mode="all")
session_config=tf.ConfigProto()
config = NPURunConfig(dump_config=dump_config, session_config=session_config)
dump_path"/" Dump
sess.run Dump
sess.runsessionenable_dumpdump_pathdump_step dump_modeDump
config = tf.ConfigProto()
custom_op = config.graph_options.rewrite_options.custom_optimizers.add()
01 (2020-11-28)
©
38
Atlas Data Center Solution
6
custom_op.name = "NpuOptimizer" custom_op.parameter_map["use_off_line"].b = True # dump custom_op.parameter_map["enable_dump"].b = True # dumpdump/var/log/npu/ide_daemon/dump/ {dump_path} custom_op.parameter_map["dump_path"].s = tf.compat.as_bytes("/tmp") # dumpNonedump"|" 0|5|10"-"0|3-5|10 custom_op.parameter_map["dump_step"].s = tf.compat.as_bytes("0|5|10") # dumpdumpdumpinput/output/all custom_op.parameter_map["dump_mode"].s = tf.compat.as_bytes("all")
with tf.Session(config=config) as sess: print(sess.run(cost))
Dump
Dump/var/log/npu/ide_daemon/dump/ {dump_path}//deviceid/model_name/model_id/data_indexdump GEge_proto_xxxxx_Bulid.txt
/var/log/npu/ide_daemon/dump"adaide_daemon/ ide_daemon.cfg"DUMP_PATH" "DUMP_PATH"dumpHost {WORK_PATH}/ide_daemon/dump/{dump_path}{WORK_PATH} ide_daemon.cfg"WORK_PATH""~" ada
{dump_path}dump
20200317020343
deviceidDeviceID
model_namemodel_namedump
model_idID
data_indexdumpdump_step data_indexdump_stepdump_stepdata_index0 dump1
dump{op_type}.{op_name}.{taskid}.{timestamp}
model_nameop_typeop_name".""/""\"
Pdevice
Dump
01 (2020-11-28)
©
39
Atlas Data Center Solution
6
6.6
iterations_per_loopsession.runDevice Deviceiterations_per_loopHost HostDevice
iterations_per_loop1iterations_per_loop
iterations_per_loop>1save_checkpoints_steps iterations_per_loopiterations_per_loop save_checkpoints_stepsiterations_per_loop>1 save_summary_stepslog_step_count_steps Log/Summary
mix_compile_modeTrueiterations_per_loop1
enable_data_pre_proctf.data.make_initializable_iterator() Devicegetnextiterations_per_loop 1
enable_data_pre_proc tf.data.make_one_shot_iterator()getnextDevice iterations_per_loop1
Estimatorinput_fndatasetEstimator tf.data.make_initializable_iterator()
iterations_per_loop1 iterations_per_loop
Estimator iterations_per_loop
EstimatorNPURunConfigiterations_per_loop
from npu_bridge.estimator.npu.npu_config import NPURunConfig from npu_bridge.estimator import npu_ops
session_config=tf.ConfigProto() config = NPURunConfig(session_config=session_config, iterations_per_loop=10)
session.run iterations_per_loop
session.runset_iteration_per_loopiterations_per_loop session.runiterations_per_loop iterations_per_loop
from __future__ import print_function import input_data
from npu_bridge.estimator.npu import util from tensorflow.core.protobuf.rewriter_config_pb2 import RewriterConfig
mnist = input_data.read_data_sets("/test/", one_hot=True)
import tensorflow as tf
01 (2020-11-28)
©
40
Atlas Data Center Solution
6
# # learning_rate = 0.01 # training_epochs = 10 # batch batch_size = 100 # display_step = 1
x = tf.placeholder(tf.float32, [None, 784]) y = tf.placeholder(tf.float32, [None, 10])
# W = tf.Variable(tf.zeros([784, 10])) b = tf.Variable(tf.zeros([10]))
# pred = tf.nn.softmax(tf.matmul(x, W) + b)
# cost = tf.reduce_mean(-tf.reduce_sum(y*tf.log(pred), reduction_indices=1))
# optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost)
# init = tf.global_variables_initializer()
config = tf.ConfigProto() custom_op = config.graph_options.rewrite_options.custom_optimizers.add() custom_op.name = "NpuOptimizer" custom_op.parameter_map["use_off_line"].b = True #AI custom_op.parameter_map["mix_compile_mode"].b = False # custom_op.parameter_map["iterations_per_loop"].i = 10 #set_iteration_per_loop iterations_per_loop config.graph_options.rewrite_options.remapping = RewriterConfig.OFF #remap
# with tf.Session(config=config) as sess:
sess.run(init) # sess.run10 train_op = util.set_iteration_per_loop(sess, optimizer, 10)
for epoch in range(training_epochs): avg_cost = 0 total_batch = int(mnist.train.num_examples / batch_size)
for i in range(total_batch): batch_xs, batch_ys = mnist.train.next_batch(batch_size) _, c = sess.run([train_op, cost], feed_dict={x: batch_xs, y: batch_ys})
avg_cost += c / total_batch
tf.train.Supervisorsessionset_iteration_per_loop create_iteration_per_loop_var load_iteration_per_loop_var
from __future__ import print_function import input_data
from npu_bridge.estimator.npu import util from tensorflow.core.protobuf.rewriter_config_pb2 import RewriterConfig
mnist = input_data.read_data_sets("/test/", one_hot=True)
import tensorflow as tf
01 (2020-11-28)
©
41
Atlas Data Center Solution
6
# # learning_rate = 0.01 # training_epochs = 10 # batch batch_size = 100 # display_step = 1
x = tf.placeholder(tf.float32, [None, 784]) y = tf.placeholder(tf.float32, [None, 10])
# W = tf.Variable(tf.zeros([784, 10])) b = tf.Variable(tf.zeros([10]))
# pred = tf.nn.softmax(tf.matmul(x, W) + b)
# cost = tf.reduce_mean(-tf.reduce_sum(y*tf.log(pred), reduction_indices=1))
# optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost)
# init = tf.global_variables_initializer()
config = tf.ConfigProto() custom_op = config.graph_options.rewrite_options.custom_optimizers.add() custom_op.name = "NpuOptimizer" custom_op.parameter_map["use_off_line"].b = True #AI custom_op.parameter_map["mix_compile_mode"].b = False # custom_op.parameter_map["iterations_per_loop"].i = 10 #set_iteration_per_loop iterations_per_loop config.graph_options.rewrite_options.remapping = RewriterConfig.OFF #remap
# with tf.Session(config=config) as sess:
sess.run(init) # sess.run10 iteration = util.IterationPerLoop() train_op = iteration.create_iteration_per_loop_var(optimizer) # tf.train.Supervisor(logdir="/home/xxxx",init_op=init) # iteration.load_iteration_per_loop_var(sess, 10) #
for epoch in range(training_epochs): avg_cost = 0 total_batch = int(mnist.train.num_examples / batch_size)
for i in range(total_batch): batch_xs, batch_ys = mnist.train.next_batch(batch_size) _, c = sess.run([train_op, cost], feed_dict={x: batch_xs, y: batch_ys})
avg_cost += c / total_batch
iterations_per_loop
HostInsert op successiterations_per_loop6-1
01 (2020-11-28)
©
42
Atlas Data Center Solution
6-1 iterations_per_loop 1
6
iterations_per_loop16-2 6-2 iterations_per_loop 1
6.7 Log/Summary
LogSummaryDeviceDeviceLog/ SummarystepHost
Log
EstimatorLogHostdequeue DeviceLog
print_op = tf.print(loss) with tf.control_dependencies([print_op]):
train_op = xxx # printprint
sess.runLogHostdequeue dequeueLog
from threading import Thread
import sys def dequeue():
global config tf.reset_default_graph() outfeed_log_tensors = npu_ops.outfeed_dequeue_op(
channel_name="_npu_log", output_types=[tf.string], output_shapes=[()]) dequeue_ops = tf.print(outfeed_log_tensors, sys.stderr) with tf.Session() as sess: i = 0 while i < get_next_times: sess.run(dequeue_ops) i = i + 1
t1 = Thread(target=dequeue) t1.start()
AssertPrintLog
01 (2020-11-28)
©
43
Atlas Data Center Solution
print_op = tf.print(loss) with tf.control_dependencies([print_op]):
train_op = xxx # printprint
6
Summary
Estiamtorhost_call Summary
def _host_call_fn(gs, loss): with summary.create_file_writer( "./model", max_queue=1000).as_default(): with summary.always_record_summaries(): summary.scalar("host_call_loss", loss, step=gs) return summary.all_summary_ops()
NPUEstimatorSpechost_callSummary DeviceenqueueSummaryHostdequeue DeviceSummarystepHost
host_callfunctiontensortensor train()evaluate()
from npu_bridge.estimator.npu.npu_estimator import NPUEstimatorSpec
host_call = (_host_call_fn, [global_step, loss]) return NPUEstimatorSpec(mode=tf.estimator.ModeKeys.TRAIN, loss=loss, train_op=train_op, host_call=host_call)
from npu_bridge.estimator.npu.npu_config import NPURunConfig from npu_bridge.estimator.npu.npu_estimator import NPUEstimator from npu_bridge.estimator.npu.npu_estimator import NPUEstimatorSpec
# host_call from tensorflow.contrib import summary def _host_call_fn(gs, loss):
with summary.create_file_writer( "./model", max_queue=1000).as_default():
with summary.always_record_summaries(): summary.scalar("host_call_loss", loss, step=gs) return summary.all_summary_ops()
def input_fn(): "dataset"
# model_fnhost_call def model_fn():
"" model = *** loss = *** optimizer = tf.train.MomentumOptimizer(learning_rate=c, momentum=0.9) global_step = tf.train.get_or_create_global_step() grad_vars = optimizer.compute_gradients(loss) minimize_op = optimizer.apply_gradients(grad_vars, global_step) update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS) with tf.control_dependencies([print_op]):
train_op = tf.group(minimize_op, update_ops) host_call = (_host_call_fn, [global_step, loss]) return NPUEstimatorSpec(mode=tf.estimator.ModeKeys.TRAIN, loss=loss, train_op=train_op, host_call=host_call)
run_config = NPURunConfig()
classifier = NPUEstimator(model_fn=model_fn, config=run_config, params={ }) classifier.train(input_fn=lambda: input_fn(), max_steps=get_next_times)
01 (2020-11-28)
©
44
Atlas Data Center Solution
6
6.8
Host CPUAI
AI AIAI Host CPU
AImapbatchmap_and_batch Host CPU
AI
TensorFlow
TFRecordDatasetshuffleAImapbatch AI
train_dataset=tf.contrib.data.TFRecordDataset("./train_new.tfrecords") train_dataset=train_dataset.shuffle(1000) train_dataset=train_dataset.map(parse_tf) train_dataset=train_dataset.batch(batch_size)
train_dataset=tf.contrib.data.TFRecordDataset("./train_new.tfrecords") train_dataset=train_dataset.shuffle(1000) train_dataset=train_dataset.map(parse_tf) train_dataset=train_dataset.batch(batch_size) train_dataset = train_dataset.prefetch(buffer_size=buffer_size)
mapbatchprefetchprefetchAI Host CPU
CPU
PHost CPU CPUHost CPU 8NPU
1. Host CPUTotal CPU =96
2. Host CPUn
n = Total CPU / 8 = 12
3. "taskset -c 0-n-1" Host CPU
Device0
taskset -c 0-11 python3.7 /home/test/xxx.py /
01 (2020-11-28)
©
45
Atlas Data Center Solution
Device7
taskset -c 84-95 python3.7 /home/test/xxx.py /
6
6.9
Device
96.54% 3.46%
ProfilingTraining Trace
ProfilingProfiling ProfilingProfiling
AI fp_start/bp_end/ allreduce1_start/allreduce1_end/allreduce2_start/allreduce2_end/Iteration_end
AR1BPFP AR2
01 (2020-11-28)
©
46
Atlas Data Center Solution
6
1AR1AR2 AR2
50%50%
80%20%
2AR1AR1BPFP AR1BPFP
90%10%
80%20%
01 (2020-11-28)
©
47
Atlas Data Center Solution
6
3BPFP AR12 AR2AR21BPFP BPFP
allreduce
set_split_strategy_by_idxidgroup
from hccl.split.api import set_split_strategy_by_idx set_split_strategy_by_idx([20, 100, 159])
set_split_strategy_by_sizegroup
from hccl.split.api import set_split_strategy_by_size set_split_strategy_by_size([60, 20, 20])
allreduce
import tensorflow as tf from npu_bridge.estimator import npu_ops from hccl.split.api import set_split_strategy_by_size from tensorflow.core.protobuf.rewriter_config_pb2 import RewriterConfig
npu_init = npu_ops.initialize_system() npu_shutdown = npu_ops.shutdown_system()
config = tf.ConfigProto() custom_op = config.graph_options.rewrite_options.custom_optimizers.add() custom_op.name = "NpuOptimizer" custom_op.parameter_map["use_off_line"].b = True
01 (2020-11-28)
©
48
Atlas Data Center Solution
6
#Profiling custom_op.parameter_map["profiling_mode"].b = True custom_op.parameter_map["profiling_options"].s = tf.compat.as_bytes("training_trace") config.graph_options.rewrite_options.remapping = RewriterConfig.OFF #remap
with tf.Session(config=config) as sess: sess.run(npu_init) # set_split_strategy_by_size([80, 20]) #allreduce... #... sess.run(npu_shutdown)
6.10 ckpt pb
6.10.1
tensorflowsaver = tf.train.Saver()saver.save() saver.save()
checkpointcheckpointcheckpoint
model.ckpt.data-00000-of-00001 model.ckpt.index model.ckpt.meta
tensorflowfreeze_graphpb
tensorflowfreeze_graphpb
1. checkpoint
2. IteratorV2 placeholder
3. lossArgmax BiasAdd
4. BatchNorm dropout
BatchNormBatchnorm batchnorm
dropoutdropoutrate1
if is_training: x = npu_ops.dropout(x, 0.65)
else: x = npu_ops.dropout(x, 1.0)
is_training=False
01 (2020-11-28)
©
49
Atlas Data Center Solution
6
# alexnet.inference logits = alexnet.inference(inputs, version="he_uniform", num_classes=1000, is_training=False)
5. tf.train.writegraphpbfreeze_graph
6. freeze_graphtf.train.writegraphpbcheckpoint pb
6.10.2
import tensorflow as tf from tensorflow.python.tools import freeze_graph from npu_bridge.estimator import npu_ops
# import alexnet # checkpoint ckpt_path = "/opt/npu/model_ckpt/alexnet/model_8p/model.ckpt-0"
def main(): tf.reset_default_graph() # inputs = tf.placeholder(tf.float32, shape=[None, 224, 224, 3], name="input") # logits = alexnet.inference(inputs, version="he_uniform", num_classes=1000, is_training=False) # predict_class = tf.argmax(logits, axis=1, output_type=tf.int32, name="output") with tf.Session() as sess: #./pb_modelmodel.pb # model.pbinput_graphfreeze_graph tf.train.write_graph(sess.graph_def, './pb_model', 'model.pb') # write_graph freeze_graph.freeze_graph( input_graph='./pb_model/model.pb', # write_graph input_saver='', input_binary=False, input_checkpoint=ckpt_path, # checkpoint output_node_names='output', # restore_op_name='save/restore_all', filename_tensor_name='save/Const:0', output_graph='./pb_model/alexnet.pb', # clear_devices=False, initializer_nodes='') print("done")
if __name__ == '__main__': main()
freeze_graph
input_graphwrite_graph
input_binaryinput_graphtrueinput_graphfalse input_graphFalse
input_checkpointcheckpoint
output_node_names
output_graphpb
./pb_model/alexnet.pb pb
01 (2020-11-28)
©
50
Atlas Data Center Solution
7
7
bash bash run_npu.sh
fwkacllib/opp/home/HwHiAiUser/Ascend/nnae/latestdriver /usr/local/Ascendtfplugin/home/HwHiAiUser/Ascend/ tfplugin/latest
#
export install_path=/home/HwHiAiUser/Ascend
export LD_LIBRARY_PATH=/usr/local/lib/:/usr/lib/:$install_path/nnae/latest/fwkacllib/lib64/:/usr/local/
Ascend/driver/lib64/common/:/usr/local/Ascend/driver/lib64/driver/:/usr/local/Ascend/add-ons/
export PYTHONPATH=$install_path/nnae/latest/fwkacllib/python/site-packages/te:$install_path/nnae/latest/
fwkacllib/python/site-packages/topi:$install_path/nnae/latest/fwkacllib/python/site-packages/hccl:
$install_path/tfplugin/latest/tfplugin/python/site-packages:$install_path/nnae/latest/opp/op_impl/built-in/
ai_core/tbe
export PATH=$install_path/nnae/latest/fwkacllib/ccec_compiler/bin:$PATH
export ASCEND_OPP_PATH=$install_path/nnae/latest/opp
export SOC_VERSION=Ascend910
export JOB_ID=10087
export DEVICE_ID=0
export RANK_TABLE_FILE=/home/test/rank_table_2p.json
#
export RANK_ID=0
export RANK_SIZE=8
#
python3.7 /home/test/xxx.py /
P DEVICE_IDRANK_ID
export DEVICE_ID=1 export RANK_ID=1
7-1
01 (2020-11-28)
©
51
Atlas Data Center Solution
7-1
LD_LIBRARY_PATH
PYTHONPATH PATH ASCEND_OPP_PATH SOC_VERSION JOB_ID DEVICE_ID RANK_TABLE_FILE RANK_ID
RANK_SIZE
7
/
gcc
CentosDebianBClinux
".../xxx/xxx/xxx/lib64"
".../xxx/xxx/xxx/"gcc
10.1-4
Python
Ascend910
ID
Device ID
[0,7]
ranktable
ranktablerank_id
ranktablepod_name
Device
Device
01 (2020-11-28)
©
52
Atlas Data Center Solution
GE_USE_STATIC_MEMORY
TE_PARALLEL_COMPILER PROFILING_MODE PROFILING_OPTIONS
7
/
featuremap
25Gbert24
P
1
31G graph_memory_max_size variable_memory_max_size graph_memory_max_size variable_memory_max_size
Estimatorgraph_memory_max_size variable_memory_max_size NPURunConfigsess.run sess.runsession
8 0 host cpuPcpu *80%Pcpu*80%/P
Profiling
trueProfiling
PROFILING_OPTIONSProfiling
falseProfiling
Profiling
training_trace AI
task_trace AIHWTS/AICore
op_trace training_tracetask_trace
01 (2020-11-28)
©
53
Atlas Data Center Solution
FP_POINT
BP_POINT
7
/
Profiling
tf.io.write_graph graph.pbtxt name
Profiling
BP_POINT FP_POINT
tf.io.write_graphgraph.pbtxt name
01 (2020-11-28)
©
54
Atlas Data Center Solution
7
HCCL_INTRA_PCIE_ENABLE
HCCL_INTRA_ROCE_ENABLE SKT_ENABLE
/
Atlas 300T 9000 ServerPCIe HCCL_INTRA_ROCE_ENABLE
HCCL_INTRA_PCIE_ENABLE HCCL_INTRA_ROCE_ENABLEAtlas 300T 9000Server Server RoCE HCCL_INTRA_PCIE_ENABLE HCCL_INTRA_ROCE_ENABLE
HCCL_INTRA_PCIE_ENABLE HCCL_INTRA_ROCE_ENABLE 0ServerPCIe
HCCL_INTRA_PCIE_ENABLE1 HCCL_INTRA_ROCE_ENABLE0 ServerPCIe
HCCL_INTRA_PCIE_ENABLE0 HCCL_INTRA_ROCE_ENABLE1 ServerRoCE
HCCL_INTRA_PCIE_ENABLE HCCL_INTRA_ROCE_ENABLE1
Atlas 300T 9000 ServerRoCE
tasktask task
1superkernelsuper task
0superkerneltask
01 (2020-11-28)
©
55
Atlas Data Center Solution
OP_NO_REUSE_MEM
DUMP_GE_GRAPH
DUMP_GRAPH_LEVEL
7
/
/
(",")
export OP_NO_REUSE_MEM=gradients/logits/ semantic/kernel/Regularizer/l2_regularizer_grad/ Mul_1,resnet_v1_50/conv1_1/BatchNorm/ AssignMovingAvg2
export OP_NO_REUSE_MEM=FusedMulAddN,BatchNorm
export OP_NO_REUSE_MEM=FusedMulAddN, resnet_v1_50/conv1_1/BatchNorm/ AssignMovingAvg
dump
1dump
2dump
3dump
dump bulid
dump
1dump
2dump
3dump
DUMP_GE_GRAPH 2
01 (2020-11-28)
©
56
Atlas Data Center Solution
8
8
8.1 Tensorflow 8.2 TF Adapter 8.3
8.1 Tensorflow
AITensorFlow 1.15TensorFlow Python API
Python API
TensorFlow Python API
8-1 Python API tf tf tf tf tf tf tf tf tf tf
Python API assert_same_float_dtype assert_scalar assert_type dimension_at_index dimension_value get_logger get_static_value grad_pass_through GradientTape is_tensor
01 (2020-11-28)
©
57
Atlas Data Center Solution
tf tf tf tf tf tf tf tf tf tf tf tf tf tf.config.threading tf.config.threading tf.config.threading tf.config.threading tf.estimator tf.estimator tf.estimator tf.estimator tf.estimator tf.estimator tf.estimator tf.estimator tf.estimator tf.estimator tf.estimator tf.estimator tf.estimator
8
Python API local_variables_initializer make_ndarray make_template make_tensor_proto min_max_variable_partitioner no_gradient NoGradient RaggedTensorSpec recompute_grad resource_variables_enabled TensorSpec TypeSpec UnconnectedGradients get_inter_op_parallelism_threads get_intra_op_parallelism_threads set_inter_op_parallelism_threads set_intra_op_parallelism_threads add_metrics BestExporter classifier_parse_example_spec EstimatorSpec EvalSpec Exporter FinalExporter Head LatestExporter ModeKeys regressor_parse_example_spec RunConfig train_and_evaluate
01 (2020-11-28)
©
58
Atlas Data Center Solution
tf.estimator tf.estimator tf.estimator.export tf.estimator.export tf.estimator.export tf.estimator.export tf.estimator.export tf.estimator.export tf.Keras tf.keras.datasets tf.keras.datasets tf.keras.datasets tf.keras.datasets tf.keras.datasets tf.keras.datasets tf.keras.datasets tf.keras.datasets tf.keras.datasets tf.keras.datasets tf.keras.datasets tf.keras.datasets.imdb tf.keras.datasets.imdb tf.keras.layers tf.keras.layers tf.keras.layers tf.keras.layers tf.keras.metrics tf.keras.optimizers tf.keras.optimizers.schedules tf.keras.optimizers.schedules
8
Python API TrainSpec WarmStartSettings ClassificationOutput ExportOutput PredictOutput RegressionOutput ServingInputReceiver TensorServingInputReceiver Sequential boston_housing.load_data cifar10.load_data cifar100.load_data fashion_mnist fashion_mnist.load_data imdb mnist mnist.load_data reuters reuters.get_word_index reuters.load_data get_word_index load_data AbstractRNNCell DenseFeatures deserialize serialize Metric schedules deserialize LearningRateSchedule
01 (2020-11-28)
©
59
Atlas Data Center Solution
tf.keras.optimizers.schedules tf.nest tf.nest tf.nest tf.nest tf.nest tf.saved_model tf.saved_model tf.saved_model tf.saved_model tf.saved_model.loader tf.saved_model.signature_def_utils tf.saved_model.utils tf.summary tf.sysconfig tf.sysconfig tf.sysconfig tf.sysconfig tf.train tf.train tf.train tf.train tf.train tf.train tf.train tf.train tf.train tf.train tf.train tf.train
8
Python API serialize assert_same_structure flatten is_nested map_structure pack_sequence_as Builder contains_saved_model load_v2 save load build_signature_def build_tensor_info Audio get_compile_flags get_include get_lib get_link_flags assert_global_step basic_train_loop CheckpointManager checkpoints_iterator CheckpointSaverHook CheckpointSaverListener FeedFnHook FinalOpsHook LooperThread remove_checkpoint SessionRunHook SessionRunValues
01 (2020-11-28)
©
60
Atlas Data Center Solution
tf.train tf.train tf.train tf.train
8
Python API summary_iterator VocabInfo FeatureLists.FeatureListEntry Features.FeatureEntry
Python API
TensorFlow Python API
8-2 Python API
tf.Keras tf.keras.applications.densenet tf.keras.applications.imagenet_utils tf.keras.applications.inception_resnet_v2 tf.keras.applications.inception_v3 tf.keras.applications.mobilenet tf.keras.applications.mobilenet_v2
Python API Model decode_predictions decode_predictions decode_predictions decode_predictions decode_predictions decode_predictions
iteration_per _loop 1 model_to_np uestimator estimator
tf.keras.applications.nasnet
decode_predictions
tf.keras.applications.resnet
decode_predictions
tf.keras.applications.resnet_v2
decode_predictions
tf.keras.applications.resnet50
decode_predictions
tf.keras.applications.vgg16
decode_predictions
tf.keras.applications.vgg19
decode_predictions
tf.keras.applications.xception
decode_predictions
tf.keras.backend
learning_phase_scope
tf.keras.layers
Layer
tf.keras.optimizers
Optimizer
Python API
TensorFlow Python API
01 (2020-11-28)
©
61
Atlas Data Center Solution
8-3 Python API tf tf tf tf tf tf tf tf tf tf tf tf tf tf tf tf tf tf tf.config tf.config tf.config.optimizer tf.config.optimizer tf.config.optimizer tf.config.optimizer tf.estimator.tpu tf.estimator.tpu tf.estimator.tpu tf.estimator.tpu tf.keras.preprocessing.image
Python API enable_eager_execution autograph distribute disable_v2_tensorshape enable_control_flow_v2 enable_tensor_equality enable_v2_behavior enable_v2_tensorshape CriticalSection IndexedSlicesSpec Module OptionalSpec RaggedTensor function disable_control_flow_v2 disable_eager_execution disable_tensor_equality disable_v2_behavior get_soft_device_placement set_soft_device_placement get_experimental_options get_jit set_experimental_options set_jit TPUConfig RunConfig TPUEstimatorSpec InputPipelineConfig DirectoryIterator
8
01 (2020-11-28)
©
62
Atlas Data Center Solution
tf.keras.preprocessing.image tf.keras.preprocessing.image tf.keras.preprocessing.image tf.keras.preprocessing.image tf.keras.preprocessing.image tf.keras.preprocessing.image tf.keras.preprocessing.image tf.keras.preprocessing.image tf.keras.preprocessing.image tf.keras.preprocessing.image tf.keras.preprocessing.image tf.keras.preprocessing.image tf.keras.preprocessing.image tf.keras.preprocessing.image tf.keras.preprocessing.image tf.keras.preprocessing.image tf.keras.preprocessing.sequence tf.keras.preprocessing.sequence tf.keras.preprocessing.sequence tf.keras.preprocessing.sequence tf.keras.preprocessing.text tf.keras.preprocessing.text tf.keras.preprocessing.text tf.keras.preprocessing.text tf.keras.utils tf.profiler tf.profiler tf.profiler tf.profiler tf.profiler
8
Python API ImageDataGenerator Iterator NumpyArrayIterator apply_affine_transform apply_brightness_shift apply_channel_shift array_to_img img_to_array load_img random_brightness random_zoom save_img random_rotation random_shift random_shear random_channel_shift pad_sequences skipgrams TimeseriesGenerator make_sampling_table hashing_trick one_hot text_to_word_sequence Tokenizer model_to_dot AdviceProto AdviceProto.Checker AdviceProto.CheckersEntry GraphNodeProto GraphNodeProto.InputShapesEntry
01 (2020-11-28)
©
63
Atlas Data Center Solution
tf.profiler tf.profiler tf.profiler tf.profiler tf.profiler tf.summary tf.train tf.train tf.train
Python API MultiGraphNodeProto OpLogProto OpLogProto.IdToStringEntry advise write_op_log all_v2_summary_ops ClusterDef JobDef JobDef.TasksEntry
8
Python API
TensorFlowPython API
8-4 Python API tf tf.train tf.train tf.train tf.train tf.train tf.train tf.train tf.train tf.train.queue_runner tf.train.queue_runner
Python API disable_resource_variables input_producer do_quantize_training_on_graphdef limit_epochs maybe_batch maybe_batch_join maybe_shuffle_batch maybe_shuffle_batch_join shuffle_batch_join QueueRunner add_queue_runner
TensorFlow
01 (2020-11-28)
©
64
Atlas Data Center Solution
8
8.2 TF Adapter
8.2.1
TensorflowTF AdapterTensorflow
8-1 TF Adapter
8-5 NPURunConfig
ProfilingConfig
DumpConfig
NPURunConfig NPURunConfig RunConfig
ProfilingConfig Profiling
DumpConfig dump
{install_path_tfplu gin}/tfplugin/ python/sitepackages/ npu_bridge/ estimator/npu/ npu_config.py
{install_path_tfplu gin}/tfplugin/ python/sitepackages/ npu_bridge/ estimator/npu/ npu_config.py
{install_path_tfplu gin}/tfplugin/ python/sitepackages/ npu_bridge/ estimator/npu/ npu_config.py
01 (2020-11-28)
©
65
Atlas Data Center Solution
NPUEstimator
NPUEstimatorSpec
NPUCheckpointSaverHook
NPUOutputTensorHook
NPUDistributedOptimizer
8
NPUEstimator NPUEstimator Estimator
{install_path_tfplu gin}/tfplugin/ python/sitepackages/ npu_bridge/ estimator/npu/ npu_estimator.py
NPUEstimatorSp ec
NPUEstimatorSp ec
EstimatorSpec
{install_path_tfplu gin}/tfplugin/ python/sitepackages/ npu_bridge/ estimator/npu/ npu_estimator.py
NPUCheckpoint SaverHook
NPUCheckpoint SaverHook
CheckpointSaver Hook
{install_path_tfplu gin}/tfplugin/ python/sitepackages/ npu_bridge/ estimator/npu/ npu_hook.py
NPUOutputTens orHook NPUEstimator train evaluate predict HookN output_fn tensors
{install_path_tfplu gin}/tfplugin/ python/sitepackages/ npu_bridge/ estimator/npu/ npu_hook.py
NPUDistributed Optimizer NPU
{install_path_tfplu gin}/tfplugin/ python/sitepackages/ npu_bridge/ estimator/npu/ npu_optimizer.py
01 (2020-11-28)
©
66
Atlas Data Center Solution
8
NPULossScaleOptimizer
NPUOptimizer FixedLossScaleManager ExponentialUpdateLossScaleManager
NPULossScaleO ptimizer Loss ScalingLoss Scaling float16
NPULossScaleO ptimizer
LossScaleOptimi zer
{install_path_tfplu gin}/tfplugin/ python/sitepackages/ npu_bridge/ estimator/npu/ npu_loss_scale_op timizer.py
NPUOptimizer
NPUDistributed Optimizer
NPULossScaleO ptimizer
{install_path_tfplu gin}/tfplugin/ python/sitepackages/ npu_bridge/ estimator/npu/ npu_optimizer.py
FixedLossScaleM anager LossScale
{install_path_tfplu gin}/tfplugin/ python/sitepackages/ npu_bridge/ estimator/npu/ npu_loss_scale_m anager.py
ExponentialUpd
ateLossScaleMa nager LossScale
{install_path_tfplu gin}/tfplugin/ python/sitepackages/ npu_bridge/ estimator/npu/ npu_loss_scale_m anager.py
01 (2020-11-28)
©
67
Atlas Data Center Solution
dropout
LARSV2
initialize_system
shutdown_system
8
tf.nn.dropout Tensor1/ keep_prob Tensor keep_prob 0 Tensor shape Tensorshape
{install_path_tfplu gin}/tfplugin/ python/sitepackages/ npu_bridge/ estimator/ npu_ops.py
batch size
{install_path_tfplu gin}/tfplugin/ python/sitepackages/ npu_bridge/ estimator/ npu_ops.py
GE
{install_path_tfplu gin}/tfplugin/ python/sitepackages/ npu_bridge/ estimator/ npu_ops.py
Device
initialize_syste m
{install_path_tfplu gin}/tfplugin/ python/sitepackages/ npu_bridge/ estimator/ npu_ops.py
01 (2020-11-28)
©
68
Atlas Data Center Solution
8
without_npu_compile_scope set_iteration_per_loop create_iteration_per_loop_var
load_iteration_per_loop_var model_to_npu_estimator
Host
{install_path_tfplu gin}/tfplugin/ python/sitepackages/ npu_bridge/ estimator/npu/ npu_scope.py
sess.run sess.run() Device
{install_path_tfplu gin}/tfplugin/ python/sitepackages/ npu_bridge/ estimator/npu/ util.py.py
load_iteration_ per_loop_var sess.run sess.run() Device
load_iteration_ per_loop_var
{install_path_tfplu gin}/tfplugin/ python/sitepackages/ npu_bridge/ estimator/npu/ util.py.py
create_iteration _per_loop_var sess.run sess.run() Device
{install_path_tfplu gin}/tfplugin/ python/sitepackages/ npu_bridge/ estimator/npu/ util.py.py
Keras NPUEstimator
{install_path_tfplu gin}/tfplugin/ python/sitepackages/ npu_bridge/ estimator/npu/ keras_to_npu.py
01 (2020-11-28)
©
69
Atlas Data Center Solution
8
8.2.2 NPURunConfig
def __init__(self, iterations_per_loop=1, profiling_config=None, model_dir=None, tf_random_seed=None, save_summary_steps=0, save_checkpoints_steps=None, save_checkpoints_secs=None, session_config=None, keep_checkpoint_max=5, keep_checkpoint_every_n_hours=10000, log_step_count_steps=100, enable_data_pre_proc=True, precision_mode=None, enable_reduce_precision=False, variable_format_optimize=True, mix_compile_mode=False, hcom_parallel=False, graph_memory_max_size=None, variable_memory_max_size=None, auto_tune_mode=None, dump_config=None, stream_max_parallel_num=None, is_tailing_optimization=False, horovod_mode = False, graph_run_mode = 1)
NPURunConfigNPURunConfigRunConfig
01 (2020-11-28)
©
70
Atlas Data Center Solution
8
Device save_checkpoints_secs
NPURunConfigRunConfig
model_dir
None
tf_random_seed
None
save_summary_steps
stepSummary0
iterations_per_loop=1 iterations_per_loop>1 Log/Summary
save_checkpoints_steps
stepcheckpoint None
save_checkpoints_secs
save_checkpoints_steps save_checkpoints_secsNone 100stepcheckpoint
iterations_per_loop>1 save_checkpoints_steps iterations_per_loop iterations_per_loop save_checkpoints_steps checkpoint
save_checkpoints_secs
checkpoint None
save_checkpoints_steps
session_config
sessionConfigProto None
keep_checkpoint_max
checkpoint5
keep_checkpoint_every_n_hours
Ncheckpoint 10000
keep_checkpoint_max
log_step_count_steps
stepglobal steploss 100
iterations_per_loop=1 iterations_per_loop>1 Log/Summary
01 (2020-11-28)
©
71
Atlas Data Center Solution
8
NPURunConfig train_distribute
device_fn protocol eval_distribute experimental_distribute NPURunConfig iterations_per_loop
profiling_config dump_config enable_data_pre_proc
experimental_distribute TF AdapterNPUDistributedOptimizer NPU
OperationDevicefunction
Server GRPC
experimental_distribute
session.runAI 1 iterations_per_loop AIiterations_per_loop Host HostDevice mix_compile_modeTrue iterations_per_loop1 iterations_per_loop1 LossScale
profilingNPURunConfig ProfilingConfig profilingProfilingConfig ProfilingConfig
dumpNPURunConfig DumpConfigdump DumpConfig DumpConfig
Deivce True False
01 (2020-11-28)
©
72
Atlas Data Center Solution
precision_mode
enable_reduce_precision variable_format_optimize
mix_compile_mode
hcom_parallel
8
string
allow_fp32_to_fp16None float32float16
force_fp16float16 float32float16
must_keep_origin_dtype
allow_mix_precision float32 float32 float16 Loss Scaling NPULossScaleOptimizer
True
False AI
True
False
Device AI Tensorflow
Allreduce
TrueAllreduce
FalseAllreduce
01 (2020-11-28)
©
73
Atlas Data Center Solution
graph_memory_max_size variable_memory_max_size auto_tune_mode
stream_max_parallel_num
is_tailing_optimization
horovod_mode
8
Byte[0, 256*1024*1024*1024][0, 274877906944] graph_memory_max_size variable_memory_max_size 31G26GB
Byte [0256*1024*1024*1024][0, 274877906944] graph_memory_max_size variable_memory_max_size 31G5GB
TBEAuto TuneAI
auto_tune_mode = "RL,GA"
Auto Tune Auto Tune
AICPU/AICORE AICPU/AICORE
"DNN_VM_TF:10,DNN_V100:1"
DNN_VM_TFAICPU AICPU10
DNN_V100AICORE AICORE1
AICPU/AICORE1 [1,13]
AR AR
True
FalseFalse
NPUOptimizer NPUOptimizer is_tailing_optimization
01 (2020-11-28)
©
74
Atlas Data Center Solution
graph_run_mode
8
00 111
NPURunConfigNPUEstimator
"1000"config
from npu_bridge.estimator.npu.npu_config import NPURunConfig session_config=tf.ConfigProto() config = NPURunConfig(session_config=session_config, mix_compile_mode=False, iterations_per_loop=1000)
8.2.3 ProfilingConfig
def __init__(self, enable_profiling=False, enable_options=[])
ProfilingConfigProfiling
enable_profiling
/
Profiling TrueProfiling
enable_optionsProfiling FalseProfiling
01 (2020-11-28)
©
75
Atlas Data Center Solution
enable_options
8
/
Profiling
training_trace AI
task_traceAI HWTS/AICore
op_trace training_trace task_trace
['task_trace','training_trace']
ProfilingConfigNPURunConfig
8.2.4 DumpConfig
def __init__(self, enable_dump=False, dump_path=None, dump_step=None, dump_mode="output")
DumpConfigdump
01 (2020-11-28)
©
76
Atlas Data Center Solution
enable_dump
dump_path
dump_step dump_mode
8
/
dumpFalse
Truedumpdump_pathdump dump_pathNone
Falsedump
dumpNone
dump/var/log/npu/ide_daemon/ dump/{dump_path}/var/log/npu/ ide_daemon/dump"ada ide_daemon/ide_daemon.cfg "DUMP_PATH" "DUMP_PATH"dump Host{WORK_PATH}/ide_daemon/ dump/{dump_path}{WORK_PATH} ide_daemon.cfg"WORK_PATH" "~"ada
dump_path "/"Dump
dumpNone dump
"|"0|5|10"-" 0|3-5|10
dumpdump
inputdump
outputdumpoutput
alldump
DumpConfigNPURunConfig
8.2.5 NPUEstimator
def __init__(self, model_fn=None,
01 (2020-11-28)
©
77
Atlas Data Center Solution
8
model_dir=None, config=None, params=None, job_start_file='' )
NPUEstimatorNPUEstimatorEstimator
model_fn
model_dir
config params job_start_file
/
functionfunction NPUEstimatorSpec NPUEstimatorSpec NPUEstimatorSpec
configmodel_dir None/tmp
NPURunConfig NPURunConfig NPURunConfig
model_fn python
CSA job
NPUEstimator
8.2.6 NPUEstimatorSpec
def __new__(cls, mode, predictions=None, loss=None,
01 (2020-11-28)
©
78
Atlas Data Center Solution
8
train_op=None, eval_metric_ops=None, export_outputs=None, training_chief_hooks=None, training_hooks=None, scaffold=None, evaluation_hooks=None, prediction_hooks=None, host_call=None)
NPUEstimatorSpecNPUEstimatorSpecEstimatorSpec
EstimatorSpecmodel_fnmodepredictionsloss train_opexport_outputsEstimatorEstimatorSpec NPUEstimatorSpecEstimatorSpec
/
NPUEstimatorSpecEstimatorSpec
mode
ModeKeys.TRAIN
ModeKeys.EVAL
ModeKeys.PREDICT
predictions
Tensormode ModeKeys.PREDICT
loss
train_op
eval_metric_ops
Tensor
Metric
metric_tensorupdate_op
export_outputs
SavedModel
01 (2020-11-28)
©
79
Atlas Data Center Solution
8
/
training_chief_hooks
SessionRunHooks
training_hooks
SessionRunHooks
scaffold
scaffoldsaverinit_op summary_opglobal_step
evaluation_hooks
SessionRunHooks
prediction_hook
SessionRunHooks
NPUEstimatorSpec
host_call
Summarystep HostLog/ Summary
host_callfunctiontensor tensor
host_calltrain()evaluate()
NPUEstimatorSpec
8.2.7 NPUCheckpointSaverHook
def __init__(self, checkpoint_dir, save_secs=None, save_steps=None, saver=None, checkpoint_basename="model.ckpt", scaffold=None, listeners=None)
NPUCheckpointSaverHookNPUCheckpointSaverHook CheckpointSaverHook
NPUCheckpointSaverHookcheckpoint
01 (2020-11-28)
©
80
Atlas Data Center Solution
8
NPUEstimatoriteration_per_loop>1Hook
checkpoint_dir save_secs save_steps saver checkpoint_basename scaffold listeners
/
checkpoint
step
Saver
checkpointbasename
saverScaffold
CheckpointSaverListener checkpoint
NPUCheckpointSaverHook
from npu_bridge.estimator.npu.npu_hook import NPUCheckpointSaverHook checkpoint_hook = NPUCheckpointSaverHook(checkpoint_dir='./ckpt', save_steps=2000) ... mnist_classifier.train(
input_fn=train_input_fn, steps=2000, hooks=[checkpoint_hook])
8.2.8 NPUOutputTensorHook
def __init__(self, tensors, dependencies=None, output_fn=None, output_every_n_steps=0 )
NPUOutputTensorHookNPUOutputTensorHook LoggingTensorHook
01 (2020-11-28)
©
81
Atlas Data Center Solution
8
NPUOutputTensorHookNPUEstimatortrainevaluatepredict HookNoutput_fntensors
Iterations_per_loop>1output_every_n_steps output_fn
tensors dependencies output_fn output_every_n_steps
/
Tensor tensors tensors N output_fn
NPUOutputTensorHook
8.2.9 NPUDistributedOptimizer
def __init__(self, optimizer, name=None)
NPUDistributedOptimizer NPU Device
optimizer name
/
NPUDistributedOptimizer
01 (2020-11-28)
©
82
Atlas Data Center Solution
import tensorflow as tf from npu_bridge.estimator.npu.npu_optimizer import NPUDistributedOptimizer
optimizer = tf.train.GradientDescentOptimizer(learning_rate=learning_rate) optimizer = NPUDistributedOptimizer(optimizer)
8.2.10 NPULossScaleOptimizer
8
def __init__(self, opt, loss_scale_manager, is_distributed=False)
NPULossScaleOptimizerLoss Scaling Loss Scalingfloat16 NPULossScaleOptimizerLossScaleOptimizer
/
opt
loss_scale_manager
is_distributed
LossScale
NPULossScaleOptimizer FixedLossScaleManager LossScaleLossScale LossScale FixedLossScaleManager FixedLossScaleManager
NPULossScaleOptimizer ExponentialUpdateLossScaleManager LossScale ExponentialUpdateLossScaleManager ExponentialUpdateLossScaleManager
Loss Scaling
TrueTrue
False
01 (2020-11-28)
©
83
Atlas Data Center Solution
8
NPULossScaleOptimizer
from npu_bridge.estimator.npu.npu_loss_scale_optimizer import NPULossScaleOptimizer from npu_bridge.estimator.npu.npu_loss_scale_manager import FixedLossScaleManager from npu_bridge.estimator.npu.npu_loss_scale_manager import ExponentialUpdateLossScaleManager
if FLAGS.use_fp16 and (FLAGS.npu_bert_loss_scale not in [None, -1]): opt_tmp = opt if FLAGS.npu_bert_loss_scale == 0: loss_scale_manager = ExponentialUpdateLossScaleManager(init_loss_scale=2**32,
incr_every_n_steps=1000, decr_every_n_nan_or_inf=2, decr_ratio=0.5) elif FLAGS.npu_bert_loss_scale >= 1: loss_scale_manager = FixedLossScaleManager(loss_scale=FLAGS.npu_bert_loss_scale) else: raise ValueError("Invalid loss scale: %d" % FLAGS.npu_bert_loss_scale) if ops_adapter.size() > 1: opt = NPULossScaleOptimizer(opt_tmp, loss_scale_manager, is_distributed=True) else: opt = NPULossScaleOptimizer(opt_tmp, loss_scale_manager)
8.2.11 NPUOptimizer
def __init__(self,
opt,
loss_scale_manager=None,
is_distributed=False,
is_loss_scale=False,
is_tailing_optimization=False,
name=None)
NPUOptimizerNPUDistributedOptimizer NPULossScaleOptimizer
Loss ScalingLoss Scalingfloat16
NPU Device
AR AR
01 (2020-11-28)
©
84
Atlas Data Center Solution
8
opt loss_scale_manager
is_distributed is_loss_scale is_tailing_optimization name
/
is_loss_scaleTrueLoss Scaling LossScale
NPUOptimizer FixedLossScaleManager LossScaleLossScale LossScale FixedLossScaleManager FixedLossScaleManager
NPUOptimizer ExponentialUpdateLossScaleManager LossScale ExponentialUpdateLossScaleManager
ExponentialUpdateLossScaleManager
Trueallreduce FalseFalse
Loss Scaling
True loss_scale_manager None
FalseFalse
is_distributedTrue
True
FalseFalse
NPUOptimizer
import tensorflow as tf from npu_bridge.estimator.npu.npu_optimizer import NPUOptimizer
01 (2020-11-28)
©
85
Atlas Data Center Solution
8
from npu_bridge.estimator.npu.npu_loss_scale_manager import FixedLossScaleManager from npu_bridge.estimator.npu.npu_loss_scale_manager import ExponentialUpdateLossScaleManager
# optimizer = LAMBOptimizer(
learning_rate=learning_rate, weight_decay_rate=0.01, beta_1=0.9, beta_2=0.999, epsilon=1e-6, exclude_from_weight_decay=["LayerNorm", "layer_norm", "bias"])
#Loss Scaling if tf.flags.FLAGS.npu_bert_loss_scale not in [None, -1]: if tf.flags.FLAGS.npu_bert_loss_scale == 0: loss_scale_manager =
lsm_lib.ExponentialUpdateLossScaleManager(init_loss_scale=tf.flags.FLAGS.init_loss_scale_value, incr_every_n_steps=1000, decr_every_n_nan_or_inf=2, decr_ratio=0.5)
elif tf.flags.FLAGS.npu_bert_loss_scale >= 1: loss_scale_manager = lsm_lib.FixedLossScaleManager(loss_scale=tf.flags.FLAGS.npu_bert_loss_scale)
else: raise ValueError("Invalid loss scale: %d" % tf.flags.FLAGS.npu_bert_loss_scale)
optimizer = NPUOptimizer(optimizer, loss_scale_manager, is_distributed=tf.flags.FLAGS.distributed, is_loss_scale=True, is_tailing_optimization=True)
#loss_scale else: optimizer = NPUOptimizer(optimizer, is_distributed=tf.flags.FLAGS.distributed)
8.2.12 FixedLossScaleManager
def __init__(self, loss_scale)
FixedLossScaleManagerLossScale
loss_scale
/
LossScalefloat1
LossScale LossScale GPU
FixedLossScaleManager
01 (2020-11-28)
©
86
Atlas Data Center Solution
8
8.2.13 ExponentialUpdateLossScaleManager
def __init__(self, init_loss_scale, incr_every_n_steps, decr_every_n_nan_or_inf=2, incr_ratio=2, decr_ratio=0.8)
ExponentialUpdateLossScaleManagerLossScale
init_loss_scale incr_every_n_steps
decr_every_n_nan_or_inf
incr_ratio decr_ratio
/
LossScalefloat N LossScale N LossScale LossScale LossScale
ExponentialUpdateLossScaleManager
8.2.14 dropout
def dropout(x, keep_prob, noise_shape=None, seed=None, name=None)
tf.nn.dropoutTensor1/keep_probTensor keep_prob0TensorshapeTensorshape
01 (2020-11-28)
©
87
Atlas Data Center Solution
x keep_prob noise_shape
seed name
8
/
Tensorfloat Tensorfloat Tensorint32keep/drop
tensorxdropoutTensor
from npu_bridge.estimator import npu_ops layers = npu_ops.dropout()
8.2.15 LARSV2
def LARSV2(input_weight, input_grad, weight_decay, learning_rate, hyperpara=0.001, epsilon=0.00001, use_clip=False, name=None)
batch size
input_weight
/
Tensorfloat
01 (2020-11-28)
©
88
Atlas Data Center Solution
input_grad weight_decay learning_rate hyperpara
epsilon
use_clip
name
8
/
Tensorfloat
Tensorfloat
Tensorfloat
float 0.001
0 1e-5
boolFalse
True
tensorTensor
from npu_bridge.estimator import npu_ops layers = npu_ops.LARSV2(input_weight , input_grad, weight_decay, learning_rate)
8.2.16 initialize_system
def initialize_system(name = None)
GE
initialize_systemsessionsession NPURunConfig
name
/
01 (2020-11-28)
©
89
Atlas Data Center Solution
8
opsess.run(op)GE
sess.runestimator.trainget_local_rank_id/get_rank_size/ get_rank_idHCCLsessioninitialize_system shutdown_systemsession
import tensorflow as tf from npu_bridge.estimator import npu_ops from tensorflow.core.protobuf.rewriter_config_pb2 import RewriterConfig
npu_int = npu_ops.initialize_system() npu_shutdown = npu_ops.shutdown_system()
config = tf.ConfigProto() custom_op = config.graph_options.rewrite_options.custom_optimizers.add() custom_op.name = "NpuOptimizer" custom_op.parameter_map["use_off_line"].b = True config.graph_options.rewrite_options.remapping = RewriterConfig.OFF #remap
init_sess = tf.Session(config=config) init_sess.run(npu_int)
#HCCL... #...
init_sess.run(npu_shutdown) init_sess.close()
import tensorflow as tf from npu_bridge.estimator import npu_ops from tensorflow.core.protobuf.rewriter_config_pb2 import RewriterConfig
npu_init = npu_ops.initialize_system() npu_shutdown = npu_ops.shutdown_system()
config = tf.ConfigProto() custom_op = config.graph_options.rewrite_options.custom_optimizers.add() custom_op.name = "NpuOptimizer" custom_op.parameter_map["use_off_line"].b = True config.graph_options.rewrite_options.remapping = RewriterConfig.OFF #remap
with tf.Session(config=config) as sess: sess.run(npu_init) #HCCL... #... sess.run(npu_shutdown)
8.2.17 shutdown_system
def shutdown_system(name = None) Deviceinitialize_system
01 (2020-11-28)
©
90
Atlas Data Center Solution
name
/
8
opsess.run(op)
8.2.18 without_npu_compile_scope
def without_npu_compile_scope() Host
8.2.19 set_iteration_per_loop
def set_iteration_per_loop(sess, train_op, iterations_per_loop=1)
sess.runsess.run()Device HostDevice
tf.train.Supervisorsessionset_iteration_per_loop create_iteration_per_loop_var load_iteration_per_loop_var
01 (2020-11-28)
©
91
Atlas Data Center Solution
sess train_op iterations_per_loop
8
/
TensorFlow
sess.run()Device 1 iterations_per_loop
mix_compile_modeTrue iterations_per_loop1
sess.run(op)
8.2.20 create_iteration_per_loop_var
def create_iteration_per_loop_var(self, train_op)
load_iteration_per_loop_varsess.run sess.run()Device load_iteration_per_loop_var
train_op
/
sess.run(op)
8.2.21 load_iteration_per_loop_var
def load_iteration_per_loop_var(self, sess, iterations_per_loop=1)
01 (2020-11-28)
©
92
Atlas Data Center Solution
8
create_iteration_per_loop_varsess.run sess.run()Device
sess iterations_per_loop
/
TensorFlow
sess.run()Device 1 iterations_per_loop
mix_compile_modeTrue iterations_per_loop1
8.2.22 model_to_npu_estimator
def model_to_npu_estimator(keras_model=None, keras_model_path=None, custom_objects=None, model_dir=None, checkpoint_format='saver', config=None, job_start_file='')
KerasNPUEstimator
Kerasmodel_to_npu_estimator NPUEstimator
01 (2020-11-28)
©
93
Atlas Data Center Solution
8
keras_model keras_model_path custom_objects model_dir checkpoint_format
config
job_start_file
Keras keras_model_path
KerasKeras save()HDF5Keras keras_model
keras custom_objects
configmodel_dir None /tmp
NPUEstimatorcheckpoint savertf.train.Saver() checkpointtf.train.Checkpoint ()
tf.train.Checkpoint tf.train.Saver ""
NPURunConfigNPUEstimator NPURunConfigNPURunConfig
CSA
keras modelNPUEstimator
8.2.23 sess.run session
AIsess.run
8-6 session use_off_line
AI TrueAI FalseHostCPU
False
01 (2020-11-28)
©
94
Atlas Data Center Solution
enable_data_pre_proc iterations_per_loop profiling_mode profiling_options
enable_dump dump_path dump_step
8
True
False
sess.runset_iteration_per_loop sess.run()Device set_iteration_per_loop iterations_per_loop
Profiling
TrueProfilingenable_options Profiling
FalseProfiling
Profiling
training_trace AI
task_traceAI HWTS/AICore
op_trace training_tracetask_trace
"traing_trace:task_trace"
dump
Truedump_pathdump dump_pathNone
False
dumpNone
dump_path "/"Dump
dumpNone dump "|"0|5|10 "-"0|3-5|10
01 (2020-11-28)
©
95
Atlas Data Center Solution
8
dump_mode precision_mode
enable_reduce_precision variable_format_optimize
mix_compile_mode
hcom_parallel
dumpdump
inputdump
outputdumpoutput
alldump
string
allow_fp32_to_fp16None float32float16
force_fp16float16float32 float16
must_keep_origin_dtype
allow_mix_precision float32 float32 float16 Loss Scaling NPULossScaleOptimizer
True
False
AI NCHWNC1HWC0
True
False
Device AI Tensorflow
Allreduce
TrueAllreduce
FalseAllreduce
01 (2020-11-28)
©
96
Atlas Data Center Solution
8
graph_memory_max_size variable_memory_max_size auto_tune_mode stream_max_parallel_num
is_tailing_optimization
graph_run_mode
Byte[0, 256*1024*1024*1024][0, 274877906944] graph_memory_max_size variable_memory_max_size31G 26GB
Byte [0256*1024*1024*1024][0, 274877906944] graph_memory_max_size variable_memory_max_size31G 5GB
TBEAuto Tune AI
auto_tune_mode = "RL,GA"
Auto Tune Auto Tune
AICPU/AICORE AICPU/AICORE
"DNN_VM_TF:10,DNN_V100:1"
DNN_VM_TFAICPU AICPU10
DNN_V100AICORE AICORE1
AICPU/AICORE1 [1,13]
AR AR
True
FalseFalse
NPUOptimizer NPUOptimizer is_tailing_optimization
00
111
01 (2020-11-28)
©
97
Atlas Data Center Solution
8
sess.run
import tensorflow as tf from npu_bridge.estimator import npu_ops from npu_bridge.estimator.npu import npu_scope from tensorflow.core.protobuf.rewriter_config_pb2 import RewriterConfig
X = tf.random_normal([2,]) Y = tf.random_normal([2,])
with npu_scope.without_npu_compile_scope(): pred = tf.add(tf.multiply(X, 1.), 0.)
cost = tf.reduce_sum(tf.abs(pred-Y))
config = tf.ConfigProto()
custom_op = config.graph_options.rewrite_options.custom_optimizers.add() custom_op.name = "NpuOptimizer" custom_op.parameter_map["use_off_line"].b = True custom_op.parameter_map["enable_data_pre_proc"].b = True custom_op.parameter_map["profiling_mode"].b = True custom_op.parameter_map["profiling_options"].s = tf.compat.as_bytes("task_trace") custom_op.parameter_map["precision_mode"].s = tf.compat.as_bytes("allow_mix_precision") custom_op.parameter_map["variable_format_optimize"].b = True custom_op.parameter_map["mix_compile_mode"].b = True custom_op.parameter_map["enable_dump"].b = True custom_op.parameter_map["dump_path"].s = tf.compat.as_bytes("/tmp/test") custom_op.parameter_map["dump_step"].s = tf.compat.as_bytes("0|5|10") custom_op.parameter_map["dump_mode"].s = tf.compat.as_bytes("all") custom_op.parameter_map["hcom_parallel"].b = True custom_op.parameter_map["graph_memory_max_size"].s = tf.compat.as_bytes(str(26*1024 * 1024 * 1024)) custom_op.parameter_map["variable_memory_max_size"].s = tf.compat.as_bytes(str(5*1024 * 1024 * 1024)) custom_op.parameter_map["iterations_per_loop"].i = 10 custom_op.parameter_map["auto_tune_mode"].s = tf.compat.as_bytes("RL,GA") custom_op.parameter_map["stream_max_parallel_num"].s = tf.compat.as_bytes("DNN_VM_TF: 10,DNN_V100:1") custom_op.parameter_map["is_tailing_optimization"].b = True custom_op.parameter_map["graph_run_mode"].i = 1
config.graph_options.rewrite_options.remapping = RewriterConfig.OFF #remap
with tf.Session(config=config) as sess: print(sess.run(cost))
8.3
8.3.1
NPUDistributedOptimizerallreduce rank
8-7
rank
create_group
group
{install_pat h_fwkacllib }/fwkacllib/
01 (2020-11-28)
©
98
Atlas Data Center Solution
8
destroy_group get_rank_size get_local_rank_size get_rank_id get_local_rank_id get_world_rank_from_group_rank get_group_rank_from_world_rank
set_split_strategy_by_idx
set_split_strategy_by_size
group
group rank Device
python/ sitepackages/ hccl/hccl/ manage/ api.py
group device local rank
device group rank
device group local rank
group rank id world rank id
world rank id group
group rank id
id group
group
{install_pat h_fwkacllib }/fwkacllib/ python/ sitepackages/ hccl/hccl/ split/api.py
01 (2020-11-28)
©
99
Atlas Data Center Solution
allreduce
allgather
broadcast
reduce_scatter send receive
8
group allreduce
group allgather Tensor
{install_pat h_tfplugin}/ tfplugin/ python/ sitepackages/ npu_bridge/ hccl/ hccl_ops.py
group broadcast root rank
group
reducescatt er
group send
group receive
8.3.2 group
8.3.2.1 create_group
def create_group(group, rank_num, rank_ids)
01 (2020-11-28)
©
100
Atlas Data Center Solution
8
group
group
rank_num rank_ids
/
String128
groupgroup hccl_world_group
hccl_world_groupgroupranktable grouphccl_world_groupgroup
int
grouprank
4096
list
groupworld_rank_id
Serverrank_ids rank1/2/4/8
Serverrank_ids
Serverrank 1/2/4/8
Serverrankrank id 8
Serverrank2/4npu clusterrank id8 44
ServergroupServer rank id{0,1,2,3,4,5,6,7}, {8,9,10,11,12,13,14,15}, {16,17,18,19,20,21,22,23}
rank_ids rank_ids=[1,9,17]
rank_ids=[1,2,9,10,17,18] rank_ids=[4,5,6,7,12,13,14,15,20,21,22,23]
Atlas 300T 9000 rank_idsrank 1/2/4/8
01 (2020-11-28)
©
101
Atlas Data Center Solution
8
from hccl.manage.api import create_group create_group("myGroup", 4, [0, 1, 2, 3])
initialize_system rankgrouprank
group ranktablerank1/2/4/8group
8.3.2.2 destroy_group
def destroy_group(group) group
group
/
String128 groupgroup
from hccl.manage.api import create_group from hccl.manage.api import destroy_group create_group("myGroup", 4, [0, 1, 2, 3]) destroy_group("myGroup")
initialize_system rankgrouprank
groupdestroy_groupcreate_group
create_group grouphccl_world_groupgroupgroup
01 (2020-11-28)
©
102
Atlas Data Center Solution
8
8.3.2.3 get_rank_size
def get_rank_size(group="hccl_world_group")
grouprankDevice
group
/
String128 groupgroup "hccl_world_group"
intgrouprank
from hccl.manage.api import create_group from hccl.manage.api import get_rank_size create_group("myGroup", 4, [0, 1, 2, 3]) rankSize = get_rank_size("myGroup") #rankSize = 4
initialize_system rankgrouprank
create_groupAPIgrouprank "hccl_world_group"world_grouprank
8.3.2.4 get_local_rank_size
def get_local_rank_size(group="hccl_world_group")
groupdevicelocal rank
01 (2020-11-28)
©
103
Atlas Data Center Solution
group
8
/
String128 groupgroup "hccl_world_group"
intdevicelocal rank
from hccl.manage.api import create_group from hccl.manage.api import get_local_rank_size create_group("myGroup", 4, [0, 1, 2, 3]) lcoalRankSize = get_local_rank_size("myGroup") #localRankSize = 1
initialize_system rankgrouprank
create_groupAPIgrouplocal rank "hccl_world_group"world_grouplocal rank
8.3.2.5 get_rank_id
def get_rank_id(group="hccl_world_group")
devicegrouprank
group
/
String128 groupgroup "hccl_world_group"
intdevicegrouprank id
01 (2020-11-28)
©
104
Atlas Data Center Solution
8
from hccl.manage.api import create_group from hccl.manage.api import get_rank_id create_group("myGroup", 4, [0, 1, 2, 3]) rankId = get_rank_id("myGroup") #rankId = 0/1/2/3
initialize_system rankgrouprank
create_groupAPIgrouprank id "hccl_world_group"world_grouprank id
8.3.2.6 get_local_rank_id
def get_local_rank_id(group="hccl_world_group")
devicegrouplocal rank
group
/
String128 groupgroup "hccl_world_group"
intdevicelocal rank id
from hccl.manage.api import create_group from hccl.manage.api import get_local_rank_id create_group("myGroup", 4, [0, 1, 2, 3]) localRankId = get_local_rank_id("myGroup") #rankId = 0
initialize_system rankgrouprank
create_groupAPIgrouplocal rank id
01 (2020-11-28)
©
105
Atlas Data Center Solution
8
"hccl_world_group"world_grouplocal rank id
8.3.2.7 get_world_rank_from_group_rank
def get_world_rank_from_group_rank(group, group_rank_id)
group rank idworld rank id
group
/
group_rank_id
String128 groupgroup "hccl_world_group"
int grouprank id
int"hccl_world_group"rank id
from hccl.manage.api import create_group from hccl.manage.api import get_world_rank_from_group_rank create_group("myGroup", 4, [0, 1, 2, 3]) worldRankId = get_world_rank_from_group_rank ("myGroup", 1) #worldRankId = 8
initialize_system rankgrouprank
create_groupAPIgroup rank idworld rank id
8.3.2.8 get_group_rank_from_world_rank
def get_group_rank_from_world_rank(world_rank_id, group) world rank idgroupgroup rank id
01 (2020-11-28)
©
106
Atlas Data Center Solution
8
/
world_rank_id
group
int "hccl_world_group"rank id
String128 groupgroup "hccl_world_group"
intgrouprank id
from hccl.manage.api import create_group
from hccl.manage.api import get_group_rank_from_world_rank create_group("myGroup", 4, [0, 1, 2, 3]) groupRankId = get_group_rank_from_world_rank (8, "myGroup")
#groupRankId = 1
initialize_system
rankgrouprank
create_groupAPIworld rank idgroup rank id
8.3.3
8.3.3.1 set_split_strategy_by_idx
def set_split_strategy_by_idx(idxList, group="hccl_world_group")
idgroupallreduce
01 (2020-11-28)
©
107
Atlas Data Center Solution
idxList
group
8
/
list
id
id
id id0
set_split_strategy_by_size
INFOhost "segment result" : segment index list: [0,107] [108,159] 159
8
160 [0,20][21,100][101,159] idxList=[20,100,159]
String
group"hccl_world_group" group"hccl_world_group"
from hccl.split.api import set_split_strategy_by_idx set_split_strategy_by_idx([20, 100, 159], "group")
rankgrouprank
296.54% 3.46%
01 (2020-11-28)
©
108
Atlas Data Center Solution
8
8.3.3.2 set_split_strategy_by_size
def set_split_strategy_by_size(dataSizeList, group="hccl_world_group")
groupallreduce
dataSizeList
group
/
list
id 100
8
150M 90M30M30M dataSizeList =[60,20,20]
String128 group"hccl_world_group" group"hccl_world_group"
from hccl.split.api import set_split_strategy_by_size set_split_strategy_by_size([60, 20, 20], "group")
rankgrouprank
id
ResNet502 96.54%3.46%
8.3.4
01 (2020-11-28)
©
109
Atlas Data Center Solution
8
8.3.4.1 allreduce
def allreduce(tensor, reduction, fusion=1, group = "hccl_world_group")
groupallreducereduce
tensor reduction fusion
group
/
tensorflowtensor tensorint8, int32, float16, float32
String reduceop "max","min","prod""sum"
int 0allreduceallreduce
01
String128 groupgroup "hccl_world_group"
tensortensorallreducetensor
from npu_bridge.hccl import hccl_ops result = hccl_ops.allreduce(tensor, "sum")
rankgrouprank
allreducevariable
01 (2020-11-28)
©
110
Atlas Data Center Solution
8
8.3.4.2 allgather
def allgather(tensor, rank_size, group = "hccl_world_group")
groupallgatherTensor
tensor
rank_size
group
/
tensorflowtensor tensorint8, int32, float16, float32
int groupdevice 4096
String128 groupgroup "hccl_world_group"
tensortensorallgathertensor
from npu_bridge.hccl import hccl_ops rank_size = 2 result = hccl_ops.allgather (tensor, rank_size)
rankgrouprank
8.3.4.3 broadcast
def broadcast(tensor, root_rank, group = "hccl_world_group") groupbroadcastrootrank
01 (2020-11-28)
©
111
Atlas Data Center Solution
tensor
root_rank
group
8
/
tensorflowtensorlist tensorint8, int32, float16, float32
int rootrank_ididgrouprank id
String128 groupgroup "hccl_world_group"
tensortensorbroadcasttensor
from npu_bridge.hccl import hccl_ops root = 0 inputs = [tensor] result = hccl_ops.broadcast (inputs, root)
rankgrouprank
8.3.4.4 reduce_scatter
def reduce_scatter(tensor, reduction, rank_size, group = "hccl_world_group")
groupreducescatterreducereduction
tensor
/
tensorflowtensor tensorint8, int32, float16, float32 tensorrank size
01 (2020-11-28)
©
112
Atlas Data Center Solution
reduction
rank_size
group
8
/
String reduceop "max","min","prod""sum"
int groupdevice 4096
String128 groupgroup "hccl_world_group"
tensortensorreducescattertensortensor 32Byte
from npu_bridge.hccl import hccl_ops rank_size = 2 result = hccl_ops. reduce_scatter (tensor, "sum", rank_size)
rankgrouprank
8.3.4.5 send
def send(tensor, sr_tag, dest_rank, group = "hccl_world_group")
groupsend
tensor
/
tensorflowtensor tensorint8, int32, float16, float32
01 (2020-11-28)
©
113
Atlas Data Center Solution
sr_tag
dest_rank
group
8
/
int sr_tagsend/recv
int rankgrouprank id
String128 groupgroup "hccl_world_group"
from npu_bridge.hccl import hccl_ops sr_tag = 0 dest_rank = 1 hccl_ops. send (tensor, sr_tag, dest_rank)
rankgrouprank
rankServerID
8.3.4.6 receive
def receive(shape, data_type, sr_tag, src_rank, group = "hccl_world_group")
groupreceive
shape data_type
/
tensorshape
tensorint8, int32, float16, float32
01 (2020-11-28)
©
114
Atlas Data Center Solution
sr_tag
src_rank
group
8
/
int sr_tagsend/recv
int rankgrouprank id
String128 groupgroup "hccl_world_group"
tensorreceivetensor
from npu_bridge.hccl import hccl_ops sr_tag = 0 src_rank = 0 tensor = hccl_ops. receive (tensor.shape, tensor.dtype, sr_tag, src_rank)
rankgrouprank
rankServerID
01 (2020-11-28)
©
115
Atlas Data Center Solution
9
9
Resnet50TensorFlowPython API AI
9.1 imagenetResNet50
9.2 bookscorpusBERTEstimater
9.1 imagenet ResNet50
9.1.1
imagenetimagenethttp://www.imagenet.org/
Resnet50CIFAR-10ImageNet1000
https://github.com/tensorflow/models/tree/r2.1_model_reference/official Resnet
r1 //
resnet // resnet
__init__.py
imagenet_main.py // Imagenet
imagenet_preprocessing.py // Imagenet
resnet_model.py // resnet
resnet_run_loop.py //
README.md //
utils
01 (2020-11-28)
©
116
Atlas Data Center Solution
9
export.py //
logs
hooks_helper.py ///
NCPU/GPU
logger.py //
utils
flags
core.py
//
misc
distribution_utils.py //
model_helpers.py //
9.1.2
Estimator
Estimator APITensorFlowAPI2018TensorFlow 1.10 Estimator
Estimator
9-1
input_fn
model_fn
EstimatorRunconfig
EstimatorEstimator.train()
9.1.3
r1
resnet // resnet
imagenet_main.py // Imagenet
imagenet_preprocessing.py // Imagenet
resnet_model.py // resnet
resnet_run_loop.py //
utils
flags
_base.py //
01 (2020-11-28)
©
117
Atlas Data Center Solution
9
9-2 py imagenet_main.py
imagenet_preprocessing.py
resnet_model.py resnet_run_loop.py
imagenet get_filenames()parse_record()input_fn() get_synth_input_fn()_parse_example_proto() ImagenetModel imagenet_model_fn()run_cifar() define_cifar_flags()
imagenet
ResNetResNet ResNet block
imagelabel Estimator
9.1.4
910 AI
input_fn
imagenet910AI py
9-3 API
input_fn()
Estimator
resnet_main()
"/official/r1/resnet/ imagenet_main.py"
"/official/r1/resnet/ resnet_run_loop.py"
01 (2020-11-28)
©
118
Atlas Data Center Solution
9
1. "official/r1/resnet/imagenet_main.py"
from hccl.manage.api import get_rank_size from hccl.manage.api import get_rank_id
2. id
"official/r1/resnet/imagenet_main.py"input_fn()
def input_fn(is_training, data_dir, batch_size, num_epochs=1, dtype=tf.float32, datasets_num_private_threads=None, parse_record_fn=parse_record, input_context=None, drop_remainder=False, tf_data_experimental_slack=False):
"""batches :
is_training: data_dir: batch_size: batch num_epochs: dtype: / datasets_num_private_threads: tf.data parse_record_fn: tfrecords input_context: 'tf.distribute.Strategy''tf.distribute.InputContext' drop_remainder: batchbatch_size True,batch tf_data_experimental_slack: tf.data'experimental_slack'
Returns:
""" #
filenames = get_filenames(is_training, data_dir) #
dataset = tf.data.Dataset.from_tensor_slices(filenames)
if input_context: # id
############## npu modify begin #############
dataset = dataset.shard(get_rank_size(),get_rank_id())
############## npu modify end ###############
#
# if input_context:
# tf.compat.v1.logging.info(
# 'Sharding the dataset: input_pipeline_id=%d num_input_pipelines=%d' % (
#
input_context.input_pipeline_id, input_context.num_input_pipelines))
# dataset = dataset.shard(input_context.num_input_pipelines,
#
input_context.input_pipeline_id)
if is_training: #
dataset = dataset.shuffle(buffer_size=_NUM_TRAIN_FILES)
# cycle_length = 10 10CPU dataset = dataset.interleave(
tf.data.TFRecordDataset, cycle_length=10, num_parallel_calls=tf.data.experimental.AUTOTUNE)
return resnet_run_loop.process_record_dataset( dataset=dataset, is_training=is_training, batch_size=batch_size, shuffle_buffer=_SHUFFLE_BUFFER, parse_record_fn=parse_record_fn, num_epochs=num_epochs, dtype=dtype, datasets_num_private_threads=datasets_num_private_threads, drop_remainder=drop_remainder, tf_data_experimental_slack=tf_data_experimental_slack,
)
3. drop_remainderTrue
01 (2020-11-28)
©
119
Atlas Data Center Solution
9
"/official/r1/resnet/resnet_run_loop.py"resnet_main() input_fn_train()input_fn_eval()
def input_fn_train(num_epochs, input_context=None): ############## npu modify begin ############# # dtype=tf.float16 # drop_remainderTrue # batch_sizebatchbatch return input_function( is_training=True, data_dir=flags_obj.data_dir, batch_size=flags_obj.batch_size, num_epochs=num_epochs, dtype=tf.float16, input_context=input_context, drop_remainder=True)
def input_fn_eval():
# dtype=tf.float16
# drop_remainderTrue
# batch_sizebatchbatch
return input_function(
is_training=False,
data_dir=flags_obj.data_dir,
batch_size=flags_obj.batch_size,
num_epochs=1,
dtype=tf.float16,
input_context=True,
drop_remainder=True)
############## npu modify end ###############
#
# def input_fn_train(num_epochs, input_context=None):
# return input_function(
#
is_training=True,
#
data_dir=flags_obj.data_dir,
#
batch_size=distribution_utils.per_replica_batch_size(
#
flags_obj.batch_size, flags_core.get_num_gpus(flags_obj)),
#
num_epochs=num_epochs,
#
dtype=flags_core.get_tf_dtype(flags_obj),
#
datasets_num_private_threads=flags_obj.datasets_num_private_threads,
#
input_context=input_context)
#
# def input_fn_eval():
# return input_function(
#
is_training=False,
#
data_dir=flags_obj.data_dir,
#
batch_size=distribution_utils.per_replica_batch_size(
#
flags_obj.batch_size, flags_core.get_num_gpus(flags_obj)),
#
num_epochs=1,
#
dtype=flags_core.get_tf_dtype(flags_obj))
9.1.5
imagenet
01 (2020-11-28)
©
120
Atlas Data Center Solution
9-4 imagenet_model_fn()
learning_rate_with_decay()
resnet_model_fn()
ImagenetModel()
__call__()
9
imagenet
"/official/r1/
resnet/
imagenet_main.p y"
"/official/r1/
resnet/
resnet_run_loop. py"
EstimatorSpec Estimator
"/official/r1/
resnet/
resnet_run_loop. py"
ImagenetModel resnet_model Model imagenetResNet
"/official/r1/
resnet/
imagenet_main.p y"
1GPU NHWC NCHW2 3ResNet batch norm4 pooling5block6 7
"/official/r1/
resnet/
resnet_model.py "
1. "/official/r1/resnet/resnet_run_loop.py"
from npu_bridge.hccl import hccl_ops
2. max_pool_with_argmaxmax_pooling2d
"official/r1/resnet/resnet_model.py"__call__()
# if self.first_pool_size:
############## npu modify begin ############# # max_pool_with_argmaxmax_pooling2d inputs,argmax = tf.compat.v1.nn.max_pool_with_argmax(
input=inputs, ksize=(1,self.first_pool_size,self.first_pool_size,1), strides=(1,self.first_pool_stride,self.first_pool_stride,1), padding='SAME', data_format='NCHW' if self.data_format == 'channels_first' else 'NHWC') ############## npu modify end ###############
# max_pooling2d() # inputs = tf.compat.v1.layers.max_pooling2d(
01 (2020-11-28)
©
121
Atlas Data Center Solution
9
# inputs=inputs, pool_size=self.first_pool_size, # strides=self.first_pool_stride, padding='SAME', # data_format=self.data_format)
inputs = tf.identity(inputs, 'initial_max_pool')
3. /
"official/r1/resnet/resnet_run_loop.py"resnet_model_fn()
############# npu modify begin ############# # / if features.dtype != dtype:
# dtype features = tf.cast(features, dtype) ############## npu modify end ############### # # assert features.dtype == dtype
4. accuracylabelsfloat32
"official/r1/resnet/resnet_run_loop.py"resnet_model_fn()
############## npu modify begin ############# # labelsfloat32 accuracy = tf.compat.v1.metrics.accuracy(tf.cast(labels, tf.float32), predictions['classes']) ############## npu modify end ###############
# accuracy # accuracy = tf.compat.v1.metrics.accuracy(labels, predictions['classes'])
accuracy_top_5 = tf.compat.v1.metrics.mean( tf.nn.in_top_k(predictions=logits, targets=labels, k=5, name='top_5_op'))
############## npu modify begin ############# # accuracy rank_size = int(os.getenv('RANK_SIZE')) newaccuracy = (hccl_ops.allreduce(accuracy[0], "sum") / rank_size, accuracy[1]) newaccuracy_top_5 = (hccl_ops.allreduce(accuracy_top_5[0], "sum") / rank_size, accuracy_top_5[1]) metrics = {'accuracy': newaccuracy,
'accuracy_top_5': newaccuracy_top_5} ############## npu modify end #############
# metrics
# metrics = {'accuracy': accuracy,
#
'accuracy_top_5': accuracy_top_5}
1.
2.
"official/r1/resnet/resnet_run_loop.py"
from npu_bridge.estimator.npu.npu_optimizer import NPUDistributedOptimizer
NPUDistributedOptimizer
"official/r1/resnet/resnet_run_loop.py"resnet_model_fn()
if flags.FLAGS.enable_lars: from tensorflow.contrib import opt as contrib_opt # pylint: disable=g-import-not-at-top optimizer = contrib_opt.LARSOptimizer( learning_rate, momentum=momentum, weight_decay=weight_decay, skip_list=['batch_normalization', 'bias'])
else: optimizer = tf.compat.v1.train.MomentumOptimizer( learning_rate=learning_rate, momentum=momentum )
01 (2020-11-28)
©
122
Atlas Data Center Solution
9
9.1.6
############## npu modify begin ############# # optimizer = NPUDistributedOptimizer(optimizer) ############## npu modify end ###############
fp16_implementation = getattr(flags.FLAGS, 'fp16_implementation', None) if fp16_implementation == 'graph_rewrite':
optimizer = ( tf.compat.v1.train.experimental.enable_mixed_precision_graph_rewrite( optimizer, loss_scale=loss_scale))
resnet_main()
9-5
resnet_main() "/official/r1/resnet/
resnet_run_loop.py"
1. "/official/r1/resnet/resnet_run_loop.py"
from npu_bridge.estimator.npu.npu_config import NPURunConfig from npu_bridge.estimator.npu.npu_estimator import NPUEstimator
2. NPURunconfigRunconfig
"official/r1/resnet/resnet_run_loop.py"resnet_main()
############## npu modify begin ############# # NPURunconfigRunconfigAI115200checkpoint10000 summary # run_config = NPURunConfig(
model_dir=flags_obj.model_dir, session_config=session_config, save_checkpoints_steps=115200, enable_data_pre_proc=True, iterations_per_loop=100, # enable_auto_mix_precision=True, # precision_mode='allow_mix_precision', hcom_parallel=True ) ############## npu modify end ############### # # run_config = tf.estimator.RunConfig( # train_distribute=distribution_strategy, # session_config=session_config, # save_checkpoints_secs=60 * 60 * 24, # save_checkpoints_steps=None)
precision_mode='allow_mix_precision'
3. NPUEstimatorNPUEstimatortf.estimator.Estimator
01 (2020-11-28)
©
123
Atlas Data Center Solution
9
"/official/r1/resnet/resnet_run_loop.py"resnet_main()
############## npu modify begin ############# # `NPUEstimator`tf.estimator.EstimatorAI classifier = NPUEstimator(
model_fn=model_function, model_dir=flags_obj.model_dir, config=run_config, params={
'resnet_size': int(flags_obj.resnet_size), 'data_format': flags_obj.data_format, 'batch_size': flags_obj.batch_size, 'resnet_version': int(flags_obj.resnet_version), 'loss_scale': flags_core.get_loss_scale(flags_obj,
default_for_fp16=128), 'dtype': flags_core.get_tf_dtype(flags_obj), 'fine_tune': flags_obj.fine_tune, 'num_workers': num_workers, 'num_gpus': flags_core.get_num_gpus(flags_obj), }) ############## npu modify end ###############
# Estimator
# classifier = tf.estimator.Estimator(
# model_fn=model_function, model_dir=flags_obj.model_dir, config=run_config,
# warm_start_from=warm_start_settings, params={
#
'resnet_size': int(flags_obj.resnet_size),
#
'data_format': flags_obj.data_format,
#
'batch_size': flags_obj.batch_size,
#
'resnet_version': int(flags_obj.resnet_version),
#
'loss_scale': flags_core.get_loss_scale(flags_obj,
#
default_for_fp16=128),
#
'dtype': flags_core.get_tf_dtype(flags_obj),
#
'fine_tune': flags_obj.fine_tune,
#
'num_workers': num_workers,
# })
9.1.7
9-6
main()
run_imagenet()
resnet_main()
"/official/r1/resnet/ imagenet_main.py"
"/official/r1/resnet/ imagenet_main.py"
"/official/r1/resnet/ resnet_run_loop.py"
1. "official/r1/resnet/resnet_run_loop.py"
from npu_bridge.estimator import npu_ops from tensorflow.core.protobuf import rewriter_config_pb2
01 (2020-11-28)
©
124
Atlas Data Center Solution
9
2.
"official/r1/resnet/imagenet_main.py"main()
def main(_): ############## npu modify begin ############# # NPUHCCL # init_sess, npu_init = resnet_run_loop.init_npu() init_sess.run(npu_init) ############## npu modify end ###############
with logger.benchmark_context(flags.FLAGS): run_imagenet(flags.FLAGS)
3.
"official/r1/resnet/resnet_run_loop.py"init_npu()
def resnet_main(flags_obj, model_function, input_function, dataset_name, shape=None):... ############## npu modify begin ############# # def init_npu():
"""NPU
`init_sess` npu init session config. `npu_init` npu init ops. """ npu_init = npu_ops.initialize_system() config = tf.ConfigProto()
config.graph_options.rewrite_options.remapping = rewriter_config_pb2.RewriterConfig.OFF custom_op = config.graph_options.rewrite_options.custom_optimizers.add() custom_op.name = "NpuOptimizer" # custom_op.parameter_map["precision_mode"].b = True custom_op.parameter_map["precision_mode"].s = tf.compat.as_bytes("allow_mix_precision") custom_op.parameter_map["use_off_line"].b = True
init_sess = tf.Session(config=config) return init_sess, npu_init ############## npu modify end ###############
4. /
"official/r1/resnet/resnet_run_loop.py"resnet_main()
for cycle_index, num_train_epochs in enumerate(schedule): tf.compat.v1.logging.info('Starting cycle: %d/%d', cycle_index, int(n_loops))
if num_train_epochs: # Since we are calling classifier.train immediately in each loop, the # value of num_train_epochs in the lambda function will not be changed # before it is used. So it is safe to ignore the pylint error here # pylint: disable=cell-var-from-loop classifier.train( input_fn=lambda input_context=None: input_fn_train( num_train_epochs, input_context=input_context), hooks=train_hooks, max_steps=flags_obj.max_train_steps)
############## npu modify begin ############# # npuhccl # : init_sess, npu_init = init_npu() npu_shutdown = npu_ops.shutdown_system() init_sess.run(npu_shutdown) init_sess.run(npu_init) ############## npu modify end ###############
01 (2020-11-28)
©
125
Atlas Data Center Solution
9
tf.compat.v1.logging.info('Starting to evaluate.') eval_results = classifier.evaluate(input_fn=input_fn_eval,
steps=flags_obj.max_train_steps)
benchmark_logger.log_evaluation_result(eval_results)
if model_helpers.past_stop_threshold( flags_obj.stop_threshold, eval_results['accuracy']):
break
############## npu modify begin ############# # npuhccl # : init_sess, npu_init = init_npu() npu_shutdown = npu_ops.shutdown_system() init_sess.run(npu_shutdown) init_sess.run(npu_init) ############## npu modify end ###############
5. /
/npu_ops.shutdown_system
"official/r1/resnet/resnet_run_loop.py"resnet_main()
if flags_obj.export_dir is not None: # Exports a saved model for the given classifier. export_dtype = flags_core.get_tf_dtype(flags_obj) if flags_obj.image_bytes_as_serving_input: input_receiver_fn = functools.partial( image_bytes_serving_input_fn, shape, dtype=export_dtype) else: input_receiver_fn = export.build_tensor_serving_input_receiver_fn( shape, batch_size=flags_obj.batch_size, dtype=export_dtype) classifier.export_savedmodel(flags_obj.export_dir, input_receiver_fn, strip_default_attrs=True)
############## npu modify begin ############# # /npu_ops.shutdown_system # npu_shutdown = npu_ops.shutdown_system() init_sess.run(npu_shutdown) ############## npu modify end ###############
stats = {} stats['eval_results'] = eval_results stats['train_hooks'] = train_hooks
return stats
loss scale
loss scale
"official/r1/resnet/imagenet_main.py"define_imagenet_flags()
def define_imagenet_flags(): resnet_run_loop.define_resnet_flags( resnet_size_choices=['18', '34', '50', '101', '152', '200'], dynamic_loss_scale=True, fp16_implementation=True) flags.adopt_module_key_flags(resnet_run_loop) flags_core.set_defaults(train_epochs=90)
############## npu modify begin ############# # AIloss_scale #
01 (2020-11-28)
©
126
Atlas Data Center Solution
9
flags_core.set_defaults(loss_scale='512') ############## npu modify end ###############
9.1.8
/home/data/resnet50/imagenet
ranktable
ranktable
python3 /home/official/r1/resnet/imagenet_main.py --batch_size=32 -hooks=ExamplesPerSecondHook --data_dir=/home/data/resnet50/imagenet
9.2 bookscorpus BERT Estimater
9.2.1
bookscorpustfrecordbookscorpus "/home/data/bert/cn-clue-256/"
BERTbookscorpus
https://github.com/NVIDIA/DeepLearningExamples/tree/master/ TensorFlow/LanguageModeling/BERTBERT
BERT
# BERT
__init__.py
extract_features.py
//
fp16_utils.py
// fp16 utils
01 (2020-11-28)
©
127
Atlas Data Center Solution
9
fused_layer_norm.py
// layer norm
gpu_environment.py
// gpu_environment
modeling.py
// BERT
optimization.py
//
run_pretraining.py
//
tf_metrics.py
// tf metrics
tokenization.py
//
scripts/
#
data_download.sh
// data/
run_pretraining_adam.sh // run_pretraining.pyAdam
run_pretraining_lamb.sh // run_pretraining.pyLAMB
data/
# BERT
utils/
utils.py
// utils
9.2.2
Estimator
Estimator APITensorFlowAPI2018TensorFlow 1.10 Estimator
Estimator
9-7
input_fn model_fn EstimatorRunconfig EstimatorEstimator.train()
9.2.3
910 AI 9-8
BERT
gpu_environment.py
// gpu_environment
modeling.py
// BERT
optimization.py
//
run_pretraining.py
//
scripts/
#
utils/
utils.py
// utils
01 (2020-11-28)
©
128
Atlas Data Center Solution
9
9-8 py
gpu_environment.py
tensortf.float16
modeling.py
BERT
optimization.py
AdamWeightDecayOptimizer LAMBOOptimizer create_optimizer()
run_pretraining.py
bookscorpus input_fn_builder()_decode_record() Estimator model_fn_builder() get_masked_lm_output() get_next_sentence_output()gather_indexes() main()
9.2.4
910 AI
input_fn
bookscorpus910 AIpy9-9
9-9 API input_fn_builder()
_decode_record()
Estimator
tf.int64tensor tf.int32tensor 910 AI
"BERT/ run_pretraining.py "
"BERT/ run_pretraining.py "
1. id
01 (2020-11-28)
©
129
Atlas Data Center Solution
9
"BERT/run_pretraining.py"input_fn_builder()input_fn()
if is_training: d = tf.data.Dataset.from_tensor_slices(tf.constant(input_files)) ############## npu modify begin ############# #id if FLAGS.distributed: rank_size = int(os.getenv('RANK_SIZE')) rank_id = int(os.getenv('RANK_INDEX')) device_id = int(os.getenv('DEVICE_ID')) local_rank = rank_id * 8 + device_id print('RANK_SIZE=', rank_size, ' RANK_ID=', local_rank) d = d.shard(rank_size, local_rank) ############## npu modify end #############
# # if hvd is not None: # d = d.shard(hvd.size(), hvd.rank())
2. "npu_bert_debug"
"BERT/run_pretraining.py"input_fn_builder()input_fn()
d = d.repeat()
############## npu modify begin ############# # "npu_bert_debug" if not FLAGS.npu_bert_debug:
d = d.shuffle(buffer_size=len(input_files)) ############## npu modify end #############
# # d = d.shuffle(buffer_size=len(input_files))
3. "npu_bert_debug"cycle_length
"BERT/run_pretraining.py"input_fn_builder()input_fn()
############## npu modify begin ############# # # `cycle_length` is the number of parallel files that get read. # "npu_bert_debug"cycle_length if not FLAGS.npu_bert_debug:
cycle_length = min(num_cpu_threads, int(len(input_files)/int(os.getenv('RANK_SIZE')))) else:
cycle_length = 1 ############## npu modify end #############
# # cycle_length = min(num_cpu_threads, len(input_files))
############## npu modify begin ############# # d = d.interleave(
tf.data.TFRecordDataset, cycle_length=cycle_length, num_parallel_calls=tf.data.experimental.AUTOTUNE) # "npu_bert_debug" if not FLAGS.npu_bert_debug: d = d.shuffle(buffer_size=100) ############## npu modify end #############
# # # `sloppy` mode means that the interleaving is not exact. This adds # # even more randomness to the training pipeline. # d = d.apply( # tf.contrib.data.parallel_interleave( # tf.data.TFRecordDataset, # sloppy=is_training,
01 (2020-11-28)
©
130
Atlas Data Center Solution
9
# cycle_length=cycle_length)) # d = d.shuffle(buffer_size=100)
4. drop_remainderTrue910 AI batch_size
"BERT/run_pretraining.py"input_fn_builder()input_fn()
############## npu modify begin ############# # drop_remainderTrue d = d.apply(
tf.contrib.data.map_and_batch( lambda record: _decode_record(record, name_to_features), batch_size=batch_size, num_parallel_batches=num_cpu_threads, drop_remainder=True))
############## npu modify end ###############
#
# d = d.apply(
# tf.contrib.data.map_and_batch(
#
lambda record: _decode_record(record, name_to_features),
#
batch_size=batch_size,
#
num_parallel_batches=num_cpu_threads,
#
drop_remainder=True if is_training else False))
9.2.5
910 AI
bookscorpus9-10
9-10 model_fn_builder() get_masked_lm_output()
get_next_sentence_output()
BertConfig()
bookscorpus BERT Estimator
"BERT/
run_pretraining. py"
Masked LMloss
Masked LM
"BERT/
run_pretraining. py"
loss
"BERT/
run_pretraining. py"
bert BertModel
"BERT/ modeling.py"
01 (2020-11-28)
©
131
Atlas Data Center Solution
9
BertModel() embedding_lookup() embedding_postprocessor()
gather_npu()
gelu()
dropout()
LogEvalRunHook()
bert "BERT/
bert
modeling.py"
embeddingid "BERT/ embedding_size modeling.py"
embedding_lookup() tensor
"BERT/ modeling.py"
tf.gather() 910 AI
"BERT/ modeling.py"
tf.nn.gelu() 910 AI
"BERT/ modeling.py"
tf.nn.dropout() 910 AI
"BERT/ modeling.py"
"BERT/utils/
utils.py"
1. "BERT/run_pretraining.py"
import utils.dllogger_class from dllogger import Verbosity
2. "BERT/run_pretraining.py"
from gpu_environment import get_custom_getter
3. optimization.create_optimizer()FLAGS.use_fp16 FLAGS.ampFLAGS.init_loss_scale910 AI
"BERT/run_pretraining.py"model_fn_builder() model_fn()
############## npu modify begin ############# # FLAGS.use_fp16FLAGS.ampFLAGS.init_loss_scale if mode == tf.estimator.ModeKeys.TRAIN:
train_op = optimization.create_optimizer( total_loss, learning_rate, num_train_steps, num_warmup_steps, hvd, FLAGS.manual_fp16, FLAGS.use_fp16, FLAGS.num_accumulation_steps,
FLAGS.optimizer_type, FLAGS.allreduce_post_accumulation)
############## npu modify end ###############
#
# if mode == tf.estimator.ModeKeys.TRAIN:
# train_op = optimization.create_optimizer(
#
total_loss, learning_rate, num_train_steps, num_warmup_steps,
#
hvd, FLAGS.manual_fp16, FLAGS.amp, FLAGS.num_accumulation_steps,
FLAGS.optimizer_type,
#
FLAGS.allreduce_post_accumulation, FLAGS.init_loss_scale)
01 (2020-11-28)
©
132
Atlas Data Center Solution
9
4. "use_fp16_cls"input_tensortf.float16 input_tensor tf.float32
"BERT/run_pretraining.py"get_masked_lm_output()
def get_masked_lm_output(bert_config, input_tensor, output_weights, positions, label_ids, label_weights):
"""Get loss and log probs for the masked LM.""" input_tensor = gather_indexes(input_tensor, positions)
with tf.variable_scope("cls/predictions"):
############## npu modify begin ############# # # tf.layers.dense()input_tensortensorinput_tensor tf.float16 # tf.layers.dense()input_tensorinput_tensortf.float32
with tf.variable_scope("transform", custom_getter=get_custom_getter(
compute_type=tf.float16 if FLAGS.use_fp16_cls else tf.float32)):
if FLAGS.use_fp16_cls:
input_tensor = tf.cast(input_tensor, tf.float16)
input_tensor = tf.layers.dense(
input_tensor,
units=bert_config.hidden_size,
activation=modeling.get_activation(bert_config.hidden_act),
kernel_initializer=modeling.create_initializer(
bert_config.initializer_range))
input_tensor = tf.cast(input_tensor, tf.float32)
input_tensor = modeling.layer_norm(input_tensor)
############## npu modify end #############
#
# with tf.variable_scope("transform"):
# input_tensor = tf.layers.dense(
#
input_tensor,
#
units=bert_config.hidden_size,
#
activation=modeling.get_activation(bert_config.hidden_act),
#
kernel_initializer=modeling.create_initializer(
#
bert_config.initializer_range))
# input_tensor = modeling.layer_norm(input_tensor)
output_bias = tf.get_variable( "output_bias", shape=[bert_config.vocab_size], initializer=tf.zeros_initializer())
############## npu modify begin ############# # use_fp16_clsinput_tensortf.float16 # input_tensortf.float32
if FLAGS.use_fp16_cls:
input_tensor = tf.cast(input_tensor, tf.float16)
logits = tf.matmul(input_tensor, tf.cast(output_weights, tf.float16), transpose_b=True)
logits = tf.cast(logits, tf.float32)
else:
logits = tf.matmul(tf.cast(input_tensor, tf.float32), output_weights, transpose_b=True)
############## npu modify end ###############
# # logits = tf.matmul(tf.cast(input_tensor, tf.float32), output_weights, transpose_b=True)
5. "use_fp16_cls"input_tensortf.float16 input_tensor tf.float32
01 (2020-11-28)
©
133
Atlas Data Center Solution
9
"BERT/run_pretraining.py"get_next_sentence_output()
def get_next_sentence_output(bert_config, input_tensor, labels): with tf.variable_scope("cls/seq_relationship"): output_weights = tf.get_variable( "output_weights", shape=[2, bert_config.hidden_size], initializer=modeling.create_initializer(bert_config.initializer_range)) output_bias = tf.get_variable( "output_bias", shape=[2], initializer=tf.zeros_initializer())
############## npu modify begin ############# # use_fp16_clsinput_tensortf.float16 # input_tensortf.float32 if FLAGS.use_fp16_cls:
input_tensor = tf.cast(input_tensor, tf.float16) logits = tf.matmul(input_tensor, tf.cast(output_weights, tf.float16), transpose_b=True) logits = tf.cast(logits, tf.float32) else: logits = tf.matmul(tf.cast(input_tensor, tf.float32), output_weights, transpose_b=True) ############## npu modify end #############
# # logits = tf.matmul(tf.cast(input_tensor, tf.float32), output_weights, transpose_b=True)
6. "use_fp16_cls"first_token_tensor tf.float16self.pooled_output tf.float32
"BERT/modeling.py"BertModel__init__()
############## npu modify begin ############# # use_fp16_clsfirst_token_tensortf.float16 # self.pooled_outputtf.float32 with tf.variable_scope("pooler"):
first_token_tensor = tf.squeeze(self.sequence_output[:, 0:1, :], axis=1) if tf.flags.FLAGS.use_fp16_cls:
first_token_tensor = tf.cast(first_token_tensor, tf.float16) self.pooled_output = tf.layers.dense(
first_token_tensor, config.hidden_size, activation=tf.tanh, kernel_initializer=create_initializer(config.initializer_range)) self.pooled_output = tf.cast(self.pooled_output, tf.float32) ############## npu modify end #############
# # with tf.variable_scope("pooler"): # # We "pool" the model by simply taking the hidden state corresponding # # to the first token. We assume that this has been pre-trained # first_token_tensor = tf.squeeze(self.sequence_output[:, 0:1, :], axis=1) # self.pooled_output = tf.layers.dense( # first_token_tensor, # config.hidden_size, # activation=tf.tanh, # kernel_initializer=create_initializer(config.initializer_range))
7. gather_npu()tf.gather()910 AI
"BERT/modeling.py"embedding_lookup() embedding_postprocessor()
modeling.pygather_npu()
@tf.custom_gradient def gather_npu(params, indices):
def grad(dy): params_shape = tf.shape(params, out_type=tf.int64)
01 (2020-11-28)
©
134
Atlas Data Center Solution
9
params_shape = tf.cast(params_shape, tf.int32) grad_gather = tf.unsorted_segment_sum(dy, indices, params_shape[0]) return grad_gather, None return tf.gather(params, indices), grad
embedding_lookup()
if use_one_hot_embeddings: one_hot_input_ids = tf.one_hot(flat_input_ids, depth=vocab_size) output = tf.matmul(one_hot_input_ids, embedding_table)
else:
############## npu modify begin ############# # npu_gathertf.gather()gather_npu(), if tf.flags.FLAGS.npu_gather:
output = gather_npu(embedding_table, flat_input_ids) else:
output = tf.gather(embedding_table, flat_input_ids) ############## npu modify end #############
# # output = tf.gather(embedding_table, flat_input_ids)
embedding_postprocessor()
if use_one_hot_embeddings: one_hot_input_ids = tf.one_hot(flat_input_ids, depth=vocab_size) output = tf.matmul(one_hot_input_ids, embedding_table)
else:
############## npu modify begin ############# # npu_gathertf.gather()gather_npu(), if tf.flags.FLAGS.npu_gather:
token_type_embeddings = gather_npu(token_type_table, flat_token_type_ids) else:
token_type_embeddings = tf.gather(token_type_table, flat_token_type_ids) ############## npu modify end #############
# # token_type_embeddings = tf.gather(token_type_table, flat_token_type_ids)
910 AI
1. "BERT/modeling.py"
from npu_bridge.estimator.npu_unary_ops import npu_unary_ops from npu_bridge.estimator import npu_ops
2. npu_unary_ops.gelu()tf.nn.gelu() npu_unary_ops.gelu()tf.nn.gelu() npu_unary_ops.gelu()910 AI
"BERT/modeling.py"gelu()
def gelu(x): """ gelu : x: tensor Returns: tensor """ ############## npu modify begin ############# # npu_bert_fused_gelunpu_unary_ops.gelu()Ascend 910
if tf.flags.FLAGS.npu_bert_fused_gelu: return npu_unary_ops.gelu(x) else: cdf = 0.5 * (1.0 + tf.tanh( (np.sqrt(2 / np.pi) * (x + 0.044715 * tf.pow(x, 3))))) return x * cdf
01 (2020-11-28)
©
135
Atlas Data Center Solution
9
############## npu modify end #############
# # cdf = 0.5 * (1.0 + tf.tanh( # (np.sqrt(2 / np.pi) * (x + 0.044715 * tf.pow(x, 3))))) # return x * cdf
3. npu_ops.dropout()tf.nn.dropout()npu_ops.dropout() 910 AI
"BERT/modeling.py"dropout()
def dropout(input_tensor, dropout_prob): """ dropout Args: input_tensor: tensor dropout_prob: float tensor
Returns: dropouttensor
""" ############## npu modify begin ############# # npu_bert_debugTruedropout if tf.flags.FLAGS.npu_bert_debug:
return input_tensor
# dropout_probNone0.0dropout if dropout_prob is None or dropout_prob == 0.0:
return input_tensor
# npu_bert_npu_dropoutnpu_ops.dropout()Ascend 910
if tf.flags.FLAGS.npu_bert_npu_dropout: output = npu_ops.dropout(input_tensor, 1.0 - dropout_prob)
else: output = tf.nn.dropout(input_tensor, 1.0 - dropout_prob)
return output ############## npu modify end #############
# # if dropout_prob is None or dropout_prob == 0.0: # return input_tensor # # output = tf.nn.dropout(input_tensor, 1.0 - dropout_prob) # return output
optimizer
1. "BERT/optimization.py"
from horovod.tensorflow.compression import Compression
2. "BERT/optimization.py"
from npu_bridge.estimator.npu.npu_optimizer import NPUDistributedOptimizer from npu_bridge.estimator.npu import npu_loss_scale_optimizer as lso from npu_bridge.estimator.npu import npu_loss_scale_manager as lsm_lib
3.
"BERT/optimization.py"create_optimizer()
create_optimizer()init_loss_scale
############## npu modify begin ############# # create_optimizer()init_loss_scale def create_optimizer(loss, init_lr, num_train_steps, num_warmup_steps, hvd=None, manual_fp16=False, use_fp16=False, num_accumulation_steps=1,
optimizer_type="adam", allreduce_post_accumulation=False): ############## npu modify end #############
01 (2020-11-28)
©
136
Atlas Data Center Solution
9
#
# def create_optimizer(loss, init_lr, num_train_steps, num_warmup_steps, hvd=None,
manual_fp16=False, use_fp16=False, num_accumulation_steps=1,
#
optimizer_type="adam", allreduce_post_accumulation=False, init_loss_scale=2 **
32):
learning_rateinit_lr
############## npu modify begin ############# # learning_rateinit_lr learning_rate = tf.constant(value=init_lr, shape=[], dtype=tf.float32) ############## npu modify end #############
# # learning_rate = tf.constant(value=adjusted_init_lr, shape=[], dtype=tf.float32)
AdamWeightDecayOptimizerepsilon
############## npu modify begin ############# # AdamWeightDecayOptimizerepsilon optimizer = AdamWeightDecayOptimizer(
learning_rate=learning_rate, weight_decay_rate=0.01, beta_1=0.9, beta_2=0.999, epsilon=1e-4, exclude_from_weight_decay=["LayerNorm", "layer_norm", "bias"]) ############## npu modify end #############
# # optimizer = AdamWeightDecayOptimizer( # learning_rate=learning_rate, # weight_decay_rate=0.01, # beta_1=0.9, # beta_2=0.999, # epsilon=1e-6, # exclude_from_weight_decay=["LayerNorm", "layer_norm", "bias"])
NPUDistributedOptimizer NPU
############## npu modify begin ############# # NPUDistributedOptimizerNPU optimizer = NPUDistributedOptimizer(optimizer)
# NPULossScaleOptimizerNPU if tf.flags.FLAGS.npu_bert_loss_scale not in [None, -1]:
opt_tmp = optimizer if tf.flags.FLAGS.npu_bert_loss_scale == 0:
loss_scale_manager = lsm_lib.ExponentialUpdateLossScaleManager( init_loss_scale=tf.flags.FLAGS.init_loss_scale_value, incr_every_n_steps=1000, decr_every_n_nan_or_inf=2, decr_ratio=0.5)
elif tf.flags.FLAGS.npu_bert_loss_scale >= 1: loss_scale_manager =
lsm_lib.FixedLossScaleManager(loss_scale=tf.flags.FLAGS.npu_bert_loss_scale) else: raise ValueError("Invalid loss scale: %d" % tf.flags.FLAGS.npu_bert_loss_scale)
# NPULossScaleOptimizerNPU Loss Scaling(),
# Loss Scalingfloat16
optimizer = lso.NPULossScaleOptimizer(opt_tmp, loss_scale_manager,
is_distributed=tf.flags.FLAGS.distributed)
############## npu modify end #############
# # if use_fp16: # loss_scaler = tf.train.experimental.DynamicLossScale(initial_loss_scale=init_loss_scale, increment_period=1000,multiplier=2.0) # optimizer = tf.train.experimental.enable_mixed_precision_graph_rewrite(optimizer,
01 (2020-11-28)
©
137
Atlas Data Center Solution
9
loss_scaler) # loss_scale_value = tf.identity(loss_scaler(), name="loss_scale") # if manual_fp16: # loss_scale_manager = tf.contrib.mixed_precision.ExponentialUpdateLossScaleManager( # init_loss_scale=init_loss_scale, # incr_every_n_steps=1000, # decr_every_n_nan_or_inf=2, # decr_ratio=0.5) # optimizer = tf.contrib.mixed_precision.LossScaleOptimizer(optimizer, loss_scale_manager)
tf.reduce_all
grads_and_vars_and_accums = [(gv[0], gv[1], accum_vars[i]) for i, gv in enumerate(grads_and_vars) if gv[0] is not None]
grads, tvars, accum_vars = list(zip(*grads_and_vars_and_accums))
############## npu modify begin ############# # tf.reduce_all all_are_finite = tf.reduce_all([tf.reduce_all(tf.is_finite(g)) for g in grads]) if (tf.flags.FLAGS.npu_bert_loss_scale not in [None, -1]) and (manual_fp16 or use_fp16) else tf.constant(True, dtype=tf.bool) ############## npu modify end #############
# # all_are_finite = tf.reduce_all( [tf.reduce_all(tf.is_finite(g)) for g in grads]) if manual_fp16 or use_fp16 else tf.constant(True,dtype=tf.bool)
"npu_bert_clip_by_global_norm"
create_optimizer()
############## npu modify begin ############# # npu_bert_clip_by_global_norm if tf.flags.FLAGS.npu_bert_clip_by_global_norm:
(clipped_grads, _) = tf.clip_by_global_norm( grads, clip_norm=1.0, use_norm=tf.cond( all_are_finite, lambda: tf.global_norm(grads), lambda: tf.constant(1.0)))
else: with tf.name_scope("clip_grads"): clipped_grads = [ (tf.clip_by_norm(grad, clip_norm=1.0)) if grad is not None else grad for grad in grads ]
############## npu modify end #############
#
# (clipped_grads, _) = tf.clip_by_global_norm(
# grads, clip_norm=1.0,
# use_norm=tf.cond(
#
all_are_finite,
#
lambda: tf.global_norm(grads),
#
lambda: tf.constant(1.0)))
new_global_step
############## npu modify begin ############# # new_global_step new_global_step = tf.cond(tf.math.logical_and(update_step, tf.cast(hvd.allreduce(tf.cast(batch_finite, tf.int32)), tf.bool)),
lambda: global_step + 1, lambda: global_step) ############## npu modify end #############
#
# new_global_step = tf.cond(tf.math.logical_and(update_step,
#
tf.cast(hvd.allreduce(tf.cast(batch_finite, tf.int32)),
#
tf.bool) if hvd is not None else batch_finite),
#
lambda: global_step + 1,
#
lambda: global_step)
01 (2020-11-28)
©
138
Atlas Data Center Solution
9
"npu_bert_clip_by_global_norm"
grads_and_vars = [(g, v) for g, v in grads_and_vars if g is not None] grads, tvars = list(zip(*grads_and_vars))
############## npu modify begin ############# # "npu_bert_clip_by_global_norm" if tf.flags.FLAGS.npu_bert_clip_by_global_norm:
all_are_finite = tf.reduce_all( [tf.reduce_all(tf.is_finite(g)) for g in grads]) if (tf.flags.FLAGS.npu_bert_loss_scale
not in [None, -1]) and (
use_fp16 or manual_fp16) else tf.constant(
True, dtype=tf.bool) ############## npu modify end #############
#
# all_are_finite = tf.reduce_all(
# [tf.reduce_all(tf.is_finite(g)) for g in grads]) if use_fp16 or manual_fp16 else
tf.constant(True,
#
dtype=tf.bool)
"global_step"
############## npu modify begin ############# # new_global_step = tf.cond(all_are_finite, lambda: global_step + 1, lambda: global_step) new_global_step = tf.identity(new_global_step, name='step_update') train_op = tf.group(train_op, [global_step.assign(new_global_step)]) ############## npu modify end ##############
return train_op
4. AdamWeightDecayOptimizer
"BERT/optimization.py"AdamWeightDecayOptimizer
__init__()
############## npu modify begin ############# # __init__()
def __init__(self,
learning_rate,
weight_decay_rate=0.0,
beta_1=0.9,
beta_2=0.999,
epsilon=1e-4,
exclude_from_weight_decay=None,
name="AdamWeightDecayOptimizer"):
############## npu modify end ############# #
# def __init__(self,
#
learning_rate,
#
weight_decay_rate=0.0,
#
beta_1=0.9,
#
beta_2=0.999,
#
epsilon=1e-6,
#
exclude_from_weight_decay=None,
#
name="AdamWeightDecayOptimizer"):
"BERT/optimization.py"AdamWeightDecayOptimizer apply_gradients()
############## npu modify begin ############# # #
new_global_step = global_step + 1
new_global_step = tf.identity(new_global_step, name='step_update')
assignments.extend([global_step.assign(new_global_step)])
############## npu modify end #############
return tf.group(*assignments, name=name)
5. LAMBOptimizer
01 (2020-11-28)
©
139
Atlas Data Center Solution
9
"BERT/optimization.py"LAMBOOptimizer__init__()
self.epsilon = epsilon
self.exclude_from_weight_decay = exclude_from_weight_decay
############## npu modify begin ############# # LAMBOOptimizer self.steps = 0 ############## npu modify end #############
"BERT/optimization.py"LAMBOptimizer apply_gradients()
"global_step"None ############## npu modify begin ############# # global_stepNone def apply_gradients(self, grads_and_vars, global_step=None, name=None, manual_fp16=False): ############## npu modify end #############
#
# def apply_gradients(self, grads_and_vars, global_step, name=None,
#
manual_fp16=False):
"global_step"float32
############## npu modify begin ############# # steps = tf.cast(global_step, tf.float32) ############## npu modify end #############
############## npu modify begin ############# # #
self.steps += 1 #
beta1_correction = (1 - self.beta_1 ** self.steps)
beta2_correction = (1 - self.beta_2 ** self.steps)
############## npu modify end #############
# # beta1_correction = (1 - self.beta_1 ** steps) # beta2_correction = (1 - self.beta_2 ** steps)
new_global_stepglobal_step ############## npu modify begin ############# # # new_global_step = global_step + 1 new_global_step = tf.identity(new_global_step, name='step_update') assignments.extend([global_step.assign(new_global_step)]) ############## npu modify end #############
return tf.group(*assignments, name=name)
"BERT/utils/utils.py"LogEvalRunHook
############## npu modify begin ############# #
01 (2020-11-28)
©
140
Atlas Data Center Solution
9
class LogEvalRunHook(tf.estimator.SessionRunHook): def __init__(self, global_batch_size, hvd_rank=-1): self.global_batch_size = global_batch_size self.hvd_rank = hvd_rank self.total_time = 0.0 self.count = 0 self.skipped = 0 self.time_list = []
def before_run(self, run_context): self.t0 = time.time()
def after_run(self, run_context, run_values): elapsed_secs = time.time() - self.t0 self.count += 1
# Removing first 2 (arbitrary) number of startup iterations from perf evaluations if self.count <= 2:
print("Skipping time record for ", self.count, " due to overhead") self.skipped += 1 else: self.time_list.append(elapsed_secs) self.total_time += elapsed_secs ############## npu modify end #############
#
# class LogEvalRunHook(tf.estimator.SessionRunHook):
# def __init__(self, global_batch_size, hvd_rank=-1):
#
self.global_batch_size = global_batch_size
#
self.hvd_rank = hvd_rank
#
self.count = 0
#
self.time_list = []
#
# def before_run(self, run_context):
#
self.t0 = time.time()
#
# def after_run(self, run_context, run_values):
#
elapsed_secs = time.time() - self.t0
#
self.count += 1
#
self.time_list.append(elapsed_secs)
9.2.6
"BERT/run_pretraining.py"main()9-11
9-11
main()
"BERT/ run_pretraining.py"
1. "BERT/run_pretraining.py"
from npu_bridge.estimator.npu.npu_config import * from npu_bridge.estimator.npu.npu_estimator import *
01 (2020-11-28)
©
141
Atlas Data Center Solution
9
from npu_bridge.estimator.npu.npu_config import NPURunConfig from npu_bridge.estimator.npu.npu_estimator import NPUEstimator
2. NPURunconfigRunconfig
"BERT/run_pretraining.py"main()
############## npu modify begin ############# run_config = NPURunConfig(
model_dir=FLAGS.output_dir, save_summary_steps=0, session_config=config, save_checkpoints_steps=FLAGS.save_checkpoints_steps if not FLAGS.horovod or hvd.rank() == 0 else None, # This variable controls how often estimator reports examples/sec. # Default value is every 100 steps. # When --report_loss is True, we set to very large value to prevent # default info reporting from estimator. # Ideally we should set it to None, but that does not work. log_step_count_steps=1 if FLAGS.report_loss else 100, enable_data_pre_proc=FLAGS.npu_bert_use_tdt, iterations_per_loop=FLAGS.iterations_per_loop, hcom_parallel=FLAGS.hcom_parallel) ############## npu modify end #############
# # run_config = tf.estimator.RunConfig( # model_dir=FLAGS.output_dir, # session_config=config, # save_checkpoints_steps=FLAGS.save_checkpoints_steps if not FLAGS.horovod or hvd.rank() == 0 else None, # save_summary_steps=FLAGS.save_checkpoints_steps if not FLAGS.horovod or hvd.rank() == 0 else None, # # This variable controls how often estimator reports examples/sec. # # Default value is every 100 steps. # # When --report_loss is True, we set to very large value to prevent # # default info reporting from estimator. # # Ideally we should set it to None, but that does not work. # log_step_count_steps=10000 if FLAGS.report_loss else 100)
tensorFlowRunconfigRunconfigNPURunConfig NPURunconfigNPURunConfig
3. NPUEstimatorNPUEstimatortf.estimator.Estimator
"BERT/run_pretraining.py"main()
############# npu modify begin ############# estimator = NPUEstimator(
model_fn=model_fn, config=run_config, job_start_file=FLAGS.npu_bert_job_start_file) ############## npu modify end #############
# Estimator # estimator = tf.estimator.Estimator( # model_fn=model_fn, # config=run_config)
configs
1 configs
2 configs1p.json
01 (2020-11-28)
©
142
Atlas Data Center Solution
9
3 1p.jsonNPU
{ "board_id": "0x002f", "chip_info": "910", "deploy_mode": "lab", "group_count": "1", "group_list": [ { "device_num": "1", "server_num": "1", "group_name": "", "instance_count": "1", "instance_list": [ { "devices": [ { "device_id": "0", "device_ip": "192.168.100.101" } ], "rank_id": "0", "server_id": "172.17.1.120" } ] } ], "para_plane_nic_location": "device", "para_plane_nic_name": [ "eth0" ], "para_plane_nic_num": "1", "status": "completed"
}
4 configsbert_base_config.json
5 bert_base_config.jsonbert
{ "attention_probs_dropout_prob": 0.1, "hidden_act": "gelu", "hidden_dropout_prob": 0.1, "hidden_size": 768, "initializer_range": 0.02, "intermediate_size": 3072, "max_position_embeddings": 512, "num_attention_heads": 12, "num_hidden_layers": 12, "type_vocab_size": 2, "vocab_size": 30522 }
9-12
9-12 bert attention_probs_dropout_prob
hidden_act
hidden_dropout_prob
attentiontensordropout
bertgelu
hiddentensordropout
01 (2020-11-28)
©
143
Atlas Data Center Solution
hidden_size initializer_range intermediate_size
max_position_embeddings
num_attention_heads num_hidden_layers type_vocab_size vocab_size
9
hidden_size 7681024
attention tensorhidden_size intermediate_size hidden_size
bert max_position_embeddings
attentionhead 12
transformer612
4000 5000
----
9.2.7
9-13
9-13
main()
"BERT/run_pretraining.py"
1. flags
"BERT/run_pretraining.py"flags
############## npu modify begin ############# # os.environ['WHICH_OP'] = 'GEOP'
01 (2020-11-28)
©
144
Atlas Data Center Solution
9
os.environ['NEW_GE_FE_ID'] = '1' os.environ['GE_AICPU_FLAG'] = '1' os.environ['GE_USE_STATIC_MEMORY'] = '1' os.environ['OPTION_EXEC_HCCL_FLAG'] = '1' os.environ['HCCL_CONNECT_TIMEOUT'] = '600' ############## npu modify end #############
flags = tf.flags
flag
############## npu modify begin ############# flags.DEFINE_string(
"input_files_dir", "./data", "Directory with input files, comma separated or single directory.")
flags.DEFINE_string( "output_dir", "./models", "The output directory where the model checkpoints will be written.")
############## npu modify end #############
# # flags.DEFINE_string( # "input_files_dir", None, # "Directory with input files, comma separated or single directory.") # # flags.DEFINE_string( # "output_dir", None, # "The output directory where the model checkpoints will be written.")
dllog_path
flags.DEFINE_string( "dllog_path", "/results/bert_dllog.json", "filename where dllogger writes to")
flag
############## npu modify begin ############# flags.DEFINE_integer(
"max_seq_length", 128, "The maximum total input sequence length after WordPiece tokenization. " "Sequences longer than this will be truncated, and sequences shorter " "than this will be padded. Must match data generation.")
flags.DEFINE_integer( "max_predictions_per_seq", 20, "Maximum number of masked LM predictions per sequence. " "Must match data generation.")
flags.DEFINE_bool("do_train", True, "Whether to run training.")
flags.DEFINE_integer("train_batch_size", 64, "Total batch size for training.")
flags.DEFINE_float("learning_rate", 1e-4, "The initial learning rate for Adam.")
flags.DEFINE_integer("num_train_steps", 1000000, "Number of training steps.")
flags.DEFINE_integer("save_checkpoints_steps", 10000, "How often to save the model checkpoint.")
flags.DEFINE_integer("display_loss_steps", 10, "How often to print loss")
flags.DEFINE_bool("manual_fp16", True, "Whether to use fp32 or fp16 arithmetic on GPU. " "Manual casting is done instead of using AMP")
############## npu modify end #############
# # flags.DEFINE_integer( # "max_seq_length", 512, # "The maximum total input sequence length after WordPiece tokenization. " # "Sequences longer than this will be truncated, and sequences shorter " # "than this will be padded. Must match data generation.")
01 (2020-11-28)
©
145
Atlas Data Center Solution
9
#
# flags.DEFINE_integer(
# "max_predictions_per_seq", 80,
# "Maximum number of masked LM predictions per sequence. "
# "Must match data generation.")
#
# flags.DEFINE_bool("do_train", False, "Whether to run training.")
#
# flags.DEFINE_integer("train_batch_size", 32, "Total batch size for training.")
#
# flags.DEFINE_float("learning_rate", 5e-5, "The initial learning rate for Adam.")
#
# flags.DEFINE_integer("num_train_steps", 100000, "Number of training steps.")
#
# flags.DEFINE_integer("save_checkpoints_steps", 10000,
#
"How often to save the model checkpoint.")
#
# flags.DEFINE_integer("display_loss_steps", 1,
#
"How often to print loss")
# flags.DEFINE_bool("manual_fp16", False, "Whether to use fp32 or fp16 arithmetic on GPU. "
#
"Manual casting is done instead of using AMP")
amp
flags.DEFINE_bool("amp", True, "Whether to enable AMP ops. When false, uses TF32 on A100 and FP
flag
############## npu modify begin ############# flags.DEFINE_bool("use_xla", False, "Whether to enable XLA JIT compilation.") ############## npu modify end #############
# # flags.DEFINE_bool("use_xla", True, "Whether to enable XLA JIT compilation.")
############## npu modify begin ############# # flags.DEFINE_bool("use_fp16", False, "Whether to enable AMP ops.")
flags.DEFINE_bool("use_fp16_cls", True, "Whether to use fp16 in cls and pooler.")
flags.DEFINE_bool("distributed", True, "Whether to use multi-npu")
flags.DEFINE_bool('npu_bert_fused_gelu', True, 'Whether to use npu defined gelu op')
flags.DEFINE_bool('npu_bert_debug', False, 'If True, dropout and shuffle is disabled.')
flags.DEFINE_bool('npu_bert_use_tdt', True, 'Whether to use tdt as dataset')
flags.DEFINE_string("npu_bert_job_start_file", None, "CSA job start file path.")
flags.DEFINE_integer("npu_bert_loss_scale", 0, "Whether to use loss scale, -1 is disable, 0 is dynamic loss scale, >=1 is static loss scale")
flags.DEFINE_bool("npu_bert_clip_by_global_norm", False, "Use clip_by_global_norm if True, or use clip_by_norm for each gradient")
flags.DEFINE_bool('npu_bert_npu_dropout', True, 'Whether to use npu defined gelu op')
flags.DEFINE_bool('npu_gather', True, 'Whether to use gather_npu whose backward propagation avoids IndexedSlices')
flags.DEFINE_bool('hcom_parallel', True, 'Whether to use parallel allreduce') ############## npu modify end #############
flag
############## npu modify begin ############# flags.DEFINE_integer('init_loss_scale_value', 2**32, 'Initial loss scale value for loss scale optimizer') ############## npu modify end #############
01 (2020-11-28)
©
146
Atlas Data Center Solution
9
# # flags.DEFINE_integer("init_loss_scale", 2**32, "Initial value of loss scale if mixed precision training")
2. _LogSessionRunHook
"BERT/run_pretraining.py"_LogSessionRunHook
_LogSessionRunHook__init__()
############## npu modify begin ############# def __init__(self, global_batch_size, num_accumulation_steps, display_every=10, hvd_rank=-1):
self.global_batch_size = global_batch_size self.display_every = display_every self.hvd_rank = hvd_rank self.num_accumulation_steps = num_accumulation_steps ############## npu modify end #############
#
# def __init__(self, global_batch_size, num_accumulation_steps, dllogging, display_every=10,
#
save_ckpt_steps=1000, report_loss=True, hvd_rank=-1):
# self.global_batch_size = global_batch_size
# self.display_every = display_every
# self.save_ckpt_steps = save_ckpt_steps
# self.hvd_rank = hvd_rank
# self.num_accumulation_steps = num_accumulation_steps
# self.dllogging = dllogging
# self.report_loss = report_loss
_LogSessionRunHookafter_create_session()
############## npu modify begin ############# def after_create_session(self, session, coord):
self.elapsed_secs = 0. self.count = 0 self.all_count = 0 self.avg_loss = 0.0 ############## npu modify end #############
# # def after_create_session(self, session, coord): # self.elapsed_secs = 0.0 # elapsed seconds between every print # self.count = 0 # number of global steps between every print # self.all_count = 0 # number of steps (including accumulation) between every print # self.loss = 0.0 # accumulation of loss in each step between every print # # self.total_time = 0.0 # total time taken to train (excluding warmup + ckpt saving steps) # self.step_time = 0.0 # time taken per step # self.init_global_step = session.run(tf.train.get_global_step()) # training starts at init_global_step # self.skipped = 0
_LogSessionRunHookbefore_run()
iftf.train.SessionRunArgs tf.estimator.SessionRunArgs910 AI
############## npu modify begin ############# def before_run(self, run_context):
self.t0 = time.time() if self.num_accumulation_steps <= 1:
if (tf.flags.FLAGS.npu_bert_loss_scale == 0) and (FLAGS.manual_fp16 or FLAGS.use_fp16):
return tf.estimator.SessionRunArgs( fetches=['global_step:0', 'total_loss:0', 'learning_rate:0', 'nsp_loss:0', 'mlm_loss:0', 'loss_scale:0', 'apply_grads/All:0'])
else: return tf.estimator.SessionRunArgs( fetches=['global_step:0', 'total_loss:0', 'learning_rate:0', 'nsp_loss:0', 'mlm_loss:0'])
01 (2020-11-28)
©
147
Atlas Data Center Solution
9
else: if (tf.flags.FLAGS.npu_bert_loss_scale == 0) and (FLAGS.manual_fp16 or
FLAGS.use_fp16): return tf.estimator.SessionRunArgs( fetches=['global_step:0', 'update_step:0', 'total_loss:0', 'learning_rate:0', 'nsp_loss:0', 'mlm_loss:0', 'loss_scale:0'])
else: return tf.estimator.SessionRunArgs( fetches=['global_step:0', 'update_step:0', 'total_loss:0', 'learning_rate:0', 'nsp_loss:0', 'mlm_loss:0'])
############## npu modify end #############
#
# def before_run(self, run_context):
# self.t0 = time.time()
# if self.num_accumulation_steps <= 1:
# if FLAGS.manual_fp16 or FLAGS.amp:
#
return tf.estimator.SessionRunArgs(
#
fetches=['step_update:0', 'total_loss:0',
#
'learning_rate:0', 'nsp_loss:0',
#
'mlm_loss:0', 'loss_scale:0'])
# else:
#
return tf.estimator.SessionRunArgs(
#
fetches=['step_update:0', 'total_loss:0',
#
'learning_rate:0', 'nsp_loss:0',
#
'mlm_loss:0'])
# else:
# if FLAGS.manual_fp16 or FLAGS.amp:
#
return tf.estimator.SessionRunArgs(
#
fetches=['step_update:0', 'update_step:0', 'total_loss:0',
#
'learning_rate:0', 'nsp_loss:0',
#
'mlm_loss:0', 'loss_scale:0'])
# else:
#
return tf.estimator.SessionRunArgs(
#
fetches=['step_update:0', 'update_step:0', 'total_loss:0',
#
'learning_rate:0', 'nsp_loss:0',
#
'mlm_loss:0'])
_LogSessionRunHookafter_run()_LogSessionRunHook after_run() after_run()
def after_run(self, run_context, run_values): self.elapsed_secs += time.time() - self.t0 if self.num_accumulation_steps <=1: if (tf.flags.FLAGS.npu_bert_loss_scale == 0) and (FLAGS.manual_fp16 or
FLAGS.use_fp16): global_step, total_loss, lr, nsp_loss, mlm_loss, loss_scaler, custom_arg =
run_values.results else: global_step, total_loss, lr, nsp_loss, mlm_loss = run_values.results update_step = True
else: if (tf.flags.FLAGS.npu_bert_loss_scale == 0) and (FLAGS.manual_fp16 or
FLAGS.use_fp16): global_step, update_step, total_loss, lr, nsp_loss, mlm_loss, loss_scaler =
run_values.results else: global_step, update_step, total_loss, lr, nsp_loss, mlm_loss = run_values.results
print_step = global_step + 1 # One-based index for printing. self.avg_loss += total_loss self.all_count += 1 if update_step:
self.count += 1 dt = self.elapsed_secs / self.count sent_per_sec = self.global_batch_size / dt * FLAGS.iterations_per_loop avg_loss_step = self.avg_loss / self.all_count if self.hvd_rank >= 0:
01 (2020-11-28)
©
148
Atlas Data Center Solution
9
if (tf.flags.FLAGS.npu_bert_loss_scale == 0) and (FLAGS.manual_fp16 or FLAGS.use_fp16):
print('Rank = %2d :: Step = %6i Throughput = %11.1f MLM Loss = %10.4e NSP Loss = %10.4e Loss = %9.6f Average Loss = %9.6f LR = %6.4e Loss scale = %6.4e isFinite = %6i' %
(self.hvd_rank, print_step, sent_per_sec, mlm_loss, nsp_loss, total_loss, avg_loss_step, lr, loss_scaler, custom_arg), flush=True)
else: print('Rank = %2d :: Step = %6i Throughput = %11.1f MLM Loss = %10.4e NSP Loss
= %10.4e Loss = %9.6f Average Loss = %9.6f LR = %6.4e' % (self.hvd_rank, print_step, sent_per_sec, mlm_loss, nsp_loss, total_loss,
avg_loss_step, lr), flush=True) else: if (tf.flags.FLAGS.npu_bert_loss_scale == 0) and (FLAGS.manual_fp16 or
FLAGS.use_fp16): print('Step = %6i Throughput = %11.1f MLM Loss = %10.4e NSP Loss = %10.4e Loss
= %9.6f Average Loss = %9.6f LR = %6.4e Loss scale = %6.4e isFinite = %6i' % (print_step, sent_per_sec, mlm_loss, nsp_loss, total_loss, avg_loss_step, lr,
loss_scaler, custom_arg), flush=True) else: print('Step = %6i Throughput = %11.1f MLM Loss = %10.4e NSP Loss = %10.4e Loss
= %9.6f Average Loss = %9.6f LR = %6.4e' % (print_step, sent_per_sec, mlm_loss, nsp_loss, total_loss, avg_loss_step, lr),
flush=True) self.elapsed_secs = 0. self.count = 0 self.avg_loss = 0.0 self.all_count = 0
3. main
"BERT/run_pretraining.py"main()
"TF_XLA_FLASS"
############## npu modify begin ############# # os.environ["TF_XLA_FLAGS"] = " --tf_xla_enable_lazy_compilation false" #causes memory fragmentation for bert leading to OOM ############## npu modify end #############
flag
############## npu modify begin ############# # flag # for name, value in FLAGS.__flags.items():
print("name:", name, " ", FLAGS[name].value) ############## npu modify end #############
utils.dllogger_class.dllogger_class()
############## npu modify begin ############# # dllogging = utils.dllogger_class.dllogger_class(FLAGS.dllog_path) ############## npu modify end #############
use_fp16
if not FLAGS.do_train and not FLAGS.do_eval: raise ValueError("At least one of `do_train` or `do_eval` must be True.")
############## npu modify begin ############# # use_fp16 if FLAGS.use_fp16:
os.environ["TF_ENABLE_AUTO_MIXED_PRECISION_GRAPH_REWRITE"] = "1" ############## npu modify end #############
npu_gather910 AI
bert_config = modeling.BertConfig.from_json_file(FLAGS.bert_config_file)
############## npu modify begin ############# # npu_gather #
01 (2020-11-28)
©
149
Atlas Data Center Solution
9
if FLAGS.npu_gather: if FLAGS.distributed and bert_config.num_hidden_layers == 24: from hccl.split.api import set_split_strategy_by_size set_split_strategy_by_size([10,10,10,10,15,15,15,15]) if FLAGS.distributed and bert_config.num_hidden_layers == 12: from hccl.split.api import set_split_strategy_by_idx set_split_strategy_by_idx([8,56,104,152,200,205]) if FLAGS.distributed and bert_config.num_hidden_layers == 6: from hccl.split.api import set_split_strategy_by_idx set_split_strategy_by_idx([8,40,72,104,109])
############## npu modify end #############
############## npu modify begin ############# # # input_files.sort() print("Input Files:", input_files) ############## npu modify end #############
if FLAGS.horovod and len(input_files) < hvd.size(): raise ValueError("Input Files must be sharded")
flags"use_fp16""amp"910 AI
############## npu modify begin ############# # "use_fp16""amp"910 AI if FLAGS.use_fp16 and FLAGS.manual_fp16:
raise ValueError("AMP and Manual Mixed Precision Training are both activated! Error") ############## npu modify end #############
# # if FLAGS.amp and FLAGS.manual_fp16: # raise ValueError("AMP and Manual Mixed Precision Training are both activated! Error")
flags"amp"910 AI
if FLAGS.use_xla: config.graph_options.optimizer_options.global_jit_level =
tf.compat.v1.OptimizerOptions.ON_1 config.graph_options.rewrite_options.memory_optimization =
rewriter_config_pb2.RewriterConfig.NO_MEM_OPT
############## npu modify begin ############# # if FLAGS.amp: tf.enable_resource_variables() ############## npu modify end ###############
distributed"RANK_SIZE" 910 AI
run_config = NPURunConfig( model_dir=FLAGS.output_dir, save_summary_steps=0, session_config=config, save_checkpoints_steps=FLAGS.save_checkpoints_steps if not FLAGS.horovod or hvd.rank()
== 0 else None, # This variable controls how often estimator reports examples/sec. # Default value is every 100 steps. # When --report_loss is True, we set to very large value to prevent # default info reporting from estimator. # Ideally we should set it to None, but that does not work. log_step_count_steps=1 if FLAGS.report_loss else 100, enable_data_pre_proc=FLAGS.npu_bert_use_tdt, iterations_per_loop=FLAGS.iterations_per_loop, hcom_parallel=FLAGS.hcom_parallel)
############## npu modify begin ############# # if FLAGS.distributed:
01 (2020-11-28)
©
150
Atlas Data Center Solution
9
rank_size = int(os.getenv('RANK_SIZE')) ############## npu modify end #############
model_fn_builder() learning_rate
############## npu modify begin ############# # model_fn_builder()model_fn_builder()learning_rate model_fn = model_fn_builder(
bert_config=bert_config, init_checkpoint=FLAGS.init_checkpoint, learning_rate=FLAGS.learning_rate, num_train_steps=FLAGS.num_train_steps, num_warmup_steps=FLAGS.num_warmup_steps, use_one_hot_embeddings=False, hvd=None if not FLAGS.horovod else hvd) ############## npu modify end #############
# # model_fn = model_fn_builder( # bert_config=bert_config, # init_checkpoint=FLAGS.init_checkpoint, # learning_rate=FLAGS.learning_rate if not FLAGS.horovod else FLAGS.learning_rate*hvd.size(), # num_train_steps=FLAGS.num_train_steps, # num_warmup_steps=FLAGS.num_warmup_steps, # use_one_hot_embeddings=False, # hvd=None if not FLAGS.horovod else hvd)
model_fn = model_fn_builder( bert_config=bert_config, init_checkpoint=FLAGS.init_checkpoint, learning_rate=FLAGS.learning_rate, num_train_steps=FLAGS.num_train_steps, num_warmup_steps=FLAGS.num_warmup_steps, use_one_hot_embeddings=False, hvd=None if not FLAGS.horovod else hvd)
############## npu modify begin ############# # training_hooks = [] if FLAGS.report_loss:
global_batch_size = FLAGS.train_batch_size * FLAGS.num_accumulation_steps if not FLAGS.distributed else FLAGS.train_batch_size * FLAGS.num_accumulation_steps * rank_size
training_hooks.append( _LogSessionRunHook(global_batch_size, FLAGS.num_accumulation_steps,
FLAGS.display_loss_steps)) ############## npu modify end #############
############## npu modify begin ############# # training_hooks = [] if FLAGS.horovod and hvd.size() > 1:
training_hooks.append(hvd.BroadcastGlobalVariablesHook(0)) if (not FLAGS.horovod or hvd.rank() == 0):
global_batch_size = FLAGS.train_batch_size * FLAGS.num_accumulation_steps if not FLAGS.horovod else FLAGS.train_batch_size * FLAGS.num_accumulation_steps * hvd.size()
training_hooks.append(_LogSessionRunHook(global_batch_size, FLAGS.num_accumulation_steps, dllogging, FLAGS.display_loss_steps, FLAGS.save_checkpoints_steps, FLAGS.report_loss))
############## npu modify end #############
############## npu modify begin ############# # train_start_time = time.time() ############## npu modify end #############
estimator.train(input_fn=train_input_fn, hooks=training_hooks,
01 (2020-11-28)
©
151
Atlas Data Center Solution
9
max_steps=FLAGS.num_train_steps)
############## npu modify begin ############# # train_time_elapsed = time.time() - train_start_time ############## npu modify end #############
############## npu modify begin ############# # if (not FLAGS.horovod or hvd.rank() == 0):
train_time_wo_overhead = training_hooks[-1].total_time avg_sentences_per_second = FLAGS.num_train_steps * global_batch_size * 1.0 / train_time_elapsed ss_sentences_per_second = (FLAGS.num_train_steps - training_hooks[-1].skipped) * global_batch_size * 1.0 / train_time_wo_overhead
tf.compat.v1.logging.info("-----------------------------") tf.compat.v1.logging.info("Total Training Time = %0.2f for Sentences = %d", train_time_elapsed,
FLAGS.num_train_steps * global_batch_size) tf.compat.v1.logging.info("Total Training Time W/O Overhead = %0.2f for Sentences = %d", train_time_wo_overhead,
(FLAGS.num_train_steps - training_hooks[-1].skipped) * global_batch_size) tf.compat.v1.logging.info("Throughput Average (sentences/sec) with overhead = %0.2f", avg_sentences_per_second) tf.compat.v1.logging.info("Throughput Average (sentences/sec) = %0.2f", ss_sentences_per_second) dllogging.logger.log(step=(), data={"throughput_train": ss_sentences_per_second}, verbosity=Verbosity.DEFAULT) tf.compat.v1.logging.info("-----------------------------") ############## npu modify end #############
eval_time_elapsed = time.time() - eval_start_time
############## npu modify begin ############# eval_time_wo_overhead = eval_hooks[-1].total_time num_sentences = (eval_hooks[-1].count - eval_hooks[-1].skipped) * FLAGS.eval_batch_size ############## npu modify end #############
# # time_list = eval_hooks[-1].time_list # time_list.sort() # # Removing outliers (init/warmup) in throughput computation. # eval_time_wo_overhead = sum(time_list[:int(len(time_list) * 0.99)]) # num_sentences = (int(len(time_list) * 0.99)) * FLAGS.eval_batch_size
############## npu modify begin ############# # tf.logging.info("Total Inference Time W/O Overhead = %0.2f for Sentences = %d", eval_time_wo_overhead,
(eval_hooks[-1].count - eval_hooks[-1].skipped) * FLAGS.eval_batch_size) ############## npu modify end #############
# # tf.compat.v1.logging.info("Total Inference Time W/O Overhead = %0.2f for Sentences = %d", eval_time_wo_overhead,num_sentences)
tf.logging.info("Summary Inference Statistics on EVAL set") tf.logging.info("Batch size = %d", FLAGS.eval_batch_size) tf.logging.info("Sequence Length = %d", FLAGS.max_seq_length)
############## npu modify begin ############# # tf.logging.info("Precision = %s", "fp16" if FLAGS.use_fp16 else "fp32") ############## npu modify end #############
01 (2020-11-28)
©
152
Atlas Data Center Solution
9
# # tf.compat.v1.logging.info("Precision = %s", "fp16" if FLAGS.amp else "fp32")
############# npu modify begin ############# # dllogging.logger.log(step=(), data={"throughput_val": ss_sentences_per_second}, verbosity=Verbosity.DEFAULT) ############## npu modify end ############
"BERT/run_pretraining.py"
############## npu modify begin #############
if __name__ == "__main__":
flags.mark_flag_as_required("input_files_dir") # "eval_files_dir"
flags.mark_flag_as_required("eval_files_dir")
flags.mark_flag_as_required("bert_config_file")
flags.mark_flag_as_required("output_dir")
flags.mark_flag_as_required("npu_bert_job_start_file")
if FLAGS.use_xla and FLAGS.manual_fp16:
print('WARNING! Combining --use_xla with --manual_fp16 may prevent convergence.')
print('
This warning message will be removed when the underlying')
print('
issues have been fixed and you are running a TF version')
print('
that has that fix.')
tf.compat.v1.app.run()
############## npu modify end #############
#
# if __name__ == "__main__":
# flags.mark_flag_as_required("input_files_dir")
# if FLAGS.do_eval:
#
flags.mark_flag_as_required("eval_files_dir")
# flags.mark_flag_as_required("bert_config_file")
# flags.mark_flag_as_required("output_dir")
# if FLAGS.use_xla and FLAGS.manual_fp16:
#
print('WARNING! Combining --use_xla with --manual_fp16 may prevent convergence.')
#
print('
This warning message will be removed when the underlying')
#
print('
issues have been fixed and you are running a TF version')
#
print('
that has that fix.')
# tf.compat.v1.app.run()
9.2.8
tfrecordbookscorpus"/home/data/bert/cn-clue-256/"
ranktable
ranktable
01 (2020-11-28)
©
153
Atlas Data Center Solution
9
1 "Bert/scripts"npu_set_env.shnpu_set_env.sh
# main env export LD_LIBRARY_PATH=/usr/local/:/usr/local/lib/:/usr/lib/:/usr/local/Ascend/fwkacllib/lib64/:/usr/local/ Ascend/driver/lib64/common/:/usr/local/Ascend/driver/lib64/driver/:/usr/local/Ascend/add-ons/ export PYTHONPATH=$PYTHONPATH:/usr/local/Ascend/opp/op_impl/built-in/ai_core/tbe:/code export PATH=$PATH:/usr/local/Ascend/fwkacllib/ccec_compiler/bin export ASCEND_OPP_PATH=/usr/local/Ascend/opp export SOC_VERSION=Ascend910 export HCCL_CONNECT_TIMEOUT=600
# user env export JOB_ID=bert-base-1p export RANK_TABLE_FILE=../configs/1p.json export RANK_SIZE=1 export RANK_INDEX=0 export RANK_ID=0
# profiling env export PROFILING_MODE=true export AICPU_PROFILING_MODE=false export PROFILING_OPTIONS=task_trace:training_trace export FP_POINT=bert/embeddings/GatherV2 export BP_POINT=gradients/bert/embeddings/IdentityN_1_grad/UnsortedSegmentSum
# debug env #export DUMP_GE_GRAPH=2 #export DUMP_OP=1 #export DUMP_OP_LESS=1 #export PRINT_MODEL=1 #export TE_PARALLEL_COMPILER=0
# system env ulimit -c unlimited
2
"Bert/scripts"run_pretraining.sh run_pretraining.sh
#!/bin/sh currentDir=$(cd "$(dirname "$0")"; pwd) cd ${currentDir}
PWD=${currentDir}
device_id=0 if [ x"${device_id}" = x ] ; then
echo "turing train fail" >> ${currentDir}/train_${device_id}.log exit else export DEVICE_ID=${device_id} fi
DEVICE_INDEX=$(( DEVICE_ID + RANK_INDEX * 8 )) export DEVICE_INDEX=${DEVICE_INDEX}
env > ${currentDir}/env_${device_id}.log
#mkdir exec path #mkdir -p ${currentDir}/${device_id} #rm -rf ${currentDir}/${device_id}/* cd ${currentDir}/ rm -rf kernel_meta rm -rf output #start exec python3.7 ../run_pretraining.py --bert_config_file=../configs/bert_base_config.json --max_seq_length=128 -max_predictions_per_seq=20 --train_batch_size=128 --learning_rate=1e-4 --num_warmup_steps=10000 --
01 (2020-11-28)
©
154
Atlas Data Center Solution
9
num_train_steps=500000 --optimizer_type=adam --manual_fp16=True --use_fp16_cls=True -input_files_dir=/home/data/bert/cn-clue-256 --eval_files_dir=/home/data/bert/cn-clue-256 -npu_bert_use_tdt=True --do_train=True --num_accumulation_steps=1 --npu_bert_job_start_file= -iterations_per_loop=100 --save_checkpoints_steps=10000 --npu_bert_clip_by_global_norm=False -distributed=True --npu_bert_loss_scale=0 --output_dir=./output
--input_files_dir=/home/data/bert/cn-clue-256--eval_files_dir=/home/data/bert/cnclue-256
9-14
9-14 bert_config_file max_seq_length max_predictions_per_seq train_batch_size learning_rate num_warmup_steps num_train_steps optimizer_type manual_fp16 use_fp16_cls
input_files_dir eval_files_dir npu_bert_use_tdt do_train num_accumulation_steps npu_bert_job_start_file iterations_per_loop save_checkpoints_steps npu_bert_clip_by_global_norm distributed npu_bert_loss_scale output_dir
bert bert bert BatchSize tf.float16 clspoolertensor tf.float16Ascend 910 tdt step step step Ascend 910 loss scaling
01 (2020-11-28)
©
155
Atlas Data Center Solution
9
----
BERT
#
configs
# jsonbert
1p.json
// NPUP
bert_bae_config.json
// BERT
scripts
#
npu_set_env.sh
//
run_pretraining.sh
//
utils
# utils
utils.py
__init__.py
gpu_environment.py
// gpu_environment
modeling.py
// BERT
optimization.py
//
run_pretraining.py
// -
CONTRIBUTING.md
// CONTRIBUTING.md
README.md
//
01 (2020-11-28)
©
156
Atlas Data Center Solution
10
10
10.1 7.3.0gcc 10.2
10.1 7.3.0 gcc
root
1 gcc-7.3.0.tar.gzhttps://mirrors.tuna.tsinghua.edu.cn/gnu/gcc/ gcc-7.3.0/gcc-7.3.0.tar.gz
2 gcc/tmp
sudo rm -rf /tmp/*
3
centos/bclinux
yum install bzip2
ubuntu/debian
apt-get install bzip2
4 gcc
1. gcc-7.3.0.tar.gz
tar -zxvf gcc-7.3.0.tar.gz
2. gcc
cd gcc-7.3.0 ./contrib/download_prerequisites
"gcc-7.3.0/"
wget http://gcc.gnu.org/pub/gcc/infrastructure/gmp-6.1.0.tar.bz2 wget http://gcc.gnu.org/pub/gcc/infrastructure/mpfr-3.1.4.tar.bz2 wget http://gcc.gnu.org/pub/gcc/infrastructure/mpc-1.0.3.tar.gz wget http://gcc.gnu.org/pub/gcc/infrastructure/isl-0.16.1.tar.bz2
./contrib/download_prerequisites
01 (2020-11-28)
©
157
Atlas Data Center Solution
10
3.
./configure --enable-languages=c,c++ --disable-multilib --with-system-zlib --prefix=/usr/local/
linux_gcc7.3.0 make -j15 # grep -w processor /proc/cpuinfo|wc -lcpu15
make install
"--prefix"linux_gcc7.3.0 "/usr/local""/usr"gcc gcc"/usr/local/ linux_gcc7.3.0"
5 gcc
export LD_LIBRARY_PATH=.../xxx/xxx/xxx/lib64
".../xxx/xxx/xxx/"3.gcc "/usr/local/linux_gcc7.3.0/"
gcc
----
10.2
2020-07-22
01 (2020-11-28)
©
158
AH Formatter V6.2 MR8 for Windows : 6.2.10.20473 (2015/04/14 10:00JST) Antenna House PDF Output Library 6.2.680 (Windows)