网络模型移植训练指南

华为技术有限公司

查看PDF - 华为云

raise ValueError("AMP and Manual Mixed Precision Training are both activated! Error"). –. 删除使用flags参数"amp"的判断,用于适配昇腾910 AI ...

mprtg-A800 9000 9010
Atlas Data Center Solution V100R020C00


 

01 2020-11-28



 ©  2020   

 
   
 

 01 (2020-11-28)

 © 

i

Atlas Data Center Solution 

 

 

1 ..............................................................................................................................................1
2 .................................................................................................................................. 2
3 ...................................................................................................................................... 3
4 .................................................................................................................................. 4
4.1 Estimator .......................................................................................................................................................................... 4 4.2 sess.run .............................................................................................................................................................................. 6 4.3 Keras ...................................................................................................................................................................................9 4.3.1 Keras ............................................................................................................................................................................... 9 4.3.2 Keras .............................................................................................................................................................. 9 4.3.3 Keras  NPUEstimator........................................................................................................................................... 10
5 ................................................................................................................................12
5.1 ............................................................................................................................................................................................ 12 5.2 ................................................................................................................................................................................... 15 5.2.1 Server ........................................................................................................................................................................... 15 5.2.2 Server ........................................................................................................................................................................... 16 5.2.3 Atlasorovodgroup .................................................................................................................................................................. 29 5.5.5 ...................................................................................................................................................... 29 5.5.6  Tensor .................................................................................................................................. 30
6 ....................................................................................................................................32
6.1 ................................................................................................................................................................................... 32 6.2 Loss Scaling............................................................................................................................................................................. 34

 01 (2020-11-28)

 © 

ii

Atlas Data Center Solution 

 

6.3 ................................................................................................................................................................................... 35 6.4 Profiling.................................................................................................................................................................................... 36 6.5 Dump........................................................................................................................................................................................ 38 6.6 ................................................................................................................................................................................... 40 6.7 Log/Summaryckpt  pb
7 .................................................................................................................... 51
8 ....................................................................................................................................57
8.1  Tensorflow ....................................................................................................................................................... 57 8.2 TF Adapter ............................................................................................................................................................ 65 8.2.1 ................................................................................................................................................................................ 65 8.2.2 NPURunConfig ................................................................................................................................................. 70 8.2.3 ProfilingConfig ................................................................................................................................................. 75 8.2.4 DumpConfig ...................................................................................................................................................... 76 8.2.5 NPUEstimator ................................................................................................................................................... 77 8.2.6 NPUEstimatorSpec .......................................................................................................................................... 78 8.2.7 NPUCheckpointSaverHook .......................................................................................................................... 80 8.2.8 NPUOutputTensorHook ................................................................................................................................ 81 8.2.9 NPUDistributedOptimizer ............................................................................................................................ 82 8.2.10 NPULossScaleOptimizer ............................................................................................................................. 83 8.2.11 NPUOptimizer ................................................................................................................................................ 84 8.2.12 FixedLossScaleManager ..............................................................................................................................86 8.2.13 ExponentialUpdateLossScaleManager .................................................................................................. 87 8.2.14 dropout............................................................................................................................................................................... 87 8.2.15 LARSV2............................................................................................................................................................................... 88 8.2.16 initialize_system...............................................................................................................................................................89 8.2.17 shutdown_system............................................................................................................................................................ 90 8.2.18 without_npu_compile_scope....................................................................................................................................... 91 8.2.19 set_iteration_per_loop................................................................................................................................................... 91 8.2.20 create_iteration_per_loop_var..................................................................................................................................... 92 8.2.21 load_iteration_per_loop_var........................................................................................................................................ 92 8.2.22 model_to_npu_estimator.............................................................................................................................................. 93 8.2.23 sess.run  session .................................................................................................................................... 94 8.3 .................................................................................................................................................................. 98 8.3.1 ................................................................................................................................................................................ 98 8.3.2 group ................................................................................................................................................................. 100 8.3.2.1 create_group.................................................................................................................................................................. 100 8.3.2.2 destroy_group............................................................................................................................................................... 102

 01 (2020-11-28)

 © 

iii

Atlas Data Center Solution 

 

8.3.2.3 get_rank_size................................................................................................................................................................. 103 8.3.2.4 get_local_rank_size...................................................................................................................................................... 103 8.3.2.5 get_rank_id.....................................................................................................................................................................104 8.3.2.6 get_local_rank_id......................................................................................................................................................... 105 8.3.2.7 get_world_rank_from_group_rank......................................................................................................................... 106 8.3.2.8 get_group_rank_from_world_rank......................................................................................................................... 106 8.3.3 ..................................................................................................................................................................... 107 8.3.3.1 set_split_strategy_by_idx........................................................................................................................................... 107 8.3.3.2 set_split_strategy_by_size.......................................................................................................................................... 109 8.3.4 ..................................................................................................................................................................... 109 8.3.4.1 allreduce......................................................................................................................................................................... 110 8.3.4.2 allgather..........................................................................................................................................................................111 8.3.4.3 broadcast........................................................................................................................................................................ 111 8.3.4.4 reduce_scatter............................................................................................................................................................... 112 8.3.4.5 send.................................................................................................................................................................................. 113 8.3.4.6 receive.............................................................................................................................................................................. 114
9 ................................................................................................................................. 116
9.1  imagenet  ResNetbookscorpus  BERT Estimater
10 ....................................................................................................................................... 157
10.1  7.3.0  gcc........................................................................................................................................................... 157 10.2 .............................................................................................................................................................................. 158

 01 (2020-11-28)

 © 

iv

Atlas Data Center Solution 

1 

1 
TensorFlowPython APIAI 

 01 (2020-11-28)

 © 

1

Atlas Data Center Solution 

2 

2 
1. TensorFlow 1.15TensorFlow Tensorflow
2. infershapeunknowshape 3. formatNCHWNHWCNCHWCNCN 4. cast
float32 float16 5. TF.conditionTF.whileloop 6. PNPURunconfigsave_checkpoints_secs 7. PPSummary 8. iterations_per_loop>1save_checkpoints_steps iterations_per_loopiterations_per_loop save_checkpoints_stepsiterations_per_loop>1 save_summary_stepslog_step_count_steps Log/Summary 9. geludropoutAscend  10. summary/log/dataString 11. inf/nan 12. 
a. device b. server1/2/4/8P c. int8int32float16float32 13. 
a. tf.data.make_initializable_iteratorgetnext b. BatchDatasetdrop_remainderTrue
batch size batch size

 01 (2020-11-28)

 © 

2

Atlas Data Center Solution 

3 

3 
 Ascend  Tensorflow 1.15.0  deviceIP 

 01 (2020-11-28)

 © 

3

Atlas Data Center Solution 

4 

4 

4.1 Estimator 4.2 sess.run 4.3 Keras

4.1 Estimator 

Estimator 
Estimator APITensorFlowAPI2018TensorFlow 1.10 Estimator 
Estimator
1. input_fn 2. model_fn 3. EstimatorRunconfig 4. EstimatorEstimator.train()

Estimator APIAI


def train_input_fn(train_data,train_labels): #numpy return tf.estimator.inputs.numpy_input_fn(
x={"x":train_data}, y=train_labels, batch_size=FLAGS.batch_size, num_epochs=None,#epochs shuffle=True)

shapeshape dataset.batch(batch_size)

 01 (2020-11-28)

 © 

4

Atlas Data Center Solution 

4 

batchAI drop_remainderTrue
dataset = dataset.batch(batch_size, drop_remainder=True)
 (batch_size)batch sizebatch size 
assert num_written_lines == num_actual_predict_examples



 dropout
TensorFlow
layers = tf.nn.dropout()

from npu_bridge.estimator import npu_ops layers = npu_ops.dropout()
 bertgelu
TensorFlow
def gelu(x): cdf = 0.5 * (1.0 + tf.tanh( (np.sqrt(2 / np.pi) * (x + 0.044715 * tf.pow(x, 3))))) return x*cdf
layers = gelu()

from npu_bridge.estimator.npu_unary_ops import npu_unary_ops layers = npu_unary_ops.gelu(x)
 TensorFlowRunconfigtrain_distributeAscend train_distributeNPUDistributedOptimizer NPU Device EstimatorNPUEstimator NPUBroadcastGlobalVariablesHookbroadcast



TensorFlowRunconfigRunconfig NPURunconfig
TensorFlow
config=tf.estimator.RunConfig( model_dir=FLAGS.model_dir, save_checkpoints_steps=FLAGS.save_checkpoints_steps, session_config=tf.ConfigProto(allow_soft_placement=True,log_device_placement=False))

from npu_bridge.estimator.npu.npu_config import NPURunConfig from npu_bridge.estimator import npu_ops npu_config=NPURunConfig(
model_dir=FLAGS.model_dir, save_checkpoints_steps=FLAGS.save_checkpoints_steps, session_config=tf.ConfigProto(allow_soft_placement=True,log_device_placement=False)#  )

 01 (2020-11-28)

 © 

5

Atlas Data Center Solution 

4 

 Estimator
TensorFlowEstimatorNPUEstimator
TensorFlow
mnist_classifier=tf.estimator.Estimator( model_fn=cnn_model_fn, config=config, model_dir="/tmp/mnist_convnet_model")

from npu_bridge.estimator.npu.npu_estimator import NPUEstimator
mnist_classifier=NPUEstimator( model_fn=cnn_model_fn, config=npu_config, model_dir="/tmp/mnist_convnet_model" )




mnist_classifier.train( input_fn=train_input_fn, steps=20000, hooks=[logging_hook])

4.2 sess.run 

sess.run 
sess.run APITensorFlowAPIEstimator 
sess.run API
1.  2. /Loss/ 3. session 4. 
sess.run APIAI





shapeshape dataset.batch(batch_size) batchAI drop_remainderTrue
dataset = dataset.batch(batch_size, drop_remainder=True)
 (batch_size)batch

 01 (2020-11-28)

 © 

6

Atlas Data Center Solution 

4 

sizebatch size 
assert num_written_lines == num_actual_predict_examples
/ Loss/

 dropout TensorFlow
layers = tf.nn.dropout()

from npu_bridge.estimator import npu_ops layers = npu_ops.dropout()
 bertgelu TensorFlow
def gelu(x): cdf = 0.5 * (1.0 + tf.tanh( (np.sqrt(2 / np.pi) * (x + 0.044715 * tf.pow(x, 3))))) return x*cdf
layers = gelu()

from npu_bridge.estimator.npu_unary_ops import npu_unary_ops layers = npu_unary_ops.gelu(x)
 sess.run ­ broadcast
rank_size = os.environ.get('RANK_SIZE', '').strip() if int(rank_size) > 1:
input = tf.trainable_variables() bcast_global_variables_op = hccl_ops.broadcast(input, 0)
­ Deviceallreduce NPUDistributedOptimizer NPUDistributedOptimizer 
rank_size = os.environ.get('RANK_SIZE', '').strip() if int(rank_size) > 1:
grads = [ hccl_ops.allreduce(grad, "sum") for grad in grads ]
 session 
AIsess.run
 
rewrite_options.disable_model_pruning  
­ rewrite_options.function_optimization
­ rewrite_options.constant_folding
­ rewrite_options.shape_optimization
­ rewrite_options.arithmetic_optimization
­ rewrite_options.loop_optimization
­ rewrite_options.dependency_optimization

 01 (2020-11-28)

 © 

7

Atlas Data Center Solution 

4 



­ rewrite_options.layout_optimizer ­ rewrite_options.memory_optimization   rewrite_options.remapping  GradFusionOptimizer rewrite_options.optimizers.extend(["GradFusionOptimizer"])  AI custom_op.parameter_map["use_off_line"].b = True
TensorFlow
# iterator=Iterator.from_structure(train_dataset.output_types,train_dataset.output_shapes)
#batch next_batch=iterator.get_next()
# training_init_op=iterator.make_initializer(train_dataset)
# init=tf.global_variables_initializer() sess=tf.Session() sess.run(init)
#Get the number of training/validation steps per epoch train_batches_per_epoch=int(np.floor(train_size/batch_size))

from npu_bridge.estimator import npu_ops from tensorflow.core.protobuf.rewriter_config_pb2 import RewriterConfig
# iterator=Iterator.from_structure(train_dataset.output_types,train_dataset.output_shapes)
#batch next_batch=iterator.get_next()
# training_init_op=iterator.make_initializer(train_dataset)
# init=tf.global_variables_initializer()
#session config = tf.ConfigProto() custom_op = config.graph_options.rewrite_options.custom_optimizers.add() custom_op.name = "NpuOptimizer" custom_op.parameter_map["use_off_line"].b = True #AI config.graph_options.rewrite_options.remapping = RewriterConfig.OFF #remap
sess = tf.Session(config=config) sess.run(init)
#Get the number of training/validation steps per epoch train_batches_per_epoch=int(np.floor(train_size/batch_size))
sessionsess.runsession


 01 (2020-11-28)

 © 

8

Atlas Data Center Solution 

4 

# for epoch in range(num_epochs):
##Initialize iterator with the training dataset sess.run(training_init_op) for step in range(train_batches_per_epoch):
#get next batch of data img_batch,label_batch=sess.run(next_batch) #run the training op _,train_loss = sess.run([train_op, loss],feed_dict={x:img_batch,y_:label_batch,is_training:True})

4.3 Keras 

4.3.1 Keras 
KerasEstimatorTensorFlowAPI Keras API 
1. 
2. 
3. 
4. 
KerasAscend AscendKerasAscend Keras
 AscendKerasAPI session.runAI1 Keras
 HostDevice model_to_npu_estimatorKerasNPUEstimator NPURunConfigiterations_per_loopsession.run AIKeras NPUEstimator
4.3.2 Keras 
AscendKerasAPIsession.run AI1Keras Ascend
1. use_off_lineAI TensorFlowKeras
import tensorflow as tf import tensorflow.python.keras as keras from tensorflow.python.keras import backend as K from tensorflow.core.protobuf.rewriter_config_pb2 import RewriterConfig from npu_bridge.estimator import npu_ops
sess_config = tf.ConfigProto() custom_op = sess_config.graph_options.rewrite_options.custom_optimizers.add() custom_op.name = "NpuOptimizer" custom_op.parameter_map["use_off_line"].b = True sess_config.graph_options.rewrite_options.remapping = RewriterConfig.OFF sess_config.graph_options.rewrite_options.optimizers.extend(["GradFusionOptimizer"]) # 

 01 (2020-11-28)

 © 

9

Atlas Data Center Solution 

4 

sess = tf.Session(config=sess_config) K.set_session(sess)
#... #... #... #...
sess.close()
2. AIKeras optimizertensorflowkeras NPUDistributedOptimizer
from npu_bridge.estimator.npu.npu_optimizer import NPUDistributedOptimizer
opt = tf.compat.v1.train.AdamOptimizer(learning_rate=0.1) opt = NPUDistributedOptimizer(opt) keras_model.compile(optimizer=opt,loss='sparse_categorical_crossentropy')
callback
4.3.3 Keras  NPUEstimator
KerasNPUEstimator iterations_per_loop

KerasNPUEstimatorinput_fn 
Keras resize
Estimatorlist listlist resize
TensorFlow
# keras train_datagen = ImageDataGenerator(rescale=1./255,
horizontal_flip=True)
train_generator = train_datagen.flow_from_directory('data/', target_size=(224, 224, 3), batch_size=32, class_mode='sparse')

# filename def _parse_function(filename, label):
image = tf.read_file(filename) image = tf.image.decode_image(image) image = image / 255.0 image = tf.image.resize_images(image, [224, 224, 3]) image = tf.image.random_flip_left_right(image) return image, label
def input_fn(): # list filenames = tf.constant(["/data/image1.jpg", "/data/image2.jpg", ...]) # label[i]filenames[i]label, labellist labels = tf.constant([0, 5, ...]) # dataset(filename, label)

 01 (2020-11-28)

 © 

10

Atlas Data Center Solution 

4 



dataset = tf.data.Dataset.from_tensor_slices((filenames, labels)).repeat(10) # dataset(image_resized, label) dataset = dataset.map(_parse_function) # dataset(image_resized_batch, label_batch) dataset = dataset.shuffle().batch(32) return dataset
model_to_npu_estimatorKerasNPUEstimator 
TensorFlow
from keras.layers import Input, Dense from keras.models import Model
# This returns a tensor inputs = Input(shape=(224, 224, 3))
# This creates a model that includes # the Input layer and three Dense layers keras_model = ResNet50(input_tensor=inputs, weights=None, include_top=True) keras_model.compile(optimizer='rmsprop', loss='sparse_categorical_crossentropy')
keras_model.fit_generator( train_generator, steps_per_epoch=100, epochs=10)

session_config = tf.ConfigProto() run_config = NPURunConfig(enable_data_pre_proc=True,
session_config=session_config, save_checkpoints_steps=2, model_dir=model_path, iterations_per_loop=10) # KerasNPUEstimator est_resnet = model_to_npu_estimator(keras_model=keras_model, config=run_config) #  est_resnet.train(input_fn=lambda: input_fn(), max_steps=1000)
AIKerasoptimizer tensorflowkeras NPUDistributedOptimizer
from npu_bridge.estimator.npu.npu_optimizer import NPUDistributedOptimizer
opt = tf.train.AdamOptimizer(0.01) opt = NPUDistributedOptimizer(opt) keras_model.compile(optimizer=opt, loss='sparse_categorical_crossentropy')
KerascallbackNPUEstimator

 01 (2020-11-28)

 © 

11

Atlas Data Center Solution 

5 

5 

5.1  5.2  5.3  5.4  5.5 

5.1 



AI device 

 01 (2020-11-28)

 © 

12

Atlas Data Center Solution 
 5-1 

5 

PS-workersAllReduce 
PS-workers5-2 AllReduce5-3 AllReduceAscend AllReduce

 01 (2020-11-28)

 © 

13

Atlas Data Center Solution 
 5-2 PS-workers 

5 

 5-3 Allreduce 



broadcastallreduce 

 01 (2020-11-28)

 © 

14

Atlas Data Center Solution 

5 

 NPUDistributedOptimizerallreduce
NPUDistributedOptimizer allreduce

 API allreduce/broadcast/allgather/reduce_scatter/send/receive 824 
5.2 
5.2.1 Server 
Server1ServerServer8 AI1/2/4/80-34-7 24

 01 (2020-11-28)

 © 

15

Atlas Data Center Solution 
 5-4 

5 


Device
5.2.2 Server 

Server+Server Server128Server8AI Server8*nn Servern2 
Server8Server 1*n/2*n/4*nnServer create_groupgroup

 01 (2020-11-28)

 © 

16

Atlas Data Center Solution 
 5-5 

5 

 


 5-6 

 01 (2020-11-28)

 © 

17

Atlas Data Center Solution 

5 

Agent TensorFlowTensorFlow AI


Server8Server Server1/2/4Server
 Serverdevice1/2/4 ­ 1 ­ 2[0, 5][1, 4][2, 7][3, 6]  ­ 4[0, 2, 5, 7][1, 3, 4, 6]
 broadcast/allreduce/ reduce_scatter/allgather1/2/4 
 8*nnServer 1*n/2*n/4*n
5.2.3 Atlas 300T  9000
 AI



100GServerRing + Halving-doubling

 5-7 


1. 2. 3. 4.

Server
IP
allreduce/broadcast/allgather/reduce_scatter
HCCL_INTRA_PCIE_ENABLE HCCL_INTRA_ROCE_ENABLEPCIe RoCE

 01 (2020-11-28)

 © 

18

Atlas Data Center Solution 

5 

5. ranktable  

5.3 


ranktable RANK_TABLE_FILE ranktable
ranktablejson2p rank_table_2p.json



1.  
2. Serverranktable 1/2/4/8
3. Serverranktable8*n nServer
4. Atlas 300T  9000ranktable 



{

"server_count":"1", //serverserver

"server_list":

[

{

"device":[ // serverdevice

{

"device_id":"0", // HDC

"device_ip":"192.168.0.2", // IP

"rank_id":"0" // rankrankID0

},

{

"device_id":"1",

"device_ip":"192.168.1.2",

"rank_id":"1"

}

],

"server_id":"10.0.0.10" //serverIP

}

],

"status":"completed", // ranktablecompleted

"version":"1.0"

// ranktable,"1.0"

}

 01 (2020-11-28)

 © 

19

Atlas Data Center Solution 

5 

 5-1 ranktable 





/ 

server_count

Server



status

Rank table
 completedRank table 
 initializingRank table 



version

ranktable1.0 

server_list

Server



server_id

ServerIP  10.0.0.10

device_id

IDDeviceServer HDC
[0-7]



device_ip

IP 192.168.1.2
Servercat /etc/hccn.conf IP

 8ranktable devicesdeviceip device_id0ip 192.168.100.101devices "devices": [{"device_id": "0", "device_ip": "192.168.100.101"}]
 ranktable Server1/2/4devices device_ip" "device device_id0 devices "devices": [{"device_id": "0", "device_ip": ""}]



rank_id

Rank0 [0, Device-1]





{

"status":"completed", // Rank tablecompleted

"group_count":"1", // group1

"group_list":

// group

[

{

"group_name":"hccl_world_group",//grouphccl_world_group

"instance_count":"2",

// instance

"device_count":"2",

// groupdevice

"instance_list":[

 01 (2020-11-28)

 © 

20

Atlas Data Center Solution 

5 

{

"pod_name":"tf-bae41", //instance

"server_id":"10.0.0.10", //serverIP

"devices":[

//instancedevice

{

"device_id":"0",

// HDC

"device_ip":"192.168.0.2" // IP

}

]

},

{

"pod_name":"tf-tbdf1",

"server_id":"10.0.0.10",

"devices":[

{

"device_id":"1",

"device_ip":"192.168.1.2"

}

]

}

]

}

]

}

 5-2 ranktable 





/ 

status

Rank table
 completedRank table 
 initializingRank table 



group_count

Group1



group_list

Group



group_name

Groupgroup_count1 hccl_world_group hccl_world_group group
group groupgroup hccl_world_groupgroup



instance_count

instance_listpod_name 



device_count

group



instance_list

-

-

pod_name

instance_list 

server_id

ServerIP  10.0.0.10

 01 (2020-11-28)

 © 

21

Atlas Data Center Solution 
 devices device_id
device_ip

5 



/ 

-

-

IDDeviceServer HDC
[0-7]



IP 192.168.1.2
Servercat /etc/hccn.conf IP

 8ranktable devicesdeviceip device_id0ip 192.168.100.101devices "devices": [{"device_id": "0", "device_ip": "192.168.100.101"}]
 ranktable Server1/2/4devices device_ip" "device device_id0 devices "devices": [{"device_id": "0", "device_ip": ""}]



5.4 

5.4.1 
 get_rank_sizeget_rank_idid
dataset = dataset.shard(get_rank_size(),get_rank_id())

dataset = dataset.repeat( ) 
5.4.2 
TensorFlowRunconfigtrain_distributeAscend train_distributeNPUDistributedOptimizer NPU Device
TensorFlow
def cnn_model_fn(features,labels,mode): # xxx #loss

 01 (2020-11-28)

 © 

22

Atlas Data Center Solution 

5 

xxx
#Configure the TrainingOp() if mode == tf.estimator.ModeKeys.TRAIN:
optimizer=tf.train.GradientDescentOptimizer(learning_rate=0.001)#SGD train_op=optimizer.minimize(loss=loss,global_step=tf.train.get_global_step())#loss return tf.estimator.EstimatorSpec(mode=mode,loss=loss,train_op=train_op)

from npu_bridge.estimator.npu.npu_optimizer import NPUDistributedOptimizer
def cnn_model_fn(features,labels,mode): # xxx #loss xxx
#Configure the TrainingOp(for TRAIN mode) if mode == tf.estimator.ModeKeys.TRAIN:
optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.001)#SGD distributedOptimizer=NPUDistributedOptimizer(optimizer)#NPU train_op=distributedOptimizer.minimize(loss=loss,global_step=tf.train.get_global_step())#loss return tf.estimator.EstimatorSpec(mode=mode,loss=loss,train_op=train_op)

 Tensorflowgrads = tf.gradients(loss, tvars) NPUDistributedOptimizerNPUDistributedOptimizer compute_gradientsapply_gradients
 EstimatorNPUDistributedOptimizerallreduce NPUEstimatorNPUBroadcastGlobalVariablesHook broadcast sess.runNPUDistributedOptimizerallreduce broadcastbroadcast
5.4.3 Horovod 
HorovodTensorFlowKerasPyTorchMXNet TensorFlowPS worker HorovodAllreducePS worker 
HorovodAI 


 5-3  Horovod hvd.DistributedOptimizer hvd.init hvd.local_rank hvd.size hvd.rank

 NPUDistributedOptimizer  get_local_rank_id get_rank_size get_rank_id

 01 (2020-11-28)

 © 

23

Atlas Data Center Solution 
Horovod hvd.BroadcastGlobalVariablesHook

5 
  NPUDistributedOptimizerGE Broadcast

sess.runestimator.trainget_local_rank_id/get_rank_size/ get_rank_idHCCLsessioninitialize_system shutdown_systemsession



Horovod
import tensorflow as tf import horovod.tensorflow as hvd
# Initialize Horovod hvd.init()
# Pin GPU to be used to process local rank (one GPU per process) config = tf.ConfigProto() config.gpu_options.visible_device_list = str(hvd.local_rank())
# Build model... loss = ... opt = tf.train.AdagradOptimizer(0.01 * hvd.size())
# Add Horovod Distributed Optimizer opt = hvd.DistributedOptimizer(opt)
# Add hook to broadcast variables from rank 0 to all other processes during # initialization. hooks = [hvd.BroadcastGlobalVariablesHook(0)]
# Make training operation train_op = opt.minimize(loss)
# Save checkpoints only on worker 0 to prevent other workers from corrupting them. checkpoint_dir = '/tmp/train_logs' if hvd.rank() == 0 else None
# The MonitoredTrainingSession takes care of session initialization, # restoring from a checkpoint, saving to a checkpoint, and closing when done # or an error occurs. with tf.train.MonitoredTrainingSession(checkpoint_dir=checkpoint_dir,
config=config, hooks=hooks) as mon_sess: while not mon_sess.should_stop(): # Perform synchronous training. mon_sess.run(train_op)

import tensorflow as tf from hccl.manage.api import get_local_rank_id from hccl.manage.api import get_rank_size from hccl.manage.api import get_rank_id from npu_bridge.estimator import npu_ops from npu_bridge.estimator.npu.npu_optimizer import NPUDistributedOptimizer from tensorflow.core.protobuf.rewriter_config_pb2 import RewriterConfig # HCCLgroupHCCL npu_int = npu_ops.initialize_system() npu_shutdown = npu_ops.shutdown_system()

 01 (2020-11-28)

 © 

24

Atlas Data Center Solution 

5 

config = tf.ConfigProto() custom_op = config.graph_options.rewrite_options.custom_optimizers.add() custom_op.name = "NpuOptimizer" custom_op.parameter_map["use_off_line"].b = True config.graph_options.rewrite_options.remapping = RewriterConfig.OFF #remap init_sess = tf.Session(config=config) init_sess.run(npu_int)
# Pin GPU to be used to process local rank (one GPU per process) config.gpu_options.visible_device_list = str(get_local_rank_id())
# Build model... loss = ... opt = tf.train.AdagradOptimizer(0.01 * get_rank_size())
# Add NPU Distributed Optimizer opt = NPUDistributedOptimizer(opt)
# Add hook to broadcast variables from rank 0 to all other processes during # initialization. # hooks = [hvd.BroadcastGlobalVariablesHook(0)]
# Make training operation train_op = opt.minimize(loss)
# Save checkpoints only on worker 0 to prevent other workers from corrupting them. checkpoint_dir = '/tmp/train_logs' if get_rank_id() == 0 else None
# The MonitoredTrainingSession takes care of session initialization, # restoring from a checkpoint, saving to a checkpoint, and closing when done # or an error occurs. with tf.train.MonitoredTrainingSession(checkpoint_dir=checkpoint_dir,
config=config, hooks=hooks) as mon_sess: while not mon_sess.should_stop(): # Perform synchronous training. mon_sess.run(train_op)
init_sess.run(npu_shutdown) init_sess.close()
5.5 
5.5.1 API 
NPUDistributedOptimizerallreduce  rank


AIhccl python npu_bridge
from npu_bridge.estimator import npu_ops from hccl.manage.api import get_rank_size

 01 (2020-11-28)

 © 

25

Atlas Data Center Solution 

5 

 5-4 





rank

create_group

destroy_group

get_rank_size

get_local_rank_size

get_rank_id get_local_rank_id get_world_rank_from_group_rank

get_group_rank_from_world_rank



set_split_strategy_by_idx





 group
 group
group rank Device 

{install_pat h_fwkacllib }/fwkacllib/ python/ sitepackages/ hccl/hccl/ manage/ api.py

group device  local rank 

device group rank 

device group local rank

group rank id  world rank id

world rank id  group
group rank id

 id  group  

{install_pat h_fwkacllib }/fwkacllib/ python/ sitepackages/ hccl/hccl/ split/api.py

 01 (2020-11-28)

 © 

26

Atlas Data Center Solution 


 set_split_strategy_by_size

 

allreduce

allgather

broadcast

reduce_scatter send receive

5 





   group   

group  allreduce    
group  allgather   Tensor 

{install_pat h_tfplugin}/ tfplugin/ python/ sitepackages/ npu_bridge/ hccl/ hccl_ops.py

group  broadcast  root  rank

group  
reducescatt er

group   send 

group   receive 

 01 (2020-11-28)

 © 

27

Atlas Data Center Solution 

5 

 5-5 





ranktable ranktableServer



Device

rank

rank

group

group
 hccl world groupgroup rankranktable
 grouphccl world group create_groupranktablerankgroup 

rank size

 rank sizegrouprank4096
 local rank sizegroupServerrank 1/2/4/8

rank id

 rank idgrouprank0~rank size-1grouprankgroup0 hccl world grouprank idworld rank id
 world rank idhccl world grouprank 0~rank size-1
 local rank idgroupServerrank 0~local rank size-1



GE allreduceallreduce  

5.5.2 
sess.runestimator.trainget_local_rank_id/get_rank_size/ get_rank_idHCCLsessioninitialize_system shutdown_systemsession
import tensorflow as tf from npu_bridge.estimator import npu_ops from tensorflow.core.protobuf.rewriter_config_pb2 import RewriterConfig
npu_int = npu_ops.initialize_system() npu_shutdown = npu_ops.shutdown_system()
config = tf.ConfigProto() custom_op = config.graph_options.rewrite_options.custom_optimizers.add() custom_op.name = "NpuOptimizer" custom_op.parameter_map["use_off_line"].b = True config.graph_options.rewrite_options.remapping = RewriterConfig.OFF #remap
init_sess = tf.Session(config=config) init_sess.run(npu_int)
#HCCL...

 01 (2020-11-28)

 © 

28

Atlas Data Center Solution 

5 

#...
init_sess.run(npu_shutdown) init_sess.close()

import tensorflow as tf from npu_bridge.estimator import npu_ops from tensorflow.core.protobuf.rewriter_config_pb2 import RewriterConfig
npu_init = npu_ops.initialize_system() npu_shutdown = npu_ops.shutdown_system()
config = tf.ConfigProto() custom_op = config.graph_options.rewrite_options.custom_optimizers.add() custom_op.name = "NpuOptimizer" custom_op.parameter_map["use_off_line"].b = True config.graph_options.rewrite_options.remapping = RewriterConfig.OFF #remap
with tf.Session(config=config) as sess: sess.run(npu_init) #HCCL... #... sess.run(npu_shutdown)
5.5.3 
create_group groupranktablehccl_world_group
hccl_world_groupnpugroup
from hccl.manage.api import create_group create_group("myGroup", 2, [0, 1])
5.5.4  group 
groupgroup
get_rank_sizegroupnpu
from hccl.manage.api import get_rank_size rankSize = get_rank_size("myGroup")
get_local_rank_sizenpuServergroupnpu
from hccl.manage.api import get_local_rank_size lcoalRankSize = get_local_rank_size("myGroup")
get_rank_idnpugrouprank id
from hccl.manage.api import get_rank_id rankId = get_rank_id("myGroup")
get_local_rank_idnpuServerlocal rank id
from hccl.manage.api import get_local_rank_id localRankId = get_local_rank_id("myGroup")
5.5.5 
allreduce
set_split_strategy_by_idxidgroup 
from hccl.split.api import set_split_strategy_by_idx set_split_strategy_by_idx([20, 100, 159])

 01 (2020-11-28)

 © 

29

Atlas Data Center Solution 

5 

set_split_strategy_by_sizegroup 
from hccl.split.api import set_split_strategy_by_size set_split_strategy_by_size([60, 20, 20])
5.5.6  Tensor 
tensor

allreduce

allreducegroupallreduce reducereducereduction
#---------------------allreduce test(2 npu)---------------------------------
from npu_bridge.hccl import hccl_ops
tensor = tf.random_uniform((1, 3), minval=1, maxval=10, dtype=tf.float32)
allreduce_test = hccl_ops.allreduce(tensor , "sum")

allgather

allgathergroupallgatherTensor 
#---------------------allgather test(2 npu)---------------------------------
from npu_bridge.hccl import hccl_ops
cCon = tf.constant([1.0,2.0,3.0])
allgather_test = hccl_ops.allgather(cCon, 2)
#---------- rank 0/1 allgather _test = [1.0, 2.0, 3.0, 1.0, 2.0, 3.0] ----------

broadcast

broadcastgroupbroadcastroot rank
#---------------------broadcast test(2 npu)---------------------------------
from npu_bridge.hccl import hccl_ops
cCon = tf.Variable([1.0,2.0,3.0])

 01 (2020-11-28)

 © 

30

Atlas Data Center Solution 
input = [cCon] broadcast_test = hccl_ops.broadcast(input, 0) #---------------- rank 0/1 broadcast_test = [1.0, 2.0, 3.0] --------------------

5 

reducescatter
reduce_scattergroupreducescatterreduce reduction
#---------------------reducescatter test(2 npu)----------------------------from npu_bridge.hccl import hccl_ops cCon = tf.constant([1.0,2.0,3.0,4.0]) reducescatter_test = hccl_ops.reduce_scatter(cCon, "sum", 2) #-----------------rank 0 reducescatter _test = [2.0, 4.0] ---------------------#-----------------rank 1 reducescatter _test = [6.0, 8.0] ----------------------

send receive

sendgroupsend
#---------------------------------send test------------------------------------from npu_bridge.hccl import hccl_ops sr_tag = 0 dest_rank = 1 hccl_ops.send(tensor, sr_tag, dest_rank)
receivegroupreceive
#---------------------receive test(2 npu)----------------------------------from npu_bridge.hccl import hccl_ops sr_tag = 0 src_rank = 0 tensor = hccl_ops.receive(tensor.shape, tensor.dtype, sr_tag, src_rank)

 01 (2020-11-28)

 © 

31

Atlas Data Center Solution 

6 

6 

6.1  6.2 Loss Scaling 6.3  6.4 Profiling 6.5 Dump 6.6  6.7 Log/Summary 6.8  6.9  6.10 ckptpb

6.1 



float16float32  float32AI 
 allow_fp32_to_fp16float32 float16float32 Conv2DDepthwiseConv2D 
 force_fp16float16float32float16
 must_keep_origin_dtypeConv2D float16float32 
 allow_mix_precisionfloat32 float32float16

 01 (2020-11-28)

 © 

32

Atlas Data Center Solution 

6 

AI float32float16 Loss Scaling castAI 

Estimator 
EstimatorNPURunConfigprecision_mode
from npu_bridge.estimator.npu.npu_config import NPURunConfig from npu_bridge.estimator import npu_ops
npu_config=NPURunConfig( model_dir=FLAGS.model_dir, save_checkpoints_steps=FLAGS.save_checkpoints_steps, session_config=tf.ConfigProto(allow_soft_placement=True,log_device_placement=False), precision_mode="allow_mix_precision" )
allow_mix_precision 
1. /home/HwHiAiUser/Ascend/nnae/latest/opp/op_impl/built-in/ ai_core/tbe/config/
"/home/HwHiAiUser/Ascend/nnae/latest/opp" 
2. aic-ascend910-ops-info.json
chmod u+w aic-ascend910-ops-info.json
3. aic-ascend910-ops-info.json precision_reduce
"precision_reduce":{ "flag":"true"
}
­ truefloat32float16
­ falsefloat32float16
­   

sess.run 
sess.runsessionprecision_mode
import tensorflow as tf from npu_bridge.estimator import npu_ops from tensorflow.core.protobuf.rewriter_config_pb2 import RewriterConfig
config = tf.ConfigProto()
custom_op = config.graph_options.rewrite_options.custom_optimizers.add() custom_op.name = "NpuOptimizer" custom_op.parameter_map["use_off_line"].b = True custom_op.parameter_map["precision_mode"].s = tf.compat.as_bytes("allow_mix_precision") config.graph_options.rewrite_options.remapping = RewriterConfig.OFF #remap
with tf.Session(config=config) as sess: print(sess.run(cost))
Estimator

 01 (2020-11-28)

 © 

33

Atlas Data Center Solution 

6 

6.2 Loss Scaling



Loss Scalingfloat16 lossLoss ScaleS Loss Scaling 

 Loss Scaling
Loss ScalingLossScaleOptimizer NPULossScaleOptimizerNPUOptimizer NPULossScaleOptimizer
 Loss ScalingLoss Scale 
NPULossScaleOptimizer FixedLossScaleManagerLoss Scale
 Loss Scaling Loss Scale
NPULossScaleOptimizer ExponentialUpdateLossScaleManagerLoss Scale
NPULossScaleOptimizeris_distributedTrue Loss Scaling
TensorFlow
if FLAGS.use_fp16 and (FLAGS.bert_loss_scale not in [None, -1]): opt_tmp = opt if FLAGS.bert_loss_scale == 0: loss_scale_manager =
tf.contrib.mixed_precision.ExponentialUpdateLossScaleManager(init_loss_scale=2**32, incr_every_n_steps=1000, decr_every_n_nan_or_inf=2, decr_ratio=0.5)
elif FLAGS.bert_loss_scale >= 1: loss_scale_manager = tf.contrib.mixed_precision.FixedLossScaleManager(loss_scale=FLAGS.bert_loss_scale)
else: raise ValueError("Invalid loss scale: %d" % FLAGS.bert_loss_scale)
opt = tf.contrib.mixed_precision.LossScaleOptimizer(opt_tmp, loss_scale_manager)

from npu_bridge.estimator.npu.npu_loss_scale_optimizer import NPULossScaleOptimizer from npu_bridge.estimator.npu.npu_loss_scale_manager import FixedLossScaleManager from npu_bridge.estimator.npu.npu_loss_scale_manager import ExponentialUpdateLossScaleManager if FLAGS.use_fp16 and (FLAGS.bert_loss_scale not in [None, -1]):
opt_tmp = opt if FLAGS.bert_loss_scale == 0:
loss_scale_manager = ExponentialUpdateLossScaleManager(init_loss_scale=2**32, incr_every_n_steps=1000, decr_every_n_nan_or_inf=2, decr_ratio=0.5)
elif FLAGS.bert_loss_scale >= 1: loss_scale_manager = FixedLossScaleManager(loss_scale=FLAGS.bert_loss_scale)
else: raise ValueError("Invalid loss scale: %d" % FLAGS.bert_loss_scale)
#device11 if ops_adapter.size() > 1:
opt_tmp = NPUDistributedOptimizer(opt_tmp)

 01 (2020-11-28)

 © 

34

Atlas Data Center Solution 

6 

opt = NPULossScaleOptimizer(opt_tmp, loss_scale_manager, is_distributed=True) else:
opt = NPULossScaleOptimizer(opt_tmp, loss_scale_manager)
 global step
Loss ScalingLoss Scalingstep step
 resnet50HCtf.train.MomentumOptimizer global stepapply_gradients step
 Bertglobal stepcreate_optimizer global step
TensorFlowglobal stepcreate_optimizer 
def create_optimizer(loss, init_lr, num_train_steps, num_warmup_steps, hvd=None, manual_fp16=False, use_fp16=False, num_accumulation_steps=1,
optimizer_type="adam", allreduce_post_accumulation=False): ...
if tf.flags.FLAGS.npu_bert_clip_by_global_norm: new_global_step = tf.cond(all_are_finite, lambda: global_step + 1, lambda: global_step)
else: new_global_step = global_step + 1
new_global_step = tf.identity(new_global_step, name='step_update') train_op = tf.group(train_op, [global_step.assign(new_global_step)]) return train_op
Ascendglobal step
1. create_optimizerglobal step
def create_optimizer(loss, init_lr, num_train_steps, num_warmup_steps, hvd=None, manual_fp16=False, use_fp16=False, num_accumulation_steps=1,
optimizer_type="adam", allreduce_post_accumulation=False): ...
#if tf.flags.FLAGS.npu_bert_clip_by_global_norm: # new_global_step = tf.cond(all_are_finite, lambda: global_step + 1, lambda: global_step) #else: # new_global_step = global_step + 1 #new_global_step = tf.identity(new_global_step, name='step_update') #train_op = tf.group(train_op, [global_step.assign(new_global_step)]) return train_op
2. AdamWeightDecayOptimizerLAMBOptimizerapply_gradients returnglobal stepLoss Scaling apply_gredients
def apply_gradients(self, grads_and_vars, global_step=None, name=None, manual_fp16=False):
assignments = [] for (grad, param) in grads_and_vars:
... new_global_step = global_step + 1 new_global_step = tf.identity(new_global_step, name='step_update') assignments.extend([global_step.assign(new_global_step)]) return tf.group(*assignments, name=name)

6.3 



AIDevice

 01 (2020-11-28)

 © 

35

Atlas Data Center Solution 

6 

 AITensorflow 
 iterations_per_loop1
 without_npu_compile_scope 

Estimator 
EstimatorNPURunConfigmix_compile_mode 
from npu_bridge.estimator.npu.npu_config import NPURunConfig from npu_bridge.estimator import npu_ops
session_config=tf.ConfigProto() config = NPURunConfig(session_config=session_config, mix_compile_mode=True, iterations_per_loop=1)
sess.run 
sess.runsessionmix_compile_mode without_npu_compile_scope
import tensorflow as tf from npu_bridge.estimator import npu_ops from npu_bridge.estimator.npu import npu_scope from tensorflow.core.protobuf.rewriter_config_pb2 import RewriterConfig
X = tf.random_normal([2,]) Y = tf.random_normal([2,])
with npu_scope.without_npu_compile_scope(): pred = tf.add(tf.multiply(X, 1.), 0.)
cost = tf.reduce_sum(tf.abs(pred-Y))
config = tf.ConfigProto() custom_op = config.graph_options.rewrite_options.custom_optimizers.add() custom_op.name = "NpuOptimizer" custom_op.parameter_map["use_off_line"].b = True custom_op.parameter_map["mix_compile_mode"].b = True config.graph_options.rewrite_options.remapping = RewriterConfig.OFF #remap
with tf.Session(config=config) as sess: print(sess.run(cost)) # reduce_sumHost

6.4 Profiling



ProfilingProfiling Profiling
 training_traceAI 
 task_traceAIHWTS/AICore 
 op_trace training_tracetask_trace

 01 (2020-11-28)

 © 

36

Atlas Data Center Solution 

6 

Profiling

Estimator  Profiling 
EstimatorNPURunConfigprofiling_configProfiling
from npu_bridge.estimator.npu.npu_config import NPURunConfig from npu_bridge.estimator.npu.npu_config import ProfilingConfig
profiling_options = ['task_trace','training_trace'] profiling_config = ProfilingConfig(enable_profiling=True, enable_options = profiling_options) session_config=tf.ConfigProto()
config = NPURunConfig(profiling_config=profiling_config, session_config=session_config)
 
export FP_POINT=resnet_v1_50_1/conv1/Conv2D export BP_POINT=add_1

sess.run  Profiling 
sess.runsessionprofiling_modeprofiling_optionsProfiling 
custom_op = config.graph_options.rewrite_options.custom_optimizers.add() custom_op.name = "NpuOptimizer" custom_op.parameter_map["use_off_line"].b = True custom_op.parameter_map["profiling_mode"].b = True custom_op.parameter_map["profiling_options"].s = tf.compat.as_bytes("task_trace:training_trace")
with tf.Session(config=config) as sess: print(sess.run(cost))
 
export FP_POINT=resnet_v1_50_1/conv1/Conv2D export BP_POINT=add_1



Profiling 
export PROFILING_MODE=true export PROFILING_OPTIONS=training_trace:task_trace export FP_POINT=resnet_v1_50_1/conv1/Conv2D export BP_POINT=add_1

 Profiling 
/var/log/npu/profiling/ JOBxxxxAAA  first_runtime_task_trace_datakernelname  hwts.log.data.45.dev.profiler_default_tagAICORE
  ts_track.data.44.dev.profiler_default_tagAICPU  training_trace.46.dev.profiler_default_tag

 01 (2020-11-28)

 © 

37

Atlas Data Center Solution 
 Profiling 
ProfilingProfiling

6 

6.5 Dump



Data DumpDump/ Dump
 inputDump  outputDump  allDump
Dump

 dumpdump dumpG 

Estimator  Dump 
EstimatorNPURunConfigdump_configDump NPURunConfigDumpConfigdumpdump dumpdumpDumpConfig DumpConfig
from npu_bridge.estimator.npu.npu_config import NPURunConfig from npu_bridge.estimator.npu.npu_config import DumpConfig # enable_dumpdump # dump_pathdumpdump/var/log/npu/ ide_daemon/dump/{dump_path} # dump_stepdumpNonedump"|" 0|5|10"-"0|3-5|10 # dump_modedumpdumpdumpinput/output/all dump_config = DumpConfig(enable_dump=True, dump_path = "/tmp", dump_step="0|5|10", dump_mode="all")
session_config=tf.ConfigProto()
config = NPURunConfig(dump_config=dump_config, session_config=session_config)

dump_path"/" Dump

sess.run  Dump 
sess.runsessionenable_dumpdump_pathdump_step dump_modeDump
config = tf.ConfigProto()
custom_op = config.graph_options.rewrite_options.custom_optimizers.add()

 01 (2020-11-28)

 © 

38

Atlas Data Center Solution 

6 

custom_op.name = "NpuOptimizer" custom_op.parameter_map["use_off_line"].b = True # dump custom_op.parameter_map["enable_dump"].b = True # dumpdump/var/log/npu/ide_daemon/dump/ {dump_path} custom_op.parameter_map["dump_path"].s = tf.compat.as_bytes("/tmp") # dumpNonedump"|" 0|5|10"-"0|3-5|10 custom_op.parameter_map["dump_step"].s = tf.compat.as_bytes("0|5|10") # dumpdumpdumpinput/output/all custom_op.parameter_map["dump_mode"].s = tf.compat.as_bytes("all")
with tf.Session(config=config) as sess: print(sess.run(cost))
 Dump 
Dump/var/log/npu/ide_daemon/dump/ {dump_path}//deviceid/model_name/model_id/data_indexdump GEge_proto_xxxxx_Bulid.txt

 /var/log/npu/ide_daemon/dump"adaide_daemon/ ide_daemon.cfg"DUMP_PATH" "DUMP_PATH"dumpHost {WORK_PATH}/ide_daemon/dump/{dump_path}{WORK_PATH} ide_daemon.cfg"WORK_PATH""~" ada
 {dump_path}dump
 20200317020343
 deviceidDeviceID
 model_namemodel_namedump 
 model_idID
 data_indexdumpdump_step data_indexdump_stepdump_stepdata_index0 dump1
 dump{op_type}.{op_name}.{taskid}.{timestamp}
 model_nameop_typeop_name".""/""\" 

Pdevice
 Dump 
 

 01 (2020-11-28)

 © 

39

Atlas Data Center Solution 

6 

6.6 



iterations_per_loopsession.runDevice Deviceiterations_per_loopHost HostDevice
 iterations_per_loop1iterations_per_loop 
 iterations_per_loop>1save_checkpoints_steps iterations_per_loopiterations_per_loop save_checkpoints_stepsiterations_per_loop>1 save_summary_stepslog_step_count_steps Log/Summary
 mix_compile_modeTrueiterations_per_loop1
 enable_data_pre_proctf.data.make_initializable_iterator() Devicegetnextiterations_per_loop 1
enable_data_pre_proc tf.data.make_one_shot_iterator()getnextDevice iterations_per_loop1
Estimatorinput_fndatasetEstimator tf.data.make_initializable_iterator()
 iterations_per_loop1 iterations_per_loop

Estimator  iterations_per_loop
EstimatorNPURunConfigiterations_per_loop 
from npu_bridge.estimator.npu.npu_config import NPURunConfig from npu_bridge.estimator import npu_ops
session_config=tf.ConfigProto() config = NPURunConfig(session_config=session_config, iterations_per_loop=10)

session.run  iterations_per_loop
session.runset_iteration_per_loopiterations_per_loop session.runiterations_per_loop iterations_per_loop
from __future__ import print_function import input_data
from npu_bridge.estimator.npu import util from tensorflow.core.protobuf.rewriter_config_pb2 import RewriterConfig
mnist = input_data.read_data_sets("/test/", one_hot=True)
import tensorflow as tf

 01 (2020-11-28)

 © 

40

Atlas Data Center Solution 

6 

#  #  learning_rate = 0.01 #  training_epochs = 10 # batch batch_size = 100 #  display_step = 1
x = tf.placeholder(tf.float32, [None, 784]) y = tf.placeholder(tf.float32, [None, 10])
#  W = tf.Variable(tf.zeros([784, 10])) b = tf.Variable(tf.zeros([10]))
#  pred = tf.nn.softmax(tf.matmul(x, W) + b)
#  cost = tf.reduce_mean(-tf.reduce_sum(y*tf.log(pred), reduction_indices=1))
#  optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost)
#  init = tf.global_variables_initializer()
config = tf.ConfigProto() custom_op = config.graph_options.rewrite_options.custom_optimizers.add() custom_op.name = "NpuOptimizer" custom_op.parameter_map["use_off_line"].b = True #AI custom_op.parameter_map["mix_compile_mode"].b = False # custom_op.parameter_map["iterations_per_loop"].i = 10 #set_iteration_per_loop iterations_per_loop config.graph_options.rewrite_options.remapping = RewriterConfig.OFF #remap
#  with tf.Session(config=config) as sess:
sess.run(init) # sess.run10 train_op = util.set_iteration_per_loop(sess, optimizer, 10)
for epoch in range(training_epochs): avg_cost = 0 total_batch = int(mnist.train.num_examples / batch_size)
for i in range(total_batch): batch_xs, batch_ys = mnist.train.next_batch(batch_size) _, c = sess.run([train_op, cost], feed_dict={x: batch_xs, y: batch_ys})
avg_cost += c / total_batch
 tf.train.Supervisorsessionset_iteration_per_loop create_iteration_per_loop_var load_iteration_per_loop_var
from __future__ import print_function import input_data
from npu_bridge.estimator.npu import util from tensorflow.core.protobuf.rewriter_config_pb2 import RewriterConfig
mnist = input_data.read_data_sets("/test/", one_hot=True)
import tensorflow as tf

 01 (2020-11-28)

 © 

41

Atlas Data Center Solution 

6 

#  #  learning_rate = 0.01 #  training_epochs = 10 # batch batch_size = 100 #  display_step = 1
x = tf.placeholder(tf.float32, [None, 784]) y = tf.placeholder(tf.float32, [None, 10])
#  W = tf.Variable(tf.zeros([784, 10])) b = tf.Variable(tf.zeros([10]))
#  pred = tf.nn.softmax(tf.matmul(x, W) + b)
#  cost = tf.reduce_mean(-tf.reduce_sum(y*tf.log(pred), reduction_indices=1))
#  optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost)
#  init = tf.global_variables_initializer()
config = tf.ConfigProto() custom_op = config.graph_options.rewrite_options.custom_optimizers.add() custom_op.name = "NpuOptimizer" custom_op.parameter_map["use_off_line"].b = True #AI custom_op.parameter_map["mix_compile_mode"].b = False # custom_op.parameter_map["iterations_per_loop"].i = 10 #set_iteration_per_loop iterations_per_loop config.graph_options.rewrite_options.remapping = RewriterConfig.OFF #remap
#  with tf.Session(config=config) as sess:
sess.run(init) # sess.run10 iteration = util.IterationPerLoop() train_op = iteration.create_iteration_per_loop_var(optimizer) # tf.train.Supervisor(logdir="/home/xxxx",init_op=init) # iteration.load_iteration_per_loop_var(sess, 10) #
for epoch in range(training_epochs): avg_cost = 0 total_batch = int(mnist.train.num_examples / batch_size)
for i in range(total_batch): batch_xs, batch_ys = mnist.train.next_batch(batch_size) _, c = sess.run([train_op, cost], feed_dict={x: batch_xs, y: batch_ys})
avg_cost += c / total_batch
 iterations_per_loop 
HostInsert op successiterations_per_loop6-1

 01 (2020-11-28)

 © 

42

Atlas Data Center Solution 
 6-1 iterations_per_loop  1

6 

iterations_per_loop16-2  6-2 iterations_per_loop  1

6.7 Log/Summary



LogSummaryDeviceDeviceLog/ SummarystepHost 

 Log 
EstimatorLogHostdequeue DeviceLog
print_op = tf.print(loss) with tf.control_dependencies([print_op]):
train_op = xxx # printprint
sess.runLogHostdequeue dequeueLog
from threading import Thread
import sys def dequeue():
global config tf.reset_default_graph() outfeed_log_tensors = npu_ops.outfeed_dequeue_op(
channel_name="_npu_log", output_types=[tf.string], output_shapes=[()]) dequeue_ops = tf.print(outfeed_log_tensors, sys.stderr) with tf.Session() as sess: i = 0 while i < get_next_times: sess.run(dequeue_ops) i = i + 1
t1 = Thread(target=dequeue) t1.start()
AssertPrintLog

 01 (2020-11-28)

 © 

43

Atlas Data Center Solution 
print_op = tf.print(loss) with tf.control_dependencies([print_op]):
train_op = xxx # printprint

6 

 Summary 
Estiamtorhost_call Summary
def _host_call_fn(gs, loss): with summary.create_file_writer( "./model", max_queue=1000).as_default(): with summary.always_record_summaries(): summary.scalar("host_call_loss", loss, step=gs) return summary.all_summary_ops()
NPUEstimatorSpechost_callSummary DeviceenqueueSummaryHostdequeue DeviceSummarystepHost
host_callfunctiontensortensor train()evaluate()
from npu_bridge.estimator.npu.npu_estimator import NPUEstimatorSpec
host_call = (_host_call_fn, [global_step, loss]) return NPUEstimatorSpec(mode=tf.estimator.ModeKeys.TRAIN, loss=loss, train_op=train_op, host_call=host_call)

from npu_bridge.estimator.npu.npu_config import NPURunConfig from npu_bridge.estimator.npu.npu_estimator import NPUEstimator from npu_bridge.estimator.npu.npu_estimator import NPUEstimatorSpec
# host_call from tensorflow.contrib import summary def _host_call_fn(gs, loss):
with summary.create_file_writer( "./model", max_queue=1000).as_default():
with summary.always_record_summaries(): summary.scalar("host_call_loss", loss, step=gs) return summary.all_summary_ops()
def input_fn(): "dataset"
# model_fnhost_call def model_fn():
"" model = *** loss = *** optimizer = tf.train.MomentumOptimizer(learning_rate=c, momentum=0.9) global_step = tf.train.get_or_create_global_step() grad_vars = optimizer.compute_gradients(loss) minimize_op = optimizer.apply_gradients(grad_vars, global_step) update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS) with tf.control_dependencies([print_op]):
train_op = tf.group(minimize_op, update_ops) host_call = (_host_call_fn, [global_step, loss]) return NPUEstimatorSpec(mode=tf.estimator.ModeKeys.TRAIN, loss=loss, train_op=train_op, host_call=host_call)
run_config = NPURunConfig()
classifier = NPUEstimator(model_fn=model_fn, config=run_config, params={ }) classifier.train(input_fn=lambda: input_fn(), max_steps=get_next_times)

 01 (2020-11-28)

 © 

44

Atlas Data Center Solution 

6 

6.8 

 Host CPUAI
AI AIAI Host CPU
AImapbatchmap_and_batch Host CPU
AI
TensorFlow
TFRecordDatasetshuffleAImapbatch AI
train_dataset=tf.contrib.data.TFRecordDataset("./train_new.tfrecords") train_dataset=train_dataset.shuffle(1000) train_dataset=train_dataset.map(parse_tf) train_dataset=train_dataset.batch(batch_size)

train_dataset=tf.contrib.data.TFRecordDataset("./train_new.tfrecords") train_dataset=train_dataset.shuffle(1000) train_dataset=train_dataset.map(parse_tf) train_dataset=train_dataset.batch(batch_size) train_dataset = train_dataset.prefetch(buffer_size=buffer_size)
mapbatchprefetchprefetchAI Host CPU
 CPU
PHost CPU CPUHost CPU 8NPU
1. Host CPUTotal CPU =96

2. Host CPUn
n = Total CPU / 8 = 12
3. "taskset -c 0-n-1" Host CPU
Device0
taskset -c 0-11 python3.7 /home/test/xxx.py /

 01 (2020-11-28)

 © 

45

Atlas Data Center Solution 
Device7
taskset -c 84-95 python3.7 /home/test/xxx.py /



6 

6.9 



Device   
96.54% 3.46%  


ProfilingTraining Trace 

ProfilingProfiling ProfilingProfiling
AI fp_start/bp_end/ allreduce1_start/allreduce1_end/allreduce2_start/allreduce2_end/Iteration_end 


 AR1BPFP  AR2

 01 (2020-11-28)

 © 

46

Atlas Data Center Solution 

6 

  
1AR1AR2 AR2
50%50%

80%20%

2AR1AR1BPFP AR1BPFP
90%10%

80%20%

 01 (2020-11-28)

 © 

47

Atlas Data Center Solution 

6 

3BPFP AR12 AR2AR21BPFP BPFP


allreduce 
set_split_strategy_by_idxidgroup 
from hccl.split.api import set_split_strategy_by_idx set_split_strategy_by_idx([20, 100, 159])
set_split_strategy_by_sizegroup 
from hccl.split.api import set_split_strategy_by_size set_split_strategy_by_size([60, 20, 20])
allreduce
import tensorflow as tf from npu_bridge.estimator import npu_ops from hccl.split.api import set_split_strategy_by_size from tensorflow.core.protobuf.rewriter_config_pb2 import RewriterConfig
npu_init = npu_ops.initialize_system() npu_shutdown = npu_ops.shutdown_system()
config = tf.ConfigProto() custom_op = config.graph_options.rewrite_options.custom_optimizers.add() custom_op.name = "NpuOptimizer" custom_op.parameter_map["use_off_line"].b = True

 01 (2020-11-28)

 © 

48

Atlas Data Center Solution 

6 

#Profiling custom_op.parameter_map["profiling_mode"].b = True custom_op.parameter_map["profiling_options"].s = tf.compat.as_bytes("training_trace") config.graph_options.rewrite_options.remapping = RewriterConfig.OFF #remap
with tf.Session(config=config) as sess: sess.run(npu_init) # set_split_strategy_by_size([80, 20]) #allreduce... #... sess.run(npu_shutdown)

6.10  ckpt  pb 

6.10.1 
tensorflowsaver = tf.train.Saver()saver.save()  saver.save() 
 checkpointcheckpointcheckpoint 
 model.ckpt.data-00000-of-00001  model.ckpt.index  model.ckpt.meta
 tensorflowfreeze_graphpb 

tensorflowfreeze_graphpb
1. checkpoint
2. IteratorV2 placeholder
3. lossArgmax BiasAdd
4. BatchNorm dropout
­ BatchNormBatchnorm  batchnorm
­ dropoutdropoutrate1
if is_training: x = npu_ops.dropout(x, 0.65)
else: x = npu_ops.dropout(x, 1.0)
 is_training=False

 01 (2020-11-28)

 © 

49

Atlas Data Center Solution 

6 

# alexnet.inference logits = alexnet.inference(inputs, version="he_uniform", num_classes=1000, is_training=False)
5. tf.train.writegraphpbfreeze_graph 
6. freeze_graphtf.train.writegraphpbcheckpoint pb
6.10.2 
import tensorflow as tf from tensorflow.python.tools import freeze_graph from npu_bridge.estimator import npu_ops
#  import alexnet # checkpoint ckpt_path = "/opt/npu/model_ckpt/alexnet/model_8p/model.ckpt-0"
def main(): tf.reset_default_graph() #  inputs = tf.placeholder(tf.float32, shape=[None, 224, 224, 3], name="input") #  logits = alexnet.inference(inputs, version="he_uniform", num_classes=1000, is_training=False) #  predict_class = tf.argmax(logits, axis=1, output_type=tf.int32, name="output") with tf.Session() as sess: #./pb_modelmodel.pb # model.pbinput_graphfreeze_graph tf.train.write_graph(sess.graph_def, './pb_model', 'model.pb') # write_graph freeze_graph.freeze_graph( input_graph='./pb_model/model.pb', # write_graph input_saver='', input_binary=False, input_checkpoint=ckpt_path, # checkpoint output_node_names='output', #  restore_op_name='save/restore_all', filename_tensor_name='save/Const:0', output_graph='./pb_model/alexnet.pb', #  clear_devices=False, initializer_nodes='') print("done")
if __name__ == '__main__': main()
freeze_graph
 input_graphwrite_graph
 input_binaryinput_graphtrueinput_graphfalse input_graphFalse
 input_checkpointcheckpoint
 output_node_names
 output_graphpb
./pb_model/alexnet.pb pb



 01 (2020-11-28)

 © 

50

Atlas Data Center Solution 

7 

7 

bash bash run_npu.sh
 



fwkacllib/opp/home/HwHiAiUser/Ascend/nnae/latestdriver /usr/local/Ascendtfplugin/home/HwHiAiUser/Ascend/ tfplugin/latest

#

export install_path=/home/HwHiAiUser/Ascend

export LD_LIBRARY_PATH=/usr/local/lib/:/usr/lib/:$install_path/nnae/latest/fwkacllib/lib64/:/usr/local/

Ascend/driver/lib64/common/:/usr/local/Ascend/driver/lib64/driver/:/usr/local/Ascend/add-ons/

export PYTHONPATH=$install_path/nnae/latest/fwkacllib/python/site-packages/te:$install_path/nnae/latest/

fwkacllib/python/site-packages/topi:$install_path/nnae/latest/fwkacllib/python/site-packages/hccl:

$install_path/tfplugin/latest/tfplugin/python/site-packages:$install_path/nnae/latest/opp/op_impl/built-in/

ai_core/tbe

export PATH=$install_path/nnae/latest/fwkacllib/ccec_compiler/bin:$PATH

export ASCEND_OPP_PATH=$install_path/nnae/latest/opp

export SOC_VERSION=Ascend910

export JOB_ID=10087

export DEVICE_ID=0

export RANK_TABLE_FILE=/home/test/rank_table_2p.json

# 

export RANK_ID=0

export RANK_SIZE=8

#

python3.7 /home/test/xxx.py /

P DEVICE_IDRANK_ID

export DEVICE_ID=1 export RANK_ID=1
7-1

 01 (2020-11-28)

 © 

51

Atlas Data Center Solution 
 7-1  
LD_LIBRARY_PATH
PYTHONPATH PATH ASCEND_OPP_PATH SOC_VERSION JOB_ID DEVICE_ID RANK_TABLE_FILE RANK_ID
RANK_SIZE

7 





/





 





gcc

CentosDebianBClinux

".../xxx/xxx/xxx/lib64"

".../xxx/xxx/xxx/"gcc

10.1-4

Python





 











Ascend910





ID 





Device ID



[0,7]







ranktable 



ranktablerank_id 





ranktablepod_name 

Device 

Device



 01 (2020-11-28)

 © 

52

Atlas Data Center Solution 
 GE_USE_STATIC_MEMORY
TE_PARALLEL_COMPILER PROFILING_MODE PROFILING_OPTIONS

7 





/









featuremap 

25Gbert24

P

1





31G graph_memory_max_size variable_memory_max_size  graph_memory_max_size variable_memory_max_size

 Estimatorgraph_memory_max_size variable_memory_max_size NPURunConfigsess.run sess.runsession

8  0  host cpuPcpu *80%Pcpu*80%/P

Profiling



 trueProfiling



PROFILING_OPTIONSProfiling



 falseProfiling

Profiling 





 training_trace AI   

 task_trace AIHWTS/AICore 

 op_trace   training_tracetask_trace

 01 (2020-11-28)

 © 

53

Atlas Data Center Solution 
 FP_POINT
BP_POINT

7 





/





Profiling 





 

 tf.io.write_graph graph.pbtxt name

Profiling 





 BP_POINT FP_POINT

  tf.io.write_graphgraph.pbtxt name

 01 (2020-11-28)

 © 

54

Atlas Data Center Solution 

7 

 HCCL_INTRA_PCIE_ENABLE
HCCL_INTRA_ROCE_ENABLE SKT_ENABLE





/





Atlas 300T  9000  ServerPCIe  HCCL_INTRA_ROCE_ENABLE 
HCCL_INTRA_PCIE_ENABLE HCCL_INTRA_ROCE_ENABLEAtlas 300T  9000Server Server RoCE HCCL_INTRA_PCIE_ENABLE HCCL_INTRA_ROCE_ENABLE 
 HCCL_INTRA_PCIE_ENABLE HCCL_INTRA_ROCE_ENABLE 0ServerPCIe 
 HCCL_INTRA_PCIE_ENABLE1 HCCL_INTRA_ROCE_ENABLE0 ServerPCIe 
 HCCL_INTRA_PCIE_ENABLE0 HCCL_INTRA_ROCE_ENABLE1 ServerRoCE 
 HCCL_INTRA_PCIE_ENABLE HCCL_INTRA_ROCE_ENABLE1

Atlas 300T  9000  ServerRoCE  

tasktask  task  
 1superkernelsuper task
 0superkerneltask 

 01 (2020-11-28)

 © 

55

Atlas Data Center Solution 
 OP_NO_REUSE_MEM
DUMP_GE_GRAPH
DUMP_GRAPH_LEVEL

7 





/





  /  
  (",")
 
export OP_NO_REUSE_MEM=gradients/logits/ semantic/kernel/Regularizer/l2_regularizer_grad/ Mul_1,resnet_v1_50/conv1_1/BatchNorm/ AssignMovingAvg2
 
export OP_NO_REUSE_MEM=FusedMulAddN,BatchNorm
 
export OP_NO_REUSE_MEM=FusedMulAddN, resnet_v1_50/conv1_1/BatchNorm/ AssignMovingAvg

  dump  
 1dump
 2dump
 3dump
dump bulid

  dump  
 1dump
 2dump
 3dump
DUMP_GE_GRAPH 2

 01 (2020-11-28)

 © 

56

Atlas Data Center Solution 

8 

8 

8.1 Tensorflow 8.2 TF Adapter 8.3 

8.1  Tensorflow 
AITensorFlow 1.15TensorFlow Python API

 Python API
TensorFlow Python API

 8-1  Python API  tf tf tf tf tf tf tf tf tf tf

Python API assert_same_float_dtype assert_scalar assert_type dimension_at_index dimension_value get_logger get_static_value grad_pass_through GradientTape is_tensor

 01 (2020-11-28)

 © 

57

Atlas Data Center Solution 
 tf tf tf tf tf tf tf tf tf tf tf tf tf tf.config.threading tf.config.threading tf.config.threading tf.config.threading tf.estimator tf.estimator tf.estimator tf.estimator tf.estimator tf.estimator tf.estimator tf.estimator tf.estimator tf.estimator tf.estimator tf.estimator tf.estimator

8 
Python API local_variables_initializer make_ndarray make_template make_tensor_proto min_max_variable_partitioner no_gradient NoGradient RaggedTensorSpec recompute_grad resource_variables_enabled TensorSpec TypeSpec UnconnectedGradients get_inter_op_parallelism_threads get_intra_op_parallelism_threads set_inter_op_parallelism_threads set_intra_op_parallelism_threads add_metrics BestExporter classifier_parse_example_spec EstimatorSpec EvalSpec Exporter FinalExporter Head LatestExporter ModeKeys regressor_parse_example_spec RunConfig train_and_evaluate

 01 (2020-11-28)

 © 

58

Atlas Data Center Solution 
 tf.estimator tf.estimator tf.estimator.export tf.estimator.export tf.estimator.export tf.estimator.export tf.estimator.export tf.estimator.export tf.Keras tf.keras.datasets tf.keras.datasets tf.keras.datasets tf.keras.datasets tf.keras.datasets tf.keras.datasets tf.keras.datasets tf.keras.datasets tf.keras.datasets tf.keras.datasets tf.keras.datasets tf.keras.datasets.imdb tf.keras.datasets.imdb tf.keras.layers tf.keras.layers tf.keras.layers tf.keras.layers tf.keras.metrics tf.keras.optimizers tf.keras.optimizers.schedules tf.keras.optimizers.schedules

8 
Python API TrainSpec WarmStartSettings ClassificationOutput ExportOutput PredictOutput RegressionOutput ServingInputReceiver TensorServingInputReceiver Sequential boston_housing.load_data cifar10.load_data cifar100.load_data fashion_mnist fashion_mnist.load_data imdb mnist mnist.load_data reuters reuters.get_word_index reuters.load_data get_word_index load_data AbstractRNNCell DenseFeatures deserialize serialize Metric schedules deserialize LearningRateSchedule

 01 (2020-11-28)

 © 

59

Atlas Data Center Solution 
 tf.keras.optimizers.schedules tf.nest tf.nest tf.nest tf.nest tf.nest tf.saved_model tf.saved_model tf.saved_model tf.saved_model tf.saved_model.loader tf.saved_model.signature_def_utils tf.saved_model.utils tf.summary tf.sysconfig tf.sysconfig tf.sysconfig tf.sysconfig tf.train tf.train tf.train tf.train tf.train tf.train tf.train tf.train tf.train tf.train tf.train tf.train

8 
Python API serialize assert_same_structure flatten is_nested map_structure pack_sequence_as Builder contains_saved_model load_v2 save load build_signature_def build_tensor_info Audio get_compile_flags get_include get_lib get_link_flags assert_global_step basic_train_loop CheckpointManager checkpoints_iterator CheckpointSaverHook CheckpointSaverListener FeedFnHook FinalOpsHook LooperThread remove_checkpoint SessionRunHook SessionRunValues

 01 (2020-11-28)

 © 

60

Atlas Data Center Solution 
 tf.train tf.train tf.train tf.train

8 
Python API summary_iterator VocabInfo FeatureLists.FeatureListEntry Features.FeatureEntry

 Python API
TensorFlow Python API

 8-2  Python API

 tf.Keras tf.keras.applications.densenet tf.keras.applications.imagenet_utils tf.keras.applications.inception_resnet_v2 tf.keras.applications.inception_v3 tf.keras.applications.mobilenet tf.keras.applications.mobilenet_v2

Python API Model decode_predictions decode_predictions decode_predictions decode_predictions decode_predictions decode_predictions


 iteration_per _loop 1   model_to_np uestimator  estimator

tf.keras.applications.nasnet

decode_predictions

tf.keras.applications.resnet

decode_predictions

tf.keras.applications.resnet_v2

decode_predictions

tf.keras.applications.resnet50

decode_predictions

tf.keras.applications.vgg16

decode_predictions

tf.keras.applications.vgg19

decode_predictions

tf.keras.applications.xception

decode_predictions

tf.keras.backend

learning_phase_scope

tf.keras.layers

Layer

tf.keras.optimizers

Optimizer

 Python API
TensorFlow Python API

 01 (2020-11-28)

 © 

61

Atlas Data Center Solution 
 8-3  Python API  tf tf tf tf tf tf tf tf tf tf tf tf tf tf tf tf tf tf tf.config tf.config tf.config.optimizer tf.config.optimizer tf.config.optimizer tf.config.optimizer tf.estimator.tpu tf.estimator.tpu tf.estimator.tpu tf.estimator.tpu tf.keras.preprocessing.image

Python API enable_eager_execution autograph distribute disable_v2_tensorshape enable_control_flow_v2 enable_tensor_equality enable_v2_behavior enable_v2_tensorshape CriticalSection IndexedSlicesSpec Module OptionalSpec RaggedTensor function disable_control_flow_v2 disable_eager_execution disable_tensor_equality disable_v2_behavior get_soft_device_placement set_soft_device_placement get_experimental_options get_jit set_experimental_options set_jit TPUConfig RunConfig TPUEstimatorSpec InputPipelineConfig DirectoryIterator

8 

 01 (2020-11-28)

 © 

62

Atlas Data Center Solution 
 tf.keras.preprocessing.image tf.keras.preprocessing.image tf.keras.preprocessing.image tf.keras.preprocessing.image tf.keras.preprocessing.image tf.keras.preprocessing.image tf.keras.preprocessing.image tf.keras.preprocessing.image tf.keras.preprocessing.image tf.keras.preprocessing.image tf.keras.preprocessing.image tf.keras.preprocessing.image tf.keras.preprocessing.image tf.keras.preprocessing.image tf.keras.preprocessing.image tf.keras.preprocessing.image tf.keras.preprocessing.sequence tf.keras.preprocessing.sequence tf.keras.preprocessing.sequence tf.keras.preprocessing.sequence tf.keras.preprocessing.text tf.keras.preprocessing.text tf.keras.preprocessing.text tf.keras.preprocessing.text tf.keras.utils tf.profiler tf.profiler tf.profiler tf.profiler tf.profiler

8 
Python API ImageDataGenerator Iterator NumpyArrayIterator apply_affine_transform apply_brightness_shift apply_channel_shift array_to_img img_to_array load_img random_brightness random_zoom save_img random_rotation random_shift random_shear random_channel_shift pad_sequences skipgrams TimeseriesGenerator make_sampling_table hashing_trick one_hot text_to_word_sequence Tokenizer model_to_dot AdviceProto AdviceProto.Checker AdviceProto.CheckersEntry GraphNodeProto GraphNodeProto.InputShapesEntry

 01 (2020-11-28)

 © 

63

Atlas Data Center Solution 
 tf.profiler tf.profiler tf.profiler tf.profiler tf.profiler tf.summary tf.train tf.train tf.train

Python API MultiGraphNodeProto OpLogProto OpLogProto.IdToStringEntry advise write_op_log all_v2_summary_ops ClusterDef JobDef JobDef.TasksEntry

8 

 Python API
TensorFlowPython API

 8-4  Python API  tf tf.train tf.train tf.train tf.train tf.train tf.train tf.train tf.train tf.train.queue_runner tf.train.queue_runner

Python API disable_resource_variables input_producer do_quantize_training_on_graphdef limit_epochs maybe_batch maybe_batch_join maybe_shuffle_batch maybe_shuffle_batch_join shuffle_batch_join QueueRunner add_queue_runner



TensorFlow

 01 (2020-11-28)

 © 

64

Atlas Data Center Solution 

8 

8.2 TF Adapter 
8.2.1 
TensorflowTF AdapterTensorflow 
 8-1 TF Adapter 

 8-5   NPURunConfig
ProfilingConfig
DumpConfig


NPURunConfig  NPURunConfig  RunConfig  
ProfilingConfig   Profiling
DumpConfig  dump 


{install_path_tfplu gin}/tfplugin/ python/sitepackages/ npu_bridge/ estimator/npu/ npu_config.py
{install_path_tfplu gin}/tfplugin/ python/sitepackages/ npu_bridge/ estimator/npu/ npu_config.py
{install_path_tfplu gin}/tfplugin/ python/sitepackages/ npu_bridge/ estimator/npu/ npu_config.py

 01 (2020-11-28)

 © 

65

Atlas Data Center Solution 
 NPUEstimator
NPUEstimatorSpec
NPUCheckpointSaverHook
NPUOutputTensorHook
NPUDistributedOptimizer

8 





NPUEstimator  NPUEstimator Estimator  

{install_path_tfplu gin}/tfplugin/ python/sitepackages/ npu_bridge/ estimator/npu/ npu_estimator.py

NPUEstimatorSp ec 
NPUEstimatorSp ec
EstimatorSpec  

{install_path_tfplu gin}/tfplugin/ python/sitepackages/ npu_bridge/ estimator/npu/ npu_estimator.py

NPUCheckpoint SaverHook 
NPUCheckpoint SaverHook 
CheckpointSaver Hook  

{install_path_tfplu gin}/tfplugin/ python/sitepackages/ npu_bridge/ estimator/npu/ npu_hook.py

NPUOutputTens orHook NPUEstimator train evaluate predict HookN   output_fn tensors

{install_path_tfplu gin}/tfplugin/ python/sitepackages/ npu_bridge/ estimator/npu/ npu_hook.py

NPUDistributed Optimizer    NPU 

{install_path_tfplu gin}/tfplugin/ python/sitepackages/ npu_bridge/ estimator/npu/ npu_optimizer.py

 01 (2020-11-28)

 © 

66

Atlas Data Center Solution 

8 

 NPULossScaleOptimizer
NPUOptimizer FixedLossScaleManager ExponentialUpdateLossScaleManager 





NPULossScaleO ptimizer   Loss ScalingLoss Scaling float16  
NPULossScaleO ptimizer 
LossScaleOptimi zer  

{install_path_tfplu gin}/tfplugin/ python/sitepackages/ npu_bridge/ estimator/npu/ npu_loss_scale_op timizer.py

NPUOptimizer  
NPUDistributed Optimizer
NPULossScaleO ptimizer 

{install_path_tfplu gin}/tfplugin/ python/sitepackages/ npu_bridge/ estimator/npu/ npu_optimizer.py

FixedLossScaleM anager  LossScale 

{install_path_tfplu gin}/tfplugin/ python/sitepackages/ npu_bridge/ estimator/npu/ npu_loss_scale_m anager.py

ExponentialUpd
ateLossScaleMa nager  LossScale 

{install_path_tfplu gin}/tfplugin/ python/sitepackages/ npu_bridge/ estimator/npu/ npu_loss_scale_m anager.py

 01 (2020-11-28)

 © 

67

Atlas Data Center Solution 
 dropout
LARSV2
initialize_system
shutdown_system

8 





tf.nn.dropout  Tensor1/ keep_prob  Tensor  keep_prob 0 Tensor shape Tensorshape 

{install_path_tfplu gin}/tfplugin/ python/sitepackages/ npu_bridge/ estimator/ npu_ops.py

      batch size    

{install_path_tfplu gin}/tfplugin/ python/sitepackages/ npu_bridge/ estimator/ npu_ops.py

    GE      

{install_path_tfplu gin}/tfplugin/ python/sitepackages/ npu_bridge/ estimator/ npu_ops.py

 Device
initialize_syste m

{install_path_tfplu gin}/tfplugin/ python/sitepackages/ npu_bridge/ estimator/ npu_ops.py

 01 (2020-11-28)

 © 

68

Atlas Data Center Solution 

8 

 without_npu_compile_scope set_iteration_per_loop create_iteration_per_loop_var
load_iteration_per_loop_var model_to_npu_estimator





Host 

{install_path_tfplu gin}/tfplugin/ python/sitepackages/ npu_bridge/ estimator/npu/ npu_scope.py

sess.run  sess.run() Device  

{install_path_tfplu gin}/tfplugin/ python/sitepackages/ npu_bridge/ estimator/npu/ util.py.py


load_iteration_ per_loop_var  sess.run   sess.run() Device    
load_iteration_ per_loop_var  

{install_path_tfplu gin}/tfplugin/ python/sitepackages/ npu_bridge/ estimator/npu/ util.py.py


create_iteration _per_loop_var  sess.run   sess.run() Device 

{install_path_tfplu gin}/tfplugin/ python/sitepackages/ npu_bridge/ estimator/npu/ util.py.py

Keras  NPUEstimator 

{install_path_tfplu gin}/tfplugin/ python/sitepackages/ npu_bridge/ estimator/npu/ keras_to_npu.py

 01 (2020-11-28)

 © 

69

Atlas Data Center Solution 

8 

8.2.2 NPURunConfig 

 

def __init__(self, iterations_per_loop=1, profiling_config=None, model_dir=None, tf_random_seed=None, save_summary_steps=0, save_checkpoints_steps=None, save_checkpoints_secs=None, session_config=None, keep_checkpoint_max=5, keep_checkpoint_every_n_hours=10000, log_step_count_steps=100, enable_data_pre_proc=True, precision_mode=None, enable_reduce_precision=False, variable_format_optimize=True, mix_compile_mode=False, hcom_parallel=False, graph_memory_max_size=None, variable_memory_max_size=None, auto_tune_mode=None, dump_config=None, stream_max_parallel_num=None, is_tailing_optimization=False, horovod_mode = False, graph_run_mode = 1)
NPURunConfigNPURunConfigRunConfig 

 01 (2020-11-28)

 © 

70

Atlas Data Center Solution 

8 



Device save_checkpoints_secs







NPURunConfigRunConfig

model_dir

None

tf_random_seed

None

save_summary_steps

stepSummary0
iterations_per_loop=1 iterations_per_loop>1 Log/Summary

save_checkpoints_steps

stepcheckpoint None
 save_checkpoints_secs
 save_checkpoints_steps save_checkpoints_secsNone 100stepcheckpoint
 iterations_per_loop>1 save_checkpoints_steps iterations_per_loop iterations_per_loop save_checkpoints_steps checkpoint

save_checkpoints_secs

checkpoint None
save_checkpoints_steps

session_config

sessionConfigProto None

keep_checkpoint_max

checkpoint5

keep_checkpoint_every_n_hours

Ncheckpoint  10000
keep_checkpoint_max 

log_step_count_steps

stepglobal steploss 100
iterations_per_loop=1 iterations_per_loop>1 Log/Summary

 01 (2020-11-28)

 © 

71

Atlas Data Center Solution 

8 

 NPURunConfig train_distribute
device_fn protocol eval_distribute experimental_distribute NPURunConfig iterations_per_loop
profiling_config dump_config enable_data_pre_proc


 experimental_distribute TF AdapterNPUDistributedOptimizer NPU 
OperationDevicefunction
Server GRPC
 experimental_distribute

session.runAI 1 iterations_per_loop AIiterations_per_loop Host HostDevice  mix_compile_modeTrue iterations_per_loop1 iterations_per_loop1 LossScale  
profilingNPURunConfig ProfilingConfig profilingProfilingConfig ProfilingConfig
dumpNPURunConfig DumpConfigdump DumpConfig DumpConfig
Deivce  True  False

 01 (2020-11-28)

 © 

72

Atlas Data Center Solution 
 precision_mode
enable_reduce_precision variable_format_optimize
mix_compile_mode
hcom_parallel

8 

string
 allow_fp32_to_fp16None float32float16 
 force_fp16float16 float32float16
 must_keep_origin_dtype
 allow_mix_precision float32 float32 float16  Loss Scaling  NPULossScaleOptimizer


 True
 False  AI  

 True
 False 
 Device  AI Tensorflow
Allreduce 
 TrueAllreduce
 FalseAllreduce

 01 (2020-11-28)

 © 

73

Atlas Data Center Solution 
 graph_memory_max_size variable_memory_max_size auto_tune_mode
stream_max_parallel_num
is_tailing_optimization
horovod_mode

8 

 Byte[0, 256*1024*1024*1024][0, 274877906944]  graph_memory_max_size variable_memory_max_size 31G26GB
Byte [0256*1024*1024*1024][0, 274877906944] graph_memory_max_size variable_memory_max_size 31G5GB
TBEAuto TuneAI 
auto_tune_mode = "RL,GA" 
Auto Tune Auto Tune
AICPU/AICORE AICPU/AICORE
"DNN_VM_TF:10,DNN_V100:1"
DNN_VM_TFAICPU AICPU10
DNN_V100AICORE AICORE1
AICPU/AICORE1 [1,13]
  AR AR  
 True
 FalseFalse
NPUOptimizer NPUOptimizer is_tailing_optimization


 01 (2020-11-28)

 © 

74

Atlas Data Center Solution 
 graph_run_mode

8 
   00  111



NPURunConfigNPUEstimator


"1000"config
from npu_bridge.estimator.npu.npu_config import NPURunConfig session_config=tf.ConfigProto() config = NPURunConfig(session_config=session_config, mix_compile_mode=False, iterations_per_loop=1000)
8.2.3 ProfilingConfig 


 

def __init__(self, enable_profiling=False, enable_options=[])

ProfilingConfigProfiling

 enable_profiling

/ 



Profiling  TrueProfiling
enable_optionsProfiling  FalseProfiling

 01 (2020-11-28)

 © 

75

Atlas Data Center Solution 

enable_options

8 

/ 



Profiling 
 training_trace AI  
 task_traceAI HWTS/AICore 
 op_trace  training_trace task_trace
['task_trace','training_trace']


ProfilingConfigNPURunConfig
8.2.4 DumpConfig 



def __init__(self, enable_dump=False, dump_path=None, dump_step=None, dump_mode="output")



DumpConfigdump

 01 (2020-11-28)

 © 

76

Atlas Data Center Solution 





enable_dump

dump_path

dump_step dump_mode

8 

/   
 


dumpFalse
 Truedumpdump_pathdump dump_pathNone 
 Falsedump
dumpNone 
dump/var/log/npu/ide_daemon/ dump/{dump_path}/var/log/npu/ ide_daemon/dump"ada ide_daemon/ide_daemon.cfg "DUMP_PATH" "DUMP_PATH"dump Host{WORK_PATH}/ide_daemon/ dump/{dump_path}{WORK_PATH} ide_daemon.cfg"WORK_PATH" "~"ada
 dump_path "/"Dump 
dumpNone dump
"|"0|5|10"-" 0|3-5|10
dumpdump 
 inputdump
 outputdumpoutput
 alldump


DumpConfigNPURunConfig
8.2.5 NPUEstimator 



def __init__(self, model_fn=None,

 01 (2020-11-28)

 © 

77

Atlas Data Center Solution 

8 

 

model_dir=None, config=None, params=None, job_start_file='' )

NPUEstimatorNPUEstimatorEstimator 

 model_fn
model_dir
config params job_start_file

/  


 


functionfunction NPUEstimatorSpec NPUEstimatorSpec NPUEstimatorSpec
  configmodel_dir  None/tmp
NPURunConfig NPURunConfig NPURunConfig
model_fn python
CSA job 


NPUEstimator
8.2.6 NPUEstimatorSpec 



def __new__(cls, mode, predictions=None, loss=None,

 01 (2020-11-28)

 © 

78

Atlas Data Center Solution 

8 

train_op=None, eval_metric_ops=None, export_outputs=None, training_chief_hooks=None, training_hooks=None, scaffold=None, evaluation_hooks=None, prediction_hooks=None, host_call=None)



NPUEstimatorSpecNPUEstimatorSpecEstimatorSpec 
EstimatorSpecmodel_fnmodepredictionsloss train_opexport_outputsEstimatorEstimatorSpec NPUEstimatorSpecEstimatorSpec





/ 

NPUEstimatorSpecEstimatorSpec

mode



 
 ModeKeys.TRAIN
 ModeKeys.EVAL
 ModeKeys.PREDICT

predictions



Tensormode ModeKeys.PREDICT

loss





train_op





eval_metric_ops



Tensor  
 Metric
  metric_tensorupdate_op

export_outputs



SavedModel 

 01 (2020-11-28)

 © 

79

Atlas Data Center Solution 

8 



/ 

training_chief_hooks 

SessionRunHooks 

training_hooks



SessionRunHooks

scaffold



scaffoldsaverinit_op summary_opglobal_step

evaluation_hooks



SessionRunHooks

prediction_hook



SessionRunHooks

NPUEstimatorSpec

host_call



Summarystep HostLog/ Summary
host_callfunctiontensor tensor
host_calltrain()evaluate()


NPUEstimatorSpec
8.2.7 NPUCheckpointSaverHook 



def __init__(self, checkpoint_dir, save_secs=None, save_steps=None, saver=None, checkpoint_basename="model.ckpt", scaffold=None, listeners=None)



NPUCheckpointSaverHookNPUCheckpointSaverHook CheckpointSaverHook
NPUCheckpointSaverHookcheckpoint

 01 (2020-11-28)

 © 

80

Atlas Data Center Solution 

8 



NPUEstimatoriteration_per_loop>1Hook



 checkpoint_dir save_secs save_steps saver checkpoint_basename scaffold listeners

/ 



checkpoint







step



Saver



checkpointbasename



saverScaffold



CheckpointSaverListener checkpoint



NPUCheckpointSaverHook



from npu_bridge.estimator.npu.npu_hook import NPUCheckpointSaverHook checkpoint_hook = NPUCheckpointSaverHook(checkpoint_dir='./ckpt', save_steps=2000) ... mnist_classifier.train(

input_fn=train_input_fn, steps=2000, hooks=[checkpoint_hook])

8.2.8 NPUOutputTensorHook 

 

def __init__(self, tensors, dependencies=None, output_fn=None, output_every_n_steps=0 )
NPUOutputTensorHookNPUOutputTensorHook LoggingTensorHook

 01 (2020-11-28)

 © 

81

Atlas Data Center Solution 

8 

NPUOutputTensorHookNPUEstimatortrainevaluatepredict HookNoutput_fntensors



Iterations_per_loop>1output_every_n_steps output_fn



 tensors dependencies output_fn output_every_n_steps

/    

 Tensor tensors tensors N output_fn


NPUOutputTensorHook
8.2.9 NPUDistributedOptimizer 

 

def __init__(self, optimizer, name=None)
NPUDistributedOptimizer NPU Device 



 optimizer name

/  

  



NPUDistributedOptimizer

 01 (2020-11-28)

 © 

82

Atlas Data Center Solution 


import tensorflow as tf from npu_bridge.estimator.npu.npu_optimizer import NPUDistributedOptimizer
optimizer = tf.train.GradientDescentOptimizer(learning_rate=learning_rate) optimizer = NPUDistributedOptimizer(optimizer)
8.2.10 NPULossScaleOptimizer 

8 



def __init__(self, opt, loss_scale_manager, is_distributed=False)



NPULossScaleOptimizerLoss Scaling Loss Scalingfloat16 NPULossScaleOptimizerLossScaleOptimizer





/ 

opt



loss_scale_manager 

is_distributed





LossScale 
 NPULossScaleOptimizer FixedLossScaleManager LossScaleLossScale LossScale FixedLossScaleManager FixedLossScaleManager
 NPULossScaleOptimizer  ExponentialUpdateLossScaleManager LossScale ExponentialUpdateLossScaleManager  ExponentialUpdateLossScaleManager 
Loss Scaling 
 TrueTrue
 False

 01 (2020-11-28)

 © 

83

Atlas Data Center Solution 

8 



NPULossScaleOptimizer



from npu_bridge.estimator.npu.npu_loss_scale_optimizer import NPULossScaleOptimizer from npu_bridge.estimator.npu.npu_loss_scale_manager import FixedLossScaleManager from npu_bridge.estimator.npu.npu_loss_scale_manager import ExponentialUpdateLossScaleManager

if FLAGS.use_fp16 and (FLAGS.npu_bert_loss_scale not in [None, -1]): opt_tmp = opt if FLAGS.npu_bert_loss_scale == 0: loss_scale_manager = ExponentialUpdateLossScaleManager(init_loss_scale=2**32,
incr_every_n_steps=1000, decr_every_n_nan_or_inf=2, decr_ratio=0.5) elif FLAGS.npu_bert_loss_scale >= 1: loss_scale_manager = FixedLossScaleManager(loss_scale=FLAGS.npu_bert_loss_scale) else: raise ValueError("Invalid loss scale: %d" % FLAGS.npu_bert_loss_scale) if ops_adapter.size() > 1: opt = NPULossScaleOptimizer(opt_tmp, loss_scale_manager, is_distributed=True) else: opt = NPULossScaleOptimizer(opt_tmp, loss_scale_manager)

8.2.11 NPUOptimizer 

 

def __init__(self,
opt,
loss_scale_manager=None,
is_distributed=False,
is_loss_scale=False,
is_tailing_optimization=False,
name=None)
NPUOptimizerNPUDistributedOptimizer NPULossScaleOptimizer 
 Loss ScalingLoss Scalingfloat16 
 NPU Device 
 AR AR 

 01 (2020-11-28)

 © 

84

Atlas Data Center Solution 

8 



 opt loss_scale_manager
is_distributed is_loss_scale is_tailing_optimization name

/ 



 



is_loss_scaleTrueLoss Scaling LossScale 
 NPUOptimizer FixedLossScaleManager LossScaleLossScale LossScale FixedLossScaleManager FixedLossScaleManager 
 NPUOptimizer  ExponentialUpdateLossScaleManager LossScale ExponentialUpdateLossScaleManager 
ExponentialUpdateLossScaleManager 



  Trueallreduce  FalseFalse



Loss Scaling
 True loss_scale_manager None
 FalseFalse



 is_distributedTrue 
 True
 FalseFalse





 

NPUOptimizer
import tensorflow as tf from npu_bridge.estimator.npu.npu_optimizer import NPUOptimizer

 01 (2020-11-28)

 © 

85

Atlas Data Center Solution 

8 

from npu_bridge.estimator.npu.npu_loss_scale_manager import FixedLossScaleManager from npu_bridge.estimator.npu.npu_loss_scale_manager import ExponentialUpdateLossScaleManager
# optimizer = LAMBOptimizer(
learning_rate=learning_rate, weight_decay_rate=0.01, beta_1=0.9, beta_2=0.999, epsilon=1e-6, exclude_from_weight_decay=["LayerNorm", "layer_norm", "bias"])
#Loss Scaling if tf.flags.FLAGS.npu_bert_loss_scale not in [None, -1]: if tf.flags.FLAGS.npu_bert_loss_scale == 0: loss_scale_manager =
lsm_lib.ExponentialUpdateLossScaleManager(init_loss_scale=tf.flags.FLAGS.init_loss_scale_value, incr_every_n_steps=1000, decr_every_n_nan_or_inf=2, decr_ratio=0.5)
elif tf.flags.FLAGS.npu_bert_loss_scale >= 1: loss_scale_manager = lsm_lib.FixedLossScaleManager(loss_scale=tf.flags.FLAGS.npu_bert_loss_scale)
else: raise ValueError("Invalid loss scale: %d" % tf.flags.FLAGS.npu_bert_loss_scale)
optimizer = NPUOptimizer(optimizer, loss_scale_manager, is_distributed=tf.flags.FLAGS.distributed, is_loss_scale=True, is_tailing_optimization=True)
#loss_scale else: optimizer = NPUOptimizer(optimizer, is_distributed=tf.flags.FLAGS.distributed)
8.2.12 FixedLossScaleManager 

  

def __init__(self, loss_scale)

FixedLossScaleManagerLossScale

 loss_scale

/  



LossScalefloat1
LossScale LossScale GPU



FixedLossScaleManager

 01 (2020-11-28)

 © 

86

Atlas Data Center Solution 

8 

8.2.13 ExponentialUpdateLossScaleManager 



def __init__(self, init_loss_scale, incr_every_n_steps, decr_every_n_nan_or_inf=2, incr_ratio=2, decr_ratio=0.8)



ExponentialUpdateLossScaleManagerLossScale




init_loss_scale incr_every_n_steps
decr_every_n_nan_or_inf
incr_ratio decr_ratio

/   

 


LossScalefloat N LossScale N LossScale LossScale LossScale



ExponentialUpdateLossScaleManager

8.2.14 dropout



def dropout(x, keep_prob, noise_shape=None, seed=None, name=None)



tf.nn.dropoutTensor1/keep_probTensor keep_prob0TensorshapeTensorshape 

 01 (2020-11-28)

 © 

87

Atlas Data Center Solution 





x keep_prob noise_shape

seed name

8 

/    
 


Tensorfloat Tensorfloat Tensorint32keep/drop   



tensorxdropoutTensor



from npu_bridge.estimator import npu_ops layers = npu_ops.dropout()

8.2.15 LARSV2


 

def LARSV2(input_weight, input_grad, weight_decay, learning_rate, hyperpara=0.001, epsilon=0.00001, use_clip=False, name=None)

 batch size

 input_weight

/ 



Tensorfloat

 01 (2020-11-28)

 © 

88

Atlas Data Center Solution 
 input_grad weight_decay learning_rate hyperpara
epsilon
use_clip
name

8 

/ 



Tensorfloat



Tensorfloat



Tensorfloat



float 0.001



0 1e-5



boolFalse
True 







tensorTensor



from npu_bridge.estimator import npu_ops layers = npu_ops.LARSV2(input_weight , input_grad, weight_decay, learning_rate)

8.2.16 initialize_system

 

def initialize_system(name = None)
GE 



initialize_systemsessionsession NPURunConfig



 name

/ 


 

 01 (2020-11-28)

 © 

89

Atlas Data Center Solution 

8 



opsess.run(op)GE



sess.runestimator.trainget_local_rank_id/get_rank_size/ get_rank_idHCCLsessioninitialize_system shutdown_systemsession
import tensorflow as tf from npu_bridge.estimator import npu_ops from tensorflow.core.protobuf.rewriter_config_pb2 import RewriterConfig
npu_int = npu_ops.initialize_system() npu_shutdown = npu_ops.shutdown_system()
config = tf.ConfigProto() custom_op = config.graph_options.rewrite_options.custom_optimizers.add() custom_op.name = "NpuOptimizer" custom_op.parameter_map["use_off_line"].b = True config.graph_options.rewrite_options.remapping = RewriterConfig.OFF #remap
init_sess = tf.Session(config=config) init_sess.run(npu_int)
#HCCL... #...
init_sess.run(npu_shutdown) init_sess.close()

import tensorflow as tf from npu_bridge.estimator import npu_ops from tensorflow.core.protobuf.rewriter_config_pb2 import RewriterConfig
npu_init = npu_ops.initialize_system() npu_shutdown = npu_ops.shutdown_system()
config = tf.ConfigProto() custom_op = config.graph_options.rewrite_options.custom_optimizers.add() custom_op.name = "NpuOptimizer" custom_op.parameter_map["use_off_line"].b = True config.graph_options.rewrite_options.remapping = RewriterConfig.OFF #remap
with tf.Session(config=config) as sess: sess.run(npu_init) #HCCL... #... sess.run(npu_shutdown)

8.2.17 shutdown_system

 

def shutdown_system(name = None) Deviceinitialize_system

 01 (2020-11-28)

 © 

90

Atlas Data Center Solution 





name

/ 


 

8 


opsess.run(op)
8.2.18 without_npu_compile_scope

  

def without_npu_compile_scope() Host 






8.2.19 set_iteration_per_loop



def set_iteration_per_loop(sess, train_op, iterations_per_loop=1)

 

sess.runsess.run()Device HostDevice
 tf.train.Supervisorsessionset_iteration_per_loop create_iteration_per_loop_var load_iteration_per_loop_var

 01 (2020-11-28)

 © 

91

Atlas Data Center Solution 



 sess train_op iterations_per_loop

8 

/ 



TensorFlow







sess.run()Device 1 iterations_per_loop
mix_compile_modeTrue iterations_per_loop1


sess.run(op)
8.2.20 create_iteration_per_loop_var



def create_iteration_per_loop_var(self, train_op)



load_iteration_per_loop_varsess.run sess.run()Device load_iteration_per_loop_var



 train_op

/ 


 


sess.run(op)
8.2.21 load_iteration_per_loop_var



def load_iteration_per_loop_var(self, sess, iterations_per_loop=1)

 01 (2020-11-28)

 © 

92

Atlas Data Center Solution 

8 



create_iteration_per_loop_varsess.run sess.run()Device




sess iterations_per_loop

/  



TensorFlow



sess.run()Device 1 iterations_per_loop
mix_compile_modeTrue iterations_per_loop1



8.2.22 model_to_npu_estimator



def model_to_npu_estimator(keras_model=None, keras_model_path=None, custom_objects=None, model_dir=None, checkpoint_format='saver', config=None, job_start_file='')



KerasNPUEstimator



Kerasmodel_to_npu_estimator NPUEstimator

 01 (2020-11-28)

 © 

93

Atlas Data Center Solution 

8 



 keras_model keras_model_path custom_objects model_dir checkpoint_format
config
job_start_file


Keras keras_model_path
KerasKeras save()HDF5Keras keras_model
keras custom_objects
  configmodel_dir None /tmp
NPUEstimatorcheckpoint  savertf.train.Saver()  checkpointtf.train.Checkpoint ()
tf.train.Checkpoint  tf.train.Saver   "" 
NPURunConfigNPUEstimator  NPURunConfigNPURunConfig 
CSA 


keras modelNPUEstimator
8.2.23 sess.run  session 
AIsess.run

 8-6 session   use_off_line


AI  TrueAI  FalseHostCPU
False

 01 (2020-11-28)

 © 

94

Atlas Data Center Solution 
 enable_data_pre_proc iterations_per_loop profiling_mode profiling_options
enable_dump dump_path dump_step

8 


 True
 False
sess.runset_iteration_per_loop sess.run()Device  set_iteration_per_loop iterations_per_loop
Profiling
 TrueProfilingenable_options Profiling
 FalseProfiling
Profiling
 training_trace AI  
 task_traceAI HWTS/AICore 
 op_trace  training_tracetask_trace 
 "traing_trace:task_trace"
dump
 Truedump_pathdump dump_pathNone
 False
dumpNone
 dump_path "/"Dump 
dumpNone dump "|"0|5|10 "-"0|3-5|10

 01 (2020-11-28)

 © 

95

Atlas Data Center Solution 

8 

 dump_mode precision_mode
enable_reduce_precision variable_format_optimize
mix_compile_mode
hcom_parallel


dumpdump 
 inputdump
 outputdumpoutput
 alldump
string
 allow_fp32_to_fp16None float32float16 
 force_fp16float16float32 float16
 must_keep_origin_dtype
 allow_mix_precision float32 float32 float16  Loss Scaling  NPULossScaleOptimizer


 True
 False
 AI NCHWNC1HWC0  

 True
 False
Device   AI Tensorflow
Allreduce
 TrueAllreduce
 FalseAllreduce

 01 (2020-11-28)

 © 

96

Atlas Data Center Solution 

8 

 graph_memory_max_size variable_memory_max_size auto_tune_mode stream_max_parallel_num
is_tailing_optimization
graph_run_mode


 Byte[0, 256*1024*1024*1024][0, 274877906944] graph_memory_max_size variable_memory_max_size31G 26GB
Byte [0256*1024*1024*1024][0, 274877906944] graph_memory_max_size variable_memory_max_size31G 5GB
TBEAuto Tune AI 
auto_tune_mode = "RL,GA" 
Auto Tune Auto Tune
AICPU/AICORE AICPU/AICORE
"DNN_VM_TF:10,DNN_V100:1"
DNN_VM_TFAICPU AICPU10
DNN_V100AICORE AICORE1
AICPU/AICORE1 [1,13]
  AR AR 
 True
 FalseFalse
NPUOptimizer NPUOptimizer is_tailing_optimization

 00
 111

 01 (2020-11-28)

 © 

97

Atlas Data Center Solution 

8 



sess.run
import tensorflow as tf from npu_bridge.estimator import npu_ops from npu_bridge.estimator.npu import npu_scope from tensorflow.core.protobuf.rewriter_config_pb2 import RewriterConfig
X = tf.random_normal([2,]) Y = tf.random_normal([2,])
with npu_scope.without_npu_compile_scope(): pred = tf.add(tf.multiply(X, 1.), 0.)
cost = tf.reduce_sum(tf.abs(pred-Y))
config = tf.ConfigProto()
custom_op = config.graph_options.rewrite_options.custom_optimizers.add() custom_op.name = "NpuOptimizer" custom_op.parameter_map["use_off_line"].b = True custom_op.parameter_map["enable_data_pre_proc"].b = True custom_op.parameter_map["profiling_mode"].b = True custom_op.parameter_map["profiling_options"].s = tf.compat.as_bytes("task_trace") custom_op.parameter_map["precision_mode"].s = tf.compat.as_bytes("allow_mix_precision") custom_op.parameter_map["variable_format_optimize"].b = True custom_op.parameter_map["mix_compile_mode"].b = True custom_op.parameter_map["enable_dump"].b = True custom_op.parameter_map["dump_path"].s = tf.compat.as_bytes("/tmp/test") custom_op.parameter_map["dump_step"].s = tf.compat.as_bytes("0|5|10") custom_op.parameter_map["dump_mode"].s = tf.compat.as_bytes("all") custom_op.parameter_map["hcom_parallel"].b = True custom_op.parameter_map["graph_memory_max_size"].s = tf.compat.as_bytes(str(26*1024 * 1024 * 1024)) custom_op.parameter_map["variable_memory_max_size"].s = tf.compat.as_bytes(str(5*1024 * 1024 * 1024)) custom_op.parameter_map["iterations_per_loop"].i = 10 custom_op.parameter_map["auto_tune_mode"].s = tf.compat.as_bytes("RL,GA") custom_op.parameter_map["stream_max_parallel_num"].s = tf.compat.as_bytes("DNN_VM_TF: 10,DNN_V100:1") custom_op.parameter_map["is_tailing_optimization"].b = True custom_op.parameter_map["graph_run_mode"].i = 1
config.graph_options.rewrite_options.remapping = RewriterConfig.OFF #remap
with tf.Session(config=config) as sess: print(sess.run(cost))

8.3 

8.3.1 
NPUDistributedOptimizerallreduce  rank

 8-7 





rank

create_group


 group


{install_pat h_fwkacllib }/fwkacllib/

 01 (2020-11-28)

 © 

98

Atlas Data Center Solution 



8 

 destroy_group get_rank_size get_local_rank_size get_rank_id get_local_rank_id get_world_rank_from_group_rank get_group_rank_from_world_rank
set_split_strategy_by_idx
set_split_strategy_by_size





 group
group rank Device 

python/ sitepackages/ hccl/hccl/ manage/ api.py

group device  local rank 

device group rank 

device group local rank

group rank id  world rank id

world rank id  group
group rank id

 id  group  
   group   

{install_pat h_fwkacllib }/fwkacllib/ python/ sitepackages/ hccl/hccl/ split/api.py

 01 (2020-11-28)

 © 

99

Atlas Data Center Solution 


 

 allreduce

allgather

broadcast

reduce_scatter send receive

8 





group  allreduce    
group  allgather   Tensor 

{install_pat h_tfplugin}/ tfplugin/ python/ sitepackages/ npu_bridge/ hccl/ hccl_ops.py

group  broadcast  root  rank

group  
reducescatt er

group   send 

group   receive 

8.3.2 group 

8.3.2.1 create_group



def create_group(group, rank_num, rank_ids)

 01 (2020-11-28)

 © 

100

Atlas Data Center Solution 

8 

 

group

 group
rank_num rank_ids

/ 
 


String128
groupgroup hccl_world_group
hccl_world_groupgroupranktable  grouphccl_world_groupgroup
int
grouprank
4096
list
groupworld_rank_id

Serverrank_ids rank1/2/4/8
Serverrank_ids
 Serverrank 1/2/4/8
 Serverrankrank id 8
 Serverrank2/4npu clusterrank id8 44
  ServergroupServer rank id{0,1,2,3,4,5,6,7}, {8,9,10,11,12,13,14,15}, {16,17,18,19,20,21,22,23}
rank_ids rank_ids=[1,9,17]
rank_ids=[1,2,9,10,17,18] rank_ids=[4,5,6,7,12,13,14,15,20,21,22,23]
 Atlas 300T  9000 rank_idsrank 1/2/4/8




 01 (2020-11-28)

 © 

101

Atlas Data Center Solution 

8 

 

from hccl.manage.api import create_group create_group("myGroup", 4, [0, 1, 2, 3])
 initialize_system  rankgrouprank
  group  ranktablerank1/2/4/8group

8.3.2.2 destroy_group

 

def destroy_group(group) group



 group

/ 


String128 groupgroup




 

from hccl.manage.api import create_group from hccl.manage.api import destroy_group create_group("myGroup", 4, [0, 1, 2, 3]) destroy_group("myGroup")
 initialize_system  rankgrouprank
  groupdestroy_groupcreate_group
create_group  grouphccl_world_groupgroupgroup


 01 (2020-11-28)

 © 

102

Atlas Data Center Solution 

8 

8.3.2.3 get_rank_size



def get_rank_size(group="hccl_world_group")



grouprankDevice



 group

/ 


String128 groupgroup "hccl_world_group"



intgrouprank

 

from hccl.manage.api import create_group from hccl.manage.api import get_rank_size create_group("myGroup", 4, [0, 1, 2, 3]) rankSize = get_rank_size("myGroup") #rankSize = 4
 initialize_system  rankgrouprank
  create_groupAPIgrouprank  "hccl_world_group"world_grouprank

8.3.2.4 get_local_rank_size



def get_local_rank_size(group="hccl_world_group")



groupdevicelocal rank

 01 (2020-11-28)

 © 

103

Atlas Data Center Solution 



 group

8 

/ 


String128 groupgroup "hccl_world_group"

 


intdevicelocal rank
from hccl.manage.api import create_group from hccl.manage.api import get_local_rank_size create_group("myGroup", 4, [0, 1, 2, 3]) lcoalRankSize = get_local_rank_size("myGroup") #localRankSize = 1
 initialize_system  rankgrouprank
  create_groupAPIgrouplocal rank  "hccl_world_group"world_grouplocal rank

8.3.2.5 get_rank_id

  

def get_rank_id(group="hccl_world_group")

devicegrouprank

 group

/ 


String128 groupgroup "hccl_world_group"



intdevicegrouprank id

 01 (2020-11-28)

 © 

104

Atlas Data Center Solution 

8 

 

from hccl.manage.api import create_group from hccl.manage.api import get_rank_id create_group("myGroup", 4, [0, 1, 2, 3]) rankId = get_rank_id("myGroup") #rankId = 0/1/2/3
 initialize_system  rankgrouprank
  create_groupAPIgrouprank id  "hccl_world_group"world_grouprank id

8.3.2.6 get_local_rank_id

  

def get_local_rank_id(group="hccl_world_group")

devicegrouplocal rank

 group

/ 


String128 groupgroup "hccl_world_group"

 


intdevicelocal rank id
from hccl.manage.api import create_group from hccl.manage.api import get_local_rank_id create_group("myGroup", 4, [0, 1, 2, 3]) localRankId = get_local_rank_id("myGroup") #rankId = 0
 initialize_system  rankgrouprank
  create_groupAPIgrouplocal rank id

 01 (2020-11-28)

 © 

105

Atlas Data Center Solution 

8 

 "hccl_world_group"world_grouplocal rank id

8.3.2.7 get_world_rank_from_group_rank



def get_world_rank_from_group_rank(group, group_rank_id)



group rank idworld rank id



 group

/ 

group_rank_id 


String128 groupgroup "hccl_world_group"
int grouprank id

 


int"hccl_world_group"rank id
from hccl.manage.api import create_group from hccl.manage.api import get_world_rank_from_group_rank create_group("myGroup", 4, [0, 1, 2, 3]) worldRankId = get_world_rank_from_group_rank ("myGroup", 1) #worldRankId = 8
 initialize_system  rankgrouprank
  create_groupAPIgroup rank idworld rank id

8.3.2.8 get_group_rank_from_world_rank

 

def get_group_rank_from_world_rank(world_rank_id, group) world rank idgroupgroup rank id

 01 (2020-11-28)

 © 

106

Atlas Data Center Solution 

8 





/

world_rank_id 

group




int "hccl_world_group"rank id
String128 groupgroup "hccl_world_group"



intgrouprank id



from hccl.manage.api import create_group
from hccl.manage.api import get_group_rank_from_world_rank create_group("myGroup", 4, [0, 1, 2, 3]) groupRankId = get_group_rank_from_world_rank (8, "myGroup")
#groupRankId = 1



 initialize_system
 rankgrouprank 
 create_groupAPIworld rank idgroup rank id

8.3.3 

8.3.3.1 set_split_strategy_by_idx



def set_split_strategy_by_idx(idxList, group="hccl_world_group")



idgroupallreduce 

 01 (2020-11-28)

 © 

107

Atlas Data Center Solution 



 idxList

group

8 

/ 



list
id
 id
 id id0 
­   set_split_strategy_by_size 
­ INFOhost "segment result" : segment index list: [0,107] [108,159] 159
 8
 160 [0,20][21,100][101,159] idxList=[20,100,159]
String
group"hccl_world_group" group"hccl_world_group"

  


from hccl.split.api import set_split_strategy_by_idx set_split_strategy_by_idx([20, 100, 159], "group")
 rankgrouprank 
  296.54% 3.46%

 01 (2020-11-28)

 © 

108

Atlas Data Center Solution 

8 

8.3.3.2 set_split_strategy_by_size



def set_split_strategy_by_size(dataSizeList, group="hccl_world_group")



groupallreduce 



 dataSizeList
group

/ 



list 
 id 100
 8
 150M 90M30M30M dataSizeList =[60,20,20]
String128 group"hccl_world_group" group"hccl_world_group"






from hccl.split.api import set_split_strategy_by_size set_split_strategy_by_size([60, 20, 20], "group")



 rankgrouprank 
 id 
 
ResNet502 96.54%3.46%

8.3.4 

 01 (2020-11-28)

 © 

109

Atlas Data Center Solution 

8 

8.3.4.1 allreduce



def allreduce(tensor, reduction, fusion=1, group = "hccl_world_group")



groupallreducereduce



 tensor reduction fusion
group

/   



tensorflowtensor tensorint8, int32, float16, float32
String reduceop "max","min","prod""sum"
int   0allreduceallreduce
  01
String128 groupgroup "hccl_world_group"

  

tensortensorallreducetensor
from npu_bridge.hccl import hccl_ops result = hccl_ops.allreduce(tensor, "sum")
 rankgrouprank 
 allreducevariable

 01 (2020-11-28)

 © 

110

Atlas Data Center Solution 

8 

8.3.4.2 allgather



def allgather(tensor, rank_size, group = "hccl_world_group")



groupallgatherTensor



 tensor
rank_size
group

/ 




tensorflowtensor tensorint8, int32, float16, float32
int groupdevice 4096
String128 groupgroup "hccl_world_group"

  

tensortensorallgathertensor
from npu_bridge.hccl import hccl_ops rank_size = 2 result = hccl_ops.allgather (tensor, rank_size)
 rankgrouprank 

8.3.4.3 broadcast

 

def broadcast(tensor, root_rank, group = "hccl_world_group") groupbroadcastrootrank

 01 (2020-11-28)

 © 

111

Atlas Data Center Solution 



 tensor

root_rank

group

8 

/ 




tensorflowtensorlist tensorint8, int32, float16, float32
int rootrank_ididgrouprank id
String128 groupgroup "hccl_world_group"

  

tensortensorbroadcasttensor
from npu_bridge.hccl import hccl_ops root = 0 inputs = [tensor] result = hccl_ops.broadcast (inputs, root)
 rankgrouprank 

8.3.4.4 reduce_scatter

  

def reduce_scatter(tensor, reduction, rank_size, group = "hccl_world_group")

groupreducescatterreducereduction

 tensor

/ 


tensorflowtensor tensorint8, int32, float16, float32 tensorrank size 

 01 (2020-11-28)

 © 

112

Atlas Data Center Solution 
 reduction
rank_size
group

8 

/ 




String reduceop "max","min","prod""sum"
int groupdevice 4096
String128 groupgroup "hccl_world_group"

  

tensortensorreducescattertensortensor 32Byte
from npu_bridge.hccl import hccl_ops rank_size = 2 result = hccl_ops. reduce_scatter (tensor, "sum", rank_size)
 rankgrouprank 

8.3.4.5 send

  

def send(tensor, sr_tag, dest_rank, group = "hccl_world_group")

groupsend

 tensor

/ 


tensorflowtensor tensorint8, int32, float16, float32

 01 (2020-11-28)

 © 

113

Atlas Data Center Solution 
 sr_tag
dest_rank
group

8 

/ 
 


int sr_tagsend/recv 
int rankgrouprank id
String128 groupgroup "hccl_world_group"

  


from npu_bridge.hccl import hccl_ops sr_tag = 0 dest_rank = 1 hccl_ops. send (tensor, sr_tag, dest_rank)
 rankgrouprank 
 rankServerID

8.3.4.6 receive



def receive(shape, data_type, sr_tag, src_rank, group = "hccl_world_group")



groupreceive



 shape data_type

/  


tensorshape
 tensorint8, int32, float16, float32

 01 (2020-11-28)

 © 

114

Atlas Data Center Solution 
 sr_tag
src_rank
group

8 

/ 
 


int sr_tagsend/recv 
int rankgrouprank id
String128 groupgroup "hccl_world_group"

  

tensorreceivetensor
from npu_bridge.hccl import hccl_ops sr_tag = 0 src_rank = 0 tensor = hccl_ops. receive (tensor.shape, tensor.dtype, sr_tag, src_rank)
 rankgrouprank 
 rankServerID

 01 (2020-11-28)

 © 

115

Atlas Data Center Solution 

9 

9 

Resnet50TensorFlowPython API AI
9.1 imagenetResNet50
9.2 bookscorpusBERTEstimater

9.1  imagenet  ResNet50 

9.1.1 



imagenetimagenethttp://www.imagenet.org/


Resnet50CIFAR-10ImageNet1000 


https://github.com/tensorflow/models/tree/r2.1_model_reference/official Resnet





 r1 // 

  resnet // resnet



 __init__.py



 imagenet_main.py // Imagenet



 imagenet_preprocessing.py // Imagenet



 resnet_model.py // resnet



 resnet_run_loop.py // 



 README.md // 

  utils

 01 (2020-11-28)

 © 

116

Atlas Data Center Solution 

9 

   export.py // 

   logs

   hooks_helper.py ///

NCPU/GPU

   logger.py // 

 utils

  flags

   core.py

// 

  misc

   distribution_utils.py // 

   model_helpers.py // 

9.1.2 

Estimator 
Estimator APITensorFlowAPI2018TensorFlow 1.10 Estimator 
Estimator

 9-1 







input_fn



model_fn



EstimatorRunconfig



EstimatorEstimator.train() 

9.1.3 





 r1

  resnet // resnet



 imagenet_main.py // Imagenet



 imagenet_preprocessing.py // Imagenet



 resnet_model.py // resnet



 resnet_run_loop.py // 

 utils

  flags

   _base.py //

 01 (2020-11-28)

 © 

117

Atlas Data Center Solution 

9 



 9-2 py   imagenet_main.py
imagenet_preprocessing.py
resnet_model.py resnet_run_loop.py


imagenet  get_filenames()parse_record()input_fn() get_synth_input_fn()_parse_example_proto() ImagenetModel imagenet_model_fn()run_cifar() define_cifar_flags()
imagenet    
ResNetResNet ResNet block
  imagelabel  Estimator   

9.1.4 
910 AI 

 input_fn
imagenet910AI py

 9-3  API





input_fn()

 Estimator

resnet_main()  


"/official/r1/resnet/ imagenet_main.py"
"/official/r1/resnet/ resnet_run_loop.py"

 01 (2020-11-28)

 © 

118

Atlas Data Center Solution 

9 

1. "official/r1/resnet/imagenet_main.py"
from hccl.manage.api import get_rank_size from hccl.manage.api import get_rank_id
2. id
"official/r1/resnet/imagenet_main.py"input_fn() 
def input_fn(is_training, data_dir, batch_size, num_epochs=1, dtype=tf.float32, datasets_num_private_threads=None, parse_record_fn=parse_record, input_context=None, drop_remainder=False, tf_data_experimental_slack=False):
"""batches :
is_training:  data_dir:  batch_size: batch num_epochs:  dtype: / datasets_num_private_threads: tf.data parse_record_fn: tfrecords input_context: 'tf.distribute.Strategy''tf.distribute.InputContext' drop_remainder: batchbatch_size True,batch tf_data_experimental_slack: tf.data'experimental_slack'

Returns: 

""" # 

filenames = get_filenames(is_training, data_dir) # 

dataset = tf.data.Dataset.from_tensor_slices(filenames)

if input_context: # id

############## npu modify begin #############

dataset = dataset.shard(get_rank_size(),get_rank_id())

############## npu modify end ###############

# 

# if input_context:

# tf.compat.v1.logging.info(

# 'Sharding the dataset: input_pipeline_id=%d num_input_pipelines=%d' % (

#

input_context.input_pipeline_id, input_context.num_input_pipelines))

# dataset = dataset.shard(input_context.num_input_pipelines,

#

input_context.input_pipeline_id)

if is_training: # 

dataset = dataset.shuffle(buffer_size=_NUM_TRAIN_FILES)

# cycle_length = 10 10CPU dataset = dataset.interleave(
tf.data.TFRecordDataset, cycle_length=10, num_parallel_calls=tf.data.experimental.AUTOTUNE)

return resnet_run_loop.process_record_dataset( dataset=dataset, is_training=is_training, batch_size=batch_size, shuffle_buffer=_SHUFFLE_BUFFER, parse_record_fn=parse_record_fn, num_epochs=num_epochs, dtype=dtype, datasets_num_private_threads=datasets_num_private_threads, drop_remainder=drop_remainder, tf_data_experimental_slack=tf_data_experimental_slack,
)
3. drop_remainderTrue

 01 (2020-11-28)

 © 

119

Atlas Data Center Solution 

9 

"/official/r1/resnet/resnet_run_loop.py"resnet_main() input_fn_train()input_fn_eval()
def input_fn_train(num_epochs, input_context=None): ############## npu modify begin ############# # dtype=tf.float16 # drop_remainderTrue # batch_sizebatchbatch return input_function( is_training=True, data_dir=flags_obj.data_dir, batch_size=flags_obj.batch_size, num_epochs=num_epochs, dtype=tf.float16, input_context=input_context, drop_remainder=True)

def input_fn_eval():

# dtype=tf.float16

# drop_remainderTrue

# batch_sizebatchbatch

return input_function(

is_training=False,

data_dir=flags_obj.data_dir,

batch_size=flags_obj.batch_size,

num_epochs=1,

dtype=tf.float16,

input_context=True,

drop_remainder=True)

############## npu modify end ###############

# 

# def input_fn_train(num_epochs, input_context=None):

# return input_function(

#

is_training=True,

#

data_dir=flags_obj.data_dir,

#

batch_size=distribution_utils.per_replica_batch_size(

#

flags_obj.batch_size, flags_core.get_num_gpus(flags_obj)),

#

num_epochs=num_epochs,

#

dtype=flags_core.get_tf_dtype(flags_obj),

#

datasets_num_private_threads=flags_obj.datasets_num_private_threads,

#

input_context=input_context)

#

# def input_fn_eval():

# return input_function(

#

is_training=False,

#

data_dir=flags_obj.data_dir,

#

batch_size=distribution_utils.per_replica_batch_size(

#

flags_obj.batch_size, flags_core.get_num_gpus(flags_obj)),

#

num_epochs=1,

#

dtype=flags_core.get_tf_dtype(flags_obj))

9.1.5 

 


imagenet

 01 (2020-11-28)

 © 

120

Atlas Data Center Solution 
 9-4   imagenet_model_fn()
learning_rate_with_decay()
resnet_model_fn()
ImagenetModel()
__call__()

9 





imagenet 

"/official/r1/
resnet/
imagenet_main.p y"

   

"/official/r1/
resnet/
resnet_run_loop. py"

EstimatorSpec Estimator 

"/official/r1/
resnet/
resnet_run_loop. py"

ImagenetModel resnet_model Model imagenetResNet  

"/official/r1/
resnet/
imagenet_main.p y"

 1GPU NHWC NCHW2 3ResNet batch norm4 pooling5block6 7 

"/official/r1/
resnet/
resnet_model.py "



1. "/official/r1/resnet/resnet_run_loop.py"
from npu_bridge.hccl import hccl_ops
2. max_pool_with_argmaxmax_pooling2d 
"official/r1/resnet/resnet_model.py"__call__() 
#  if self.first_pool_size:
############## npu modify begin ############# # max_pool_with_argmaxmax_pooling2d inputs,argmax = tf.compat.v1.nn.max_pool_with_argmax(
input=inputs, ksize=(1,self.first_pool_size,self.first_pool_size,1), strides=(1,self.first_pool_stride,self.first_pool_stride,1), padding='SAME', data_format='NCHW' if self.data_format == 'channels_first' else 'NHWC') ############## npu modify end ###############
# max_pooling2d() # inputs = tf.compat.v1.layers.max_pooling2d(

 01 (2020-11-28)

 © 

121

Atlas Data Center Solution 

9 

# inputs=inputs, pool_size=self.first_pool_size, # strides=self.first_pool_stride, padding='SAME', # data_format=self.data_format)

inputs = tf.identity(inputs, 'initial_max_pool')
3. /
"official/r1/resnet/resnet_run_loop.py"resnet_model_fn() 
############# npu modify begin ############# # / if features.dtype != dtype:
# dtype features = tf.cast(features, dtype) ############## npu modify end ############### #  # assert features.dtype == dtype
4. accuracylabelsfloat32
"official/r1/resnet/resnet_run_loop.py"resnet_model_fn() 
############## npu modify begin ############# # labelsfloat32 accuracy = tf.compat.v1.metrics.accuracy(tf.cast(labels, tf.float32), predictions['classes']) ############## npu modify end ###############

# accuracy # accuracy = tf.compat.v1.metrics.accuracy(labels, predictions['classes'])

accuracy_top_5 = tf.compat.v1.metrics.mean( tf.nn.in_top_k(predictions=logits, targets=labels, k=5, name='top_5_op'))

############## npu modify begin ############# # accuracy rank_size = int(os.getenv('RANK_SIZE')) newaccuracy = (hccl_ops.allreduce(accuracy[0], "sum") / rank_size, accuracy[1]) newaccuracy_top_5 = (hccl_ops.allreduce(accuracy_top_5[0], "sum") / rank_size, accuracy_top_5[1]) metrics = {'accuracy': newaccuracy,
'accuracy_top_5': newaccuracy_top_5} ############## npu modify end #############

# metrics

# metrics = {'accuracy': accuracy,

#

'accuracy_top_5': accuracy_top_5}


1.
2.

"official/r1/resnet/resnet_run_loop.py"
from npu_bridge.estimator.npu.npu_optimizer import NPUDistributedOptimizer
NPUDistributedOptimizer
"official/r1/resnet/resnet_run_loop.py"resnet_model_fn() 
if flags.FLAGS.enable_lars: from tensorflow.contrib import opt as contrib_opt # pylint: disable=g-import-not-at-top optimizer = contrib_opt.LARSOptimizer( learning_rate, momentum=momentum, weight_decay=weight_decay, skip_list=['batch_normalization', 'bias'])
else: optimizer = tf.compat.v1.train.MomentumOptimizer( learning_rate=learning_rate, momentum=momentum )

 01 (2020-11-28)

 © 

122

Atlas Data Center Solution 

9 

9.1.6 

############## npu modify begin ############# #  optimizer = NPUDistributedOptimizer(optimizer) ############## npu modify end ###############
fp16_implementation = getattr(flags.FLAGS, 'fp16_implementation', None) if fp16_implementation == 'graph_rewrite':
optimizer = ( tf.compat.v1.train.experimental.enable_mixed_precision_graph_rewrite( optimizer, loss_scale=loss_scale))



resnet_main()

 9-5 







resnet_main()  "/official/r1/resnet/



resnet_run_loop.py"

1. "/official/r1/resnet/resnet_run_loop.py"
from npu_bridge.estimator.npu.npu_config import NPURunConfig from npu_bridge.estimator.npu.npu_estimator import NPUEstimator
2. NPURunconfigRunconfig
"official/r1/resnet/resnet_run_loop.py"resnet_main()
############## npu modify begin ############# # NPURunconfigRunconfigAI115200checkpoint10000 summary #  run_config = NPURunConfig(
model_dir=flags_obj.model_dir, session_config=session_config, save_checkpoints_steps=115200, enable_data_pre_proc=True, iterations_per_loop=100, # enable_auto_mix_precision=True, #  precision_mode='allow_mix_precision', hcom_parallel=True ) ############## npu modify end ############### #  # run_config = tf.estimator.RunConfig( # train_distribute=distribution_strategy, # session_config=session_config, # save_checkpoints_secs=60 * 60 * 24, # save_checkpoints_steps=None)

precision_mode='allow_mix_precision'  
3. NPUEstimatorNPUEstimatortf.estimator.Estimator

 01 (2020-11-28)

 © 

123

Atlas Data Center Solution 

9 

"/official/r1/resnet/resnet_run_loop.py"resnet_main() 
############## npu modify begin ############# # `NPUEstimator`tf.estimator.EstimatorAI classifier = NPUEstimator(
model_fn=model_function, model_dir=flags_obj.model_dir, config=run_config, params={
'resnet_size': int(flags_obj.resnet_size), 'data_format': flags_obj.data_format, 'batch_size': flags_obj.batch_size, 'resnet_version': int(flags_obj.resnet_version), 'loss_scale': flags_core.get_loss_scale(flags_obj,
default_for_fp16=128), 'dtype': flags_core.get_tf_dtype(flags_obj), 'fine_tune': flags_obj.fine_tune, 'num_workers': num_workers, 'num_gpus': flags_core.get_num_gpus(flags_obj), }) ############## npu modify end ###############

# Estimator

# classifier = tf.estimator.Estimator(

# model_fn=model_function, model_dir=flags_obj.model_dir, config=run_config,

# warm_start_from=warm_start_settings, params={

#

'resnet_size': int(flags_obj.resnet_size),

#

'data_format': flags_obj.data_format,

#

'batch_size': flags_obj.batch_size,

#

'resnet_version': int(flags_obj.resnet_version),

#

'loss_scale': flags_core.get_loss_scale(flags_obj,

#

default_for_fp16=128),

#

'dtype': flags_core.get_tf_dtype(flags_obj),

#

'fine_tune': flags_obj.fine_tune,

#

'num_workers': num_workers,

# })

9.1.7 





 9-6 





main()

  

run_imagenet()

 

resnet_main()

 


"/official/r1/resnet/ imagenet_main.py"
"/official/r1/resnet/ imagenet_main.py"
"/official/r1/resnet/ resnet_run_loop.py"



1. "official/r1/resnet/resnet_run_loop.py"
from npu_bridge.estimator import npu_ops from tensorflow.core.protobuf import rewriter_config_pb2

 01 (2020-11-28)

 © 

124

Atlas Data Center Solution 

9 

2. 
"official/r1/resnet/imagenet_main.py"main() 
def main(_): ############## npu modify begin ############# # NPUHCCL #  init_sess, npu_init = resnet_run_loop.init_npu() init_sess.run(npu_init) ############## npu modify end ###############
with logger.benchmark_context(flags.FLAGS): run_imagenet(flags.FLAGS)
3. 
"official/r1/resnet/resnet_run_loop.py"init_npu() 
def resnet_main(flags_obj, model_function, input_function, dataset_name, shape=None):... ############## npu modify begin ############# #  def init_npu():
"""NPU 
`init_sess` npu init session config. `npu_init` npu init ops. """ npu_init = npu_ops.initialize_system() config = tf.ConfigProto()
config.graph_options.rewrite_options.remapping = rewriter_config_pb2.RewriterConfig.OFF custom_op = config.graph_options.rewrite_options.custom_optimizers.add() custom_op.name = "NpuOptimizer" # custom_op.parameter_map["precision_mode"].b = True custom_op.parameter_map["precision_mode"].s = tf.compat.as_bytes("allow_mix_precision") custom_op.parameter_map["use_off_line"].b = True
init_sess = tf.Session(config=config) return init_sess, npu_init ############## npu modify end ###############
4. /
"official/r1/resnet/resnet_run_loop.py"resnet_main() 
for cycle_index, num_train_epochs in enumerate(schedule): tf.compat.v1.logging.info('Starting cycle: %d/%d', cycle_index, int(n_loops))
if num_train_epochs: # Since we are calling classifier.train immediately in each loop, the # value of num_train_epochs in the lambda function will not be changed # before it is used. So it is safe to ignore the pylint error here # pylint: disable=cell-var-from-loop classifier.train( input_fn=lambda input_context=None: input_fn_train( num_train_epochs, input_context=input_context), hooks=train_hooks, max_steps=flags_obj.max_train_steps)
############## npu modify begin ############# # npuhccl # : init_sess, npu_init = init_npu() npu_shutdown = npu_ops.shutdown_system() init_sess.run(npu_shutdown) init_sess.run(npu_init) ############## npu modify end ###############

 01 (2020-11-28)

 © 

125

Atlas Data Center Solution 

9 

tf.compat.v1.logging.info('Starting to evaluate.') eval_results = classifier.evaluate(input_fn=input_fn_eval,
steps=flags_obj.max_train_steps)
benchmark_logger.log_evaluation_result(eval_results)
if model_helpers.past_stop_threshold( flags_obj.stop_threshold, eval_results['accuracy']):
break
############## npu modify begin ############# # npuhccl # : init_sess, npu_init = init_npu() npu_shutdown = npu_ops.shutdown_system() init_sess.run(npu_shutdown) init_sess.run(npu_init) ############## npu modify end ###############
5. /
/npu_ops.shutdown_system
"official/r1/resnet/resnet_run_loop.py"resnet_main() 
if flags_obj.export_dir is not None: # Exports a saved model for the given classifier. export_dtype = flags_core.get_tf_dtype(flags_obj) if flags_obj.image_bytes_as_serving_input: input_receiver_fn = functools.partial( image_bytes_serving_input_fn, shape, dtype=export_dtype) else: input_receiver_fn = export.build_tensor_serving_input_receiver_fn( shape, batch_size=flags_obj.batch_size, dtype=export_dtype) classifier.export_savedmodel(flags_obj.export_dir, input_receiver_fn, strip_default_attrs=True)
############## npu modify begin ############# # /npu_ops.shutdown_system #  npu_shutdown = npu_ops.shutdown_system() init_sess.run(npu_shutdown) ############## npu modify end ###############
stats = {} stats['eval_results'] = eval_results stats['train_hooks'] = train_hooks
return stats
loss scale 
loss scale
"official/r1/resnet/imagenet_main.py"define_imagenet_flags() 
def define_imagenet_flags(): resnet_run_loop.define_resnet_flags( resnet_size_choices=['18', '34', '50', '101', '152', '200'], dynamic_loss_scale=True, fp16_implementation=True) flags.adopt_module_key_flags(resnet_run_loop) flags_core.set_defaults(train_epochs=90)
############## npu modify begin ############# # AIloss_scale  # 

 01 (2020-11-28)

 © 

126

Atlas Data Center Solution 

9 

flags_core.set_defaults(loss_scale='512') ############## npu modify end ###############
9.1.8 



/home/data/resnet50/imagenet

 ranktable 
ranktable






python3 /home/official/r1/resnet/imagenet_main.py --batch_size=32 -hooks=ExamplesPerSecondHook --data_dir=/home/data/resnet50/imagenet

9.2  bookscorpus  BERT  Estimater

9.2.1 



bookscorpustfrecordbookscorpus "/home/data/bert/cn-clue-256/" 


BERTbookscorpus 


https://github.com/NVIDIA/DeepLearningExamples/tree/master/ TensorFlow/LanguageModeling/BERTBERT





 BERT

# BERT

  __init__.py

  extract_features.py

// 

  fp16_utils.py

// fp16 utils

 01 (2020-11-28)

 © 

127

Atlas Data Center Solution 

9 

  fused_layer_norm.py

// layer norm

  gpu_environment.py

// gpu_environment

  modeling.py

// BERT

  optimization.py

// 

  run_pretraining.py

//  

  tf_metrics.py

// tf metrics

  tokenization.py

// 

  scripts/

# 



 data_download.sh

// data/



 run_pretraining_adam.sh // run_pretraining.pyAdam



 run_pretraining_lamb.sh // run_pretraining.pyLAMB

  data/

# BERT

  utils/



utils.py

// utils



  

9.2.2 

Estimator 
Estimator APITensorFlowAPI2018TensorFlow 1.10 Estimator 
Estimator

 9-7      

 input_fn model_fn EstimatorRunconfig EstimatorEstimator.train() 

9.2.3 



910 AI 9-8

 BERT

  gpu_environment.py

// gpu_environment

  modeling.py

// BERT

  optimization.py

// 

  run_pretraining.py

//  

  scripts/

# 

  utils/



utils.py

// utils

 01 (2020-11-28)

 © 

128

Atlas Data Center Solution 

9 

 9-8 py 





gpu_environment.py

tensortf.float16 

modeling.py

BERT

optimization.py

AdamWeightDecayOptimizer LAMBOOptimizer create_optimizer() 

run_pretraining.py

 bookscorpus   input_fn_builder()_decode_record() Estimator   model_fn_builder() get_masked_lm_output() get_next_sentence_output()gather_indexes() main()

9.2.4 
910 AI 

 input_fn
bookscorpus910 AIpy9-9

 9-9  API  input_fn_builder()
_decode_record()


 Estimator 
tf.int64tensor tf.int32tensor 910 AI


"BERT/ run_pretraining.py "
"BERT/ run_pretraining.py "

1. id

 01 (2020-11-28)

 © 

129

Atlas Data Center Solution 

9 

"BERT/run_pretraining.py"input_fn_builder()input_fn() 
if is_training: d = tf.data.Dataset.from_tensor_slices(tf.constant(input_files)) ############## npu modify begin ############# #id if FLAGS.distributed: rank_size = int(os.getenv('RANK_SIZE')) rank_id = int(os.getenv('RANK_INDEX')) device_id = int(os.getenv('DEVICE_ID')) local_rank = rank_id * 8 + device_id print('RANK_SIZE=', rank_size, ' RANK_ID=', local_rank) d = d.shard(rank_size, local_rank) ############## npu modify end #############
#  # if hvd is not None: # d = d.shard(hvd.size(), hvd.rank())
2. "npu_bert_debug"
"BERT/run_pretraining.py"input_fn_builder()input_fn() 
d = d.repeat()
############## npu modify begin ############# # "npu_bert_debug" if not FLAGS.npu_bert_debug:
d = d.shuffle(buffer_size=len(input_files)) ############## npu modify end #############
#  # d = d.shuffle(buffer_size=len(input_files))
3. "npu_bert_debug"cycle_length 
"BERT/run_pretraining.py"input_fn_builder()input_fn() 
############## npu modify begin ############# # # `cycle_length` is the number of parallel files that get read. # "npu_bert_debug"cycle_length if not FLAGS.npu_bert_debug:
cycle_length = min(num_cpu_threads, int(len(input_files)/int(os.getenv('RANK_SIZE')))) else:
cycle_length = 1 ############## npu modify end #############
#  # cycle_length = min(num_cpu_threads, len(input_files))
############## npu modify begin ############# #  d = d.interleave(
tf.data.TFRecordDataset, cycle_length=cycle_length, num_parallel_calls=tf.data.experimental.AUTOTUNE) # "npu_bert_debug" if not FLAGS.npu_bert_debug: d = d.shuffle(buffer_size=100) ############## npu modify end #############
#  # # `sloppy` mode means that the interleaving is not exact. This adds # # even more randomness to the training pipeline. # d = d.apply( # tf.contrib.data.parallel_interleave( # tf.data.TFRecordDataset, # sloppy=is_training,

 01 (2020-11-28)

 © 

130

Atlas Data Center Solution 

9 

# cycle_length=cycle_length)) # d = d.shuffle(buffer_size=100)
4. drop_remainderTrue910 AI batch_size
"BERT/run_pretraining.py"input_fn_builder()input_fn() 
############## npu modify begin ############# # drop_remainderTrue d = d.apply(
tf.contrib.data.map_and_batch( lambda record: _decode_record(record, name_to_features), batch_size=batch_size, num_parallel_batches=num_cpu_threads, drop_remainder=True))
############## npu modify end ###############

# 

# d = d.apply(

# tf.contrib.data.map_and_batch(

#

lambda record: _decode_record(record, name_to_features),

#

batch_size=batch_size,

#

num_parallel_batches=num_cpu_threads,

#

drop_remainder=True if is_training else False))

9.2.5 

910 AI


bookscorpus9-10

 9-10   model_fn_builder() get_masked_lm_output()
get_next_sentence_output()
BertConfig()





bookscorpus BERT Estimator

"BERT/
run_pretraining. py"

Masked LMloss 
Masked LM    

"BERT/
run_pretraining. py"

loss 
   

"BERT/
run_pretraining. py"

bert BertModel

"BERT/ modeling.py"

 01 (2020-11-28)

 © 

131

Atlas Data Center Solution 

9 

 BertModel() embedding_lookup() embedding_postprocessor()
gather_npu()
gelu()
dropout()
LogEvalRunHook()





bert "BERT/

bert

modeling.py"

embeddingid "BERT/ embedding_size modeling.py"

embedding_lookup() tensor 

"BERT/ modeling.py"

tf.gather() 910 AI 

"BERT/ modeling.py"

tf.nn.gelu() 910 AI 

"BERT/ modeling.py"

tf.nn.dropout() 910 AI 

"BERT/ modeling.py"

 "BERT/utils/



utils.py"



1. "BERT/run_pretraining.py"
import utils.dllogger_class from dllogger import Verbosity
2. "BERT/run_pretraining.py"
from gpu_environment import get_custom_getter
3. optimization.create_optimizer()FLAGS.use_fp16 FLAGS.ampFLAGS.init_loss_scale910 AI 
"BERT/run_pretraining.py"model_fn_builder() model_fn()
############## npu modify begin ############# # FLAGS.use_fp16FLAGS.ampFLAGS.init_loss_scale if mode == tf.estimator.ModeKeys.TRAIN:
train_op = optimization.create_optimizer( total_loss, learning_rate, num_train_steps, num_warmup_steps, hvd, FLAGS.manual_fp16, FLAGS.use_fp16, FLAGS.num_accumulation_steps,
FLAGS.optimizer_type, FLAGS.allreduce_post_accumulation)
############## npu modify end ###############

# 

# if mode == tf.estimator.ModeKeys.TRAIN:

# train_op = optimization.create_optimizer(

#

total_loss, learning_rate, num_train_steps, num_warmup_steps,

#

hvd, FLAGS.manual_fp16, FLAGS.amp, FLAGS.num_accumulation_steps,

FLAGS.optimizer_type,

#

FLAGS.allreduce_post_accumulation, FLAGS.init_loss_scale)

 01 (2020-11-28)

 © 

132

Atlas Data Center Solution 

9 

4. "use_fp16_cls"input_tensortf.float16 input_tensor tf.float32
"BERT/run_pretraining.py"get_masked_lm_output() 
def get_masked_lm_output(bert_config, input_tensor, output_weights, positions, label_ids, label_weights):
"""Get loss and log probs for the masked LM.""" input_tensor = gather_indexes(input_tensor, positions)

with tf.variable_scope("cls/predictions"):
############## npu modify begin ############# #   # tf.layers.dense()input_tensortensorinput_tensor tf.float16 # tf.layers.dense()input_tensorinput_tensortf.float32 
with tf.variable_scope("transform", custom_getter=get_custom_getter(
compute_type=tf.float16 if FLAGS.use_fp16_cls else tf.float32)):
if FLAGS.use_fp16_cls:
input_tensor = tf.cast(input_tensor, tf.float16)
input_tensor = tf.layers.dense(
input_tensor,
units=bert_config.hidden_size,
activation=modeling.get_activation(bert_config.hidden_act),
kernel_initializer=modeling.create_initializer(
bert_config.initializer_range))
input_tensor = tf.cast(input_tensor, tf.float32)
input_tensor = modeling.layer_norm(input_tensor)
############## npu modify end #############

# 

# with tf.variable_scope("transform"):

# input_tensor = tf.layers.dense(

#

input_tensor,

#

units=bert_config.hidden_size,

#

activation=modeling.get_activation(bert_config.hidden_act),

#

kernel_initializer=modeling.create_initializer(

#

bert_config.initializer_range))

# input_tensor = modeling.layer_norm(input_tensor)

output_bias = tf.get_variable( "output_bias", shape=[bert_config.vocab_size], initializer=tf.zeros_initializer())

############## npu modify begin ############# # use_fp16_clsinput_tensortf.float16  # input_tensortf.float32
if FLAGS.use_fp16_cls:
input_tensor = tf.cast(input_tensor, tf.float16)
logits = tf.matmul(input_tensor, tf.cast(output_weights, tf.float16), transpose_b=True)
logits = tf.cast(logits, tf.float32)
else:
logits = tf.matmul(tf.cast(input_tensor, tf.float32), output_weights, transpose_b=True)
############## npu modify end ###############

#  # logits = tf.matmul(tf.cast(input_tensor, tf.float32), output_weights, transpose_b=True)
5. "use_fp16_cls"input_tensortf.float16 input_tensor tf.float32

 01 (2020-11-28)

 © 

133

Atlas Data Center Solution 

9 

"BERT/run_pretraining.py"get_next_sentence_output() 
def get_next_sentence_output(bert_config, input_tensor, labels): with tf.variable_scope("cls/seq_relationship"): output_weights = tf.get_variable( "output_weights", shape=[2, bert_config.hidden_size], initializer=modeling.create_initializer(bert_config.initializer_range)) output_bias = tf.get_variable( "output_bias", shape=[2], initializer=tf.zeros_initializer())
############## npu modify begin ############# # use_fp16_clsinput_tensortf.float16  # input_tensortf.float32 if FLAGS.use_fp16_cls:
input_tensor = tf.cast(input_tensor, tf.float16) logits = tf.matmul(input_tensor, tf.cast(output_weights, tf.float16), transpose_b=True) logits = tf.cast(logits, tf.float32) else: logits = tf.matmul(tf.cast(input_tensor, tf.float32), output_weights, transpose_b=True) ############## npu modify end #############
#  # logits = tf.matmul(tf.cast(input_tensor, tf.float32), output_weights, transpose_b=True)
6. "use_fp16_cls"first_token_tensor tf.float16self.pooled_output tf.float32
"BERT/modeling.py"BertModel__init__() 
############## npu modify begin ############# # use_fp16_clsfirst_token_tensortf.float16 # self.pooled_outputtf.float32 with tf.variable_scope("pooler"):
first_token_tensor = tf.squeeze(self.sequence_output[:, 0:1, :], axis=1) if tf.flags.FLAGS.use_fp16_cls:
first_token_tensor = tf.cast(first_token_tensor, tf.float16) self.pooled_output = tf.layers.dense(
first_token_tensor, config.hidden_size, activation=tf.tanh, kernel_initializer=create_initializer(config.initializer_range)) self.pooled_output = tf.cast(self.pooled_output, tf.float32) ############## npu modify end #############
#  # with tf.variable_scope("pooler"): # # We "pool" the model by simply taking the hidden state corresponding # # to the first token. We assume that this has been pre-trained # first_token_tensor = tf.squeeze(self.sequence_output[:, 0:1, :], axis=1) # self.pooled_output = tf.layers.dense( # first_token_tensor, # config.hidden_size, # activation=tf.tanh, # kernel_initializer=create_initializer(config.initializer_range))
7. gather_npu()tf.gather()910 AI 
"BERT/modeling.py"embedding_lookup() embedding_postprocessor()
­ modeling.pygather_npu()
@tf.custom_gradient def gather_npu(params, indices):
def grad(dy): params_shape = tf.shape(params, out_type=tf.int64)

 01 (2020-11-28)

 © 

134

Atlas Data Center Solution 

9 

params_shape = tf.cast(params_shape, tf.int32) grad_gather = tf.unsorted_segment_sum(dy, indices, params_shape[0]) return grad_gather, None return tf.gather(params, indices), grad
­ embedding_lookup()
if use_one_hot_embeddings: one_hot_input_ids = tf.one_hot(flat_input_ids, depth=vocab_size) output = tf.matmul(one_hot_input_ids, embedding_table)
else:
############## npu modify begin ############# # npu_gathertf.gather()gather_npu(),  if tf.flags.FLAGS.npu_gather:
output = gather_npu(embedding_table, flat_input_ids) else:
output = tf.gather(embedding_table, flat_input_ids) ############## npu modify end #############
#  # output = tf.gather(embedding_table, flat_input_ids)
­ embedding_postprocessor()
if use_one_hot_embeddings: one_hot_input_ids = tf.one_hot(flat_input_ids, depth=vocab_size) output = tf.matmul(one_hot_input_ids, embedding_table)
else:
############## npu modify begin ############# # npu_gathertf.gather()gather_npu(),  if tf.flags.FLAGS.npu_gather:
token_type_embeddings = gather_npu(token_type_table, flat_token_type_ids) else:
token_type_embeddings = tf.gather(token_type_table, flat_token_type_ids) ############## npu modify end #############
#  # token_type_embeddings = tf.gather(token_type_table, flat_token_type_ids)
 910 AI 
1. "BERT/modeling.py"
from npu_bridge.estimator.npu_unary_ops import npu_unary_ops from npu_bridge.estimator import npu_ops
2. npu_unary_ops.gelu()tf.nn.gelu() npu_unary_ops.gelu()tf.nn.gelu() npu_unary_ops.gelu()910 AI
"BERT/modeling.py"gelu()
def gelu(x): """ gelu : x: tensor Returns: tensor """ ############## npu modify begin ############# # npu_bert_fused_gelunpu_unary_ops.gelu()Ascend 910
 if tf.flags.FLAGS.npu_bert_fused_gelu: return npu_unary_ops.gelu(x) else: cdf = 0.5 * (1.0 + tf.tanh( (np.sqrt(2 / np.pi) * (x + 0.044715 * tf.pow(x, 3))))) return x * cdf

 01 (2020-11-28)

 © 

135

Atlas Data Center Solution 

9 

############## npu modify end #############
#  # cdf = 0.5 * (1.0 + tf.tanh( # (np.sqrt(2 / np.pi) * (x + 0.044715 * tf.pow(x, 3))))) # return x * cdf
3. npu_ops.dropout()tf.nn.dropout()npu_ops.dropout() 910 AI
"BERT/modeling.py"dropout() 
def dropout(input_tensor, dropout_prob): """ dropout Args: input_tensor: tensor dropout_prob: float tensor
Returns: dropouttensor
""" ############## npu modify begin ############# # npu_bert_debugTruedropout if tf.flags.FLAGS.npu_bert_debug:
return input_tensor
# dropout_probNone0.0dropout if dropout_prob is None or dropout_prob == 0.0:
return input_tensor
# npu_bert_npu_dropoutnpu_ops.dropout()Ascend 910 
if tf.flags.FLAGS.npu_bert_npu_dropout: output = npu_ops.dropout(input_tensor, 1.0 - dropout_prob)
else: output = tf.nn.dropout(input_tensor, 1.0 - dropout_prob)
return output ############## npu modify end #############
#  # if dropout_prob is None or dropout_prob == 0.0: # return input_tensor # # output = tf.nn.dropout(input_tensor, 1.0 - dropout_prob) # return output
optimizer 
1. "BERT/optimization.py"
from horovod.tensorflow.compression import Compression
2. "BERT/optimization.py"
from npu_bridge.estimator.npu.npu_optimizer import NPUDistributedOptimizer from npu_bridge.estimator.npu import npu_loss_scale_optimizer as lso from npu_bridge.estimator.npu import npu_loss_scale_manager as lsm_lib
3. 
"BERT/optimization.py"create_optimizer() 
­ create_optimizer()init_loss_scale
############## npu modify begin ############# # create_optimizer()init_loss_scale def create_optimizer(loss, init_lr, num_train_steps, num_warmup_steps, hvd=None, manual_fp16=False, use_fp16=False, num_accumulation_steps=1,
optimizer_type="adam", allreduce_post_accumulation=False): ############## npu modify end #############

 01 (2020-11-28)

 © 

136

Atlas Data Center Solution 

9 

# 

# def create_optimizer(loss, init_lr, num_train_steps, num_warmup_steps, hvd=None,

manual_fp16=False, use_fp16=False, num_accumulation_steps=1,

#

optimizer_type="adam", allreduce_post_accumulation=False, init_loss_scale=2 **

32):

­ learning_rateinit_lr
############## npu modify begin ############# # learning_rateinit_lr learning_rate = tf.constant(value=init_lr, shape=[], dtype=tf.float32) ############## npu modify end #############

#  # learning_rate = tf.constant(value=adjusted_init_lr, shape=[], dtype=tf.float32)
­ AdamWeightDecayOptimizerepsilon 
############## npu modify begin ############# # AdamWeightDecayOptimizerepsilon optimizer = AdamWeightDecayOptimizer(
learning_rate=learning_rate, weight_decay_rate=0.01, beta_1=0.9, beta_2=0.999, epsilon=1e-4, exclude_from_weight_decay=["LayerNorm", "layer_norm", "bias"]) ############## npu modify end #############

#  # optimizer = AdamWeightDecayOptimizer( # learning_rate=learning_rate, # weight_decay_rate=0.01, # beta_1=0.9, # beta_2=0.999, # epsilon=1e-6, # exclude_from_weight_decay=["LayerNorm", "layer_norm", "bias"])
­ NPUDistributedOptimizer NPU
############## npu modify begin ############# # NPUDistributedOptimizerNPU  optimizer = NPUDistributedOptimizer(optimizer)

# NPULossScaleOptimizerNPU if tf.flags.FLAGS.npu_bert_loss_scale not in [None, -1]:
opt_tmp = optimizer if tf.flags.FLAGS.npu_bert_loss_scale == 0:
loss_scale_manager = lsm_lib.ExponentialUpdateLossScaleManager( init_loss_scale=tf.flags.FLAGS.init_loss_scale_value, incr_every_n_steps=1000, decr_every_n_nan_or_inf=2, decr_ratio=0.5)
elif tf.flags.FLAGS.npu_bert_loss_scale >= 1: loss_scale_manager =
lsm_lib.FixedLossScaleManager(loss_scale=tf.flags.FLAGS.npu_bert_loss_scale) else: raise ValueError("Invalid loss scale: %d" % tf.flags.FLAGS.npu_bert_loss_scale)

# NPULossScaleOptimizerNPU Loss Scaling(),
# Loss Scalingfloat16
optimizer = lso.NPULossScaleOptimizer(opt_tmp, loss_scale_manager,
is_distributed=tf.flags.FLAGS.distributed)
############## npu modify end #############

#  # if use_fp16: # loss_scaler = tf.train.experimental.DynamicLossScale(initial_loss_scale=init_loss_scale, increment_period=1000,multiplier=2.0) # optimizer = tf.train.experimental.enable_mixed_precision_graph_rewrite(optimizer,

 01 (2020-11-28)

 © 

137

Atlas Data Center Solution 

9 

loss_scaler) # loss_scale_value = tf.identity(loss_scaler(), name="loss_scale") # if manual_fp16: # loss_scale_manager = tf.contrib.mixed_precision.ExponentialUpdateLossScaleManager( # init_loss_scale=init_loss_scale, # incr_every_n_steps=1000, # decr_every_n_nan_or_inf=2, # decr_ratio=0.5) # optimizer = tf.contrib.mixed_precision.LossScaleOptimizer(optimizer, loss_scale_manager)
­ tf.reduce_all
grads_and_vars_and_accums = [(gv[0], gv[1], accum_vars[i]) for i, gv in enumerate(grads_and_vars) if gv[0] is not None]
grads, tvars, accum_vars = list(zip(*grads_and_vars_and_accums))

############## npu modify begin ############# # tf.reduce_all all_are_finite = tf.reduce_all([tf.reduce_all(tf.is_finite(g)) for g in grads]) if (tf.flags.FLAGS.npu_bert_loss_scale not in [None, -1]) and (manual_fp16 or use_fp16) else tf.constant(True, dtype=tf.bool) ############## npu modify end #############

#  # all_are_finite = tf.reduce_all( [tf.reduce_all(tf.is_finite(g)) for g in grads]) if manual_fp16 or use_fp16 else tf.constant(True,dtype=tf.bool)
­ "npu_bert_clip_by_global_norm" 
create_optimizer() 
############## npu modify begin ############# # npu_bert_clip_by_global_norm if tf.flags.FLAGS.npu_bert_clip_by_global_norm:
(clipped_grads, _) = tf.clip_by_global_norm( grads, clip_norm=1.0, use_norm=tf.cond( all_are_finite, lambda: tf.global_norm(grads), lambda: tf.constant(1.0)))
else: with tf.name_scope("clip_grads"): clipped_grads = [ (tf.clip_by_norm(grad, clip_norm=1.0)) if grad is not None else grad for grad in grads ]
############## npu modify end #############

# 

# (clipped_grads, _) = tf.clip_by_global_norm(

# grads, clip_norm=1.0,

# use_norm=tf.cond(

#

all_are_finite,

#

lambda: tf.global_norm(grads),

#

lambda: tf.constant(1.0)))

­ new_global_step 
############## npu modify begin ############# # new_global_step  new_global_step = tf.cond(tf.math.logical_and(update_step, tf.cast(hvd.allreduce(tf.cast(batch_finite, tf.int32)), tf.bool)),
lambda: global_step + 1, lambda: global_step) ############## npu modify end #############

# 

# new_global_step = tf.cond(tf.math.logical_and(update_step,

#

tf.cast(hvd.allreduce(tf.cast(batch_finite, tf.int32)),

#

tf.bool) if hvd is not None else batch_finite),

#

lambda: global_step + 1,

#

lambda: global_step)

 01 (2020-11-28)

 © 

138

Atlas Data Center Solution 

9 

­ "npu_bert_clip_by_global_norm"
grads_and_vars = [(g, v) for g, v in grads_and_vars if g is not None] grads, tvars = list(zip(*grads_and_vars))

############## npu modify begin ############# # "npu_bert_clip_by_global_norm" if tf.flags.FLAGS.npu_bert_clip_by_global_norm:
all_are_finite = tf.reduce_all( [tf.reduce_all(tf.is_finite(g)) for g in grads]) if (tf.flags.FLAGS.npu_bert_loss_scale
not in [None, -1]) and (
use_fp16 or manual_fp16) else tf.constant(
True, dtype=tf.bool) ############## npu modify end #############

# 

# all_are_finite = tf.reduce_all(

# [tf.reduce_all(tf.is_finite(g)) for g in grads]) if use_fp16 or manual_fp16 else

tf.constant(True,

#

dtype=tf.bool)

­ "global_step"
############## npu modify begin ############# #  new_global_step = tf.cond(all_are_finite, lambda: global_step + 1, lambda: global_step) new_global_step = tf.identity(new_global_step, name='step_update') train_op = tf.group(train_op, [global_step.assign(new_global_step)]) ############## npu modify end ##############

return train_op

4. AdamWeightDecayOptimizer

­ "BERT/optimization.py"AdamWeightDecayOptimizer

__init__()

############## npu modify begin ############# # __init__()

def __init__(self,

learning_rate,

weight_decay_rate=0.0,

beta_1=0.9,

beta_2=0.999,

epsilon=1e-4,

exclude_from_weight_decay=None,

name="AdamWeightDecayOptimizer"):

############## npu modify end ############# # 

# def __init__(self,

#

learning_rate,

#

weight_decay_rate=0.0,

#

beta_1=0.9,

#

beta_2=0.999,

#

epsilon=1e-6,

#

exclude_from_weight_decay=None,

#

name="AdamWeightDecayOptimizer"):

­ "BERT/optimization.py"AdamWeightDecayOptimizer apply_gradients()
############## npu modify begin ############# #  # 
new_global_step = global_step + 1
new_global_step = tf.identity(new_global_step, name='step_update')
assignments.extend([global_step.assign(new_global_step)])
############## npu modify end #############

return tf.group(*assignments, name=name)
5. LAMBOptimizer

 01 (2020-11-28)

 © 

139

Atlas Data Center Solution 

9 

­ "BERT/optimization.py"LAMBOOptimizer__init__() 
self.epsilon = epsilon
self.exclude_from_weight_decay = exclude_from_weight_decay

############## npu modify begin ############# # LAMBOOptimizer self.steps = 0 ############## npu modify end #############
­ "BERT/optimization.py"LAMBOptimizer apply_gradients()
 "global_step"None ############## npu modify begin ############# # global_stepNone def apply_gradients(self, grads_and_vars, global_step=None, name=None, manual_fp16=False): ############## npu modify end #############

# 

# def apply_gradients(self, grads_and_vars, global_step, name=None,

#

manual_fp16=False):

 "global_step"float32 
############## npu modify begin ############# #  steps = tf.cast(global_step, tf.float32) ############## npu modify end #############

  ############## npu modify begin ############# #  # 
self.steps += 1 # 
beta1_correction = (1 - self.beta_1 ** self.steps)
beta2_correction = (1 - self.beta_2 ** self.steps)
############## npu modify end #############

#  # beta1_correction = (1 - self.beta_1 ** steps) # beta2_correction = (1 - self.beta_2 ** steps)
 new_global_stepglobal_step  ############## npu modify begin ############# #  #  new_global_step = global_step + 1 new_global_step = tf.identity(new_global_step, name='step_update') assignments.extend([global_step.assign(new_global_step)]) ############## npu modify end #############

return tf.group(*assignments, name=name)



"BERT/utils/utils.py"LogEvalRunHook 
############## npu modify begin ############# # 

 01 (2020-11-28)

 © 

140

Atlas Data Center Solution 

9 

class LogEvalRunHook(tf.estimator.SessionRunHook): def __init__(self, global_batch_size, hvd_rank=-1): self.global_batch_size = global_batch_size self.hvd_rank = hvd_rank self.total_time = 0.0 self.count = 0 self.skipped = 0 self.time_list = []

def before_run(self, run_context): self.t0 = time.time()

def after_run(self, run_context, run_values): elapsed_secs = time.time() - self.t0 self.count += 1

# Removing first 2 (arbitrary) number of startup iterations from perf evaluations if self.count <= 2:
print("Skipping time record for ", self.count, " due to overhead") self.skipped += 1 else: self.time_list.append(elapsed_secs) self.total_time += elapsed_secs ############## npu modify end #############

# 

# class LogEvalRunHook(tf.estimator.SessionRunHook):

# def __init__(self, global_batch_size, hvd_rank=-1):

#

self.global_batch_size = global_batch_size

#

self.hvd_rank = hvd_rank

#

self.count = 0

#

self.time_list = []

#

# def before_run(self, run_context):

#

self.t0 = time.time()

#

# def after_run(self, run_context, run_values):

#

elapsed_secs = time.time() - self.t0

#

self.count += 1

#

self.time_list.append(elapsed_secs)

9.2.6 

 



"BERT/run_pretraining.py"main()9-11 

 9-11 





main()

 


"BERT/ run_pretraining.py"

1. "BERT/run_pretraining.py"
from npu_bridge.estimator.npu.npu_config import * from npu_bridge.estimator.npu.npu_estimator import *

 01 (2020-11-28)

 © 

141

Atlas Data Center Solution 

9 

from npu_bridge.estimator.npu.npu_config import NPURunConfig from npu_bridge.estimator.npu.npu_estimator import NPUEstimator
2. NPURunconfigRunconfig
"BERT/run_pretraining.py"main() 
############## npu modify begin ############# run_config = NPURunConfig(
model_dir=FLAGS.output_dir, save_summary_steps=0, session_config=config, save_checkpoints_steps=FLAGS.save_checkpoints_steps if not FLAGS.horovod or hvd.rank() == 0 else None, # This variable controls how often estimator reports examples/sec. # Default value is every 100 steps. # When --report_loss is True, we set to very large value to prevent # default info reporting from estimator. # Ideally we should set it to None, but that does not work. log_step_count_steps=1 if FLAGS.report_loss else 100, enable_data_pre_proc=FLAGS.npu_bert_use_tdt, iterations_per_loop=FLAGS.iterations_per_loop, hcom_parallel=FLAGS.hcom_parallel) ############## npu modify end #############
#  # run_config = tf.estimator.RunConfig( # model_dir=FLAGS.output_dir, # session_config=config, # save_checkpoints_steps=FLAGS.save_checkpoints_steps if not FLAGS.horovod or hvd.rank() == 0 else None, # save_summary_steps=FLAGS.save_checkpoints_steps if not FLAGS.horovod or hvd.rank() == 0 else None, # # This variable controls how often estimator reports examples/sec. # # Default value is every 100 steps. # # When --report_loss is True, we set to very large value to prevent # # default info reporting from estimator. # # Ideally we should set it to None, but that does not work. # log_step_count_steps=10000 if FLAGS.report_loss else 100)

tensorFlowRunconfigRunconfigNPURunConfig NPURunconfigNPURunConfig
3. NPUEstimatorNPUEstimatortf.estimator.Estimator 
"BERT/run_pretraining.py"main() 
############# npu modify begin ############# estimator = NPUEstimator(
model_fn=model_fn, config=run_config, job_start_file=FLAGS.npu_bert_job_start_file) ############## npu modify end #############
# Estimator # estimator = tf.estimator.Estimator( # model_fn=model_fn, # config=run_config)
configs 
1 configs
2 configs1p.json

 01 (2020-11-28)

 © 

142

Atlas Data Center Solution 

9 

3 1p.jsonNPU
{ "board_id": "0x002f", "chip_info": "910", "deploy_mode": "lab", "group_count": "1", "group_list": [ { "device_num": "1", "server_num": "1", "group_name": "", "instance_count": "1", "instance_list": [ { "devices": [ { "device_id": "0", "device_ip": "192.168.100.101" } ], "rank_id": "0", "server_id": "172.17.1.120" } ] } ], "para_plane_nic_location": "device", "para_plane_nic_name": [ "eth0" ], "para_plane_nic_num": "1", "status": "completed"
}
4 configsbert_base_config.json
5 bert_base_config.jsonbert
{ "attention_probs_dropout_prob": 0.1, "hidden_act": "gelu", "hidden_dropout_prob": 0.1, "hidden_size": 768, "initializer_range": 0.02, "intermediate_size": 3072, "max_position_embeddings": 512, "num_attention_heads": 12, "num_hidden_layers": 12, "type_vocab_size": 2, "vocab_size": 30522 }
9-12

 9-12 bert   attention_probs_dropout_prob
hidden_act
hidden_dropout_prob


attentiontensordropout 
bertgelu 
hiddentensordropout

 01 (2020-11-28)

 © 

143

Atlas Data Center Solution 
 hidden_size initializer_range intermediate_size
max_position_embeddings
num_attention_heads num_hidden_layers type_vocab_size vocab_size

9 

hidden_size 7681024
 
attention tensorhidden_size intermediate_size hidden_size
bert max_position_embeddings 
attentionhead 12
transformer612
 
4000 5000

----
9.2.7 




9-13

 9-13 





main()

 

 "BERT/run_pretraining.py"

1. flags
"BERT/run_pretraining.py"flags 
­ 
############## npu modify begin ############# #  os.environ['WHICH_OP'] = 'GEOP'

 01 (2020-11-28)

 © 

144

Atlas Data Center Solution 

9 

os.environ['NEW_GE_FE_ID'] = '1' os.environ['GE_AICPU_FLAG'] = '1' os.environ['GE_USE_STATIC_MEMORY'] = '1' os.environ['OPTION_EXEC_HCCL_FLAG'] = '1' os.environ['HCCL_CONNECT_TIMEOUT'] = '600' ############## npu modify end #############
flags = tf.flags
­ flag
############## npu modify begin ############# flags.DEFINE_string(
"input_files_dir", "./data", "Directory with input files, comma separated or single directory.")
flags.DEFINE_string( "output_dir", "./models", "The output directory where the model checkpoints will be written.")
############## npu modify end #############
#  # flags.DEFINE_string( # "input_files_dir", None, # "Directory with input files, comma separated or single directory.") # # flags.DEFINE_string( # "output_dir", None, # "The output directory where the model checkpoints will be written.")
­ dllog_path
flags.DEFINE_string( "dllog_path", "/results/bert_dllog.json", "filename where dllogger writes to")
­ flag
############## npu modify begin ############# flags.DEFINE_integer(
"max_seq_length", 128, "The maximum total input sequence length after WordPiece tokenization. " "Sequences longer than this will be truncated, and sequences shorter " "than this will be padded. Must match data generation.")
flags.DEFINE_integer( "max_predictions_per_seq", 20, "Maximum number of masked LM predictions per sequence. " "Must match data generation.")
flags.DEFINE_bool("do_train", True, "Whether to run training.")
flags.DEFINE_integer("train_batch_size", 64, "Total batch size for training.")
flags.DEFINE_float("learning_rate", 1e-4, "The initial learning rate for Adam.")
flags.DEFINE_integer("num_train_steps", 1000000, "Number of training steps.")
flags.DEFINE_integer("save_checkpoints_steps", 10000, "How often to save the model checkpoint.")
flags.DEFINE_integer("display_loss_steps", 10, "How often to print loss")
flags.DEFINE_bool("manual_fp16", True, "Whether to use fp32 or fp16 arithmetic on GPU. " "Manual casting is done instead of using AMP")
############## npu modify end #############
#  # flags.DEFINE_integer( # "max_seq_length", 512, # "The maximum total input sequence length after WordPiece tokenization. " # "Sequences longer than this will be truncated, and sequences shorter " # "than this will be padded. Must match data generation.")

 01 (2020-11-28)

 © 

145

Atlas Data Center Solution 

9 

#

# flags.DEFINE_integer(

# "max_predictions_per_seq", 80,

# "Maximum number of masked LM predictions per sequence. "

# "Must match data generation.")

#

# flags.DEFINE_bool("do_train", False, "Whether to run training.")

#

# flags.DEFINE_integer("train_batch_size", 32, "Total batch size for training.")

#

# flags.DEFINE_float("learning_rate", 5e-5, "The initial learning rate for Adam.")

#

# flags.DEFINE_integer("num_train_steps", 100000, "Number of training steps.")

#

# flags.DEFINE_integer("save_checkpoints_steps", 10000,

#

"How often to save the model checkpoint.")

#

# flags.DEFINE_integer("display_loss_steps", 1,

#

"How often to print loss")

# flags.DEFINE_bool("manual_fp16", False, "Whether to use fp32 or fp16 arithmetic on GPU. "

#

"Manual casting is done instead of using AMP")

­ amp
flags.DEFINE_bool("amp", True, "Whether to enable AMP ops. When false, uses TF32 on A100 and FP

­ flag
############## npu modify begin ############# flags.DEFINE_bool("use_xla", False, "Whether to enable XLA JIT compilation.") ############## npu modify end #############

#  # flags.DEFINE_bool("use_xla", True, "Whether to enable XLA JIT compilation.")
­ 
############## npu modify begin ############# #  flags.DEFINE_bool("use_fp16", False, "Whether to enable AMP ops.")

flags.DEFINE_bool("use_fp16_cls", True, "Whether to use fp16 in cls and pooler.")

flags.DEFINE_bool("distributed", True, "Whether to use multi-npu")

flags.DEFINE_bool('npu_bert_fused_gelu', True, 'Whether to use npu defined gelu op')

flags.DEFINE_bool('npu_bert_debug', False, 'If True, dropout and shuffle is disabled.')

flags.DEFINE_bool('npu_bert_use_tdt', True, 'Whether to use tdt as dataset')

flags.DEFINE_string("npu_bert_job_start_file", None, "CSA job start file path.")

flags.DEFINE_integer("npu_bert_loss_scale", 0, "Whether to use loss scale, -1 is disable, 0 is dynamic loss scale, >=1 is static loss scale")

flags.DEFINE_bool("npu_bert_clip_by_global_norm", False, "Use clip_by_global_norm if True, or use clip_by_norm for each gradient")

flags.DEFINE_bool('npu_bert_npu_dropout', True, 'Whether to use npu defined gelu op')

flags.DEFINE_bool('npu_gather', True, 'Whether to use gather_npu whose backward propagation avoids IndexedSlices')

flags.DEFINE_bool('hcom_parallel', True, 'Whether to use parallel allreduce') ############## npu modify end #############
­ flag
############## npu modify begin ############# flags.DEFINE_integer('init_loss_scale_value', 2**32, 'Initial loss scale value for loss scale optimizer') ############## npu modify end #############

 01 (2020-11-28)

 © 

146

Atlas Data Center Solution 

9 

#  # flags.DEFINE_integer("init_loss_scale", 2**32, "Initial value of loss scale if mixed precision training")
2. _LogSessionRunHook
"BERT/run_pretraining.py"_LogSessionRunHook 
­ _LogSessionRunHook__init__()
############## npu modify begin ############# def __init__(self, global_batch_size, num_accumulation_steps, display_every=10, hvd_rank=-1):
self.global_batch_size = global_batch_size self.display_every = display_every self.hvd_rank = hvd_rank self.num_accumulation_steps = num_accumulation_steps ############## npu modify end #############

# 

# def __init__(self, global_batch_size, num_accumulation_steps, dllogging, display_every=10,

#

save_ckpt_steps=1000, report_loss=True, hvd_rank=-1):

# self.global_batch_size = global_batch_size

# self.display_every = display_every

# self.save_ckpt_steps = save_ckpt_steps

# self.hvd_rank = hvd_rank

# self.num_accumulation_steps = num_accumulation_steps

# self.dllogging = dllogging

# self.report_loss = report_loss

­ _LogSessionRunHookafter_create_session()
############## npu modify begin ############# def after_create_session(self, session, coord):
self.elapsed_secs = 0. self.count = 0 self.all_count = 0 self.avg_loss = 0.0 ############## npu modify end #############

#  # def after_create_session(self, session, coord): # self.elapsed_secs = 0.0 # elapsed seconds between every print # self.count = 0 # number of global steps between every print # self.all_count = 0 # number of steps (including accumulation) between every print # self.loss = 0.0 # accumulation of loss in each step between every print # # self.total_time = 0.0 # total time taken to train (excluding warmup + ckpt saving steps) # self.step_time = 0.0 # time taken per step # self.init_global_step = session.run(tf.train.get_global_step()) # training starts at init_global_step # self.skipped = 0
­ _LogSessionRunHookbefore_run()
iftf.train.SessionRunArgs tf.estimator.SessionRunArgs910 AI
############## npu modify begin ############# def before_run(self, run_context):
self.t0 = time.time() if self.num_accumulation_steps <= 1:
if (tf.flags.FLAGS.npu_bert_loss_scale == 0) and (FLAGS.manual_fp16 or FLAGS.use_fp16):
return tf.estimator.SessionRunArgs( fetches=['global_step:0', 'total_loss:0', 'learning_rate:0', 'nsp_loss:0', 'mlm_loss:0', 'loss_scale:0', 'apply_grads/All:0'])
else: return tf.estimator.SessionRunArgs( fetches=['global_step:0', 'total_loss:0', 'learning_rate:0', 'nsp_loss:0', 'mlm_loss:0'])

 01 (2020-11-28)

 © 

147

Atlas Data Center Solution 

9 

else: if (tf.flags.FLAGS.npu_bert_loss_scale == 0) and (FLAGS.manual_fp16 or
FLAGS.use_fp16): return tf.estimator.SessionRunArgs( fetches=['global_step:0', 'update_step:0', 'total_loss:0', 'learning_rate:0', 'nsp_loss:0', 'mlm_loss:0', 'loss_scale:0'])
else: return tf.estimator.SessionRunArgs( fetches=['global_step:0', 'update_step:0', 'total_loss:0', 'learning_rate:0', 'nsp_loss:0', 'mlm_loss:0'])
############## npu modify end #############

# 

# def before_run(self, run_context):

# self.t0 = time.time()

# if self.num_accumulation_steps <= 1:

# if FLAGS.manual_fp16 or FLAGS.amp:

#

return tf.estimator.SessionRunArgs(

#

fetches=['step_update:0', 'total_loss:0',

#

'learning_rate:0', 'nsp_loss:0',

#

'mlm_loss:0', 'loss_scale:0'])

# else:

#

return tf.estimator.SessionRunArgs(

#

fetches=['step_update:0', 'total_loss:0',

#

'learning_rate:0', 'nsp_loss:0',

#

'mlm_loss:0'])

# else:

# if FLAGS.manual_fp16 or FLAGS.amp:

#

return tf.estimator.SessionRunArgs(

#

fetches=['step_update:0', 'update_step:0', 'total_loss:0',

#

'learning_rate:0', 'nsp_loss:0',

#

'mlm_loss:0', 'loss_scale:0'])

# else:

#

return tf.estimator.SessionRunArgs(

#

fetches=['step_update:0', 'update_step:0', 'total_loss:0',

#

'learning_rate:0', 'nsp_loss:0',

#

'mlm_loss:0'])

­ _LogSessionRunHookafter_run()_LogSessionRunHook after_run() after_run()
def after_run(self, run_context, run_values): self.elapsed_secs += time.time() - self.t0 if self.num_accumulation_steps <=1: if (tf.flags.FLAGS.npu_bert_loss_scale == 0) and (FLAGS.manual_fp16 or
FLAGS.use_fp16): global_step, total_loss, lr, nsp_loss, mlm_loss, loss_scaler, custom_arg =
run_values.results else: global_step, total_loss, lr, nsp_loss, mlm_loss = run_values.results update_step = True
else: if (tf.flags.FLAGS.npu_bert_loss_scale == 0) and (FLAGS.manual_fp16 or
FLAGS.use_fp16): global_step, update_step, total_loss, lr, nsp_loss, mlm_loss, loss_scaler =
run_values.results else: global_step, update_step, total_loss, lr, nsp_loss, mlm_loss = run_values.results
print_step = global_step + 1 # One-based index for printing. self.avg_loss += total_loss self.all_count += 1 if update_step:
self.count += 1 dt = self.elapsed_secs / self.count sent_per_sec = self.global_batch_size / dt * FLAGS.iterations_per_loop avg_loss_step = self.avg_loss / self.all_count if self.hvd_rank >= 0:

 01 (2020-11-28)

 © 

148

Atlas Data Center Solution 

9 

if (tf.flags.FLAGS.npu_bert_loss_scale == 0) and (FLAGS.manual_fp16 or FLAGS.use_fp16):
print('Rank = %2d :: Step = %6i Throughput = %11.1f MLM Loss = %10.4e NSP Loss = %10.4e Loss = %9.6f Average Loss = %9.6f LR = %6.4e Loss scale = %6.4e isFinite = %6i' %
(self.hvd_rank, print_step, sent_per_sec, mlm_loss, nsp_loss, total_loss, avg_loss_step, lr, loss_scaler, custom_arg), flush=True)
else: print('Rank = %2d :: Step = %6i Throughput = %11.1f MLM Loss = %10.4e NSP Loss
= %10.4e Loss = %9.6f Average Loss = %9.6f LR = %6.4e' % (self.hvd_rank, print_step, sent_per_sec, mlm_loss, nsp_loss, total_loss,
avg_loss_step, lr), flush=True) else: if (tf.flags.FLAGS.npu_bert_loss_scale == 0) and (FLAGS.manual_fp16 or
FLAGS.use_fp16): print('Step = %6i Throughput = %11.1f MLM Loss = %10.4e NSP Loss = %10.4e Loss
= %9.6f Average Loss = %9.6f LR = %6.4e Loss scale = %6.4e isFinite = %6i' % (print_step, sent_per_sec, mlm_loss, nsp_loss, total_loss, avg_loss_step, lr,
loss_scaler, custom_arg), flush=True) else: print('Step = %6i Throughput = %11.1f MLM Loss = %10.4e NSP Loss = %10.4e Loss
= %9.6f Average Loss = %9.6f LR = %6.4e' % (print_step, sent_per_sec, mlm_loss, nsp_loss, total_loss, avg_loss_step, lr),
flush=True) self.elapsed_secs = 0. self.count = 0 self.avg_loss = 0.0 self.all_count = 0
3. main
"BERT/run_pretraining.py"main() 
­ "TF_XLA_FLASS"
############## npu modify begin ############# #  os.environ["TF_XLA_FLAGS"] = " --tf_xla_enable_lazy_compilation false" #causes memory fragmentation for bert leading to OOM ############## npu modify end #############
­ flag
############## npu modify begin ############# # flag #  for name, value in FLAGS.__flags.items():
print("name:", name, " ", FLAGS[name].value) ############## npu modify end #############
­ utils.dllogger_class.dllogger_class()
############## npu modify begin ############# #  dllogging = utils.dllogger_class.dllogger_class(FLAGS.dllog_path) ############## npu modify end #############
­ use_fp16
if not FLAGS.do_train and not FLAGS.do_eval: raise ValueError("At least one of `do_train` or `do_eval` must be True.")
############## npu modify begin ############# # use_fp16 if FLAGS.use_fp16:
os.environ["TF_ENABLE_AUTO_MIXED_PRECISION_GRAPH_REWRITE"] = "1" ############## npu modify end #############
­ npu_gather910 AI 
bert_config = modeling.BertConfig.from_json_file(FLAGS.bert_config_file)
############## npu modify begin ############# # npu_gather # 

 01 (2020-11-28)

 © 

149

Atlas Data Center Solution 

9 

if FLAGS.npu_gather: if FLAGS.distributed and bert_config.num_hidden_layers == 24: from hccl.split.api import set_split_strategy_by_size set_split_strategy_by_size([10,10,10,10,15,15,15,15]) if FLAGS.distributed and bert_config.num_hidden_layers == 12: from hccl.split.api import set_split_strategy_by_idx set_split_strategy_by_idx([8,56,104,152,200,205]) if FLAGS.distributed and bert_config.num_hidden_layers == 6: from hccl.split.api import set_split_strategy_by_idx set_split_strategy_by_idx([8,40,72,104,109])
############## npu modify end #############
­ 
############## npu modify begin ############# #  #  input_files.sort() print("Input Files:", input_files) ############## npu modify end #############
if FLAGS.horovod and len(input_files) < hvd.size(): raise ValueError("Input Files must be sharded")
­ flags"use_fp16""amp"910 AI 
############## npu modify begin ############# # "use_fp16""amp"910 AI if FLAGS.use_fp16 and FLAGS.manual_fp16:
raise ValueError("AMP and Manual Mixed Precision Training are both activated! Error") ############## npu modify end #############
#  # if FLAGS.amp and FLAGS.manual_fp16: # raise ValueError("AMP and Manual Mixed Precision Training are both activated! Error")
­ flags"amp"910 AI
if FLAGS.use_xla: config.graph_options.optimizer_options.global_jit_level =
tf.compat.v1.OptimizerOptions.ON_1 config.graph_options.rewrite_options.memory_optimization =
rewriter_config_pb2.RewriterConfig.NO_MEM_OPT
############## npu modify begin ############# #  if FLAGS.amp: tf.enable_resource_variables() ############## npu modify end ###############
­ distributed"RANK_SIZE" 910 AI
run_config = NPURunConfig( model_dir=FLAGS.output_dir, save_summary_steps=0, session_config=config, save_checkpoints_steps=FLAGS.save_checkpoints_steps if not FLAGS.horovod or hvd.rank()
== 0 else None, # This variable controls how often estimator reports examples/sec. # Default value is every 100 steps. # When --report_loss is True, we set to very large value to prevent # default info reporting from estimator. # Ideally we should set it to None, but that does not work. log_step_count_steps=1 if FLAGS.report_loss else 100, enable_data_pre_proc=FLAGS.npu_bert_use_tdt, iterations_per_loop=FLAGS.iterations_per_loop, hcom_parallel=FLAGS.hcom_parallel)
############## npu modify begin ############# #  if FLAGS.distributed:

 01 (2020-11-28)

 © 

150

Atlas Data Center Solution 

9 

rank_size = int(os.getenv('RANK_SIZE')) ############## npu modify end #############
­ model_fn_builder() learning_rate
############## npu modify begin ############# # model_fn_builder()model_fn_builder()learning_rate  model_fn = model_fn_builder(
bert_config=bert_config, init_checkpoint=FLAGS.init_checkpoint, learning_rate=FLAGS.learning_rate, num_train_steps=FLAGS.num_train_steps, num_warmup_steps=FLAGS.num_warmup_steps, use_one_hot_embeddings=False, hvd=None if not FLAGS.horovod else hvd) ############## npu modify end #############
#  # model_fn = model_fn_builder( # bert_config=bert_config, # init_checkpoint=FLAGS.init_checkpoint, # learning_rate=FLAGS.learning_rate if not FLAGS.horovod else FLAGS.learning_rate*hvd.size(), # num_train_steps=FLAGS.num_train_steps, # num_warmup_steps=FLAGS.num_warmup_steps, # use_one_hot_embeddings=False, # hvd=None if not FLAGS.horovod else hvd)
­ 
model_fn = model_fn_builder( bert_config=bert_config, init_checkpoint=FLAGS.init_checkpoint, learning_rate=FLAGS.learning_rate, num_train_steps=FLAGS.num_train_steps, num_warmup_steps=FLAGS.num_warmup_steps, use_one_hot_embeddings=False, hvd=None if not FLAGS.horovod else hvd)
############## npu modify begin ############# #  training_hooks = [] if FLAGS.report_loss:
global_batch_size = FLAGS.train_batch_size * FLAGS.num_accumulation_steps if not FLAGS.distributed else FLAGS.train_batch_size * FLAGS.num_accumulation_steps * rank_size
training_hooks.append( _LogSessionRunHook(global_batch_size, FLAGS.num_accumulation_steps,
FLAGS.display_loss_steps)) ############## npu modify end #############
­ 
############## npu modify begin ############# #  training_hooks = [] if FLAGS.horovod and hvd.size() > 1:
training_hooks.append(hvd.BroadcastGlobalVariablesHook(0)) if (not FLAGS.horovod or hvd.rank() == 0):
global_batch_size = FLAGS.train_batch_size * FLAGS.num_accumulation_steps if not FLAGS.horovod else FLAGS.train_batch_size * FLAGS.num_accumulation_steps * hvd.size()
training_hooks.append(_LogSessionRunHook(global_batch_size, FLAGS.num_accumulation_steps, dllogging, FLAGS.display_loss_steps, FLAGS.save_checkpoints_steps, FLAGS.report_loss))
############## npu modify end #############
­ 
############## npu modify begin ############# #  train_start_time = time.time() ############## npu modify end #############
estimator.train(input_fn=train_input_fn, hooks=training_hooks,

 01 (2020-11-28)

 © 

151

Atlas Data Center Solution 

9 

max_steps=FLAGS.num_train_steps)
############## npu modify begin ############# #  train_time_elapsed = time.time() - train_start_time ############## npu modify end #############
­ 
############## npu modify begin ############# #  if (not FLAGS.horovod or hvd.rank() == 0):
train_time_wo_overhead = training_hooks[-1].total_time avg_sentences_per_second = FLAGS.num_train_steps * global_batch_size * 1.0 / train_time_elapsed ss_sentences_per_second = (FLAGS.num_train_steps - training_hooks[-1].skipped) * global_batch_size * 1.0 / train_time_wo_overhead
tf.compat.v1.logging.info("-----------------------------") tf.compat.v1.logging.info("Total Training Time = %0.2f for Sentences = %d", train_time_elapsed,
FLAGS.num_train_steps * global_batch_size) tf.compat.v1.logging.info("Total Training Time W/O Overhead = %0.2f for Sentences = %d", train_time_wo_overhead,
(FLAGS.num_train_steps - training_hooks[-1].skipped) * global_batch_size) tf.compat.v1.logging.info("Throughput Average (sentences/sec) with overhead = %0.2f", avg_sentences_per_second) tf.compat.v1.logging.info("Throughput Average (sentences/sec) = %0.2f", ss_sentences_per_second) dllogging.logger.log(step=(), data={"throughput_train": ss_sentences_per_second}, verbosity=Verbosity.DEFAULT) tf.compat.v1.logging.info("-----------------------------") ############## npu modify end #############
­ 
eval_time_elapsed = time.time() - eval_start_time
############## npu modify begin #############  eval_time_wo_overhead = eval_hooks[-1].total_time num_sentences = (eval_hooks[-1].count - eval_hooks[-1].skipped) * FLAGS.eval_batch_size ############## npu modify end #############
#  # time_list = eval_hooks[-1].time_list # time_list.sort() # # Removing outliers (init/warmup) in throughput computation. # eval_time_wo_overhead = sum(time_list[:int(len(time_list) * 0.99)]) # num_sentences = (int(len(time_list) * 0.99)) * FLAGS.eval_batch_size
­ 
############## npu modify begin ############# #  tf.logging.info("Total Inference Time W/O Overhead = %0.2f for Sentences = %d", eval_time_wo_overhead,
(eval_hooks[-1].count - eval_hooks[-1].skipped) * FLAGS.eval_batch_size) ############## npu modify end #############
#  # tf.compat.v1.logging.info("Total Inference Time W/O Overhead = %0.2f for Sentences = %d", eval_time_wo_overhead,num_sentences)
tf.logging.info("Summary Inference Statistics on EVAL set") tf.logging.info("Batch size = %d", FLAGS.eval_batch_size) tf.logging.info("Sequence Length = %d", FLAGS.max_seq_length)
############## npu modify begin ############# #  tf.logging.info("Precision = %s", "fp16" if FLAGS.use_fp16 else "fp32") ############## npu modify end #############

 01 (2020-11-28)

 © 

152

Atlas Data Center Solution 

9 

#  # tf.compat.v1.logging.info("Precision = %s", "fp16" if FLAGS.amp else "fp32")
­ 
############# npu modify begin ############# #  dllogging.logger.log(step=(), data={"throughput_val": ss_sentences_per_second}, verbosity=Verbosity.DEFAULT) ############## npu modify end ############



"BERT/run_pretraining.py"



############## npu modify begin #############

if __name__ == "__main__":

flags.mark_flag_as_required("input_files_dir") # "eval_files_dir"

flags.mark_flag_as_required("eval_files_dir")

flags.mark_flag_as_required("bert_config_file")

flags.mark_flag_as_required("output_dir")

flags.mark_flag_as_required("npu_bert_job_start_file")

if FLAGS.use_xla and FLAGS.manual_fp16:

print('WARNING! Combining --use_xla with --manual_fp16 may prevent convergence.')

print('

This warning message will be removed when the underlying')

print('

issues have been fixed and you are running a TF version')

print('

that has that fix.')

tf.compat.v1.app.run()

############## npu modify end #############

# 

# if __name__ == "__main__":

# flags.mark_flag_as_required("input_files_dir")

# if FLAGS.do_eval:

#

flags.mark_flag_as_required("eval_files_dir")

# flags.mark_flag_as_required("bert_config_file")

# flags.mark_flag_as_required("output_dir")

# if FLAGS.use_xla and FLAGS.manual_fp16:

#

print('WARNING! Combining --use_xla with --manual_fp16 may prevent convergence.')

#

print('

This warning message will be removed when the underlying')

#

print('

issues have been fixed and you are running a TF version')

#

print('

that has that fix.')

# tf.compat.v1.app.run()

9.2.8 





tfrecordbookscorpus"/home/data/bert/cn-clue-256/" 

 ranktable 
ranktable




 01 (2020-11-28)

 © 

153

Atlas Data Center Solution 

9 



1 "Bert/scripts"npu_set_env.shnpu_set_env.sh 
# main env export LD_LIBRARY_PATH=/usr/local/:/usr/local/lib/:/usr/lib/:/usr/local/Ascend/fwkacllib/lib64/:/usr/local/ Ascend/driver/lib64/common/:/usr/local/Ascend/driver/lib64/driver/:/usr/local/Ascend/add-ons/ export PYTHONPATH=$PYTHONPATH:/usr/local/Ascend/opp/op_impl/built-in/ai_core/tbe:/code export PATH=$PATH:/usr/local/Ascend/fwkacllib/ccec_compiler/bin export ASCEND_OPP_PATH=/usr/local/Ascend/opp export SOC_VERSION=Ascend910 export HCCL_CONNECT_TIMEOUT=600

# user env export JOB_ID=bert-base-1p export RANK_TABLE_FILE=../configs/1p.json export RANK_SIZE=1 export RANK_INDEX=0 export RANK_ID=0

# profiling env export PROFILING_MODE=true export AICPU_PROFILING_MODE=false export PROFILING_OPTIONS=task_trace:training_trace export FP_POINT=bert/embeddings/GatherV2 export BP_POINT=gradients/bert/embeddings/IdentityN_1_grad/UnsortedSegmentSum

# debug env #export DUMP_GE_GRAPH=2 #export DUMP_OP=1 #export DUMP_OP_LESS=1 #export PRINT_MODEL=1 #export TE_PARALLEL_COMPILER=0

# system env ulimit -c unlimited

2

"Bert/scripts"run_pretraining.sh run_pretraining.sh
#!/bin/sh currentDir=$(cd "$(dirname "$0")"; pwd) cd ${currentDir}

PWD=${currentDir}

device_id=0 if [ x"${device_id}" = x ] ; then
echo "turing train fail" >> ${currentDir}/train_${device_id}.log exit else export DEVICE_ID=${device_id} fi

DEVICE_INDEX=$(( DEVICE_ID + RANK_INDEX * 8 )) export DEVICE_INDEX=${DEVICE_INDEX}

env > ${currentDir}/env_${device_id}.log

#mkdir exec path #mkdir -p ${currentDir}/${device_id} #rm -rf ${currentDir}/${device_id}/* cd ${currentDir}/ rm -rf kernel_meta rm -rf output #start exec python3.7 ../run_pretraining.py --bert_config_file=../configs/bert_base_config.json --max_seq_length=128 -max_predictions_per_seq=20 --train_batch_size=128 --learning_rate=1e-4 --num_warmup_steps=10000 --

 01 (2020-11-28)

 © 

154

Atlas Data Center Solution 

9 

num_train_steps=500000 --optimizer_type=adam --manual_fp16=True --use_fp16_cls=True -input_files_dir=/home/data/bert/cn-clue-256 --eval_files_dir=/home/data/bert/cn-clue-256 -npu_bert_use_tdt=True --do_train=True --num_accumulation_steps=1 --npu_bert_job_start_file= -iterations_per_loop=100 --save_checkpoints_steps=10000 --npu_bert_clip_by_global_norm=False -distributed=True --npu_bert_loss_scale=0 --output_dir=./output

--input_files_dir=/home/data/bert/cn-clue-256--eval_files_dir=/home/data/bert/cnclue-256
9-14

 9-14   bert_config_file max_seq_length max_predictions_per_seq train_batch_size learning_rate num_warmup_steps num_train_steps optimizer_type manual_fp16 use_fp16_cls
input_files_dir eval_files_dir npu_bert_use_tdt do_train num_accumulation_steps npu_bert_job_start_file iterations_per_loop save_checkpoints_steps npu_bert_clip_by_global_norm distributed npu_bert_loss_scale output_dir

 bert bert bert BatchSize     tf.float16 clspoolertensor tf.float16Ascend 910   tdt  step  step step  Ascend 910 loss scaling 

 01 (2020-11-28)

 © 

155

Atlas Data Center Solution 

9 

----





 BERT

# 

  configs

# jsonbert



 1p.json

// NPUP



 bert_bae_config.json

// BERT

  scripts

# 



 npu_set_env.sh

// 



 run_pretraining.sh

// 

  utils

# utils



 utils.py

  __init__.py

  gpu_environment.py

// gpu_environment

  modeling.py

// BERT

  optimization.py

// 

  run_pretraining.py

// -

  CONTRIBUTING.md

// CONTRIBUTING.md

  README.md

// 



 

 01 (2020-11-28)

 © 

156

Atlas Data Center Solution 

10 

10 

10.1 7.3.0gcc 10.2 

10.1  7.3.0  gcc
root
1 gcc-7.3.0.tar.gzhttps://mirrors.tuna.tsinghua.edu.cn/gnu/gcc/ gcc-7.3.0/gcc-7.3.0.tar.gz
2 gcc/tmp
sudo rm -rf /tmp/*
3 
centos/bclinux
yum install bzip2
ubuntu/debian
apt-get install bzip2
4 gcc
1. gcc-7.3.0.tar.gz
tar -zxvf gcc-7.3.0.tar.gz
2. gcc
cd gcc-7.3.0 ./contrib/download_prerequisites
"gcc-7.3.0/" 
wget http://gcc.gnu.org/pub/gcc/infrastructure/gmp-6.1.0.tar.bz2 wget http://gcc.gnu.org/pub/gcc/infrastructure/mpfr-3.1.4.tar.bz2 wget http://gcc.gnu.org/pub/gcc/infrastructure/mpc-1.0.3.tar.gz wget http://gcc.gnu.org/pub/gcc/infrastructure/isl-0.16.1.tar.bz2

./contrib/download_prerequisites


 01 (2020-11-28)

 © 

157

Atlas Data Center Solution 

10 

3. 
./configure --enable-languages=c,c++ --disable-multilib --with-system-zlib --prefix=/usr/local/
linux_gcc7.3.0 make -j15 # grep -w processor /proc/cpuinfo|wc -lcpu15 
make install


"--prefix"linux_gcc7.3.0 "/usr/local""/usr"gcc gcc"/usr/local/ linux_gcc7.3.0"
5  gcc 
export LD_LIBRARY_PATH=.../xxx/xxx/xxx/lib64
".../xxx/xxx/xxx/"3.gcc "/usr/local/linux_gcc7.3.0/"

gcc
----

10.2 
 2020-07-22

 

 01 (2020-11-28)

 © 

158


AH Formatter V6.2 MR8 for Windows : 6.2.10.20473 (2015/04/14 10:00JST) Antenna House PDF Output Library 6.2.680 (Windows)