神經機器翻譯任務之目的為透過深度學習模型將來源語言句子轉換為目標語言,同時得以保留來源句子語意及正確句法。近年來常用的模型之一為 Transformer,透過模型中的自注意力機制捕捉句子的全局資訊,在多項自然語言處理任務中表現良好。然而,有研究指出自注意力機制會學到重複資訊,且無法有效學習文本中的局部資訊。因此,本研究針對 Transformer 中的自注意力機制進行改善,分別加入 Gate 機制與 K-means 分群演算法,進而提出 Gated Attention 與 Clustered Attention,其中 Gated Attention 又涵蓋 Top-k % 方法及 Threshold 方法。透過將 Attention Map 集中化,加強模型捕捉局部資訊之能力,藉此學習到更多元的句子關係,提升其翻譯品質。 本研究將 Gated Attention 的 Top-k % 方法與 Threshold 方法,以及 Clustered Attention 應用於中英翻譯任務上,以 BLEU 作為評估指標,分別達到 25.30、24.69 及 24.69。其次,同時採用兩種注意力機制的混合組合模型之最佳結果為 24.88,並未比僅採用單一種方法要來得優秀。在實驗中皆證實本研究提出的改進模型優於原始 Transformer,另外亦表明了只使用一種注意力機制更能夠幫助 Transformer 學習文本資訊,且達到 Attention Map 集中化之目的。;The purpose of Neural Machine Translation (NMT) is to translate a source sentence to a target sentence by deep learning models and to be able to preserve the semantic meaning of the source sentence and have correct syntax as well. Recently, Transformer is one of the commonly used models. It can capture the global information of sentences through the Self-Attention Mechanism and performs well in lots of Natural Language Processing (NLP) tasks. However, some studies have indicated that the Self-Attention Mechanism learns repetitive information and cannot learn local information of texts effectively. Therefore, we modify the Self-attention Mechanism in Transformer and propose Gated Attention and Clustered Attention, by adding Gated Mechanism and K-means clustering algorithm respectively. Moreover, Gated Attention includes Top-k% method and Threshold method. These approaches centralize the Attention Map to made model improve the ability to capture local information and learn more different relationship in sentences. Hence Transformer can provide a higher quality translation. In this work, we apply Clustered Attention as well as Top-k% method and Threshold method of Gated Attention to Chinese-to-English translation tasks, and then the results are 24.69, 25.30 and 24.69 BLEU, respectively. Secondly, the best result of the hybrid combination model that uses both attention mechanisms at the same time is 24.88 BLEU, which is not better than using a single attention mechanism. In our experiments, we have found that the proposed model outperforms the vanilla Transformer. Furthermore, we have also observed that using only one attention mechanism can help Transformer learn text information better and achieve the goal of Attention Map centralization as well.