主動式學習之古漢語斷詞

NCU Institutional Repository > 資訊電機學院 > 軟體工程研究所 > 博碩士論文 > Item 987654321/75924

請使用永久網址來引用或連結此文件: http://ir.lib.ncu.edu.tw/handle/987654321/75924

題名:	主動式學習之古漢語斷詞
作者:	蔡融易;Tsai, Jung-Yi
貢獻者:	軟體工程研究所
關鍵詞:	自然語言處理;主動式學習;古漢語斷詞;Natural Language Processing;Active Learning;Classical Chinese Word Segmentation
日期:	2018-01-25
上傳時間:	2018-04-13 11:17:47 (UTC+8)
出版者:	國立中央大學
摘要:	目前進階的自然語言技術有事件擷取、事件分類、自動摘要等等，若是可以應用在古漢語中，對於歷史學者會有很大的幫助，但是自然語言處理應用在古漢語方面上，大部分都還在基礎的斷句、斷詞和命名實體辨識上使用監督式學習的方法去做辨識，因為古漢語標註的人員少門檻高，因此在建立監督式學習的方法的訓練資料需要花更多時間，進而影響進階的自然語言技術系統的開發，因為進階的自然語言技術所構成的基本元素就是語意詞，如果沒有準確度高的斷詞結果，都會直接影響到進階自然語言技術的準確度，因此，我們建立古漢語斷詞系統，相較於傳統，我們的系統在斷詞之前，不需要訓練資料。現有的中文斷詞模組並不適合古漢語，文法與用詞上都相差太多，因此無法直接使用現有的中文的斷詞模組，但是訓練一個監督式學習的機器模型，又需要耗費大量時間和人力在定義與標註語意詞上，而且古漢語標註人員需要仰賴對歷史的專業度，加上對於標註沒有句讀的段落，致使人工標註時間增加，從上述幾個原因可以發現建立古漢語監督式學習的機器模型成本是很高的，因此，我們使用非監督式模型斷詞，再透過主動式學習找到可能錯誤的片段，提供給人來加以做修正，讓人工不用再去檢驗正確率高的部分，提升標註效率。本篇論文實現了主動式學習之古漢語斷詞，並實用於【明實錄】上，我們以主動式學習取代需要大量人力標註的監督式學習，並且改善非監督式學習需要透過資料量才能增加精準度的缺點，透過主動式學習的網頁呈現出可能錯誤的片段，減少標註人員修正的次數。 ;Currently, advanced Natural Language Processing (NLP) includes event extraction or event classification, automatic text summarization and so on. Most NLP techniques for classical Chinese are still on the early stage, like sentence segmentation or word segmentation, named entity recognition. These basic applications usually use supervised learning to identify. Tagging the training data of these basic applications need to spend much time, because the people that know the classical Chinese are minority. Therefore, the current advanced Natural Language Processing for classical Chinese are difficult to develop. The basic element of most languages is word. The accuracy of word segmentation influences the effect of the current advanced Natural Language Processing directly. As a result, we develop the word segment system for classical Chinese. Compared with traditional word segmentation, we do not need training data. This thesis focuses on applying active learning to word segmentation of historical texts. In addition, we apply the algorithm to the MING SHILU. We use active learning because it can reduce the annotation efforts significantly. We also mitigate the disadvantage of unsupervised model that needs large amounts of data to achieve satisfactory accuracy.
顯示於類別:	[軟體工程研究所 ] 博碩士論文

文件中的檔案:

檔案	描述	大小	格式	瀏覽次數
index.html		0Kb	HTML	337	檢視/開啟

在NCUIR中所有的資料項目都受到原著作權保護.

社群 sharing

資料載入中.....