在成本限制下，以非監督式學習進行樣本選取之研究

NCUIR > School of Management at National Central University > Graduate Institute of Information Management > Electronic Thesis & Dissertation > Item 987654321/74818

Please use this identifier to cite or link to this item: http://ir.lib.ncu.edu.tw/handle/987654321/74818

Title:	在成本限制下，以非監督式學習進行樣本選取之研究
Authors:	張詒鈞;Zhang, Yi-Jun
Contributors:	資訊管理學系
Keywords:	文件分類;非監督式樣本選取;成本;Document classification;Unsupervised instance selection;Cost
Date:	2017-08-17
Issue Date:	2017-10-27 14:40:56 (UTC+8)
Publisher:	國立中央大學
Abstract:	隨著科技的進步及大數據的浪潮，「資料」的重要性及實用性逐漸被人們所看重，因此許多的學者開始著墨於資料探勘領域，期待在眾多資料中找出其背後的價值並產生出許多相關應用，如使用分類器預測文章的所屬類別等。然而，對於分類器而言，若其訓練資料越能代表整體資料，則會使其所得到的訓練結果越好，而在分類器建立過程，會將訓練資料以人為方式貼上所屬標籤，但由於文章有長有短，並不是每筆資料貼上標籤所花費的成本都相同。而本研究著重於在成本下以非監督式學習進行樣本選取之過程，在實驗中給予每筆資料其挑選成本，並限制訓練資料最終所挑選之總成本，而本論文使用了Bisecting K-means及Hierarchical Clustering兩種演算法，並以最佳點及成本考量下最佳點兩種方法去挑選資料，將這些訓練資料透過五種不同的分類器進行建模，來衡量所挑選資料所建立的分類器之分類結果。最終在實驗結果證明本論文所提出之方法在五種不同分類器中，與隨機挑選法相比而言，所得資料在建立分類器模型時，皆有其相對表現較好之方法，而透過本論文之方法，可以在成本限制下，從尚未擁有類別標籤的資料中選出較具代表性的資料，若將這些資料交給專家進行類別標示，即可訓練出更好的分類模型，大幅的降低類別標示的成本。 ;With the progress of technology along with the tide of big data, the importance of "information" has gradually been valued by people. Therefore, many scholars began to dive into the field of data mining, looking forward to find the value behind numerous data and come up with innovative usages. Such as, but not limited to, using classifiers to discriminate the categories of articles and so on. However, for a classifier, a more comprehensive training data will come to a better result. When building a classifier, we label the data in manual, since articles and paragraphs come in different length, the cost varies widely for doing so. This study focuses on using unsupervised learning to select samples while giving each data a selection cost to limit the total cost of the final selection. In this thesis, by using Bisecting K-means and Hierarchical Clustering algorithm, the data are selected by two ways, best points and best points under cost considerations. These training materials then are modeled by five different classifiers to measure the classification of classifiers that were established by the selected data. Finally, the experimental results show that compared with random selecting, the 5 different classifiers each show better strengths in different areas when classifiers are established. Using the method mentioned in this thesis can result in selecting better quality and representative data from unlabeled data while not exceeding the budget. If these data are handed to experts for labeling, the labeling cost will drop significantly and come out in a better result.
Appears in Collections:	[Graduate Institute of Information Management] Electronic Thesis & Dissertation

Files in This Item:

File	Description	Size	Format
index.html		0Kb	HTML	403	View/Open

社群 sharing

Loading...