摘要: | 資料離散化主要是將連續型屬性的資料數值轉換為離散型資料數值的一個過程。應用資料離散化方法可以簡化數據使得資料分析時容易解讀。此外,目前有許多知名資料探勘演算法,例如C4.5/5.0決策樹與單純貝式(naïve Bayes)等方法比較適合處理離散型數值屬性的資料。在實務上我們所收集的資料中通常都會包含些許的雜訊,例如多餘或不相關的特徵屬性或是異常資料,進而影響後續的探勘結果。同時,收集的資料也可能會發生遺漏值的情況。在相關文獻中,資料淨化技術包含特徵選取、案例篩選與遺漏值填補已經被廣泛的使用並解決上述這些問題。然而,當進行資料分析與探勘時,很有可能會遇到收集的的連續型資料除了特定探勘的目的需要進行資料離散化之外,其中也可能包含過多的特徵屬性、些許異常資料或遺漏值,在這樣的情況下必須執行資料離散化與相關的資料淨化技術兩個資料前處理步驟。目前鮮少有相關研究探討資料離散化與不同的資料淨化技術互相搭配後所產生的交互影響。因此,本研究計畫的主要目的為透過三年的研究期間分別針對上述三種資料淨化技術與資料離散化找出最佳之前後執行流程與搭配組合以供未來處理與淨化實務資料之準則。換句話說,本計畫主要的研究問題為:要先執行資料離散化再執行資料淨化的程序(即特徵選取、案例篩選與遺漏值填補)或是先執行資料淨化再執行資料離散化的程序能夠產生最好的探勘結果。 ;Data discretization (or discretization) is the process of transferring continuous data values into discrete ones. Data discretization can allow the data analysis results to be easily interpreted. In addition, many well-known data mining algorithms, such as C4.5/5.0 decision trees and naïve Bayes, are more suitable for handling the discrete type of data. In practice, the real world datasets usually contains some noisy data. For example, they may contain redundant or irrelevant features and outliers, which can negatively impact the final mining results. Moreover, sometimes the collected datasets are likely to have missing (attribute) values. In literature, data cleaning techniques, including feature selection, instance selection, and missing value imputation have been widely used to solve the above problems. However, it may be the case that the collected datasets require data discretization to be performed for specific mining purposes, but they also contain some noisy features, outliers, and/or missing values. As a result, both discretization and one of the three data cleaning techniques should be considered for the data pre-processing step. However, in related literatures, very few studies focused on investigating the interaction effects between discretization and the data cleaning techniques. Therefore, the aim of this three-year research project is to find out the optimal combination of discretization and each of the three types of data cleaning techniques, respectively. In other words, the research question of this research project is: whether performing discretization first and the data cleaning step second (i.e. feature selection, instance selection, and missing value imputation) or performing the data cleaning step first and discretization second can produce the best mining result? |