中大機構典藏-NCU Institutional Repository-提供博碩士論文、考古題、期刊論文、研究計畫等下載:Item 987654321/84014
English  |  正體中文  |  简体中文  |  Items with full text/Total items : 80990/80990 (100%)
Visitors : 45473921      Online Users : 4049
RC Version 7.0 © Powered By DSPACE, MIT. Enhanced by NTU Library IR team.
Scope Tips:
  • please add "double quotation mark" for query phrases to get precise results
  • please goto advance search for comprehansive author search
  • Adv. Search
    HomeLoginUploadHelpAboutAdminister Goto mobile version


    Please use this identifier to cite or link to this item: http://ir.lib.ncu.edu.tw/handle/987654321/84014


    Title: 人工合成文本之資料增益於不平衡文字分類問題;Data Augmentation for Imbalanced Classification with Synthetic Text
    Authors: 黃軍儒;Huang, Chun-Ru
    Contributors: 資訊管理學系
    Keywords: 自然語言生成;類別不平衡;文字分類;資料增益;Natural Language Generation;class imbalance;text classification;data augmentation
    Date: 2020-07-16
    Issue Date: 2020-09-02 17:55:09 (UTC+8)
    Publisher: 國立中央大學
    Abstract: 類別不平衡問題會因為各類別分布的高度不平均而產生。在現實生活中,不平衡文字分類任務時常發生,而文本分類器通常因為缺乏次要類別訓練數據而過度擬合於主要類別,導致在次要類別的分類表現不佳。
    因此在本論文中,我們提出用各種不同的文字生成模型(MLE, SeqGAN, VAE, GPT-2)生成合成文本,並且資料增益在次要類別上。在我們的實驗中,我們將探討合成文本和真實資料在資料增益上的差距表現,以及比較合成文本與傳統的採樣方法、同義詞替換之方法之間的有效性,不同的文字表達法也將會被納入我們的觀察當中。
    從我們的結果顯示,基於文字生成模型生成的合成文本用於資料增益可以解決類別不平衡的文字分類問題以及缺乏次要類別資料的問題。我們發現我們所提出的方法比先前的過採樣方法(如SMOTE)及同義詞替換方法的表現來的好。
    再者,我們採用長文本及短文本這兩種角度觀察,發現不同的文字生成模型會依據其輸入的資料量大小及文本的長度,其增益的表現會有所不同。
    ;Class imbalance exists when class distributions are heavily skewed. It is commonly found in many real-world text classification tasks. Text classifiers usually underperform on minor classes because of lack of training data, which is not desirable especially when minor classes are of interest.
    We propose to apply different text generation models (MLE, SeqGAN, VAE, GPT-2) to generate synthetic text for data augmentation on minor classes. In our experiments, we evaluate the effectiveness of synthetic text against traditional sampling method, synonym replacement method and real-world text in terms of classification performance. Various text representations will also be discussed.
    Our results show that synthetic text generated from text generation model for data augmentation can solve the problem of class imbalance and the problem of insufficient minority data. We found that the performance of our approach is better than previous oversampling method (SMOTE) and synonym replacement method. We also discover that different text generation models will perform different performances based on the dataset size and sentence length.
    Appears in Collections:[Graduate Institute of Information Management] Electronic Thesis & Dissertation

    Files in This Item:

    File Description SizeFormat
    index.html0KbHTML139View/Open


    All items in NCUIR are protected by copyright, with all rights reserved.

    社群 sharing

    ::: Copyright National Central University. | 國立中央大學圖書館版權所有 | 收藏本站 | 設為首頁 | 最佳瀏覽畫面: 1024*768 | 建站日期:8-24-2009 :::
    DSpace Software Copyright © 2002-2004  MIT &  Hewlett-Packard  /   Enhanced by   NTU Library IR team Copyright ©   - 隱私權政策聲明