中大機構典藏-NCU Institutional Repository-提供博碩士論文、考古題、期刊論文、研究計畫等下載:Item 987654321/77510
English  |  正體中文  |  简体中文  |  全文筆數/總筆數 : 80990/80990 (100%)
造訪人次 : 42098606      線上人數 : 782
RC Version 7.0 © Powered By DSPACE, MIT. Enhanced by NTU Library IR team.
搜尋範圍 查詢小技巧:
  • 您可在西文檢索詞彙前後加上"雙引號",以獲取較精準的檢索結果
  • 若欲以作者姓名搜尋,建議至進階搜尋限定作者欄位,可獲得較完整資料
  • 進階搜尋


    請使用永久網址來引用或連結此文件: http://ir.lib.ncu.edu.tw/handle/987654321/77510


    題名: 樣板網頁結構自動分群;Clustering of Template Page for Data Extraction
    作者: 吳佳儒;Wu, Jia-Ru
    貢獻者: 資訊工程學系在職專班
    關鍵詞: 特徵挑選;樣板網頁擷取;階層式分群;非監督式分群
    日期: 2018-07-23
    上傳時間: 2018-08-31 14:46:33 (UTC+8)
    出版者: 國立中央大學
    摘要: 在網頁資料擷取(Web Data Extraction)的領域中,由於網頁內容多樣及架構的複雜性,要如何自動從各式不同樣板的網頁中擷取出資料,這類型的研究一直面臨相當大的挑戰。
    網頁資料擷取系統主要分為記錄層級(Record Level)和頁面層級(Page Level)兩大類別,兩者是接受相同樣板的網頁,進行資料擷取或是綱要推導,針對不同網頁樣板來進行分群之研究較為少見。
    本篇論文提出一個依照網頁結構之相似程度來自動分群的功能,簡化不同網頁樣板之間擷取的問題,針對所設計的網頁特徵來實作非監督式分群與監督式分群,並比較其分群之效能。雖從整體分群效果中來看不甚理想,但於目標群結果可達到在非監督式分群時之精確率 99%,召回率 78%,監督式分群時之精確率 97%,召回率超過 80%。
    最後,此分群結果可再結合Page-level Information Extraction System (UWIDE) 系統,產生完整的頁面綱要及擷取出所需 POI 相關資訊,進而建立及累積資料庫,以提升相關加值服務之效率及品質。;In the field of Web Data Extraction, due to the diversity of web content and the complexity of the web page structure, the research of extracting data automatically from web pages of different template has always faced considerable challenges. The web data extraction system is
    mainly divided into two categories: Record Level and Page Level. Both input dataset use the web pages of the same template, and are used for data extraction and schema induction. Clustering research on web page of different template is rarely to be found.
    This paper proposes a method to do clustering automatically with the similarity of web page structure, and can simplify the problem of data extraction from different templates in web page. We also use the unsupervised and supervised clustering, which based on our designed features, and compare the performance of both clustering results. Although the overall clustering performance is not well as expected, the results of unsupervised clustering can reach
    a precision of 99% for the target cluster, a recall rate of approximately 78%. A precision of 97%, and a recall rate of more than 80% for supervised clustering.
    Finally, we can generate a complete web page schema and extract the POI-related information via Page-Level Information Extraction System (UWIDE) with this clustering result. It can also be accumulated into databases, to enhance the efficiency and quality of related value
    added services.
    顯示於類別:[資訊工程學系碩士在職專班 ] 博碩士論文

    文件中的檔案:

    檔案 描述 大小格式瀏覽次數
    index.html0KbHTML238檢視/開啟


    在NCUIR中所有的資料項目都受到原著作權保護.

    社群 sharing

    ::: Copyright National Central University. | 國立中央大學圖書館版權所有 | 收藏本站 | 設為首頁 | 最佳瀏覽畫面: 1024*768 | 建站日期:8-24-2009 :::
    DSpace Software Copyright © 2002-2004  MIT &  Hewlett-Packard  /   Enhanced by   NTU Library IR team Copyright ©   - 隱私權政策聲明