基於資料結構探勘 PDF 文本資訊擷取系統之設計與開發;Schema Mining and Information Extraction for PDF Documents

NCU Institutional Repository > 資訊電機學院 > 資訊工程學系碩士在職專班 > 博碩士論文 > Item 987654321/95397

請使用永久網址來引用或連結此文件: http://ir.lib.ncu.edu.tw/handle/987654321/95397

題名:	基於資料結構探勘 PDF 文本資訊擷取系統之設計與開發;Schema Mining and Information Extraction for PDF Documents
作者:	彭綉雯;Peng, Hsiu-Wen
貢獻者:	資訊工程學系在職專班
關鍵詞:	序列模式挖掘;上下文學習;線上學習;大型語言模型;Sequential pattern mining;In-context Learning;Online Learning;Large Language Model
日期:	2024-04-26
上傳時間:	2024-10-09 16:46:09 (UTC+8)
出版者:	國立中央大學
摘要:	網路上充斥著大量以 PDF 儲存的資訊，例如裁判書、財務報告、入學簡章等。對於許多應用服務而言，往往需要將其轉成結構化格式以方便後續的應用。一般說來，我們需要以人工的方式進行資料結構的定義，並依據定義好的資料結構進行資料擷取，進而訓練模型，這是十分消耗人力及時間成本的，因此如何有效率的定義資料結構，且準確的擷取資料，將是本文研究的主要課題。本文結合資料探勘與資料擷取兩個任務，開發了一套互動式的線上學習資料擷取系統。前者透過 PrefixSpan 的技術可以幫助使用者找出目標文件的Pattern，讓使用者能有效率的定義目標文件的資料結構；後者則是採用傳統機器學習的有限狀態傳感機 (Finite-state transducer, FST)，系統可以透過少量的標記資料，依據資料結構的定義來學習提取規則，並經由這些提取規則完成資料擷取任務。由於資料探勘時會挖掘出過多 Pattern，因此我們透過排除項目（如：去除文件中的頁碼或行號資訊... 等) 的判斷來減少 Pattern 數量，並對不同文件格式類型作進一步的分析。而在資料擷取的任務中，我們實作兩種 LLM 擷取方法：LangChain 及 ChatGPT-QA。實驗結果顯示 LangChain 擷取效能優於ChatGPT-QA ，平均 F1 Score 分別為 0.77 及 0.63。另外，我們也針對兩種不同標記方法：人工標記及 LangChain 標記，以評估 LangChain 是否能達到取代人工標記的目標，透過使用 FST 進行資料擷取的實驗結果呈現LangChain並不能取代人工標記，其人工標記與 LangChain 標記的平均 F1 Score 分別為0.91 及 0.70。;The internet is flooded with a large amount of information stored in PDF format, such as judgments, financial reports, admission brochures, and so on. For many applications and services, it is often necessary to convert this information into structured formats for subsequent use. Typically, this involves manually defining data structures and extracting data based on the defined structures to train models, which is extremely labor and time-consuming. Therefore, how to eﬀiciently define data structures and accurately extract data will be the main focus of this study. This paper combines two tasks, data mining and data extraction, to develop an interactive online learning data extraction system. The former uses the PrefixSpan technique to help users find patterns in target documents, allowing users to eﬀiciently define the data structure of target documents. The latter adopts the Finite-state transducer (FST) of traditional machine learning, which can learn extraction rules based on the defined data structure with a small amount of labeled data and complete the data extraction task through these extraction rules. Since data mining may uncover too many patterns, we reduce the number of patterns by excluding items (such as removing page numbers or line number information, etc.) and further analyze different document format types. In the data extraction task, we implemented two LLM extraction methods: LangChain and ChatGPT-QA. Experimental results show that LangChain outperforms ChatGPT-QA in extraction performance, with average F1 scores of 0.77 and 0.63, respectively. Additionally, we evaluated whether LangChain can replace manual labeling by comparing two different labeling methods: manual labeling and LangChain labeling. The experimental results of using FST for data extraction show that LangChain cannot replace manual labeling, with average F1 scores of 0.91 and 0.70 for manual labeling and LangChain labeling, respectively.
顯示於類別:	[資訊工程學系碩士在職專班 ] 博碩士論文

文件中的檔案:

檔案	描述	大小	格式	瀏覽次數
index.html		0Kb	HTML	39	檢視/開啟

在NCUIR中所有的資料項目都受到原著作權保護.

社群 sharing

資料載入中.....