國立台灣科技大學 電子工程系所
模糊類神經網路實驗室 研究論文
Fuzzy Neuron Laboratory Paper

92級畢業碩士 王柏翔 發表論文


QueryFind: 以使用者回饋及專家建議為基礎之網頁排序方式

摘要

線上資訊 (On-line Information) 在網際網路上快速成長引起許多研究團體的注意,傳統 的資訊檢索技術被發展來處理少量、高品質、同性質內容的文件集合,但線上資訊卻是大 量、品質不齊、不同性質內容、半結構化的超文字標籤語言 (HyperText Markup Language, HTML) 撰寫的文件集合,使得傳統的資訊檢索技術,應用於線上資訊的處理效 能不盡理想,而且超文字標籤語言主要的用途是以標籤定義文件的顯示效果,文件作者用 標籤定義顯示的樣式以加強文件中重要的資訊,儘管已經有研究人員利用標籤資訊提高資 訊檢索技術的效能,但這些研究大都只注意到少數特定的標籤及標籤本身的意義,除了欠 缺完整性外,標籤所改變的顯示效果變化也被忽視;同時,標籤資訊所產生的應用也不夠 廣泛。因此,我們提出樣式探勘 (Style Mining) 觀念及樣式檢索技術 (Style Retrieval),並考慮 40 個具有改變文字顯示效果的標籤及改變文字顏色與大小的標籤屬 性 (Attributes),更以人類寫作閱讀的習慣為出發點,看待標籤資訊被使用的意義及目的 。另外,我們提出五種以樣式為基礎的應用,分別是:以樣式為基礎的特徵選取法 ( Style-based Feature Selection)、樣式索引 (Style Indexing)、樣式聚類器 (Style Cluster)、樣式產生器 (Style Generator) 及樣式定位器 (Style Locator)。以樣式為基 礎的特徵選取法,幫助自動文件分類器 (Automatic Documents Classifier) 取得高品質 的文件特徵,提高分類準確度;樣式索引輔助全文檢索系統 (Full-Text Search Engine) 的排名機制 (Ranking),即以樣式變化來排名;樣式聚類器以文件設計樣式的相似性為條 件進行聚類,具有減少查詢結果的量,及重新組織查詢結果的效果。樣式產生器加上相對 強烈的樣式變化於重要的關鍵字上,使文件中的關鍵字能夠明顯化,提高文件的可讀性; 樣式定位器利用文件中的樣式變化規則,幫助包裝者 (Wrapper) 程式萃取出文件特定內容。


Abstract

Given a query word, search engines can retrieve vast amount of Web pages from the World Wild Web to users. However, the main challenge of search engines is to effectively rank vast retrieved Web pages to meet users’ needs. Because the traditional ranking method is based on content-oriented approaches to give each Web page a score for ranking, the ranking score is calculated by some sophisticated approaches and it is independent of users’ query words. Therefore, the relation between Web pages and users’ required information cannot be completely matched. In this manner, the most relevant Web pages to users’ query words might not be shown at the top of the search result list. That is, users still need to spend time for seeking out their required Web pages. Therefore, a novel ranking method named as QueryFind, based on learning from historical query logs, is proposed to predict users’ information needs and reduce the seeking time from the search result list. Our method uses not only the users’ feedback but also the source search engine’s recommendation. Based on this ranking method, we exploit users’ feedback to implicitly judge the Web pages’ quality. We also apply the meta-search concept to give each Web page a content-based ranking score. Therefore, the time users spend for seeking out their required information from search result list can be reduced and more relevant Web pages can be presented. In our experiments, Yam Search Engine’s query log over one week is used to evaluate. We also propose a novel evaluation approach to verify the feasibility of our ranking method. The approach is to capture the ranking order and Web pages that users have clicked from the search result list. Finally, our experiments show that the time users spend for seeking out their required information can be reduced significantly.