國立台灣科技大學 資訊工程系所
智慧型系統實驗室 研究論文
Intelligent System Laboratory Paper

96級畢業碩士 陳威達 發表論文


使用查詢擴展技術及支援向量機由網路資料集挖掘中文姓名翻譯

摘要

    中文姓名翻譯是屬於專名實體翻譯中的一種特殊案例。 因為在翻譯中文姓名的方法中存在許多不同種類的羅馬拼音系統,且許多人會在所翻譯過的名字中添加額外與本身中文姓名不相關的字。 而將某學者的姓名正確的翻譯成英文將能夠對人們在網路上尋找此學者的相關學術成就有很大的幫助,因此中文姓名的翻譯成為一個重要的議題。 在這篇論文中,我們首先提出一個為中文姓名之翻譯分類的方法,接著提出一個新的方法來從網路資料集中挖掘出中文姓名的翻譯。 我們的方法利用查詢擴展技術及支援向量機與“發音”與“距離”這兩種特徵來設法取得可能的姓名翻譯。利用查詢擴展技術能夠有效且更精確的回收同時含有輸入人名與其英文翻譯的網頁, 而利用支援向量機透過範例的訓練學習來判別姓名翻譯候選的正確與否可減少使用啟發式法則時因主觀判斷而產生的副作用。我們將中文姓名依其相對應的英文翻譯分成八種類型, 實驗結果顯示我們的方法可將三種較常見的類型有效的翻譯。


Mining Translations of Chinese Names from Web Corpora by Using a Query Expansion Technique and Support Vector Machine

Abstract

    Chinese name translation is a special case of the problem of named entity translation. It is a very challenging problem because there exist many kinds of Romanization systems and some people like to add some words to their English names. Because of translating a scholar’s name into its corresponding English name correctly could help find information about his academic achievements, Chinese name translation is in great demand. In this thesis, we first propose a classification of Chinese names, and then propose a novel methodology to mining Chinese name translations from Web corpora. Our methodology uses two kinds of features, which are the phonetic and the distant features, to extract name translation candidates by using a query expansion technique and Support Vector Machine (SVM). Using query expansion technique can effectively and more precisely retrieve the Web pages which contained the input Chinese name and the name’s translation. And using SVM to learn verification rule by training samples for name translation candidates can avoid the side effect caused by using heuristic rule. We classify Chinese names into eight name types according to the corresponding name translation. The experiment result showed our methodology can effectively mine out the correct name translations of three common name types.