编辑: kieth 2019-07-09

因此也得出在低频特征占优、且使用Simple Good-Turing平滑策略下无需进行特征选择的推论.在使用所有特征情况下,总体分类正确率可达80%.还应用该方法实现了一个资源半自动分类工具,在人工给定资源粒度条件下,进行资源分类的时间是基准时间的45%-50%. 针对原始质量较好的资源集合,提出一种利用原始组织知识的目录归并模型整合资源,刻画粗分类和精细检查两阶段工作模式并评估模型效率.粗分类阶段有精度损失,但完成任务的时间为基准做法的1/2a(a为批量处理的资源数,a≥1);

精细检查阶段在第一阶段基础上进行,能保证精度无损,且完成任务的时间约为基准做法的1/2. 持续从互联网收集、并运用目录归并模式高效低代价地构建一个容量为7.5TB的海量网络资源库藏系统.通过分类体系和文件目录的映射,并在服务器和磁盘两级用模块化思想设计存储、组织功能,该系统能很好地应对增量式存储、组织和服务需求.系统还基于Ontology思想从互联网上为热点门类的资源扩展相关描述信息. 关键词:网络资源,命名分析,组织,自动分类,目录归并 On the Name Characteristics of Digital Resources and Their Applications in Resource Organization Chong Chen(Computer Science and Technology) Directed by Professor Xiaoming Li Abstract In this dissertation, the term Digital resource refers to the non-web page data that is: 1) usually composed by one or more files of various data types, and existing within some directory structures;

2) representing a single independent topic;

3) widely shared and distributed through FTP sites or P2P file systems;

4) organized by Internet users at will more than well-defined styles. Internet users concern about digital resources more and more. At the same time, digital resources are characterized with mass, disorder and confusion. It is a fundamental demand to widely collect and organize digital resources for many applications. In this work, what is the most basic is the resource names. On the one hand, they provide the clue of meaning of resources. On the other hand, they are used to identify the resources. This paper first studies the disorder naming status of digital resources, and tries to find out generally naming manners of Internet users. Secondly, the paper studies the method of how to segment the resources names based on semantic meanings. Thirdly, we study how to make use of resource names in automatic resource classification, and analyze the impact factors on the performance. Noting that there are many well-organized digital resources on the Web, we propose a method to reorganize the resources in different file directories to a coherent classification framework. And we also evaluate the efficiency of the integration process. As practice to all above mentioned research, we designed and implemented a scalable digital resource library which can support massive volume of digital resources and is capable of providing data and services for many academic institutions. In this paper, contributions are listed as follows: 1) Study the disorder naming status of digital resources, and find out the generally naming manners of Internet users. By examining the name length, the character type, the fragment frequency distribution, the point mutual information of file extensions with resources categories and the semantic information, we get an overall knowledge on the disorder and chaos of resource names. For example, from the information entropy of character type, the resource names act as expression medium where the Internet users are apt to add information about digital resource, such as short description, personal viewpoints, etc. From the symbol appearance, we can know the Internet users often use explicit or implicit separators among name texts to designate the transition of different semantic meanings. These studies are the base of the later research of this dissertation. 2) Propose a segmentation approach which is able to detect the semantic snippets in the digital resource names without any lexicons. The approach is based on the idea of Transformation-Based Error-Driven Learning and the assumption of splitting name strings at the position of char-type transition. This way of practice can also be applied to similar problems where texts are composed of various symbols and letters, and concentrated expression of a variety of types of semantic information. The method takes full advantage of context and does not require large-scale training data. Training on

下载(注:源文件不在本站服务器,都将跳转到源网站下载)
备用下载
发帖评论
相关话题
发布一个新话题