编辑: kieth | 2019-07-09 |
800 samples, we get a performance of 81% in precision and 83% in recall of all the semantic fragmentations. 3) Propose a method using the name of resources and theirs members for automatic resource category. We study on the performance factors such as feature distribution, the smoothing method on probability estimation, and the number of samples. We found that a large quantity of low-frequency features, especially those which only appears in one resource, contribute to the classification accuracy by helping to get reasonable probability estimation on the unobserved features. Based on this knowledge, the usual feature selection procedures in text categorization are not necessary in this circumstance. When employing all features acquired from the name strings, the overall accuracy of the classification method proposed here can reach 80%. As an application of this method, we implemented a semi-automatic classification tool which classified the resources with only 45% to 55% in time cost comparing with the benchmark method. 4) Propose a tree-merge model to map the resources originally organized in file system directories and dispersing in the Internet to a coherent classification architecture. The model performs well when the original organization quality is good and usable enough. There are two phases defined by the model, the first phase is a roughly classification with a little precision loss but rapid committing, and the second is a refine phase which remedies the incorrect classification to the required quality. In the first phase, the time cost is only 1/2a of the baseline (a is the average number of resources classified in one judgement, a ≥ 1). And in the refine phase, the time is only half of that of the baseline. 5) Continuously collect digital resources from the Internet and built a 7.5TB resource library system based on the directory tree-merge scheme with low-cost and high-efficient approach. By mapping the classification architecture with the file system directory tree, we implemented the system with modular design strategy. Our scheme satisfies the need of incremental storage, organization and service demands. Additionally, we get resource descriptions from the Web based on the aid of purified resource names, the information of categories and the expansion terms of each category. Keywords: Digital resource, Naming analyse, Resource organization, Classification, Directory tree-merge 目录第1章 绪论
1 1.1 研究背景
1 1.2 研究目的
2 1.2.1 困难与挑战
3 1.2.2 研究路线
4 1.3 网络资源的概述
4 1.3.1 资源在本文中的定义
4 1.3.2 资源的数据模型
6 1.3.3 网络资源的存储组织模型
7 1.4 网络资源的组织及本文研究的意义
9 1.5 本文主要工作
10 1.6 本文主要贡献
12 1.7 本文内容结构
13 第2章 网络资源名字无序特征与用户命名行为
16 2.1 引言
16 2.2 基本概念
17 2.3 资源命名混乱程度的量化评估
17 2.3.1 名字长度分布规律
19 2.3.2 通过名字表达资源信息的互补性
21 2.3.3 文件名的后缀
23 2.3.4 文件后缀和资源类别的关系
26 2.3.5 字符构成
28 2.3.6 名字片段频度
31 2.3.7 语义片段
33 2.4 相关研究
35 2.5 本章小结
36 第3章 网络资源名字中语义片段的切分
37 3.1 引言
37 3.2 语义信息切分概述
38 3.2.1 两级映射策略
38 3.2.2 字符类型突变分割假设
39 3.2.3 自动切分方法相关研究
41 3.3 基于错误驱动转换学习的自动切分
42 3.3.1 自动学习的基本思想