Technical Report

 

Cost Effective Conceptual Design for Semantic Annotation (Extended Version) Público Deposited

Conteúdo disponível para baixar

Baixar PDF
https://ir.library.oregonstate.edu/concern/technical_reports/br86b495m

Descriptions

Attribute NameValues
Creator
Abstract
  • It is well established that annotating occurrences of entities in a collection of unstructured or semi-structured text documents with their concepts (i.e. entity sets), called semantic annotation, improves the effectiveness of answers to users’ queries. However, an enterprise has to spend large amount of financial, computational, and human resources to develop and deploy an annotation program to semantically annotate a concept in a large collection. Moreover, since the structure and content of the documents may evolve over time, the annotation programs should be often rewritten and repaired. These efforts are even more costly for concepts that are defined in specific domains, such as medicine and the law, as they require extensive collaboration between developers and domain experts. Since the available resources in an enterprise are limited, it has to select only a subset of relevant concepts for annotation. We call this subset a conceptual design for the annotated collection. To the best of our knowledge, finding a conceptual design for semantic annotation are generally left to intuition. In this paper, we introduce and formally define the problem of cost effective conceptual design, where given a set of relevant concepts and a fixed budget, one likes to find a conceptual design that improves the effectiveness of answers to users’ queries the most. We prove that the problem is generally NP-hard in the number of relevant concepts and propose two efficient approximation algorithms: Approximate Popularity Maximization (APM for short) and Annotation-Benefit Maximization (AAM for short). We prove that APM has a constant approximation ratio and AAM is a fully polynomial time approximation scheme algorithm. We validate our analysis using extensive experiments over Wikipedia articles and queries form a real-world search engine query log. Our results indicate that AAM generally returns optimal or near-optimal conceptual designs that are more effective than the solutions provided by APM in most cases. Since the precise values of the input parameters for APM and AAM may not be available at the design time, we analyze the sensitivity of AAM to the estimation errors in their input parameters. Our results show that using input parameters computed over small samples of the collection, AAM generally return the same answers as the cases where they have full information about the values of their input parameters.
Resource Type
Date Available
Date Issued
Palavra-chave
Declaração de direitos
Publisher
Peer Reviewed
Language
Replaces

Relações

Parents:

This work has no parents.

Itens