Quantcast
Viewing all articles
Browse latest Browse all 2

Data Quality is in its Fitness to the Beholder

A few weeks ago Leigh Dodds began a thoughtful discussion on SemanticOverflow with the question:

There’s an increasing variety of data available as Linked Data coming from a range of different sources. I’m wondering what indicators we might use to judge the “quality” of a dataset…Clearly quality is a subjective thing, but I’d be interested to know what factors people might use to indicate whether a dataset was trustworthy, well modelled, sustainable, etc.

For starters, I think we can all agree at the highest level that the measure of data quality is subjective and that “beauty is in the eye of the beholder”: the quality of a dataset is measured by its fitness for use in specific applications. This question of determining and disseminating “fitness” scores is the rub!

In his answer to Leigh’s question, Tim Finin proposes adopting a PageRank-like mechanism, “LODrank” based on measured usage

We could define LODrank as a PageRank-like measure that was a function of the number of links to/from other LOD datasets weighted by their LODrank. Alternatively, it might divided by the number of linkable instances in the collection, so that large datasets did not have an advantage…

This approach scores data quality based on observed fitness as evidenced by discovered use and has the advantage of automation.

My replies went in a different direction, focusing instead on the subjective nature of data quality and the need to aggregate consumer-space rankings of datasets across a set of dimensions. In his 2005 white paper Principles of Data Quality [1] Arthur D. Chapman writes,

Data quality is multidimensional, and involves data management, modelling and analysis, quality control and assurance, storage and presentation. As independently stated by Chrisman [2] and Strong et al. [3], data quality is related to use and cannot be assessed independently of the user. In a database, the data have no actual quality or value [4]; they only have potential value that is realized only when someone uses the data to do something useful. Information quality relates to its ability to satisfy its customers and to meet customers’ needs [5]

Chapman goes on to enumerate a set of factors that contribute to fitness-for-use, citing Redman [6]:

  • Accessibility
  • Accuracy
  • Timeliness
  • Completeness
  • Consistency with other sources
  • Relevance
  • Comprehensiveness
  • Providing a proper level of detail
  • Easy to “read”
  • Easy to “interpret”

Each of these factors is fundamentally subjective, even if mechanisms exist within particular domains to take their measure “objectively.” Indeed, in some domains such ratings might only be done by humans, either through voting mechanisms or by individual reviewers.

I believe the greater linked data community needs to develop vocabulary terms for expressing metrics for data quality — consider the ten points above — and then within individual communities develop agreed-upon means to determine those values. Arguably this is a “Dublin Core” approach to the problem, in the sense that terms like completeness or consistency would be reused across domains with inherently different domain-specific meanings, but such reuse would facilitate consumers from other communities choosing datasets from outside their expertise. A non-physicist might then say, “The physics community says this dataset is accurate, by their measures.”

Some of these factors are even more deeply subjective and must be evaluated dynamically, based on the consumer’s immediate context. An example of this is relevance, which could be interpreted as equivalent to a recommendation.

If you have thoughts on data quality as it applies to linked data, consider answering Leigh’s question at SemanticOverflow!

References: (as cited by Chapman)

  1. Chapman, A. D. 2005. Principles of Data Quality, version 1.0. Report for the Global Biodiversity Information Facility, Copenhagen.
  2. Chrisman, N.R., 1991. The Error Component in Spatial Data. pp. 165-174 in: Maguire D.J., Goodchild M.F. and Rhind D.W. (eds)
  3. Geographical Information Systems Vol. 1, Principals: Longman Scientific and Technical.
  4. Strong, D.M., Lee, Y.W.and Wang, R.W. 1997. Data quality in context. Communications of ACM 40(5): 103-110.
  5. Dalcin, E.C. 2004. Data Quality Concepts and Techniques Applied to Taxonomic Databases. Thesis for the degree of Doctor of Philosophy,
  6. School of Biological Sciences, Faculty of Medicine, Health and Life Sciences, University of Southampton. November 2004. 266 pp.
  7. English, L.P. 1999. Improving Data Warehouse and Business Information Quality: Methods for Reducing Costs and Increasing Profits. New York: John Wiley & Sons, Inc. 518pp.
  8. Redman, T.C. 2001. Data Quality: The Field Guide. Boston, MA: Digital Press.

Image may be NSFW.
Clik here to view.
Image may be NSFW.
Clik here to view.

Viewing all articles
Browse latest Browse all 2

Trending Articles