This document provides statistics about the structure and content of the crawlable subset of the Linked Open Data (LOD) cloud in April 2014 and analyzes to which extent crawlable Linked Data sources implement the Linked Data best practices.
This document updates the findings of the original State of the LOD Cloud report published in 2011. The 2011 report was based on information that was provided by the dataset publishers themselves via the datahub.io Linked Data catalog. This report is based on a crawl of the Web of Linked Data conducted in April 2014. This document is a shortend version of the ISWC2014 paper Adoption of the Linked Data Best Practices in Different Topical Domains. The paper provides more details about the crawling process as well as a deeper discussion of the results. In contrast, this document links the statistics to the Mannheim Linked Data catalog and enables the reader to drill-down and explore information about the datasets behind each statistical result.
Contents
1. The Linked Data Crawl
In order to discover as many Linked Data sources as possible, we have crawled a snapshot of the Linked Data Web. We used the LDSpider linked data crawler. We seeded LDSpider with 560 thousand seed URIs originating from the datahub.io dataset catalog, the Billion Triple Challenge 2012 dataset as well as from datasets being advertised on the public-lod@w3.org mailing list. With those seeds, we performed crawls during April 2014 to retrieve entities from every dataset using a breadth-first crawling strategy. Datasets not allowing crawlers were not included in our corpus. Altogether, we crawled 900,129 documents describing 8,038,396 resources. In general, we assume that all data from one PLD belongs to one dataset. As an exception to this rule, we partitioned datasets in the datahub.io lod-cloud group where multiple datasets per PLD were defined. Furthermore, we removed datasets that only contained vocabulary definitions. We provide the crawled data for download, so that all results presented in the following can be verified.
2. Linked Data by Domain
Linked Data technologies are being using to share data covering a wide range of different topical domains. The table below gives an overview of the topical domains of the 1014 datasets that were disocvered by our crawl.
Topic | Datasets | % |
---|---|---|
Government | 183 | 18.05% |
Publications | 96 | 9.47% |
Life sciences | 83 | 8.19% |
User-generated content | 48 | 4.73% |
Cross-domain | 41 | 4.04% |
Media | 22 | 2.17% |
Geographic | 21 | 2.07% |
Social web | 520 | 51.28% |
Total | 1014 |
3. Crawlable LOD Cloud Diagram
The image below gives an overview of the linkage relationships between datasets. Clicking the image will take you to an image map, which allows you to explore the metadata for each dataset in the Mannheim Linked Data Catalog.
4. Best Practices
The central idea of Linked Data is that data publishers support applications in discovering and integrating data by complying to a set of best practices in the areas of linking, vocabulary usage, and metadata provision. This section analyses to which extent the crawled data sources implement these best practices.
4.1 Interlinking Best Practice
By setting RDF links, data providers connect their datasets into a single global data graph which can be navigated by applications and enables the discovery of additional data by following RDF links.
In total, 56.11% of the crawled datasets link to at least one other dataset. The remaining datasets are only targets of RDF links.
The table below categorizes the datasets by the number of other data sources that are target of outgoing RDF links.
Number of linked datasets | Number of datasets |
---|---|
more than 10 | 79 (7.79%) |
6 to 10 | 81 (7.99%) |
5 | 31 (3.06%) |
4 | 42 (4.14%) |
3 | 54 (5.33%) |
2 | 106 (10.45%) |
1 | 176 (17.36%) |
0 | 445 (43.89%) |
The tables below show the ten datasets with the highest in- and outdegrees.
|
|
Category | Vocabulary | Usage | Category | Vocabulary | Usage |
---|---|---|---|---|---|
social web | foaf | 86.12% | life sciences | dct | 66.29% |
dct | 40.65% | foaf | 41.57% | ||
wgs84 | 36.99% | void | 31.46% | ||
publications | dct | 81.73% | government | dct | 63.98% |
foaf | 69.23% | cube | 60.75% | ||
bibo | 41.34% | odc* | 46.24% | ||
user-generated content | dct | 81.91% | geographic | dct | 82.93% |
foaf | 74.55% | foaf | 65.85% | ||
sioc | 43.63% | skos | 48.78% | ||
media | foaf | 75.67% | crossdomain | dct | 72.73% |
dct | 54.05% | foaf | 72.73% | ||
mo | 18.91% | skos | 38.63% |
4.2.2 Usage of Dereferencable Vocabularies
In particular for proprietary vocabularies, it is essential that they are derefencable and linked to other vocabularies, so that agents can interpret their semantics.
To assess whether a vocabulary is dereferencable, we collected the terms for each proprietary vocabulary encountered in our corpus. For every term, we requested its URI via an HTTP GET request. We define the dereferencability quota of a vocabulary as the number of dereferencable terms divided by all terms collected from the vocabulary.
In total, 19.25% of all proprietary vocabularies are fully dereferencable (i.e., their quota is 1.0). On the other hand, 72.75% of all proprietary vocabularies are not dereferencable at all. The remaining vocabularies, which are 8.00% of all proprietary ones, are partially dereferencable, meaning that for some terms, but not for all, a definition could be retrieved.The dereferencability of proprietary vocabularies attributed to individual categories can be seen at the following table.
Category | Different Prop. Vocabs Used (% of all Prop Vocab.) | # of Datasets Using Prop. Vocab. (% of all datasets) | Dereferencability | ||
---|---|---|---|---|---|
Full | Partial | None | |||
social web | 128 (33.86%) | 83 (15.99%) | 16.41% | 6.25% | 77.78% |
government | 48 (12.70%) | 35 (18.82%) | 20.83% | 12.50% | 66.67% |
publications | 58 (15.34%) | 35 (33.65%) | 20.69% | 6.90% | 72.41% |
life sciences | 35 (9.25%) | 26 (29.21%) | 28.57% | 5.71% | 65.71% |
user-gen. cnt. | 30 (7.93%) | 26 (47.27%) | 13.33% | 10.00% | 76.67% |
cross-domain | 55 (14.55%) | 16 (36.36%) | 27.27% | 10.91% | 61.82% |
media | 22 (5.82%) | 21 (56.76%) | 0.00% | 9.09% | 90.91% |
geographic | 24 (6.34%) | 16 (39.02%) | 20.83% | 4.17% | 75.00% |
Total | 378 (58.24%) | 241 (23.17%) | 19.25% | 8.00% | 72.75% |
4.3 Adoption of Metadata Best Practices
Metadata helps making datasets self-descriptive. Best practices for providing metadata as Linked Data include provenance and licensing information, dataset-level metadata, and information about additional access methods.
4.3.1 Providing Provenance Information
Our analysis is based on a set of 26 vocabularies we identified to be usable for providing provenance information. It was assembled from information provided by the W3C working group on provenance, the LOV vocabulary catalog, as well as our own experience. Using those vocabularies, we searched in each datasets for triples that use one of the vocabularies and have a document's URI as the subject.
In summary, 35.77% of all datasets use some provenance vocabulary. Looking at individual vocabularies, 28.37% of all datasets use DC or DCTerms, 10.77% use MetaVocab, and 0.77% use prv or prov.
The following table shows an overview of provenance vocabulary use for different topical domains.
Category | Any provenance vocabulary | Using Dublin Core | Using admin | Using prv or prov |
---|---|---|---|---|
social web | 169 (32.56%) | 56.21% | 58.58% | 1.18% |
government | 77 (41.40%) | 100.00% | 0.00% | 1.30% |
publications | 39 (37.50%) | 94.87% | 5.13% | 2.56% |
life sciences | 21 (23.60%) | 100.00% | 0.00% | 2.56% |
user-gen. content | 11 (20.00%) | 90.91% | 54.55% | 0.00% |
cross-domain | 8 (18.18%) | 100.00% | 12.50% | 0.00% |
media | 5 (13.51%) | 100% | 0.00% | 0.00% |
geographic | 4 (9.76%) | 100.00% | 0.00% | 25.00% |
Total | 372 (35.77%) | 28.37% | 10.77% | 0.77% |
4.3.2 Providing Licensing Information
With the help of licensing information, agents can assess whether they may use the data for the purpose at hand.
To evaluate whether a dataset provides license information, we again searched for triples which have the document as their subject and a predicate containing the string "licen". To this list, we added all predicates containing the string dc:/dct:rights and the waiver vocabulary, which leads to a total of 47 terms.
In total, 7.85% of all datasets provide licensing information in RDF. The most important predicates for indicating the license are dc/dct:license (7.98%), cc:license (2.02%) and dc/dct:rights (1.63%).
Category | Licensing Information |
---|---|
social web | 5.20% |
government | 29.57% |
publications | 3.85% |
life sciences | 3.37% |
user-gen. content | 10.91% |
cross-domain | 11.36% |
media | 5.41% |
geographic | 0.00% |
Total | 7.85% |
4.3.3 Providing Dataset Level Metadata
Dataset level metadata is provided by using the VoID vocabulary, either as inline statements in the dataset or in a separate VoID file.
In the latter case, that file has to be linked from the data via a backlink, or provided at a well-known location, as defined by RFC5785, which is created by appending /.well-known/void to the host part of the URI. As the latter condition is often too strict for data providers due to missing root-level access to the servers, we relax the the search for VoID files at well-known locations, appending /.well-known/void to any portion of the URI.In total, 140 (13.46%) of all datasets use the VoID vocabulary of which 48 (4.62%) use a backlinking mechanism, 34 of which link to a retrievable VoID file.
Category | Total | Link | Well-known | Inline |
---|---|---|---|---|
social web | 6 (1.16%) | 0.58% | 0.19% | 0.58% |
government | 75 (40.32%) | 6.99% | 3.23% | 31.18% |
publications | 14 (13.46%) | 6.73% | 2.88% | 5.77% |
life sciences | 29 (32.58%) | 19.10% | 4.49% | 12.36% |
user-gen. content | 6 (10.91%) | 5.45% | 0.00% | 5.45% |
cross-domain | 5 (11.36%) | 9.09% | 2.27% | 2.27% |
media | 2 (5.41%) | 2.70% | 0.00% | 2.70% |
geographic | 15 (36.59%) | 14.63% | 12.20% | 12.20% |
Total | 140 (13.46%) | 4.62% | 1.44% | 8.27% |
4.3.4 Providing Alternative Access Methods
When looking at the availability of alternative access methods, we restricted ourselves those access methods which are stated in the dataset-level metadata, namely VoID files and triples with VoID statements.
In total, we found alternative access methods for 48 (5.89%) of all datasets. In total, SPARQL endpoints are denoted by 4.54% of all datasets while dumps are denoted by 3.8%. The following table gives an overview over SPARQL endpoints and dumps by category.Category | Any | SPARQL | Dump |
---|---|---|---|
social web | 6 (1.16%) | 1.16% | 0.39% |
government | 61 (32.80%) | 30.11% | 30.65% |
publications | 10 (10.58%) | 9.62% | 3.85% |
life sciences | 19 (21.35%) | 20.22% | 16.85% |
user-gen. content | 3 (5.45%) | 5.45% | 1.82% |
cross-domain | 4 (9.09%) | 4.55% | 6.82% |
media | 1 (2.70%) | 0.00% | 2.70% |
geographic | 8 (19.51%) | 12.20% | 12.20% |
Total | 48 (5.89%) | 4.54% | 3.80% |
4. Downloads
The crawl dumps and other files which are the basis of this analysis can be downloaded here.
5. Feedback
For feedback, please contact Max Schmachtenberg or Chris Bizer
6. References
- Auer, S., Demter, J., Martin, M., Lehmann, J.: Lodstats - an extensible framework for high-performance dataset analytics. In: EKAW. pp. 353–362 (2012) .
- Bizer, C., Eckert, K., Meusel, R., Mühleisen, H., Schuhmacher, M., Völker, J.: Deployment of rdfa, microdata, and microformats on the web - a quantitative analysis. In: The Semantic Web–ISWC 2013, pp. 17–32. Springer Berlin Heidelberg (2013)
- Bizer, C., Heath, T., Berners-Lee, T.: Linked data-the story so far. International Journal on Semantic Web and Information Systems 5(3), 1–22 (2009)
- Heath, T., Bizer, C.: Linked data: Evolving the web into a global data space. Synthesis lectures on the semantic web: theory and technology 1(1), 1–136 (2011)
- Hogan, A., Umbrich, J., Harth, A., Cyganiak, R., Polleres, A., Decker, S.: An empirical survey of linked data conformance. J. Web Sem. 14, 14–44 (2012)
- Isele, R., Umbrich, J., Bizer, C., Harth, A.: LDSpider: An open-source crawling framework for the web of linked data. In: Proceedings of the ISWC 2010 Posters and Demonstrations Track (2010)
- Jentzsch, A., Cyganiak, R., Bizer, C.: State of the lod cloud (September 2011)
- Paulheim, H., Hertling, S.: Discoverability of SPARQL endpoints in linked open data. In: Proceedings of the Posters and Demos Track of ISWC 2013 (2013)
- Schmachtenberg, M., Bizer, C., Paulheim, H.: Adoption of the Linked Data Best Practices in Different Topical Domains. 13th International Semantic Web Conference (ISWC2014) - RDB Track (2014)
7. Credits
The work was supported by the EU research project PlanetData.
