Version 0.4, 08/30/2014
Max Schmachtenberg
Christian Bizer
Heiko Paulheim

This document provides statistics about the structure and content of the crawlable subset of the Linked Open Data (LOD) cloud in April 2014 and analyzes to which extent crawlable Linked Data sources implement the Linked Data best practices.

This document updates the findings of the original State of the LOD Cloud report published in 2011. The 2011 report was based on information that was provided by the dataset publishers themselves via the Linked Data catalog. This report is based on a crawl of the Web of Linked Data conducted in April 2014. This document is a shortend version of the ISWC2014 paper  Adoption of the Linked Data Best Practices in Different Topical Domains. The paper provides more details about the crawling process as well as a deeper discussion of the results. In contrast, this document links the statistics to the Mannheim Linked Data catalog and enables the reader to drill-down and explore information about the datasets behind each statistical result.


1. The Linked Data Crawl

In order to discover as many Linked Data sources as possible, we have crawled a snapshot of the Linked Data Web. We used the LDSpider linked data crawler. We seeded LDSpider with 560 thousand seed URIs originating from the dataset catalog, the Billion Triple Challenge 2012 dataset as well as from datasets being advertised on the mailing list. With those seeds, we performed crawls during April 2014 to retrieve entities from every dataset using a breadth-first crawling strategy. Datasets not allowing crawlers were not included in our corpus. Altogether, we crawled 900,129 documents describing 8,038,396 resources. In general, we assume that all data from one PLD belongs to one dataset. As an exception to this rule, we partitioned datasets in the lod-cloud group where multiple datasets per PLD were defined. Furthermore, we removed datasets that only contained vocabulary definitions. We provide the crawled data for download, so that all results presented in the following can be verified.

2. Linked Data by Domain

Linked Data technologies are being using to share data covering a wide range of different topical domains. The table below gives an overview of the topical domains of the 1014 datasets that were disocvered by our crawl.

Datasets by topical domain.
Topic Datasets %
Government 183 18.05%
Publications 96 9.47%
Life sciences 83 8.19%
User-generated content 48 4.73%
Cross-domain 41 4.04%
Media 22 2.17%
Geographic 21 2.07%
Social web 520 51.28%
Total 1014


3. Crawlable LOD Cloud Diagram

The image below gives an overview of the linkage relationships between datasets. Clicking the image will take you to an image map, which allows you to explore the metadata for each dataset in the Mannheim Linked Data Catalog.

Crawlale LOD Cloud in April 2014

4. Best Practices

The central idea of Linked Data is that data publishers support applications in discovering and integrating data by complying to a set of best practices in the areas of linking, vocabulary usage, and metadata provision. This section analyses to which extent the crawled data sources implement these best practices.

4.1 Interlinking Best Practice

By setting RDF links, data providers connect their datasets into a single global data graph which can be navigated by applications and enables the discovery of additional data by following RDF links.

In total, 56.11% of the crawled datasets link to at least one other dataset. The remaining datasets are only targets of RDF links.

The table below categorizes the datasets by the number of other data sources that are target of outgoing RDF links.

Categorization by number of linked datasets
Number of linked datasets Number of datasets
more than 10 79 (7.79%)
6 to 10 81 (7.99%)
5 31 (3.06%)
4 42 (4.14%)
3 54 (5.33%)
2 106 (10.45%)
1 176 (17.36%)
0 445 (43.89%)

The tables below show the ten datasets with the highest in- and outdegrees.

Datasets with the ten highest indegrees
Dataset Category Indegree cross-domain 207 geographic 141 cross-domain 117 social web 64 social web 63 social web 56 social web 55 government 45 publications 44 social web 41 cross-domain 37
Datasets with the ten highest outdegrees
Dataset Category Outdegree publications 91 user-generated content 88 social web 71 social web 68 social web 67 user-generated content 64 social web 60 social web 59 social web 47 crossdomain 45 publications 45

The table below list the most frequently used linking predicates for each topical domain.

Three most used predicates for interlinking by category.
Category Predicate Usage Category Predicate Usage
social web foaf:knows 60.27% life sciences owl:sameAs 52.17%
foaf:based_near 35.69% rdfs:seeAlso 48.48%
sioc:follows 34.34% dct:creator 21.74%
publications owl:sameAs 32.20% government dct:publisher 47.57%
dct:language 25.42% dct:spatial 30.10%
rdfs:seeAlso 23.73% owl:sameAs 24.27%
user-generated content owl:sameAs 53.13% geographic owl:sameAs 64.29%
rdfs:seeAlso 21.88% skos:exactMatch 21.43%
dct:source 18.75% skos:closeMatch 21.43%
media owl:sameAs 81.25% cross-domain owl:sameAs 80.00%
rdfs:seeAlso 18.75% rdfs:seeAlso 52.00%
foaf:based near 18.75% dct:creator 20.00%

4.2 Vocabulary Best Practices

In order to make it easier for applications to understand Linked Data, data providers should use terms from widely deployed vocabularies to represent data wherever possible.

4.2.1 Usage of Proprietary Vocabularies

We define a vocabulary as non-proprietary if there are at least two datasets using the vocabulary.

Of all 649 vocabularies encountered, 378 (58.24%) vocabularies can are proprietary according to our definition, while 271 (41.76%) are non-proprietary.

In total, 241 (23.17%) datasets use proprietary vocabularies, while nearly all (99.87%) datasets use non-proprietary vocabularies.

The table below lists the non-proprietary vocabularies that are used by at least 1% of the datasets and provides links to the data sources that use a specific vocabulary.

Vocabulary PrefixVocabulary LinkNumber of DatasetsDatasets that use the Vocabulary
rdf (98.22%)Datasets that use rdf
rdfs (72.58%)Datasets that use rdfs
foaf (69.13%)Datasets that use foaf
dcterm (56.02%)Datasets that use dcterm
owl (36.49%)Datasets that use owl
pos (25.05%)Datasets that use pos
sioc (17.65%)Datasets that use sioc
admin (15.48%)Datasets that use admin
skos (14.10%)Datasets that use skos
void (13.51%)Datasets that use void
bio (12.33%)Datasets that use bio
cube (11.24%)Datasets that use cube
rss (9.76%)Datasets that use rss
w3con (7.59%)Datasets that use w3con
doap (6.41%)Datasets that use doap
bibo (6.11%)Datasets that use bibo
dcat (5.82%)Datasets that use dcat
cert (5.03%)Datasets that use cert
sdmxd (4.73%)Datasets that use sdmxd
airport (4.44%)Datasets that use airport
wot (4.34%)Datasets that use wot
content (4.24%)Datasets that use content
cc (3.85%)Datasets that use cc
ref (3.55%)Datasets that use ref
wn (3.25%)Datasets that use wn
tsioc (3.25%)Datasets that use tsioc
vcard2006 (2.86%)Datasets that use vcard2006
sdmxa (2.86%)Datasets that use sdmxa
gn (2.66%)Datasets that use gn
swc (2.66%)Datasets that use swc
dctypes (2.56%)Datasets that use dctypes
hartigprov (2.56%)Datasets that use hartigprov
sd (2.47%)Datasets that use sd
open (2.17%)Datasets that use open
prov (2.07%)Datasets that use prov
resource (1.97%)Datasets that use resource
rda (1.87%)Datasets that use rda
prvt (1.78%)Datasets that use prvt
c4dm (1.78%)Datasets that use c4dm
gr (1.68%)Datasets that use gr
rsa (1.68%)Datasets that use rsa
aiiso (1.68%)Datasets that use aiiso
pingback (1.58%)Datasets that use pingback
time (1.38%)Datasets that use time
org (1.38%)Datasets that use org
wdrs (1.28%)Datasets that use wdrs
vs (1.18%)Datasets that use vs
vann (1.18%)Datasets that use vann
icaltzd (1.08%)Datasets that use icaltzd
frbrcore (1.08%)Datasets that use frbrcore
xhv (1.08%)Datasets that use xhv
lcy (0.99%)Datasets that use lcy
rdfg (0.99%)Datasets that use rdfg
mo (0.89%)Datasets that use mo
cal (0.89%)Datasets that use cal
sdmx (0.89%)Datasets that use sdmx
skosxl (0.79%)Datasets that use skosxl
visit (0.79%)Datasets that use visit
timeline (0.79%)Datasets that use timeline
coun (0.69%)Datasets that use coun
wn20schema (0.59%)Datasets that use wn20schema
spatial (0.59%)Datasets that use spatial
dcam (0.49%)Datasets that use dcam
adms (0.49%)Datasets that use adms
voaf (0.49%)Datasets that use voaf
xkos (0.49%)Datasets that use xkos
rev (0.49%)Datasets that use rev
api (0.39%)Datasets that use api
rdarel (0.30%)Datasets that use rdarel
geom (0.30%)Datasets that use geom
oo (0.30%)Datasets that use oo
log (0.30%)Datasets that use log
wordnet (0.30%)Datasets that use wordnet
formats (0.30%)Datasets that use formats
exif (0.30%)Datasets that use exif
wlo (0.30%)Datasets that use wlo
gold (0.30%)Datasets that use gold
xtypes (0.30%)Datasets that use xtypes
doc (0.30%)Datasets that use doc
book (0.20%)Datasets that use book
po (0.20%)Datasets that use po
rdag1 (0.20%)Datasets that use rdag1
taxo (0.20%)Datasets that use taxo
label (0.20%)Datasets that use label
wv (0.20%)Datasets that use wv
daml (0.20%)Datasets that use daml
ctorg (0.20%)Datasets that use ctorg
prog (0.20%)Datasets that use prog
cs (0.20%)Datasets that use cs
opmv (0.20%)Datasets that use opmv
coin (0.20%)Datasets that use coin
admssw (0.20%)Datasets that use admssw
library (0.10%)Datasets that use library
fresnel (0.10%)Datasets that use fresnel
scv (0.10%)Datasets that use scv
re (0.10%)Datasets that use re
cidoccrm (0.10%)Datasets that use cidoccrm
grddl (0.10%)Datasets that use grddl
lyou (0.10%)Datasets that use lyou
te (0.10%)Datasets that use te
gadm (0.10%)Datasets that use gadm
being (0.10%)Datasets that use being
ann (0.10%)Datasets that use ann
bookmark (0.10%)Datasets that use bookmark
rad (0.10%)Datasets that use rad
link (0.10%)Datasets that use link
oa (0.10%)Datasets that use oa
asn (0.10%)Datasets that use asn
swid (0.10%)Datasets that use swid
radion (0.10%)Datasets that use radion
gbv (0.10%)Datasets that use gbv
ssn (0.10%)Datasets that use ssn
wdr (0.10%)Datasets that use wdr
gso (0.10%)Datasets that use gso
amalgame (0.10%)Datasets that use amalgame
emp (0.10%)Datasets that use emp
conversion (0.10%)Datasets that use conversion
acl (0.10%)Datasets that use acl
psych (0.10%)Datasets that use psych
places (0.10%)Datasets that use places
hcard (0.10%)Datasets that use hcard
cito (0.10%)Datasets that use cito
rov (0.10%)Datasets that use rov
identity (0.10%)Datasets that use identity
flow (0.10%)Datasets that use flow
b2bo (0.10%)Datasets that use b2bo
swrl (0.10%)Datasets that use swrl
transit (0.10%)Datasets that use transit
rdafrbr (0.10%)Datasets that use rdafrbr
fowl (0.10%)Datasets that use fowl
cv (0.10%)Datasets that use cv
sim (0.10%)Datasets that use sim
wordmap (0.10%)Datasets that use wordmap
frbre (0.10%)Datasets that use frbre

The following table displays the top three used vocabularies except the ubiquitously used vocabularies rdf, rdfs and owl for different categorical domains. The prefix odc denotes the vocabulary from

Category Vocabulary Usage Category Vocabulary Usage
social web foaf 86.12% life sciences dct 66.29%
dct 40.65% foaf 41.57%
wgs84 36.99% void 31.46%
publications dct 81.73% government dct 63.98%
foaf 69.23% cube 60.75%
bibo 41.34% odc* 46.24%
user-generated content dct 81.91% geographic dct 82.93%
foaf 74.55% foaf 65.85%
sioc 43.63% skos 48.78%
media foaf 75.67% crossdomain dct 72.73%
dct 54.05% foaf 72.73%
mo 18.91% skos 38.63%

4.2.2 Usage of Dereferencable Vocabularies

In particular for proprietary vocabularies, it is essential that they are derefencable and linked to other vocabularies, so that agents can interpret their semantics.

To assess whether a vocabulary is dereferencable, we collected the terms for each proprietary vocabulary encountered in our corpus. For every term, we requested its URI via an HTTP GET request. We define the dereferencability quota of a vocabulary as the number of dereferencable terms divided by all terms collected from the vocabulary.

In total, 19.25% of all proprietary vocabularies are fully dereferencable (i.e., their quota is 1.0). On the other hand, 72.75% of all proprietary vocabularies are not dereferencable at all. The remaining vocabularies, which are 8.00% of all proprietary ones, are partially dereferencable, meaning that for some terms, but not for all, a definition could be retrieved.

The dereferencability of proprietary vocabularies attributed to individual categories can be seen at the following table.

Usage and Dereferencability of Proprietary Vocabularies per Category
CategoryDifferent Prop. Vocabs Used (% of all Prop Vocab.)# of Datasets Using Prop. Vocab. (% of all datasets)Dereferencability
social web 128 (33.86%) 83 (15.99%) 16.41% 6.25% 77.78%
government48 (12.70%)35 (18.82%) 20.83% 12.50% 66.67%
publications58 (15.34%) 35 (33.65%) 20.69% 6.90% 72.41%
life sciences35 (9.25%)26 (29.21%) 28.57% 5.71% 65.71%
user-gen. cnt.30 (7.93%)26 (47.27%) 13.33% 10.00% 76.67%
cross-domain55 (14.55%)16 (36.36%) 27.27% 10.91% 61.82%
media22 (5.82%)21 (56.76%) 0.00% 9.09% 90.91%
geographic24 (6.34%)16 (39.02%) 20.83% 4.17% 75.00%
Total378 (58.24%)241 (23.17%) 19.25% 8.00% 72.75%

4.3 Adoption of Metadata Best Practices

Metadata helps making datasets self-descriptive. Best practices for providing metadata as Linked Data include provenance and licensing information, dataset-level metadata, and information about additional access methods.

4.3.1 Providing Provenance Information

Our analysis is based on a set of 26 vocabularies we identified to be usable for providing provenance information. It was assembled from information provided by the W3C working group on provenance, the LOV vocabulary catalog, as well as our own experience. Using those vocabularies, we searched in each datasets for triples that use one of the vocabularies and have a document's URI as the subject.

In summary, 35.77% of all datasets use some provenance vocabulary. Looking at individual vocabularies, 28.37% of all datasets use DC or DCTerms, 10.77% use MetaVocab, and 0.77% use prv or prov.

The following table shows an overview of provenance vocabulary use for different topical domains.

Datasets Providing Provenance Information By Category Including the Vocabulary Used
Category Any provenance vocabulary Using Dublin Core Using admin Using prv or prov
social web 169 (32.56%) 56.21% 58.58% 1.18%
government 77 (41.40%) 100.00% 0.00% 1.30%
publications 39 (37.50%) 94.87% 5.13% 2.56%
life sciences 21 (23.60%) 100.00% 0.00% 2.56%
user-gen. content 11 (20.00%) 90.91% 54.55% 0.00%
cross-domain 8 (18.18%) 100.00% 12.50% 0.00%
media 5 (13.51%) 100% 0.00% 0.00%
geographic 4 (9.76%) 100.00% 0.00% 25.00%
Total 372 (35.77%) 28.37% 10.77% 0.77%

4.3.2 Providing Licensing Information

With the help of licensing information, agents can assess whether they may use the data for the purpose at hand.

To evaluate whether a dataset provides license information, we again searched for triples which have the document as their subject and a predicate containing the string "licen". To this list, we added all predicates containing the string dc:/dct:rights and the waiver vocabulary, which leads to a total of 47 terms.

In total, 7.85% of all datasets provide licensing information in RDF. The most important predicates for indicating the license are dc/dct:license (7.98%), cc:license (2.02%) and dc/dct:rights (1.63%).

Category Licensing Information
social web 5.20%
government 29.57%
publications 3.85%
life sciences 3.37%
user-gen. content 10.91%
cross-domain 11.36%
media 5.41%
geographic 0.00%
Total 7.85%

4.3.3 Providing Dataset Level Metadata

Dataset level metadata is provided by using the VoID vocabulary, either as inline statements in the dataset or in a separate VoID file.

In the latter case, that file has to be linked from the data via a backlink, or provided at a well-known location, as defined by RFC5785, which is created by appending /.well-known/void to the host part of the URI. As the latter condition is often too strict for data providers due to missing root-level access to the servers, we relax the the search for VoID files at well-known locations, appending /.well-known/void to any portion of the URI.

In total, 140 (13.46%) of all datasets use the VoID vocabulary of which 48 (4.62%) use a backlinking mechanism, 34 of which link to a retrievable VoID file.

Category Total Link Well-known Inline
social web 6 (1.16%) 0.58% 0.19% 0.58%
government 75 (40.32%) 6.99% 3.23% 31.18%
publications 14 (13.46%) 6.73% 2.88% 5.77%
life sciences 29 (32.58%) 19.10% 4.49% 12.36%
user-gen. content 6 (10.91%) 5.45% 0.00% 5.45%
cross-domain 5 (11.36%) 9.09% 2.27% 2.27%
media 2 (5.41%) 2.70% 0.00% 2.70%
geographic 15 (36.59%) 14.63% 12.20% 12.20%
Total 140 (13.46%) 4.62% 1.44% 8.27%

4.3.4 Providing Alternative Access Methods

When looking at the availability of alternative access methods, we restricted ourselves those access methods which are stated in the dataset-level metadata, namely VoID files and triples with VoID statements.

In total, we found alternative access methods for 48 (5.89%) of all datasets. In total, SPARQL endpoints are denoted by 4.54% of all datasets while dumps are denoted by 3.8%. The following table gives an overview over SPARQL endpoints and dumps by category.

Category Any SPARQL Dump
social web 6 (1.16%) 1.16% 0.39%
government 61 (32.80%) 30.11% 30.65%
publications 10 (10.58%) 9.62% 3.85%
life sciences 19 (21.35%) 20.22% 16.85%
user-gen. content 3 (5.45%) 5.45% 1.82%
cross-domain 4 (9.09%) 4.55% 6.82%
media 1 (2.70%) 0.00% 2.70%
geographic 8 (19.51%) 12.20% 12.20%
Total 48 (5.89%) 4.54% 3.80%


The crawl dumps and other files which are the basis of this analysis can be downloaded here.

5. Feedback

For feedback, please contact Max Schmachtenberg or Chris Bizer

6. References

7. Credits

The work was supported by the EU research project PlanetData.

PlanetData Logo