State of the LOD Cloud

This document provides statistics about the structure and content of the crawlable subset of the Linked Open Data (LOD) cloud in April 2014 and analyzes to which extent crawlable Linked Data sources implement the Linked Data best practices.

This document updates the findings of the original State of the LOD Cloud report published in 2011. The 2011 report was based on information that was provided by the dataset publishers themselves via the datahub.io Linked Data catalog. This report is based on a crawl of the Web of Linked Data conducted in April 2014. This document is a shortend version of the ISWC2014 paper Adoption of the Linked Data Best Practices in Different Topical Domains. The paper provides more details about the crawling process as well as a deeper discussion of the results. In contrast, this document links the statistics to the Mannheim Linked Data catalog and enables the reader to drill-down and explore information about the datasets behind each statistical result.

1. The Linked Data Crawl

In order to discover as many Linked Data sources as possible, we have crawled a snapshot of the Linked Data Web. We used the LDSpider linked data crawler. We seeded LDSpider with 560 thousand seed URIs originating from the datahub.io dataset catalog, the Billion Triple Challenge 2012 dataset as well as from datasets being advertised on the public-lod@w3.org mailing list. With those seeds, we performed crawls during April 2014 to retrieve entities from every dataset using a breadth-first crawling strategy. Datasets not allowing crawlers were not included in our corpus. Altogether, we crawled 900,129 documents describing 8,038,396 resources. In general, we assume that all data from one PLD belongs to one dataset. As an exception to this rule, we partitioned datasets in the datahub.io lod-cloud group where multiple datasets per PLD were defined. Furthermore, we removed datasets that only contained vocabulary definitions. We provide the crawled data for download, so that all results presented in the following can be verified.

2. Linked Data by Domain

Linked Data technologies are being using to share data covering a wide range of different topical domains. The table below gives an overview of the topical domains of the 1014 datasets that were disocvered by our crawl.

**Datasets by topical domain.**
Topic	Datasets	%
Government	183	18.05%
Publications	96	9.47%
Life sciences	83	8.19%
User-generated content	48	4.73%
Cross-domain	41	4.04%
Media	22	2.17%
Geographic	21	2.07%
Social web	520	51.28%
Total	1014

3. Crawlable LOD Cloud Diagram

The image below gives an overview of the linkage relationships between datasets. Clicking the image will take you to an image map, which allows you to explore the metadata for each dataset in the Mannheim Linked Data Catalog.

4. Best Practices

The central idea of Linked Data is that data publishers support applications in discovering and integrating data by complying to a set of best practices in the areas of linking, vocabulary usage, and metadata provision. This section analyses to which extent the crawled data sources implement these best practices.

4.1 Interlinking Best Practice

By setting RDF links, data providers connect their datasets into a single global data graph which can be navigated by applications and enables the discovery of additional data by following RDF links.

In total, 56.11% of the crawled datasets link to at least one other dataset. The remaining datasets are only targets of RDF links.

The table below categorizes the datasets by the number of other data sources that are target of outgoing RDF links.

**Categorization by number of linked datasets**
Number of linked datasets	Number of datasets
more than 10	79 (7.79%)
6 to 10	81 (7.99%)
5	31 (3.06%)
4	42 (4.14%)
3	54 (5.33%)
2	106 (10.45%)
1	176 (17.36%)
0	445 (43.89%)

The tables below show the ten datasets with the highest in- and outdegrees.

**Datasets with the ten highest indegrees**
Dataset	Category	Indegree
dbpedia.org	cross-domain	207
geonames.org	geographic	141
w3.org	cross-domain	117
quitter.se	social web	64
status.net	social web	63
postblue.info	social web	56
skilledtests.com	social web	55
reference.data.gov.uk	government	45
data.semanticweb.org	publications	44
fragdev.com	social web	41
lexvo.org	cross-domain	37

**Datasets with the ten highest outdegrees**
Dataset	Category	Outdegree
bibsonomy.org	publications	91
semanlink.net	user-generated content	88
deri.org	social web	71
harth.org	social web	68
quitter.se	social web	67
semanticweb.org	user-generated content	64
skilledtests.com	social web	60
postblue.info	social web	59
status.net	social web	47
w3.org	crossdomain	45
data.semanticweb.org	publications	45

The table below list the most frequently used linking predicates for each topical domain.

**Three most used predicates for interlinking by category.**
Category	Predicate	Usage	Category	Predicate	Usage
social web	foaf:knows	60.27%	life sciences	owl:sameAs	52.17%
	foaf:based_near	35.69%		rdfs:seeAlso	48.48%
	sioc:follows	34.34%		dct:creator	21.74%
publications	owl:sameAs	32.20%	government	dct:publisher	47.57%
	dct:language	25.42%		dct:spatial	30.10%
	rdfs:seeAlso	23.73%		owl:sameAs	24.27%
user-generated content	owl:sameAs	53.13%	geographic	owl:sameAs	64.29%
	rdfs:seeAlso	21.88%		skos:exactMatch	21.43%
	dct:source	18.75%		skos:closeMatch	21.43%
media	owl:sameAs	81.25%	cross-domain	owl:sameAs	80.00%
	rdfs:seeAlso	18.75%		rdfs:seeAlso	52.00%
	foaf:based near	18.75%		dct:creator	20.00%

4.2 Vocabulary Best Practices

In order to make it easier for applications to understand Linked Data, data providers should use terms from widely deployed vocabularies to represent data wherever possible.

4.2.1 Usage of Proprietary Vocabularies

We define a vocabulary as non-proprietary if there are at least two datasets using the vocabulary.

Of all 649 vocabularies encountered, 378 (58.24%) vocabularies can are proprietary according to our definition, while 271 (41.76%) are non-proprietary.

In total, 241 (23.17%) datasets use proprietary vocabularies, while nearly all (99.87%) datasets use non-proprietary vocabularies.

The table below lists the non-proprietary vocabularies that are used by at least 1% of the datasets and provides links to the data sources that use a specific vocabulary.

Vocabulary Prefix	Vocabulary Link	Number of Datasets	Datasets that use the Vocabulary
rdf	http://www.w3.org/1999/02/22-rdf-syntax-ns#	996 (98.22%)	Datasets that use rdf
rdfs	http://www.w3.org/2000/01/rdf-schema#	736 (72.58%)	Datasets that use rdfs
foaf	http://xmlns.com/foaf/0.1/	701 (69.13%)	Datasets that use foaf
dcterm	http://purl.org/dc/terms/	568 (56.02%)	Datasets that use dcterm
owl	http://www.w3.org/2002/07/owl#	370 (36.49%)	Datasets that use owl
pos	http://www.w3.org/2003/01/geo/wgs84_pos#	254 (25.05%)	Datasets that use pos
sioc	http://rdfs.org/sioc/ns#	179 (17.65%)	Datasets that use sioc
admin	http://webns.net/mvcb/	157 (15.48%)	Datasets that use admin
skos	http://www.w3.org/2004/02/skos/core#	143 (14.10%)	Datasets that use skos
void	http://rdfs.org/ns/void#	137 (13.51%)	Datasets that use void
bio	http://purl.org/vocab/bio/0.1/	125 (12.33%)	Datasets that use bio
cube	http://purl.org/linked-data/cube#	114 (11.24%)	Datasets that use cube
rss	http://purl.org/rss/1.0/	99 (9.76%)	Datasets that use rss
w3con	http://www.w3.org/2000/10/swap/pim/contact#	77 (7.59%)	Datasets that use w3con
doap	http://usefulinc.com/ns/doap#	65 (6.41%)	Datasets that use doap
bibo	http://purl.org/ontology/bibo/	62 (6.11%)	Datasets that use bibo
dcat	http://www.w3.org/ns/dcat#	59 (5.82%)	Datasets that use dcat
cert	http://www.w3.org/ns/auth/cert#	51 (5.03%)	Datasets that use cert
sdmxd	http://purl.org/linked-data/sdmx/2009/dimension#	48 (4.73%)	Datasets that use sdmxd
airport	http://www.daml.org/2001/10/html/airport-ont#	45 (4.44%)	Datasets that use airport
wot	http://xmlns.com/wot/0.1/	44 (4.34%)	Datasets that use wot
content	http://purl.org/rss/1.0/modules/content/	43 (4.24%)	Datasets that use content
cc	http://creativecommons.org/ns#	39 (3.85%)	Datasets that use cc
ref	http://purl.org/vocab/relationship/	36 (3.55%)	Datasets that use ref
wn	http://xmlns.com/wordnet/1.6/	33 (3.25%)	Datasets that use wn
tsioc	http://rdfs.org/sioc/types#	33 (3.25%)	Datasets that use tsioc
vcard2006	http://www.w3.org/2006/vcard/ns#	29 (2.86%)	Datasets that use vcard2006
sdmxa	http://purl.org/linked-data/sdmx/2009/attribute#	29 (2.86%)	Datasets that use sdmxa
gn	http://www.geonames.org/ontology#	27 (2.66%)	Datasets that use gn
swc	http://data.semanticweb.org/ns/swc/ontology#	27 (2.66%)	Datasets that use swc
dctypes	http://purl.org/dc/dcmitype/	26 (2.56%)	Datasets that use dctypes
hartigprov	http://purl.org/net/provenance/ns#	26 (2.56%)	Datasets that use hartigprov
sd	http://www.w3.org/ns/sparql-service-description#	25 (2.47%)	Datasets that use sd
open	http://open.vocab.org/terms/	22 (2.17%)	Datasets that use open
prov	http://www.w3.org/ns/prov#	21 (2.07%)	Datasets that use prov
resource	http://purl.org/vocab/resourcelist/schema#	20 (1.97%)	Datasets that use resource
rda	http://rdvocab.info/elements/	19 (1.87%)	Datasets that use rda
prvt	http://purl.org/net/provenance/types#	18 (1.78%)	Datasets that use prvt
c4dm	http://purl.org/NET/c4dm/event.owl#	18 (1.78%)	Datasets that use c4dm
gr	http://purl.org/goodrelations/v1#	17 (1.68%)	Datasets that use gr
rsa	http://www.w3.org/ns/auth/rsa#	17 (1.68%)	Datasets that use rsa
aiiso	http://purl.org/vocab/aiiso/schema#	17 (1.68%)	Datasets that use aiiso
pingback	http://purl.org/net/pingback/	16 (1.58%)	Datasets that use pingback
time	http://www.w3.org/2006/time#	14 (1.38%)	Datasets that use time
org	http://www.w3.org/ns/org#	14 (1.38%)	Datasets that use org
wdrs	http://www.w3.org/2007/05/powder-s#	13 (1.28%)	Datasets that use wdrs
vs	http://www.w3.org/2003/06/sw-vocab-status/ns#	12 (1.18%)	Datasets that use vs
vann	http://purl.org/vocab/vann/	12 (1.18%)	Datasets that use vann
icaltzd	http://www.w3.org/2002/12/cal/icaltzd#	11 (1.08%)	Datasets that use icaltzd
frbrcore	http://purl.org/vocab/frbr/core#	11 (1.08%)	Datasets that use frbrcore
xhv	http://www.w3.org/1999/xhtml/vocab#	11 (1.08%)	Datasets that use xhv
lcy	http://purl.org/vocab/lifecycle/schema#	10 (0.99%)	Datasets that use lcy
rdfg	http://www.w3.org/2004/03/trix/rdfg-1/	10 (0.99%)	Datasets that use rdfg
mo	http://purl.org/ontology/mo/	9 (0.89%)	Datasets that use mo
cal	http://www.w3.org/2002/12/cal/ical#	9 (0.89%)	Datasets that use cal
sdmx	http://purl.org/linked-data/sdmx#	9 (0.89%)	Datasets that use sdmx
skosxl	http://www.w3.org/2008/05/skos-xl#	8 (0.79%)	Datasets that use skosxl
visit	http://purl.org/net/vocab/2004/07/visit#	8 (0.79%)	Datasets that use visit
timeline	http://purl.org/NET/c4dm/timeline.owl#	8 (0.79%)	Datasets that use timeline
coun	http://www.daml.org/2001/09/countries/iso-3166-ont#	7 (0.69%)	Datasets that use coun
wn20schema	http://www.w3.org/2006/03/wn/wn20/schema/	6 (0.59%)	Datasets that use wn20schema
spatial	http://geovocab.org/spatial#	6 (0.59%)	Datasets that use spatial
dcam	http://purl.org/dc/dcam/	5 (0.49%)	Datasets that use dcam
adms	http://www.w3.org/ns/adms#	5 (0.49%)	Datasets that use adms
voaf	http://purl.org/vocommons/voaf#	5 (0.49%)	Datasets that use voaf
xkos	http://purl.org/linked-data/xkos#	5 (0.49%)	Datasets that use xkos
rev	http://purl.org/stuff/rev#	5 (0.49%)	Datasets that use rev
api	http://purl.org/linked-data/api/vocab#	4 (0.39%)	Datasets that use api
rdarel	http://rdvocab.info/RDARelationshipsWEMI/	3 (0.30%)	Datasets that use rdarel
geom	http://geovocab.org/geometry#	3 (0.30%)	Datasets that use geom
oo	http://purl.org/openorg/	3 (0.30%)	Datasets that use oo
log	http://www.w3.org/2000/10/swap/log#	3 (0.30%)	Datasets that use log
wordnet	http://purl.org/vocabularies/princeton/wordnet/schema#	3 (0.30%)	Datasets that use wordnet
formats	http://www.w3.org/ns/formats/	3 (0.30%)	Datasets that use formats
exif	http://www.w3.org/2003/12/exif/ns#	3 (0.30%)	Datasets that use exif
wlo	http://purl.org/ontology/wo/	3 (0.30%)	Datasets that use wlo
gold	http://purl.org/linguistics/gold/	3 (0.30%)	Datasets that use gold
xtypes	http://purl.org/xtypes/	3 (0.30%)	Datasets that use xtypes
doc	http://www.w3.org/2000/10/swap/pim/doc#	3 (0.30%)	Datasets that use doc
book	http://purl.org/NET/book/vocab#	2 (0.20%)	Datasets that use book
po	http://purl.org/ontology/po/	2 (0.20%)	Datasets that use po
rdag1	http://rdvocab.info/Elements/	2 (0.20%)	Datasets that use rdag1
taxo	http://purl.org/rss/1.0/modules/taxonomy/	2 (0.20%)	Datasets that use taxo
label	http://purl.org/net/vocab/2004/03/label#	2 (0.20%)	Datasets that use label
wv	http://vocab.org/waiver/terms/	2 (0.20%)	Datasets that use wv
daml	http://www.daml.org/2001/03/daml+oil#	2 (0.20%)	Datasets that use daml
ctorg	http://purl.org/ctic/infraestructuras/organizacion#	2 (0.20%)	Datasets that use ctorg
prog	http://purl.org/prog/	2 (0.20%)	Datasets that use prog
cs	http://purl.org/vocab/changeset/schema#	2 (0.20%)	Datasets that use cs
opmv	http://purl.org/net/opmv/ns#	2 (0.20%)	Datasets that use opmv
coin	http://purl.org/court/def/2009/coin#	2 (0.20%)	Datasets that use coin
admssw	http://purl.org/adms/sw/	2 (0.20%)	Datasets that use admssw
library	http://purl.org/library/	1 (0.10%)	Datasets that use library
fresnel	http://www.w3.org/2004/09/fresnel#	1 (0.10%)	Datasets that use fresnel
scv	http://purl.org/NET/scovo#	1 (0.10%)	Datasets that use scv
re	http://www.w3.org/2000/10/swap/reason#	1 (0.10%)	Datasets that use re
cidoccrm	http://purl.org/NET/cidoc-crm/core#	1 (0.10%)	Datasets that use cidoccrm
grddl	http://www.w3.org/2003/g/data-view#	1 (0.10%)	Datasets that use grddl
lyou	http://purl.org/linkingyou/	1 (0.10%)	Datasets that use lyou
te	http://www.w3.org/2006/time-entry#	1 (0.10%)	Datasets that use te
gadm	http://gadm.geovocab.org/ontology#	1 (0.10%)	Datasets that use gadm
being	http://purl.org/ontomedia/ext/common/being#	1 (0.10%)	Datasets that use being
ann	http://www.w3.org/2000/10/annotation-ns#	1 (0.10%)	Datasets that use ann
bookmark	http://www.w3.org/2002/01/bookmark#	1 (0.10%)	Datasets that use bookmark
rad	http://www.w3.org/ns/rad#	1 (0.10%)	Datasets that use rad
link	http://www.w3.org/2006/link#	1 (0.10%)	Datasets that use link
oa	http://www.w3.org/ns/oa#	1 (0.10%)	Datasets that use oa
asn	http://purl.org/ASN/schema/core/	1 (0.10%)	Datasets that use asn
swid	http://semanticweb.org/id/	1 (0.10%)	Datasets that use swid
radion	http://www.w3.org/ns/radion#	1 (0.10%)	Datasets that use radion
gbv	http://purl.org/ontology/gbv/	1 (0.10%)	Datasets that use gbv
ssn	http://www.w3.org/2005/Incubator/ssn/ssnx/ssn#	1 (0.10%)	Datasets that use ssn
wdr	http://www.w3.org/2007/05/powder#	1 (0.10%)	Datasets that use wdr
gso	http://www.w3.org/2006/gen/ont#	1 (0.10%)	Datasets that use gso
amalgame	http://purl.org/vocabularies/amalgame#	1 (0.10%)	Datasets that use amalgame
emp	http://purl.org/ctic/empleo/oferta#	1 (0.10%)	Datasets that use emp
conversion	http://purl.org/twc/vocab/conversion/	1 (0.10%)	Datasets that use conversion
acl	http://www.w3.org/ns/auth/acl#	1 (0.10%)	Datasets that use acl
psych	http://purl.org/vocab/psychometric-profile/	1 (0.10%)	Datasets that use psych
places	http://purl.org/ontology/places#	1 (0.10%)	Datasets that use places
hcard	http://purl.org/uF/hCard/terms/	1 (0.10%)	Datasets that use hcard
cito	http://purl.org/spar/cito/	1 (0.10%)	Datasets that use cito
rov	http://www.w3.org/ns/regorg#	1 (0.10%)	Datasets that use rov
identity	http://purl.org/twc/ontologies/identity.owl#	1 (0.10%)	Datasets that use identity
flow	http://www.w3.org/2005/01/wf/flow#	1 (0.10%)	Datasets that use flow
b2bo	http://purl.org/b2bo#	1 (0.10%)	Datasets that use b2bo
swrl	http://www.w3.org/2003/11/swrl#	1 (0.10%)	Datasets that use swrl
transit	http://vocab.org/transit/terms/	1 (0.10%)	Datasets that use transit
rdafrbr	http://rdvocab.info/uri/schema/FRBRentitiesRDA/	1 (0.10%)	Datasets that use rdafrbr
fowl	http://www.w3.org/TR/2003/PR-owl-guide-20031209/food#	1 (0.10%)	Datasets that use fowl
cv	http://purl.org/captsolo/resume-rdf/0.2/cv#	1 (0.10%)	Datasets that use cv
sim	http://purl.org/ontology/similarity/	1 (0.10%)	Datasets that use sim
wordmap	http://purl.org/net/ns/wordmap#	1 (0.10%)	Datasets that use wordmap
frbre	http://purl.org/vocab/frbr/extended#	1 (0.10%)	Datasets that use frbre

The following table displays the top three used vocabularies except the ubiquitously used vocabularies rdf, rdfs and owl for different categorical domains. The prefix odc denotes the vocabulary from opendatacommunities.org.

Category	Vocabulary	Usage	Category	Vocabulary	Usage
social web	foaf	86.12%	life sciences	dct	66.29%
	dct	40.65%		foaf	41.57%
	wgs84	36.99%		void	31.46%
publications	dct	81.73%	government	dct	63.98%
	foaf	69.23%		cube	60.75%
	bibo	41.34%		odc*	46.24%
user-generated content	dct	81.91%	geographic	dct	82.93%
	foaf	74.55%		foaf	65.85%
	sioc	43.63%		skos	48.78%
media	foaf	75.67%	crossdomain	dct	72.73%
	dct	54.05%		foaf	72.73%
	mo	18.91%		skos	38.63%

4.2.2 Usage of Dereferencable Vocabularies

In particular for proprietary vocabularies, it is essential that they are derefencable and linked to other vocabularies, so that agents can interpret their semantics.

To assess whether a vocabulary is dereferencable, we collected the terms for each proprietary vocabulary encountered in our corpus. For every term, we requested its URI via an HTTP GET request. We define the dereferencability quota of a vocabulary as the number of dereferencable terms divided by all terms collected from the vocabulary.

In total, 19.25% of all proprietary vocabularies are fully dereferencable (i.e., their quota is 1.0). On the other hand, 72.75% of all proprietary vocabularies are not dereferencable at all. The remaining vocabularies, which are 8.00% of all proprietary ones, are partially dereferencable, meaning that for some terms, but not for all, a definition could be retrieved.

The dereferencability of proprietary vocabularies attributed to individual categories can be seen at the following table.

**Usage and Dereferencability of Proprietary Vocabularies per Category**
Category	Different Prop. Vocabs Used (% of all Prop Vocab.)	# of Datasets Using Prop. Vocab. (% of all datasets)	Dereferencability
Category	Different Prop. Vocabs Used (% of all Prop Vocab.)	# of Datasets Using Prop. Vocab. (% of all datasets)	Full	Partial	None
social web	128 (33.86%)	83 (15.99%)	16.41%	6.25%	77.78%
government	48 (12.70%)	35 (18.82%)	20.83%	12.50%	66.67%
publications	58 (15.34%)	35 (33.65%)	20.69%	6.90%	72.41%
life sciences	35 (9.25%)	26 (29.21%)	28.57%	5.71%	65.71%
user-gen. cnt.	30 (7.93%)	26 (47.27%)	13.33%	10.00%	76.67%
cross-domain	55 (14.55%)	16 (36.36%)	27.27%	10.91%	61.82%
media	22 (5.82%)	21 (56.76%)	0.00%	9.09%	90.91%
geographic	24 (6.34%)	16 (39.02%)	20.83%	4.17%	75.00%
Total	378 (58.24%)	241 (23.17%)	19.25%	8.00%	72.75%

4.3 Adoption of Metadata Best Practices

Metadata helps making datasets self-descriptive. Best practices for providing metadata as Linked Data include provenance and licensing information, dataset-level metadata, and information about additional access methods.

4.3.1 Providing Provenance Information

Our analysis is based on a set of 26 vocabularies we identified to be usable for providing provenance information. It was assembled from information provided by the W3C working group on provenance, the LOV vocabulary catalog, as well as our own experience. Using those vocabularies, we searched in each datasets for triples that use one of the vocabularies and have a document's URI as the subject.

In summary, 35.77% of all datasets use some provenance vocabulary. Looking at individual vocabularies, 28.37% of all datasets use DC or DCTerms, 10.77% use MetaVocab, and 0.77% use prv or prov.

The following table shows an overview of provenance vocabulary use for different topical domains.

**Datasets Providing Provenance Information By Category Including the Vocabulary Used**
Category	Any provenance vocabulary	Using Dublin Core	Using admin	Using prv or prov
social web	169 (32.56%)	56.21%	58.58%	1.18%
government	77 (41.40%)	100.00%	0.00%	1.30%
publications	39 (37.50%)	94.87%	5.13%	2.56%
life sciences	21 (23.60%)	100.00%	0.00%	2.56%
user-gen. content	11 (20.00%)	90.91%	54.55%	0.00%
cross-domain	8 (18.18%)	100.00%	12.50%	0.00%
media	5 (13.51%)	100%	0.00%	0.00%
geographic	4 (9.76%)	100.00%	0.00%	25.00%
Total	372 (35.77%)	28.37%	10.77%	0.77%

4.3.2 Providing Licensing Information

With the help of licensing information, agents can assess whether they may use the data for the purpose at hand.

To evaluate whether a dataset provides license information, we again searched for triples which have the document as their subject and a predicate containing the string "licen". To this list, we added all predicates containing the string dc:/dct:rights and the waiver vocabulary, which leads to a total of 47 terms.

In total, 7.85% of all datasets provide licensing information in RDF. The most important predicates for indicating the license are dc/dct:license (7.98%), cc:license (2.02%) and dc/dct:rights (1.63%).

Category	Licensing Information
social web	5.20%
government	29.57%
publications	3.85%
life sciences	3.37%
user-gen. content	10.91%
cross-domain	11.36%
media	5.41%
geographic	0.00%
Total	7.85%

4.3.3 Providing Dataset Level Metadata

Dataset level metadata is provided by using the VoID vocabulary, either as inline statements in the dataset or in a separate VoID file.

In the latter case, that file has to be linked from the data via a backlink, or provided at a well-known location, as defined by RFC5785, which is created by appending /.well-known/void to the host part of the URI. As the latter condition is often too strict for data providers due to missing root-level access to the servers, we relax the the search for VoID files at well-known locations, appending /.well-known/void to any portion of the URI.

In total, 140 (13.46%) of all datasets use the VoID vocabulary of which 48 (4.62%) use a backlinking mechanism, 34 of which link to a retrievable VoID file.

Category	Total	Link	Well-known	Inline
social web	6 (1.16%)	0.58%	0.19%	0.58%
government	75 (40.32%)	6.99%	3.23%	31.18%
publications	14 (13.46%)	6.73%	2.88%	5.77%
life sciences	29 (32.58%)	19.10%	4.49%	12.36%
user-gen. content	6 (10.91%)	5.45%	0.00%	5.45%
cross-domain	5 (11.36%)	9.09%	2.27%	2.27%
media	2 (5.41%)	2.70%	0.00%	2.70%
geographic	15 (36.59%)	14.63%	12.20%	12.20%
Total	140 (13.46%)	4.62%	1.44%	8.27%

4.3.4 Providing Alternative Access Methods

When looking at the availability of alternative access methods, we restricted ourselves those access methods which are stated in the dataset-level metadata, namely VoID files and triples with VoID statements.

In total, we found alternative access methods for 48 (5.89%) of all datasets. In total, SPARQL endpoints are denoted by 4.54% of all datasets while dumps are denoted by 3.8%. The following table gives an overview over SPARQL endpoints and dumps by category.

Category	Any	SPARQL	Dump
social web	6 (1.16%)	1.16%	0.39%
government	61 (32.80%)	30.11%	30.65%
publications	10 (10.58%)	9.62%	3.85%
life sciences	19 (21.35%)	20.22%	16.85%
user-gen. content	3 (5.45%)	5.45%	1.82%
cross-domain	4 (9.09%)	4.55%	6.82%
media	1 (2.70%)	0.00%	2.70%
geographic	8 (19.51%)	12.20%	12.20%
Total	48 (5.89%)	4.54%	3.80%

4. Downloads

The crawl dumps and other files which are the basis of this analysis can be downloaded here.

5. Feedback

For feedback, please contact Max Schmachtenberg or Chris Bizer

6. References

Auer, S., Demter, J., Martin, M., Lehmann, J.: Lodstats - an extensible framework for high-performance dataset analytics. In: EKAW. pp. 353–362 (2012) .
Bizer, C., Eckert, K., Meusel, R., Mühleisen, H., Schuhmacher, M., Völker, J.: Deployment of rdfa, microdata, and microformats on the web - a quantitative analysis. In: The Semantic Web–ISWC 2013, pp. 17–32. Springer Berlin Heidelberg (2013)
Bizer, C., Heath, T., Berners-Lee, T.: Linked data-the story so far. International Journal on Semantic Web and Information Systems 5(3), 1–22 (2009)
Heath, T., Bizer, C.: Linked data: Evolving the web into a global data space. Synthesis lectures on the semantic web: theory and technology 1(1), 1–136 (2011)
Hogan, A., Umbrich, J., Harth, A., Cyganiak, R., Polleres, A., Decker, S.: An empirical survey of linked data conformance. J. Web Sem. 14, 14–44 (2012)
Isele, R., Umbrich, J., Bizer, C., Harth, A.: LDSpider: An open-source crawling framework for the web of linked data. In: Proceedings of the ISWC 2010 Posters and Demonstrations Track (2010)
Jentzsch, A., Cyganiak, R., Bizer, C.: State of the lod cloud (September 2011)
Paulheim, H., Hertling, S.: Discoverability of SPARQL endpoints in linked open data. In: Proceedings of the Posters and Demos Track of ISWC 2013 (2013)
Schmachtenberg, M., Bizer, C., Paulheim, H.: Adoption of the Linked Data Best Practices in Different Topical Domains. 13th International Semantic Web Conference (ISWC2014) - RDB Track (2014)

7. Credits

The work was supported by the EU research project PlanetData.

State of the LOD Cloud 2014

Contents