This document defines the Information model for cross-domain data queries - Geospatial (IMX-Geo): an overarching information model describing the relationships between object types from heterogeneous governmental data sources ("Base registries") for geospatial data. The goal is to support users in asking cross-registry questions.
Status of this document
This is a stable draft, published for public comment. Comments regarding this document may be sent to
1. Introduction
This document defines the Information model for cross-domain data queries - Geospatial (IMX-Geo): an overarching information model describing the relationships between object types from heterogeneous governmental data sources ("Base registries") for geospatial data. The goal is to support users in asking cross-registry questions. The data is served from the source registries on demand via orchestration mechanisms, using product API interfaces that cross source data boundaries.
The cross-domain model should:
Guide the way for people: people who design product APIs or other products based on the orchestration endpoint should be able to read IMX-Geo to understand what data is available, which objects are related to which, and how to navigate the paths through the data to assemble the data they want.
Inform machines: the orchestration engine should be able to read IMX-Geo to understand the relationships between objects and the navigation paths that exist through the data. E.g., IMX-Geo can keep the orchestration engine up to date if any changes in the underlying data sources occur.
The data sources are:
Addresses and buildings (BAG)
Large scale topography (BGT)
Small scale topography (BRT)
Cadastral registry: only the Cadastral Map (BRK, only DKK)
The information model should be user-friendly, i.e.:
Simplified structures for common use cases, like the Address class
Names that correspond with common language, e.g. straatnaam instead of Openbare ruimte naam.
2.2 Cherry-picking
It should be possible to leave out properties, that are present in the source model but not of interest to the users.
2.3 Aid data discovery
The semantic model should be user-friendly as well. In this case that means developers should be able to read the model easily to discover the paths leading to the information they want.
2.4 Coherence between objects from different source models
The SAM information model should add useful relationships that exist inherently between objects, but are not currently defined in the source models.
Background: because the source models were originally defined as silos, relationships between objects from different source registries are mostly unavailable right now. We have a requirement to add those relationships that are of value to users (definitely not all).
2.5 Coherence in extra layer
The added relationships between SAM and source models cannot change the source object types as this goes against maintenance and ownership principles. The additional relationships must be added in a separate semantic layer.
2.6 Link with source models
The SAM information model should not be completely independent / a completely new model, but should be linked to the source models.
The SAM model will re-use classes and properties from source models, and derive information from source models. SAM classes and properties should contain information about the source they depend on.
2.7 Machine readability
The links between objects/properties in the SAM model and objects/properties in source models should be machine readable. This opens up possibilities to automate things for the knowledge graph and the orchestration.
2.8 Maintainability
It should be possible to keep the SAM information model in sync / up to date with the source models without too much effort / impact.
When changes in source models are published and implemented in data sources, the SAM information model should be updated to take those changes into account.
Note that base registry models do not change at all frequently.
3. Design of IMX-Geo
We will create a concept scheme ([MIM11] level 1) in SKOS; a UML conceptual model (MIM level 2) and a UML logical model (MIM level 3). In very general terms the roles of these models are:
The SKOS concept scheme describes the concepts that play a role in our universe of discourse.
The conceptual model defines the classes of our universe of discourse.
The logical model describes the shapes of the data and how the data is derived from the source registries.
The most important one is the logical model, which sits between the source registry models on the one hand, and on the other hand the product models that define what data is served to the users.
3.1 How does the model satisfy the requirements
2.1 User-friendliness - the model introduces classes that correspond to the common point of view of users. Mappings can be used to add user-friendly names for things. These names will also be added in SKOS (MIM level 1).
2.2 Cherry-picking - the model is loosely coupled to the source classes and requires a mapping for each source class.
2.3 Aid data discovery - the model shows how classes are related, including cross-registry relationships.
2.6 Link with source models - the model does contain links to the source classes to indicate which data is derived from which source class.
2.7 Machine readability - the model is expressed in SKOS and MIM. MIM can be expressed UML but also in XML and RDF.
2.8 Maintainability - the model is loosely coupled so changes in the source models do not directly impact it, unless changes include the addition or removal of classes. Mappings can be impacted by changes in properties, but the mappings are maintained outside of the model.
3.2 Concept scheme
The concept scheme is a MIM level 1 model and introduces user-friendly concepts to talk about the universe of discourse.
URI: https://begrippen.geostandaarden.nl/sm/nl/.
sm stands for 'samenhangende begrippen' which is 'coherent concepts' in English.
The concept scheme contains only those concepts that play a role in the IMX-Geo universe of discourse but have not been coined elsewhere in the context of the Dutch base registries. I.e.: we only coin those concepts that do NOT have an exact match with an existing concept (again, in the context of the Dutch base registries). This saves work and maintenance. We will find out if this is workable.
We create the concept scheme manually, we do not generate it from a UML model. The reason is that we want to be able to link related concepts in ways not supported in UML (see next point).
Concepts will have matching relationships (broadMatch, narrowMatch, closeMatch, relatedMatch) with existing concepts from the Dutch base registries where appropriate. Note: exactMatch is excluded (see point 2).
Both the conceptual and the logical model have annotations containing the uris of concepts from a Dutch base registry or from the IMX-Geo sm concept scheme. These are entered in the MIM metaproperty begrip. Every class and property has this metadata.
Note
The work-in-progress version of the concept scheme can be viewed here.
3.3 Conceptual model
In this section, we list a set of design principles, and describe a first modeling attempt. After creating a first attempt at a conceptual model, we decided to focus on the logical level first, in order to see how it can play a role in the orchestration engine. We retain the old design principles and modeling attempt for now - they may or may not still be valid.
Conceptual modeling principles:
The conceptual model will be a valid UML [MIM11] model on MIM level 2, with the exception that we may define extensions of MIM if we need them.
The conceptual model defines the classes of our universe of discourse (see scope). It identifies the object types and their inherent properties, including relationships with other objects.
The conceptual model describes how the classes in our universe of discourse relate to each other. These relationships can cross the boundaries of individual base registries.
We relate all object types and properties to corresponding SKOS concepts as described in the previous section.
In this conceptual model we will define relationships between objects if they are relevant for users, even though they may not be present in the source datasets (which were designed as silos).
We model the relationships between object types in different source registries by adding these to the copied classes. More on these last two points below.
This preliminary, partial sketch of the conceptual model contains a few object types from BAG, BRK, and DiSGeo. It is only a sketch of how object types from different base registries could be related.
An assumption is that we have access to the information models for all source datasets. These are created using modeling language UML.
3.3.1 Modeling the relationships between object types in source registries
We considered two options:
Create subclasses (UML specialisations) of the source classes and add the relationships between these subclasses.
Don't use the source models directly, but make copies of all object types that we need in the overarching model. We then add the relationships between these copy classes.
We decided to go for the second option, at least in the conceptual model. In the case of subclasses, in the MIM paradigm we would 'inherit' all properties of the superclasses, while we want only a selection of relevant properties. Also, conceptually we are not creating subclasses. What we actually want is to derive data from source data. We will model the dependency of our object types from source object types on the logical level, because there we are considering data. On the conceptual level we are only considering objects.
The logical model defines the shapes of the data. On this level we add data-registration concepts like history and provenance. The logical model also specifies how orchestrated data is related to source data, at least on the class level.
This logical model must satisfy all requirements in 2. Requirements. I.e we want to be able to add relationships, without changing source models, but retain a link TO source model classes we derive information from; in a machine readable way, but also usable for developers to discover the available information and how it can be integrated. The maintenance requirement is less important, because the source models do not change often, once standardized.
We are planning to introduce a generic modeling pattern on the MIM level (i.e., in the metamodel) for provenance that can be applied to describe how orchestrated data was created from source data. A first version of this was created as part of our first use case, Adresses. This provenance or lineage information will also be available to users on request. The Lineage model is developed separately in the IMX-LineageModel repository.
This is the current attempt at creating the cross-domain model on the logical level:
Cluster classes are classes we add in the model. In the example above, Gebouw and Perceel are cluster classes. Cluster classes represent a logical unit of information about a real-world object, in which data from different source models can be integrated. They are not present in a source model, but are related to one or more classes in source models. They always have a MIM Begrip metadata field containing the URI of either a concept in a source model, or a concept coined in the context of IMX-Geo.
Trace links: Links between cluster classes and source classes are expressed using UML <<trace>> relationships. They indicate that the cluster class is derived from these source classes - one at minimun, but since we are integrating data from different sources there will often be several trace links originating from one cluster class. Every <<trace>> relationship indicates that data is retrieved from a certain source class, and that a mapping is needed to specify how this is done. The mapping is maintained outside of the model.
Generalization: In an earlier stage of our thinking, we wanted to use <<generalization>> relationships between a cluster class and a source class in cases where the cluster class only has one related source class, and the wanted behaviour is to copy all properties from the source object, and to add one or more properties.
However, during the second High 5 we discussed this and concluded that these cases would probably be few, and this approach also had the disadvantage of linking the cluster classes more tightly to the source classes. When using generalisation, IMX-Geo effectively becomes an extension of the source models, while we want to specify IMX-Geo as a separate layer which is only loosely coupled. Therefore, we decided to use only trace relationships.
Associations: Whenever we want to introduce a relationship to cross-link two classes from different source models, we add this relationship between the corresponding cluster classes.
4. Mapping properties to external standardized properties
4.1 Spatial relationships
The IMX-Geo model introduces several spatial relationships between object types. We want to model these according to the Simple Features topological relationships (also included in NEN 3610 and GeoSPARQL), so we have to consider for each of these relationships, which topological relationship is the correct one.
Contains: a geometry contains another geometry. There are no points of the contained geometry that lie outside the boundary of the containing geometry.
Overlaps: two overlapping geometries have some points in common, and have the same dimension; their intersection also has the same dimension.
Intersects is very broad: two geometries have at least one point in common.
Touches: two geometries have at least one point in common, but the interiors do not intersect.
The IMX-Geo currently has these spatial relationships:
ligtInGemeentegebied: Perceel objects within a Gemeentegebied. They never overlap.
gebouwBinnenKlicMelding: Gebouw objects within a Klicmelding area, or having overlap with it.
adresBinnenKlicMelding Adres objects within a Klicmelding area, or having overlap with it.
bevatBouwwerk: Perceel objects or Terrein objects that contain a Bouwwerk or have Bouwwerk objects that overlap with it.
ligtInWaterschap: Perceel objects that are within a Waterschapsgebied.
heeftBestemming: Perceel objects that are within a Bestemming or overlap with it.
Terrein bevat: Terrein objects that contain Landschapselement objects.
heeftAlsOndergrondOnderTerrein: Terrein objects that contain or have overlap with Ondergrond objects.
heeftAlsNabijWater: Terrein objects that are nearby Water objects. However, there is no GeoSPARQL relation that has this meaning. As far as I know only http://www.geonames.org/ontology#nearby expresses this.
heeftBestemming: Perceel objects that contain or have overlap with Bestemming objects.
heeftBeperking: Probably not a spatial relationship. Todo check.
RegistratieveRuimte bevat: Registratieve ruimte objects (subclasses) that contain Perceel objects.
ligtAan: Adres objects that touch a Weg object? I am not sure this is what is meant.
4.2 Property mapping
This section lists the properties contained in IMX-Geo that are mapped to external, standardized properties. This mapping is present in the IMX-Geo model. A tagged value uri is used for this.
Generic mapping of properties that occur in different classes:
geometrie: geosparql:hasGeometry
identificatie: nen3610:identificatie
domein: nen3610:domein
ligtOp/ligtIn/binnengeosparql:sfWithin or geosparql:sfIntersects (see previous section)
bevat: geosparql:sfContains or geosparql:sfIntersects
naam: rdfs:label
Specific mapping of properties of certain classes:
Other properties besides those mentioned above are not mapped to external standardized properties.
5. IMX-Geo completeness analysis
Analysis of the completeness of the IMX-Geo model.
5.1 Comparison to "stelselcatalogus clusterbegrippen"
The Stelselcatalogus provides insight into
the Dutch base registries / governmental data.
It is a collection of concepts that describe the governmental data that exists
within the different base registries. It also contains a short list of "cluster
concepts", concepts that were
added to serve as entry points to the more detailed base registry concepts.
As part of the analysis, we compared the classes in IMX-Geo to the cluster
concepts in the Stelselcatalogus.
Stelselcatalogus
IMX-Geo
Remarks
Adres
Adres
Looks equivalent
Bouwwerk
Bouwwerk
In IMX-Geo this is a broader concept than in the stelselcatalogus. Gebouw is a narrower concept of Bouwwerk. This conflicts with the Stelselcatalogus definition.
X
Gebouw
Present in IMX-Geo because it is a concept in high demand and there is a lot of data about buildings. It makes sense in IMX-Geo to define it separately as Bouwwerk is too generic for many use cases.
Ondergrond
Ondergrond
Looks equivalent
Onroerende zaak
X
IMX-Geo has the more specific concept Perceel.
X
Perceel
See below for rationale
Terrein
Terrein
The definitions / populations differ. IMX-Geo Terrain covers physical terrain and its function only.
Water
Water
Looks equivalent
Weg
Weg
Almost equivalent. IMX-Geo also relates this to Openbare Ruimte.
X
Registratieve Ruimte
Possible addition for Stelselcatalogus. Defined in NEN3610.
X
Beperking
Source is BRK, so in scope for Stelselcatalogus, but maybe too detailed.
X
Grondwaterstand
Narrower concept of Ondergrond
X
Bodemsamenstelling
Narrower concept of Ondergrond
X
Bestemming
Not in a base registry; out of scope for Stelselcatalogus
X
Landschapselement
Not in a base registry; present in IMX-Geo because of ideas to set up a registry for this.
X
Spoorweg
Narrower concept of Weg
X
Bestuurlijk Gebied
Narrower concept of Registratieve Ruimte
X
Provincie
Narrower concept of Openbaar Lichaam (defined in DisGeo)
X
Gemeente
Narrower concept of Openbaar Lichaam (defined in DisGeo)
X
Gemeentegebied
Narrower concept of Bestuurlijk gebied
X
Waterschap
Narrower concept of Openbaar Lichaam
X
Waterschapsgebied
Narrower concept of Bestuurlijk gebied
X
Veiligheidsregio
Narrower concept of Openbaar Lichaam
X
Buurt
Too detailed as Cluster concept
X
Wijk
Too detailed as Cluster concept
X
Woonplaats
Not a cluster concept; but note that in IMX-Geo this falls under Registratieve Ruimte, while in stelselcatalogus it falls under Terrein.
Locatie
X
Too generic for IMX-Geo (see below)
Aansprakelijkheid
X
Not in scope IMX-Geo
Inkomen
X
Not in scope IMX-Geo
Natuurlijk persoon
X
Not in scope IMX-Geo
Omzet
X
Not in scope IMX-Geo
Organisatie
X
Not in scope IMX-Geo
Roerende zaak
X
Not in scope IMX-Geo
Vestiging
X
Not in scope IMX-Geo
Adres, Bouwwerk, Ondergrond, Terrein, Water, and Weg
match Cluster concepts pretty closely.
Aansprakelijkheid, Inkomen, Locatie, Natuurlijk persoon,
Omzet, Organisatie, Roerende zaak, and Vestiging are Cluster
concepts that don't match any class in IMX-Geo. This is okay; most of them are
not in scope for IMX-Geo, because they are not about spatial objects.
The Cluster concept Locatie is too generic to use in IMX-Geo; everything has
location in IMX-Geo so it doesn't make sense to have this class there.
Perceel in IMX-Geo falls under the Cluster concept Onroerende zaak. For now,
we only need Perceel, not the other subclasses of Onroerende Zaak
(Appartementsrecht and Leidingnetwerk) but it might be interesting to add those
later. If that's the case, it would be better to add the Onroerende Zaak class
to IMX-Geo instead.
IMX-Geo has quite a few concepts that don't match a Cluster concept. Several of
these classes are a bit more specific than the Cluster concepts are, so it makes
sense the are missing from the Cluster concepts in the Stelselcatalogus. Others,
like Bestemming, are not in scope for the stelselcatalogus, because the data
for these things is not in a base registry.
For Bestuurlijk gebied it could be argued that it should be a Cluster
concept - or its superclass in NEN 3610, Registratieve
ruimte.
Gemeentegebied and Waterschapsgebied would fall under this as well.
Gemeente and Waterschap are subclasses of Openbaar
Lichaam
(introduced in DisGeo but also in TOOI); this could also be a Cluster concept as
there is government data about these governmental organisations.
5.2 Analysis of KKG and other use cases
We looked at Kadaster Knowledge Graph (KKG) use cases that were assembled in
earlier projects, as well as questions about datasets posed by users at the
Geoforum. Approximately 50% of these use cases was covered by IMX-Geo (at
analysis time, during the fourth high5 in May 2023).
Some highlights of this analysis:
Many questions used neighbourhoods (wijken/buurten)
A number of use cases needed data which is within scope of IMX-Geo, but was
not yet included in the model: e.g. trees, wells, statusses, swimming pools,
sports halls, terrain functions, playground equipment, publiekrechtelijke
beperkingen.
A number of use cases needed data that was out of scope of IMX-Geo, such as
risk objects, CBS kerncijfers, floor height.
As a result of this, Neighbourhood (buurt/wijk) was added as well as
Publiekrechtelijke beperking.
Finally, we compared the IMX-Geo model to the source models that are within
scope, to find any other classes we wanted to support.
When developing the IMX-Geo, we chose to only add those objects to the model
that are relevant to a certain use case. The IMX-Geo therefore only contains
those objects that are mentioned in a use case.
As a result of this, the IMX-Geo now contains 60% of all objects/concepts
that appear in the source models. See table below.
BAG and BGT are almost fully supported. Because KLIC is recognized as a consumer
of data by means of the IMX-Geo, IMKL is not part of IMX-Geo anymore.
The IMX-Geo can grow over time and support more objects/concepts from source
models as the IMX-Geo is expanded based on new use cases, hitting new objects.
Alleen de hoofdstatussen volgens het begrip Levensfase uit het DISGEO begrippenkader:
Aanwezige toestand, Afwezige toestand, Plantoestand. Zie https://begrippen.geostandaarden.nl/disgeo/nl/page/levensfase
Een door of namens het gemeentebestuur ten aanzien
van een adresseerbaar object toegekende toevoeging
aan een huisnummer in de vorm van een alfanumeriek
teken.
Een door of namens het gemeentebestuur ten aanzien
van een adresseerbaar object toegekende nadere
toevoeging aan een huisnummer of een combinatie van
huisnummer en huisletter.
De verzameling van waarden die gegevens van deze attribuutsoort kunnen hebben, oftewel
het waardenbereik, uitgedrukt in een specifieke structuur.
Indicatie kardinaliteit
1
Toelichting
Het gebied dat bestuurt wordt door een openbaar lichaam heeft een geometrische representatie
in de vorm van één of meerdere polygonen (tweedimensionale vlakken).
Indicatie classificerend
Nee
Mogelijk geen waarde
Nee
Indicatie identificerend
Nee
Toelichting Het gebied dat bestuurt wordt door een openbaar lichaam heeft een geometrische representatie
in de vorm van één of meerdere polygonen (tweedimensionale vlakken).
Alleen de hoofdstatussen volgens het begrip Levensfase uit het DISGEO begrippenkader:
Aanwezige toestand, Afwezige toestand, Plantoestand. Zie https://begrippen.geostandaarden.nl/disgeo/nl/page/levensfase
Elke nummeraanduiding waarvan gegevens zijn opgenomen in de basisregistratie adressen
en gebouwen, wordt uniek aangeduid door middel van een identificatiecode.
Indicatie classificerend
Nee
Mogelijk geen waarde
Nee
Indicatie identificerend
Ja
Toelichting Elke nummeraanduiding waarvan gegevens zijn opgenomen in de basisregistratie adressen
en gebouwen, wordt uniek aangeduid door middel van een identificatiecode.
Elke nummeraanduiding waarvan gegevens zijn opgenomen in de basisregistratie adressen
en gebouwen, wordt uniek aangeduid door middel van een identificatiecode.
Indicatie classificerend
Nee
Mogelijk geen waarde
Nee
Indicatie identificerend
Ja
Toelichting Elke nummeraanduiding waarvan gegevens zijn opgenomen in de basisregistratie adressen
en gebouwen, wordt uniek aangeduid door middel van een identificatiecode.
Elke nummeraanduiding waarvan gegevens zijn opgenomen in de basisregistratie adressen
en gebouwen, wordt uniek aangeduid door middel van een identificatiecode.
Indicatie classificerend
Nee
Mogelijk geen waarde
Nee
Indicatie identificerend
Nee
Toelichting Elke nummeraanduiding waarvan gegevens zijn opgenomen in de basisregistratie adressen
en gebouwen, wordt uniek aangeduid door middel van een identificatiecode.
Een door of namens het gemeentebestuur ten aanzien
van een adresseerbaar object toegekende toevoeging
aan een huisnummer in de vorm van een alfanumeriek
teken.
Een door of namens het gemeentebestuur ten aanzien
van een adresseerbaar object toegekende nadere
toevoeging aan een huisnummer of een combinatie van
huisnummer en huisletter.
Question: How do we model, in the SAM logical information model, the missing relationships that we want to add between object types in source registries?
Arnoud and Linda each created a first attempt at modeling this.
Linda's model
In this model a WaU product model "SAM" is defined for the use case 'answer all questions' (the diagram is only showing a small part). In it, dependency relationships are used to express that the logical model SAM contains classes, which depend on (more specifically: are derived from) classes in source models.
So e.g. WaU-SAM Woonplaats is derived from BAG Woonplaats. It does not inherit its attributes. And WaU-SAM Gemeentegebied is derived from a class with the same name in the source model DisGeo Bestuurlijke gebieden.
WaU-SAM Woonplaats and Gemeentegebied also have the modeling pattern for specifying provenance on the data level.
The special thing here is that WaU-SAM Woonplaats and Gemeentegebied have an association binnen. This is not present in the BAG or DiSGeo Bestuurlijke gebieden, but an addition owned by the WaU-SAM model.
Arnoud's model
In this model generalisation relationships are used to explain how classes that are needed on the product level are related to classes in source models.
E.g. a new class Windturbine which is declared a subclass of BGT Pand.
It seems to be doable to script a translation from this model, which is strictly not a correct UML model, to a working OWL ontology + SHACL. The UML model provides a nice visualization, the generated OWL/SHACL would then be the "real", formal logical model. This would also allow us to use the UML editor as an OWL editor.
Jesse's model
The SAM model is a product model requiring flexibility in terms of vocabulary and schema. It is a model tailored to specific user requirements but it is nonetheless based on the models (MIM-1 to 3) as defined in the source models. SAM does not change the meaning of source data but instead introduces a new projection of the data, where data might be transformed or inferred. We want to exchange data about e.g. Woonplaats; but not redefine woonplaats.
In order to stay true to the original semantics the SAM product model links to the concepts defined in the respective contexts where possible. In some cases there might not be available concepts in the source context. Here new concepts must be defined. These should be matched to concepts in the respective contexts.