Hm, no offence, but such level of duplication sounds kind of careless. Certain kind of data in CDDA is probably better not to be automatically matched and imported. I see that these exact law references are given in CDDA, but these references nonetheless are arbitrary text, most legalReferenceName values being a reference to unieque paragaph in some legislative act. Some sort of data'd probably give much better results if handled in smaller patches, e.g. you can treat sites of one designation in one way and sites of another designation in another way.
I'm not sure which 19789 sites exactly you have in mind. I do notice that CDDA includes 12703 sites of designation type EE23, none of which even have a name, i.e. siteName values are numbered like "VEP nr.102011", "VEP nr.130107" etc. Such sites, compareable to cadastral parcels and alike, might not even meet WD:N. I also checked that WD currently has 1305 protected natural objects with Environmental Register code (Estonia) (P4689), of which 177 are without WDPA id. Some of these may be former protected objects that are no longer available in WDPA/CDDA, but others should be matched. I don't know if there is some external dataset that matches P4689 values directly to P809 values. If not, then you can use CDDA's "nationalId" column and match it to properties
> id
value in queries like this to get P4689 value.
If you plan to import more data about Estonian protected objests then the following would be appreciated. Otherwise I feel that the manual cleanup work you leave to others, including myself, is tremendous. In summary:
- drop misleading/dubious inception/start time and offical name statements
- use P31=protected area (Q473972) statement only for sites of certain designations like nature reserve and nature park (protected landscape), and not for other designations that are *not* for protected areas like individual protected natural object (protected nature monument)
- drop area statement for nature monuments (trees, boulders) and other sites with meaningless "0 ha" value
- if preliminary Estonian-language description for individiual protected objects is its designation, then it should be in lower case, e.g. "kaitstav looduse üksikobjekt"
- if data is matched to existing item, then don't import duplicating values to statements that already have value that is expected to be valid, and don't overwrite existing descriptions that are expected to be more accurate
- avoid some weird stuff such as setting designation code as alias for individual protected object, like here lately
Also it would be appreciated if CDDA sites in Estonia are matched to designations that are currently used consistently for protected natural objects in Estonia as heritage designation (P1435) values (see query) and these designation be used for new items as well.