Wikidata:ScienceSource project/MeSH and cleanup dashboard

From Wikidata
Jump to navigation Jump to search
ScienceSource logo

In the last quarter of 2019, a major part of the work on ScienceSource was matching the major Medical Subject Headings into Wikidata. These are the D-numbers of the MeSH system. They should be matched 1-to-1 here, so the "reverse lookup" from D-number to Wikidata Q-number can work.

Example: starting from D012343, we can get to transfer RNA (Q201448). There are good tools for this (e.g. the Resolver dedicated tool at https://tools.wmflabs.org/wikidata-todo/resolver.php?, when you enter 486 and D012343).

We want this basic operation to work in all cases, and no ambiguities to arise. This is a requirement for the NCBI2wikidata tool to operate correctly. Of course the D-numbers must also be on the correct items, and this raises another set of issues. Many matches have been inaccurate: the label may have been OK, but actually the item may, for example, be for an article, not the required topic (has been quite common).

Table[edit]

These stats relate to the issues of:

  1. Getting a complete set of MeSH descriptor ID (P486) D-number statements into Wikidata. All the mix'n'match catalogs at https://tools.wmflabs.org/mix-n-match/#/group/medical are now complete. Since those catalogs were created towards the end of 2017, there are now MeSH updates to take into account. The current total from mix'n'match is around 28.5K, but the actual current total might be nearer 30K.
  2. Removing the database constraint violations caused by D-number duplications.[1]
  3. Creating the MeSH tree code (P672) statements that go with MeSH descriptor ID (P486)[2]
  4. Building up the tagging of items for review articles with copyright license (P275) statements
  5. Creating some main subject (P921) statement for all such items

See the footnotes for further details.

MeSH has a major application to searching PubMed (Q180686), which is how it enters ScienceSource, at the very start of the software pipeline. In order to automate batch searches, the term itself is being entered as a subject named as (P1810) qualifier statement to the MeSH descriptor ID (P486) statements. The effect is to allow lists of MeSH terms to be gathered from SPARQL queries.

Date DD/MM/YYYY MeSH D-number total[3] MeSH descriptor ID (P486) unique value constraint violations[4][5][6] MeSH D-number MeSH descriptor ID (P486) lacking MeSH tree code (P672)[7] review article (Q7318358) with copyright license (P275)[8] Review articles with license lacking main subject (P921)[9]
18/08/2019 19779 2420 11324 57360 4720
20/09/2019 19820 2322 11204 57547 4686
27/09/2019 20046 2195 11143 60297 3582
3/10/2019 21110 2058 12074 62062 3215
22/10/2019 23512 1979 14212 64494 2542
4/11/2019 24712 1820 15326 66292 2052
8/12/2019 28205 1729 18320 68043 1751
15/12/2019 28635 1717 18842 68277 1663
24/12/2019 28706 1688 18335 55230 1561
11/07/2020 28701 1616 18052 60762 1634
1/08/2020 28689 1354 17643 62439 1459
23/08/2020 28699 1189

463 C-numbers, 726 D-numbers

17111 63877 1288
13/10/2020 28716 1052

462 C-numbers, 590 D-numbers

16373 69676 805
13/11/2020 28697 456

446 C-numbers, 10 D-numbers

16003 72656 597
18/01/2021 28726 454

440 C-numbers, 14 D-numbers

15674 75915 465
1/01/2022 28883 470

436 C-numbers, 34 D-numbers

13227 87496 248
10/05/2022 29125 584

442 C-numbers, 144 D-numbers

12218 87523 222
17/11/2022 30302 577

442 C-numbers, 135 D-numbers

11219 94418 1782
2/07/2023 30535 560

443 C-numbers, 117 D-numbers

11041 133275 18595

Reviews with license[edit]

Date DD/MM/YYYY CC0[10] public domain sign[11] open access[12] CC[13] CC-BY[14] CC-BY-SA[15] CC-BY-SA 2.0[16] CC-BY 2.5[17] CC-BY-SA 2.5[18] CC-BY 3.0[19] CC-BY-SA 3.0[20] CC-BY-SA 4.0[21] CC-BY-SA 4.0 Int[22] CC-BY-NC[23] CC-BY-NC 2.5[24] CC-BY-NC-SA 2.5[25] CC-BY-NC-ND[26] CC-BY-NC-ND 3.0[27] CC-BY-NC 4.0[28]
2/07/2023 222 661 1 2 22718 0 8308 2149 1 14230 0 208 51008 6620 248 30 2849 1867 6704

MeSH updates[edit]

The mix'n'match catalogs were compiled in 2017. Supplementary work then added the additional MeSH D-numbers 2018-2022. The following type of query can retrieve subsequent additions from https://id.nlm.nih.gov/mesh/query:

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX meshv: <http://id.nlm.nih.gov/mesh/vocab#>
PREFIX mesh: <http://id.nlm.nih.gov/mesh/>

SELECT DISTINCT ?s
FROM <http://id.nlm.nih.gov/mesh>
WHERE {
  ?s meshv:dateEstablished "2023-01-01"^^xsd:date
}

ORDER BY ASC(?s)

Version to search by initial letter of tree code:

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX meshv: <http://id.nlm.nih.gov/mesh/vocab#>
PREFIX mesh: <http://id.nlm.nih.gov/mesh/>
PREFIX mesh2024: <http://id.nlm.nih.gov/mesh/2024/>
PREFIX mesh2023: <http://id.nlm.nih.gov/mesh/2023/>
PREFIX mesh2022: <http://id.nlm.nih.gov/mesh/2022/>

SELECT DISTINCT ?d ?name ?tn
FROM <http://id.nlm.nih.gov/mesh>
WHERE {
?d meshv:treeNumber ?tn;
meshv:dateEstablished "2024-01-01"^^xsd:date.
?d rdfs:label ?name .
 FILTER (regex (?tn, "Z"))
 
}

ORDER BY ASC(?d)

MeSH tree comparison[edit]

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX meshv: <http://id.nlm.nih.gov/mesh/vocab#>
PREFIX mesh: <http://id.nlm.nih.gov/mesh/>
PREFIX mesh2024: <http://id.nlm.nih.gov/mesh/2024/>
PREFIX mesh2023: <http://id.nlm.nih.gov/mesh/2023/>
PREFIX mesh2022: <http://id.nlm.nih.gov/mesh/2022/>

SELECT ?d ?name ?tn 
FROM <http://id.nlm.nih.gov/mesh>
WHERE {
?d meshv:treeNumber ?tn.
?d rdfs:label ?name .
 FILTER (regex (?tn, "D03.132"))
 }

order by asc(?tn)
SELECT ?item ?mesh ?itemLabel ?string 

   WHERE 
   {{?item wdt:P672 ?string;
           wdt:P486 ?mesh.

    FILTER (STRSTARTS(?string, "D03.132"))
    
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
   }}

order by asc(?string)
Try it!

Large items queries[edit]

#Megabyte range scholarly article items, review articles with CC license
SELECT ?item ?statementcount ?itemLabel
   WHERE {?item wdt:P31 wd:Q13442814;
                wdt:P31 wd:Q7318358;
                wdt:P275 [ ];
                wikibase:statements ?statementcount.
    FILTER (?statementcount > 600 )
    SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
    }
Try it!

For referencing buildup:

#Locating P31 statements referenced to PubMed, multiple retrievals
SELECT ?statement (COUNT(?statement) AS ?count)
  WHERE {?reference pr:P248 wd:Q180686;
                    pr:P813 ?date.
         ?statement prov:wasDerivedFrom ?reference.
         ?statement ps:P31 wd:Q7318358.
        }
 GROUP BY ?statement
 HAVING (COUNT(?statement) > 10)
Try it!
#Locating P921 statements referenced to PubMed, specific value
SELECT ?statement 
  WHERE {?reference pr:P248 wd:Q180686.
         ?statement prov:wasDerivedFrom ?reference.
         ?statement ps:P921 wd:Q2335423.
        }
Try it!
#Locate all statements, items and retrieval dates for a property given reference type
SELECT DISTINCT ?date ?statement ?item 

WHERE {?reference pr:P248 wd:Q180686;
                    pr:P813 ?date.
         ?statement prov:wasDerivedFrom ?reference.
         ?statement ps:P921 wd:Q8084905.
         ?item p:P921 ?statement}

ORDER BY ASC(?date)
Try it!

Lone MeSH statements[edit]

#Lone D-number MeSH statements
SELECT DISTINCT ?item ?subject ?itemLabel

WHERE {?item wdt:P486 ?subject.
     ?item wikibase:statements ?statementcount.
   FILTER ( ?statementcount = 1 )
   FILTER (STRSTARTS(?subject, "D"))
  
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
  }
Try it!

Notes[edit]

  1. As far as is reasonable. Some edge cases are mentioned on Wikidata talk:Database reports/Constraint violations/P486.
  2. There are often multiple statements, which can be incomplete here. There are also obsolete codes to delete, and the codes are updated.
  3. #All P486 identifiers starting with D
    SELECT ?item ?mesh
      WHERE {?item wdt:P486 ?mesh.
      
             FILTER(strstarts(?mesh, 'D'))
             }
    
    Try it!
  4. Wikidata:Database reports/Constraint violations/P486
  5. # Quick query for items with most values of the property P486 after User:Infovarius, 2019-07-15
    SELECT ?item ?itemLabel ?cnt
    {
      {
           SELECT ?item (COUNT(?value) AS ?cnt)
           {
              ?item wdt:P486 ?value
              FILTER(STRSTARTS(?value, 'D'))
           }
           GROUP BY ?item ORDER BY DESC(?cnt) LIMIT 100
      }
      SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
    }
    ORDER BY DESC(?cnt)
    
    Try it!
  6. #Unique value checker, for P486
    SELECT DISTINCT ?item1 ?item1Label ?item2 ?item2Label ?value 
    {
    	?item1 wdt:P486 ?value .
    	?item2 wdt:P486 ?value .
    	FILTER( ?item1 != ?item2 && STR( ?item1 ) < STR( ?item2 ) ) .
        FILTER( STRSTARTS(?value, 'D') )
    	SERVICE wikibase:label { bd:serviceParam wikibase:language "en" } .
    }
    LIMIT 1000
    
    Try it!
  7. #Topics with MeSH Descriptor ID (D-number) lacking MeSH Code ID
    SELECT DISTINCT ?item ?itemLabel
      WHERE {?item wdt:P486 ?meshid.
             FILTER(STRSTARTS(?meshid,"D"))
             MINUS {?item wdt:P672 [ ]}
             
          SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
             }
    
    Try it!
  8. #Article items, reviews with license
    SELECT ?item ?itemLabel
       WHERE {?item wdt:P31 wd:Q7318358;
                wdt:P275 [ ].
           }
    
    Try it!
  9. #Article items, reviews with license, lacking main subject (NB the date ordering, while convenient, introduces duplications on the basis of e-publication being earlier than hard copy - the table uses the number calculated without ?date)
    SELECT ?item ?itemLabel ?date
       WHERE {?item wdt:P31 wd:Q7318358;
                    wdt:P577 ?date;
                wdt:P275 [ ].
              
                MINUS{?item wdt:P921 [ ]}
             SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
           }
    
    ORDER BY ASC(?date)
    
    Try it!
  10. #Article items, reviews with CC0 license
    SELECT ?item 
       WHERE {?item wdt:P31 wd:Q13442814;
                wdt:P31 wd:Q7318358;
                wdt:P275 wd:Q6938433.
              }
    
    Try it!
  11. #Article items, reviews with PDM license
    SELECT ?item 
       WHERE {?item wdt:P31 wd:Q13442814;
                wdt:P31 wd:Q7318358;
                wdt:P275 wd:Q7257361.
              }
    
    Try it!
  12. #Article items, reviews with "open access" license
    SELECT ?item 
       WHERE {?item wdt:P31 wd:Q13442814;
                wdt:P31 wd:Q7318358;
                wdt:P275 wd:Q232932.
              }
    
    Try it!
  13. #Article items, reviews with CC license
    SELECT ?item 
       WHERE {?item wdt:P31 wd:Q13442814;
                wdt:P31 wd:Q7318358;
                wdt:P275 wd:Q284742.
              }
    
    Try it!
  14. #Article items, reviews with CC-BY license
    SELECT ?item 
       WHERE {?item wdt:P31 wd:Q13442814;
                wdt:P31 wd:Q7318358;
                wdt:P275 wd:Q6905323.
              }
    
    Try it!
  15. #Article items, reviews with CC0 license
    SELECT ?item 
       WHERE {?item wdt:P31 wd:Q13442814;
                wdt:P31 wd:Q7318358;
                wdt:P275 wd:Q6905942.
              }
    
    Try it!
  16. #Article items, reviews with CC0 license
    SELECT ?item 
       WHERE {?item wdt:P31 wd:Q13442814;
                wdt:P31 wd:Q7318358;
                wdt:P275 wd:Q19125117.
              }
    
    Try it!
  17. #Article items, reviews with CC0 license
    SELECT ?item 
       WHERE {?item wdt:P31 wd:Q13442814;
                wdt:P31 wd:Q7318358;
                wdt:P275 wd:Q18810333.
              }
    
    Try it!
  18. #Article items, reviews with CC0 license
    SELECT ?item 
       WHERE {?item wdt:P31 wd:Q13442814;
                wdt:P31 wd:Q7318358;
                wdt:P275 wd:Q19113751.
              }
    
    Try it!
  19. #Article items, reviews with CC-BY 3.0 license
    SELECT ?item 
       WHERE {?item wdt:P31 wd:Q13442814;
                wdt:P31 wd:Q7318358;
                wdt:P275 wd:Q14947546.
              }
    
    Try it!
  20. #Article items, reviews with CC0 license
    SELECT ?item 
       WHERE {?item wdt:P31 wd:Q13442814;
                wdt:P31 wd:Q7318358;
                wdt:P275 wd:Q14946043.
              }
    
    Try it!
  21. #Article items, reviews with CC0 license
    SELECT ?item 
       WHERE {?item wdt:P31 wd:Q13442814;
                wdt:P31 wd:Q7318358;
                wdt:P275 wd:Q18199165.
              }
    
    Try it!
  22. #Article items, reviews with CC0 license
    SELECT ?item 
       WHERE {?item wdt:P31 wd:Q13442814;
                wdt:P31 wd:Q7318358;
                wdt:P275 wd:Q20007257.
              }
    
    Try it!
  23. #Article items, reviews with CC0 license
    SELECT ?item 
       WHERE {?item wdt:P31 wd:Q13442814;
                wdt:P31 wd:Q7318358;
                wdt:P275 wd:Q6936496.
              }
    
    Try it!
  24. #Article items, reviews with CC0 license
    SELECT ?item 
       WHERE {?item wdt:P31 wd:Q13442814;
                wdt:P31 wd:Q7318358;
                wdt:P275 wd:Q19113746.
              }
    
    Try it!
  25. #Article items, reviews with CC0 license
    SELECT ?item 
       WHERE {?item wdt:P31 wd:Q13442814;
                wdt:P31 wd:Q7318358;
                wdt:P275 wd:Q19068212.
              }
    
    Try it!
  26. #Article items, reviews with CC0 license
    SELECT ?item 
       WHERE {?item wdt:P31 wd:Q13442814;
                wdt:P31 wd:Q7318358;
                wdt:P275 wd:Q6937225.
              }
    
    Try it!
  27. #Article items, reviews with CC0 license
    SELECT ?item 
       WHERE {?item wdt:P31 wd:Q13442814;
                wdt:P31 wd:Q7318358;
                wdt:P275 wd:Q19125045.
              }
    
    Try it!
  28. #Article items, reviews with CC0 license
    SELECT ?item 
       WHERE {?item wdt:P31 wd:Q13442814;
                wdt:P31 wd:Q7318358;
                wdt:P275 wd:Q34179348.
              }
    
    Try it!