Wikidata:WikiProject LD4 Wikidata Affinity Group/Affinity Group Calls/Meeting Notes/2022-07-12

From Wikidata
Jump to navigation Jump to search
     

Call Details[edit]

Presentation Materials[edit]

Notes[edit]

  • Created in 2021, first research structure in Tunisia
    • Team composed of members from various disciplines, universities
    • Adapting Wikidata to support clinical practice
  • Introduction
    • Items are aligned to external biomedical resources (PubMed, MeSH, etc.)
    • Wikidata statements supported by references
  • Biomedical Knowledge in WD
    • Various types of biomedical items
    • Multiple languages, mostly European and Asian
    • Uneven coverage of natural languages for biomedical entities
  • Parsing WD
    • Use Wikidata Query Service, Mediawiki API
    • Finding insights
    • Synthesizing data based on integrating information
  • Easily extensible
    • Everyone can create new items, apply for new properties, easy creation of data models, Easy alignment to external resources, intuitive embedding in bots, possible change of data models upon community consent
  • Biomedical entities dominated by genes and proteins
    • Many classes of biomedical items poorly supported by references
  • What Wikidata really needs
    • A way to allow relation extraction, relation classification…
  • Can use Scholarly Publications to do this
    • 1.6 million papers issued, indexed in PubMed, also Web of Science, PubMed Central, DataCite
  • Research publications in brief
    • Full texts, cannot analyze full text (huge size, include natural language, tables, etc.)
    • But bibliographic data in references is limited size, structured, and annotated by design
  • PubMed search tags
    • Can be used to enrich bibliographic metadata in WD despite several legal concerns
    • Processing data can be used to enrich scientific knowledge in Wikidata
  • MESH Keywords
    • Controlled keywords assigned to PubMed Records by the curators of NCBI databases
    • Biopython python library
  • Relation classification
    • Relation classification based on MeSH keywords
    • Tried to find associations between keywords
    • Need a dataset of biomedical relations
    • Concepts assigned labels
    • Taxonomic relations and non-taxonomic relations
    • Property constraints
    • Aligned to MeSH terms
  • Formulated a SPARQL query accordingly
  • Biomedical Relation Classification
    • Machine Learning Models => Evaluation Metrics
  • Machine Learning Models
    • D-Net: Fully Connected or Dense Model
    • Machine Learning Models
    • C-Net: Convolutional Neural Networks (CNNs)
    • Evaluation Metrics
      • Accuracy
      • Precision
      • Recall
  • MeSH2 Matrix Generation
    • Biomedical Relation Classification
      • 29,000 samples
      • Good results, accuracy 78%, 83%
    • Data availability
      • Available via GitHub
  • Relation, Extraction, and Validation
    • Pointwise Mutual Information
      • A simple measure of association between entities
      • Used in computational linguistics for finding associations between words
      • MeSH Keywords are predefined and formatted
  • Process for relation extraction and validation
    • Extract and compute PMI between MeSH keywords
    • Find relation types between MeSH keywords
    • Formulate query and search in PubMed
  • Finding relations between keywords
    • 30% as a training set
    • 70% as a test set
    • Classifying extracted association
    • Human validation
  • Reference Identification
    • Process: Extract unreferenced WD statements => ID most relevant PubMed Central publications => Find the supporting sentence for claims => Align PMC ID with WD ID => Add obtained references to WD
  • Principles
    • Find MeSH equivalent of subject and object, for relation type
  • Tools for Bot Creation
    • Recommend:
      • Wikibase Integrator: Work with statements in WD, python library
      • Wikidata Hub
      • Wikidata Query Service: Can submit some queries in Python
      • Biopython: Python library for working with bibliographic info, mainly from PubMed and PubMed Central

Questions[edit]

  • Q: How did you gather group of people: Conferences, because people are shy make consortiums (ex: Machine Learning community) so people can get in. Because there is huge work that can be done by a lot of people, can share work in groups, then you get some collaborators.
  • Question: did you need to address potential "spam" or manipulation of MeSH headings in the medical literature?  Or did you just rely on using reputable domains?: Depends, mainly using the probabilistic approach, many papers having given association means it’s probably correct. Because manipulation is possible at micro scale, but difficult at macro scale. The MeSH headings are rated by ACBI curators or staff/employees, so kind of reliable. Noted retracted and partially retracted literature, PubMed flags retractions. Use that tag to prevent using information that is not relevant. Important
  • Have you used any tool to clean data preIngestion tool: Drag and drop process, not human generated keyword, mainly an item chosen from a list. Not a problem with having irrelevant words or wrong words, typos, etc. Terms are pre-defined, controlled. No redundancy, messy data.