User:DwivediAntima/Implementation and use case of Research Information System Files

From Wikidata
Jump to navigation Jump to search

Asked mentors about an item(scientific article to work on.[edit]

The suggested article by @Mike_Peel  is  Tunneling/shear force microscopy using piezoelectric tuning forks for characterization of topography and local electric surface properties (Q33562513).

Impoting libraries:[edit]

import pywikibot
import re

# Connect to ptwiki
ptwiki = pywikibot.Site('wikidata:wikidata', user='DwivediAntima')
# and then to wikidata
ptwiki_repo = ptwiki.data_repository()

Loading the suggested page and printing out the author's information from it[edit]

def name_information(author, pid):
    return author.claims[pid][0].getTarget().labels['en']

def label_of_property(pid):
    property_page = pywikibot.PropertyPage(ptwiki_repo, pid)
    return property_page.labels['en']

def printwikidata(wd_item):
    item_dict = wd_item.get()
    try:
        print('Name: ' + item_dict['labels']['en'])
    except:
        print('No English label!')

    try:
        for claim in item_dict['claims']['P50']:
            p50_value = claim.getTarget()
            p50_item_dict = p50_value.get()
            qid=p50_value.title()
            print('author            :' + p50_item_dict['labels']['en'])
            author = pywikibot.ItemPage(ptwiki_repo, qid)
            if author.get():
                print('Redirecting.......')
                try:
                    print(f"{label_of_property('P735')}: {name_information(author, 'P735')}")
                except:
                    print("That didn't work.(given name information is not available)")
                try:
                    print(f"{label_of_property('P734')}: {name_information(author, 'P734')}")
                except:
                    print("That didn't work.(family name information is not available)")
    except:
        print("That didn't work!")
    try:
        for claim in item_dict['claims']['P2093']:
            p2093_value = claim.getTarget()
            print(f"{label_of_property('P2093')}: {p2093_value}")
    except:
        print("That didn't work!")
    return 0

page = pywikibot.ItemPage(ptwiki_repo,'Q33562513')
test = printwikidata(page)
Output[edit]
Name: Tunneling/shear force microscopy using piezoelectric tuning forks for characterization of topography and local electric surface properties
author            :Grzegorz Jóźwiak
Redirecting.......
given name: Grzegorz
family name: Jóźwiak
author name string: Mirosław Woszczyna
author name string: Paweł Zawierucha
author name string: Agata Masalska
author name string: Elzbieta Staryga
author name string: Teodor Gotszalk


Printing out the author information contains in the RIS file.[edit]

#Used regular expressions to extract  the information from RIS file.
with open('C:\\Users\\Antima Dwivedi\\Downloads\\core_stable\\risfile.ris',encoding="utf8") as f:
    contents = f.read()
    print(contents)
#Printing all the author's names from RIS file.
p = re.compile("AU  - (.*)")
print (p.findall(contents))
Output[edit]
['Woszczyna, MirosŁaw', 'Zawierucha, PaweŁ', 'Masalska, Agata', 'Jóźwiak, Grzegorz', 'Staryga, Elżbieta', 'Gotszalk, Teodor']

Comparing outputs of both:[edit]

'''author names in item                   author names in RIS file'''                 
author            : Grzegorz Jóźwiak     Jóźwiak, Grzegorz
author name string: Mirosław Woszczyna   Woszczyna, MirosŁaw
author name string: Paweł Zawierucha     Zawierucha, PaweŁ
author name string: Agata Masalska       Masalska, Agata
author name string: Elzbieta Staryga     Staryga, Elżbieta
author name string: Teodor Gotszalk      Gotszalk, Teodor
Observations:[edit]
It is observed that if we compare author names in an item with author names in the RIS file, first names and last names are interchanged in both.

Printing Out Author's first name[edit]

with open('C:\\Users\\Antima Dwivedi\\Downloads\\core_stable\\risfile.ris',encoding="utf8") as fd:
     for line in fd:
        match = re.search(r', (\S+)', line)
        if match: 
            author = match.group(0)
            print('first name: {}'.format(author))
Output[edit]
first name : AU , MirosŁaw
first name : AU , PaweŁ
first name : AU , Agata
first name : AU , Grzegorz
first name : AU , Elżbieta
first name : AU , Teodor

Printing out Author's last name[edit]

with open('C:\\Users\\Antima Dwivedi\\Downloads\\core_stable\\risfile.ris',encoding="utf8") as fd:
    for line in fd:
        if match:
            match = re.search(r'AU  - (\S+)', line)
            author = match.group(0)
            print('last name: AU {}'.format(author))
Output[edit]
last name: AU  - Woszczyna,
last name: AU  - Zawierucha,
last name: AU  - Masalska,
last name: AU  - Jóźwiak,
last name: AU  - Staryga,
last name: AU  - Gotszalk,

Improvements:[edit]

I have added author given names (P9687) and author last names (P9688) properties in Grzegorz Jóźwiak (Q63660686).

My understanding of the internship tasks[edit]

  • In the RIS file, all the names were interchanged. All the last names were written before the first names(given names) and it left me confused about the correct first name last name combinations. I am attaching some samples of RIS file data.
    • AU - Woszczyna, MirosŁaw
    • AU - Zawierucha, PaweŁ
    • AU - Masalska, Agata
    • AU - Jóźwiak, Grzegorz
    • AU - Staryga, Elżbieta
    • AU - Gotszalk, Teodor
  • Some samples of author information are stored in items.
    • author  : Grzegorz Jóźwiak
    • author name string: Mirosław Woszczyna
    • author name string: Paweł Zawierucha
    • author name string: Agata Masalska
    • author name string: Elzbieta Staryga
    • author name string: Teodor Gotszalk
  • Since different people across the globe have different origins and this results in different names too. In the Wikipedia pages which contain authors' information, their country will also be mentioned, and with the help of the country name, we can read about the specialties of first name and last name. One such example is that Chinese surnames usually come first, followed by the given name. In Chan Tai Man, Chan is the surname while Tai Man is the given name.

Example-1!
Example-2!

  • These researches will be useful while distinguishing authors' first names and last names from their countries' names.
  • We can use Cross-validation for getting text from articles and will make the corpus of it. Corpus will undergo some preprocessing steps. Further, we can use a decision tree classifier to identify first and last names.
  • Code written in task-2 and task-3 can be modified to find out the optimal solution(with less time complexity)

Conclusion[edit]

  • I really enjoyed working on the tasks and learning about the wikidata.No matter what will be the results, it would be a pleasure if I will be able to contribute to the Wikimedia community in the future apart from this outreachy program too. Kindly let me know about future opportunities.
  • Gmail-antimadwivedi28@gmail.com
  • Github