User:Glorian WD/Clustering Result v2

From Wikidata
Jump to navigation Jump to search

The result which is described in this page is the iteration from the feedback on User:Glorian_WD/Clustering_Result.

Background

[edit]
  • I extracted a new sample. Initially, I extracted 1000 items which have at least 60 property pairs (~100k in size/rev_len). I extracted items with at least 60 property pairs (~100k in size), because the item examples in the existing scale B are generally have size around 100k.
However, I found out that the clustering result of these items were not really good. I suspected this was because the large size gap between number of property pairs which are contained by items. Remember that I extracted items with at least 60 property pairs. So, one item could have 60 property pairs, whereas others could have 90 property pairs.
Hence, I tried to tighten the size gap. I used simple LEN Libre calc function to measure the number of property pairs which are contained by each item. As a result, I trimmed down my sample into 333 items which contain property pairs that have LEN ~1000 - ~2000. Yes, I am aware that LEN may not be a good way to measure the number of property pairs for each item, because the property pairs 31-49999999999999 is essentially equal with 31-5 in terms of quantity (i.e. both counted as 1 property pairs, although 31-49999999999999 has a bigger LEN than 31-5).
Nevertheless, I thought maybe they would not create so much impact. So, I decided to just used these 333 items. After all, this was just still my hypothesis. If it is true that the clustering problem on initial 1000 items was occurred because of the large size gap, I can always extract new items which have a tighter size gap (e.g. items with 60-70 property pairs).
  • This time, in addition to playing around with number of clusters and displayed centroids, I also tweaked the TF-IDF Vectorizer parameter.

Achieved Result

[edit]

Below is the clustering result plot using the new sample.

Below is the detail of clustered items on each cluster.

Below is the information of some items from the above clusters.

Item Cluster # Item name Item Description
Q3271551 0 Smoothened, frizzled class receptor mammalian protein found in Homo sapiens
Q357439 0 Adiponectin, C1Q and collagen domain containing mammalian protein found in Homo sapiens
Q21172390 0 Kinase insert domain receptor mammalian protein found in Homo sapiens
Q287961 0 Adrenoceptor beta 2 mammalian protein found in Homo sapiens
Q4734884 0 Adrenoceptor alpha 1A mammalian protein found in Homo sapiens
Q14062179 1 Gaolou place in China
Q169140 1 Daotian town in China
Q10902759 1 Beimengzhen town in Weifang, Shandong, China
Q11113438 1 Qiaotouhe urban town in Lianyuan, Loudi City, Hunan Province, People's Republic of China
Q14773949 1 Kanjiazhen town in Weifang, Shandong, China
Q21118717 2 Myocyte enhancer factor 2C mammalian protein found in Homo sapiens
Q254943 2 Retinoic acid receptor alpha mammalian protein found in Homo sapiens
Q21108503 2 Paired like homeodomain 2 mammalian protein found in Homo sapiens
Q423698 2 Interleukin 10 mammalian protein found in Homo sapiens
Q21116377 2 SIX homeobox 1 mammalian protein found in Homo sapiens
Q23757242 3 A comprehensive glossary of autophagy-related molecules and processes (2nd edition) scientific article
Q26801759 3 Genetic Diversity Underlying the Envelope Glycoproteins of Hepatitis C Virus: Structural and Functional Consequences and the Implications for Vaccine Design scientific article
Q26829681 3 Ion Channels in the Heart scientific article
Q24650179 3 Early steps in the DNA base excision/single-strand interruption repair pathway in mammalian cells scientific article
Q24629682 3 PDZ domains and their binding partners: structure, specificity, and modification scientific article
Q142 4 France republic in Western Europe
Q787127 4 BAFTA Award for Best Special Visual Effects no description
Q838329 4 Chrudim District district of Czech Republic
Q1404656 4 We Are All Murderers 1952 French-Italian film by André Cayatte
Q16199 4 Province of Lecco province of Italy
Q14911646 5 Fibroblast growth factor receptor 2 mammalian protein found in Mus musculus
Q14913903 5 Transcription factor 7 like 2, T cell specific, HMG box mammalian protein found in Mus musculus
Q21497006 5 PYD and CARD domain containing mammalian protein found in Mus musculus
Q21498877 5 Signal transducer and activator of transcription 5A mammalian protein found in Mus musculus
Q21980001 5 Myelocytomatosis oncogene mammalian protein found in Mus musculus
Q22695891 6 Leptin mammalian protein found in Mus musculus
Q21494863 6 Adenomatous polyposis coli protein mouse protein (annotated by UniProtKB/Swiss-Prot Q61315)
Q14908176 6 Transient receptor potential cation channel, subfamily V, member 1 mammalian protein found in Mus musculus
Q21984751 6 Fyn proto-oncogene mammalian protein found in Mus musculus
Q14905733 6 Sodium channel, voltage-gated, type V, alpha mammalian protein found in Mus musculus
Q15270647 7 Zootopia 2016 computer animated film by Walt Disney Animation Studios
Q113110 7 All Saints Australian television medical drama
Q18786473 7 Pete's Dragon 2016 live action-animated fantasy adventure film
Q110203 7 Collateral 2004 American thriller
Q468484 7 The Longest Day 1962 war film
Q416356 8 Phosphatase and tensin homolog mammalian protein found in Homo sapiens
Q4769148 8 Annexin A1 mammalian protein found in Homo sapiens
Q634510 8 Calreticulin mammalian protein found in Homo sapiens
Q21100382 8 Calcium/calmodulin dependent protein kinase II delta mammalian protein found in Homo sapiens
Q21171798 8 Glutathione S-transferase pi 1 mammalian protein found in Homo sapiens
Q10944828 9 Sunwu Subdistrict no description. But it is the "instance of" subdistrict of China
Q11061325 9 No label no description. But it is the "instance of" subdistrict of China
Q11060876 9 No label no description. But it is the "instance of" subdistrict of China
Q14609895 9 No label no description. But it is the "instance of" subdistrict of China
Q13780692 9 No label no description. But it is the "instance of" subdistrict of China

Discussion

[edit]

I just cannot split the cluster 4 into 3 different clusters. When I tried to add more number of clusters, it seems that the new clusters will break other cluster into smaller ones (e.g. cluster 5 and 6).
My hypothesis is, the items under cluster 4 are simply do not have appropriate property pairs that can be clustered. They are just mixed with various types of item (e.g. events, people, place data). Another hypothesis, perhaps I have to tighten the size gap of the sample because the current sample size gap is still too large.