User:Glorian WD/Clustering Result v2
The result which is described in this page is the iteration from the feedback on User:Glorian_WD/Clustering_Result.
Background
[edit]- I extracted a new sample. Initially, I extracted 1000 items which have at least 60 property pairs (~100k in size/rev_len). I extracted items with at least 60 property pairs (~100k in size), because the item examples in the existing scale B are generally have size around 100k.
- However, I found out that the clustering result of these items were not really good. I suspected this was because the large size gap between number of property pairs which are contained by items. Remember that I extracted items with at least 60 property pairs. So, one item could have 60 property pairs, whereas others could have 90 property pairs.
- Hence, I tried to tighten the size gap. I used simple LEN Libre calc function to measure the number of property pairs which are contained by each item. As a result, I trimmed down my sample into 333 items which contain property pairs that have LEN ~1000 - ~2000. Yes, I am aware that LEN may not be a good way to measure the number of property pairs for each item, because the property pairs 31-49999999999999 is essentially equal with 31-5 in terms of quantity (i.e. both counted as 1 property pairs, although 31-49999999999999 has a bigger LEN than 31-5).
- Nevertheless, I thought maybe they would not create so much impact. So, I decided to just used these 333 items. After all, this was just still my hypothesis. If it is true that the clustering problem on initial 1000 items was occurred because of the large size gap, I can always extract new items which have a tighter size gap (e.g. items with 60-70 property pairs).
- This time, in addition to playing around with number of clusters and displayed centroids, I also tweaked the TF-IDF Vectorizer parameter.
Achieved Result
[edit]Below is the clustering result plot using the new sample.
Below is the detail of clustered items on each cluster.
Below is the information of some items from the above clusters.
Item | Cluster # | Item name | Item Description |
---|---|---|---|
Q3271551 | 0 | Smoothened, frizzled class receptor | mammalian protein found in Homo sapiens |
Q357439 | 0 | Adiponectin, C1Q and collagen domain containing | mammalian protein found in Homo sapiens |
Q21172390 | 0 | Kinase insert domain receptor | mammalian protein found in Homo sapiens |
Q287961 | 0 | Adrenoceptor beta 2 | mammalian protein found in Homo sapiens |
Q4734884 | 0 | Adrenoceptor alpha 1A | mammalian protein found in Homo sapiens |
Q14062179 | 1 | Gaolou | place in China |
Q169140 | 1 | Daotian | town in China |
Q10902759 | 1 | Beimengzhen | town in Weifang, Shandong, China |
Q11113438 | 1 | Qiaotouhe | urban town in Lianyuan, Loudi City, Hunan Province, People's Republic of China |
Q14773949 | 1 | Kanjiazhen | town in Weifang, Shandong, China |
Q21118717 | 2 | Myocyte enhancer factor 2C | mammalian protein found in Homo sapiens |
Q254943 | 2 | Retinoic acid receptor alpha | mammalian protein found in Homo sapiens |
Q21108503 | 2 | Paired like homeodomain 2 | mammalian protein found in Homo sapiens |
Q423698 | 2 | Interleukin 10 | mammalian protein found in Homo sapiens |
Q21116377 | 2 | SIX homeobox 1 | mammalian protein found in Homo sapiens |
Q23757242 | 3 | A comprehensive glossary of autophagy-related molecules and processes (2nd edition) | scientific article |
Q26801759 | 3 | Genetic Diversity Underlying the Envelope Glycoproteins of Hepatitis C Virus: Structural and Functional Consequences and the Implications for Vaccine Design | scientific article |
Q26829681 | 3 | Ion Channels in the Heart | scientific article |
Q24650179 | 3 | Early steps in the DNA base excision/single-strand interruption repair pathway in mammalian cells | scientific article |
Q24629682 | 3 | PDZ domains and their binding partners: structure, specificity, and modification | scientific article |
Q142 | 4 | France | republic in Western Europe |
Q787127 | 4 | BAFTA Award for Best Special Visual Effects | no description |
Q838329 | 4 | Chrudim District | district of Czech Republic |
Q1404656 | 4 | We Are All Murderers | 1952 French-Italian film by André Cayatte |
Q16199 | 4 | Province of Lecco | province of Italy |
Q14911646 | 5 | Fibroblast growth factor receptor 2 | mammalian protein found in Mus musculus |
Q14913903 | 5 | Transcription factor 7 like 2, T cell specific, HMG box | mammalian protein found in Mus musculus |
Q21497006 | 5 | PYD and CARD domain containing | mammalian protein found in Mus musculus |
Q21498877 | 5 | Signal transducer and activator of transcription 5A | mammalian protein found in Mus musculus |
Q21980001 | 5 | Myelocytomatosis oncogene | mammalian protein found in Mus musculus |
Q22695891 | 6 | Leptin | mammalian protein found in Mus musculus |
Q21494863 | 6 | Adenomatous polyposis coli protein | mouse protein (annotated by UniProtKB/Swiss-Prot Q61315) |
Q14908176 | 6 | Transient receptor potential cation channel, subfamily V, member 1 | mammalian protein found in Mus musculus |
Q21984751 | 6 | Fyn proto-oncogene | mammalian protein found in Mus musculus |
Q14905733 | 6 | Sodium channel, voltage-gated, type V, alpha | mammalian protein found in Mus musculus |
Q15270647 | 7 | Zootopia | 2016 computer animated film by Walt Disney Animation Studios |
Q113110 | 7 | All Saints | Australian television medical drama |
Q18786473 | 7 | Pete's Dragon | 2016 live action-animated fantasy adventure film |
Q110203 | 7 | Collateral | 2004 American thriller |
Q468484 | 7 | The Longest Day | 1962 war film |
Q416356 | 8 | Phosphatase and tensin homolog | mammalian protein found in Homo sapiens |
Q4769148 | 8 | Annexin A1 | mammalian protein found in Homo sapiens |
Q634510 | 8 | Calreticulin | mammalian protein found in Homo sapiens |
Q21100382 | 8 | Calcium/calmodulin dependent protein kinase II delta | mammalian protein found in Homo sapiens |
Q21171798 | 8 | Glutathione S-transferase pi 1 | mammalian protein found in Homo sapiens |
Q10944828 | 9 | Sunwu Subdistrict | no description. But it is the "instance of" subdistrict of China |
Q11061325 | 9 | No label | no description. But it is the "instance of" subdistrict of China |
Q11060876 | 9 | No label | no description. But it is the "instance of" subdistrict of China |
Q14609895 | 9 | No label | no description. But it is the "instance of" subdistrict of China |
Q13780692 | 9 | No label | no description. But it is the "instance of" subdistrict of China |
Discussion
[edit]I just cannot split the cluster 4 into 3 different clusters. When I tried to add more number of clusters, it seems that the new clusters will break other cluster into smaller ones (e.g. cluster 5 and 6).
My hypothesis is, the items under cluster 4 are simply do not have appropriate property pairs that can be clustered. They are just mixed with various types of item (e.g. events, people, place data). Another hypothesis, perhaps I have to tighten the size gap of the sample because the current sample size gap is still too large.