User:Glorian WD/Clustering Result v2

The result which is described in this page is the iteration from the feedback on User:Glorian_WD/Clustering_Result.

Background

I extracted a new sample. Initially, I extracted 1000 items which have at least 60 property pairs (~100k in size/rev_len). I extracted items with at least 60 property pairs (~100k in size), because the item examples in the existing scale B are generally have size around 100k.

However, I found out that the clustering result of these items were not really good. I suspected this was because the large size gap between number of property pairs which are contained by items. Remember that I extracted items with at least 60 property pairs. So, one item could have 60 property pairs, whereas others could have 90 property pairs.

Hence, I tried to tighten the size gap. I used simple LEN Libre calc function to measure the number of property pairs which are contained by each item. As a result, I trimmed down my sample into 333 items which contain property pairs that have LEN ~1000 - ~2000. Yes, I am aware that LEN may not be a good way to measure the number of property pairs for each item, because the property pairs 31-49999999999999 is essentially equal with 31-5 in terms of quantity (i.e. both counted as 1 property pairs, although 31-49999999999999 has a bigger LEN than 31-5).

Nevertheless, I thought maybe they would not create so much impact. So, I decided to just used these 333 items. After all, this was just still my hypothesis. If it is true that the clustering problem on initial 1000 items was occurred because of the large size gap, I can always extract new items which have a tighter size gap (e.g. items with 60-70 property pairs).

This time, in addition to playing around with number of clusters and displayed centroids, I also tweaked the TF-IDF Vectorizer parameter.

Achieved Result

Below is the clustering result plot using the new sample.

Below is the detail of clustered items on each cluster.

Below is the information of some items from the above clusters.

Item	Cluster #	Item name	Item Description
Q3271551	0	Smoothened, frizzled class receptor	mammalian protein found in Homo sapiens
Q357439	0	Adiponectin, C1Q and collagen domain containing	mammalian protein found in Homo sapiens
Q21172390	0	Kinase insert domain receptor	mammalian protein found in Homo sapiens
Q287961	0	Adrenoceptor beta 2	mammalian protein found in Homo sapiens
Q4734884	0	Adrenoceptor alpha 1A	mammalian protein found in Homo sapiens
Q14062179	1	Gaolou	place in China
Q169140	1	Daotian	town in China
Q10902759	1	Beimengzhen	town in Weifang, Shandong, China
Q11113438	1	Qiaotouhe	urban town in Lianyuan, Loudi City, Hunan Province, People's Republic of China
Q14773949	1	Kanjiazhen	town in Weifang, Shandong, China
Q21118717	2	Myocyte enhancer factor 2C	mammalian protein found in Homo sapiens
Q254943	2	Retinoic acid receptor alpha	mammalian protein found in Homo sapiens
Q21108503	2	Paired like homeodomain 2	mammalian protein found in Homo sapiens
Q423698	2	Interleukin 10	mammalian protein found in Homo sapiens
Q21116377	2	SIX homeobox 1	mammalian protein found in Homo sapiens
Q23757242	3	A comprehensive glossary of autophagy-related molecules and processes (2nd edition)	scientific article
Q26801759	3	Genetic Diversity Underlying the Envelope Glycoproteins of Hepatitis C Virus: Structural and Functional Consequences and the Implications for Vaccine Design	scientific article
Q26829681	3	Ion Channels in the Heart	scientific article
Q24650179	3	Early steps in the DNA base excision/single-strand interruption repair pathway in mammalian cells	scientific article
Q24629682	3	PDZ domains and their binding partners: structure, specificity, and modification	scientific article
Q142	4	France	republic in Western Europe
Q787127	4	BAFTA Award for Best Special Visual Effects	no description
Q838329	4	Chrudim District	district of Czech Republic
Q1404656	4	We Are All Murderers	1952 French-Italian film by André Cayatte
Q16199	4	Province of Lecco	province of Italy
Q14911646	5	Fibroblast growth factor receptor 2	mammalian protein found in Mus musculus
Q14913903	5	Transcription factor 7 like 2, T cell specific, HMG box	mammalian protein found in Mus musculus
Q21497006	5	PYD and CARD domain containing	mammalian protein found in Mus musculus
Q21498877	5	Signal transducer and activator of transcription 5A	mammalian protein found in Mus musculus
Q21980001	5	Myelocytomatosis oncogene	mammalian protein found in Mus musculus
Q22695891	6	Leptin	mammalian protein found in Mus musculus
Q21494863	6	Adenomatous polyposis coli protein	mouse protein (annotated by UniProtKB/Swiss-Prot Q61315)
Q14908176	6	Transient receptor potential cation channel, subfamily V, member 1	mammalian protein found in Mus musculus
Q21984751	6	Fyn proto-oncogene	mammalian protein found in Mus musculus
Q14905733	6	Sodium channel, voltage-gated, type V, alpha	mammalian protein found in Mus musculus
Q15270647	7	Zootopia	2016 computer animated film by Walt Disney Animation Studios
Q113110	7	All Saints	Australian television medical drama
Q18786473	7	Pete's Dragon	2016 live action-animated fantasy adventure film
Q110203	7	Collateral	2004 American thriller
Q468484	7	The Longest Day	1962 war film
Q416356	8	Phosphatase and tensin homolog	mammalian protein found in Homo sapiens
Q4769148	8	Annexin A1	mammalian protein found in Homo sapiens
Q634510	8	Calreticulin	mammalian protein found in Homo sapiens
Q21100382	8	Calcium/calmodulin dependent protein kinase II delta	mammalian protein found in Homo sapiens
Q21171798	8	Glutathione S-transferase pi 1	mammalian protein found in Homo sapiens
Q10944828	9	Sunwu Subdistrict	no description. But it is the "instance of" subdistrict of China
Q11061325	9	No label	no description. But it is the "instance of" subdistrict of China
Q11060876	9	No label	no description. But it is the "instance of" subdistrict of China
Q14609895	9	No label	no description. But it is the "instance of" subdistrict of China
Q13780692	9	No label	no description. But it is the "instance of" subdistrict of China

Discussion

I just cannot split the cluster 4 into 3 different clusters. When I tried to add more number of clusters, it seems that the new clusters will break other cluster into smaller ones (e.g. cluster 5 and 6).
My hypothesis is, the items under cluster 4 are simply do not have appropriate property pairs that can be clustered. They are just mixed with various types of item (e.g. events, people, place data). Another hypothesis, perhaps I have to tighten the size gap of the sample because the current sample size gap is still too large.

User:Glorian WD/Clustering Result v2

Background

Achieved Result

Discussion

Navigation menu

Search