K-means clustering using sklearn and Python

lawlorino · on Oct 31, 2019

This just seems like an advert for a mailing list and some ML platform I've never heard of, the article content doesn't offer anything new over the million other beginner sklearn tutorials out there.

andrewmatte · on Oct 31, 2019

Yes and the bait and switch from business for flowers?

And K-means??? Why not HDBSCAN?

Buetol · on Oct 31, 2019

Each time somebody points out K-means, I show them this clustering benchmark by the scikit-learn project: https://scikit-learn.org/stable/_images/sphx_glr_plot_cluste...

wpietri · on Oct 31, 2019

Wow, thanks so much for that. I was trying to figure out how to do clustering for geographic place names (from AIS data) and that one image answers so many questions for me.

Godel_unicode · on Oct 31, 2019

I printed this out and put it on the wall by my desk a while back because of the number of questions people were asking me about various clustering algorithms.

vasili111 · on Oct 31, 2019

Any accompanying text for the image?

Buetol · on Nov 1, 2019

Yes, here's the context: https://scikit-learn.org/stable/modules/clustering.html

carokann · on Nov 1, 2019

Anyone willing to describe us the importance of this image? I'd like to be enlightened.

starpilot · on Oct 31, 2019

Really depends on your data and what clustering you want. There isn't one "best" clustering algo. Sometimes you really DO want partitioning, and KMeans works better. Sometimes it's agglomerative for connecting thin threads. What I've found is that HDBScan is too conservative in clusters. It's usually just running the data through numerous models and seeing which are the most stable after parameter tuning, and what is usable by marketing.

lootsauce · on Oct 31, 2019

Just read this today! Also this lib is from the maker of the amazing UMAP dimension reduction lib.

https://hdbscan.readthedocs.io/en/latest/performance_and_sca...

lawlorino · on Oct 31, 2019

Exactly! At least with an actual customer dataset,that it implied at first that it was going to use, it would have been slightly useful.

sillysaurusx · on Oct 31, 2019

I see a lot of people asking for more advanced notebooks.

Recently I was asked to participate in a competition to identify brain hemorrhages (https://www.kaggle.com/c/rsna-intracranial-hemorrhage-detect...). It turns out jhoward has published a lot of Kaggle notebooks walking through the entire process of gathering the data, cleaning it, implementing a learning algorithm, and submitting an entry.

If you're trying to do practical end to end machine learning, these are definitely worth studying.

Notebooks:

https://www.kaggle.com/jhoward/from-prototyping-to-submissio...

https://www.kaggle.com/jhoward/cleaning-the-data-for-rapid-p...

https://www.kaggle.com/jhoward/some-dicom-gotchas-to-be-awar...

https://www.kaggle.com/jhoward/don-t-see-like-a-radiologist-...

The last one is particularly interesting, and proposes a new color map for seeing 65536 different greyscale values. Turbo is also an alternative: https://ai.googleblog.com/2019/08/turbo-improved-rainbow-col...

ska · on Oct 31, 2019

I haven't looked at them carefully but two things jump out at me:

1) there is some very useful stuff there, particularly for someone new to consuming medical imaging data. The writeups are aiming to be fairly complete

2) there are some things jhoward is being naive, e.g. CT image scaling, where the advice could give you trouble.

_pgmf · on Oct 31, 2019

You can implement it, albeit more slowly, in pure python using just 20-30 lines of code. I wrote a blog a while back showing how kmeans can be used to identify dominant colors in images. It has many applications and is a handy tool to use for roughly grouping data. Care is needed to pick the optimal starting centroids and k.

lawlorino · on Oct 31, 2019

> You can implement it, albeit more slowly, in pure python using just 20-30 lines of code.

This is a good exercise for anyone starting out learning about machine learning, but I'd always stick to a well known library if I was actually using it for something else.

> Care is needed to pick the optimal starting centroids and k.

Definitely, and I think that speaks to the laziness of the linked article that they just say "Use the elbow method" for choosing k. In my ~4 years of being a data scientist I've never seen or heard of this working for any "real world" problem. Metrics like silhouette scores are much more useful and quantifiable.

tjpaudio · on Oct 31, 2019

There are like a million blog style posts for these basic intro to ML with pandas & python type thing. It's so prevalent that I am starting to wonder what the motivation is?

ApolloFortyNine · on Oct 31, 2019

They're basically just ads for whatever the author is selling, since you obviously couldn't just post a literal ad to hackernews, people make simple how to articles (they have to be simple to reach the most people) and post those instead.

durbatuluk · on Nov 1, 2019

I should start calling myself a ML expert? I'm 32 year old and most of these "ML algorithms" are called statistics for me. Maybe I'm missing something, where is the LEARNING on k-means? Numerical solutions seems to be on rise now.

Nice write for someone starting, more details about details of algorithm steps would greatly attract more readers.

Rainymood · on Nov 1, 2019

>I should start calling myself a ML expert? I'm 32 year old and most of these "ML algorithms" are called statistics for me. Maybe I'm missing something, where is the LEARNING on k-means? Numerical solutions seems to be on rise now.

Yes, you can. I have studied statistics and I cringe at the watering down of what "machine learning" and "AI" has become; simple statistics.

>Nice write for someone starting, more details about details of algorithm steps would greatly attract more readers.

I disagree, making it even simpler would attract more readers. You see the same with Youtube tutorials that have 22 parts. The first part has 200.000 views, the second 150.000 and the 20th part only has 400 views or so.

th0ma5 · on Nov 1, 2019

Some of the simpler stuff sure, but more advanced techniques may require more understanding of more complex data and relationships. A lot of this is like you suspect and straightforward. It is being automated by things like AutoML. But doing things "better" for some definition of it is always probably going to be an expert thing. K-Means like this article outlines is maybe a beginner thing? I guess you could get complicated with it though as well.

dna_polymerase · on Oct 31, 2019

This was one of the first lectures in my ML course back in university. We didn't even have pictures of grapes or the local farmers market back then. Also, we implemented this stupidly easy algorithm without any libraries.

thiago_fm · on Oct 31, 2019

Sorry if I've hijacked the article, but is K-means really relevant? My teacher in 2008 enjoyed to use it for everything and it is very basic and easy to grasp, perhaps not even worth an article on it. Are people writing very amazing AI using it? So, why do people like writing about it so much?

TuringNYC · on Oct 31, 2019

Yes, very relevant. You dont always need advanced models. You can accomplish quite a bit with classic models. I used it for two separate customer projects recently in two different industries.

ci5er · on Oct 31, 2019

SVD is also simple and versatile.

john-rowa · on Oct 31, 2019

I liked the Elbow method implementation part, which is designed to help find the optimal number of clusters in a dataset. Thanks.

xibalba · on Nov 1, 2019

Did you create this account just to make a positive comment on this article?

sunrise100 · on Oct 31, 2019

There are many developers who might not have done K means clustering or unsupervised learning at all. We should think about them as well. And I think the article did a good job of explaining related concepts. I liked it.

commandlinefan · on Nov 1, 2019

I don’t know - he didn’t explain how the algorithm works, what it does, or anything you couldn’t just have gotten out of the scikit docs. Would have been much better if it actually described the algorithm or presented an implementation of it (which isn’t that complex, BTW).