This just seems like an advert for a mailing list and some ML platform I've never heard of, the article content doesn't offer anything new over the million other beginner sklearn tutorials out there.
Wow, thanks so much for that. I was trying to figure out how to do clustering for geographic place names (from AIS data) and that one image answers so many questions for me.
I printed this out and put it on the wall by my desk a while back because of the number of questions people were asking me about various clustering algorithms.
Really depends on your data and what clustering you want. There isn't one "best" clustering algo. Sometimes you really DO want partitioning, and KMeans works better. Sometimes it's agglomerative for connecting thin threads. What I've found is that HDBScan is too conservative in clusters. It's usually just running the data through numerous models and seeing which are the most stable after parameter tuning, and what is usable by marketing.
I see a lot of people asking for more advanced notebooks.
Recently I was asked to participate in a competition to identify brain hemorrhages (https://www.kaggle.com/c/rsna-intracranial-hemorrhage-detect...). It turns out jhoward has published a lot of Kaggle notebooks walking through the entire process of gathering the data, cleaning it, implementing a learning algorithm, and submitting an entry.
If you're trying to do practical end to end machine learning, these are definitely worth studying.
I haven't looked at them carefully but two things jump out at me:
1) there is some very useful stuff there, particularly for someone new to consuming medical imaging data. The writeups are aiming to be fairly complete
2) there are some things jhoward is being naive, e.g. CT image scaling, where the advice could give you trouble.
You can implement it, albeit more slowly, in pure python using just 20-30 lines of code. I wrote a blog a while back showing how kmeans can be used to identify dominant colors in images. It has many applications and is a handy tool to use for roughly grouping data. Care is needed to pick the optimal starting centroids and k.
> You can implement it, albeit more slowly, in pure python using just 20-30 lines of code.
This is a good exercise for anyone starting out learning about machine learning, but I'd always stick to a well known library if I was actually using it for something else.
> Care is needed to pick the optimal starting centroids and k.
Definitely, and I think that speaks to the laziness of the linked article that they just say "Use the elbow method" for choosing k. In my ~4 years of being a data scientist I've never seen or heard of this working for any "real world" problem. Metrics like silhouette scores are much more useful and quantifiable.
There are like a million blog style posts for these basic intro to ML with pandas & python type thing. It's so prevalent that I am starting to wonder what the motivation is?
They're basically just ads for whatever the author is selling, since you obviously couldn't just post a literal ad to hackernews, people make simple how to articles (they have to be simple to reach the most people) and post those instead.
I should start calling myself a ML expert? I'm 32 year old and most of these "ML algorithms" are called statistics for me. Maybe I'm missing something, where is the LEARNING on k-means? Numerical solutions seems to be on rise now.
Nice write for someone starting, more details about details of algorithm steps would greatly attract more readers.
>I should start calling myself a ML expert? I'm 32 year old and most of these "ML algorithms" are called statistics for me. Maybe I'm missing something, where is the LEARNING on k-means? Numerical solutions seems to be on rise now.
Yes, you can. I have studied statistics and I cringe at the watering down of what "machine learning" and "AI" has become; simple statistics.
>Nice write for someone starting, more details about details of algorithm steps would greatly attract more readers.
I disagree, making it even simpler would attract more readers. You see the same with Youtube tutorials that have 22 parts. The first part has 200.000 views, the second 150.000 and the 20th part only has 400 views or so.
Some of the simpler stuff sure, but more advanced techniques may require more understanding of more complex data and relationships. A lot of this is like you suspect and straightforward. It is being automated by things like AutoML. But doing things "better" for some definition of it is always probably going to be an expert thing. K-Means like this article outlines is maybe a beginner thing? I guess you could get complicated with it though as well.
This was one of the first lectures in my ML course back in university. We didn't even have pictures of grapes or the local farmers market back then. Also, we implemented this stupidly easy algorithm without any libraries.
Sorry if I've hijacked the article, but is K-means really relevant? My teacher in 2008 enjoyed to use it for everything and it is very basic and easy to grasp, perhaps not even worth an article on it. Are people writing very amazing AI using it? So, why do people like writing about it so much?
Yes, very relevant. You dont always need advanced models. You can accomplish quite a bit with classic models. I used it for two separate customer projects recently in two different industries.
There are many developers who might not have done K means clustering or unsupervised learning at all. We should think about them as well. And I think the article did a good job of explaining related concepts. I liked it.
I don’t know - he didn’t explain how the algorithm works, what it does, or anything you couldn’t just have gotten out of the scikit docs. Would have been much better if it actually described the algorithm or presented an implementation of it (which isn’t that complex, BTW).