Xanda BI Toolkit: clustering

In the previous post we introduced the toolkit release to open source and the general idea behind the project, now I would like to share clustering implementation.

At this point we implemented 3 clustering algorithms:

  • K-means
  • DBSCAN
  • Hierarchical clustering

K-means

Very straight-forward algorithm

#clustering algorithms
class KMeansAlgorithm(Step):
    def __init__(self):
        self.params = settings["clustering_settings"]["kmeans_params"]
        self.newColumn = settings["clustering_settings"]["target_column"]

    def execute(self, df):
        pprint(self.__class__.__name__)
        pprint(inspect.stack()[0][3])

        km = KMeans(**self.params)
        km.fit(df)
        clusters = km.labels_.tolist()
        df[self.newColumn] = clusters
        pprint(df.head(settings["rows_to_debug"]))
        return df

K-means is memory-friendly and provides good output resulrs.

DBSCAN

Although DBSCAN is noise reduction based algorithm it is capable to self-organise clusters.

class DBScanAlgorithm(Step):
    def __init__(self):
        self.params = settings["clustering_settings"]["dbscan_params"]
        self.newColumn = settings["clustering_settings"]["target_column"]

    def execute(self, df):
        pprint(self.__class__.__name__)
        pprint(inspect.stack()[0][3])


        loc_df = StandardScaler().fit_transform(df)
        db = DBSCAN(**self.params).fit(loc_df)
        core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
        core_samples_mask[db.core_sample_indices_] = True
        clusters = db.labels_.tolist()
        print(clusters)


        loc_df[self.newColumn] = clusters
        pprint(df.head(settings["rows_to_debug"]))

        return loc_df