Xanda BI Toolkit: clustering

In the previous post we introduced the toolkit release to open source and the general idea behind the project, now I would like to share clustering implementation.

At this point we implemented 3 clustering algorithms:

  • K-means
  • DBSCAN
  • Hierarchical clustering

K-means

Very straight-forward algorithm

#clustering algorithms
class KMeansAlgorithm(Step):
    def __init__(self):
        self.params = settings["clustering_settings"]["kmeans_params"]
        self.newColumn = settings["clustering_settings"]["target_column"]

    def execute(self, df):
        pprint(self.__class__.__name__)
        pprint(inspect.stack()[0][3])

        km = KMeans(**self.params)
        km.fit(df)
        clusters = km.labels_.tolist()
        df[self.newColumn] = clusters
        pprint(df.head(settings["rows_to_debug"]))
        return df

K-means is memory-friendly and provides good output resulrs.

DBSCAN

Although DBSCAN is noise reduction based algorithm it is capable to self-organise clusters.

class DBScanAlgorithm(Step):
    def __init__(self):
        self.params = settings["clustering_settings"]["dbscan_params"]
        self.newColumn = settings["clustering_settings"]["target_column"]

    def execute(self, df):
        pprint(self.__class__.__name__)
        pprint(inspect.stack()[0][3])


        loc_df = StandardScaler().fit_transform(df)
        db = DBSCAN(**self.params).fit(loc_df)
        core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
        core_samples_mask[db.core_sample_indices_] = True
        clusters = db.labels_.tolist()
        print(clusters)


        loc_df[self.newColumn] = clusters
        pprint(df.head(settings["rows_to_debug"]))

        return loc_df

Xandra BI Toolkit powered by ML released to Open Source

We are happy to announce that will be partially releasing our Python Business Intelligence Toolkit powered by machine learning algorithms to open-source.

Idea

The idea behind the toolkit is to provide an easy way for companies to arrange, process, visualise business data. Due to machine learning algorithms applied, users will be able so solve prediction, classification and clustering problems.

The visual part will also be a priority for us so the users are capable of conducting quick review.

Development

The development is done in Python using pandas, seaborn and, of course sk-learn libraries.  Since the product will bear a graceful name, we will be putting our best effort create modular architecture, lightweight code-style and test coverage.

Fine-tuning parameters will also be made easily using settings file.

{
"dataset_path" : "trained_all.csv",
"dataset_separator" : ";",
"columns_to_remove": ["Unnamed: 0", "Autoclass", "Color 1", "Color 2", "Image", "Images", "Description", "Overview" ],
"columns_to_encode":["Category"],
"columns_to_do_tfidf":["Product name"],
"should_purify" : true,
"problem" : "clustering",
"clustering_settings": {
  "algorithm" : "kmeans",
  "number_of_cluster" : 30,
  "target_column" : "Cluster"

},

"rows_to_debug": 5
}

The following design patterns will be used:

  • Pipeline / Chain of responsibility – in order to build pipeline of execution.
  • Abstract factory – to dynamically generate objects responsible for the picked algorithms
  • Decorator – to provide additional functionality to existing classes
  • MVC – to serve as architectural pattern for web applications later on

Roadmap

At this point data preprocessing is implemented: label encoding, tf-idf textual fields transformations, excessive columns removal.

The steps to follow are:

  • To implement clustering algorithms
  • To implement classification algorithms
  • To implement regression algorithms
  • To add visualization
  • To add support of different datasources (.txt, SQL etc)
  • To wrap inside web application

Please follow out Github repo or contact us at [email protected]