Xandra BI toolkit – Cyber Whale – Digital Services blog

K-means

Very straight-forward algorithm

#clustering algorithms class KMeansAlgorithm(Step): def __init__(self): self.params = settings["clustering_settings"]["kmeans_params"] self.newColumn = settings["clustering_settings"]["target_column"] def execute(self, df): pprint(self.__class__.__name__) pprint(inspect.stack()[0][3]) km = KMeans(**self.params) km.fit(df) clusters = km.labels_.tolist() df[self.newColumn] = clusters pprint(df.head(settings["rows_to_debug"])) return df

K-means is memory-friendly and provides good output resulrs.

DBSCAN

Although DBSCAN is noise reduction based algorithm it is capable to self-organise clusters.

class DBScanAlgorithm(Step): def __init__(self): self.params = settings["clustering_settings"]["dbscan_params"] self.newColumn = settings["clustering_settings"]["target_column"] def execute(self, df): pprint(self.__class__.__name__) pprint(inspect.stack()[0][3]) loc_df = StandardScaler().fit_transform(df) db = DBSCAN(**self.params).fit(loc_df) core_samples_mask = np.zeros_like(db.labels_, dtype=bool) core_samples_mask[db.core_sample_indices_] = True clusters = db.labels_.tolist() print(clusters) loc_df[self.newColumn] = clusters pprint(df.head(settings["rows_to_debug"])) return loc_df

We are happy to announce that will be partially releasing our Python Business Intelligence Toolkit powered by machine learning algorithms to open-source.

Idea

The idea behind the toolkit is to provide an easy way for companies to arrange, process, visualise business data. Due to machine learning algorithms applied, users will be able so solve prediction, classification and clustering problems.

The visual part will also be a priority for us so the users are capable of conducting quick review.

Development

The development is done in Python using pandas, seaborn and, of course sk-learn libraries. Since the product will bear a graceful name, we will be putting our best effort create modular architecture, lightweight code-style and test coverage.

Fine-tuning parameters will also be made easily using settings file.

{
"dataset_path" : "trained_all.csv",
"dataset_separator" : ";",
"columns_to_remove": ["Unnamed: 0", "Autoclass", "Color 1", "Color 2", "Image", "Images", "Description", "Overview" ],
"columns_to_encode":["Category"],
"columns_to_do_tfidf":["Product name"],
"should_purify" : true,
"problem" : "clustering",
"clustering_settings": {
  "algorithm" : "kmeans",
  "number_of_cluster" : 30,
  "target_column" : "Cluster"

},

"rows_to_debug": 5
}

The following design patterns will be used:

Pipeline / Chain of responsibility – in order to build pipeline of execution.
Abstract factory – to dynamically generate objects responsible for the picked algorithms
Decorator – to provide additional functionality to existing classes
MVC – to serve as architectural pattern for web applications later on

Roadmap

At this point data preprocessing is implemented: label encoding, tf-idf textual fields transformations, excessive columns removal.

The steps to follow are:

To implement clustering algorithms
To implement classification algorithms
To implement regression algorithms
To add visualization
To add support of different datasources (.txt, SQL etc)
To wrap inside web application

Please follow out Github repo or contact us at [email protected]

Category: Xandra BI toolkit

Xanda BI Toolkit: clustering

K-means

DBSCAN

Xandra BI Toolkit powered by ML released to Open Source

Idea

Development

Roadmap