Unsupervised Feature Selection for Time-Series Sensor Data with MSDA package

What is MSDA?

MSDA is an open-source multidimensional multi-sensor data analysis framework, written in Python.

Why MSDA?

A simple & intuitive Python package that makes it easier to explore, plot, and visualize time-series multidimensional multi-sensor data aimed towards appropriate feature/sensor selection tasks be it the unsupervised/supervised.

Basics Revisited

Before we dwell into the usability of the package, let’s understand a few basic concepts in simple layman terms. Also, before going into deeper understanding of PCA lets first discuss a few important concepts of linear algebra.

  1. Time series analysis.
  2. Identifying variation of each sensor column wrt time (increasing, decreasing, equal).
  3. Identifying how each column values varies wrt other column, and the maximum variation ratio between each column wrt other column.
  4. Relationship establishment with trend array to identify the most appropriate sensor.
  5. User can select window length and then check average value and standard deviation across each window for each sensor column.
  6. It provides count of growth/decay value for each sensor column values above or below a threshold value.
  7. Feature Engineering

EXAMPLE USECASE — Unsupervised Feature Selection

High-dimensional is very hard to process and visualize. Therefore reducing the dimensions of the data by extracting the important features (lesser than the overall number of features) which are enough to cover the variations in the data can help in the reduction of the data size and in turn for processing.

PCA Evaluation

Steps:-

  1. Import libraries
  • You can get the eigenvectors using pca.components_
  • eigenvalues using pca.explained_variance_
  • Percentage of variance explained by each of the selected components using pca.explained_variance_ratio_
['net_in', 'cpu_util_percent', 'mem_util_percent', 'cpu_util_percent']

IPCA Evaluation

Steps:-

  1. Import libraries
['net_in', 'cpu_util_percent', 'mem_util_percent', 'cpu_util_percent']

MSDA Evaluation

Note:- Here, I am explicitly taking you through each of the available algorithms in the module without showing them being used directly from the package. For using as a package, follow the demo tutorial as shown here

  1. Import libraries
Index(['timestamp', 'machine_id', 'cpu_util_percent', 'mem_util_percent','mem_gps', 'mkpi', 'net_in', 'net_out', 'disk_io_percent', 'Date', 'Time'],
dtype='object')
Max. Variation Involved in each Sensor Column values are:
Note: Inc-Increasing ; Dec-Decreasing ; Eq-Equal
For CPU UTIL PERCENT Column: Dec
For MEM UTILPERCENT Column: Eql
For NET IN Column: Eql
For NET OUT Column: Eql
For DISK IO Column: Eql
[['Eq' 'Inc' 'Dec' ... 'Eq' 'Eq' 'Eq']
['Eq' 'Inc' 'Inc' ... 'Eq' 'Eq' 'Eq']
['Eq' 'Inc' 'Inc' ... 'Eq' 'Eq' 'Eq']
...
['Eq' 'Inc' 'Dec' ... 'Eq' 'Eq' 'Eq']
['Eq' 'Inc' 'Dec' ... 'Eq' 'Eq' 'Inc']
['Eq' 'Inc' 'Inc' ... 'Eq' 'Eq' 'Eq']]
** Ratios of Variations Of Values of Each Sensor Column wrt other Sensor Column **
Note: Inc-Increasing ; Dec-Decreasing ; Eq-Equal
For Sensor Column:- cpu_util_percent
Ratio is: 0.6909658204509999
When Sensor Column 'cpu_util_percent' values are Eq , Sensor Column 'mem_util_percent' values are Eq
------------------------
For Sensor Column:- mem_util_percent
Ratio is: 1.0
When Sensor Column 'mem_util_percent' values are Eq , Sensor Column 'cpu_util_percent' values are Eq
------------------------
For Sensor Column:- net_in
Ratio is: 0.5092423319114361
When Sensor Column 'net_in' values are Inc , Sensor Column 'cpu_util_percent' values are Eq
------------------------
--------------------------------------------------------------------
** Avg. and Standard deviations for each Sensor Column **
Enter Time in Seconds for the Window: (Must be a Multiple of 2):20
Rate of Change of AVG Across Window For Sensor Column cpu_util_percent: 34.61231884057971
Rate of Change of AVG Across Window For Sensor Column mem_util_percent: 89.74639837819186
Rate of Change of AVG Across Window For Sensor Column net_in: 37.58424969806764
Count of Growth/Decay value for each Sensor Column Values above or below a threshold value:
{'cpu_util_percent': 19224, 'mem_util_percent': 8965, 'net_in': 864}

MSDA Conclusion

The plots show each sensor value and features with correlation (slope) are provided.

Most Important Features — Comparison of PCA, IPCA, MSDA

# The top-n variables in the order of importance using the different approaches are given below.

CONTACT

You can reach me at ajay.arunachalam08@gmail.com

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Ajay Arunachalam

Ajay Arunachalam

Data Science Manager; AWS Certified ML Specialist; AWS Certified Cloud Solution Architect; Power BI Certified https://www.linkedin.com/in/ajay-ph-d-4744581a/