Tabnet — Deep Learning for Tabular data: Architecture Overview

5 min readApr 11, 2021

We know that the love for solving tabular data using Deep Learning models has been showing up in recent years. XGBoost, RFE, LightGBM has been ruling this stream because of their effective feature selection and importance.

Tabnet, initially written by Arik and Pfister for Google Cloud AI has been used in Kaggle competitions recently showing some promising results. I have attached the paper here and the code repo in the end. The paper is very self-explanatory. This article focuses on the working architecture of Tabnet for a better understanding.

Top Advantages of Tablet:

Encode multiple data types like images along with tabular data and use nonlinearity to solve.
No need for Feature Engineering can throw all the columns and the model will pick the best features and it's also interpretable.

Jumping into Architecture:

a) Tabnet Encoder Architecture

So the architecture basically consists of multi-steps which are sequential, passing the inputs from one step to another. Various tricks on choosing the number of steps are also mentioned in the paper. So if we take a single step, three processes happen:

Feature transformer, which is a four consecutive GLU decision block.
An attentive transformer that uses sparse-matrix to give Sparse feature selection which enables interpretability and better learning as the capacity is used for the most salient features.
Mask — which is used with the transformer to give out the decisions parameter: n(d) and n(a) which is then fed to the next step.

So initially the whole dataset with all the features is taken without any feature engineering. It is Batch normalized(BN) and passed to the feature transformer where it passes through the 4-decision GLU steps to give two parameters,

n(d) which is the output decision from that particular step giving its prediction of continuous numbers/classes in case of regression/classification.
n(a) which is going as input to the next attentive transformer where the next cycle begins.

After the attention transformer, where the different features along with their importance are figured out, the feature importance for that step is aggregated along with the other steps. This individual importance from the step(f(i)) is then multiplied with the step’s(s(i)) importance and added with other steps(S(n)) and its values(F(n)) to give out the final feature importance of the model which can help in explaining the model without using a SHAP or Lime.

The decision output from the Feature transformers (n(d) or d[i]) is also aggregated and embedded in this form

and applied a linear mapping to get the final decision as output.

b) Feature Transformer

The Feature Transformer has 4 consecutive blocks, a Fully connected Layer followed by a Batch Normalization Layer followed by GLU. GLU stands for Gated linear unit which is just sigmoid of x multiplied by x. ( GLU = σ(x) . x ). So they consist of two shared decision steps followed by two independent decision steps. For robust learning, the layers are shared across two decision steps as the same input features are used in different steps. Normalization with √ 0.5 helps to stabilize learning by ensuring that the variance throughout the network does not change dramatically. It gives two outputs n(d) and n(a) as mentioned before.

c) Attentive Transformer:

As you can see the Attentive Transformer consists of an FC layer, BN layer, Prior scales layer, and Sparsemax layer. The n(a) input is passed into a fully connected layer followed by Batch normalization. It is then multiplied with the Prior scale which is a function that tells you how much you known about the features already from the previous steps and how many features have been used before in the steps. If it's set to 1 all features have equal importance. But the main advantage of Tabnet is that it employs soft feature selection with controllable sparsity in end-to-end learning — a single model jointly performs feature selection and output mapping. So we have the parameter

if γ is close to 1 then we can select different features at every step or if it's larger than 1 then we can reuse the same features across multiple steps. The Sparsematrix is like softmax but instead of all features adding up to 1 here, some will be 0 and only the rest will add up to 1.

This helps in making it an instance-wise feature selection where different features are taken at different steps. These are then fed to the mask layer which helps to identify the selected features. Quoting from the paper, If M(b,j) [i] = 0, then j-th feature of the b-th sample should have no contribution to the decision. If f(i) were a linear function, the coefficient M(b,j) [i] would correspond to the feature importance of f(b,j) . Although each decision step employs non-linear processing, their outputs are combined later in a linear way. It quantifies aggregate feature importance in addition to analysis of each step. Combining the masks at different steps requires a coefficient that can weigh the relative importance of each step in the decision We simply propose

to denote the aggregate decision contribution at i-th decision step for the b-th sample. Intuitively, if d(b,c)[i] propose the aggregate feature importance mask,

M(agg−b,j) =

Some findings from my experiments:

In the case of bigger data sets, there seems to perform better than XgBoost and LightGBM
Not good with rare event classification problems, can give weights are parameter just like XGboost but still didn't perform well in the case I experimented
Worked well in case of Multiclassication problems with a good dataset
Feature importances were almost 75 % common with Decision Tree algorithms. The non-intersections were not from the top features though.

Original Paper Reference:

https://arxiv.org/pdf/1908.07442.pdf [1]

There are many codebases available for Tabnet. I would recommend the following:

google-research/google-research

Authors: Sercan O. Arik, Tomas Pfister Paper: https://arxiv.org/abs/1908.07442 This directory contains an example…

github.com

manujosephv/pytorch_tabular

PyTorch Tabular aims to make Deep Learning with Tabular data easy and accessible to real-world cases and research…

github.com

dreamquark-ai/tabnet

This is a pyTorch implementation of Tabnet (Arik, S. O., & Pfister, T. (2019). TabNet: Attentive Interpretable Tabular…

github.com

Please feel free to connect on any further discussions:
LinkedIn : https://www.linkedin.com/in/vigneshwarilango/
Gmail: mr.vigneshwarilango@gmail.com

Regards,
Vigneshwar Ilango

Tabnet — Deep Learning for Tabular data: Architecture Overview

Jumping into Architecture:

a) Tabnet Encoder Architecture

b) Feature Transformer

c) Attentive Transformer:

google-research/google-research

Authors: Sercan O. Arik, Tomas Pfister Paper: https://arxiv.org/abs/1908.07442 This directory contains an example…

manujosephv/pytorch_tabular

PyTorch Tabular aims to make Deep Learning with Tabular data easy and accessible to real-world cases and research…

dreamquark-ai/tabnet

This is a pyTorch implementation of Tabnet (Arik, S. O., & Pfister, T. (2019). TabNet: Attentive Interpretable Tabular…

Written by Vigneshwar Ilango

No responses yet