Correlation matrix analysis with the nested stochastic block model:the graph structure of stock market.

Table of Contents

In this post, I’ll show you how the nested stochastic block model (nSBM) can extract insights from correlation matrices using a combination with a filtering edge graph technique called disparity filter. I’ll use as a case study a correlation matrix derived from the stock market data.

Don’t be afraid, you will reproduce the beautiful plot presented below and understand what it means.

If you want to reproduce, use a conda environment because of the graph-tool dependency. Besides that, check if you have the following packages installed:


matplotlib, pandas, yfinance

Probably you will need to install the igraph package


$ pip install python-igraph

Now, install the graph-tool made by Tiago Peixoto.


$ conda install -c conda-forge graph-tool

No less important, we will need to remove weakling edges from the graph. To do so, we will use my package, edgeseraser. The repository of the project is available here: devmessias/edgeseraser. To install the edgeseraser run the following command:


$ pip install edgeseraser

Introduction

Exploratory data analysis allows us to gain insights or improve our choices of the next steps in the data science process. In some cases, we can explore the data assuming that all the entities are independent of each other. What happens when it’s not the case?

Graphs

A graph is a data structure used to store objects (vertexes) with binary relationships. Those relationships are called edges.

Each edge can have generic data associated with it. Like a numeric value.

Maybe you can think that this data structure is rare in the world of data. But

if you look carefully at the data spread on the internet, graphs are everywhere: social networks, economic transactions, correlation matrices and many more.

Correlation matrices

A correlation matrix is a storing mechanism that quantifies relationships between pairs of random variables.

Thus, a correlation matrix is at least a weighted graph. The main point of this post is to show how we can use tools of network science to analyze the correlation matrix. Ok, but maybe you’re asking: Why do we need to do this?

Correlation matrices are commonly used to discover and segment the universe of possible random variables. With this segmentation into groups (clustering) we can analyze the dataset in a granular way. Therefore, improving our efficiency in discarding unnecessary information.

A widely spread way to analyze correlation matrices using graph analytics is through the use of minimum spanning trees. I won’t go into details about this method (check the following posts Networks with MlFinLab: Minimum Spanning and [Dynamic spanning trees in stock market networks:

The case of Asia-Pacific]( https://www.bcb.gov.br/pec/wps/ingl/wps351.pdf)) because I want to show how we can use a different technique that at the same time is more flexible and powerful when we want to understand the hidden group structure of a correlation matrix.

This technique is called the nested stochastic block model (nSBM) and it’s a probabilistic model that allows us to infer the hidden group structure of a graph. The nSBM is a generalization of the stochastic block model (SBM) that assumes that the graph is composed of a set of communities (groups) that are connected by a set of edges. The SBM is a generative model that allows us to generate graphs with several communities. This generative model also creates those communities using parameters that describe the probability of a connection between vertices living in the same or different communities. Although the SBM is a powerful tool to analyze the structure of a graph, it’s not enough to infer the small communities (small groups of stocks) in our graph. To do so, we need to use the nSBM.

Ok, let’s go to the actual stuff and code.

Downloading and preprocessing the data

Closing prices of the S&P 500

Let’s import what we need


import yfinance as yf

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

import matplotlib as mpl

import igraph as ig

from edgeseraser.disparity import filter_ig_graph

To keep things simple, I choose to use the following data available here. This data encompasses the S&P500 from a specific time with symbols and sectors like this:

| Symbol | Name | Sector |

| —— | ——————- | ———– |

| MMM | 3M | Industrials |

| AOS | A. O. Smith | Industrials |

| ABT | Abbott Laboratories | Health Care |

| ABBV | AbbVie | Health Care |


!wget https://raw.githubusercontent.com/datasets/s-and-p-500-companies/master/data/constituents.csv

After downloading the data we just need to load it into a pandas dataframe.


df = pd.read_csv(“constituents.csv”)

all_symbols = df[‘Symbol’].values

all_sectors = df[‘Sector’].values

all_names = df[‘Name’].values

The following dictionaries will map the symbols to sectors and names later on


symbol2sector = dict(zip(all_symbols, all_sectors))

symbol2name = dict(zip(all_symbols, all_names))

Time to download the information about each symbol. Let’s request one semester

of data from yahoo finance API


start_date = ‘2018-01-01’

end_date = ‘2018-06-01’

try:

    prices = pd.read_csv(

        f"sp500_prices_{start_date}_{end_date}.csv", index_col="Date")

    tickers_available = prices.columns.values

except FileNotFoundError:

    df = yf.download(

        list(all_symbols),

        start=start_date,

        end=end_date,

        interval="1d",

        group_by=’ticker’,

        progress=True

    )

    tickers_available = list(

        set([ticket for ticket, _ in df.columns.T.to_numpy()]))

    prices = pd.DataFrame.from_dict(

        {

            ticker: df[ticker][“Adj Close”].to_numpy()

            for ticker in tickers_available

        }

    )

    prices.index = df.index

    prices = prices.iloc[:-1]

    del df

    prices.to_csv(

        f"sp500_prices_{start_date}_{end_date}.csv")

Returns and correlation matrix

We first start by computing the percentage of price changes for each stock, the return of each stock.


returns_all = prices.pct_change()

# a primeira linha não faz sentido, não existe retorno no primeiro dia

returns_all = returns_all.iloc[1:, :]

returns_all.dropna(axis=1, thresh=len(returns_all.index)//2., inplace=True)

returns_all.dropna(axis=0, inplace=True)

symbols = returns_all.columns.values

Now, we compute the correlation matrix using the returns


# plot the correlation matrix with ticks at each item

correlation_matrix = returns_all.corr()

plt.title(f"Correlation matrix from {start_date} to {end_date}")

plt.imshow(correlation_matrix)

plt.colorbar()

plt.savefig(“correlation.png”, dpi=150)

plt.clf()

Correlation matrix for S&P500 off the first 2018 semester. Yes, it’s a mess!

Ok, look into the image above. If you try to find groups of stocks that are less correlated just using visual inspection, you will have a hard time. This works just for a small set of stocks. Obviously, that is another approach like using linked cluster algorithms, but they are still hard to analyze when the number of stocks increases.

Removing irrelevant correlations with the edgeseraser library

Our goal here is to understand the structure of the stock market communities,

so we will keep just the matrix elements with positive values.


pos_correlation = correlation_matrix.copy()

# vamos considerar apenas as correlações positivas pois queremos

# apenas as comunidade>

pos_correlation[pos_correlation < 0.] = 0

# diagonal principal é setada a 0 para evitar auto-arestas

np.fill_diagonal(pos_correlation.values, 0)

Now, we create a weighted graph where each vertex is a stock and each positive correlation is a weight associated to the edge between two stocks.


g = ig.Graph.Weighted_Adjacency(pos_correlation.values, mode=’undirected’)

# criamos uma feature symbol para cada vértice

g.vs[“symbol”] = returns_all.columns

# o grafo pode estar desconectado. Portanto, extraímos a componente gigante

cl = g.clusters()

g = cl.giant()

n_edges_before = g.ecount()

The graph above has too much information, which is problematic for graph analysis. So, we must first identify and keep only the relevant edges. To do so, we use the disparity filter.

available in my edgeseraser package.


_ = filter_ig_graph(g, .25, cond="both", field="weight")

cl = g.clusters()

g = cl.giant()

n_edges_after = g.ecount()


print(f"Percentage of edges removed: {(n_edges_before - n_edges_after)/n_edges_before*100:.2f}%")

print(f"Number of remained stocks: {len(symbols)}")

...

Percentage of edges removed: 95.76%

Number of remained stocks: 492

We have deleted most of the edges and only a backbone remained in our graph. This backbone structure can now be used to extract.

insights from the stock market.

nSBM: fiding the hierarquical structure

Converting from igraph to graph-tool

Here, we will use the graph-tool lib to infer the hierarchical structure of the graph. Thus, we must first convert the graph from igraph to graph-tool. We can convert that using the following code


import graph_tool.all as gt

gnsbm = gt.Graph(directed=False)

# iremos adicionar os vértices

for v in g.vs:

    gnsbm.add_vertex()

# e as arestas

for e in g.es:

    gnsbm.add_edge(e.source, e.target)

How to use the graph-tool nSBM?

With the sparsified correlation graph that encompasses the stock relationships, we can ask ourselves how those relationships ties stocks together. Here, we will use the nSBM to answer this.

The nSBM

is an inference method that optimizes a set of parameters that defines a generative graph model for our data. In our case, to find this model for the sparsified s&p500 correlation graph, we will cal the method minimize_nested_blockmodel_dl


state = gt.minimize_nested_blockmodel_dl(gnsbm)

We can use the code below to define the colors for our plot.


symbols = g.vs[“symbol”]

sectors = [symbol2sector[symbol] for symbol in symbols]

u_sectors = np.sort(np.unique(sectors))

u_colors = [plt.cm.tab10(i/len(u_sectors))

          for i in range(len(u_sectors))]

# a primeira cor da lista era muito similar a segunda,

u_colors[0] = [0, 1, 0, 1]

sector2color = {sector: color for sector, color in zip(u_sectors, u_colors)}

rgba = gnsbm.new_vertex_property(“vector<double>“)

gnsbm.vertex_properties[‘rgba’] = rgba

for i, symbol in enumerate(symbols):

    c = sector2color[symbol2sector[symbol]]

    rgba[i] = [c[0], c[1], c[2], .5]

Calling the draw method, we create a plot with the nSBM result. You can play a little, changing the `$\beta \in (0, 1)$ parameter to see how the things change. This parameter controls the strength of the edge-bundling algorithm.


options = {

    ‘output’: f’nsbm_{start_date}_{end_date}.png’,

    ‘beta’: .9,

    ‘bg_color’: ‘w’,

   #’output_size’: (1500, 1500),

    ‘vertex_color’: gnsbm.vertex_properties[‘rgba’],

    ‘vertex_fill_color’: gnsbm.vertex_properties[‘rgba’],

    ‘hedge_pen_width’: 2,

    ‘hvertex_fill_color’: np.array([0., 0., 0., .5]),

    ‘hedge_color’: np.array([0., 0., 0., .5]),

    ‘hedge_marker_size’: 20,

    ‘hvertex_size’:20

}

state.draw(**options)

Finally, we can see the hierarchical structure got from our filtered graph using

the nSBM algorithm.


plt.figure(dpi=150)

plt.title(f”Sectors of the S&P 500 from {start_date} to {end_date}")

legend = plt.legend(

    [plt.Line2D([0], [0], color=c, lw=10)

        for c in list(sector2color.values())],

    list(sector2color.keys()),

    bbox_to_anchor=(1.05, 1),

    loc=2,

    borderaxespad=0.)

plt.imshow(plt.imread(f’nsbm_{start_date}_{end_date}.png’))

plt.xticks([])

plt.yticks([])

plt.axis(‘off’)

plt.savefig(f’nsbm_final_{start_date}_{end_date}.png’, bbox_inches=’tight’,

            dpi=150, bbox_extra_artists=(legend,), facecolor=’w’, edgecolor=’w’)

plt.show()

nSBM result for the correlation matrix of S&P500 returns.

Wow, beautiful, isn’t it? But maybe you can only see a beautiful picture and not be

able to understand what is happening. So, let’s try to understand what this plot is

telling us.

How to analyze?

Each circle (leaf) represents a stock. A vertex in the original graph.
Each group of leafs which resembles a broom is a community detected by the nSBM.

Navigating backwards through the arrows, we can see the hierarchical community structure. The image above shows three communities that belong to the same grandfather community.

We can see interesting stuff in the community structure. For example, most stocks related to Consumer staples, Real state, and Utilities belongs to the same community.

What is the meaning of the edges?

A larger number of edges between Financials, Industrials and Information technology survive the filtering technique, which shows a strong relationship between these sectors.

The visual inspection of the edges relationships is meaningful only because of the use of edge bundling algorithm. Let’s see how a weak bundling turns our job into a hard one, setting $\beta = 0.5$

Can you understand what is happening? Me neither.

A visual inspection is sometimes not enough, and maybe you need to create an automatic analysis, which is easy to do with the graph-tool. For example, to get a summary of the hierarchical community structure got by the nSBM algorithm, we can call the print_summary method.


state.print_summary()

l: 0, N: 483, B: 25

l: 1, N: 25, B: 6

l: 2, N: 6, B: 2

l: 3, N: 2, B: 1

l: 4, N: 1, B: 1

In the first level, we have 21 communities for all the 11 sectors.

If we want to know which communities the TSLA stock belongs to, we go through

the hierarchy in the reverse order,


# esse é o indice da TSLA no nosso grafo original

symbol = “TSLA”

index_tesla = symbols.index(symbol)

print(symbol, symbol2sector[symbol], symbol2name[symbol])

r0 = state.levels[0].get_blocks()[index_tesla]

r1 = state.levels[1].get_blocks()[r0]

r2 = state.levels[2].get_blocks()[r1]

r3 = state.levels[3].get_blocks()[r2]

(r1, r2, r3)


(‘TSLA’, ‘Consumer Discretionary’, ‘Tesla’)

(19, 0, 0)

Conclusion

The aim of this post was to give you a first impression of a method that combines the use of nSBM and edge filtering techniques to enhance our understanding of the correlation structure of the data, especially with financial data. However, we can also use this method in different scenarios, as

topic modeling in NLP and survey analysis.

grafos comunidades nsbm finanças mercado-de-ações python