Why Nonadditive Entropy Is Important for Big Data Corpora Combinations

Why Nonadditive Entropy Is Important for Big Data Corpora Combinations

Non-Additive Entropy – A Crucial Predictive Analysis Measure for Data Mining in Multiple Large Data Corpora

Statistical mechanics has an important role to play in big data analytics. Up until now, there has been almost no understanding of how statistical mechanics provides both practical value and a theoretic framework for data analysis and even predictive intelligence (sometimes called predictive analysis).

This blogpost focuses on a related – and crucially important – issue: How can we determine the value of combining two or more data corpora when answering a query?

Getting a Cost Function for Corpora Combination

Let us formalize what we are seeking as a Value measure, and identify V(A+B) as the value associated with merging corpora B with that of A, where the values of each independently would be V(B) and V(A), respectively. (I will relate the value to entropy and information content in a later post.)

What we are seeking here is an actual theoretic framework for providing V(A+B), even if we know beforehand that certain crucial parameters will have to be determined empirically.

The reason that this is so important is that if – a priori we can figure out the information value achieved by merging two (or more) corpora, and if we also can estimate the work involved (as computational tasking), then we can estimate answers to crucial questions, essentially dealing with the cost associated with complexity/completeness of multiple corpora access. This cost could be interpreted multiple ways, ranging from computer resource allocation, up to impacting real cost billed to a client.

This will have profound impact, as both corporations and government organizations seek to extract meaningful answers from multiple large data corpora.

The larger the corpora, and the more that we need to access multiple corpora to get complete answers, the more that getting a Value function for V(A+B) is an important topic.

Role of Statistical Mechanics in Big Data Analytics

The theoretic formalism that will contribute to our Value measurement, V(A+B), comes to us through statistical mechanics. Specifically, it will relate to the entropy of combining corpus A with corpus B to extract answers from both corpora regarding a query.

Up until now, those of us with a strong physics or physical chemistry or information-theoretic background have had an innate sense that statistical mechanics should play a role in big data analytics. Despite our strong intuition that the connection should be obvious, the actual means has been elusive.

The reason for this is that when we apply the simplest of statistical mechanics formalisms to modeling behavior in big data corpora, we have not had the kind of results that we’d like. This has been largely due to the simplicity of the statistical mechanics equations that most researchers have investigated.

However, it is possible to find a very useful and rich application through the use of the cluster variation method. (This discussion will be deferred to a different blogpost, and to a related White Paper. Links to be provided.)

Nonadditive Entropy

Over the past few decades, the notions of both nonadditive entropy and nonextensive entropy have emerged, and become the subject of much excited attention. This revolution in the basic formulation of entropy rivals that of the breakthroughs that quantum mechanics and understanding of relativity made beyond classical Newtonian mechanics.

A good reference on this subject is:

Tsallis, C., The nonadditive entropy S(q) and its application in physics and elsewhere: some remarks, Entropy 2011, 13, 1765-1804.

Specifically, the nonadditive entropy is given as

Tsallis_nonadditive-entropy1_crppd

When all probabilities are equal, for the discrete case, we have

Tsallis_nonadditive-entropy2_v2

When q=1, we have the classic formalism for entropy.

The q-generalized entropy S(q) is defined as

Tsallis_nonadditive-entropy3_v2_crppd

Our task becomes determining the index q, a real number. In the context of corpora merging, it will reflect the degree of overlap between two corpora.

There has been substantial work in physics and related fields on q-generalized entropy S(q). We can use this as a springboard for determining q as relevant to the world of data mining when the task involves combining large data corpora.

The special value of this in predictive intelligence comes as we assess differences in corpora over time, e.g., how corpus B differs from corpus A, which should show up in the q values. One of the greatest points of interest will be seeing how the cluster variables z(i) change over time; this will give indications about how relevant data items are embedded in different contexts. (This will be the subject of a future blogpost.)

Italian-renaissance-border-2-thin

Related Posts

Leave a Reply

Your email address will not be published. Required fields are marked *