Sunday, December 9, 2018

Detecting Correlation Relationships with Python


Have you ever visualized a data set and 'think' you see a relationship between two data streams?  When there seems to be a connection between one data stream and another this is often referred to as a correlation.  Detecting a correlation with two data streams is surprisingly easy with the numpy library.  The focus of this post will show what I've recently learned on the matter.

Let's create a list of objects with 5 numeric attributes.  Let's assign the first 4 attributes random values, the 5th attribute will 94% of the time be identical to the 4th attribute.  Because the 4th attribute and 5th attributes value are correlated we expect to detect a strong correlation coefficient when we calculate and a much lower coefficient (likely near zero) when comparing the unrelated attributes.  It's worthwhile noting that you likely need a significant sized data set to get a noteworthy coefficient, smaller data sets likely won't be sufficient.

Let's create a list of 1,000,000 elements and calculate the correlation coefficients between each of the values.  We expect a coefficient value between the 4th and 5th element to be 94% as that is the correlation we are forcing.



$ cat demoCorrelation 
#!/usr/bin/python
import numpy;
import uuid;

def run():
  L=[];
  DiceCoeff=0.94;
  for i in range(0,10000):
    e=dict();
    e['id']=str(uuid.uuid4());
    for k in ['c1','c2','c3','c4']:
      e[k]=numpy.random.random_integers(0,5,size=1)[0]
    #--add a bit of randomness to the correlation
    roll = numpy.random.random_integers(0,100,size=1)[0];
    if roll<=DiceCoeff*100:
      e['e1']=e['c4'];
    else:
      e['e1']=numpy.random.random_integers(0,5,size=1)[0];
    L.append(e);

  MinCorrCoeff=0.75;
 
  print "DiceCoeff: %f"%(DiceCoeff);
  print "MinCorrCoeff: %f"%(MinCorrCoeff);
  for k1 in sorted(L[0].keys()):
    L1=[e[k1] for e in L];
    for k2 in sorted(L[0].keys()):
      L2=[e[k2] for e in L];
      try:
        coeff=numpy.corrcoef(L1,L2)[0][1];
        if abs(coeff)>MinCorrCoeff and k1!=k2:
          print "%s/%s : %f"%(k1,k2,coeff);
      except Exception as e:
        pass;
 

#--main--
run()

When we run this beast you'll see that the forced correlation is detected.


$ ./demoCorrelation 
DiceCoeff: 0.940000
MinCorrCoeff: 0.750000
c4/e1 : 0.942737
e1/c4 : 0.942737


Correlation doesn't determine cause/effect, so we expect the bi-directional pairing.

Pretty cool, huh. Use this power for good dear reader.

No comments:

Post a Comment