Showing posts with label detecting relationships in data. Show all posts
Showing posts with label detecting relationships in data. Show all posts

Sunday, December 9, 2018

Detecting Correlation Relationships with Python


Have you ever visualized a data set and 'think' you see a relationship between two data streams?  When there seems to be a connection between one data stream and another this is often referred to as a correlation.  Detecting a correlation with two data streams is surprisingly easy with the numpy library.  The focus of this post will show what I've recently learned on the matter.

Let's create a list of objects with 5 numeric attributes.  Let's assign the first 4 attributes random values, the 5th attribute will 94% of the time be identical to the 4th attribute.  Because the 4th attribute and 5th attributes value are correlated we expect to detect a strong correlation coefficient when we calculate and a much lower coefficient (likely near zero) when comparing the unrelated attributes.  It's worthwhile noting that you likely need a significant sized data set to get a noteworthy coefficient, smaller data sets likely won't be sufficient.

Let's create a list of 1,000,000 elements and calculate the correlation coefficients between each of the values.  We expect a coefficient value between the 4th and 5th element to be 94% as that is the correlation we are forcing.



$ cat demoCorrelation 
#!/usr/bin/python
import numpy;
import uuid;

def run():
  L=[];
  DiceCoeff=0.94;
  for i in range(0,10000):
    e=dict();
    e['id']=str(uuid.uuid4());
    for k in ['c1','c2','c3','c4']:
      e[k]=numpy.random.random_integers(0,5,size=1)[0]
    #--add a bit of randomness to the correlation
    roll = numpy.random.random_integers(0,100,size=1)[0];
    if roll<=DiceCoeff*100:
      e['e1']=e['c4'];
    else:
      e['e1']=numpy.random.random_integers(0,5,size=1)[0];
    L.append(e);

  MinCorrCoeff=0.75;
 
  print "DiceCoeff: %f"%(DiceCoeff);
  print "MinCorrCoeff: %f"%(MinCorrCoeff);
  for k1 in sorted(L[0].keys()):
    L1=[e[k1] for e in L];
    for k2 in sorted(L[0].keys()):
      L2=[e[k2] for e in L];
      try:
        coeff=numpy.corrcoef(L1,L2)[0][1];
        if abs(coeff)>MinCorrCoeff and k1!=k2:
          print "%s/%s : %f"%(k1,k2,coeff);
      except Exception as e:
        pass;
 

#--main--
run()

When we run this beast you'll see that the forced correlation is detected.


$ ./demoCorrelation 
DiceCoeff: 0.940000
MinCorrCoeff: 0.750000
c4/e1 : 0.942737
e1/c4 : 0.942737


Correlation doesn't determine cause/effect, so we expect the bi-directional pairing.

Pretty cool, huh. Use this power for good dear reader.