Have you ever visualized a data set and 'think' you see a relationship between two data streams? When there seems to be a connection between one data stream and another this is often referred to as a correlation. Detecting a correlation with two data streams is surprisingly easy with the numpy library. The focus of this post will show what I've recently learned on the matter.
Let's create a list of objects with 5 numeric attributes. Let's assign the first 4 attributes random values, the 5th attribute will 94% of the time be identical to the 4th attribute. Because the 4th attribute and 5th attributes value are correlated we expect to detect a strong correlation coefficient when we calculate and a much lower coefficient (likely near zero) when comparing the unrelated attributes. It's worthwhile noting that you likely need a significant sized data set to get a noteworthy coefficient, smaller data sets likely won't be sufficient.
Let's create a list of 1,000,000 elements and calculate the correlation coefficients between each of the values. We expect a coefficient value between the 4th and 5th element to be 94% as that is the correlation we are forcing.
$ cat demoCorrelation
#!/usr/bin/python
import numpy;
import uuid;
def run():
L=[];
DiceCoeff=0.94;
for i in range(0,10000):
e=dict();
e['id']=str(uuid.uuid4());
for k in ['c1','c2','c3','c4']:
e[k]=numpy.random.random_integers(0,5,size=1)[0]
#--add a bit of randomness to the correlation
roll = numpy.random.random_integers(0,100,size=1)[0];
if roll<=DiceCoeff*100:
e['e1']=e['c4'];
else:
e['e1']=numpy.random.random_integers(0,5,size=1)[0];
L.append(e);
MinCorrCoeff=0.75;
print "DiceCoeff: %f"%(DiceCoeff);
print "MinCorrCoeff: %f"%(MinCorrCoeff);
for k1 in sorted(L[0].keys()):
L1=[e[k1] for e in L];
for k2 in sorted(L[0].keys()):
L2=[e[k2] for e in L];
try:
coeff=numpy.corrcoef(L1,L2)[0][1];
if abs(coeff)>MinCorrCoeff and k1!=k2:
print "%s/%s : %f"%(k1,k2,coeff);
except Exception as e:
pass;
#--main--
run()
When we run this beast you'll see that the forced correlation is detected.
$ ./demoCorrelation
DiceCoeff: 0.940000
MinCorrCoeff: 0.750000
c4/e1 : 0.942737
e1/c4 : 0.942737
Correlation doesn't determine cause/effect, so we expect the bi-directional pairing.
Pretty cool, huh. Use this power for good dear reader.
No comments:
Post a Comment