pivoting in Python
I needed to do some pre-processing of some data which involved transposing column names to values. The condition was that the value for each respective column (frequency count) had to be > 1.
My input was a csv file, and my goal was an output csv file which would feed into a batch database import process.
ID,DA,NL,PHENOM1,PHENOM2,PHENOM3,PHENOM4 233,99,44,0.00,27.00,12.00,0.00
The other interesting bit was that only a range of columns applied to the condition; the other columns represented ancillary data.
Enter Python:
#!/usr/bin/python
import sys
import csv
# open file and read headers
fPhenomenon = open("phenomenon.txt","r")
sHeaders = fPhenomenon.readline().replace(r'"','')
aHeaders = sHeaders.split(",")
# feed the rest to csv
csvIn = csv.reader(fPhenomenon)
csvOut = csv.writer(sys.stdout)
for sRowIn in csvIn:
aRowOut = []
aPhenomenon = []
aRowOut.append(sRowIn[0]) # procedure ID
aRowOut.append(sRowIn[1]) # major drainage area ID
for nIndexTupleVal, tupleVal in enumerate(sRowIn[3:-1]):
if (float(tupleVal) > 0): # phenomenon measured at least once
# add phenomenon name to list
aPhenomenon.append(aHeaders[nIndexTupleVal+3])
# add phenomenon list to record
aRowOut.append(",".join(aPhenomenon))
csvOut.writerow(aRowOut)
Notes
- hooray for raw strings!
- enumerate() is great and saves you the trouble of declaring your own counter
- like any language, modules/libraries makes things so easy to work with
- I wish the header stuff was a bit cleaner (I should look further into the csv module w.r.t. headers
That’s my hack for the day. Have a good weekend!
UPDATE: ah, the csv module has a .next() method, which can be used instead of the shoemaker attempt I made above to regularize / split / store the header list.
