Exclude Columns From Genfromtxt With Numpy
Solution 1:
You could filter out columns with nan
after reading.
In [52]: txt=b'antelope,1,0,0,1,0,0,0,1,1,1,0,0,4,1,0,1,1'
In [53]: txt=[txt,txt]
In [54]: A=np.genfromtxt(txt, dtype=float, names=None,delimiter=',')
In [55]: A
Out[55]:
array([[ nan, 1., 0., 0., 1., 0., 0., 0., 1., 1., 1.,
0., 0., 4., 1., 0., 1., 1.],
[ nan, 1., 0., 0., 1., 0., 0., 0., 1., 1., 1.,
0., 0., 4., 1., 0., 1., 1.]])
columns with nan
in all rows; or I could use .any
for columns with any nan
. Other tests are possible.
In [56]: ind=np.isnan(A).all(axis=0)
In [57]: ind
Out[57]:
array([ True, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False], dtype=bool)
In [58]: A[:,~ind]
Out[58]:
array([[ 1., 0., 0., 1., 0., 0., 0., 1., 1., 1., 0., 0., 4.,
1., 0., 1., 1.],
[ 1., 0., 0., 1., 0., 0., 0., 1., 1., 1., 0., 0., 4.,
1., 0., 1., 1.]])
Another idea is to read the file once with dtype=None
, letting genfromtxt
choose the dtype for each column. The resulting compound dtype can be filter to find the columns of the desired type.
In [118]: A=np.genfromtxt(txt, dtype=None, names=None,delimiter=',')
In [119]: ind=[i for i, d in enumerate(A.dtype.descr) if d[1]=='<i4']
In [120]: A=np.genfromtxt(txt, dtype=None, names=None,delimiter=',',usecols=ind)
In [121]: A
Out[121]:
array([[1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 4, 1, 0, 1, 1],
[1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 4, 1, 0, 1, 1]])
The dtype could also be filtered to collect column names that are the correct type
In [128]: A=np.genfromtxt(txt, dtype=None, names=None,delimiter=',')
In [129]: ind=[d[0] for d in A.dtype.descr if d[1]=='<i4']
In [130]: A[ind]
Out[130]:
array([(1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 4, 1, 0, 1, 1),
(1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 4, 1, 0, 1, 1)],
dtype=[('f1', '<i4'), ('f2', '<i4'), ('f3', '<i4'), ('f4', '<i4'), ('f5', '<i4'), ('f6', '<i4'), ('f7', '<i4'), ('f8', '<i4'), ('f9', '<i4'), ('f10', '<i4'), ('f11', '<i4'), ('f12', '<i4'), ('f13', '<i4'), ('f14', '<i4'), ('f15', '<i4'), ('f16', '<i4'), ('f17', '<i4')])
Though consolidating this structured array into a 2d array with a single dtype (int), is a bit of a pain (I could go into the details if needed).
Solution 2:
pandas has a DataFrame.select_dtypes
method that will let you do this pretty easily. You can get the data into a DataFrame either directly (as in the example below), or using one of the various read methods (e.g., pd.read_csv()
):
In [21]: import pandas as pd
In [22]: df = pd.DataFrame({'a': [1,2,3,4,5], 'b': ['a','b','c','d','e'], 'c': [1.1, 2.2, 3.3, 4.4, 5.5]})
In [23]: df
Out[23]:
a b c
01 a 1.112 b 2.223 c 3.334 d 4.445 e 5.5
In [24]: df.select_dtypes([int, float])
Out[24]:
a c
011.1122.2233.3344.4455.5
Solution 3:
What worked for me, especially in this context of excluding just the first column was:
import csv
withopen("file.csv") as f:
# csv.QUOTE_NONNUMERIC is necessary because else it'll quote the numbers as well
cr = csv.reader(f, quoting=csv.QUOTE_NONNUMERIC)
next(cr)
matrix=[tuple(line[1:]) for line in cr] # excluding the first column
I hope this helps if anyone else comes across this issue ('cause panda and slicing didn't work properly for me).
Post a Comment for "Exclude Columns From Genfromtxt With Numpy"