Skip to content Skip to sidebar Skip to footer

Exclude Columns From Genfromtxt With Numpy

Is it possible to exclude all string columns using genfromtxt from the numpy library? I have this a csv file with this type of data from the machine learning website. antelope,1,0,

Solution 1:

You could filter out columns with nan after reading.

In [52]: txt=b'antelope,1,0,0,1,0,0,0,1,1,1,0,0,4,1,0,1,1'
In [53]: txt=[txt,txt]
In [54]: A=np.genfromtxt(txt, dtype=float, names=None,delimiter=',')
In [55]: A
Out[55]: 
array([[ nan,   1.,   0.,   0.,   1.,   0.,   0.,   0.,   1.,   1.,   1.,
          0.,   0.,   4.,   1.,   0.,   1.,   1.],
       [ nan,   1.,   0.,   0.,   1.,   0.,   0.,   0.,   1.,   1.,   1.,
          0.,   0.,   4.,   1.,   0.,   1.,   1.]])

columns with nan in all rows; or I could use .any for columns with any nan. Other tests are possible.

In [56]: ind=np.isnan(A).all(axis=0)
In [57]: ind
Out[57]: 
array([ True, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False], dtype=bool)
In [58]: A[:,~ind]
Out[58]: 
array([[ 1.,  0.,  0.,  1.,  0.,  0.,  0.,  1.,  1.,  1.,  0.,  0.,  4.,
         1.,  0.,  1.,  1.],
       [ 1.,  0.,  0.,  1.,  0.,  0.,  0.,  1.,  1.,  1.,  0.,  0.,  4.,
         1.,  0.,  1.,  1.]])

Another idea is to read the file once with dtype=None, letting genfromtxt choose the dtype for each column. The resulting compound dtype can be filter to find the columns of the desired type.

In [118]: A=np.genfromtxt(txt, dtype=None, names=None,delimiter=',')
In [119]: ind=[i for i, d in enumerate(A.dtype.descr) if d[1]=='<i4']
In [120]: A=np.genfromtxt(txt, dtype=None, names=None,delimiter=',',usecols=ind) 
In [121]: A
Out[121]: 
array([[1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 4, 1, 0, 1, 1],
       [1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 4, 1, 0, 1, 1]])

The dtype could also be filtered to collect column names that are the correct type

In [128]: A=np.genfromtxt(txt, dtype=None, names=None,delimiter=',')
In [129]: ind=[d[0] for d in A.dtype.descr if d[1]=='<i4']
In [130]: A[ind]
Out[130]: 
array([(1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 4, 1, 0, 1, 1),
       (1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 4, 1, 0, 1, 1)], 
      dtype=[('f1', '<i4'), ('f2', '<i4'), ('f3', '<i4'), ('f4', '<i4'), ('f5', '<i4'), ('f6', '<i4'), ('f7', '<i4'), ('f8', '<i4'), ('f9', '<i4'), ('f10', '<i4'), ('f11', '<i4'), ('f12', '<i4'), ('f13', '<i4'), ('f14', '<i4'), ('f15', '<i4'), ('f16', '<i4'), ('f17', '<i4')])

Though consolidating this structured array into a 2d array with a single dtype (int), is a bit of a pain (I could go into the details if needed).

Solution 2:

pandas has a DataFrame.select_dtypes method that will let you do this pretty easily. You can get the data into a DataFrame either directly (as in the example below), or using one of the various read methods (e.g., pd.read_csv()):

In [21]: import pandas as pd

In [22]: df = pd.DataFrame({'a': [1,2,3,4,5], 'b': ['a','b','c','d','e'], 'c': [1.1, 2.2, 3.3, 4.4, 5.5]})

In [23]: df
Out[23]:
   a  b    c
01  a  1.112  b  2.223  c  3.334  d  4.445  e  5.5

In [24]: df.select_dtypes([int, float])
Out[24]:
   a    c
011.1122.2233.3344.4455.5

Solution 3:

What worked for me, especially in this context of excluding just the first column was:

import csv

withopen("file.csv") as f:
    # csv.QUOTE_NONNUMERIC is necessary because else it'll quote the numbers as well
    cr = csv.reader(f, quoting=csv.QUOTE_NONNUMERIC) 
    
    next(cr)

    matrix=[tuple(line[1:]) for line in cr]  # excluding the first column

I hope this helps if anyone else comes across this issue ('cause panda and slicing didn't work properly for me).

Post a Comment for "Exclude Columns From Genfromtxt With Numpy"