Skip to content Skip to sidebar Skip to footer

How Do I Add A 'rownumber' Field To A Structured Numpy Array?

I am using genfromtxt to load large csv files into structured arrays. I need to sort the data (using multiple fields), do some work and then restore the data to the original orderi

Solution 1:

You don't need to fill rowNums iteratively:

In [93]: rowNums=np.zeros(10,dtype=[('RowID','f8')])
In [94]: for i in range(0,10):
   ....:     rowNums['RowID'][i]=i
   ....:     
In [95]: rowNums
Out[95]: 
array([(0.0,), (1.0,), (2.0,), (3.0,), (4.0,), (5.0,), (6.0,), (7.0,),
       (8.0,), (9.0,)], 
      dtype=[('RowID', '<f8')])

Just assign the range values to the field:

In [96]: rowNums['RowID']=np.arange(10)
In [97]: rowNums
Out[97]: 
array([(0.0,), (1.0,), (2.0,), (3.0,), (4.0,), (5.0,), (6.0,), (7.0,),
       (8.0,), (9.0,)], 
      dtype=[('RowID', '<f8')])

rfn.merge_arrays shouldn't be that slow - unless csvData.dtype has a great number of fields. This function creates a new dtype that merges the fields of the 2 inputs, and then copies data field by field. For many rows, and just a few fields, that is quite fast.

But you should be able to get the original order back without adding this extra field.

A 2 field 1d array:

In [118]: x = np.array([(4,2),(1, 0), (0, 1),(1,2),(3,1)], dtype=[('x', '<i4'), ('y', '<i4')])
In [119]: i = np.argsort(x, order=('y','x'))
In [120]: i
Out[120]: array([1, 2, 4, 3, 0], dtype=int32)
In [121]: x[i]
Out[121]: 
array([(1, 0), (0, 1), (3, 1), (1, 2), (4, 2)], 
      dtype=[('x', '<i4'), ('y', '<i4')])

The same values are now sorted first on y, then on x.

In [122]: j=np.argsort(i)
In [123]: j
Out[123]: array([4, 0, 1, 3, 2], dtype=int32)
In [124]: x[i][j]
Out[124]: 
array([(4, 2), (1, 0), (0, 1), (1, 2), (3, 1)], 
      dtype=[('x', '<i4'), ('y', '<i4')])

Back to the original order

I could have added a row index array to x, and then done a sort on that. But why add it; why not just apply i to a separate array:

In [127]: np.arange(5)[i]
Out[127]: array([1, 2, 4, 3, 0])

But sorting that is just the same as sorting i.


merge_arrays is doing essentially the following:

Union dtype:

In [139]: dt=np.dtype(rowNums.dtype.descr+x.dtype.descr)
In [140]: y=np.zeros((5,),dtype=dt)

fill in the values:

In [141]: y['RowID']=np.arange(5)
In [143]: for name in x.dtype.names:
    y[name]=x[name]
In [144]: y
Out[144]: 
array([(0.0, 4, 2), (1.0, 1, 0), (2.0, 0, 1), (3.0, 1, 2), (4.0, 3, 1)], 
      dtype=[('RowID', '<f8'), ('x', '<i4'), ('y', '<i4')])

And to test my argsort of argsort idea:

In[145]: y[i]Out[145]: 
array([(1.0, 1, 0), (2.0, 0, 1), (4.0, 3, 1), (3.0, 1, 2), (0.0, 4, 2)], 
      dtype=[('RowID', '<f8'), ('x', '<i4'), ('y', '<i4')])
In[146]: np.argsort(y[i],order=('RowID'))
Out[146]: array([4, 0, 1, 3, 2], dtype=int32)
In[147]: jOut[147]: array([4, 0, 1, 3, 2], dtype=int32)

Sorting on the reordered RowID is the same as sorting on i.


Curiously merge_arrays is quite a bit slower than my reconstruction:

In [163]: rfn.merge_arrays([rowNums,x],flatten=True)
Out[163]: 
array([(0.0, 4, 2), (1.0, 1, 0), (2.0, 0, 1), (3.0, 1, 2), (4.0, 3, 1)], 
      dtype=[('RowID', '<f8'), ('x', '<i4'), ('y', '<i4')])
In [164]: timeit rfn.merge_arrays([rowNums,x],flatten=True)
10000 loops, best of 3: 161 µs per loop

In [165]: %%timeit 
dt=np.dtype(rowNums.dtype.descr+x.dtype.descr)
y=np.zeros((5,),dtype=dt)
y['RowID']=rowNums['RowID']
for name in x.dtype.names:
    y[name]=x[name]
10000 loops, best of 3: 38.4 µs per loop

Solution 2:

rowNums = np.zeros(len(csvData),dtype=[('RowID','f8')])
rowNums['RowID']=np.arange(len(csvData))

The above saves approx half a second per file with the csv files I am using. Very good so far.

However the key thing was how to efficiently obtain a record of the sort order. This is most elegantly solved using;

sortorder = np.argsort(csvData, 'col_1','col_2','col_3','col_4','col_5')

giving an array that lists the order of items in CsvData when sorted by cols 1 through 5. This negates the need to make, populate and merge a RowID column, saving me around 15s per csv file (over 6hrs across my entire dataset.)

Thank you very much @hpaulj

Post a Comment for "How Do I Add A 'rownumber' Field To A Structured Numpy Array?"