How Do I Add A 'rownumber' Field To A Structured Numpy Array?
Solution 1:
You don't need to fill rowNums
iteratively:
In [93]: rowNums=np.zeros(10,dtype=[('RowID','f8')])
In [94]: for i in range(0,10):
....: rowNums['RowID'][i]=i
....:
In [95]: rowNums
Out[95]:
array([(0.0,), (1.0,), (2.0,), (3.0,), (4.0,), (5.0,), (6.0,), (7.0,),
(8.0,), (9.0,)],
dtype=[('RowID', '<f8')])
Just assign the range values to the field:
In [96]: rowNums['RowID']=np.arange(10)
In [97]: rowNums
Out[97]:
array([(0.0,), (1.0,), (2.0,), (3.0,), (4.0,), (5.0,), (6.0,), (7.0,),
(8.0,), (9.0,)],
dtype=[('RowID', '<f8')])
rfn.merge_arrays
shouldn't be that slow - unless csvData.dtype
has a great number of fields. This function creates a new dtype that merges the fields of the 2 inputs, and then copies data field by field. For many rows, and just a few fields, that is quite fast.
But you should be able to get the original order back without adding this extra field.
A 2 field 1d array:
In [118]: x = np.array([(4,2),(1, 0), (0, 1),(1,2),(3,1)], dtype=[('x', '<i4'), ('y', '<i4')])
In [119]: i = np.argsort(x, order=('y','x'))
In [120]: i
Out[120]: array([1, 2, 4, 3, 0], dtype=int32)
In [121]: x[i]
Out[121]:
array([(1, 0), (0, 1), (3, 1), (1, 2), (4, 2)],
dtype=[('x', '<i4'), ('y', '<i4')])
The same values are now sorted first on y
, then on x
.
In [122]: j=np.argsort(i)
In [123]: j
Out[123]: array([4, 0, 1, 3, 2], dtype=int32)
In [124]: x[i][j]
Out[124]:
array([(4, 2), (1, 0), (0, 1), (1, 2), (3, 1)],
dtype=[('x', '<i4'), ('y', '<i4')])
Back to the original order
I could have added a row index array to x
, and then done a sort on that. But why add it; why not just apply i
to a separate array:
In [127]: np.arange(5)[i]
Out[127]: array([1, 2, 4, 3, 0])
But sorting that is just the same as sorting i
.
merge_arrays
is doing essentially the following:
Union dtype:
In [139]: dt=np.dtype(rowNums.dtype.descr+x.dtype.descr)
In [140]: y=np.zeros((5,),dtype=dt)
fill in the values:
In [141]: y['RowID']=np.arange(5)
In [143]: for name in x.dtype.names:
y[name]=x[name]
In [144]: y
Out[144]:
array([(0.0, 4, 2), (1.0, 1, 0), (2.0, 0, 1), (3.0, 1, 2), (4.0, 3, 1)],
dtype=[('RowID', '<f8'), ('x', '<i4'), ('y', '<i4')])
And to test my argsort
of argsort
idea:
In[145]: y[i]Out[145]:
array([(1.0, 1, 0), (2.0, 0, 1), (4.0, 3, 1), (3.0, 1, 2), (0.0, 4, 2)],
dtype=[('RowID', '<f8'), ('x', '<i4'), ('y', '<i4')])
In[146]: np.argsort(y[i],order=('RowID'))
Out[146]: array([4, 0, 1, 3, 2], dtype=int32)
In[147]: jOut[147]: array([4, 0, 1, 3, 2], dtype=int32)
Sorting on the reordered RowID
is the same as sorting on i
.
Curiously merge_arrays
is quite a bit slower than my reconstruction:
In [163]: rfn.merge_arrays([rowNums,x],flatten=True)
Out[163]:
array([(0.0, 4, 2), (1.0, 1, 0), (2.0, 0, 1), (3.0, 1, 2), (4.0, 3, 1)],
dtype=[('RowID', '<f8'), ('x', '<i4'), ('y', '<i4')])
In [164]: timeit rfn.merge_arrays([rowNums,x],flatten=True)
10000 loops, best of 3: 161 µs per loop
In [165]: %%timeit
dt=np.dtype(rowNums.dtype.descr+x.dtype.descr)
y=np.zeros((5,),dtype=dt)
y['RowID']=rowNums['RowID']
for name in x.dtype.names:
y[name]=x[name]
10000 loops, best of 3: 38.4 µs per loop
Solution 2:
rowNums = np.zeros(len(csvData),dtype=[('RowID','f8')])
rowNums['RowID']=np.arange(len(csvData))
The above saves approx half a second per file with the csv files I am using. Very good so far.
However the key thing was how to efficiently obtain a record of the sort order. This is most elegantly solved using;
sortorder = np.argsort(csvData, 'col_1','col_2','col_3','col_4','col_5')
giving an array that lists the order of items in CsvData
when sorted by cols 1 through 5.
This negates the need to make, populate and merge a RowID
column, saving me around 15s per csv file (over 6hrs across my entire dataset.)
Thank you very much @hpaulj
Post a Comment for "How Do I Add A 'rownumber' Field To A Structured Numpy Array?"