Skip to content Skip to sidebar Skip to footer

How To Create A Custom Numpy Dtype Using Cython

There are examples for creating custom numpy dtypes using C here: Additionally, it seems to be possible to create custom ufuncs in cython: It seems like it should also be possible

Solution 1:

Numpy arrays are most suitable for data types with fixed size. If the objects in the array are not fixed size (such as your MultiEvent) the operations can become much slower.

I would recommend you to store all of the survival times in a 1d linear record array with 3 fields: event_id, time, period. Each event can appear mutliple times in the array:

>>>import numpy as np>>>rawdata = [(1, 0.4, 4), (1, 0.6, 6), (2,2.6, 6)]>>>npdata = np.rec.fromrecords(rawdata, names='event_id,time,period')>>>print npdata
[(1, 0.40000000000000002, 4) (1, 0.59999999999999998, 6) (2, 2.6000000000000001, 6)]

To get data for a specific index you could use fancy indexing:

>>>eventdata = npdata[npdata.event_id==1]>>>print eventdata
[(1, 0.40000000000000002, 4) (1, 0.59999999999999998, 6)]

The advantage of this approach is that you can easily intergrate it with your ndarray-based functions. You can also access this arrays from cython as described in the manual:

cdef packed struct Event:
    np.int32_t event_id
    np.float64_t time
    np.float64_6 period

def f():
    cdef np.ndarray[Event] b = np.zeros(10,
        dtype=np.dtype([('event_id', np.int32),
                        ('time', np.float64),
                        ('period', np.float64)]))
    <...>

Solution 2:

I apologise for not answering the question directly, but I've had similar problems before, and if I understand correctly, the real problem you're now having is that you have variable-length data, which is really, really not one of the strengths of numpy, and is the reason you're running into performance issues. Unless you know in advance the maximum number of entries for a multievent, you'll have problems, and even then you'll be wasting loads of memory/disk space filled with zeros for those events that aren't multi events.

You have data points with more than one field, some of which are related to other fields, and some of which need to be identified in groups. This hints strongly that you should consider a database of some form for storing this information, for performance, memory, space-on-disk and sanity reasons.

It will be much easier for a person new to your code to understand a simple database schema than a complicated, hacked-on-numpy structure that will be frustratingly slow and bloated. SQL queries are quick and easy to write in comparison.

I would suggest based on my understanding of your explanation having Event and MultiEvent tables, where each Event entry has a foreign key into the MultiEvent table where relevant.

Post a Comment for "How To Create A Custom Numpy Dtype Using Cython"