How to hstack arrays of numpy records?
[An earlier version of this post had an inaccurate heading "How do I add a single column to a numpy record array?" The question asked in the previous heading has already been partially answered , but this answer is not exactly what the body of an earlier version of this post asked for. I have reformulated the title and, in fact, edited the post to make the distinction clearer. I also explain why I mentioned this earlier than I want to.]
Suppose I have two arrays numpy
x
and y
, each of which consists of r "record" (aka "structured") arrays. Let the form x
be (r, c x), and the form y
be (r, c y). Let's also assume that there is no overlap between x.dtype.names
and y.dtype.names
.
For example, for r = 2, c x= 2 and c y= 1:
import numpy as np
x = np.array(zip((1, 2), (3., 4.)), dtype=[('i', 'i4'), ('f', 'f4')])
y = np.array(zip(('a', 'b')), dtype=[('s', 'a10')])
I would like to "horizontally" concatenate x
and y
to create a new array of records z
, having the form (r, c x + c ysub>). This operation should not change x
or at all y
.
In general, z = np.hstack((x, y))
it will not work, because dtype
in x
and y
does not necessarily match. For example, continuing with the example above:
z = np.hstack((x, y))
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-7-def477e6c8bf> in <module>()
----> 1 z = np.hstack((x, y))
TypeError: invalid type promotion
Now there is a function numpy.lib.recfunctions.append_fields
that looks like it can do something close to what I'm looking for, but I couldn't get anything out of it: everything I've tried with it fails or produces something other than what I am trying to get.
Can someone please show me explicitly the code (using n.l.r.append_fields
or otherwise 1 ) that would generate from x
and y
, defined in the example above, a new array of records z
, which is equivalent to the horizontal concatenation of x
and y
, and do it without changing either x
or y
?
I am guessing it only takes one or two lines of code to do this. Of course, I'm looking for code that doesn't require creation z
, write by write, iterate over x
and y
. In addition, the code can assume that x
both y
have the same number of records and that there is no overlap between x.dtype.names
and y.dtype.names
. Other than that, the code I'm looking for doesn't need to know anything about x
and y
. Ideally, it should also be agnostic about the number of include arrays. IOW, except for error checking, the code I'm looking for could be the body of a function hstack_rec
, so a new array z
will be the result hstack_rec((x, y))
.
1... although I have to admit that after my recording this perfect failure with numpy.lib.recfunctions.append_fields
me, I got a little curious how this function could be used at all , regardless of its relevance to this post.
source to share
I never use recarrays and so someone else will come up with something anti-aliasing, but maybe it merge_arrays
will work?
>>> import numpy.lib.recfunctions as nlr
>>> x = np.array(zip((1, 2), (3., 4.)), dtype=[('i', 'i4'), ('f', 'f4')])
>>> y = np.array(zip(('a', 'b')), dtype=[('s', 'a10')])
>>> x
array([(1, 3.0), (2, 4.0)],
dtype=[('i', '<i4'), ('f', '<f4')])
>>> y
array([('a',), ('b',)],
dtype=[('s', '|S10')])
>>> z = nlr.merge_arrays([x, y], flatten=True)
>>> z
array([(1, 3.0, 'a'), (2, 4.0, 'b')],
dtype=[('i', '<i4'), ('f', '<f4'), ('s', '|S10')])
source to share
This is a very late answer, but maybe it will be useful to someone else. I used this solution asking the same question with most criteria.
It doesn't generate a new numpy array, but with zip
and itertools.chain
it is much faster. In my case, I needed to access each value of each row in sequential order. Here is a benchmark that mimics this use case:
import numpy
from numpy.lib.recfunctions import merge_arrays
from itertools import chain
a = numpy.empty(3, [("col1", int), ("col2", float)])
b = numpy.empty(3, [("col3", int), ("col4", "U1")])
Results:
%timeit [i for i in (row for row in merge_arrays([a,b], flatten=True))]
52.9 µs ± 2 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit [i for i in (row for row in (chain(i,k) for i,k in zip(a,b)))]
3.47 µs ± 52 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
source to share