Pandas convert dataframe to array of tuples

问题内容:

I have manipulated some data using pandas and now I want to carry out a batch save back to the database. This requires me to convert the dataframe into an array of tuples, with each tuple corresponding to a “row” of the dataframe.

My DataFrame looks something like:

In [182]: data_set
Out[182]: 
  index data_date   data_1  data_2
0  14303 2012-02-17  24.75   25.03 
1  12009 2012-02-16  25.00   25.07 
2  11830 2012-02-15  24.99   25.15 
3  6274  2012-02-14  24.68   25.05 
4  2302  2012-02-13  24.62   24.77 
5  14085 2012-02-10  24.38   24.61 

I want to convert it to an array of tuples like:

[(datetime.date(2012,2,17),24.75,25.03),
(datetime.date(2012,2,16),25.00,25.07),
...etc. ]

Any suggestion on how I can efficiently do this?

问题评论:

1  
For those coming to this answer in 2017+, there is a new idiomatic solution below. You can just use list(df.itertuples(index=False, name=None))

答案:

答案1:

How about:

subset = data_set[['data_date', 'data_1', 'data_2']]
tuples = [tuple(x) for x in subset.values]

答案评论:

1  
Thanks alot Wes, much cleaner than the solution that I came up with. Great work on Pandas in general, I’ve just started scratching the surface but it looks great.
– enrishi
Mar 20 ’12 at 7:08

答案2:

A generic way:

[tuple(x) for x in data_set.to_records(index=False)]

答案评论:

答案3:

list(data_set.itertuples(index=False))

As of 17.1, the above will return a list of namedtuples.

答案评论:

1  
This should be the accepted answer IMHO (now that a dedicated feature exists). BTW, if you want normal tuples in your zip iterator (instead of namedtuples), then call: data_set.itertuples(index=False, name=None)

答案4:

Here’s a vectorized approach (assuming the dataframe, data_set to be defined as df instead) that returns a list of tuples as shown:

>>> df.set_index(['data_date'])[['data_1', 'data_2']].to_records().tolist()

produces:

[(datetime.datetime(2012, 2, 17, 0, 0), 24.75, 25.03),
 (datetime.datetime(2012, 2, 16, 0, 0), 25.0, 25.07),
 (datetime.datetime(2012, 2, 15, 0, 0), 24.99, 25.15),
 (datetime.datetime(2012, 2, 14, 0, 0), 24.68, 25.05),
 (datetime.datetime(2012, 2, 13, 0, 0), 24.62, 24.77),
 (datetime.datetime(2012, 2, 10, 0, 0), 24.38, 24.61)]

The idea of setting datetime column as the index axis is to aid in the conversion of the Timestamp value to it’s corresponding datetime.datetime format equivalent by making use of the convert_datetime64 argument in DF.to_records which does so for a DateTimeIndex dataframe.

This returns a recarray which could be then made to return a list using .tolist


More generalized solution depending on the use case would be:

df.to_records().tolist()                              # Supply index=False to exclude index

答案评论:

答案5:

Motivation


Many data sets are large enough that we need to concern ourselves with speed/efficiency. So I offer this solution in that spirit. It happens to also be succinct.

For the sake of comparison, let’s drop the index column

df = data_set.drop('index', 1)

Solution


I’ll propose the use of zip and a comprehension

list(zip(*[df.values.tolist() for c in df]))

[('2012-02-17', 24.75, 25.03),
 ('2012-02-16', 25.0, 25.07),
 ('2012-02-15', 24.99, 25.15),
 ('2012-02-14', 24.68, 25.05),
 ('2012-02-13', 24.62, 24.77),
 ('2012-02-10', 24.38, 24.61)]

It happens to also be flexible if we wanted to deal with a specific subset of columns. We’ll assume the columns we’ve already displayed are the subset we want.

list(zip(*[df.values.tolist() for c in ['data_date', 'data_1', 'data_2']))

[('2012-02-17', 24.75, 25.03),
 ('2012-02-16', 25.0, 25.07),
 ('2012-02-15', 24.99, 25.15),
 ('2012-02-14', 24.68, 25.05),
 ('2012-02-13', 24.62, 24.77),
 ('2012-02-10', 24.38, 24.61)]

All the following produce the same results

  • [tuple(x) for x in df.values]
  • df.to_records(index=False).tolist()
  • list(map(tuple,df.values))
  • list(map(tuple, df.itertuples(index=False)))

What is quicker?

zip and comprehension is faster by a large margin

%timeit [tuple(x) for x in df.values]
%timeit list(map(tuple, df.itertuples(index=False)))
%timeit df.to_records(index=False).tolist()
%timeit list(map(tuple,df.values))
%timeit list(zip(*[df.values.tolist() for c in df]))

small data

10000 loops, best of 3: 55.7 µs per loop
1000 loops, best of 3: 596 µs per loop
10000 loops, best of 3: 38.2 µs per loop
10000 loops, best of 3: 54.3 µs per loop
100000 loops, best of 3: 12.9 µs per loop

large data

10 loops, best of 3: 58.8 ms per loop
10 loops, best of 3: 43.9 ms per loop
10 loops, best of 3: 29.3 ms per loop
10 loops, best of 3: 53.7 ms per loop
100 loops, best of 3: 6.09 ms per loop

答案评论:

    
You didn’t make a fair comparison. Your solution isn’t any faster than list(df.itertuples(index=False, name=None)). This answer will just confuse people . I would delete it if I were you.
    
@TedPetrou why isn’t it fair? No one proposed what you suggested. Why don’t you put it as an answer. The two answers help illuminate the entire problem.
    
    
You suggested name=None. Does that not make a difference?

答案6:

More pythonic way:

df = data_set[['data_date', 'data_1', 'data_2']]
map(tuple,df.values)

答案评论:

答案7:

#try this one:

tuples = list(zip(data_set["data_date"], data_set["data_1"],data_set["data_2"]))
print (tuples)

答案评论:

    
This does not provide an answer to the question. To critique or request clarification from an author, leave a comment below their post. – From Review

原文地址:

https://stackoverflow.com/questions/47756553/convert-dataframe-into-list-python

Add a Comment