## The Question :

I wonder if there is a direct way to import the contents of a CSV file into a record array, much in the way that R’s `read.table()`

, `read.delim()`

, and `read.csv()`

family imports data to R’s data frame?

Or is the best way to use csv.reader() and then apply something like `numpy.core.records.fromrecords()`

?

## The Answer 1

You can use Numpy’s `genfromtxt()`

method to do so, by setting the `delimiter`

kwarg to a comma.

from numpy import genfromtxt
my_data = genfromtxt('my_file.csv', delimiter=',')

More information on the function can be found at its respective documentation.

## The Answer 2

I would recommend the `read_csv`

function from the `pandas`

library:

import pandas as pd
df=pd.read_csv('myfile.csv', sep=',',header=None)
df.values
array([[ 1. , 2. , 3. ],
[ 4. , 5.5, 6. ]])

This gives a pandas DataFrame – allowing many useful data manipulation functions which are not directly available with numpy record arrays.

DataFrame is a 2-dimensional labeled data structure with columns of
potentially different types. You can think of it like a spreadsheet or
SQL table…

I would also recommend `genfromtxt`

. However, since the question asks for a record array, as opposed to a normal array, the `dtype=None`

parameter needs to be added to the `genfromtxt`

call:

Given an input file, `myfile.csv`

:

1.0, 2, 3
4, 5.5, 6
import numpy as np
np.genfromtxt('myfile.csv',delimiter=',')

gives an array:

array([[ 1. , 2. , 3. ],
[ 4. , 5.5, 6. ]])

and

np.genfromtxt('myfile.csv',delimiter=',',dtype=None)

gives a record array:

array([(1.0, 2.0, 3), (4.0, 5.5, 6)],
dtype=[('f0', '<f8'), ('f1', '<f8'), ('f2', '<i4')])

This has the advantage that file with multiple data types (including strings) can be easily imported.

## The Answer 3

I timed the

from numpy import genfromtxt
genfromtxt(fname = dest_file, dtype = (<whatever options>))

versus

import csv
import numpy as np
with open(dest_file,'r') as dest_f:
data_iter = csv.reader(dest_f,
delimiter = delimiter,
quotechar = '"')
data = [data for data in data_iter]
data_array = np.asarray(data, dtype = <whatever options>)

on 4.6 million rows with about 70 columns and found that the NumPy path took 2 min 16 secs and the csv-list comprehension method took 13 seconds.

I would recommend the csv-list comprehension method as it is most likely relies on pre-compiled libraries and not the interpreter as much as NumPy. I suspect the pandas method would have similar interpreter overhead.

## The Answer 4

You can also try `recfromcsv()`

which can guess data types and return a properly formatted record array.

## The Answer 5

As I tried both ways using NumPy and Pandas, using pandas has a lot of advantages:

- Faster
- Less CPU usage
- 1/3 RAM usage compared to NumPy genfromtxt

This is my test code:

$ for f in test_pandas.py test_numpy_csv.py ; do /usr/bin/time python $f; done
2.94user 0.41system 0:03.05elapsed 109%CPU (0avgtext+0avgdata 502068maxresident)k
0inputs+24outputs (0major+107147minor)pagefaults 0swaps
23.29user 0.72system 0:23.72elapsed 101%CPU (0avgtext+0avgdata 1680888maxresident)k
0inputs+0outputs (0major+416145minor)pagefaults 0swaps

### test_numpy_csv.py

from numpy import genfromtxt
train = genfromtxt('/home/hvn/me/notebook/train.csv', delimiter=',')

### test_pandas.py

from pandas import read_csv
df = read_csv('/home/hvn/me/notebook/train.csv')

### Data file:

du -h ~/me/notebook/train.csv
59M /home/hvn/me/notebook/train.csv

With NumPy and pandas at versions:

$ pip freeze | egrep -i 'pandas|numpy'
numpy==1.13.3
pandas==0.20.2

## The Answer 6

You can use this code to send CSV file data into an array:

import numpy as np
csv = np.genfromtxt('test.csv', delimiter=",")
print(csv)

## The Answer 7

I would suggest using tables (`pip3 install tables`

). You can save your `.csv`

file to `.h5`

using pandas (`pip3 install pandas`

),

import pandas as pd
data = pd.read_csv("dataset.csv")
store = pd.HDFStore('dataset.h5')
store['mydata'] = data
store.close()

You can then easily, and with less time even for huge amount of data, load your data in a *NumPy array*.

import pandas as pd
store = pd.HDFStore('dataset.h5')
data = store['mydata']
store.close()
# Data in NumPy format
data = data.values

## The Answer 8

This is the easiest way:

import csv
with open('testfile.csv', newline='') as csvfile:
data = list(csv.reader(csvfile))

Now each entry in data is a record, represented as an array. So you have a 2D array. It saved me so much time.

## The Answer 9

I tried this:

import pandas as p
import numpy as n
closingValue = p.read_csv("<FILENAME>", usecols=[4], dtype=float)
print(closingValue)

## The Answer 10

Using `numpy.loadtxt`

A quite simple method. But it requires all the elements being float (int and so on)

import numpy as np
data = np.loadtxt('c:\\1.csv',delimiter=',',skiprows=0)

## The Answer 11

**This work as a charm…**

import csv
with open("data.csv", 'r') as f:
data = list(csv.reader(f, delimiter=";"))
import numpy as np
data = np.array(data, dtype=np.float)

## The Answer 12

In [329]: %time my_data = genfromtxt('one.csv', delimiter=',')
CPU times: user 19.8 s, sys: 4.58 s, total: 24.4 s
Wall time: 24.4 s
In [330]: %time df = pd.read_csv("one.csv", skiprows=20)
CPU times: user 1.06 s, sys: 312 ms, total: 1.38 s
Wall time: 1.38 s