python – Import multiple csv files into pandas and concatenate into one DataFrame

The Question :

478 people think this question is useful

I would like to read several csv files from a directory into pandas and concatenate them into one big DataFrame. I have not been able to figure it out though. Here is what I have so far:

import glob
import pandas as pd

# get data file names
path =r'C:\DRO\DCL_rawdata_files'
filenames = glob.glob(path + "/*.csv")

dfs = []
for filename in filenames:
    dfs.append(pd.read_csv(filename))

# Concatenate all data into one DataFrame
big_frame = pd.concat(dfs, ignore_index=True)

I guess I need some help within the for loop???

The Question Comments :
  • your code does nothing because you are not appending to your dfs list, don’t you want to replace the line data = pd.read_csv(filename) with dfs.append(pd.read_csv(filename). You would then need to loop over the list and concat, I don’t think concat will work on a list of dfs.
  • also you are mixing an alias for the module with the module name in your last line, shouldn’t it be big_frame = pd.concat(dfs, ignore_index=True)?, anyway once you have a list of dataframes you will need to iterate over the list and concat to big_frame
  • Yes, I edited the code, but i’m still not able to build a concatenated dataframe from the csv-files, I’m new to python so I need some more help on this
  • you need to loop over dfs now, so something like for df in dfs: big_frame.concat(df, ignore_index=True) should work, you could also try append instead of concat also.
  • Can you tell more exactly what is not working? Because concat should handle a list of DataFrames just fine like you did. I think this is a very good approach.

The Answer 1

565 people think this answer is useful

If you have same columns in all your csv files then you can try the code below. I have added header=0 so that after reading csv first row can be assigned as the column names.

import pandas as pd
import glob

path = r'C:\DRO\DCL_rawdata_files' # use your path
all_files = glob.glob(path + "/*.csv")

li = []

for filename in all_files:
    df = pd.read_csv(filename, index_col=None, header=0)
    li.append(df)

frame = pd.concat(li, axis=0, ignore_index=True)

The Answer 2

313 people think this answer is useful

An alternative to darindaCoder’s answer:

path = r'C:\DRO\DCL_rawdata_files'                     # use your path
all_files = glob.glob(os.path.join(path, "*.csv"))     # advisable to use os.path.join as this makes concatenation OS independent

df_from_each_file = (pd.read_csv(f) for f in all_files)
concatenated_df   = pd.concat(df_from_each_file, ignore_index=True)
# doesn't create a list, nor does it append to one

The Answer 3

60 people think this answer is useful
import glob, os    
df = pd.concat(map(pd.read_csv, glob.glob(os.path.join('', "my_files*.csv"))))

The Answer 4

45 people think this answer is useful

The Dask library can read a dataframe from multiple files:

>>> import dask.dataframe as dd
>>> df = dd.read_csv('data*.csv')

(Source: https://examples.dask.org/dataframes/01-data-access.html#Read-CSV-files)

The Dask dataframes implement a subset of the Pandas dataframe API. If all the data fits into memory, you can call df.compute() to convert the dataframe into a Pandas dataframe.

The Answer 5

40 people think this answer is useful

Almost all of the answers here are either unnecessarily complex (glob pattern matching) or rely on additional 3rd party libraries. You can do this in 2 lines using everything Pandas and python (all versions) already have built in.

For a few files – 1 liner:

df = pd.concat(map(pd.read_csv, ['data/d1.csv', 'data/d2.csv','data/d3.csv']))

For many files:

from os import listdir

filepaths = [f for f in listdir("./data") if f.endswith('.csv')]
df = pd.concat(map(pd.read_csv, filepaths))


This pandas line which sets the df utilizes 3 things:

  1. Python’s map (function, iterable) sends to the function (the pd.read_csv()) the iterable (our list) which is every csv element in filepaths).
  2. Panda’s read_csv() function reads in each CSV file as normal.
  3. Panda’s concat() brings all these under one df variable.

The Answer 6

25 people think this answer is useful

Easy and Fast

Import two or more csv‘s without having to make a list of names.

import glob

df = pd.concat(map(pd.read_csv, glob.glob('data/*.csv')))

The Answer 7

16 people think this answer is useful

Edit: I googled my way into https://stackoverflow.com/a/21232849/186078. However of late I am finding it faster to do any manipulation using numpy and then assigning it once to dataframe rather than manipulating the dataframe itself on an iterative basis and it seems to work in this solution too.

I do sincerely want anyone hitting this page to consider this approach, but don’t want to attach this huge piece of code as a comment and making it less readable.

You can leverage numpy to really speed up the dataframe concatenation.

import os
import glob
import pandas as pd
import numpy as np

path = "my_dir_full_path"
allFiles = glob.glob(os.path.join(path,"*.csv"))


np_array_list = []
for file_ in allFiles:
    df = pd.read_csv(file_,index_col=None, header=0)
    np_array_list.append(df.as_matrix())

comb_np_array = np.vstack(np_array_list)
big_frame = pd.DataFrame(comb_np_array)

big_frame.columns = ["col1","col2"....]

Timing stats:

total files :192
avg lines per file :8492
--approach 1 without numpy -- 8.248656988143921 seconds ---
total records old :1630571
--approach 2 with numpy -- 2.289292573928833 seconds ---

The Answer 8

11 people think this answer is useful

If you want to search recursively (Python 3.5 or above), you can do the following:

from glob import iglob
import pandas as pd

path = r'C:\user\your\path\**\*.csv'

all_rec = iglob(path, recursive=True)     
dataframes = (pd.read_csv(f) for f in all_rec)
big_dataframe = pd.concat(dataframes, ignore_index=True)

Note that the three last lines can be expressed in one single line:

df = pd.concat((pd.read_csv(f) for f in iglob(path, recursive=True)), ignore_index=True)

You can find the documentation of ** here. Also, I used iglobinstead of glob, as it returns an iterator instead of a list.



EDIT: Multiplatform recursive function:

You can wrap the above into a multiplatform function (Linux, Windows, Mac), so you can do:

df = read_df_rec('C:\user\your\path', *.csv)

Here is the function:

from glob import iglob
from os.path import join
import pandas as pd

def read_df_rec(path, fn_regex=r'*.csv'):
    return pd.concat((pd.read_csv(f) for f in iglob(
        join(path, '**', fn_regex), recursive=True)), ignore_index=True)

The Answer 9

11 people think this answer is useful

one liner using map, but if you’d like to specify additional args, you could do:

import pandas as pd
import glob
import functools

df = pd.concat(map(functools.partial(pd.read_csv, sep='|', compression=None), 
                    glob.glob("data/*.csv")))

Note: map by itself does not let you supply additional args.

The Answer 10

5 people think this answer is useful

Another on-liner with list comprehension which allows to use arguments with read_csv.

df = pd.concat([pd.read_csv(f'dir/{f}') for f in os.listdir('dir') if f.endswith('.csv')])

The Answer 11

4 people think this answer is useful

If the multiple csv files are zipped, you may use zipfile to read all and concatenate as below:

import zipfile
import numpy as np
import pandas as pd

ziptrain = zipfile.ZipFile('yourpath/yourfile.zip')

train=[]

for f in range(0,len(ziptrain.namelist())):
    if (f == 0):
        train = pd.read_csv(ziptrain.open(ziptrain.namelist()[f]))
    else:
        my_df = pd.read_csv(ziptrain.open(ziptrain.namelist()[f]))
        train = (pd.DataFrame(np.concatenate((train,my_df),axis=0), 
                          columns=list(my_df.columns.values)))

The Answer 12

4 people think this answer is useful

Alternative using the pathlib library (often preferred over os.path).

This method avoids iterative use of pandas concat()/apped().

From the pandas documentation:
It is worth noting that concat() (and therefore append()) makes a full copy of the data, and that constantly reusing this function can create a significant performance hit. If you need to use the operation over several datasets, use a list comprehension.

import pandas as pd
from pathlib import Path

dir = Path("../relevant_directory")

df = (pd.read_csv(f) for f in dir.glob("*.csv"))
df = pd.concat(df)

The Answer 13

3 people think this answer is useful

Based on @Sid’s good answer.

Before concatenating, you can load csv files into an intermediate dictionary which gives access to each data set based on the file name (in the form dict_of_df['filename.csv']). Such a dictionary can help you identify issues with heterogeneous data formats, when column names are not aligned for example.

Import modules and locate file paths:

import os
import glob
import pandas
from collections import OrderedDict
path =r'C:\DRO\DCL_rawdata_files'
filenames = glob.glob(path + "/*.csv")

Note: OrderedDict is not necessary, but it’ll keep the order of files which might be useful for analysis.

Load csv files into a dictionary. Then concatenate:

dict_of_df = OrderedDict((f, pandas.read_csv(f)) for f in filenames)
pandas.concat(dict_of_df, sort=True)

Keys are file names f and values are the data frame content of csv files. Instead of using f as a dictionary key, you can also use os.path.basename(f) or other os.path methods to reduce the size of the key in the dictionary to only the smaller part that is relevant.

The Answer 14

0 people think this answer is useful

You can do it this way also:

import pandas as pd
import os

new_df = pd.DataFrame()
for r, d, f in os.walk(csv_folder_path):
    for file in f:
        complete_file_path = csv_folder_path+file
        read_file = pd.read_csv(complete_file_path)
        new_df = new_df.append(read_file, ignore_index=True)


new_df.shape

The Answer 15

-2 people think this answer is useful

This is how you can do using Colab on Google Drive

import pandas as pd
import glob

path = r'/content/drive/My Drive/data/actual/comments_only' # use your path
all_files = glob.glob(path + "/*.csv")

li = []

for filename in all_files:
    df = pd.read_csv(filename, index_col=None, header=0)
    li.append(df)

frame = pd.concat(li, axis=0, ignore_index=True,sort=True)
frame.to_csv('/content/drive/onefile.csv')

The Answer 16

-3 people think this answer is useful
import pandas as pd
import glob

path = r'C:\DRO\DCL_rawdata_files' # use your path
file_path_list = glob.glob(path + "/*.csv")

file_iter = iter(file_path_list)

list_df_csv = []
list_df_csv.append(pd.read_csv(next(file_iter)))

for file in file_iter:
    lsit_df_csv.append(pd.read_csv(file, header=0))
df = pd.concat(lsit_df_csv, ignore_index=True)

Add a Comment